Show HN: Open-source A/B testing framework

gingerlime · on Aug 6, 2021

Wow this looks very polished. (Out of frustration with Optimizely) I created and maintain a couple of A/B test open source projects[0][1] but the statistical analysis was always the hardest part so I’m keen to see what you are doing. We’re currently relying on a commercial tool called Analytics Toolkit[2] for this part alone and have been quite happy with it though. The owner is very knowledgable and responsive (no affiliation just happy customers). I wonder if you can adopt similar ideas/algorithms into the open source tool. That can be useful I imagine.

[0] https://github.com/Alephbet/alephbet

[1] https://github.com/Alephbet/lamed

[2] https://www.analytics-toolkit.com/

jrdorn · on Aug 6, 2021

Thanks for the comment and for your work on Alephbet! Open source A/B testing is a graveyard of abandoned projects, so it's always great to see more people actively working in this space.

Georgi at Analytics Toolkit definitely knows his stuff. We're taking a Bayesian approach instead, which I know he isn't the biggest fan of, but I think it is much easier to understand. Itamar Faran, the author of our stats engine, has a great article that goes into a lot more detail if you're interested: https://towardsdatascience.com/why-you-should-switch-to-baye...

gingerlime · on Aug 7, 2021

> Open source A/B testing is a graveyard of abandoned projects

Quite true, sadly. FWIW, we keep maintaining Alephbet, even though honestly I have no clue who's actively using it besides us :) the codebase is simple enough that it doesn't require a lot of work luckily.

Re stats: I did implement a Bayesian dashboard of sorts with Alephbet, but I'm not sure it prevents the peeking problem on its own. It requires some discipline when planning the tests to decide ahead when to look at results. Disclaimer: my stats chops are virtually non-existent, but that's what I learned over the years. Georgi's platform really helps structure this process of planning and when to stop the experiment (either when successful or when it's failed).

Another small (but in my experience important) thing that sets Alephbet apart from other A/B testing platforms: ad blockers. Mixpanel, GA, Amplitude etc frequently and trivially get blocked by ad blockers on the client. For client-side A/B tests this can reduce the data quality (even though typically A/B tests are not privacy invasive). Alephbet's Lamed[0] backend allows you to create a custom AWS url that's far less likely to get blocked. The "data quality" with Alephbet is higher in my experience than the data we see, e.g. in Amplitude.

[0] https://github.com/Alephbet/lamed

Dyac · on Aug 6, 2021

Hi. Regular A/B experimenter here.

This looks like a great tool!

Does your system store the stats, or does it trust the stats to be stored in, eg. GA, and then just allow you to analyse them?

Is it appropriate to send email alerts when "significance" is reached? Without adhering to minimum sample sizes calculated in advance won't this result in a bunch of Type 1 errors?

Are the changes to the pages made client side or server side? I think clientside but I'm not sure. If so are they sync or asynchronous?

Thanks!

jrdorn · on Aug 6, 2021

Hi, thanks for the questions!

1. We don't store any raw user data. We pull things like mean and standard deviation from data sources, run the statistics, and store the result.

2. We use a Bayesian statistics engine which is much more immune to peeking problems and Type I errors than frequentist approaches.

3. Tests can be run either client or server side. For client side, we recommend bundling the SDK with your app (webpack, etc). We really care about performance so never want to add additional http requests or script tags of any kind if at all possible.

Dyac · on Aug 6, 2021

Interesting, thanks.

1. How do you get around needing session level data instead of aggregate data when working with non parametric KPIs? GA in particular is notorious for sampling data.

2. True, but you can't get away from the fact that a split test only run for a day or two isn't going to give you trustworthy results. It's things like this that abstract away the statistical reality for lay users that cause poor decisions to be made under the guise of being "data driven". I think as testers, and you as a provider of a testing system, have a duty not to lead businesses to believe that they are making statistically sound choices when they may not be.

jrdorn · on Aug 6, 2021

1. GA is very limited as a data source because of sampling and the fact that they don't expose variance. So if using GA, we only support simple binomial metrics, count data (assuming Poisson distribution), and duration data (assuming exponential distribution). For SQL data sources and non-parametric data, we currently rely on the CLT and treat the sampling distribution as Normal. There's a good article that goes over the stats in more detail (Itamar, the author, wrote our stats engine) - https://towardsdatascience.com/how-to-do-bayesian-a-b-testin...

2. We have a minimum sample size threshold before we run any statistics on the data. To your point, we don't want to say something is "significant" if it's 5 conversions vs 1. This is one area we're looking to improve with better heuristics. We can't completely take the human out of the loop, but we can help give them all the info they need to make the best decision. On that front, we do show Bayesian expected loss (risk) and credible intervals in addition to just the "chance to beat control".

Dyac · on Aug 6, 2021

Brilliant, thank you.

Can you use the system to analyse results of tests it didn't run? ie. If I run tests using some SAAS that only supports frequentist stats could I use your system as a bayesian analysis backend?

jrdorn · on Aug 6, 2021

Yes. As long as the variation assignment data and success metrics are in a supported data source (SQL, GA, or Mixpanel currently), it can be queried and analyzed in Growth Book.

tehlike · on Aug 6, 2021

From the looks of it, it doesn't look like the configuration can be stored in the code repository itself. This is one of the key things to do - treating configuration as a code and properly version it/blame it etc.

Otherwise, this looks great.

jrdorn · on Aug 6, 2021

That's on our roadmap. We originally built the tool as a multi-tenant hosted platform so storing configs in a database made the most sense initially. For self hosting, we want to support defining db connections and metrics using yml.

travisjungroth · on Aug 6, 2021

Have you considered cuelang? My new hobby is pointing people towards it.

contravariant · on Aug 7, 2021

I'm kind of curious how they've tackled the multi-armed bandits and early stopping problems inherent to A/B testing but so far I've only found that they use some form of Bayesian statistics using unknown priors and likelihoods (except when you pick binomial I suppose, though the prior is still unknown).

They seem to allow filtering drilling down by various categories, which would make statistical significance even more of a concern.

mooneater · on Aug 6, 2021

Awesome! Can you comment on choice of mongodb? I admit I have a negative association with it but Im sure there are reasons.

jrdorn · on Aug 6, 2021

Hi! One of the authors here. We're using MongoDB to store caches A/B test results (among other things), which are deeply nested JSON objects. MongoDB let us develop features really quickly so its been a great choice so far for us. We're willing to add support for another data store if there's a lot of demand for it.

ensignavenger · on Aug 6, 2021

Hi- thanks for making this and open sourcing it!

I would also suggest supporting an alternative to MongoDB. Postgres using jsonb is a great option.

I try to always use and support open source components, as open source provides much less business risk. Since MongoDB isn't itself open source, I would be hesitant to adopt it or a product that depends on it. Mongo also has a bad reputation...

I would definitely evaluate and likely use your product if it did not depend on MongoDB.

gqewogpdqa · on Aug 6, 2021

Totally think it's good to have lots of options. PostgreSQL using JSONB however is a way to hurt your head. Using SQL to manipulate JSON is pretty painful. Why would MongoDB Community not be an ok choice? Unless you're planning on offering MongoDB as a cloud service, what would the concern be?

What's the bad reputation of MongoDB that you're concerned about?

Also, you seem to have a really strong bias against it - can you explain?

ensignavenger · on Aug 6, 2021

> PostgreSQL using JSONB however is a way to hurt your head

Really? I have used it pretty extensively and like it... I don't do a lot of complex manipulations though, it might be a pain for some use cases.

> Why would MongoDB Community not be an ok choice?

MongoDB community is SSPL licensed, which is not Open Source. While I don't intend to offer a MongoDB hosting service, I want the option to fork the code and create (or pay some one else to fork the code and create) a hosting service for me to use. This is important because MongoDB Inc's business may not always align well with my business and my needs. (or they may just decide that they don't want to do business with me, maybe they go out of business or their business focus shifts or political pressures come to bear.) The option to create a viable community fork is critical to ensuring that the software remains viably usable. The business risk of relying on proprietary software is great. The more reliant you are on it, the bigger the risk.

> What's the bad reputation of MongoDB that you're concerned about?

Mongo has a long history with Jepsen test failures. See http://jepsen.io/analyses/mongodb-4.2.6 and the linked articles from that page. In addition, I have heard many confirmations of issues from folks who have used it in production.

> Also, you seem to have a really strong bias against it - can you explain?

I think I have explained my position above. I don't have any interest in Mongo or any of its competitors. I don't personally know anyone involved with it or any of its competitors (Though I have naturally had professional contact with some.) My strong preference, as previously stated, is for Open Source software. This preference applies broadly to all software, but especially to infrastructure software, and is by no means specific to MongoDB.

pphysch · on Aug 6, 2021

What is painful about "SELECT json_result FROM test_results WHERE json_result -> 'data' -> 'foo' -> 'bar' = 'baz'"?

sojournerc · on Aug 6, 2021

Nice project! PostgreSQL has excellent JSON support these days, including the ability to query nested fields which may be beneficial to your project.

nrjames · on Aug 6, 2021

Honestly, I'd love to see it just use SQLite as a backend. If it's just storing results, that seems feasible and it would reduce the complexity of the tech stack.

marcinzm · on Aug 6, 2021

This was also one of the first things I've noticed since we don't use it so it would be a decently large operational addition to our stack. Maybe it's needed at larger scales but for most companies a SQL server should be good enough.

jaggednad · on Aug 6, 2021

I think you guys really did a good job with something that's hard to do well. Kudos

JensRantil · on Aug 7, 2021

This reminds me of https://github.com/sixpack/sixpack which I've been eyeballing for years.

cardosof · on Aug 6, 2021

Congratulations, that's very cool! Do you intend to add new features to expand the scope (i.e. visitor personalization)?

jrdorn · on Aug 6, 2021

We're definitely interested in supporting some personalization use cases in the future using contextual bandits.

ablearcher83 · on Aug 6, 2021

Coupling metrics to experimentation is a huge red flag.

You should offload that onto DBT or some other data modeling tools.

jrdorn · on Aug 6, 2021

Do you mind explaining that a little more? As it's currently designed, a company could use DBT to model their raw data into dedicated metric tables and then Growth Book sits on top of those with a really simple SQL query and some settings (e.g. is the goal to increase or decrease the metric)

I'd love to see an open source standard way to define metrics, but haven't found anything yet.

nacs · on Aug 6, 2021

Looks promising.

Any plan to support Matomo (formerly Piwik) analytics as a data source?

jrdorn · on Aug 6, 2021

We plan to add MySQL/MariaDB support soon which should let you use Matomo data as long as you have raw SQL access. For cloud-hosted Matomo, we would have to use the reporting API, which is doable but not as good since there's no way to get standard deviations out of it as far as I can tell.

XCSme · on Aug 6, 2021

I am also building a self-hosted analytics platform[0] that has a MySQL/MariaDB database, and I provide a way of recording A/B test data, currently the visualization of the results is not that good so using a tool like GrowthBook makes sense. I assume that once the MySQL support is added, it would be possible to import userTrack data into GrowthBook?

[0]: https://www.usertrack.net

jrdorn · on Aug 6, 2021

Yep, should be possible once MySQL/MariaDB is done.

dreamer7 · on Aug 6, 2021

Would this be useful for A/B testing mobile app features as well?

jrdorn · on Aug 6, 2021

We don't have native mobile SDKs yet, but it's something we want to support in the future. Mobile is a little tricky since you either need to do a new release every time you want to start/stop a test or use remote config and deal with offline, slow networks, etc.

bob_roboto · on Aug 7, 2021

Not necessarily. Granted, you obviously need to make a network call at some point but you can host the entire experiment configuration in a file that is served through a CDN and the mobile client only needs to download it once (and it can be cached however appropriate for your use-case). The client SDK can then use the file to derive the experiment configuration by passing the required parameters to determine experiment allocation into the SDK which can then compute without further network calls.

dimitry12 · on Aug 6, 2021

How does this compare to Facebook Ax?

jrdorn · on Aug 6, 2021

Ax is for automated optimization using machine learning. You define the parameters and optimization function and it decides the variations, traffic splitting, and everything else for you.

Growth Book is for hypothesis testing. It let's you define and run a specific controlled experiment and then you can analyze the results and make a decision.