Wow this looks very polished. (Out of frustration with Optimizely) I created and maintain a couple of A/B test open source projects[0][1] but the statistical analysis was always the hardest part so I’m keen to see what you are doing. We’re currently relying on a commercial tool called Analytics Toolkit[2] for this part alone and have been quite happy with it though. The owner is very knowledgable and responsive (no affiliation just happy customers). I wonder if you can adopt similar ideas/algorithms into the open source tool. That can be useful I imagine.
Thanks for the comment and for your work on Alephbet! Open source A/B testing is a graveyard of abandoned projects, so it's always great to see more people actively working in this space.
Georgi at Analytics Toolkit definitely knows his stuff. We're taking a Bayesian approach instead, which I know he isn't the biggest fan of, but I think it is much easier to understand. Itamar Faran, the author of our stats engine, has a great article that goes into a lot more detail if you're interested: https://towardsdatascience.com/why-you-should-switch-to-baye...
> Open source A/B testing is a graveyard of abandoned projects
Quite true, sadly. FWIW, we keep maintaining Alephbet, even though honestly I have no clue who's actively using it besides us :) the codebase is simple enough that it doesn't require a lot of work luckily.
Re stats: I did implement a Bayesian dashboard of sorts with Alephbet, but I'm not sure it prevents the peeking problem on its own. It requires some discipline when planning the tests to decide ahead when to look at results. Disclaimer: my stats chops are virtually non-existent, but that's what I learned over the years. Georgi's platform really helps structure this process of planning and when to stop the experiment (either when successful or when it's failed).
Another small (but in my experience important) thing that sets Alephbet apart from other A/B testing platforms: ad blockers. Mixpanel, GA, Amplitude etc frequently and trivially get blocked by ad blockers on the client. For client-side A/B tests this can reduce the data quality (even though typically A/B tests are not privacy invasive). Alephbet's Lamed[0] backend allows you to create a custom AWS url that's far less likely to get blocked. The "data quality" with Alephbet is higher in my experience than the data we see, e.g. in Amplitude.
Does your system store the stats, or does it trust the stats to be stored in, eg. GA, and then just allow you to analyse them?
Is it appropriate to send email alerts when "significance" is reached? Without adhering to minimum sample sizes calculated in advance won't this result in a bunch of Type 1 errors?
Are the changes to the pages made client side or server side? I think clientside but I'm not sure. If so are they sync or asynchronous?
1. We don't store any raw user data. We pull things like mean and standard deviation from data sources, run the statistics, and store the result.
2. We use a Bayesian statistics engine which is much more immune to peeking problems and Type I errors than frequentist approaches.
3. Tests can be run either client or server side. For client side, we recommend bundling the SDK with your app (webpack, etc). We really care about performance so never want to add additional http requests or script tags of any kind if at all possible.
1. How do you get around needing session level data instead of aggregate data when working with non parametric KPIs? GA in particular is notorious for sampling data.
2. True, but you can't get away from the fact that a split test only run for a day or two isn't going to give you trustworthy results. It's things like this that abstract away the statistical reality for lay users that cause poor decisions to be made under the guise of being "data driven". I think as testers, and you as a provider of a testing system, have a duty not to lead businesses to believe that they are making statistically sound choices when they may not be.
1. GA is very limited as a data source because of sampling and the fact that they don't expose variance. So if using GA, we only support simple binomial metrics, count data (assuming Poisson distribution), and duration data (assuming exponential distribution). For SQL data sources and non-parametric data, we currently rely on the CLT and treat the sampling distribution as Normal. There's a good article that goes over the stats in more detail (Itamar, the author, wrote our stats engine) - https://towardsdatascience.com/how-to-do-bayesian-a-b-testin...
2. We have a minimum sample size threshold before we run any statistics on the data. To your point, we don't want to say something is "significant" if it's 5 conversions vs 1. This is one area we're looking to improve with better heuristics. We can't completely take the human out of the loop, but we can help give them all the info they need to make the best decision. On that front, we do show Bayesian expected loss (risk) and credible intervals in addition to just the "chance to beat control".
Can you use the system to analyse results of tests it didn't run? ie. If I run tests using some SAAS that only supports frequentist stats could I use your system as a bayesian analysis backend?
Yes. As long as the variation assignment data and success metrics are in a supported data source (SQL, GA, or Mixpanel currently), it can be queried and analyzed in Growth Book.
From the looks of it, it doesn't look like the configuration can be stored in the code repository itself. This is one of the key things to do - treating configuration as a code and properly version it/blame it etc.
That's on our roadmap. We originally built the tool as a multi-tenant hosted platform so storing configs in a database made the most sense initially. For self hosting, we want to support defining db connections and metrics using yml.
I'm kind of curious how they've tackled the multi-armed bandits and early stopping problems inherent to A/B testing but so far I've only found that they use some form of Bayesian statistics using unknown priors and likelihoods (except when you pick binomial I suppose, though the prior is still unknown).
They seem to allow filtering drilling down by various categories, which would make statistical significance even more of a concern.
Hi! One of the authors here. We're using MongoDB to store caches A/B test results (among other things), which are deeply nested JSON objects. MongoDB let us develop features really quickly so its been a great choice so far for us. We're willing to add support for another data store if there's a lot of demand for it.
I would also suggest supporting an alternative to MongoDB. Postgres using jsonb is a great option.
I try to always use and support open source components, as open source provides much less business risk. Since MongoDB isn't itself open source, I would be hesitant to adopt it or a product that depends on it. Mongo also has a bad reputation...
I would definitely evaluate and likely use your product if it did not depend on MongoDB.
Totally think it's good to have lots of options. PostgreSQL using JSONB however is a way to hurt your head. Using SQL to manipulate JSON is pretty painful.
Why would MongoDB Community not be an ok choice? Unless you're planning on offering MongoDB as a cloud service, what would the concern be?
What's the bad reputation of MongoDB that you're concerned about?
Also, you seem to have a really strong bias against it - can you explain?
> PostgreSQL using JSONB however is a way to hurt your head
Really? I have used it pretty extensively and like it... I don't do a lot of complex manipulations though, it might be a pain for some use cases.
> Why would MongoDB Community not be an ok choice?
MongoDB community is SSPL licensed, which is not Open Source. While I don't intend to offer a MongoDB hosting service, I want the option to fork the code and create (or pay some one else to fork the code and create) a hosting service for me to use. This is important because MongoDB Inc's business may not always align well with my business and my needs. (or they may just decide that they don't want to do business with me, maybe they go out of business or their business focus shifts or political pressures come to bear.) The option to create a viable community fork is critical to ensuring that the software remains viably usable. The business risk of relying on proprietary software is great. The more reliant you are on it, the bigger the risk.
> What's the bad reputation of MongoDB that you're concerned about?
Mongo has a long history with Jepsen test failures. See http://jepsen.io/analyses/mongodb-4.2.6 and the linked articles from that page. In addition, I have heard many confirmations of issues from folks who have used it in production.
> Also, you seem to have a really strong bias against it - can you explain?
I think I have explained my position above. I don't have any interest in Mongo or any of its competitors. I don't personally know anyone involved with it or any of its competitors (Though I have naturally had professional contact with some.) My strong preference, as previously stated, is for Open Source software. This preference applies broadly to all software, but especially to infrastructure software, and is by no means specific to MongoDB.
Honestly, I'd love to see it just use SQLite as a backend. If it's just storing results, that seems feasible and it would reduce the complexity of the tech stack.
This was also one of the first things I've noticed since we don't use it so it would be a decently large operational addition to our stack. Maybe it's needed at larger scales but for most companies a SQL server should be good enough.
Do you mind explaining that a little more? As it's currently designed, a company could use DBT to model their raw data into dedicated metric tables and then Growth Book sits on top of those with a really simple SQL query and some settings (e.g. is the goal to increase or decrease the metric)
I'd love to see an open source standard way to define metrics, but haven't found anything yet.
We plan to add MySQL/MariaDB support soon which should let you use Matomo data as long as you have raw SQL access. For cloud-hosted Matomo, we would have to use the reporting API, which is doable but not as good since there's no way to get standard deviations out of it as far as I can tell.
I am also building a self-hosted analytics platform[0] that has a MySQL/MariaDB database, and I provide a way of recording A/B test data, currently the visualization of the results is not that good so using a tool like GrowthBook makes sense. I assume that once the MySQL support is added, it would be possible to import userTrack data into GrowthBook?
We don't have native mobile SDKs yet, but it's something we want to support in the future. Mobile is a little tricky since you either need to do a new release every time you want to start/stop a test or use remote config and deal with offline, slow networks, etc.
Not necessarily. Granted, you obviously need to make a network call at some point but you can host the entire experiment configuration in a file that is served through a CDN and the mobile client only needs to download it once (and it can be cached however appropriate for your use-case).
The client SDK can then use the file to derive the experiment configuration by passing the required parameters to determine experiment allocation into the SDK which can then compute without further network calls.
Ax is for automated optimization using machine learning. You define the parameters and optimization function and it decides the variations, traffic splitting, and everything else for you.
Growth Book is for hypothesis testing. It let's you define and run a specific controlled experiment and then you can analyze the results and make a decision.
[0] https://github.com/Alephbet/alephbet
[1] https://github.com/Alephbet/lamed
[2] https://www.analytics-toolkit.com/