Quantifying Incrementality: A/B Testing, Mean as a Model, Statistical Analysis, and Confidence Intervals

(Note: for those interested, the Jupyter notebook with all the code for this post is available for download on my Decision Sciences GitHub repository)

In a previous post, we explored the pivotal roles of incrementality and lift in Marketing Decision Sciences. Today, we dive deeper into these measurements using a straightforward example—an A/B split test—to demonstrate how incremental gain and lift are calculated. This discussion also serves as a gentle introduction to some foundational statistical concepts:

Data as a distribution
The mean as a simple predictive model (i.e. point-forecast)
Comparing two means via hypothesis testing
Creating confidence intervals

Although I have spent a great deal of time in my career learning and applying models designed to outperform the mean (as the line of best fit), it is still beneficial to revisit it from time to time because it is so prevalent in all Marketing measurement. And when doing predictive modelling and hypothesis testing, the mean can serve as a baseline to compare and evaluate the effectiveness of other models. It is also commonly used in more advanced simulation and probabilistic programming methods, for example, that are important tools for Decision Scientists, which we will explore in more detail in the future.

In A/B testing, the difference between group means (control vs. treated) helps assess the impact of interventions.

A/B Testing to Measure Incrementality

A/B Split Testing is a straightforward comparative method where two versions of a single variable (like an offer, web page, advertisement, or product feature) are tested against each other to determine which one performs better on a specified metric. This method divides the audience randomly into two groups: the Control group (A) receives the original version, and the Treatment group (B) receives the variant. The key outcome measured is typically direct—such as click-through rates, conversion rates, or sales. The focus here is on which version is more effective under the same conditions. By employing randomization in our Control and Treatment groups, we aim to ensure that any observed difference is due to the intervention alone, thus establishing a causal relationship. We use the mean of outcomes from both groups to assess the effectiveness of the tested change, measuring the additional value, or incremental lift, of the Treatment group over the Control group (hopefully).

Note: A/B Split Testing vs. Incrementality Testing (Click to Read)

There is another testing method actually called Incrementality Testing. This method is different and is focused on understanding the additional value or ‘lift’ that marketing efforts bring over and above what would have happened naturally without those efforts. It answers the critical question: “Is this marketing activity adding any extra value?”. This method is intentionally different than comparing two different treatments directly and then measuring incrementality, which we are doing here. In future posts I will cover the similarity and differences between A/B split testing and Incrementality Testing in more detail.

Scenario

To boost both customer loyalty and profitability, your company has introduced a rewards program that incentivizes repeat purchases and increased engagement. This program is designed to appeal to your existing customer base, offering points, discounts, and exclusive access to new products based on customer activity and purchase history. The primary promotional tool for this loyalty program is targeted email campaigns, which aim to educate customers about the benefits of participating in the program and encourage them to take actions that increase their rewards. Initial analyses suggest that customers who receive these promotional emails show a higher Average Order Value (AOV) compared to those who do not, indicating potential effectiveness of the campaign. However, a deeper dive into the data reveals that emails may have been sent disproportionately to customers already exhibiting higher engagement or purchasing frequencies.

This insight suggests a selection bias that could skew the perceived impact of the loyalty program on AOV. The higher AOV observed in the targeted group might not solely be a result of the loyalty program but also of their inherent purchasing behaviors. To isolate the true effect of the loyalty program on AOV, it is crucial to implement a randomized control trial. By randomly distributing the promotional emails across a broad and varied customer base, you ensure that any observed differences in AOV post-campaign are directly attributable to the program itself, rather than underlying differences in customer behavior. This methodological approach will provide a clear and unbiased assessment of how effectively the loyalty program enhances customer spending and retention, guiding more accurate and strategic enhancements to the program.

Here’s how we set it up for our mini use case:

Control Group received the traditional offer
Treatment Group received new promotional offer

For simplicity, there are 100 customers in each group that made purchases.

Objective: to measure the Incremental Gain and Lift from the experiment to see if the new offer significantly impacts the Average Order Value (AOV).

Preliminary Results

Here are the AOVs for both groups:

Control Group: $74.17
Treatment Group: $80.22

Understanding Incremental Gain and Lift

Incremental Gain and Lift offer precise measures of how much a specific marketing activity contributes to overall performance. This quantification helps marketers prove the effectiveness of their strategies beyond general metrics like total sales or leads. By isolating the direct effects of a campaign, these metrics show the additional value generated specifically due to the marketing intervention. These metrics offer objective criteria for evaluating the success of marketing strategies and tactics, which is essential for developing long-term marketing plans and for scaling successful experiments.

Importantly, although these metrics are very simple to calculate, when applied to the right data sets or analyses, they directly link marketing actions to financial outcomes. By quantifying exactly how much value is added by specific initiatives, Incremental Gain and Lift provide a clear measure of return on investment (ROI), making them crucial for justifying marketing spend.

Incremental Gain: this measure reflects the additional value generated by the Treatment group over the Control group. It helps us understand the absolute increase in performance due to our intervention. It’s calculated as:

$\Large \text{Incremental Gain} = \text{Metric}_{\text{treat}} - \text{Metric}_{\text{control}}$

In our A/B test scenario, if the Treatment group (those who received the new offer) shows an AOV higher than that of the Control group, the difference between these values represents the Incremental Gain.

Lift (%): Lift provides a relatives measure of how much better the Treatment group performed compared to the Control group. It’s expressed as a percentage:

$\Large \text{Lift} (\%) = \left( \frac{ \text{Metric}_{\text{treat}} - \text{Metric}_{\text{control}}}{\text{Metric}_{\text{control}}} \right) \times 100\%$

Lift helps contextualize the Incremental Gain by comparing it to the baseline performance of the Control group. A positive lift indicates that the Treatment was effective in enhancing the desired outcome.

Using these formulas, we calculate that the randomized experiment recorded an incremental gain of $6.05, meaning the Treatment group’s AOV is on average $6.05 higher than that of the Control group. The lift, calculated from the baseline AOV of the Control group, is approximately 8.16%.

Mean as a Model

In statistics, the mean serves as one of the most fundamental metrics for summarizing data. The mean, or average, is essentially a point-forecast used to represent the central tendency of a data distribution. This single value is calculated by summing all the observations in a dataset and dividing by the number of observations, providing a quick snapshot of the ‘center’ of the data. By identifying where most values tend to cluster, the mean helps analysts make informed assumptions about the data as a whole, facilitating easier interpretation and comparison across different datasets, variables, or treatments.

Therefore, the mean not only summarizes the central tendency but also anchors a deeper statistical analysis, offering a simple way to predict and quantify the effect of new innovations or changes within the experimental framework of A/B testing.

The mean serves as a simple statistical model in this context. It can be expressed as each observed outcome being the overall mean plus some error term:

$\Large \text{outcome}_i = \text{model}_i + \text{error}_i$

This formula underpins many statistical models, including linear regression. Here, ‘Model’ is the mean (predicted value due to the experiment), and ‘Error’ is the variance, showing how far from the mean each individual customer’s AOV is. We could re-write the formula as so:

$\Large \text{outcome}_i = \text{mean}_i + \text{error}_i$

Extending the “Mean as a Model” to Control and Test Groups

The mean extends beyond simply summarizing data; it also acts as a basic predictive model, particularly useful in scenarios like A/B split testing. In this context, the mean serves as a baseline model to forecast and predict outcomes. By calculating the mean of outcomes for each group in an A/B test (Control and Treatment), it predicts that future outcomes will, on average, resemble these central values. Here, the mean of each group’s results provides a straightforward comparison point. By assessing whether the treatment group’s mean significantly deviates from the control group’s mean, Decision Scientists can infer the impact of the new variable being tested.

This is mathematically represented as:

For the Control group:

For the Treatment group:

Where:

and are the means of the Control and Treatment groups, respectively
and represent the variation or deviation of individual outcomes from their group means.

These formulas help us visualize each observed value in the experiment as composed of two parts:

Predicted Mean: this is what our model (the mean of each group) predicts based on the group the individual belongs to
Error: this is the deviation (or spread) of the actual observed outcome from the predicted mean, capturing the variability within each group.

Visualizing the Variance and Error Terms:

Visualizations of data variance and individual deviations (errors) from the mean illustrate the spread and consistency of data within each group. These visualizations are helpful for understanding the underlying distribution of our data, which directly affects the reliability of our mean as a model

Using the “mean as a model” equations above, we can assess how well our model predicts actual outcomes and understand the variability within each group. The error term in each equation provides insights into the spread and distribution of data around the group means, which is pivotal for further statistical analysis like hypothesis testing and creating confidence intervals.

Statistical Aside: Understanding Overlapping Distributions and Statistical Significance (Click to Open)

A common misconception in interpreting results from statistical tests is the assumption that if the sample distributions of two groups overlap, there cannot be a statistically significant difference between their means. This conclusion, while intuitive, is not accurate and can lead to misunderstandings about the results of a test.

When we visualize data distributions from two groups, such as in A/B testing, seeing the distributions overlap is quite common. However, the degree of overlap does not necessarily negate the possibility of a statistically significant difference between the group means. This is because statistical significance in testing the difference between means (as conducted in a t-test, for instance) is influenced not just by the amount of overlap but by several key factors:

Sample Size: Larger samples provide more reliable estimates of the population parameters and can detect smaller differences between means as statistically significant.
Variability Within Samples: If the data within each group has low variability (i.e., the data points are closely clustered around the mean), even a small shift in the mean between two groups can be statistically significant.
Magnitude of the Difference: The actual difference between the means relative to the variability within the groups plays a crucial role. A statistically significant difference is indicated when the calculated test statistic (reflecting this difference) exceeds the critical value determined by the desired confidence level.

The key takeaway is that the mere visual overlap of distributions in a graph does not provide a full picture of statistical significance. Statistical tests like the t-test quantify the likelihood that any observed difference in means could have arisen if there were actually no difference (the null hypothesis). If this likelihood (p-value) is very low (typically less than 5%), the result is declared statistically significant, meaning it is unlikely to have occurred by chance due to sampling error, regardless of how the distributions appear to overlap visually.

In essence, statistical significance is determined by a combination of the size of the difference, the variability of the data, and the sample size, not just by whether the distributions overlap. This understanding is crucial in avoiding erroneous conclusions and in making informed decisions based on statistical data analysis.

Statistical Analysis

A t-test is a type of statistical method used in hypothesis testing when the objective is to determine if there are significant differences between the means of two groups and the distribution is assumed to be normally distributed. The t-test calculates the likelihood that any observed differences between the sample means are due to a true difference in the test segments, not random chance.

The essence of the t-test lies in its ability to compare the actual difference to the variability or spread of the data. As discussed in more detail below, it uses the standard error of the difference between the means to normalize the difference, creating a ‘t-statistic’. This statistic — which is a type of threshold, or “hurdle rate” for the alternative hypothesis — is then compared against a critical value from the t-distribution. The t-distribution is primarily used in situations where you need to estimate how likely it is that a certain statistic (like the mean difference between two groups) actually represents a real difference.

Hypothesis Testing Framework

Null Hypothesis (H0): the means of both groups are equal (Mean Control = Mean Treatment)
Alternative Hypothesis (H1): the mean of the Treatment group is greater than the mean of the Control group (Mean Treatment > Mean Control)

Written using Expected-Value notation, our alternative hypothesis would look like the following equation, indicating that the expected AOV given customers that received the promotional email will be greater than the expected AOV given customers that received the traditional email:

$E[\text{AOV} | \text{Promo} = Yes] > E[\text{AOV}| \text{Promo} = No]$

The One-Sided T-Test:

The one-sided t-test is a statistical method used when the hypothesis is that one group’s mean is expected to be higher than the other (as we defined above).

For our marketing A/B test, the one-sided t-test helps us evaluate:

The difference in means between the control and treatment groups.
The spread or variance within each group, as indicated by the residuals/errors

In simple terms, we can think of the t-distribution (as illustrated in the plot below) as a “hypothetical” distribution that represents what the spread of the mean difference in AOV between groups would look like if the population means of the two groups were the same (i.e. if the null hypothesis is true).

If the calculated test statistic, called the t-statistic, from this test is greater than the critical value from the t-distribution (which corresponds to our desired confidence level, typically 95%), we reject the null hypothesis (that there is no difference) and accept the alternative hypothesis (that the treatment group performs better than the control group).

Note that the t-distribution above is centered at zero because in a t-test the null hypothesis states that there is no difference between the means of the two groups being compared. Although I won’t cover the formula here (but do show the calculations in my notebook), just know that it measures how many standard errors — i.e. the error/residual values we plotted above, normalized — the difference between the sample means is from zero.

If the actual difference between the sample means is exactly as predicted under the null hypothesis (zero), the t-statistic will be zero. If the difference is greater or less than zero by a margin that exceeds what we might expect by chance alone, the t-statistic will move away from zero, potentially leading to the rejection of the null hypothesis.

Statistical Aside: The Critical Value and T-Statistic in a T-Test (Click to Open)

Critical Value in a T-Test

The critical value is a threshold used to decide whether the test statistic is extreme enough to reject the null hypothesis. It’s determined by the level of confidence we want in our results (typically 95% confidence) and the degrees of freedom in our data (which generally relates to the number of data points in our sample).

In more technical terms, if the absolute value of the t-statistic exceeds the critical value (found in t-distribution tables or calculated using statistical software), the difference between groups is considered statistically significant. This threshold helps control for false positives, ensuring that when you detect a difference between groups, it’s unlikely to be just due to random noise. At 95% confidence, that you would be wrong only 1 in 20 times.

Test Statistic in a T-Test

The t-statistic in a t-test measures the ratio of the signal (difference between the means of the two groups being compared) to the noise (variability or dispersion within each group). It essentially tells us how far the observed difference between group means is from the expected difference (zero).

Here’s how this works and why a larger ratio indicates a more statistically significant result:

Signal to Noise Ratio

Signal: In the case of the t-test, the signal is the difference between the sample means of two groups. This difference is what we are testing—whether it is statistically significant or not. The signal reflects the actual effect or impact of the intervention or difference between groups that we are studying.

Noise: The noise is represented by the standard error of the difference between the means. This standard error encompasses the variability or dispersion within the sample data. It accounts for how spread out the individual observations are around their respective sample means, considering the sample size of each group.

The Test Statistic Calculation

The test statistic for a t-test is calculated as:

$t = \frac{\text{Difference between sample means (Signal)}}{\text{Standard error of the difference (Noise)}}$

Interpretation of the Test Statistic

Larger Test Statistic: A larger test statistic indicates that the difference between the group means is large relative to the variability (noise) within the groups. Statistically, this means that the observed difference is less likely to have occurred by chance, making it easier to reject the null hypothesis (which states that there is no true difference between the groups).

Smaller Test Statistic: Conversely, a smaller test statistic indicates that the difference between the group means is small relative to the noise. This suggests that the observed difference could easily be due to random variability within the data, and there is insufficient evidence to reject the null hypothesis.

Statistical Significance

The significance of the test statistic is determined by comparing it to a critical value from the t-distribution, which considers the desired level of confidence (commonly 95%) and the degrees of freedom in the data. The critical value defines the threshold for what constitutes a “large enough” test statistic.

If the absolute value of the test statistic is greater than the critical value, it indicates that the probability of observing such a difference in means under the null hypothesis is very low (typically less than 5%), leading to rejection of the null hypothesis.
If it is less than the critical value, the null hypothesis cannot be rejected.

Confidence Intervals and Understanding Variability

Last but certainly not least, we discuss confidence intervals. The t-test above not only helps determine if two groups differ significantly, but it also enables the calculation of confidence intervals for the difference in means. A confidence interval provides a range of values that is likely to contain the true difference between the population means under study. These intervals are constructed based on the t-distribution, which, as we have discussed, adjusts for the sample size and variability in the data.

The importance of confidence intervals in a t-test lies in their ability to offer more than just a binary decision of ‘different’ or ‘not different.’ They provide a range within which the true effect size is likely to fall, with a specified level of confidence, usually 95%. This means if the experiment were repeated numerous times, 95% of the confidence intervals calculated from those experiments would contain the true difference between the population means. Here is why they are important:

Quantifying Uncertainty: confidence intervals quantify the uncertainty around the estimated difference between group means. Unlike a simple p-value, which only tells you if an effect exists, confidence intervals provide a magnitude of effect and the precision of this estimate.
Informing Decisions: in practical decision-making, understanding the range of potential true outcomes is crucial. Confidence intervals allow decision-makers to assess the risk and potential range of outcomes of their decisions. For example, in context of our study above, calculating confidence intervals for the difference in AOV between customers who received the new offer (Treatment) and those who did not (Control), marketers can assess not only if the new offer increases AOV but also understand the range of potential increase. If the confidence interval for the difference in AOV is both above zero and relatively narrow, it provides strong evidence that the new offer effectively increases sales and does so with a predictable range of improvement. This insight helps in deciding whether to roll out the offer more broadly, considering the expected increase in sales against the cost of the promotion. It also helps to generate more accurate estimates of incremental customer lifetime value (which we will explore in future posts).
Enhancing Transparency: providing confidence intervals in reporting results promotes transparency. It allows stakeholders to see the potential variability in the data and understand the confidence in the statistical analysis. This can influence investment, policy decisions, and strategic business moves.

In summary, confidence intervals derived from t-tests are indispensable for robust, data-driven decision-making. They provide a fuller picture of the data’s implications, guiding stakeholders in making informed, measured decisions based on a comprehensive understanding of the statistical confidence and potential variability of the results.

Conclusion

In the context of A/B testing, the mean serves not only as a straightforward measure of central tendency but also as a predictive model that simplifies complex data sets into actionable insights. This dual function of the mean allows marketers to assess and compare the central points of distributions between control and treatment groups, thereby evaluating the effectiveness of tested variables, such as new marketing offers. By calculating the mean for each group—whether it’s average order value, click rates, or any other metric—analysts can succinctly quantify the impact of specific changes or interventions. This approach not only aids in immediate decision-making but also sets a benchmark for measuring incremental gains and lifts, which are crucial for optimizing marketing strategies and resource allocation.

Moreover, understanding the mean within the framework of hypothesis testing, particularly through the use of t-tests, enhances its utility by providing a method to determine statistical significance and confidence intervals. These statistical tools offer deeper insights into the data, allowing marketers to draw more robust conclusions about the effectiveness of new initiatives. Confidence intervals, derived from t-tests, extend the analysis by quantifying the uncertainty around the mean difference and providing a range of expected outcomes. This level of analysis supports more nuanced decision-making, enabling marketers to implement strategies not only based on whether an offer was effective but also on the reliability and predictability of those results.

Ultimately, embracing the mean as a model in A/B testing empowers marketing teams to drive more precise and evidence-based strategies, ensuring that each decision is backed by solid data analysis and contributes to overall business growth.

For those interested, the Jupyter notebook with all the code for this post is available for download on my Decision Sciences GitHub repository.

(Workbook) Quantifying Incrementality: A/B Testing, Mean as a Model, Statistical Analysis, and Confidence Intervals

Scenario

Like this:

Leave a ReplyCancel reply

Discover more from Decision Sciences: Marketing

(Workbook) Quantifying Incrementality: A/B Testing, Mean as a Model, Statistical Analysis, and Confidence Intervals

Scenario

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Decision Sciences: Marketing

Discover more from Decision Sciences: Marketing