This post is about p-values and hypothesis testing. What they are, why they are needed, how to compute them and how to use them. It also includes a worked example, how to validate that an A/B test indeed produces a significant outcome. The article follows quite closely a chapter from “Data Science from Scratch” by Joel Grus and it is annotated with my own research and observations.

## Definitions

1. Null hypothesis (H0) - the hypothesis we want to test against. Equivalent to “there is nothing out of the ordinary”. For a coin toss, for instance, the hypothesis that the coin is fair. For a medicine, that it has no more effect than a sugar pill.

2. Alternative hypothesis (H1) - the hypothesis we want to test for. Something is happening, the coin is unfair or that the medicine has a significant effect.

3. Significance (probability) - how willing are we to make a false positive claim, to reject H0 even if it is true. Due to historical reasons and scientific prudence, it is usually set to 1% or 5%.

4. Type 1 Error - false positive. Reject H0 even if it is true. Say that there is something when there isn’t.

5. Type 2 Error - the opposite of type 1; failure to reject the null hypothesis even though there is something.

6. Power of the test - probability of not making a type 2 error.

7. P-value - instead of choosing bounds based on some probability cutoff (99% or 95%), we compute the probability— assuming H0 is true—that we would see a value at least as extreme as the one we actually observed. wikipedia

Before going forward, I would like to point out these two links:

## The maths

Let’s consider the following problem: a software routine is known to return the result of a complex computation in approx. 1.2 seconds. A software engineer tries to optimize the routine and obtained an average over 100 runs of 1.05 seconds, with a standard deviation of 0.5s. Has the routine been improved?

• H0: the software performs just as it did before.
• H1: the improvement had an effect (not saying wether improvement or worsening).

If we assume H0, that means the mean of the sample should be close enough to the mean of the total distribution.

We define the standard error of the mean for the population `SEM = population standard deviation / SQRT(populatin size)`

Since we have enough samples in our population, over 30, we can approximate to `SEM = sample standard deviation / SQRT(population size)` which leads to `SEM = 0.5/SQRT(100) = 0.5/10 = 0.05`.

We have `Z-score = (population mean - sample mean) / SEM = (1.2 - 1.05) / 0.05 = 3`, which means that our result (sample mean) is 3 standard deviations from the mean of the population, which means we have a `p-value = 0.3%` chance to observe a result as extreme or more extreme to the one we observed, which means we can consider that the improvement worked - we considered an extreme result on both the positive and the negative side (a two-tailed test).

Now, let’s assume the H1 were the software response has improved (lowered), a one-tailed test, our p-value would have been `0.015%`.

Assuming we would have had 10 samples, not 100, the sample mean would not have been normally distributed, but t-distributed and we would have had to use the T-distribution with n-1 degrees of freedom to find the p-values. T-score is computed exactly the same as the Z-score, but the p-values are computed based on the associated T-distribution with n-1 degrees of freedom.

## Code example - unfair coin

We are going to test if a coin is unfair (slightly biased towards the head, with a `p(head) = 0.55`). We are going to use two sample sizes and see how the sample size affects the results.

Before that, here is some prerequisite code:

Notes:

• The coin toss is represented by a `Binomial(n, p)` distribution, where `n` is the number of trials and `p` the probability of hitting head.

• A common rule of thumb is that if both `n*p` and `n * (1 – p)` are greater than `5`, the binomial distribution may be approximated by the normal distribution.

``````import math

#The binomial distribution with
#parameters n and p is the discrete probability distribution of the number of
#successes in a sequence of n independent experiments
def normal_approximation_to_binomial(n, p):
"""finds mu and sigma corresponding to a Binomial(n, p)"""

if n * p < 5 or n * (1 - p) < 5:
raise Exception("Cannot be approximated by a normal distribution")

mu = p * n
sigma = math.sqrt(p * (1 - p) * n)
return mu, sigma

def normal_probability_below(value, mu, sigma):
""" Considering the normal distribution of coin tosses and the central limit theorem,
computes the probability of a value to be below a certain given value. Same as normal_cdf"""
return (1 + math.erf((value - mu) / math.sqrt(2) / sigma)) / 2

def normal_probability_above(value, mu, sigma):
return 1 - normal_probability_below(value, mu, sigma)

def normal_probability_between(v1, v2, mu, sigma):
v1, v2 = (v2, v1) if v1 > v2 else (v1, v2)
return normal_probability_below(v2, mu, sigma) - normal_probability_below(v1, mu, sigma)

#Inverse through binary search; not optimal for more than a few values
def interval_probability_centered(p, mu, sigma):
"""
returns the interval (lower, upper) of values for which the probability
of the result to be in it is equal to p.
"""
hw = 9 * sigma # half width of intw
interval_low, interval_high = mu - hw, mu + hw
current_prob = 1.0

while abs(current_prob - p) > 1e-12:
hw /= 2
if p < current_prob:
interval_low, interval_high = interval_low + hw, interval_high - hw
else:
interval_low, interval_high = interval_low - hw, interval_high + hw

current_prob = normal_probability_between(interval_low, interval_high, mu, sigma)

return interval_low, interval_high

def normal_upper_bound(p, mu, sigma):
""" interval (-oo, value) for which sum of probabilities == p. Equivalent to inverse_normal_cdf"""

delta = 9 * sigma
intw_high = mu + delta
current_prob = 1.0
while abs(current_prob - p) > 1e-12:

delta /= 2

while p < current_prob:
intw_high -= delta
current_prob = normal_probability_below(intw_high, mu, sigma)

while p > current_prob:
intw_high += delta
current_prob = normal_probability_below(intw_high, mu, sigma)

return intw_high

def normal_lower_bound(p, mu, sigma):
return normal_upper_bound(1-p, mu, sigma)

``````

Let’s run some tests with the code above:

``````normal_approximation_to_binomial(1000, 0.5)
Out[47]: (500.0, 15.811388300841896)

normal_approximation_to_binomial(100, 0.5)
Out[49]: (50.0, 5.0)
``````

Considering the coin fair (`p=0.5`), the average is 500 heads in case of 1000 trials and 50 heads in case of 100 trials - obviously. The dispersion (sigma), if we consider it as a percentage of the number of trials, is much higher in the case of 100 trials versus the 1000. For 1000 trials, the distribution is significantly narrower.

Assuming 100 trials,

``````normal_probability_below(50, mu, sigma)
Out[53]: 0.5

normal_probability_between(45, 55, mu, sigma)
Out[54]: 0.6826894921370861
``````

50% of the results will be below 50 (first line) and the probability for the result (no. of observed heads) to be between 45 and 55 is 0.68.

``````interval_probability_centered(0.95, mu, sigma)
Out[55]: (40.20018007732688, 59.79981992267312)
``````

For the same `n=100` trials, the number of observed heads will be between 40 and 60 for 95% of the experiment runs.

First hypothesis - coin is biased, but we don’t specify towards which side:

``````# if coin not biased, considering the mu and sigma computed above, for 100 trials and p(head) = 0.5
low, high = interval_probability_centered(0.95, mu, sigma)

# consider the coin biased towards the head, with p(head) = 0.55
mu_biased, sigma_biased = normal_approximation_to_binomial(100, 0.55)
``````

Now we run the following test: what is the probability that we will get a value between `low` and `high` as the margins of 95% confidence if the coin is unbiased, if we consider the coin biased. As we see below, the higher the number of trials, the higher the confidence we have in rejecting the null hypothesis. Power of the test below is the probability of not making a type 2 error, in which we fail to reject H0 even though it’s false.

``````normal_probability_between(low, high, mu_biased, sigma_biased)
Out[57]: 0.8312119922463654

1 - normal_probability_between(low, high, mu_biased, sigma_biased) # power of the test
Out[58]: 0.16878800775363456
``````

Obviously, we cannot discharge the null hypothesis. p-value = 0.83 and the power of the test only 0.168. However, if we run the code for `n=1000` and `p_biased(head) = 0.55`, we get a different picture due to the more narrow distribution. We still cannot reject the null hypothesis, though.

``````normal_probability_between(low, high, mu_biased, sigma_biased)
Out[62]: 0.11345221890056983

1 - normal_probability_between(low, high, mu_biased, sigma_biased) # power of the test
Out[63]: 0.8865477810994302
``````

If we had 2000 trials on the other hand,

``````normal_probability_between(low, high, mu_biased, sigma_biased)
Out[65]: 0.005787749170240164

1 - normal_probability_between(low, high, mu_biased, sigma_biased) # power of the test
Out[66]: 0.9942122508297598
``````

We can pretty confidently reject the null hypothesis.

Now we assume the hypothesis we want to test is a little bit more specific - the coin is biased towards the head. The difference from the previous scenario is that now we make a one-sided test. In the previous case we were less specific, thus the two-sided test has wider margins.

Considering 1000 trials,

``````hi = normal_upper_bound(0.95, mu, sigma)

normal_probability_below(hi, mu_biased, sigma_biased)
Out[74]: 0.06362100214309152

1 - 0.06362100214309152
Out[75]: 0.9363789978569085
``````

While we still cannot confidently reject the null hypothesis with this test, but it is obviously a more powerful one.

## P-values:

Instead of choosing bounds based on some probability cutoff, we compute the probability, assuming H0 is true, that we would see a value at least as extreme as the one we actually observed. From wikipedia,

``````def two_sided_p_value(x, mu=0, sigma=1):
if x >= mu:
# if x is greater than the mean, the tail is what's greater than x
return 2 * normal_probability_above(x, mu, sigma)
else:
# if x is less than the mean, the tail is what's less than x
return 2 * normal_probability_below(x, mu, sigma)

upper_p_value = normal_probability_above
lower_p_value = normal_probability_below
``````

The first thing that is obvious is that the `two_sided_p_value` features a multiplication by 2, whereas the single-sided functions don’t. We talked about two sided versus one sided tests in the previuos chapter. There is one very important result to consider: we can only decide to opt for a single-sided test IF we didn’t look at the data before. That means, that the hypothesis we want to test is not derived from our understanding of the data, but rather is a pure theoretical (blind) speculation. If we already have an idea of what the data is about, we need to opt for the two sided test. This comes from the fact that the data might be already unintentionally biased towards our result (or in the opposite side of our result) by precisely half of our confidence interval.

Here is a more detailed explanation: hypothesis testing with p-values

Let’s use the p-values. While in the previous chapter we estimated the validity of the test for a certain hypothesis, now we are looking at actual results. The problem sounds like: assuming that we obtain `530` tails in a `1000` trials experiment, is the coin biased?

``````two_sided_p_value(529.5, mu, sigma) # continuity correction
Out[76]: 0.06207721579598857
``````

Unfortunately we cannot dismiss the null hypothesis. Had we observed `531` tails, then the result would have looked different.

Note: decreasing `530` by `0.5` in the test above is called continuity correction. It basically asserts that `p(530)` is better estimated by the average of `p(529.5)` and `p(530.5)` than by the average of `p(530)` and `p(531)`. Continuity Correction - Wikipedia

## Example - A/B testing

Let’s put all these things together in a worked example - computing the success of an A/B test campaign.

Let’s assume we have a banner for which we want to make a change and we want to understand if the change brings more clicks and thus profit. Our null hypothesis is “no, the change does not impact the number of clicks”.

We split the clients in two cohorts (`A` and `B`). For cohort `A` we keep the old banner, for cohort `B` we have the banner changed. After `N(A)` views for banner `A` we have `n(A)` click-throughs and after `N(B)` views for banner `B` we have `n(B)` click-throughs. Let’s assume `N(A) == 1000`, `N(B) == 1000`, `n(A) == 180`, `n(B) == 200`. Is `B` a better campaign?

Obviously we have a binomial variable with two posible outcomes: click (1) or not click (0), with a probability to click of `p`. Thus, for one trial, we have:

``````mu = p #expected value
sigma = sqrt(1/2 * ((0 - p)^2 + (1 - p)^2)) = sqrt(p * (1 - p))
``````

For `N` trials we apply the central limit theorem which states:

``````mu_clt = 1/N * (x0 + x1 +... + xN) = 1/N * (1 + 0 + 0 + ... +1) = n/N = mu = p
sigma_clt = sigma / sqrt(N) = sqrt(p * (1 - p)) / sqrt(N) = sqrt(p * (1 - p) / N)
``````

Now we want to test that distribution for `A` is the same as distribution for `B`, that is `p(A) == p(B)` - null hypothesis.

`p(A) == p(B)` means `p(A) - p(B) == 0` or, more precisely, has `mu == 0`. But `p(A) - p(B)` is a normally distributed random variable, thus

``````mu(p(A) - p(B)) = mu(A) - mu(B)
sigma(p(A) - p(B)) = sqrt(sigma(A) ^ 2 + sigma(B) ^ 2)
``````

If we are to `z-score` this distribution, we should obtain a distribution with `mu == 0` and `sigma == 1` - which is a mathematical expression of our null hypothesis. If our experimental `p(A) - p(B)` is within the constraints of this distribution, then `B` most likely does not provide any improvement (or worsening) over `A`. Now we can use our `two_sided_p_value` to test our hypothesis.

The code:

``````def abtest_estimated_params_trial(N, n):
mu = n / N
sigma = math.sqrt(mu * ( 1- mu) / N)
return (mu, sigma)

def pA_minus_pB(Na, na, Nb, nb):
muA, sigmaA = abtest_estimated_params_trial(Na, na)
muB, sigmaB = abtest_estimated_params_trial(Nb, nb)

# the following number should be normally distributed with mu = 0 and sigma == 1
return (muA - muB) / math.sqrt(sigmaA ** 2 + sigmaB ** 2)

two_sided_p_value(pA_minus_pB(1000, 200, 1000, 150), 0, 1)
``````

With the results:

``````two_sided_p_value(pA_minus_pB(1000, 200, 1000, 250), 0, 1)
Out[6]: 0.007313777221710671 # > 0.05, we can safely reject the null hypothesis

two_sided_p_value(pA_minus_pB(1000, 200, 1000, 350), 0, 1)
Out[7]: 2.531308496145357e-14 # > 0.5, we can safely reject the null hypothesis

two_sided_p_value(pA_minus_pB(1000, 200, 1000, 210), 0, 1)
Out[8]: 0.5796241923602059 # cannot reject the null

two_sided_p_value(pA_minus_pB(1000, 200, 1000, 180), 0, 1)
Out[9]: 0.25414197654223614 # cannot reject the null

two_sided_p_value(pA_minus_pB(1000, 200, 1000, 170), 0, 1)
Out[10]: 0.08382984264631776 # still cannot reject the null

two_sided_p_value(pA_minus_pB(1000, 200, 1000, 150), 0, 1)
Out[11]: 0.003189699706216853 # we can safely reject the null
``````