Hypothesis testing and Power¶

Hypothesis testing¶

What is a hypothesis?

A starting point for investigation
A statement that can be disproven with data
A proposition made as a basis for reasoning, without any assumption of its truth

Four basic elements of a statistical test:

Null Hypothesis \(H_0\)
Alternate hypothesis \(H_a\)
Test statistic
Region of rejection for the null hypothesis

1. Null Hypothesis \(H_o\)¶

Example:

In a calibration bath measuring temperature,

\(x\) - A set of sensor measurements, error prone (noise and bias)
\(\mu_0\) - Calibration bath temperature, known to many more significant digits than the sensitivity of the sensor being calibrated (for temperature calibration, the well-defined triple point of gallium is used as a reference point)
\(\mu\) - The average temperature of the measurement-prone sensor in the calibration bath. This is unknown, it is what we are trying to estimate with the sample mean \(\bar{x}\).

For this example, the null hypothesis is

\(H_0\): \(\mu = \mu_0\).

(the sensor is unbiased)

2. Alternate hypothesis \(H_a\)¶

The alternate hypothesis must cover all possibilities not covered by the null hypothesis. The probability of either the null hypothesis of alternate hypothesis must sum to 1, and they must be mutually exclusive.

For the calibration bath example described above,

\(H_a\): \(\mu \neq \mu_0\).

3. Test statistic¶

The test statistic can either be parametric or non-parametric. Parametric tests are based on theoretical probability distributions, such as the normal distribution. Non-parametric tests do not assume a particular distribution, but are often less efficient or sacrifice statistical power (see below).

For the calibration bath example, the appropriate statistical test is the one sample t-test. This is because we are comparing one group of samples (sensor measurements) to a known value (calibration bath temperature). We also expect the sensor measurements to be normally distributed.

For a normally distributed parent population the t-statistic,

\(t = \frac{ \bar{x} - \mu } {s\sqrt{\frac{1}{N}}}\),

has a known distribution. If \(N\) samples are taken from the population many many times, the resulting t-statistics will follow a probability distribution called the t-distribution (see Probability and distributions).

4. Region of rejection of the null hypothesis¶

In order decide whether to accept or reject the null hypothesis, we must define what constitutes an “extreme” value of the test statistic.

For a group of \(N\) = 4 samples (3 degrees of freedom) taken randomly from a normal distribution, there is 5% chance that the t-statistic will be either greater than 3.18 or less than -3.18. This defines the rejection region for a 95% confidence level. In this case, \(\alpha =\) 0.05 is the probability that the null hypothesis will be wrongly rejected based on \(N\) samples, if the null hypothesis is true. Conversely, there is a 95% probability that the null hypothesis will be correctly accepted, if the null hypothesis is true, which in general is not known.

Example (two-tailed test):

reject null-hypothesis if: \(|t|\) > \(t_{1-\frac{\alpha}{2},N-1}\)

from scipy import stats
import numpy as np
from matplotlib import pyplot as plt

# number of samples
N = 4
alpha = 0.05

# plot t-distribution
tvalues = np.arange(-6,6,0.01)
tpdf = stats.t.pdf(tvalues,N-1)

plt.figure()
plt.plot(tvalues,tpdf,lw=3)
plt.xlabel('$t$')
plt.ylabel('probability density')
plt.title('$t$-distribution')
plt.gca().set_ylim(bottom=0)

# plot rejection regions
tcrit = stats.t.ppf(1-alpha/2,N-1)
upperi, = np.where(tvalues>tcrit)
loweri, = np.where(tvalues<-tcrit)
plt.fill_between(tvalues[upperi],tpdf[upperi],facecolor='red')
plt.fill_between(tvalues[loweri],tpdf[loweri],facecolor='red');

_images/week04a-hypothesis-power_2_0.png

In Python, the upper and lower critical values of a test statistic can be found by selecting a distribution from the stats library (stats.t in this case) and using the ppf function.

alpha = 0.05
N = 4
tupper = stats.t.ppf(1-alpha/2,N-1)
tlower = stats.t.ppf(alpha/2,N-1)
print('upper critical t value = '+str(tupper))
print('lower critical t value = '+str(tlower))

upper critical t value = 3.182446305284263
lower critical t value = -3.1824463052842638

The probability of obtaining a certain \(t\) value or less can be found from the cumulative distribution function, cdf.

stats.t.cdf(tupper,N-1)

0.9750000000000106

In testing a null hypothesis, there are four possible situations, depending on the actual truth of the null hypothesis, and the conclusion that is drawn from a test statistic calculated from a finite number of samples.

images/hypothesis-table.png

Here, \((1-\alpha)\) is called the “confidence level” and \((1-\beta)\) is called the “statistical power”. \(\alpha\) is the probability of making a Type I error and \(\beta\) is the probability of making a Type II error.

Some version of this table is presented in nearly every textbook on statistics. However, it still leads to a lot of confusion.

Let’s say you are comparing a set of observations to a theory, or a set of sensor values with a known standard. In truth, there is almost certainly a difference between the sample mean and the true mean, say to 20+ significant digits. But, if the data are noisy it would still be difficult to prove that this small difference did not just occur by random chance. Failing to reject the null hypothesis at a 95% confidence level does not mean that you are 95% certain that it is true. It just means that your data are less extreme than 95% of random groups of samples drawn from the hypothesized distribution.

Scientists tend to focus on confidence intervals rather than statistical power because this approach is conservative from a scientific point of view. By avoiding Type I errors, scientists reduce the likelihood of promoting an idea that is not actually true. This may come at the price of not detecting actual differences. However, as explained, power analysis can be an extremely valuable tool for planning experiments and determining how many samples you need.

Power analysis is described later in these notes, after covering the basic statistical procedure of t-tests.

Data Analysis Techniques in Marine Science

Hypothesis testing and Power

Contents

Hypothesis testing and Power¶