19 Hypothesis testing

Hypothesis testing is a widely used method in research to make inferences about population parameters based on sample data. In the context of health sciences, it is often used to evaluate the effectiveness of interventions, compare treatment outcomes, assess the significance of risk factors, and investigate various health-related phenomena.

Learning objectives

Understand the hypothesis testing
Understand the type of errors

19.1 Foundations of hypothesis testing

Hypothesis testing is a statistical method used to assess whether the data are consistent with the null hypothesis ( $H_{0}$ ). The null hypothesis typically states no significant effect or difference, while the alternative hypothesis ( $H_{1}$ ) represents the assertion being tested. We analyze the data to determine whether they provide enough evidence to reject the null hypothesis. A crucial step involves calculating the p-value. For a single outcome measure in a study, hypothesis testing framework can be summarized in five steps.

19.1.1 Steps of hypothesis testing

Step 1: State the null hypothesis, $H_{0}$ , and alternative hypothesis, $H_{1}$ , based on the research question.

NOTE: We decide a non-directional $H_{1}$ (also known as two-sided hypothesis) whether we test for effects in both directions (most common), otherwise a directional (also known as one-sided) hypothesis.

Step 2: Set a level of significance, α (usually 0.05).

Step 3: Determine an appropriate statistical test, check for any assumptions that may exist, and calculate the test statistic.

NOTE: There are two basic types of statistical tests, parametric and non-parametric. Parametric tests (e.g., t-test, ANOVA), depend on assumptions about the distribution of the parameter being studied. Non-parametric tests (e.g., Mann-Whitney U test, Kruskal-Wallis test), use some way of ranking the measurements and do not require such assumptions. However, non-parametric tests are typically about 95% as powerful as parametric tests.

Step 4: Decide whether or not the result is “statistically” significant according to (a) the rejection region or (b) the p-value.

(a) The rejection region approach

Based on the known sampling distribution of the test statistic, the rejection region is a set of values for the test statistic for which the null hypothesis is rejected. If the observed test statistic falls within the rejection region, then we reject the null hypothesis.

(b) The p-value approach

The p-value is the probability of obtaining the observed results, or something more extreme, if the null hypothesis is true.

We compare the calculated p-value to the significance level α:

If p − value < α, reject the null hypothesis, $H_{0}$ .
If p − value ≥ α, do not reject the null hypothesis, $H_{0}$ .

The Table 19.1 demonstrates how to interpret the strength of the evidence. However, always keep in mind the size of the study being considered.

Table 19.1: Strength of the evidence against

H_{0}

p-value	Interpretation
$p \geq 0.10$	No evidence to reject $H_{0}$
$0.05 \leq p < 0.10$	Weak evidence to reject $H_{0}$
$0.01 \leq p < 0.05$	Evidence to reject $H_{0}$
$0.001 \leq p < 0.01$	Strong evidence to reject $H_{0}$
$p < 0.001$	Very strong evidence to reject $H_{0}$

Step 5: Interpret the results and draw a “real world” conclusion.

NOTE: Even if a result is statistically significant, it may not be clinically significant if the effect size is small or if the findings do not have practical implications for patient care.

19.2 Types of errors in hypothesis testing

In the framework of hypothesis testing there are two types of errors: Type I error (False Positive) and Type II error (False Negative) (Table 19.2).

Type I error: we reject the null hypothesis when it is true (false positive), leading us to conclude that there is an effect when, in reality, there is none. The maximum chance (probability) of making a Type I error is denoted by α (alpha), which represents the significance level of the test and is typically set at 0.05 (5%); we reject the null hypothesis if our p-value is less than the significance level, i.e. if p < a.

Type II error: we do not reject the null hypothesis when it is false (false negative), and conclude that there is no effect when one really exists. The chance of making a Type II error is denoted by β (beta) and should be no more than 0.20 (20%); its compliment, (1 - β), is the power of the test.

Table 19.2: Types of error in hypothesis testing.

		In population $H_{0}$ is
		True	False
Decision based on the sample	Do Not Reject $H_{0}$	Correct decision: $1 - α$	Type II error ( $β$ )
	Reject $H_{0}$	Type I error ( $α$ )	Correct decision: $1 - β$ (power of the test)

19.3 Factors that influence the power in a study

The power ( $1 - β$ ), therefore, is the probability of (correctly) rejecting the null hypothesis when it is false; i.e. it is the chance (usually expressed as a percentage) of detecting, as statistically significant, a real treatment effect of a given size. Table 19.3 presents the main factors that can influence the power in a study.

Table 19.3: Factors Influencing Power.

Factor	Influence on study’s power
Effect Size (e.g., mean difference, risk ratio)	As effect size increases, power tends to increase (a larger effect size is easier to be detected by the statistical test, leading to a greater probability of a statistically significant result).
Sample Size	As the sample size goes up, power generally goes up (this factor is the most easily manipulated by researchers).
Standard deviation	As variability decreases, power tends to increase (variability can be reduced by controlling extraneous variables such as inclusion and exclusion criteria defining the sample in a study).
Significance level α	As α goes up, power goes up (it would be easier to find statistical significance with a larger α, e.g. α=0.1, compared to a smaller α, e.g. α=0.05).

19.4 Hypothesis testing for a single sample

Let’s explore the hypothesis testing procedure through a simple example such as the one sample t-test.

Example

Suppose the mean serum creatinine in healthy adult women is 0.75 mg/dl. A research study was conducted to examine the serum creatinine levels in female patients with diabetes. Twenty female patients were randomly enrolled, with a mean serum creatinine level of 0.96 mg/dl and a standard deviation of 0.35 mg/dl. Assuming that the serum creatinine follows a normal distribution, is the mean serum creatinine in diabetic patients significantly different from that in healthy adults, with a level of significance of $α = 0.05$ ?

Step 1: State the null and alternative hypotheses

In our example, the null hypothesis $H_{0}$ is that the mean serum creatinine in female patients with diabetes, $μ$ , is the same as the mean serum creatinine in healthy adult women, $μ_{o}$ ; here μ is unknown and $μ_{o}$ is known (0.75 mg/dl); Conversely, the alternative hypothesis $H_{1}$ suggests a difference between the mean serum creatinine levels in these two groups. Note that $H_{1}$ may have multiple options (more than, less than, or not the same as).

$H_{0}$ : μ= $μ_{o}$ , $H_{1}$ : μ> $μ_{o}$ (one-sided hypothesis; right-tailed)
$H_{0}$ : μ= $μ_{o}$ , $H_{1}$ : μ< $μ_{o}$ (one-sided hypothesis; left-tailed)
$H_{0}$ : μ= $μ_{o}$ , $H_{1}$ : μ $\neq$ $μ_{o}$ (two-sided hypothesis)

In our example, we adopt a two-sided hypothesis, implying that the mean serum creatinine in female patients with diabetes may differ from $μ_{o}$ = 75 mg/dl in either direction. That is:

$H_{0}$ : μ = 0.75, $H_{1}$ : μ $\neq$ 0.75

Step 2: Set the level of significance, α

We set the significance level (type I error) as α = 0.05.

Step 3: Determine an appropriate statistical test, check for any assumptions that may exist, and calculate the test statistic.

We choose a t-test, a statistical method utilized to assess whether the mean of a single sample differs significantly from a known population mean. This statistical test is particularly effective when dealing with small sample sizes and when the population standard deviation is unknown. The main assumption underlying the t-test is that the data being analyzed follows a normal distribution.

The formula for the test statistic t is:

$\begin{matrix} (19.1) & t = \frac{\bar{x} - μ_{o}}{S E_{\bar{x}}} \end{matrix}$

so,

$t = \frac{\bar{x} - μ_{o}}{S E_{\bar{x}}} = \frac{\bar{x} - μ_{o}}{s / \sqrt{n}} = \frac{0.96 - 0.75}{0.35 / \sqrt{20}} = \frac{0.21}{0.35 / 4.472} = \frac{0.21}{0.0783} = 2.68$

In R:

n <- 20
mu <- 0.75
x_bar <- 0.96
s <- 0.35


se <- s/sqrt(n)
t <- (x_bar-mu)/se
t

[1] 2.683282

Step 4: Decide whether or not the result is “statistically” significant.

(a) The rejection region approach

Next, we utilize the t-distribution with n-1 degrees of freedom (df = 20-1 = 19), encompassing all possible values of the test statistic t and their associated probabilities. The area under the curve is divided into one central non-rejection region and two-sided, grey rejection regions defined by the presepcified level of significance $α$ (Figure 19.1). These shaded regions denote ‘extreme’ values of the test statistic, indicating significant deviations from the null hypothesis. When an observed value t falls within these rejection regions, it suggests that the null hypothesis is less likely to be true, prompting rejection in favor of the alternative hypothesis.

Figure 19.1: T-distribution for df=19.The area under the curve is divided into one central non-rejection region and two-sided, grey rejection regions defined by the level of significance $α$ .

In our example, the observed value of the test statistic, t = 2.68, falls within the upper grey rejection region, leading us to reject the null hypothesis $H_{0}$ .

(b) The p-value approach

The second approach involves assigning a probability to the value of the test statistic. Particularly, if the test statistic is exceptionally extreme under the assumption that the null hypothesis holds true, it is given a low probability. Conversely, if the test statistic is less unusual, it receives a higher probability.

Figure 19.2: T-distribution for df=19. The p-value approach.

The corresponding probability for the test statistic t can be calculated by summing the two cumulative probabilities P(T ≤ -t) and P(T ≥ t). Therefore, the p-value is computed as 2*P(T ≥ t), given that we are conducting a two-tailed test and there is symmetry in the t-distribution with df = n-1 (Figure 19.2).

P <- pt(t, df = n-1, lower.tail = F)

p_value <- 2*P
p_value

[1] 0.01470991

The resulting p-value is 0.0147 which is less than 0.05, so we reject the null hypothesis.

Step 5: Interpret the results and draw a “real world” conclusion.

At the 5% significance level, the test result provides evidence against the null hypothesis (p = 0.014 < 0.05). The mean serum creatinine in women with diabetes (0.96 mg/dl) is significantly higher than the serum creatinine in healthy adult women (0.75 mg/dl).

INFO

Degrees of freedom (df) represents the number of independent pieces of information or parameters that can vary within the constraints of a statistical model or test.

In the context of the provided example, the calculation of degrees of freedom is based on the sample size minus any constraints imposed by the analysis. For instance, let’s consider 19 of the 20 patients in the sample, whose serum creatinine levels are as follows: 0.97, 1.27, 0.93, 0.95, 0.84, 1.27, 0.47, 0.61, 1.07, 1.23, 0.81, 1.19, 1.01, 0.49, 0.90, 0.53, 0.45, 1.88, 1.23. The sum of these values would be:

sum19 <- 0.97 + 1.27 + 0.93 + 0.95 + 0.84 + 1.27 + 0.47 + 0.61 + 1.07 +
  1.23 + 0.81 + 1.19 + 1.01 + 0.49 + 0.90 + 0.53 + 0.45 + 1.88 + 1.23
sum19

[1] 18.1

To maintain a sample mean of 0.96 mg/dl, the twentieth patient’s serum creatinine would need to be 1.1 mg/dl. We can derive this value by subtracting the sum of the 19 known values from the sum of all 20 values:

0.96*20 - sum19

[1] 1.1

Due to this constraint, only 19 values in the sample can vary freely. As a result, the final value is not free to vary; it has only one possible value. In this instance, the degrees of freedom are calculated as df = 20-1 = 19.

19.5 Confidence interval for a single sample

Hypothesis testing and confidence intervals are both inferential methods based on the concept of sampling distribution. Although closely related, they serve distinct purposes. Specifically, hypothesis testing involves a statistical decision (reject or not reject a pre-defined hypothesis) based on the p-value, while confidence intervals provide an interval estimate of the effect size and its associated precision.

Let’s calculate the 95% CI for the mean using the t-distribution:

$95 % C I = \bar{x} \pm t_{n - 1; a / 2}^{*} * S E_{\bar{x}}$

where $t_{n - 1; a / 2}^{*}$ is the critical value for the t-distribution with df = n-1 for a specific level of significance α.

t_star <- qt(0.025, df = n-1, lower.tail = F)

# compute lower limit of 95% CI
lower_95CI <- x_bar - t_star*se
lower_95CI

# compute upper limit of 95% CI
upper_95CI <- x_bar + t_star*se
upper_95CI

[1] 0.796195
[1] 1.123805

We observe that the value of null hypothesis, $μ_{o}$ = 0.75, is not included in the 95% confidence interval [0.796, 1.124]. Consequently, the conclusion aligns with that of the hypothesis testing (rejection of $H_{0}$ ).

19.6 The impact of study design: sample size

The impact of study design decisions, such as the size of the sample, on research outcomes and conclusions is a crucial aspect that researchers must carefully consider before conducting the study. Let’s return to the example provided earlier, but this time the researchers designed a study with only 10 patients (we suppose everything else is the same).

First, we calculate the new value of the t test statistic:

n <- 10

se <- s/sqrt(n)
t <- (x_bar-mu)/se
t

[1] 1.897367

Next, we will graphically present the two approaches, illustrating the rejection region and p-value methods.

(a) The rejection region approach

Figure 19.3: T-distribution for df=9. The rejection region approach.

In this instance, the observed test statistic value, t = 1.897, falls within the non-rejection region, thus indicating that we do not reject the null hypothesis $H_{0}$ .

(b) The p-value approach

Figure 19.4: T-distribution for df=9. The p-value approach.

The p-value is computed as 2*P(T ≥ t) (Figure 19.3):

P <- pt(t, df = n-1, lower.tail = F)

p_value <- 2*P
p_value

[1] 0.09026733

The resulting p-value is 0.09 which is greater than $α$ = 0.05, so we do not reject the $H_{0}$ .

Additionally, the 95% confidence interval is following:

# compute lower limit of 95% CI
lower_95CI <- x_bar - t_star*se
lower_95CI

# compute upper limit of 95% CI
upper_95CI <- x_bar + t_star*se
upper_95CI

[1] 0.7096251
[1] 1.210375

We observe that the 95% confidence interval is wider [0.709, 1.21] and includes the value of null hypothesis, $μ_{o}$ = 0.75.

Even when a difference appears substantial, a study designed with a small sample size might not have enough statistical power to attribute the difference to factors other than random chance.

As we highlighted, the sample size is a crucial aspect of research design, and we will delve into this topic in Chapter 38.