20 Two-sample t-test

Two sample t-test (Student’s t-test) can be used if we have two independent (unrelated) groups (e.g., males-females, treatment-non treatment) and one quantitative variable of interest.

When we have finished this Chapter, we should be able to:

Learning objectives

Applying hypothesis testing
Compare two independent samples applying Student’s t-test
Interpret the results

20.1 Research question and Hypothesis Testing

We consider the data in depression dataset. In an experiment designed to test the effectiveness of paroxetine for treating bipolar depression, the participants were randomly assigned into two groups (paroxetine Vs placebo). The researchers used the Hamilton Depression Rating Scale (HDRS) to measure the depression state of the participants and wanted to find out if the HDRS score is different in paroxetine group as compared to placebo group at the end of the experiment. The significance level $α$ was set to 0.05.

Note: A score of 0–7 in HDRS is generally accepted to be within the normal range, while a score of 20 or higher indicates at least moderate severity.

Null hypothesis and alternative hypothesis for the main research question

$H_{0}$ : the means of HDRS in the two groups are equal ( $μ_{1} = μ_{2}$ )
$H_{1}$ : the means of HDRS in the two groups are not equal ( $μ_{1} \neq μ_{2}$ )

20.2 Packages we need

We need to load the following packages:

# packages for graphs
library(ggrain)
library(ggsci)
library(ggpubr)
library(ggprism)

# packages for data description, transformation and analysis
library(dlookr)
library(descriptr)
library(rstatix)
library(here)
library(tidyverse)

# packages for reporting the results
library(gtsummary)
library(report)

20.3 Preparing the data

We import the data depression in R:

library(readxl)
depression <- read_excel(here("data", "depression.xlsx"))

	intervention	HDRS
1	placebo	19
2	placebo	21
3	placebo	28
4	placebo	22
5	placebo	22
6	placebo	28
7	placebo	23
8	placebo	17
9	placebo	19
10	placebo	20

Figure 20.1: Table with data from “depression” file.

We inspect the data and the type of variables:

glimpse(depression)

Rows: 76
Columns: 2
$ intervention <chr> "placebo", "placebo", "placebo", "placebo", "placebo", "p…
$ HDRS         <dbl> 19, 21, 28, 22, 22, 28, 23, 17, 19, 20, 26, 23, 23, 22, 1…

The data set depression has 76 patients (rows) and includes two variables (columns). The numeric (<dbl>) HDRS variable and the character (<chr>) intervention variable which should be converted to a factor (<fct>) variable using the factor() function as follows:

depression <- depression |> 
  mutate(intervention = factor(intervention))

glimpse(depression)

Rows: 76
Columns: 2
$ intervention <fct> placebo, placebo, placebo, placebo, placebo, placebo, pla…
$ HDRS         <dbl> 19, 21, 28, 22, 22, 28, 23, 17, 19, 20, 26, 23, 23, 22, 1…

20.4 Assumptions

Check if the following assumptions are satisfied

The data are normally distributed in both groups
The data in both groups have similar variance (also named as homogeneity of variance or homoscedasticity)

20.4.1 A. Explore the characteristics of distribution for each group and check for normality

The distributions can be explored visually with appropriate plots. Additionally, summary statistics and significance tests to check for normality (e.g., Shapiro-Wilk test) can be used.

Graphs

We can visualize the distribution of HDRS for the two groups:

ggplot(depression, aes(x= intervention, y = HDRS, fill = intervention)) +
  geom_rain(likert= TRUE, seed = 123, point.args = list(alpha = 0.3)) +
  #theme_prism(base_size = 20, base_line_size = 0.4, palette = "office") +
  labs(title = "Grouped Raincloud Plot: HDRS by intervention") +
  scale_fill_jco() +
  theme(legend.position = "none")

Figure 20.2: Raincloud plot of HDRS variable stratified by intervention.

The above figure shows that the data are close to symmetry and the assumption of a normal distribution is reasonable.

ggqqplot(depression, "HDRS", color = "intervention", conf.int = F) +
  #theme_prism(base_size = 20, base_line_size = 0.4, palette = "office") +
  scale_color_jco() +
  facet_wrap(~ intervention) + 
  theme(legend.position = "none")

Figure 20.3: Normality Q-Q plot for HDRS for paroxetine and placebo.

Summary statistics

The HDRS summary statistics for each group are:

Summary statistics by group

depression |> 
  group_by(intervention) |> 
  dplyr::summarise(
    n = n(),
    na = sum(is.na(HDRS)),
    min = min(HDRS, na.rm = TRUE),
    q1 = quantile(HDRS, 0.25, na.rm = TRUE),
    median = quantile(HDRS, 0.5, na.rm = TRUE),
    q3 = quantile(HDRS, 0.75, na.rm = TRUE),
    max = max(HDRS, na.rm = TRUE),
    mean = mean(HDRS, na.rm = TRUE),
    sd = sd(HDRS, na.rm = TRUE),
    skewness = EnvStats::skewness(HDRS, na.rm = TRUE),
    kurtosis= EnvStats::kurtosis(HDRS, na.rm = TRUE)
  ) |> 
  ungroup() |> 
  print(width = 100)

# A tibble: 2 × 12
  intervention     n    na   min    q1 median    q3   max  mean    sd skewness
  <fct>        <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1 paroxetine      33     0    13    18     21    22    27  20.3  3.65  0.00167
2 placebo         43     0    14    19     21    24    28  21.5  3.41  0.0276 
  kurtosis
     <dbl>
1   -0.574
2   -0.403

depression |>  
  group_by(intervention) |>  
  dlookr::describe(HDRS) |>  
  dplyr::select(intervention, n, na, mean, sd, p25, p50, p75, skewness, kurtosis) |>  
  ungroup() |> 
  print(width = 100)

# A tibble: 2 × 10
  intervention     n    na  mean    sd   p25   p50   p75 skewness kurtosis
  <fct>        <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 paroxetine      33     0  20.3  3.65    18    21    22  0.00167   -0.574
2 placebo         43     0  21.5  3.41    19    21    24  0.0276    -0.403

depression |> 
ds_group_summary(intervention, HDRS)

                       HDRS by intervention                         
-------------------------------------------------------------------
|     Statistic/Levels|           paroxetine|              placebo|
-------------------------------------------------------------------
|                  Obs|                   33|                   43|
|              Minimum|                   13|                   14|
|              Maximum|                   27|                   28|
|                 Mean|                20.33|                21.49|
|               Median|                   21|                   21|
|                 Mode|                   18|                   19|
|       Std. Deviation|                 3.65|                 3.41|
|             Variance|                13.35|                11.64|
|             Skewness|                    0|                 0.03|
|             Kurtosis|                -0.57|                 -0.4|
|       Uncorrected SS|                14071|                20344|
|         Corrected SS|               427.33|               488.74|
|      Coeff Variation|                17.97|                15.87|
|      Std. Error Mean|                 0.64|                 0.52|
|                Range|                   14|                   14|
|  Interquartile Range|                    4|                    5|
-------------------------------------------------------------------

The means are close to medians (20.3 vs 21 and 21.5 vs 21). The skewness is approximately zero (symmetric distribution) and the (excess) kurtosis falls into the acceptable range of [-1, 1] indicating approximately normal distributions for both groups.

Normality test

Hypothesis testing for Shapiro-Wilk test for normality

$H_{0}$ : the data came from a normally distributed population
$H_{1}$ : the data tested are not normally distributed

The Shapiro-Wilk test for normality for each group is:

depression |> 
  group_by(intervention) |> 
  shapiro_test(HDRS) |> 
  ungroup()

# A tibble: 2 × 4
  intervention variable statistic     p
  <fct>        <chr>        <dbl> <dbl>
1 paroxetine   HDRS         0.976 0.670
2 placebo      HDRS         0.979 0.614

The tests of normality suggest that the data for the HDRS in both groups are normally distributed (p=0.67 >0.05 and p=0.61 >0.05, respectively).

Normality tests frequently fail to be valuable indicators

For small sample sizes, the Shapiro-Wilk test (and other normality tests) has little power to reject the null hypothesis (under-powered test).
If the sample size is large normality tests may detect even trivial deviations from the normal distribution (over-powered test).

Comment

The decision about normality of data should be based on a careful consideration of all available information such as graphs (histograms, Q-Q plots), summary and shape measures and statistical tests.

20.4.2 B. Check Levene’s test for equality of variances

Hypothesis testing for Levene’s test for equality of variances

$H_{0}$ : the variances of data in two groups are equal
$H_{1}$ : the variances of data in two groups are not equal

The Levene’s test for equality of variances is:

depression |> 
  levene_test(HDRS ~ intervention)

# A tibble: 1 × 4
    df1   df2 statistic     p
  <int> <int>     <dbl> <dbl>
1     1    74     0.176 0.676

Since the p-value = 0.676 >0.05, the $H_{o}$ is not rejected. The variances are supposed to be equal.

Comment

If the assumption of equal variances is not satisfied ( $H_{o}$ of Levene’s test is rejected), the Welch’s t-test can be applied. This statistical test assumes normality but does not assume equal variances.

20.5 Run the t-test

20.5.1 Formulas

We will perform a pooled variance t-test (Student’s t-test) to test the null hypothesis that the mean HDRS score is the same for both groups (paroxetine and placebo).

Under this $H_{o}$ , the test statistic is:

$\begin{matrix} (20.1) & t = \frac{{\bar{x}}_{1} - {\bar{x}}_{2}}{S E_{{\bar{x}}_{1} - {\bar{x}}_{2}}} = \frac{{\bar{x}}_{1} - {\bar{x}}_{2}}{s_{p} \cdot \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \end{matrix}$

where $n_{1}$ and $n_{2}$ are the sample sizes for paroxetine and placebo groups respectively, and $s_{p}$ is an estimate of the pooled standard deviation of the two groups which is calculated by the following equation:

$\begin{matrix} (20.2) & s_{p} = \sqrt{\frac{(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2} - 2}} \end{matrix}$

Under the null hypothesis, the t statistic follows the t-distribution with $n - 2$ degrees of freedom (d.f.).

The 95% confidence interval of the mean difference is:

$\begin{matrix} (20.3) & 95 % C I = {\bar{x}}_{1} - {\bar{x}}_{2} \pm t_{d f; a / 2}^{*} * S E_{{\bar{x}}_{1} - {\bar{x}}_{2}} \end{matrix}$

INFO

Sample size of the groups: The Student t-test does not have any restrictions on $n_{1}$ and $n_{2}$ —they can be equal or unequal. However, equal samples are preferred because this maximizes the power to detect a specified difference.
Degrees of freedom: The paroxetine group has $d f_{1} = n_{1} - 1$ and the placebo group has $d f_{2} = n_{2} - 1$ , so we have $d f = n_{1} + n_{2} - 2 = n - 2$ in total. Another way of thinking of this is that the complete sample size is $n$ , and we have estimated two parameters from the data (the two means), so we have $d f = n - 2$ (see also Chapter 14).

20.5.2 In R:

First, we calculate the mean difference:

mean_1 <- mean(depression$HDRS[depression$intervention == "paroxetine"])
mean_2 <- mean(depression$HDRS[depression$intervention == "placebo"])

mean_dif <- mean_1 - mean_2
mean_dif

[1] -1.155039

Second, we find the pooled standard deviation:

n_1 <- length(depression$HDRS[depression$intervention == "paroxetine"])
n_2 <- length(depression$HDRS[depression$intervention == "placebo"])
st_div_1 <- sd(depression$HDRS[depression$intervention == "paroxetine"])
st_div_2 <- sd(depression$HDRS[depression$intervention == "placebo"])
numerator <- (n_1-1)*st_div_1^2 + (n_2-1)*st_div_2^2
denominator <- n_1 + n_2 - 2

pooled_st_div <- sqrt(numerator/denominator)
pooled_st_div

[1] 3.518441

Third, we find the standard error of the mean difference:

st_error <- pooled_st_div*sqrt(1/n_1 + 1/n_2)
st_error

[1] 0.8142652

Therefore, the t statistic is:

t <- mean_dif / st_error
t

[1] -1.418504

The corresponding probability for this value can be calculated as the cumulative probability P(T ≤ t) for n-2 degrees of freedom. Then, the p-value is 2*P(T ≤ t) because we are doing a two-tailed test.

P <- pt(t, df = n_1 + n_2 - 2)

p_value <- 2*P
p_value

[1] 0.1602415

The 95% confidence interval of the mean difference is:

lower_CI <- mean_dif - qt(0.025, df = 74, lower.tail = FALSE)*st_error
lower_CI

[1] -2.777498

upper_CI <- mean_dif + qt(0.025, df = 74, lower.tail = FALSE)*st_error
upper_CI

[1] 0.46742

Note that the 95% confidence interval of the mean difference (-2.78, 0.47) includes the hypothesized null value of 0.

Next, we present R functions to carry out all the tasks on our behalf.

Student’s t-test

stats
rstatix

t.test(HDRS ~ intervention, var.equal = T, data = depression)


    Two Sample t-test

data:  HDRS by intervention
t = -1.4185, df = 74, p-value = 0.1602
alternative hypothesis: true difference in means between group paroxetine and group placebo is not equal to 0
95 percent confidence interval:
 -2.777498  0.467420
sample estimates:
mean in group paroxetine    mean in group placebo 
                20.33333                 21.48837

depression |> 
  t_test(HDRS ~ intervention, var.equal = T, detailed = T)

# A tibble: 1 × 15
  estimate estimate1 estimate2 .y.   group1   group2    n1    n2 statistic     p
*    <dbl>     <dbl>     <dbl> <chr> <chr>    <chr>  <int> <int>     <dbl> <dbl>
1    -1.16      20.3      21.5 HDRS  paroxet… place…    33    43     -1.42  0.16
# ℹ 5 more variables: df <dbl>, conf.low <dbl>, conf.high <dbl>, method <chr>,
#   alternative <chr>

NOTE: If we reject the null hypothesis of Levene’s test, we have to type var.equal = F (or type nothing as this is the default), so the Welch’s t-test is applied.

The difference between means (20.33 - 21.49) equals -1.16 units and it is not significant (failed to reject $H_{0}$ ; p = 0.16 > 0.05).

20.6 Present the results

Summary table

It is common practice to report the mean (sd) for each group in summary tables.

Show the code

depression |>  
  tbl_summary(
    by = intervention, 
    statistic = HDRS ~ "{mean} ({sd})", 
    digits = list(everything() ~ 1),
    label = list(HDRS ~ "HDRS score"), 
    missing = c("no")) |>  
    add_difference(test.args = all_tests("t.test") ~ list(var.equal = TRUE),
                   estimate_fun = HDRS ~ function(x) style_sigfig(x, digits = 2),
                   pvalue_fun = function(x) style_pvalue(x, digits = 2)) %>% 
  add_n()

Characteristic	N	paroxetine, N = 33¹	placebo, N = 43¹	Difference²	95% CI^2,3	p-value²
HDRS score	76	20.3 (3.7)	21.5 (3.4)	-1.2	-2.8, 0.47	0.16
¹ Mean (SD)
² Two Sample t-test
³ CI = Confidence Interval

Report the results

There is also a specific package with the name {report} that may be useful in reporting the results of the t-test:

report_results <- t.test(depression$HDRS ~ depression$intervention, var.equal = T) 
report(report_results)

Effect sizes were labelled following Cohen's (1988) recommendations.

The Two Sample t-test testing the difference of depression$HDRS by
depression$intervention (mean in group paroxetine = 20.33, mean in group
placebo = 21.49) suggests that the effect is negative, statistically not
significant, and small (difference = -1.16, 95% CI [-2.78, 0.47], t(74) =
-1.42, p = 0.160; Cohen's d = -0.33, 95% CI [-0.78, 0.13])

We can use the above information to write up a final report:

Final report

There is not evidence that HDRS score is significantly different in paroxetine group, mean = 20.3 (sd = 3.7), as compared to placebo group, 21.5 (3.4), (mean difference= -1.2 units, 95% CI = -2.8 to 0.47, p = 0.16).