31 Fisher’s exact test
If we want to see whether there’s an association between two categorical variables and the assumption for the expected frequencies in the contingency table is not fulfilled, an alternative test to the chi-square test can be used.
Fisher came up with a method for computing the exact probability of the chi-square statistic that is accurate when sample sizes are small. This method is called Fisher’s exact test even though it’s not so much a test as a way of computing the exact probability of the chi-square statistic. This procedure is normally used on 2×2
contingency tables and with small samples. However, it can be used on larger contingency tables and with large samples, but in this case it becomes computationally intensive.
When we have finished this Chapter, we should be able to:
31.1 Research question and Hypothesis Testing
We consider the data in hemophilia dataset. In a survey there are two treatment regimens studied for controlling bleeding in 28 patients with hemophilia undergoing surgery. We want to investigate if there is an association between the treatment regimen (treatment A or B) and the bleeding complications (no or yes). The null hypothesis (\(H_0\)) is that the bleeding complications are independent from the treatment regimen, while the alternative (\(H_1\)) is that are dependent.
NOTE: In practice, the null hypothesis of independence, for our particular question, is no difference in the proportion of patients with bleeding complications compared with patients with no bleeding complications (\(p_{bleeding} = p_{no bleeding}\)).
31.2 Packages we need
We need to load the following packages:
31.3 Preparing the data
We import the data meldata in R:
library(readxl)
hemophilia <- read_excel(here("data", "hemophilia.xlsx"))
We inspect the data and the type of variables:
glimpse(hemophilia)
Rows: 28
Columns: 2
$ treatment <chr> "A", "A", "A", "B", "A", "B", "B", "A", "A", "A", "B", "A", …
$ bleeding <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "yes", "no"…
The dataset hemophilia has 28 patients (rows) and includes 2 variables (columns), the character (<chr>
) variable named treatment
and the character (<chr>
) variable named bleeding
. Both of them should be converted to factor (<fct>
) variables using the convert_as_factor()
function as follows:
hemophilia <- hemophilia %>%
convert_as_factor(treatment, bleeding)
glimpse(hemophilia)
Rows: 28
Columns: 2
$ treatment <fct> A, A, A, B, A, B, B, A, A, A, B, A, B, B, A, A, B, B, B, A, …
$ bleeding <fct> no, no, no, yes, no, no, no, no, yes, no, no, no, yes, no, n…
31.4 Plot the data
We count the number of patients with bleeding in the two regimens. It is useful to plot this as counts but also as percentages and compare them.
p3 <- hemophilia %>%
ggplot(aes(x = treatment, fill = bleeding)) +
geom_bar(width = 0.7) +
scale_fill_jama() +
theme_bw(base_size = 14) +
theme(legend.position = "bottom")
p4 <- hemophilia %>%
ggplot(aes(x = treatment, fill = bleeding)) +
geom_bar(position = "fill", width = 0.7) +
scale_y_continuous(labels=scales::percent) +
scale_fill_jama() +
ylab("Percentage") +
theme_bw(base_size = 14) +
theme(legend.position = "bottom")
p3 + p4 +
plot_layout(guides = "collect") & theme(legend.position = 'bottom')
The above bar plots with counts show graphically that the number of patients who had bleeding complications was similar in the two regimens. Note that the number of patients included in the study is small (n=28).
31.5 Contigency table and Expected frequencies
First, we will create a contingency 2x2 table (two categorical variables with exactly two levels each) with the frequencies using the Base R.
tb2 <- table(hemophilia$treatment, hemophilia$bleeding)
tb2
no yes
A 13 2
B 10 3
Next, we will also create a more informative table with row percentages and marginal totals.
Using the function summary_factorlist()
which is included in finalfit package for obtaining row percentages and marginal totals:
row_tb2 <- hemophilia %>%
finalfit::summary_factorlist(dependent = "bleeding", add_dependent_label = T,
explanatory = "treatment", add_col_totals = T,
include_col_totals_percent = F,
column = FALSE, total_col = TRUE)
knitr::kable(row_tb2)
Dependent: bleeding | no | yes | Total | |
---|---|---|---|---|
Total N | 23 | 5 | 28 | |
treatment | A | 13 (86.7) | 2 (13.3) | 15 (100) |
B | 10 (76.9) | 3 (23.1) | 13 (100) |
The contingency table using the datasummary_crosstab()
from the modelsummary package:
modelsummary::datasummary_crosstab(treatment ~ bleeding, data = hemophilia)
treatment | no | yes | All | |
---|---|---|---|---|
A | N | 13 | 2 | 15 |
% row | 86.7 | 13.3 | 100.0 | |
B | N | 10 | 3 | 13 |
% row | 76.9 | 23.1 | 100.0 | |
All | N | 23 | 5 | 28 |
% row | 82.1 | 17.9 | 100.0 |
From the row frequencies, there is not actually difference, as we noted in the plot we made above.
Now, we will calculate the expected frequencies for each cell using the expected()
function from {epitools}
package:
epitools::expected(tb2)
no yes
A 12.32143 2.678571
B 10.67857 2.321429
In the above table there are 2 cells (50%) with expected counts less than 5 (specifically 2.67 and 2.32), so the Chi-square test is not the appropriate one. In this case the Fisher’s exact test should be used instead.
31.6 Run Fisher’s exact test
Finally, we run the Fisher’s exact test:
fisher.test(tb2)
Fisher's Exact Test for Count Data
data: tb2
p-value = 0.6389
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1807204 26.9478788
sample estimates:
odds ratio
1.90363
fisher_test(tb2)
# A tibble: 1 × 3
n p p.signif
* <int> <dbl> <chr>
1 28 0.639 ns
The p = 0.64 is higher than 0.05. There is absence of evidence for an association between the treatment regimens and bleeding complications (failed to reject \(H_0\)).
31.7 Having only the counts
When we read an article which reports a chi-square or a fisher exact analysis we will see only the counts in a table without having the raw data of the categorical variables. In this instance, we can create the table using the matrix()
function and run the tests. For our example of hemophilia we have the following table: