35  Measures of reliability

In this chapter, we explore measures of relative and absolute reliability (or agreement) that are used to assess the consistency, stability, and reproducibility of measurements or judgments.

When we have finished this Chapter, we should be able to:

Learning objectives
  • Understand the concepts of relative and absolute reliability
  • Apply appropriate measures of reliability
  • Interpret the results of reliability analysis

35.1 Relative and absolute reliability

Two distinct types of reliability are used: the relative and absolute reliability (agreement) (Kottner and Streiner 2011).

  • Relative Reliability is defined as the ratio of variability between scores of the same subjects (e.g., by different raters or at different times) to the total variability of all scores in the sample. Reliability coefficients, such as the intra-class correlation coefficient for numerical data or Cohen’s kappa for categorical data, are employed as suitable metrics for this purpose.

  • Absolute Reliability (or Agreement) pertains to the assessment of whether scores, or judgments are identical or comparable, as well as the extent to which they might differ. Typical statistical measures employed to quantify this degree of error are the standard error of measurement (SEM) and the limits of agreement (LOA) for numerical data.

35.2 Packages we need

We need to load the following packages:

# packages for graphs and tables
library(ggExtra)
library(janitor)

# packages for reliability analysis
library(irr)
library(irrCAC)
library(SimplyAgree)
library(blandr)
library(BlandAltmanLeh)
library(flextable)
library(vcd)

# packages for data import and manipulation
library(here)
library(tidyverse)

35.3 Reliability for continuous measurements

35.3.1 Research question

The parent version of Gait Outcomes Assessment List questionnaire (GOAL) is a parent-reported outcome assessment of family priorities and functional mobility for ambulatory children with cerebral palsy. We aim to examine the test–retest reliability of the GOAL questionnaire for the total score (score range: 0 - 100) which is an indicator of the stability of the questionnaire.

Test-Retest Reliability

Test-retest reliability is used to assess the consistency and stability of a measurement tool (e.g. self-report survey instrument) over time on the same subjects under the same conditions. Specifically, assessing test-retest reliability involves administering the measurement tool to a group of individuals initially (time 1), subsequently reapplying it to the same group at a later time (time 2), and finally examining the correlation between the two sets of scores obtained.

Table 35.1: Measurements in two time points Time 1 and Time 2.
ID Measurement 1 (time 1) Measurement 2 (time 2)
1 \(x_{11}\) \(x_{12}\)
2 \(x_{21}\) \(x_{22}\)
n \(x_{n1}\) \(x_{n2}\)

35.3.2 Preparing the data

The GOAL questionnaire was completed twice, 30 days apart, in a prospective cohort study of 127 caregivers of children with cerebral palsy and the data were recorded as follows:

library(readxl)
goal <- read_excel(here("data", "goal.xlsx"))
Figure 35.1: Table with data from “BirthWeight” file.

We will begin our investigation into the association between the first (GOAL1) and the second (GOAL2) measurement by generating a scatter plot:

ggplot(goal, aes(GOAL1, GOAL2)) +
  geom_point(color = "black", size = 2) +
  lims(x = c(0, 100), y = c(0,100)) +
  geom_abline(intercept = 0, slope = 1, linewidth = 1.0, color = "blue") +
  coord_fixed(ratio = 1)
Figure 35.2: Scatter plot of the association between individual GOAL scores on the test and retest measurements (n = 127).

The scatter plot compares the GOAL1 and GOAL2 total scores. The solid blue diagonal line is the line of equality (i.e. the reference line: Y = X) that represents a perfect agreement of the two measurements.

35.3.3 True measurement and the measurement error

The observed scores (X) from an instrument are thought to be composed of the underlying true score (T) and the total (systematic and random) error of a measurement (E):

\[ X = True + Error \tag{35.1}\]

If the true measurement and the error term are uncorrelated, the measurement variance, \(\sigma^2_X\), is given by:

\[ \sigma^2_X = \sigma^2_{True} + \sigma^2_{Error} \tag{35.2}\]

Comment

The total variability can break down into smaller pieces based on characteristics of the study design.

35.3.4 Intra-class correlation coefficient (ICC)

Test-retest data of continuous measurements is often assessed using the intra-class correlation coefficient \(ρ_{ICC}\). The \(ρ_{ICC}\) is a ratio generally defined as:

\[ ρ_{ICC} =\frac{\sigma^2_{True}}{\sigma^2_{X}} \tag{35.3}\]

The \(ρ_{ICC}\) correlation coefficient ranges from 0 to 1 where higher values indicate better test-retest reliability. The population intra-class coefficient \(ρ_{ICC}\) can be estimated using the ICC index. There are three different types of ICC representing different mathematical models:

  • one-way random effects model ICC(1)
  • two-way random effects model ICC(A,1)
  • two-way mixed-effects model ICC(C,1)

where A stands for “Agreement” and C stands for “Consistency”.

The choice of the appropriate ICC model depends on several factors, including how the data were collected, which variance components are considered relevant, and the specific type of reliability (agreement or consistency) we intend to assess (Liljequist, Elfving, and Skavberg Roaldsen 2019).

In the context of our example, it is recommended to use the “two-way” model rather than the “one-way” model and “agreement” rather than “consistency”, as systematic differences in the individual scores on the GOAL instrument over time are of interest (Qin et al. 2018).

For this model the Equation 35.1 becomes:

\[ X = μ + ID + M + Residual \]

where \(μ\) is a constant (the mean value of X for the whole population), ID term is the difference due to subjects, M the difference due to measurements (in our case different time points), and the Residual is the random error.

The variance \(\sigma^2_X\) for this model consists of three variance components:

\[ \sigma^2_X=\sigma^2_{ID} +\sigma^2_{M} +\sigma^2_{Residual} \]

In population, the intra-class coefficient, \(ρ_{A, 1}\), is defined as:

\[ ρ_{A, 1} = \frac{\sigma^2_{ID}}{\sigma^2_{ID} +\sigma^2_{M} +\sigma^2_{Residual}} \tag{35.4}\]

where \(\sigma^2_{ID}\), \(\sigma^2_{M}\), and \(\sigma^2_{Residual}\) are the variances of subjects (ID), measurements (M) and error (Residual), respectively.

A statistical estimate of the \(ρ_{A, 1}\) for agreement is given by the ICC(A,1) formula (McGraw and Wong 1996; Koo and Li 2016; Qin et al. 2018; Liljequist, Elfving, and Skavberg Roaldsen 2019):

\[ ICC(A, 1) = \frac{MSB-MSE}{MSB + (k-1) MSE + \frac{k}{n}(MSM-MSE)} \tag{35.5}\]

where \(MSB\) = Mean Square Between subjects; MSE = Mean Square Error; \(MSM\) = Mean Square (Between) Measurements; n = number of subjects; k = number of measurements.

Comment

In test-retest reliability analysis there are only two measurements (Measurement 1 and Measurement 2; see Table 35.1) at different time points, so k = 2.

The Table 35.2 provides a categorization of ICC index (Koo and Li 2016), yet the interpretation of ICC values can be somewhat arbitrary.

Table 35.2: Interpretation of intra-class correlation coefficient (ICC).
ICC Level of agreement
<0.50 Poor
0.50 - 0.74 Moderate
0.75 - 0.89 Good
0.90 - 1.00 Very good

In R:

First, we convert data from a wide format to a long format using the pivot_longer() function:

# convert data into long format
goal_long <- goal |> 
  mutate(ID = row_number()) |> 
  pivot_longer(cols = c("GOAL1", "GOAL2"), 
             names_to = "M", values_to = "score") |> 
  mutate(ID = as.factor(ID),
         M = factor(M))

head(goal_long)
# A tibble: 6 × 3
  ID    M     score
  <fct> <fct> <dbl>
1 1     GOAL1    22
2 1     GOAL2    21
3 2     GOAL1    36
4 2     GOAL2    27
5 3     GOAL1    35
6 3     GOAL2    36

Then, a two-way analysis of variance is applied with factors the id and the items:

aov.goal <- aov(score ~ ID + M, data = goal_long)
s.aov <- summary(aov.goal)
s.aov
             Df Sum Sq Mean Sq F value Pr(>F)    
ID          126  58691   465.8  49.219 <2e-16 ***
M             1     34    34.1   3.598 0.0601 .  
Residuals   126   1192     9.5                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The statistical results are arranged within a matrix and we extract the mean squares of interest:

stats <- matrix(unlist(s.aov), ncol = 5)
stats
     [,1]        [,2]      [,3]      [,4]         [,5]
[1,]  126 58691.37795 465.80459 49.219201 6.192816e-72
[2,]    1    34.05118  34.05118  3.598015 6.013766e-02
[3,]  126  1192.44882   9.46388        NA           NA
MSB <- stats[1,3]
MSM <- stats[2,3]
MSE <- stats[3,3]

Finally, we calculate the ICC(A, 1) for agreement based on the Equation 35.5:

n <- dim(goal)[1]
k <- dim(goal)[2]

iccA1 <- (MSB - MSE)/(MSB + (k - 1) * MSE + k/n * (MSM - MSE))
iccA1
[1] 0.959393

Next, we present R functions to carry out all the tasks for an reliability analysis.

Summary statistics of reliability analysis
icc(goal, model = "twoway", type = "agreement")
 Single Score Intraclass Correlation

   Model: twoway 
   Type : agreement 

   Subjects = 127 
     Raters = 2 
   ICC(A,1) = 0.959

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
 F(126,121) = 49.2 , p = 1.48e-69 

 95%-Confidence Interval for ICC Population Values:
  0.943 < ICC < 0.971
SimplyAgree::jmvreli(
  data = goal,
  vars = vars(GOAL1, GOAL2),
  desc = TRUE)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |......................................................................| 100%
                                                                                          

 RELIABILITY ANALYSIS

 Coefficient of Variation (%): 4.83
 Standard Error of Measurement (SEM): 3.0763
 Standard Error of the Estimate (SEE): 4.2643
 Standard Error of Prediction (SEP): 6.0928

 Intraclass Correlation Coefficients                                                      
 ──────────────────────────────────────────────────────────────────────────────────────── 
   Model             Measures            Type     ICC          Lower C.I.    Upper C.I.   
 ──────────────────────────────────────────────────────────────────────────────────────── 
   one-way random           Agreement    ICC1     0.9593765     0.9458943     0.9695576   
   two-way random           Agreement    ICC2     0.9593930     0.9457079     0.9696539   
   two-way fixed          Consistency    ICC3     0.9601747     0.9469150     0.9701731   
   one-way random      Avg. Agreement    ICC1k    0.9792671     0.9721950     0.9845435   
   two-way random      Avg. Agreement    ICC2k    0.9792757     0.9720965     0.9845932   
   two-way fixed     Avg. Consistency    ICC3k    0.9796828     0.9727338     0.9848608   
 ──────────────────────────────────────────────────────────────────────────────────────── 


 Variance Components                         
 ─────────────────────────────────────────── 
   Component    Variance       Percent       
 ─────────────────────────────────────────── 
   ID           228.1706416     0.95939304   
   Items          0.1936130    8.140880e-4   
   Residual       9.4638647     0.03979288   
   Total        237.8281193     1.00000000   
 ─────────────────────────────────────────── 

The test–retest relative reliability obtained between the initial measurement and the measurement recorded 30 days later was very good according to the inter-rater correlation coefficient, ICC(A,1) = 0.96 (95% CI = 0.94–0.97).

Comment

The result of significance test is not a point of interest because of the expected high correlation between two repeated measurements obtained from the same individuals.

35.3.5 Standard Error of Measurement (SEM)

The standard error of measurement (SEM; not to be confused with the standard error of the mean) is considered a measure that quantifies the amount of measurement error within an instrument, thus serving as an absolute estimate of the instrument’s reliability. The SEM is defined as the square root of the error variance (Lek and Van De Schoot 2018):

\[ SEM =\sqrt{\sigma^2_{Error}} \tag{35.6}\]

Now, let’s see how the SEM and \(\rho_{ICC}\) are mathematically associated in a formula. (Tesio 2012). From Equation 35.2, the error variance of randomly selected subjects is simply the difference between the total variation in observed scores and true-score variance:

\[ \sigma^2_{Error} = \sigma^2_X - \sigma^2_{True} \]

We multiply and divide the right-hand side of the equation by \(\sigma^2_X\):

\[ \begin{aligned} \sigma^2_{Error} &= \sigma^2_X \cdot \frac{(\sigma^2_X - \sigma^2_{True})}{\sigma^2_X} \\ &= \sigma^2_X \cdot (1 - \frac{\sigma^2_{True}}{\sigma^2_X}) \end{aligned} \tag{35.7}\]

and using the Equation 35.3 becomes:

\[ \begin{aligned} \sigma^2_{Error} &= \sigma^2_X \cdot (1 - \rho_{ICC}) \end{aligned} \tag{35.8}\]

Finally, the SEM is obtained by calculating the square root of both sides of the equation Equation 35.8:

\[ SEM = \sigma_X \cdot \sqrt{1-\rho_{ICC}} \tag{35.9}\]

If \(\rho_{ICC}\) equals 0, the SEM will equal the standard deviation of the observed test scores. Conversely, if \(\rho_{ICC}\) equals 1, the SEM is zero.

Comment

In practice, \(σ_{Χ}\) is estimated from the sample data. Typically, in a test-retest analysis, it is either the standard deviation of the baseline score (Beninato and Portney 2011) or the higher standard deviation among the two scores (Goldberg and Schepens 2011). Using the score with a higher standard deviation decreases the chances of underestimating the minimum detectable change (MDC) (see below Equation 35.11). Therefore, by using the \(ICC(A,1)\) for agreement, the Equation 35.9 becomes:

\[ SEM = SD_X \cdot \sqrt{1-ICC(A,1)} \tag{35.10}\]

# choose the higher standard deviation between GOAL1 and GOAL2
SD <- max(c(sd(goal$GOAL1), sd(goal$GOAL2)))

# calculate the Standard Error of Measurement for agreement
sem <- SD * sqrt(1 - iccA1)
sem
[1] 3.14089

Note that the higher the ICC, the smaller the SEM which implies higher reliability as it indicates that observed scores are close to the true scores. SEM estimates the variability of the observed scores likely to be obtained given an individual’s true score (McManus 2012; Harvill 1991) . In our example, the SEM is approximately 3 units. This means that, there is a probability of 0.95 of the individual’s observed score being between -6 units and +6 units of the true score (Observed score = True score ± 2SEM), assuming normally distributed errors.

Comment

In test-retest analysis where there are 2 measurements at different times, the SEM often is approximated by the formula:

\[ SEM = \frac{SD_{dif}}{\sqrt{2}} \]

where \(SD_{dif}\) is the standard deviation of the differences in scores.

#calculate the differences of scores
dif <- goal$GOAL1 - goal$GOAL2

sd(dif)/sqrt(2)
[1] 3.076342

35.3.6 Minimum Detectable Change (MDC)

Now, the question that arises is whether a given change in score between test and retest is likely to be beyond the expected level of measurement error (real difference). In this case, the key measure to consider is the minimum detectable change (MDC) defined as (Goldberg and Schepens 2011):

\[ MDC = \sqrt{2} \cdot z_{a/2} \cdot SEM \tag{35.11}\]

where \(\sqrt{2}\) is introduced because we are interested in the change between two measurements (test and retest), and \(z_{a/2}\) the z-score associated with the 95% confidence level.

# calculate the z-value for a/2 = 0.05/2 = 0.025
z <- qnorm(0.025, mean = 0, sd = 1, lower.tail = FALSE)

# calculate the minimum detectable change
mdc <- sqrt(2) * z * sem
mdc
[1] 8.705942

Both random and systematic errors are taken into account in the MDC. In our example, an individual change in score smaller than 9 units can be due to measurement error and may not be a real change.

Comment

To define a 95% CI outside which one could be confident that a retest score reflects a real change, we should use a more complicated approach that uses the standard error of prediction (SEP; \(SEP = SD_X \cdot \sqrt{1-ICC^2}\)) instead of the SEM. This method can yield a more accurate result (Weir 2005).

35.3.7 Limits of Agreement (LOA) and Bland-Altman Plot

  • Limits of Agreement (LOA)

If the differences between the two scores (GOAL1 - GOAL2) follow a normal distribution, we expect that approximately 95% of the differences will fall within the following range (see Chapter 15):

\[ 95\%\ LOA = \overline{dif} \pm 1.96 \cdot SD_{dif} \tag{35.12}\]

where \(\overline{dif}\) is the mean of the differences (bias), and \(SD_{dif}\) is the standard deviation of the differences that measure random fluctuations around this mean.

# calculate the mean of the differences (MD)
MD <- mean(dif)
MD
[1] -0.7322835
# compute lower limit of 95% LOA
lower_LOA <- mean(dif) - z * sd(dif)
lower_LOA
[1] -9.259311
# compute upper limit of 95% LOA
upper_LOA <- mean(dif) + z * sd(dif)
upper_LOA
[1] 7.794745

In our example, the mean of the differences (bias) is -0.73 (i.e. the scores at retest are on average 0.73 units higher than the scores at first test), which is a small difference. Additionally, the limits of agreement indicate that 95% of the differences lie in the range of −9 units to 8 units.

  • The Bland-Altman method

The Bland-Altman method uses a scatter plot to quantify the measurement bias and a range of agreement by constructing 95% limits of agreement (LOA). The basic assumption of Bland-Altman is that the differences are normally distributed .

#calculate the mean of GOAL1 and GOAL2
mean_goal12 <- (goal$GOAL1 + goal$GOAL2)/2

# create a data frame with the means and the differences
dat_BA <- data.frame(mean_goal12, dif)

# Bland-Altman plot
BA_plot <- ggplot(dat_BA, aes(x = mean_goal12, y = dif)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5, linetype = "dashed") +
  geom_hline(yintercept = MD, color = "blue", linewidth = 1.0) +
  geom_hline(yintercept = lower_LOA, color = "red", linewidth = 0.5) +
  geom_hline(yintercept = upper_LOA, color = "red", linewidth = 0.5) +
  labs(title = "Bland-Altman plot of test-retest reliability", 
       x = "Mean of GOAL1 and GOAL2", 
       y = "GOAL1 - GOAL2") +
  annotate("text", x = 90, y = 8.6, label = "Upper LOA", color = "red", size = 4.0) +
  annotate("text", x = 90, y = -8.6, label = "Lower LOA", color = "red", size = 4.0) +
  annotate("text", x = 20, y = -1.5, label = "MD", color = "blue", size = 4.0) +
  annotate("text", x = 31.2, y = -0.2, label = "bias", color = "#EA74FC", size = 4) +
  geom_segment(x = 28, y = 0.1, xend = 28, yend = -0.85, linewidth = 0.8, colour = "#EA74FC",
               arrow = arrow(length = unit(0.07, "inches"), ends = "both"))

# add a histogram on the right-hand side of the graph
ggMarginal(BA_plot, type = "density", margins = 'y',
           yparams = list(fill = "#61D04F"))
Figure 35.3: Plot of differences between GOAL1 and GOAL2 vs. the mean of the two measurements. The bias of 0.73 units is denoted by the gap (purple arrow) between the X axis, corresponding to a zero differences (black dashed line), and the solid blue line of the mean.

For each pair of measurements, the difference between the two measurements is plotted on the Y axis, and the mean of the two measurements on the X axis. We can check the distribution of the differences by examining the marginal green histogram on the right-hand side of the graph. In our example, the normality assumption is met; however, if the histogram is skewed or has very long tails, the assumption of Normality might not hold.

The mean of the differences (MD), represented by the solid blue line, is an estimate of the systematic bias between the two measurements (Figure 35.3). In our case, the magnitude of bias (purple arrow) has a small value (-0.73 units). The lower and upper red horizontal lines represent the upper and lower 95% limits of agreement (LOA), respectively. Under the normality assumption of the differences (green histogram), nearly 95% of the data points are likely to be within the LOAs. In our example, most of the the points are randomly scattered around the zero dashed line within the limits of agreement (−9 to 8 units), and as expected 7 out of 127 (5.5%) data points fall out of these limits.

If we want to add confidence intervals for the MD and for the lower and upper limits of agreement (Lower LOA, Upper LOA) in Figure 35.3, we can use the bland.altman.stats() function from the {BlandAltmanLeh} package which provides these intervals:

# get the confidence intervals for the MD, Low and Upper LOA
ci_lines <- bland.altman.stats(goal$GOAL1, goal$GOAL2)$CI.lines
ci_lines
lower.limit.ci.lower lower.limit.ci.upper   mean.diff.ci.lower 
        -10.58273587          -7.93620048          -1.49627242 
  mean.diff.ci.upper upper.limit.ci.lower upper.limit.ci.upper 
          0.03170549           6.47163356           9.11816894 
# define the color of the lines of the confidence intervals
ci_colors <- c("red", "red", "blue", "blue", "red", "red")

# Bland-Altman plot
BA_plot2 <- BA_plot +
  geom_hline(yintercept = ci_lines, color = ci_colors, linewidth = 0.5, linetype = "dashed")

# add a histogram on the right-hand side of the graph
ggMarginal(BA_plot2, type = "density", margins = 'y',
           yparams = list(fill = "#61D04F"))
Figure 35.4: Bland-Altman plot with confidence intervals for the MD and for the lower and upper limits of agreement (Lower LOA, Upper LOA).
Comment

If many data points in the Bland-Altman plot are close to the zero dashed line, it indicates that there is a good level of agreement between the two measurements.

The MD (bias) can be considered insignificant, as the zero dashed line of equality lies inside the confidence interval for the mean difference (-1.5, 0.03). We can also test this with a paired t-test:

t.test(goal$GOAL1, goal$GOAL2, paired = T)

    Paired t-test

data:  goal$GOAL1 and goal$GOAL2
t = -1.8968, df = 126, p-value = 0.06014
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -1.49627242  0.03170549
sample estimates:
mean difference 
     -0.7322835 

There are lots of packages on CRAN that include functions for creating Bland-Altman plots such as the blandr and {BlandAltmanLeh}:

Bland-Altman plots
blandr.draw(goal$GOAL1, goal$GOAL2) +
  annotate("text", x = 90, y = 8.6, label = "Upper LOA", color = "red", size = 4.0) +
  annotate("text", x = 90, y = -8.6, label = "Lower LOA", color = "red", size = 4.0) +
  annotate("text", x = 20, y = -1.5, label = "MD", color = "blue", size = 4.0)
Figure 35.5
bland.altman.plot(goal$GOAL1, goal$GOAL2, graph.sys = "ggplot2", conf.int=0.95) +
  annotate("text", x = 90, y = 8.6, label = "Upper LOA", color = "red", size = 4.0) +
  annotate("text", x = 90, y = -8.6, label = "Lower LOA", color = "red", size = 4.0) +
  annotate("text", x = 20, y = -1.5, label = "MD", color = "blue", size = 4.0)
Figure 35.6

35.4 Reliability for categorical measurements

35.4.1 Research question

A screening questionnaire for Parkinson’s disease in a community was administrated with two weeks interval. We aim to examine the test–retest reliability of the following questions included in the questionnaire:

Q1: Do you have trouble buttoning buttons? (yes/no)

Q2: Do you have trouble arising from a chair? (yes/no)

Q3: Do your feet suddenly seem to freeze in door-ways? (yes/no)

Q4: Is your handwriting smaller than it once was? (yes/no)

35.4.2 Preparing the data

The file questions contains the data of the four questions which required a ‘yes’ or ‘no’ response. The same set of questions was administered to 2000 individuals aged over 65 years old on two different time points (T1 and T2) with a 14-day gap in between.

library(readxl)
qs <- read_excel(here("data", "questions.xlsx"))
Figure 35.7: Table with data from “questions” file.

35.4.3 Contingency table

The test-retest data are arranged in a 2x2 contingency table as follows:

Table 35.3: Interpretation.
\(\textbf{Time 2 (T2)}\)
yes no Totals
\(\textbf{Time 1 (T1)}\) yes a b g1
no c d g2
Totals f1 f2 N

The Marginal totals (f1, f2, g1, g2) provide important summary information about the distribution of categories and rater’s assessments.

Symmetry and balance in a contingency table (Dettori and Norvell 2020)

  • Symmetrical: The distribution across f1 and f2 is the same as g1 and g2
  • Asymmetrical: The distribution across f1 and f2 is in the opposite direction to g1 and g2
  • Balanced: The proportion of the total number of objects in f1 and g1 is equal to 0.5
  • Imbalance: The proportion of the total number of objects in f1 and g1 is not equal to 0.5

35.4.4 Reliability coefficients for categorical data

Cohen’s kappa (unweighted)

The Cohen’s kappa (κ), also known as the Kappa statistic, is used to measure relative reliability for categorical items. It quantifies the degree of agreement beyond what would be expected due to chance alone (Cohen 1960).

The Cohen’s kappa statistic is defined as follows:

\[ k = \frac{p_o-p_{e\kappa}}{1-p_{e\kappa}} \]

where \(p_o = (a+d)/N\) is the proportion of observed overall agreement and \(p_{e\kappa} = (f_1 \cdot g_1 + f_2 \cdot g_2)/N^2\) is the proportion of agreement that would be expected by chance.

Kappa can range form negative values (no agreement) to +1 (perfect agreement).

Table 35.4: Interpretation of Cohen’s kappa statistic.
Cohen’s kappa Level of agreement
<0.20 Poor
0.21 - 0.40 Fair
0.41 - 0.60 Moderate
0.61 - 0.80 Good
0.81 - 1.00 Very good

Gwet’s AC1

Gwet’s AC1 is another measurement of relative reliability (Gwet 2008). The AC1 statistic is given by the formula:

\[ AC1 = \frac{p_o-p_{eAC1}}{1-p_{eAC1}} \]

\(p_o\) remains the same, but \(p_{eAC1} = 2q \cdot (1 - q)\) is the proportion of agreement that would be expected by chance where \(q = (f1 + g1)/2N\).

 

35.4.5 Examples

  Q1: Do you have trouble buttoning buttons? (symmetrical and balanced distribution of marginal totals)

The first step consists of creating the contingency table with marginal totals as follows:

qs$Q1_T1 <- factor(qs$Q1_T1, levels=c("yes", "no"))
qs$Q1_T2 <- factor(qs$Q1_T2, levels=c("yes", "no"))

table1 <- table(Q1_T1 = qs$Q1_T1, Q1_T2 = qs$Q1_T2)

tb1 <- addmargins(table1)
tb1
     Q1_T2
Q1_T1  yes   no  Sum
  yes  800  180  980
  no   120  900 1020
  Sum  920 1080 2000

The proportions for question Q1 are:

     Q1_T2
Q1_T1  yes   no  Sum
  yes 0.40 0.09 0.49
  no  0.06 0.45 0.51
  Sum 0.46 0.54 1.00

We observe that the distribution of marginal totals of question Q1 is symmetrical (f1 = 0.46 is close to g1 = 0.49; f2 = 0.54 is close to g2 = 0.51) and approximately balanced (f1 = 0.46 and f2 = 0.54 are close to 0.5; g1 = 0.49 and g2 = 0.51 are close to 0.5).

  • kappa

First, we extract the elements of the tb1:

a <- tb1[1,1]
b <- tb1[1,2]
c <- tb1[2,1]
d <- tb1[2,2]
f1 <- tb1[3,1]
f2 <- tb1[3,2]
g1 <- tb1[1,3]
g2 <- tb1[2,3]
N <- tb1[3,3]

The percent of observed overall agreement is:

po <- (a + d)/N
po
[1] 0.85

The proportion of agreement that would be expected by chance is:

pek <- (f1 * g1 + f2 * g2)/N^2
pek
[1] 0.5008

Finally, we calculate the Cohen’s kappa statistic:

k <- (po-pek)/(1-pek)
k
[1] 0.6995192

There are lots of packages on CRAN that include functions to calculate Cohen’s kappa statistic such as the irr and {vcd}:

Cohen’s kappa
Q1 <- qs[1:2]
kappa2(Q1)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 2000 
   Raters = 2 
    Kappa = 0.7 

        z = 31.3 
  p-value = 0 

Hypothesis Testing

Null hypothesis and alternative hypothesis for kappa
  • \(H_0\): The agreement is the same as chance agreement. (\(k = 0\))
  • \(H_1\): The agreement is different from chance agreement. (\(k \neq 0\))

The p-value is less than 0.05 (reject \(H_0\)), the two assessments agreed more than would be expected by chance. Note that measurements taken from the same individuals on two different time points are closely associated by nature.

k_Q1 <- Kappa(table1)
k_Q1
            value     ASE     z Pr(>|z|)
Unweighted 0.6995 0.01596 43.82        0
Weighted   0.6995 0.01596 43.82        0
Warning

Note that, in the above results ASE, which is used for calculating the confidence intervals, is the asymptotic standard error of the kappa value.

confint(k_Q1)
            
Kappa              lwr       upr
  Unweighted 0.6682302 0.7308083
  Weighted   0.6682302 0.7308083

Therefore, the relative agreement for the question Q1 is good (k = 0.7, 95%CI: 0.67, 0.73).

  • AC1

For AC1 statistic, the proportion of agreement that would be expected by chance is:

q <- (f1 + g1)/(2*N)
peAC1 <- 2*q*(1-q)
peAC1
[1] 0.49875

The AC1 statistic is as follows:

AC1 <- (po-peAC1)/(1-peAC1)
AC1
[1] 0.7007481

We can also use the {irrCAC} package to get the relevant statistic with confidence intervals:

Q1 <- qs[1:2]
gwet.ac1.raw(Q1)$est
  coeff.name   pa      pe coeff.val coeff.se      conf.int p.value     w.name
1        AC1 0.85 0.49875   0.70075  0.01596 (0.669,0.732)       0 unweighted

We conclude that AC1 is 0.7, with a 95% confidence interval ranging from 0.67 to 0.73, consistent with the result obtained using Cohen’s kappa statistic.

 

  Q2: Do you have trouble arising from a chair? (symmetrical and imbalanced distribution of marginal totals)

qs$Q2_T1 <- factor(qs$Q2_T1, levels=c("yes", "no"))
qs$Q2_T2 <- factor(qs$Q2_T2, levels=c("yes", "no"))

table2 <- table(Q2_T1 = qs$Q2_T1, Q2_T2 = qs$Q2_T2)

tb2 <- addmargins(table2)
tb2
     Q2_T2
Q2_T1  yes   no  Sum
  yes 1600  200 1800
  no   100  100  200
  Sum 1700  300 2000

The proportions for question Q2 are:

     Q2_T2
Q2_T1  yes   no  Sum
  yes 0.80 0.10 0.90
  no  0.05 0.05 0.10
  Sum 0.85 0.15 1.00

The percent of observed overall agreement is 0.80 + 0.05 = 0.85. The distribution of marginal totals of question Q2 is symmetrical (f1 = 0.85 is close to g1 = 0.90; f2 = 0.15 is close to g2 = 0.10) but imbalanced (f1 = 0.85 and f2 = 0.15 deviate greatly from 0.5; g1 = 0.90 and g2 = 0.10 are far away from 0.5).

  • kappa
Q2 <- qs[3:4]
kappa2(Q2)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 2000 
   Raters = 2 
    Kappa = 0.318 

        z = 14.6 
  p-value = 0 

According to the value of kappa statistic, k = 0.32, there appears to be a fair agreement between test and re-test for question Q2.

The first kappa paradox

Although questions Q1 and Q2 have the same high percentage of agreement (\(p_o\)), the imbalanced distribution of the marginal totals in Q2 affects the kappa statistic, creating what appears to be a ‘paradox’— high percent agreement and low kappa (Feinstein and Cicchetti 1990; Cicchetti and Feinstein 1990).

  • AC1
gwet.ac1.raw(Q2)$est
  coeff.name   pa      pe coeff.val coeff.se      conf.int p.value     w.name
1        AC1 0.85 0.21875     0.808  0.01166 (0.785,0.831)       0 unweighted

The AC1 value is 0.81, which means that this measure is robust against what is known as the ‘kappa’ paradox.

 

  Q3: Do your feet suddenly seem to freeze in door-ways? (symmetrical and imbalanced distribution of marginal totals)

qs$Q3_T1 <- factor(qs$Q3_T1, levels=c("yes", "no"))
qs$Q3_T2 <- factor(qs$Q3_T2, levels=c("yes", "no"))

table3 <- table(Q3_T1 = qs$Q3_T1, Q3_T2 = qs$Q3_T2)

tb3 <- addmargins(table3)
tb3
     Q3_T2
Q3_T1  yes   no  Sum
  yes  900  300 1200
  no   500  300  800
  Sum 1400  600 2000

The proportions for question Q3 are:

     Q3_T2
Q3_T1  yes   no  Sum
  yes 0.45 0.15 0.60
  no  0.25 0.15 0.40
  Sum 0.70 0.30 1.00

The percent of observed overall agreement is 0.45 + 0.15 = 0.60. The distribution of marginal totals of question Q3 is relatively symmetrical (same direction) but it is imbalanced (f1 = 0.70 and f2 = 0.30 deviate greatly from 0.5).

  • kappa
Q3 <- qs[5:6]
kappa2(Q3)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 2000 
   Raters = 2 
    Kappa = 0.13 

        z = 5.98 
  p-value = 2.28e-09 

According to the value of kappa statistic, k = 0.13, there appears to be a poor agreement between test and re-test for question Q3.

  • AC1
gwet.ac1.raw(Q3)$est
  coeff.name  pa    pe coeff.val coeff.se      conf.int p.value     w.name
1        AC1 0.6 0.455   0.26606  0.02313 (0.221,0.311)       0 unweighted

 

  Q4: Is your handwriting smaller than it once was? (asymmetrical and imbalanced distribution of marginal totals)

qs$Q4_T1 <- factor(qs$Q4_T1, levels=c("yes", "no"))
qs$Q4_T2 <- factor(qs$Q4_T2, levels=c("yes", "no"))

table4 <- table(Q4_T1 = qs$Q4_T1, Q4_T2 = qs$Q4_T2)

tb4 <- addmargins(table4)
tb4
     Q4_T2
Q4_T1  yes   no  Sum
  yes  500  700 1200
  no   100  700  800
  Sum  600 1400 2000

The proportions for question Q4 are:

     Q4_T2
Q4_T1  yes   no  Sum
  yes 0.25 0.35 0.60
  no  0.05 0.35 0.40
  Sum 0.30 0.70 1.00

The percent of observed overall agreement is 0.25 + 0.35 = 0.60. The distribution of marginal totals of question Q4 is asymmetrical (opposite direction) and imbalanced (f1 = 0.30 and f2 = 0.70 deviate greatly from 0.5).

  • kappa
# Q4: Is your handwriting smaller than it once was?
Q4 <- qs[7:8]
kappa2(Q4)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 2000 
   Raters = 2 
    Kappa = 0.259 

        z = 13.9 
  p-value = 0 

According to the value of kappa statistic, k = 0.26, there appears to be a fair agreement between test and re-test for question Q3.

The asymmetrical imbalance in Q4 results in a higher value of k (0.26) compared to the more symmetrical imbalance in Q3, which yielded a lower k value (0.13).

The second kappa paradox

The k values for the same percentage of agreement (\(p_o\)) can be unexpectedly raised when imbalances in the marginal totals are not perfectly symmetrical (Feinstein and Cicchetti 1990; Cicchetti and Feinstein 1990).

  • AC1
gwet.ac1.raw(Q4)$est
  coeff.name  pa    pe coeff.val coeff.se      conf.int p.value     w.name
1        AC1 0.6 0.495   0.20792  0.02215 (0.164,0.251)       0 unweighted