28 Correlation

Correlation is a statistical method used to assess a possible association between two numeric variables, X and Y. There are several statistical coefficients that we can use to quantify correlation depending on the underlying relation of the data. In this chapter, we’ll learn about four correlation coefficients:

Pearson’s $r$
Spearman’s $r_{s}$ and and Kendall’s $τ$
Coefficient $ξ$

Pearson’s coefficient measures linear correlation, while the Spearman’s and Kendall’s coefficients compare the ranks of data and measure monotonic associations. The new $ξ$ correlation coefficient is more appropriate to measure the strength of non-monotonic associations.

When we have finished this Chapter, we should be able to:

Learning objectives

Explain the concept of correlation of two numeric variables.
Understand the most commonly used correlation coefficients: Pearson’s r, Spearman’s $r_{s}$ , and the new $ξ$ coefficient.
Discuss the possible meaning of correlation that we observe.

28.1 Research question

We consider the data in Birthweight dataset. Let’s say that we want to explore the association between weight (in g) and height (in cm) for a sample of 550 infants of 1 month age.

28.2 Packages we need

We need to load the following packages:

library(rstatix)
library(XICOR)
library(ggExtra)
library(patchwork)
library(here)
library(tidyverse)
library(gt)
library(report)

28.3 Preparing the data

We import the data BirthWeight in R:

library(readxl)
BirthWeight <- read_excel(here("data", "BirthWeight.xlsx"))

	weight	height	headc	gender	education	parity
1	3950	55.5	37.5	Female	tertiary	2 or more siblings
2	4630	57	38.5	Female	tertiary	Singleton
3	4750	56	38.5	Male	year12	2 or more siblings
4	3920	56	39	Male	tertiary	One sibling
5	4560	55	39.5	Male	year10	2 or more siblings
6	3640	51.5	34.5	Female	tertiary	Singleton
7	3550	56	38	Female	year10	2 or more siblings
8	4530	57	39.7	Male	year10	2 or more siblings
9	4970	58.5	39	Male	tertiary	2 or more siblings
10	3740	52	38	Male	year10	One sibling

Showing 1 to 10 of 550 entries

Previous1 2 3 4 5…55Next

Figure 28.1: Table with data from “BirthWeight” file.

We inspect the data and the type of variables:

glimpse(BirthWeight)

Rows: 550
Columns: 6
$ weight    <dbl> 3950, 4630, 4750, 3920, 4560, 3640, 3550, 4530, 4970, 3740, …
$ height    <dbl> 55.5, 57.0, 56.0, 56.0, 55.0, 51.5, 56.0, 57.0, 58.5, 52.0, …
$ headc     <dbl> 37.5, 38.5, 38.5, 39.0, 39.5, 34.5, 38.0, 39.7, 39.0, 38.0, …
$ gender    <chr> "Female", "Female", "Male", "Male", "Male", "Female", "Femal…
$ education <chr> "tertiary", "tertiary", "year12", "tertiary", "year10", "ter…
$ parity    <chr> "2 or more siblings", "Singleton", "2 or more siblings", "On…

The data set BirthWeight has 550 infants of 1 month age (rows) and includes six variables (columns). Both the weight and height are numeric (<dbl>) variables.

28.4 Plot the data

A first step that is usually useful in studying the association between two numeric variables is to prepare a scatter plot of the data. The pattern made by the points plotted on the scatter plot usually suggests the basic nature and strength of the association between two variables.

p <- ggplot(BirthWeight, aes(height, weight)) +
  geom_point(color = "blue", size = 2) +
  theme_minimal(base_size = 14)

ggMarginal(p, type = "histogram", 
           xparams = list(fill = 7),
           yparams = list(fill = 3))

Figure 28.2: Scatter plot of the association between height and weight in 550 infants of 1 month age.

28.5 Linear correlation (Pearson’s coefficient $r$ )

28.5.1 The formula

Given a set of $n$ pairs of observations $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ with means $\bar{x}$ and $\bar{y}$ respectively, $r$ is defined as:

$\begin{matrix} (28.1) & r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2} \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}} \end{matrix}$

We observe that the Equation 28.1 is based on calculating the sum of the product $(x_{i} - \bar{x}) (y_{i} - \bar{y})$ . In our example, that is the the sum of the product $(h e i g h t_{i} - \overset{―}{h e i g h t}) \cdot (w e i g h t_{i} - \overset{―}{w e i g h t})$ . Our approach begins by examining the signs of these products.

Figure 28.3: Scatter plot with two pink axes intersecting at the mean point of the variables. The vertical distances of the data points from these axes express the deviations from the mean.

Positive product: In the top-right pane of the Figure 28.3, the deviations from the mean for both variables, height and weight, are positive. Consequently, their products will also be positive. In the bottom-left pane, the deviations from the mean for both variables are negative. Once again, their product will be positive.

Negative product: In the top-left pane of the Figure 28.3, the deviation of height from its mean is negative, while the deviation of weight is positive. Therefore, their product will be negative. Similarly, in the bottom-right pane, the product will be negative.

We observe that most of the products are positive. By applying the Equation 28.1, we can calculate the Pearson’s correlation coefficient, a task that can be easily carried out using R:

cor(BirthWeight$height, BirthWeight$weight)

[1] 0.7131192

Characteristics of Pearson’s correlation coefficient

r

The $r$ statistic shows the direction and measures the strength of the linear association between the variables. It is a dimensionless quantity that takes a value in the range -1 to +1.

Direction of the association

A negative correlation coefficient indicates that as one variable increases, the other variable tends to decrease, and vice versa (Figure 28.4 a). A zero value indicates that no association exists between the two variables (Figure 28.4 b). A positive coefficient indicates that both variables increase (or decrease) together (Figure 28.4 c).

Figure 28.4: The direction of association can be (a) negative, (b) no association, or (c) positive.

Strength of the association

The strength of the association range from -1 to +1. The stronger the correlation, the more closely the correlation coefficient approaches ±1. A correlation coefficient of -1 or +1 indicates a perfect negative or positive association, respectively (Figure 28.5 c and f).

Figure 28.5: The stronger the correlation, the more closely the correlation coefficient approaches ±1.

The Table 28.1 demonstrates how to interpret the strength of an association according to (Evans 1996).

Table 28.1: Interpretation of the values of the sample estimate of the correlation coefficient.

Value of r	Strength of association
$\| r \| \geq 0.8$	very strong association
$0.6 \leq \| r \| < 0.8$	strong association
$0.4 \leq \| r \| < 0.6$	moderate association
$0.2 \leq \| r \| < 0.4$	weak association
$\| r \| < 0.2$	very weak association

In our example, the coefficient equals r = 0.713, indicating that infants with greater height generally exhibit higher weight. We say that there is a linear positive association between the two variables. However, correlation does not mean causation (Altman and Krzywinski 2015).

Even though summary statistics, such as Pearson r, can provide useful information, they are just simplified representations of the data and may not always capture the full picture. This is typically demonstrated with the Anscombe’s quartet, highlighting the need to explore and understand the underlying patterns and associations within the data through graphical representations (Figure 28.6).

Anscombe’s Quartet

Anscombe’s quartet consists of four sets of data, each containing eleven (x, y) points. Despite having the same basic statistical characteristics, these datasets exhibit distinct distributions and present remarkable differences in their graphical representations (Anscombe 1973).

Even though all datasets have a Pearson’s correlation equal to 0.82, their graphical representations are very different. Figure 28.6 I depicts a linear association where the application of Pearson’s correlation would be appropriate. Figure 28.6 II shows a non-linear association and a non-parametric analysis would be appropriate. Figure 28.6 III demonstrates a nearly perfect linear association (approaching r = 1), but the presence of an outlier has caused a reduction in the correlation coefficient. Figure 28.6 IV shows no association between the two variables (X, Y), although an outlier has artificially increased the correlation value.

28.5.2 Hypothesis Testing

Null hypothesis and alternative hypothesis

$H_{0}$ : There is no linear association between the two numeric variables (they are independent, $ρ = 0$ )
$H_{1}$ : There is linear association between the two numeric variables (they are dependent, $ρ \neq 0$ )

28.5.3 Assumptions

Before we conduct a statistical test for the Pearson r coefficient, we should make sure that some assumptions are met.

Check if the following assumptions are satisfied

The variables are observed on a random sample of individuals (each individual should have a pair of values).
There is a linear association between the two variables. If the association is nonlinear, the correlation coefficient might not accurately represent the association between the variables.
For a valid hypothesis testing and calculation of confidence intervals both variables should have an approximately normal distribution.
Absence of outliers in the data set. It’s important to identify and address outliers before calculating the coefficient.

Based on the Figure 28.2 the points seem to be scattered around an invisible line without any important outlier value. Additionally, the marginal histograms show that the data are approximately normally distributed (we have a large sample so the graphs are reliable) for both weight and height. Therefore, we conclude that the assumptions are satisfied.

28.5.4 Run the test

To determine whether to reject the null hypothesis or not, a test is conducted based on the formula:

$\begin{matrix} (28.2) & t = \frac{r}{S E_{r}} = \frac{r}{\sqrt{(1 - r^{2}) / (n - 2)}} \end{matrix}$

where n is the sample size.

For the data in our example, the number of observations are n= 550, r= 0.713 and $S E_{r} = \sqrt{\frac{(1 - {0.713}^{2})}{(550 - 2)}} = \sqrt{\frac{(1 - 0.5084)}{548}} = \sqrt{\frac{0.4916}{548}} = 0.0299$ .

According to Equation 29.17:

$t = \frac{r}{S E_{r}} = \frac{0.713}{0.0299} = 23.8$

In this example, the value for the test statistic equals 23.8. Using R, we can find the 95% confidence interval and the corresponding p-value for a two tailed test:

Pearson’s correlation test

Base R
rstatix

cor.test(BirthWeight$height, BirthWeight$weight) # the default method is "pearson"


    Pearson's product-moment correlation

data:  BirthWeight$height and BirthWeight$weight
t = 23.813, df = 548, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6694248 0.7518965
sample estimates:
      cor 
0.7131192

BirthWeight |> 
  cor_test(height, weight)   # the default method is "pearson"

# A tibble: 1 × 8
  var1   var2     cor statistic        p conf.low conf.high method 
  <chr>  <chr>  <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>  
1 height weight  0.71      23.8 1.40e-86    0.669     0.752 Pearson

The result is significant (p < 0.001) and we reject the null hypothesis.

Important

The significance of correlation is influenced by the size of the sample. With a large sample size, even a weak association may be significant, whereas with a small sample size, even a strong association might or might not be significant.

28.5.5 Present the results

Summary table

BirthWeight |> 
  cor_test(height, weight) |>
  gt() |> 
  fmt_number(columns = starts_with(c("c", "st", "p")), 
             decimals = 3)

var1	var2	cor	statistic	p	conf.low	conf.high	method
height	weight	0.710	23.813	0.000	0.669	0.752	Pearson

Report the results (according to Evans 1996)

cor.test(BirthWeight$height, BirthWeight$weight) |> 
  report(rules = "evans")

Effect sizes were labelled following Evans's (1996) recommendations.

The Pearson's product-moment correlation between BirthWeight$height and
BirthWeight$weight is positive, statistically significant, and strong (r =
0.71, 95% CI [0.67, 0.75], t(548) = 23.81, p < .001)

We can use the above information to write up a final report:

Final report

We observed a strong, positive linear association between height and weight of one-month-old infants which is significant (Pearson r = 0.71, 95% CI [0.67, 0.75], n = 550, p < 0.001).

28.6 Rank correlation (Spearman’s $r_{s}$ and Kendall’s $τ$ coefficients)

Spearman’s correlation $r_{s}$ and Kendall’s coefficient $τ$ are both non-parametric correlation coefficients used to measure the strength and direction of association between two variables. Both coefficients are based on the concept of ranking the data but they employ distinct methods in their calculation and have some different characteristics (Puth, Neuhäuser, and Ruxton 2015).

Characteristics of Spearman’s and Kendall’s coefficients

The range of both coefficients is between −1 and 1 and the interpretation of rank correlation coefficients is similar to the Pearson’s correlation coefficient. However, the focus is on assessing the monotonic association between variables, rather than just the linear association.

In a monotonic association the variables tend to move in the same relative direction, but not necessarily at a constant rate. Note that while all linear associations can be considered monotonic (Figure 28.7 a), the reverse isn’t always true, as monotonic associations can also take on non-linear forms (Figure 28.7 b).

Figure 28.7: The association can be (a) linear monotonic and (b) monotonic non-linear.

28.6.1 Hypothesis Testing

Null hypothesis and alternative hypothesis

$H_{0}$ : There is no monotonic association between the two numeric variables (they are independent)
$H_{1}$ : There is monotonic association between the two numeric variables (they are dependent)

28.6.2 Assumptions

Check if the following assumptions are satisfied

The variables are observed on a random sample of individuals (each individual should have a pair of values).
There is a monotonic association between the two variables.

28.6.3 Run the test

Spearman’s correlation test

Base R
rstatix

cor.test(BirthWeight$height, BirthWeight$weight, method = "spearman")

Warning in cor.test.default(BirthWeight$height, BirthWeight$weight, method =
"spearman"): Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  BirthWeight$height and BirthWeight$weight
S = 8013119, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.711021

BirthWeight |> 
  cor_test(height, weight, method = "spearman")

# A tibble: 1 × 6
  var1   var2     cor statistic       p method  
  <chr>  <chr>  <dbl>     <dbl>   <dbl> <chr>   
1 height weight  0.71  8013119. 7.4e-86 Spearman

Kendall’s correlation test

Base R
rstatix

cor.test(BirthWeight$height, BirthWeight$weight, method = "kendall")


    Kendall's rank correlation tau

data:  BirthWeight$height and BirthWeight$weight
z = 18.359, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.5408389

BirthWeight |> 
  cor_test(height, weight, method = "kendall")

# A tibble: 1 × 6
  var1   var2     cor statistic        p method 
  <chr>  <chr>  <dbl>     <dbl>    <dbl> <chr>  
1 height weight  0.54      18.4 2.78e-75 Kendall

We observe that Kendall’s $τ$ is smaller than Spearman’s $r_{s}$ correlation (0.54 vs 0.71).

28.6.4 Present the results for Spearman’s correlation test

Summary table

BirthWeight |> 
  cor_test(height, weight, method = "spearman") |>
  gt() |> 
  fmt_number(columns = starts_with(c("c", "p")),
             decimals = 3)

var1	var2	cor	statistic	p	method
height	weight	0.710	8013119	0.000	Spearman

Report the results (according to Evans 1996)

cor.test(BirthWeight$height, BirthWeight$weight, method = "spearman") |> 
report(rules = "evans")

Warning in cor.test.default(BirthWeight$height, BirthWeight$weight, method =
"spearman"): Cannot compute exact p-value with ties

Effect sizes were labelled following Evans's (1996) recommendations.

The Spearman's rank correlation rho between BirthWeight$height and
BirthWeight$weight is positive, statistically significant, and strong (rho =
0.71, S = 8.01e+06, p < .001)

We can use the above information to write up a final report:

Final report

We observed a strong, positive monotonic association between height and weight of one-month-old infants which is significant (Spearman $r_{s}$ = 0.71, n = 550, p < 0.001).

28.6.5 Present the results for Kendall’s correlation test

Summary table

BirthWeight |> 
  cor_test(height, weight, method = "kendall") |>
  gt() |> 
  fmt_number(columns = starts_with(c("c", "st", "p")),
             decimals = 3)

var1	var2	cor	statistic	p	method
height	weight	0.540	18.359	0.000	Kendall

Report the results (according to Evans 1996)

cor.test(BirthWeight$height, BirthWeight$weight, method = "kendall") |> 
report(rules = "evans")

Effect sizes were labelled following Evans's (1996) recommendations.

The Kendall's rank correlation tau between BirthWeight$height and
BirthWeight$weight is positive, statistically significant, and moderate (tau =
0.54, z = 18.36, p < .001)

We can use the above information to write up a final report:

Final report

We observed a moderate, positive monotonic association between height and weight of one-month-old infants which is significant (Kendall $τ$ = 0.54, n = 550, p < 0.001).

28.7 Non-monotonic association (coefficient $ξ$ )

The correlation coefficient $ξ$ ranges from 0 to 1 and is a measure of dependence between X and Y variables (Chatterjee 2021). It equals 1 when the Y is a function of X and it equals 0 when X and Y are independent. Thus, $ξ$ gives a measure of the strength of the association and it can be used for non-monotonic associations. However, for monotonic associations, it does not indicate the direction of the association.

28.7.1 Hypothesis Testing

Null hypothesis and alternative hypothesis

$H_{0}$ : There is not association between the two numeric variables (they are independent)
$H_{1}$ : There is association between the two numeric variables (they are dependent)

28.7.2 Assumptions

Check if the following assumptions are satisfied

The variables are observed on a random sample of individuals (each individual should have a pair of values).

28.7.3 Run the test

xicor(BirthWeight$height, BirthWeight$weight, pvalue = TRUE)

$xi
[1] 0.3163988

$sd
[1] 0.02697177

$pval
[1] 0

28.7.4 Present the results

Based on the $ξ$ correlation coefficient, there is a significant association between height and weight ( $ξ$ = 0.31, sd = 0.027, p < 0.001).

NOTE

In our example, there is a linear association between the height and weight, so the most appropriate correlation measure is the Pearson’s coefficient. Now, consider another example with a non-monotonic association between X and Y, as illustrated in Figure 28.8:

Let’s calculate in R the correlation coefficients for this data set:

cor(df3$x3, df3$y3)

[1] -0.04640522

cor(df3$x3, df3$y3, method = "spearman")

[1] -0.08800493

cor(df3$x3, df3$y3, method =  "kendall")

[1] -0.047343

xicor(df3$x3, df3$y3)

[1] 0.7021277

In this case, the correlation coefficient that is appropriate to be used is the $ξ$ correlation coefficient.

28.1 Research question

28.2 Packages we need

28.3 Preparing the data

28.4 Plot the data

28.5 Linear correlation (Pearson’s coefficient r)

28.5.1 The formula

28.5.2 Hypothesis Testing

28.5.3 Assumptions

28.5.4 Run the test

28.5.5 Present the results

28.6 Rank correlation (Spearman’s rs and Kendall’s τ coefficients)

28.6.1 Hypothesis Testing

28.6.2 Assumptions

28.6.3 Run the test

28.6.4 Present the results for Spearman’s correlation test

28.6.5 Present the results for Kendall’s correlation test

28.7 Non-monotonic association (coefficient ξ)

28.7.1 Hypothesis Testing

28.7.2 Assumptions

28.7.3 Run the test

28.7.4 Present the results

28.5 Linear correlation (Pearson’s coefficient $r$ )

28.6 Rank correlation (Spearman’s $r_{s}$ and Kendall’s $τ$ coefficients)

28.7 Non-monotonic association (coefficient $ξ$ )