# 28 Correlation

**Correlation** is a statistical method used to assess a possible **association** between two numeric variables, X and Y. There are several statistical coefficients that we can use to quantify correlation depending on the underlying relation of the data. In this chapter, we’ll learn about four correlation coefficients:

- Pearson’s \(r\)
- Spearman’s \(r_{s}\) and and Kendall’s \(\tau\)
- Coefficient \(ξ\)

Pearson’s coefficient measures **linear** correlation, while the Spearman’s and Kendall’s coefficients compare the **ranks** of data and measure **monotonic** associations. The new \(ξ\) correlation coefficient is more appropriate to measure the strength of **non-monotonic** associations.

When we have finished this Chapter, we should be able to:

## 28.1 Research question

We consider the data in *Birthweight* dataset. Let’s say that we want to explore the association between weight (in g) and height (in cm) for a sample of 550 infants of 1 month age.

## 28.2 Packages we need

We need to load the following packages:

## 28.3 Preparing the data

We import the data *BirthWeight* in R:

```
library(readxl)
BirthWeight <- read_excel(here("data", "BirthWeight.xlsx"))
```

We inspect the data and the type of variables:

`glimpse(BirthWeight)`

```
Rows: 550
Columns: 6
$ weight <dbl> 3950, 4630, 4750, 3920, 4560, 3640, 3550, 4530, 4970, 3740, …
$ height <dbl> 55.5, 57.0, 56.0, 56.0, 55.0, 51.5, 56.0, 57.0, 58.5, 52.0, …
$ headc <dbl> 37.5, 38.5, 38.5, 39.0, 39.5, 34.5, 38.0, 39.7, 39.0, 38.0, …
$ gender <chr> "Female", "Female", "Male", "Male", "Male", "Female", "Femal…
$ education <chr> "tertiary", "tertiary", "year12", "tertiary", "year10", "ter…
$ parity <chr> "2 or more siblings", "Singleton", "2 or more siblings", "On…
```

The data set *BirthWeight* has 550 infants of 1 month age (rows) and includes six variables (columns). Both the `weight`

and `height`

are numeric (`<dbl>`

) variables.

## 28.4 Plot the data

A first step that is usually useful in studying the association between two numeric variables is to prepare a scatter plot of the data. The pattern made by the points plotted on the scatter plot usually suggests the basic nature and strength of the association between two variables.

```
p <- ggplot(BirthWeight, aes(height, weight)) +
geom_point(color = "blue", size = 2) +
theme_minimal(base_size = 14)
ggMarginal(p, type = "histogram",
xparams = list(fill = 7),
yparams = list(fill = 3))
```

## 28.5 Linear correlation (Pearson’s coefficient \(r\))

### 28.5.1 The formula

Given a set of \({n}\) pairs of observations \((x_{1},y_{1}),\ldots ,(x_{n},y_{n})\) with means \(\bar{x}\) and \(\bar{y}\) respectively, \(r\) is defined as:

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n(y_i - \bar{y})^2}} \tag{28.1}\]

We observe that the Equation 28.1 is based on calculating the sum of the product \((x_i - \bar{x})(y_i - \bar{y})\). In our example, that is the the sum of the product \((height_{i} - \overline{height}) \cdot (weight_{i} - \overline{weight})\). Our approach begins by examining the signs of these products.

**Positive product:** In the **top-right** pane of the Figure 28.3, the deviations from the mean for both variables, height and weight, are positive. Consequently, their products will also be positive. In the **bottom-left** pane, the deviations from the mean for both variables are negative. Once again, their product will be positive.

**Negative product:** In the **top-left** pane of the Figure 28.3, the deviation of height from its mean is negative, while the deviation of weight is positive. Therefore, their product will be negative. Similarly, in the **bottom-right** pane, the product will be negative.

We observe that most of the products are positive. By applying the Equation 28.1, we can calculate the Pearson’s correlation coefficient, a task that can be easily carried out using R:

`cor(BirthWeight$height, BirthWeight$weight)`

`[1] 0.7131192`

The Table 28.1 demonstrates how to interpret the strength of an association according to (Evans 1996).

Value of r | Strength of association |
---|---|

\(|r| \geq{0.8}\) | very strong association |

\(0.6\leq|r| < 0.8\) | strong association |

\(0.4\leq|r| < 0.6\) | moderate association |

\(0.2\leq|r| < 0.4\) | weak association |

\(|r| < 0.2\) | very weak association |

In our example, the coefficient equals r = 0.713, indicating that infants with greater height generally exhibit higher weight. We say that there is a **linear positive association** between the two variables. However, **correlation does not mean causation** (Altman and Krzywinski 2015).

Even though summary statistics, such as Pearson r, can provide useful information, they are just simplified representations of the data and may not always capture the full picture. This is typically demonstrated with the Anscombe’s quartet, highlighting the need to explore and understand the underlying patterns and associations within the data through graphical representations (Figure 28.6).

### 28.5.2 Hypothesis Testing

### 28.5.3 Assumptions

Before we conduct a statistical test for the Pearson r coefficient, we should make sure that some assumptions are met.

Based on the Figure 28.2 the points seem to be scattered around an invisible line without any important outlier value. Additionally, the marginal histograms show that the data are approximately normally distributed (we have a large sample so the graphs are reliable) for both weight and height. Therefore, we conclude that the assumptions are satisfied.

### 28.5.4 Run the test

To determine whether to reject the null hypothesis or not, a test is conducted based on the formula:

\[t = \frac{r}{SE_{r}}=\frac{r}{\sqrt{(1-r^2)/(n-2)}} \tag{28.2}\]

where *n* is the sample size.

For the data in our example, the number of observations are n= 550, r= 0.713 and \(SE_{r}=\sqrt{ \frac{(1-0.713^2)}{(550-2)}}= \sqrt{ \frac{(1-0.5084)}{548}} = \sqrt{\frac{0.4916}{548}}= 0.0299\).

According to Equation 29.17:

\[t = \frac{r}{SE_{r}}= \frac{0.713}{0.0299}= 23.8\]

In this example, the value for the test statistic equals 23.8. Using R, we can find the 95% confidence interval and the corresponding p-value for a two tailed test:

`cor.test(BirthWeight$height, BirthWeight$weight) # the default method is "pearson"`

```
Pearson's product-moment correlation
data: BirthWeight$height and BirthWeight$weight
t = 23.813, df = 548, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6694248 0.7518965
sample estimates:
cor
0.7131192
```

```
BirthWeight |>
cor_test(height, weight) # the default method is "pearson"
```

```
# A tibble: 1 × 8
var1 var2 cor statistic p conf.low conf.high method
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 height weight 0.71 23.8 1.40e-86 0.669 0.752 Pearson
```

The result is significant (p < 0.001) and we reject the null hypothesis.

The **significance** of correlation is influenced by **the size of the sample**. With a large sample size, even a weak association may be significant, whereas with a small sample size, even a strong association might or might not be significant.

### 28.5.5 Present the results

**Summary table**

```
BirthWeight |>
cor_test(height, weight) |>
gt() |>
fmt_number(columns = starts_with(c("c", "st", "p")),
decimals = 3)
```

var1 | var2 | cor | statistic | p | conf.low | conf.high | method |
---|---|---|---|---|---|---|---|

height | weight | 0.710 | 23.813 | 0.000 | 0.669 | 0.752 | Pearson |

**Report the results** (according to Evans 1996)

```
Effect sizes were labelled following Evans's (1996) recommendations.
The Pearson's product-moment correlation between BirthWeight$height and
BirthWeight$weight is positive, statistically significant, and strong (r =
0.71, 95% CI [0.67, 0.75], t(548) = 23.81, p < .001)
```

We can use the above information to write up a final report:

We observed a strong, positive linear association between height and weight of one-month-old infants which is significant (Pearson r = 0.71, 95% CI [0.67, 0.75], n = 550, p < 0.001).

## 28.6 Rank correlation (Spearman’s \(r_{s}\) and Kendall’s \(\tau\) coefficients)

**Spearman’s correlation** \(r_{s}\) and **Kendall’s coefficient** \(\tau\) are both non-parametric correlation coefficients used to measure the strength and direction of association between two variables. Both coefficients are based on the concept of ranking the data but they employ distinct methods in their calculation and have some different characteristics (Puth, Neuhäuser, and Ruxton 2015).

### 28.6.1 Hypothesis Testing

### 28.6.2 Assumptions

### 28.6.3 Run the test

`cor.test(BirthWeight$height, BirthWeight$weight, method = "spearman")`

```
Warning in cor.test.default(BirthWeight$height, BirthWeight$weight, method =
"spearman"): Cannot compute exact p-value with ties
```

```
Spearman's rank correlation rho
data: BirthWeight$height and BirthWeight$weight
S = 8013119, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.711021
```

```
BirthWeight |>
cor_test(height, weight, method = "spearman")
```

```
# A tibble: 1 × 6
var1 var2 cor statistic p method
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 height weight 0.71 8013119. 7.4e-86 Spearman
```

`cor.test(BirthWeight$height, BirthWeight$weight, method = "kendall")`

```
Kendall's rank correlation tau
data: BirthWeight$height and BirthWeight$weight
z = 18.359, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.5408389
```

```
BirthWeight |>
cor_test(height, weight, method = "kendall")
```

```
# A tibble: 1 × 6
var1 var2 cor statistic p method
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 height weight 0.54 18.4 2.78e-75 Kendall
```

We observe that Kendall’s \(\tau\) is **smaller** than Spearman’s \(r_s\) correlation (0.54 vs 0.71).

### 28.6.4 Present the results for Spearman’s correlation test

**Summary table**

```
BirthWeight |>
cor_test(height, weight, method = "spearman") |>
gt() |>
fmt_number(columns = starts_with(c("c", "p")),
decimals = 3)
```

var1 | var2 | cor | statistic | p | method |
---|---|---|---|---|---|

height | weight | 0.710 | 8013119 | 0.000 | Spearman |

**Report the results** (according to Evans 1996)

```
Warning in cor.test.default(BirthWeight$height, BirthWeight$weight, method =
"spearman"): Cannot compute exact p-value with ties
```

```
Effect sizes were labelled following Evans's (1996) recommendations.
The Spearman's rank correlation rho between BirthWeight$height and
BirthWeight$weight is positive, statistically significant, and strong (rho =
0.71, S = 8.01e+06, p < .001)
```

We can use the above information to write up a final report:

We observed a strong, positive monotonic association between height and weight of one-month-old infants which is significant (Spearman \(r_s\) = 0.71, n = 550, p < 0.001).

### 28.6.5 Present the results for Kendall’s correlation test

**Summary table**

```
BirthWeight |>
cor_test(height, weight, method = "kendall") |>
gt() |>
fmt_number(columns = starts_with(c("c", "st", "p")),
decimals = 3)
```

var1 | var2 | cor | statistic | p | method |
---|---|---|---|---|---|

height | weight | 0.540 | 18.359 | 0.000 | Kendall |

**Report the results** (according to Evans 1996)

```
Effect sizes were labelled following Evans's (1996) recommendations.
The Kendall's rank correlation tau between BirthWeight$height and
BirthWeight$weight is positive, statistically significant, and moderate (tau =
0.54, z = 18.36, p < .001)
```

We can use the above information to write up a final report:

We observed a moderate, positive monotonic association between height and weight of one-month-old infants which is significant (Kendall \(\tau\) = 0.54, n = 550, p < 0.001).

## 28.7 Non-monotonic association (coefficient \(ξ\))

The correlation coefficient \(ξ\) ranges from 0 to 1 and is a measure of dependence between X and Y variables (Chatterjee 2021). It equals 1 when the Y is a function of X and it equals 0 when X and Y are independent. Thus, \(ξ\) gives a measure of the **strength** of the association and it can be used for **non-monotonic** associations. However, for monotonic associations, it does not indicate the direction of the association.

### 28.7.1 Hypothesis Testing

### 28.7.2 Assumptions

### 28.7.3 Run the test

`xicor(BirthWeight$height, BirthWeight$weight, pvalue = TRUE)`

```
$xi
[1] 0.3163988
$sd
[1] 0.02697177
$pval
[1] 0
```

### 28.7.4 Present the results

Based on the \(ξ\) correlation coefficient, there is a significant association between height and weight (\(ξ\) = 0.31, sd = 0.027, p < 0.001).