[[1]]
[1] 1 2 3 4 5
[[2]]
[1] "apple" "carrot"
[[3]]
[1] TRUE TRUE FALSE
9 Lists and data frames
When we have finished this chapter, we should be able to:
9.1 Creating a list
In R, a list enables us to organize diverse objects (e.g., 1-D vectors, matrices, even other lists) under a single data structure. There is no requirement for these objects to be associated or related to each other in any way. Essentially, a list can be considered an advanced data type, allowing us to store practically any kind of information within it.
We construct a list using the list()
function. For example:
This list consists of three elements referred to as “list items” or “items”, which are atomic vectors of different types of data (numeric, character, and logical).
We can assign names to the list items:
$num
[1] 1 2 3 4 5
$fruits
[1] "apple" "carrot"
$TF
[1] TRUE TRUE FALSE
We can also confirm that the class of the object is list
:
class(my_list)
[1] "list"
9.2 Subsetting a list
9.2.1 Subset list and preserve output as a list
We can use the extraction operator [ ]
to extract one or more list items while preserving the output in list format:
my_list[2] # extract the second list item (indexing by position)
$fruits
[1] "apple" "carrot"
class(my_list[2])
[1] "list"
my_list["fruits"] # same as above but using the item's name
$fruits
[1] "apple" "carrot"
my_list[c(FALSE, TRUE, FALSE)] # same as above but using boolean indices (TRUE/FALSE)
$fruits
[1] "apple" "carrot"
9.2.2 Subset list and simplify the output
We can use the [[ ]]
to extract one or more list items while simplifying the output:
my_list[[2]] # extract the second list item and simplify it to a vector
[1] "apple" "carrot"
class(my_list[[2]])
[1] "character"
my_list[["fruits"]] # same as above but using the item's name
[1] "apple" "carrot"
We can also access the content of the list by typing the name of the list followed by a dollar sign $
folowed by the name of the list item:
my_list$fruits # extract the numbers and simplify to a vector
[1] "apple" "carrot"
One thing that differentiates the [[ ]]
operator from the $
is that the [[ ]]
operator can be used with computed indices and names. The $
operator can only be used with names.
9.2.3 Subset list to get individual elements out of a list item
To extract individual elements out of a specific list item combine the [[ ]]
(or $) operator with the [ ]
operator:
my_list[[2]][2] # using the index
[1] "carrot"
my_list[["fruits"]][2] # using the name of the list item
[1] "carrot"
my_list$fruits[2] # using the $
[1] "carrot"
9.3 Unlist a list
We can turn a list into an atomic vector with unlist()
:
my_unlist <- unlist(my_list)
my_unlist
num1 num2 num3 num4 num5 fruits1 fruits2 TF1
"1" "2" "3" "4" "5" "apple" "carrot" "TRUE"
TF2 TF3
"TRUE" "FALSE"
class(my_unlist)
[1] "character"
9.4 Recursive vectors and Nested Lists
In R, lists are sometimes referred to as recursive vectors because they can include other lists within them. These sublists are known as nested lists. For example:
my_super_list <- list(item1 = 3.14,
item2 = list(item2a_num = 5:10,
item2b_char = c("a", "b", "c")))
my_super_list
$item1
[1] 3.14
$item2
$item2$item2a_num
[1] 5 6 7 8 9 10
$item2$item2b_char
[1] "a" "b" "c"
In this example, item2
, which is the second item of my_super_list
, is a nested list.
Subsetting a nested list
We can access the list items of a nested list by using the combination of [[ ]]
(or $) operator and the [ ]
operator. For example:
# preserve the output as a list
my_super_list[[2]][1]
$item2a_num
[1] 5 6 7 8 9 10
class(my_super_list[[2]][1])
[1] "list"
# simplify the output
my_super_list[[2]][[1]]
[1] 5 6 7 8 9 10
class(my_super_list[[2]][[1]])
[1] "integer"
# same as above with names
my_super_list[["item2"]][["item2a_num"]]
[1] 5 6 7 8 9 10
# same as above with $ operator
my_super_list$item2$item2a_num
[1] 5 6 7 8 9 10
We can also extract individual elements from the list items of a nested list. For example:
# extract individual element
my_super_list[[2]][[2]][3]
[1] "c"
class(my_super_list[[2]][[2]][3])
[1] "character"
9.5 Data frames
A data frame is the most common way of organizing and storing data in R and is generally the preferred data structure for conducting data analysis tasks.
9.5.1 Creating a data frame with tibble()
We will create a small fictional dataframe with eight rows based on the following information:
- age: age of the patient (in years)
- smoking: smoking status of the patient (0=non-smoker, 1=smoker)
- ABO: blood type of the patient based on the ABO blood group system (A, B, AB, O)
- bmi: Body Mass Index (BMI) category of the patient (1=underweight, 2=healthy weight, 3=overweight, 4=obesity)
- occupation: occupation of the patient
- adm_date: admission date to the hospital
A data frame can be created using the data.frame()
function in base R, the tibble()
function in the tidyverse package, or the data.table()
function in the data.table package. Let’s try the tibble()
:
library(tidyverse) # load the tidyverse package
library(rstatix)
dat <- tibble(
age = c(30, 65, 35, 25, 45, 55, 40, 20),
smoking = c(0, 1, 1, 0, 1, 0, 0, 1),
ABO = c("A", "O", "O", "O", "B", "O", "A", "A"),
bmi = c(2, 3, 2, 2, 4, 4, 3, 1),
occupation = c("Journalist", "Chef", "Doctor", "Teacher",
"Lawyer", "Musician", "Pharmacist", "Nurse"),
adm_date = c("10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023",
"11-04-2023", "11-09-2024", "11-22-2023", "12-02-2023")
)
dat
# A tibble: 8 × 6
age smoking ABO bmi occupation adm_date
<dbl> <dbl> <chr> <dbl> <chr> <chr>
1 30 0 A 2 Journalist 10-09-2023
2 65 1 O 3 Chef 10-12-2023
3 35 1 O 2 Doctor 10-18-2023
4 25 0 O 2 Teacher 10-27-2023
5 45 1 B 4 Lawyer 11-04-2023
6 55 0 O 4 Musician 11-09-2024
7 40 0 A 3 Pharmacist 11-22-2023
8 20 1 A 1 Nurse 12-02-2023
We can find the type, class and dim for the created object dat
:
The type is a list but the class is a tbl
(tibble) object which is a “tidy” data frame (tibbles work better in the tidyverse). The dimensions are 8x8.
The attribute()
function help us to explore the characteristics/attributes of our tibble:
attributes(dat)
$class
[1] "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8
$names
[1] "age" "smoking" "ABO" "bmi" "occupation"
[6] "adm_date"
9.5.2 Accessing variables in a data frame
In R, we can access variables in a data frame just like items in a list by using their names or indices. For example:
dat[["age"]]
dat[[2]]
[1] 30 65 35 25 45 55 40 20
[1] 0 1 1 0 1 0 0 1
or by using the dollar sign ($
) :
dat$age
[1] 30 65 35 25 45 55 40 20
We can also extract individual elements out of a specific variable as follows:
dat$age[2:5]
[1] 65 35 25 45
Another easy way of selecting one variable, similar to $
, is by utilizing the pull()
function from the {dplyr} package. For example:
pull(dat, age)
[1] 30 65 35 25 45 55 40 20
9.5.3 Converting to the appropriate data type
It’s critical to investigate the column’s data type and convert it to the appropriate type for analysis if necessary. Often we use the glimpse()
function in order to have a quick look at the structure of the data frame:
glimpse(dat)
Rows: 8
Columns: 6
$ age <dbl> 30, 65, 35, 25, 45, 55, 40, 20
$ smoking <dbl> 0, 1, 1, 0, 1, 0, 0, 1
$ ABO <chr> "A", "O", "O", "O", "B", "O", "A", "A"
$ bmi <dbl> 2, 3, 2, 2, 4, 4, 3, 1
$ occupation <chr> "Journalist", "Chef", "Doctor", "Teacher", "Lawyer", "Music…
$ adm_date <chr> "10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023", "11…
Observe the series of three letter abbreviations in angle brackets (<dbl>
, <chr>
). The abbreviations used in tibbles serve to describe the type of data in each column and are presented in (Table 9.1):
Data Type | Description | Abbreviation |
---|---|---|
character | strings: letters, numbers, symbols, and spaces | <chr> |
integer | numerical values: integer numbers | <int> |
double | numerical values: real numbers | <dbl> |
logical | logical data, typically representing TRUE or FALSE
|
<lgl> |
date | date (e.g, 2020-10-09) | <date> |
date+time | date plus time (e.g., 2020-10-09 10:03:25 UTC) | <dttm> |
factor | categorical variables with fixed and known set of possible values (e.g., male/female) | <fct> |
ordered factor | categorical variable with ordered fixed and known set of possible values | <ord> |
We can convert the categorical variables smoking
, ABO
, and bmi
from <dbl>
, <chr>
, <dbl>
types, respectively, into factors <fct>
since they have fixed and known values.
- Variable: smoking (numeric coded values → factor)
converts a numeric variable representing smoking status into a factor variable with more meaningful labels and then displays the updated dataframe along with the levels of the newly converted factor variable.
# A tibble: 8 × 6
age smoking ABO bmi occupation adm_date
<dbl> <fct> <chr> <dbl> <chr> <chr>
1 30 non-smoker A 2 Journalist 10-09-2023
2 65 smoker O 3 Chef 10-12-2023
3 35 smoker O 2 Doctor 10-18-2023
4 25 non-smoker O 2 Teacher 10-27-2023
5 45 smoker B 4 Lawyer 11-04-2023
6 55 non-smoker O 4 Musician 11-09-2024
7 40 non-smoker A 3 Pharmacist 11-22-2023
8 20 smoker A 1 Nurse 12-02-2023
levels(dat$smoking)
[1] "non-smoker" "smoker"
- Variable: ABO (chr → factor)
It’s important to note that not all potential values may be present in a given dataset. For example, if we tabulate the variable ABO
(e.g. using the table()
function) we will get counts of the categories in the data:
# create a count table
table(dat$ABO)
A B O
3 1 4
The blood type “AB” of the ABO blood group system is absent from our data. In such cases, we can use the factor and create a list of all the valid levels:
# create a vector containing the blood types A, B, AB, and O
ABO_levels <- c("A", "B", "AB", "O")
dat$ABO <- factor(dat$ABO, levels = ABO_levels)
dat
# A tibble: 8 × 6
age smoking ABO bmi occupation adm_date
<dbl> <fct> <fct> <dbl> <chr> <chr>
1 30 non-smoker A 2 Journalist 10-09-2023
2 65 smoker O 3 Chef 10-12-2023
3 35 smoker O 2 Doctor 10-18-2023
4 25 non-smoker O 2 Teacher 10-27-2023
5 45 smoker B 4 Lawyer 11-04-2023
6 55 non-smoker O 4 Musician 11-09-2024
7 40 non-smoker A 3 Pharmacist 11-22-2023
8 20 smoker A 1 Nurse 12-02-2023
# show the levels of status variable
levels(dat$ABO)
[1] "A" "B" "AB" "O"
# create a count table
table(dat$ABO)
A B AB O
3 1 0 4
- Variable: bmi (numeric coded values → ordered factor)
We might have noticed that the categorical variable bmi
takes numerically coded values (1, 2, 3, 4) in our dataset, so it is recognized as a double <dbl>
type. We can convert this variable into factor <fct>
with levels (1=underweight, 2=healthy, 3=overweight, 4=obesity). Instead of overwriting the existing variable, we prefer to create a new variable bmi1
, as follows:”
# create a vector containing the four bmi categories
bmi1_labels <- c("underweight", "healthy", "overweight", "obesity")
# convert the variable to factor
dat$bmi1 <- factor(dat$bmi, levels = c(1, 2, 3, 4),
labels = bmi1_labels, ordered = TRUE)
dat$bmi1
[1] healthy overweight healthy healthy obesity obesity
[7] overweight underweight
Levels: underweight < healthy < overweight < obesity
dat
# A tibble: 8 × 7
age smoking ABO bmi occupation adm_date bmi1
<dbl> <fct> <fct> <dbl> <chr> <chr> <ord>
1 30 non-smoker A 2 Journalist 10-09-2023 healthy
2 65 smoker O 3 Chef 10-12-2023 overweight
3 35 smoker O 2 Doctor 10-18-2023 healthy
4 25 non-smoker O 2 Teacher 10-27-2023 healthy
5 45 smoker B 4 Lawyer 11-04-2023 obesity
6 55 non-smoker O 4 Musician 11-09-2024 obesity
7 40 non-smoker A 3 Pharmacist 11-22-2023 overweight
8 20 smoker A 1 Nurse 12-02-2023 underweight
Now we can use, for example, the comparison operators >
to check whether one element of the ordered vector is larger than the other.
dat$bmi1[2] > dat$bmi1[6]
[1] FALSE
However, the use of these operators on factors is much less common compared to numeric vectors. Therefore, we typically omit the ordered = TRUE
argument, especially when we provide the order of categories explicitly in the levels
argument.
Now, let’s merge the “overweight” and “obesity” categories into a single category named “overweight/obesity” within a new variable called bmi2
:
# recode the values
dat$bmi2 <- case_match(dat$bmi,
1 ~ "underweight",
2 ~ "healthy",
c(3, 4) ~ "overweight/obesity")
# set the levels in a order
bmi2_levels <- c("underweight", "healthy", "overweight/obesity")
# convert the variable to factor
dat$bmi2 <- factor(dat$bmi2, levels = bmi2_levels, ordered = TRUE)
dat$bmi2
[1] healthy overweight/obesity healthy healthy
[5] overweight/obesity overweight/obesity overweight/obesity underweight
Levels: underweight < healthy < overweight/obesity
dat
# A tibble: 8 × 8
age smoking ABO bmi occupation adm_date bmi1 bmi2
<dbl> <fct> <fct> <dbl> <chr> <chr> <ord> <ord>
1 30 non-smoker A 2 Journalist 10-09-2023 healthy healthy
2 65 smoker O 3 Chef 10-12-2023 overweight overweight/obe…
3 35 smoker O 2 Doctor 10-18-2023 healthy healthy
4 25 non-smoker O 2 Teacher 10-27-2023 healthy healthy
5 45 smoker B 4 Lawyer 11-04-2023 obesity overweight/obe…
6 55 non-smoker O 4 Musician 11-09-2024 obesity overweight/obe…
7 40 non-smoker A 3 Pharmacist 11-22-2023 overweight overweight/obe…
8 20 smoker A 1 Nurse 12-02-2023 underweight underweight
- Variable: adm_date (chr → date)
In R, by default, values of class Date are displayed as YYYY-MM-DD. Therefore, to represent the date “10-12-2023” (assuming it’s in month-day-year format), we can use the following code:
dat$adm_date <- mdy(dat$adm_date)
dat
# A tibble: 8 × 8
age smoking ABO bmi occupation adm_date bmi1 bmi2
<dbl> <fct> <fct> <dbl> <chr> <date> <ord> <ord>
1 30 non-smoker A 2 Journalist 2023-10-09 healthy healthy
2 65 smoker O 3 Chef 2023-10-12 overweight overweight/obe…
3 35 smoker O 2 Doctor 2023-10-18 healthy healthy
4 25 non-smoker O 2 Teacher 2023-10-27 healthy healthy
5 45 smoker B 4 Lawyer 2023-11-04 obesity overweight/obe…
6 55 non-smoker O 4 Musician 2024-11-09 obesity overweight/obe…
7 40 non-smoker A 3 Pharmacist 2023-11-22 overweight overweight/obe…
8 20 smoker A 1 Nurse 2023-12-02 underweight underweight
class(dat$adm_date)
[1] "Date"