9 Lists and data frames

In this chapter, we explore the two types of generic vectors in R: lists and data frames. Lists offer a flexible way to store various data types, while data frames provide a structured format that is especially useful for working with tabular data.

All the required add-on packages for this chapter are part of the core set of packages within the tidyverse ecosystem. As such, we have the option to either load them collectively using the tidyverse meta-package or load them individually. For demonstration purposes, we choose the latter.

# Required packages for this chapter
library(tibble)             
library(dplyr)               
library(lubridate)

9.1 Creating a list

In R, a list allows us to store and organize a diverse range of objects, such as atomic vectors, matrices, and even other lists, within a single data structure. The objects within a list do not need to be associated or related to each other. Essentially, a list serves as an advanced data type, providing the flexibility to store nearly any type of information.

We can create a list using the list() function and assign it to an object named my_list, as shown below:

my_list <- list(1:5, c("apple", "carrot"), c(TRUE, TRUE, FALSE))
my_list

[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "apple"  "carrot"

[[3]]
[1]  TRUE  TRUE FALSE

The my_list consists of three elements, which we refer to as “list items” or simply “items”. Specifically, these items are:

A sequence of numbers from 1 to 5.
A character vector containing the strings “apple” and “carrot”.
A logical vector with the values TRUE, TRUE, and FALSE.

Therefore, each list item is an atomic vector of a different data type: numeric, character, and logical.

We can assign names to the list items as follows:

my_list <- list(num = 1:5, 
                fruits = c("apple", "carrot"), 
                TF = c(TRUE, TRUE, FALSE))
my_list

$num
[1] 1 2 3 4 5

$fruits
[1] "apple"  "carrot"

$TF
[1]  TRUE  TRUE FALSE

Now, my_list is a named list containing three items: num, fruits, and TF.

We verify that the object’s class is list:

class(my_list)

[1] "list"

9.2 Subsetting a list

9.2.1 Subsetting a list and preserving output as a list

We can use the subsetting operator [ ] to select one or more list items while preserving the output in list format. Let’s use the the previously created list as an example to demonstrate different methods of item selection.

We can select the second list item by indexing its position:

my_list[2]

$fruits
[1] "apple"  "carrot"

class(my_list[2])

[1] "list"

The selected item is “fruits”, and it is returned as a list.

Alternatively, we can select the second list item by specifying its name within the square brackets:

# same as above, but using the item's name
my_list["fruits"]

$fruits
[1] "apple"  "carrot"

Finally, we can select the second list item using Boolean indices (TRUE/FALSE):

# same as above, but using boolean indices 
my_list[c(FALSE, TRUE, FALSE)]

$fruits
[1] "apple"  "carrot"

Since TRUE is at the second position, the [ ] operator selects the second item in the list, and the output is preserved as a list.

9.2.2 Subsetting a list and simplifying the output

We can use double square brackets [[ ]] to select one or more items from a list while simplifying the output:

Select the second list item by indexing its position (simplifying the output):

# select the second list item and simplify it to a vector
my_list[[2]]

[1] "apple"  "carrot"

class(my_list[[2]])

[1] "character"

When using double square brackets [[ ]], R returns the actual data structure of the selected item, rather than preserving the list structure. Therefore, my_list[[2]] returns the character vector c("apple", "carrot"), not a list containing this vector.

Similarly, specifying the name “fruits” within the double square brackets allows us to select this list item while simplifying the output:

# same as above, but using the item's name
my_list[["fruits"]]

[1] "apple"  "carrot"

Another common method to select list items and simplify the output is by typing the name of the list (my_list), followed by a dollar sign ($), and then the name of the item within the list (e.g., “fruits”).

# select the list item "fruits" using the $ and simplify to a vector
my_list$fruits

[1] "apple"  "carrot"

IMPORTANT

One key difference between the [[ ]] operator and the $ is that the [[ ]] can be used with both indices and names, while the $ can only be used with names.
It is important to understand the distinction between simplifying and preserving when subsetting. Simplifying returns the simplest possible data structure that can represent the output, while preserving maintains the structure of the output the same as the input.

9.2.3 Selecting individual elements from a list item

By combining the [[ ]] or $ operator with the [ ] operator, we can select individual elements from a specific list item. For example, to select the second element, “carrot”, from the “fruits” item, we can use several approaches.

Select by indexing position:

my_list[[2]][2]

[1] "carrot"

Select by referencing the item name:

my_list[["fruits"]][2]

[1] "carrot"

Select using $ notation:

my_list$fruits[2]

[1] "carrot"

9.3 Unlisting a list

To convert a list into an atomic vector, we can use the unlist() function:

my_unlist <- unlist(my_list)
my_unlist

    num1     num2     num3     num4     num5  fruits1  fruits2      TF1      TF2 
     "1"      "2"      "3"      "4"      "5"  "apple" "carrot"   "TRUE"   "TRUE" 
     TF3 
 "FALSE"

class(my_unlist)

[1] "character"

9.4 Recursive vectors and Nested Lists

In R, lists are sometimes referred to as recursive vectors because they can contain other lists within them. These inner lists are known as nested lists. For example:

my_super_list <- list(item1 = 3.14,
                      item2 = list(item2a_num = 5:10,
                                   item2b_char = c("a", "b", "c")))

my_super_list

$item1
[1] 3.14

$item2
$item2$item2a_num
[1]  5  6  7  8  9 10

$item2$item2b_char
[1] "a" "b" "c"

In this example, item2 is a nested list, as it is the second item of my_super_list.

Subsetting a nested list

To select items from a nested list, we can combine the [[ ]] (or $) operator and the [ ] operator.

Select the item2a_num and preserve the output as a list:

# preserve the output as a list
my_super_list[[2]][1]

$item2a_num
[1]  5  6  7  8  9 10

class(my_super_list[[2]][1])

[1] "list"

Select the item2a_num and simplify it as a vector:

my_super_list[[2]][[1]]

[1]  5  6  7  8  9 10

class(my_super_list[[2]][[1]])

[1] "integer"

Select using names (simplify output):

# same as above with names
my_super_list[["item2"]][["item2a_num"]]

[1]  5  6  7  8  9 10

# same as above with $ operator
my_super_list$item2$item2a_num

[1]  5  6  7  8  9 10

Selecting individual elements from a nested list

We can also select individual elements from the items of a nested list. For example, we select the “c” element as follows:

# select individual element
my_super_list[[2]][[2]][3]

[1] "c"

class(my_super_list[[2]][[2]][3])

[1] "character"

Alternatively, we can use the $ notation:

my_super_list$item2$item2b_char[3]

[1] "c"

9.5 Data frames

A data frame is the most common way to store and organize rectangular data in R, and it is generally the preferred data structure for data analysis tasks. A data frame consists of rows and columns. While all elements within a column must have the same data type (e.g., numeric, character, or logical), different columns can have different data types. Therefore, a data frame is a special type of list with equal-length atomic vectors as its elements.

Various disciplines have different terms for the rows and columns in a data frame. In this textbook, we will consistently use the terms “observations” and “variables”, respectively.

9.5.1 Creating a data frame

We will create a small example data frame with eight observations (rows) and six variables (columns) based on the following information:

age: age of the patient (in years). Values = {30, 65, 35, 25, 45, 55, 40, 20}.
smoking: smoking status of the patient (0=non-smoker, 1=smoker). Values = {0, 1, 1, 0, 1, 0, 0, 1}.
ABO: blood type of the patient based on the ABO blood group system (A, B, AB, O). Values = {A, O, O, O, B, O, A, A}.
bmi: Body Mass Index (BMI) category of the patient (1=underweight, 2=healthy weight, 3=overweight, 4=obesity). Values = {2, 3, 2, 2, 4, 4, 3, 1}.
occupation: occupation of the patient. Values = {Journalist, Chef, Doctor, Teacher, Lawyer, Musician, Pharmacist, Nurse}.
adm_date: admission date to the hospital. Values = {10-09-2023, 10-12-2023, 10-18-2023, 10-27-2023, 11-04-2023, 11-09-2023, 11-22-2023, 12-02-2023}.

The data frame can be created using the data.frame() function in base R, the tibble() function in the tibble package, the data.table() function in the data.table package, or the tidytable() function in the tidytable package. Let’s try the tibble() :

dat <- tibble(age = c(30, 65, 35, 25, 45, 55, 40, 20),
              smoking = c(0, 1, 1, 0, 1, 0, 0, 1),
              ABO = c("A", "O", "O", "O", "B", "O", "A", "A"),
              bmi = c(2, 3, 2, 2, 4, 4, 3, 1),
              occupation = c("Journalist", "Chef", "Doctor", "Teacher",
                             "Lawyer", "Musician", "Pharmacist", "Nurse"),
              adm_date = c("10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023",
                           "11-04-2023", "11-09-2023", "11-22-2023", "12-02-2023"))
dat

# A tibble: 8 × 6
    age smoking ABO     bmi occupation adm_date  
  <dbl>   <dbl> <chr> <dbl> <chr>      <chr>     
1    30       0 A         2 Journalist 10-09-2023
2    65       1 O         3 Chef       10-12-2023
3    35       1 O         2 Doctor     10-18-2023
4    25       0 O         2 Teacher    10-27-2023
5    45       1 B         4 Lawyer     11-04-2023
6    55       0 O         4 Musician   11-09-2023
7    40       0 A         3 Pharmacist 11-22-2023
8    20       1 A         1 Nurse      12-02-2023

We can find the type, class and dimensions for the created object dat:

typeof(dat)
class(dat)
dim(dat)

[1] "list"
[1] "tbl_df"     "tbl"        "data.frame"
[1] 8 6

The type is a list but the class is a tbl (tibble) object which is a “tidy” data frame (tibbles work better in the tidyverse). The dimensions are 8x6.

9.5.2 Accessing variables in a data frame

In R, we can access variables in a data frame just like items in a list by using their names or indices. For example:

dat[["age"]]
dat[[2]]

[1] 30 65 35 25 45 55 40 20
[1] 0 1 1 0 1 0 0 1

or by using the dollar sign ($) :

dat$age

[1] 30 65 35 25 45 55 40 20

We can also select individual elements of a specific variable as follows:

dat$age[2:5]

[1] 65 35 25 45

9.5.3 Converting columns to the appropriate data type

It is important to inspect the data types of columns and convert them to the appropriate type if necessary to ensure accurate analysis. The glimpse() function is commonly used to quickly examine the structure and data types of a data frame:

glimpse(dat)

Rows: 8
Columns: 6
$ age        <dbl> 30, 65, 35, 25, 45, 55, 40, 20
$ smoking    <dbl> 0, 1, 1, 0, 1, 0, 0, 1
$ ABO        <chr> "A", "O", "O", "O", "B", "O", "A", "A"
$ bmi        <dbl> 2, 3, 2, 2, 4, 4, 3, 1
$ occupation <chr> "Journalist", "Chef", "Doctor", "Teacher", "Lawyer", "Musician",…
$ adm_date   <chr> "10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023", "11-04-2…

We observe a series of three-letter abbreviations in angle brackets (<dbl>, <chr>), which represent the data type of each column in a tibble. A list of these abbreviations is provided in TABLE 9.1:

TABLE 9.1 Tibble abbreviations that describe the type of data in columns of a data frame.

Type of Data	Description	Abbreviation
character	strings: letters, numbers, symbols, and spaces	`<chr>`
integer	numerical values: integer numbers	`<int>`
double	numerical values: real numbers	`<dbl>`
logical	logical data, typically representing `TRUE` or `FALSE`	`<lgl>`
date	date (e.g, 2020-10-09)	`<date>`
date+time	date plus time (e.g., 2020-10-09 10:03:25 UTC)	`<dttm>`
factor	categorical variables with fixed and known set of possible values (e.g., male/female)	`<fct>`
ordered factor	categorical variable with ordered fixed and known set of possible values	`<ord>`

We can convert the categorical variables smoking, ABO, and bmi from <dbl>, <chr>, <dbl> types, respectively, to factors <fct> since they have fixed and known values.

Variable: smoking (numeric coded values → factor)

We will use the factor() function to convert the existing numeric coded (0/1) variable smoking to a factor with levels “non-smoker” and “smoker”.

dat$smoking <- factor(dat$smoking, levels = c(0, 1), 
                      labels = c("non-smoker", "smoker"))
levels(dat$smoking)

[1] "non-smoker" "smoker"

Variable: ABO (chr → factor)

It’s important to note that not all possible values may be present in a given dataset. For example, if we tabulate the variable ABO using the table() function, we obtain counts of the categories that are present in the data:

# create a count table
table(dat$ABO)


A B O 
3 1 4

The blood type “AB” of the ABO blood group system is absent from our data. In such cases, we can provide a list of all known levels in the factor() function:

# create a vector containing the blood types A, B, AB, and O
ABO_levels <- c("A", "B", "AB", "O")

dat$ABO <- factor(dat$ABO, levels = ABO_levels)

# show the levels of status variable
levels(dat$ABO)

[1] "A"  "B"  "AB" "O"

# create a count table
table(dat$ABO)


 A  B AB  O 
 3  1  0  4

We observe that the count table includes the blood type “AB” even though it doesn’t exist in the original data.

Variable: bmi (numeric coded values → ordered factor)

We might have noticed that the categorical variable bmi takes numerically coded values (1, 2, 3, 4) in our dataset, so it is recognized as a double <dbl> type. We can convert this variable into an ordered factor <ord> with levels (1=underweight, 2=healthy, 3=overweight, 4=obesity). Instead of overwriting the existing variable, we prefer to create a new variable bmi1, as follows:”

# create a vector containing the four bmi categories in order
bmi1_labels <- c("underweight", "healthy", "overweight", "obesity")

# create a new variable bmi1 as a factor with specified labels
dat$bmi1 <- factor(dat$bmi, levels = c(1, 2, 3, 4), 
                   labels = bmi1_labels, ordered = TRUE)
dat$bmi1

[1] healthy     overweight  healthy     healthy     obesity     obesity    
[7] overweight  underweight
Levels: underweight < healthy < overweight < obesity

Now we can use, for example, the comparison operators > to check whether one element of the ordered vector is larger than the other.

dat$bmi1[2] > dat$bmi1[6]

[1] FALSE

However, the use of these operators on factors is much less common compared to numeric vectors. Therefore, we typically omit the ordered = TRUE argument, especially when we provide the order of categories explicitly in the levels argument.

Suppose we want to categorize patients into broader BMI categories. For example, we can recode the original variable ‘bmi’ and create a new variable in our dataset called bmi2, merging “overweight” (coded as 3) and “obesity” (coded as 4) into a single category, such as “overweight”.

# create a new variable bm2 with recoded values
dat$bmi2 <- case_match(dat$bmi, 
                      1 ~ "underweight",
                      2 ~ "healthy",
                      c(3, 4) ~ "overweight")

In the code above, the case_match() function from the dplyr package matches the values on the left side of the tilde (~) against the input vector dat$bmi. For example, if the values 3 or 4 are found in the bmi column, they are recoded as “overweight” in the new variable.

Additionally, we can explicitly define the order of the levels for the bmi2 variable:

# set the levels in order
bmi2_levels <- c("underweight", "healthy", "overweight")

# convert bmi2 to a factor with specified labels
dat$bmi2 <- factor(dat$bmi2, levels = bmi2_levels)
dat$bmi2

[1] healthy     overweight  healthy     healthy     overweight  overweight 
[7] overweight  underweight
Levels: underweight healthy overweight

dat

# A tibble: 8 × 8
    age smoking    ABO     bmi occupation adm_date   bmi1        bmi2       
  <dbl> <fct>      <fct> <dbl> <chr>      <chr>      <ord>       <fct>      
1    30 non-smoker A         2 Journalist 10-09-2023 healthy     healthy    
2    65 smoker     O         3 Chef       10-12-2023 overweight  overweight 
3    35 smoker     O         2 Doctor     10-18-2023 healthy     healthy    
4    25 non-smoker O         2 Teacher    10-27-2023 healthy     healthy    
5    45 smoker     B         4 Lawyer     11-04-2023 obesity     overweight 
6    55 non-smoker O         4 Musician   11-09-2023 obesity     overweight 
7    40 non-smoker A         3 Pharmacist 11-22-2023 overweight  overweight 
8    20 smoker     A         1 Nurse      12-02-2023 underweight underweight

Variable: adm_date (chr → date)

The standard date format in R is YYYY-MM-DD. If the original data represents dates in a different format, we need to convert them to the standard format.

The lubridate package provides functions for formatting dates by specifying the order of day (d), month (m), and year (y). For example, if a dataset contains dates in the format MM-DD-YYYY (i.e., month-day-year), the mdy() function converts them into date objects. Likewise, if the dates are in the format DD-MM-YYYY (i.e., day-month-year), the dmy() function performs the conversion. In our example, the dates are in the MM-DD-YYYY format, so we must convert them accordingly.

dat$adm_date <- mdy(dat$adm_date)
dat

# A tibble: 8 × 8
    age smoking    ABO     bmi occupation adm_date   bmi1        bmi2       
  <dbl> <fct>      <fct> <dbl> <chr>      <date>     <ord>       <fct>      
1    30 non-smoker A         2 Journalist 2023-10-09 healthy     healthy    
2    65 smoker     O         3 Chef       2023-10-12 overweight  overweight 
3    35 smoker     O         2 Doctor     2023-10-18 healthy     healthy    
4    25 non-smoker O         2 Teacher    2023-10-27 healthy     healthy    
5    45 smoker     B         4 Lawyer     2023-11-04 obesity     overweight 
6    55 non-smoker O         4 Musician   2023-11-09 obesity     overweight 
7    40 non-smoker A         3 Pharmacist 2023-11-22 overweight  overweight 
8    20 smoker     A         1 Nurse      2023-12-02 underweight underweight

class(dat$adm_date)

[1] "Date"

In the code above, the mdy() function from the lubridate package parses dates in the “month-day-year” format and convert them into R’s standard date format (YYYY-MM-DD).