9  Lists and data frames

When we have finished this chapter, we should be able to:

Learning objectives
  • Create a list using the list() function.
  • Refer a list item using its name or index number.
  • Create a data frame from equal length vectors using the tibble() function.
  • Refer to a column of a data frame using the $ notation.
  • Convert variables from character to factor variables.

 

9.1 Creating a list

In R, a list enables us to organize diverse objects (e.g., 1-D vectors, matrices, even other lists) under a single data structure. There is no requirement for these objects to be associated or related to each other in any way. Essentially, a list can be considered an advanced data type, allowing us to store practically any kind of information within it.

We construct a list using the list() function. For example:

my_list <- list(1:5, c("apple", "carrot"), c(TRUE, TRUE, FALSE))
my_list
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "apple"  "carrot"

[[3]]
[1]  TRUE  TRUE FALSE

This list consists of three elements referred to as “list items” or “items”, which are atomic vectors of different types of data (numeric, character, and logical).

We can assign names to the list items:

my_list <- list(
              num = 1:5, 
              fruits = c("apple", "carrot"), 
              TF = c(TRUE, TRUE, FALSE))
my_list
$num
[1] 1 2 3 4 5

$fruits
[1] "apple"  "carrot"

$TF
[1]  TRUE  TRUE FALSE

We can also confirm that the class of the object is list:

class(my_list)
[1] "list"

 

9.2 Subsetting a list

9.2.1 Subset list and preserve output as a list

We can use the extraction operator [ ] to extract one or more list items while preserving the output in list format:

my_list[2]    # extract the second list item (indexing by position)
$fruits
[1] "apple"  "carrot"
class(my_list[2])
[1] "list"
my_list["fruits"]   # same as above but using the item's name
$fruits
[1] "apple"  "carrot"
my_list[c(FALSE, TRUE, FALSE)]    # same as above but using boolean indices (TRUE/FALSE)
$fruits
[1] "apple"  "carrot"

 

9.2.2 Subset list and simplify the output

We can use the [[ ]] to extract one or more list items while simplifying the output:

my_list[[2]]   # extract the second list item and simplify it to a vector
[1] "apple"  "carrot"
class(my_list[[2]])
[1] "character"
my_list[["fruits"]]   # same as above but using the item's name
[1] "apple"  "carrot"

We can also access the content of the list by typing the name of the list followed by a dollar sign $ folowed by the name of the list item:

my_list$fruits  # extract the numbers and simplify to a vector
[1] "apple"  "carrot"

One thing that differentiates the [[ ]] operator from the $ is that the [[ ]] operator can be used with computed indices and names. The $ operator can only be used with names.

Simplifying Vs Preserving subsetting

It’s important to understand the difference between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output. Preserving subsets keeps the structure of the output the same as the input.

 

9.2.3 Subset list to get individual elements out of a list item

To extract individual elements out of a specific list item combine the [[ ]] (or $) operator with the [ ] operator:

my_list[[2]][2]          # using the index
[1] "carrot"
my_list[["fruits"]][2]  # using the name of the list item
[1] "carrot"
my_list$fruits[2]       # using the $
[1] "carrot"

 

9.3 Unlist a list

We can turn a list into an atomic vector with unlist():

my_unlist <- unlist(my_list)
my_unlist
    num1     num2     num3     num4     num5  fruits1  fruits2      TF1 
     "1"      "2"      "3"      "4"      "5"  "apple" "carrot"   "TRUE" 
     TF2      TF3 
  "TRUE"  "FALSE" 
class(my_unlist)
[1] "character"

 

9.4 Recursive vectors and Nested Lists

In R, lists are sometimes referred to as recursive vectors because they can include other lists within them. These sublists are known as nested lists. For example:

my_super_list <- list(item1 = 3.14,
                      item2 = list(item2a_num = 5:10,
                                   item2b_char = c("a", "b", "c")))

my_super_list
$item1
[1] 3.14

$item2
$item2$item2a_num
[1]  5  6  7  8  9 10

$item2$item2b_char
[1] "a" "b" "c"

In this example, item2, which is the second item of my_super_list, is a nested list.

 

Subsetting a nested list

We can access the list items of a nested list by using the combination of [[ ]] (or $) operator and the [ ] operator. For example:

# preserve the output as a list
my_super_list[[2]][1]
$item2a_num
[1]  5  6  7  8  9 10
class(my_super_list[[2]][1])
[1] "list"
# simplify the output
my_super_list[[2]][[1]]
[1]  5  6  7  8  9 10
class(my_super_list[[2]][[1]])
[1] "integer"
# same as above with names
my_super_list[["item2"]][["item2a_num"]]
[1]  5  6  7  8  9 10
# same as above with $ operator
my_super_list$item2$item2a_num
[1]  5  6  7  8  9 10

 

We can also extract individual elements from the list items of a nested list. For example:

# extract individual element
my_super_list[[2]][[2]][3]
[1] "c"
class(my_super_list[[2]][[2]][3])
[1] "character"

 

9.5 Data frames

A data frame is the most common way of organizing and storing data in R and is generally the preferred data structure for conducting data analysis tasks.

Data frame

In R, rectangular data is often referred to as a “data frame” consisting of rows and columns. While all elements within a column must have the same data type (e.g., numeric, character, or logical), it’s possible for different columns to have different data types. Therefore, a data frame is a special type of list with equal-length atomic vectors as its items.

Various disciplines have different terms for the rows and columns in a data frame, such as observations and variables, records and fields, or examples and attributes. In this textbook, we will consistently use the terms “observations” and “variables”. Data in variables can be either categorical (categorical variables) or numerical (numerical variables) (see also the Chapter 12).

9.5.1 Creating a data frame with tibble()

We will create a small fictional dataframe with eight rows based on the following information:

  • age: age of the patient (in years)
  • smoking: smoking status of the patient (0=non-smoker, 1=smoker)
  • ABO: blood type of the patient based on the ABO blood group system (A, B, AB, O)
  • bmi: Body Mass Index (BMI) category of the patient (1=underweight, 2=healthy weight, 3=overweight, 4=obesity)
  • occupation: occupation of the patient
  • adm_date: admission date to the hospital

A data frame can be created using the data.frame() function in base R, the tibble() function in the tidyverse package, or the data.table() function in the data.table package. Let’s try the tibble() :

library(tidyverse)   # load the tidyverse package
library(rstatix)

dat <- tibble(
  age = c(30, 65, 35, 25, 45, 55, 40, 20),
  smoking = c(0, 1, 1, 0, 1, 0, 0, 1),
  ABO = c("A", "O", "O", "O", "B", "O", "A", "A"),
  bmi = c(2, 3, 2, 2, 4, 4, 3, 1),
  occupation = c("Journalist", "Chef", "Doctor", "Teacher",
                  "Lawyer", "Musician", "Pharmacist", "Nurse"),
  adm_date = c("10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023",
               "11-04-2023", "11-09-2024", "11-22-2023", "12-02-2023")
)

dat
# A tibble: 8 × 6
    age smoking ABO     bmi occupation adm_date  
  <dbl>   <dbl> <chr> <dbl> <chr>      <chr>     
1    30       0 A         2 Journalist 10-09-2023
2    65       1 O         3 Chef       10-12-2023
3    35       1 O         2 Doctor     10-18-2023
4    25       0 O         2 Teacher    10-27-2023
5    45       1 B         4 Lawyer     11-04-2023
6    55       0 O         4 Musician   11-09-2024
7    40       0 A         3 Pharmacist 11-22-2023
8    20       1 A         1 Nurse      12-02-2023

We can find the type, class and dim for the created object dat:

typeof(dat)
class(dat)
dim(dat)
[1] "list"
[1] "tbl_df"     "tbl"        "data.frame"
[1] 8 6

The type is a list but the class is a tbl (tibble) object which is a “tidy” data frame (tibbles work better in the tidyverse). The dimensions are 8x8.

The attribute() function help us to explore the characteristics/attributes of our tibble:

$class
[1] "tbl_df"     "tbl"        "data.frame"

$row.names
[1] 1 2 3 4 5 6 7 8

$names
[1] "age"        "smoking"    "ABO"        "bmi"        "occupation"
[6] "adm_date"  

 

9.5.2 Accessing variables in a data frame

In R, we can access variables in a data frame just like items in a list by using their names or indices. For example:

dat[["age"]]
dat[[2]]
[1] 30 65 35 25 45 55 40 20
[1] 0 1 1 0 1 0 0 1

or by using the dollar sign ($) :

dat$age
[1] 30 65 35 25 45 55 40 20

We can also extract individual elements out of a specific variable as follows:

dat$age[2:5]
[1] 65 35 25 45

Another easy way of selecting one variable, similar to $, is by utilizing the pull() function from the {dplyr} package. For example:

pull(dat, age)
[1] 30 65 35 25 45 55 40 20

 

9.5.3 Converting to the appropriate data type

It’s critical to investigate the column’s data type and convert it to the appropriate type for analysis if necessary. Often we use the glimpse() function in order to have a quick look at the structure of the data frame:

glimpse(dat)
Rows: 8
Columns: 6
$ age        <dbl> 30, 65, 35, 25, 45, 55, 40, 20
$ smoking    <dbl> 0, 1, 1, 0, 1, 0, 0, 1
$ ABO        <chr> "A", "O", "O", "O", "B", "O", "A", "A"
$ bmi        <dbl> 2, 3, 2, 2, 4, 4, 3, 1
$ occupation <chr> "Journalist", "Chef", "Doctor", "Teacher", "Lawyer", "Music…
$ adm_date   <chr> "10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023", "11…

Observe the series of three letter abbreviations in angle brackets (<dbl>, <chr>). The abbreviations used in tibbles serve to describe the type of data in each column and are presented in (Table 9.1):

Table 9.1: Tibble abbreviations that describe the type of data in columns of a data frame
Data Type Description Abbreviation
character strings: letters, numbers, symbols, and spaces <chr>
integer numerical values: integer numbers <int>
double numerical values: real numbers <dbl>
logical logical data, typically representing TRUE or FALSE <lgl>
date date (e.g, 2020-10-09) <date>
date+time date plus time (e.g., 2020-10-09 10:03:25 UTC) <dttm>
factor categorical variables with fixed and known set of possible values (e.g., male/female) <fct>
ordered factor categorical variable with ordered fixed and known set of possible values <ord>

 

We can convert the categorical variables smoking, ABO, and bmi from <dbl>, <chr>, <dbl> types, respectively, into factors <fct> since they have fixed and known values.

 

  • Variable: smoking (numeric coded values → factor)

converts a numeric variable representing smoking status into a factor variable with more meaningful labels and then displays the updated dataframe along with the levels of the newly converted factor variable.

dat$smoking <- factor(dat$smoking, levels = c(0, 1), 
                  labels = c("non-smoker", "smoker"))
dat
# A tibble: 8 × 6
    age smoking    ABO     bmi occupation adm_date  
  <dbl> <fct>      <chr> <dbl> <chr>      <chr>     
1    30 non-smoker A         2 Journalist 10-09-2023
2    65 smoker     O         3 Chef       10-12-2023
3    35 smoker     O         2 Doctor     10-18-2023
4    25 non-smoker O         2 Teacher    10-27-2023
5    45 smoker     B         4 Lawyer     11-04-2023
6    55 non-smoker O         4 Musician   11-09-2024
7    40 non-smoker A         3 Pharmacist 11-22-2023
8    20 smoker     A         1 Nurse      12-02-2023
levels(dat$smoking)
[1] "non-smoker" "smoker"    

 

  • Variable: ABO (chr → factor)

It’s important to note that not all potential values may be present in a given dataset. For example, if we tabulate the variable ABO (e.g. using the table() function) we will get counts of the categories in the data:

# create a count table
table(dat$ABO)

A B O 
3 1 4 

The blood type “AB” of the ABO blood group system is absent from our data. In such cases, we can use the factor and create a list of all the valid levels:

# create a vector containing the blood types A, B, AB, and O
ABO_levels <- c("A", "B", "AB", "O")

dat$ABO <- factor(dat$ABO, levels = ABO_levels)
dat
# A tibble: 8 × 6
    age smoking    ABO     bmi occupation adm_date  
  <dbl> <fct>      <fct> <dbl> <chr>      <chr>     
1    30 non-smoker A         2 Journalist 10-09-2023
2    65 smoker     O         3 Chef       10-12-2023
3    35 smoker     O         2 Doctor     10-18-2023
4    25 non-smoker O         2 Teacher    10-27-2023
5    45 smoker     B         4 Lawyer     11-04-2023
6    55 non-smoker O         4 Musician   11-09-2024
7    40 non-smoker A         3 Pharmacist 11-22-2023
8    20 smoker     A         1 Nurse      12-02-2023
# show the levels of status variable
levels(dat$ABO)
[1] "A"  "B"  "AB" "O" 
# create a count table
table(dat$ABO)

 A  B AB  O 
 3  1  0  4 

 

  • Variable: bmi (numeric coded values → ordered factor)

We might have noticed that the categorical variable bmi takes numerically coded values (1, 2, 3, 4) in our dataset, so it is recognized as a double <dbl> type. We can convert this variable into factor <fct> with levels (1=underweight, 2=healthy, 3=overweight, 4=obesity). Instead of overwriting the existing variable, we prefer to create a new variable bmi1, as follows:”

# create a vector containing the four bmi categories
bmi1_labels <- c("underweight", "healthy", "overweight", "obesity")

# convert the variable to factor
dat$bmi1 <- factor(dat$bmi, levels = c(1, 2, 3, 4), 
                   labels = bmi1_labels, ordered = TRUE)
dat$bmi1
[1] healthy     overweight  healthy     healthy     obesity     obesity    
[7] overweight  underweight
Levels: underweight < healthy < overweight < obesity
dat
# A tibble: 8 × 7
    age smoking    ABO     bmi occupation adm_date   bmi1       
  <dbl> <fct>      <fct> <dbl> <chr>      <chr>      <ord>      
1    30 non-smoker A         2 Journalist 10-09-2023 healthy    
2    65 smoker     O         3 Chef       10-12-2023 overweight 
3    35 smoker     O         2 Doctor     10-18-2023 healthy    
4    25 non-smoker O         2 Teacher    10-27-2023 healthy    
5    45 smoker     B         4 Lawyer     11-04-2023 obesity    
6    55 non-smoker O         4 Musician   11-09-2024 obesity    
7    40 non-smoker A         3 Pharmacist 11-22-2023 overweight 
8    20 smoker     A         1 Nurse      12-02-2023 underweight

Now we can use, for example, the comparison operators > to check whether one element of the ordered vector is larger than the other.

dat$bmi1[2] > dat$bmi1[6]
[1] FALSE

However, the use of these operators on factors is much less common compared to numeric vectors. Therefore, we typically omit the ordered = TRUE argument, especially when we provide the order of categories explicitly in the levels argument.

 

Now, let’s merge the “overweight” and “obesity” categories into a single category named “overweight/obesity” within a new variable called bmi2:

# recode the values
dat$bmi2 <- case_match(dat$bmi, 
                      1 ~ "underweight",
                      2 ~ "healthy",
                      c(3, 4) ~ "overweight/obesity")

# set the levels in a order
bmi2_levels <- c("underweight", "healthy", "overweight/obesity")

# convert the variable to factor
dat$bmi2 <- factor(dat$bmi2, levels = bmi2_levels, ordered = TRUE)
dat$bmi2
[1] healthy            overweight/obesity healthy            healthy           
[5] overweight/obesity overweight/obesity overweight/obesity underweight       
Levels: underweight < healthy < overweight/obesity
dat
# A tibble: 8 × 8
    age smoking    ABO     bmi occupation adm_date   bmi1        bmi2           
  <dbl> <fct>      <fct> <dbl> <chr>      <chr>      <ord>       <ord>          
1    30 non-smoker A         2 Journalist 10-09-2023 healthy     healthy        
2    65 smoker     O         3 Chef       10-12-2023 overweight  overweight/obe…
3    35 smoker     O         2 Doctor     10-18-2023 healthy     healthy        
4    25 non-smoker O         2 Teacher    10-27-2023 healthy     healthy        
5    45 smoker     B         4 Lawyer     11-04-2023 obesity     overweight/obe…
6    55 non-smoker O         4 Musician   11-09-2024 obesity     overweight/obe…
7    40 non-smoker A         3 Pharmacist 11-22-2023 overweight  overweight/obe…
8    20 smoker     A         1 Nurse      12-02-2023 underweight underweight    

 

  • Variable: adm_date (chr → date)

In R, by default, values of class Date are displayed as YYYY-MM-DD. Therefore, to represent the date “10-12-2023” (assuming it’s in month-day-year format), we can use the following code:

dat$adm_date <- mdy(dat$adm_date)
dat
# A tibble: 8 × 8
    age smoking    ABO     bmi occupation adm_date   bmi1        bmi2           
  <dbl> <fct>      <fct> <dbl> <chr>      <date>     <ord>       <ord>          
1    30 non-smoker A         2 Journalist 2023-10-09 healthy     healthy        
2    65 smoker     O         3 Chef       2023-10-12 overweight  overweight/obe…
3    35 smoker     O         2 Doctor     2023-10-18 healthy     healthy        
4    25 non-smoker O         2 Teacher    2023-10-27 healthy     healthy        
5    45 smoker     B         4 Lawyer     2023-11-04 obesity     overweight/obe…
6    55 non-smoker O         4 Musician   2024-11-09 obesity     overweight/obe…
7    40 non-smoker A         3 Pharmacist 2023-11-22 overweight  overweight/obe…
8    20 smoker     A         1 Nurse      2023-12-02 underweight underweight    
class(dat$adm_date)
[1] "Date"