10  Data import, manipulation, and transformation

Thus far, our focus has been on utilizing data objects generated through functions. However, in real-world scenarios, we often interact with externally stored data. In this chapter, we will learn how to insert, transform, and manipulate data in R.

When we have finished this chapter, we should be able to:

Learning objectives
  • Import the data from spreadsheets in R.
  • Select rows in a data frame according to filtering conditions.
  • Select columns in a data frame.
  • Direct the output of one dplyr function to the input of another function with the native “pipe” operator |>.
  • Create new variables to a data frame that are functions of existing columns with mutate().

10.1 Importing data

Spreadsheets are frequently used to organize structured rectangular data (see also Chapter 12), available in formats like Excel, CSV, or Google Sheets. Each format offers unique features, from Excel’s comprehensive capabilities to CSV’s simplicity and Google Sheets’ collaborative editing. In this textbook, data utilized are stored in Excel files (.xlsx) and, generally, have the following basic characteristics:

  • The first line in the file is a header row indicating the names of the columns/variables.
  • Each value has its own cell.
  • Missing values are represented as empty cells.

For example, Figure 10.1 illustrates data in a single sheet from an Excel file.

Figure 10.1: Screenshot of an Excel file

The dataset shock_index includes data regarding the characteristics of 30 adult patients with suspected pulmonary embolism. The following variables were recorded:

  • sex: F for females and M for males
  • age: age of the patients in years
  • sbp: systolic blood pressure in mmHg
  • HR: resting heart rate in beats/min

10.1.1 Importing data from Excel using the RStudio interface

To import data using the RStudio interface, we navigate to the Environment pane, where we’ll find the “Import Dataset” button. Clicking this button will prompt a dropdown menu with options to import data from various file formats such as CSV, Excel, or other common formats. We select “From Excel…” (Figure 10.2).

Figure 10.2: Import data using the RStudio interface

A dialog box then appears, enabling us to navigate our file system and select the desired dataset. Once selected, we confirm the import by clicking “Open”. Finally, we proceed by clicking the “Import” button, which imports the selected dataset into our RStudio session.”

Figure 10.3: Import data using the RStudio interface

Upon successful import, the dataset will appear in the Environment pane, ready for analysis and manipulation using R’s powerful data analysis tools and functions.

10.1.2 Importing data from Excel into R with commands

Alternatively, if the dataset is saved in the subfolder named “data” of our RStudio project (see Chapter 2), we can read it with the read_excel() function from readxl package as follows:

library(here)
library(readxl)

dat <- read_excel(here("data", "shock_index.xlsx"))
INFO

The path of a file/directory is its location/address in the file system. There are two kinds of paths:

  • absolute paths such as “C:/My_name/../my_project/data/shock_index.xlsx” and
  • relative paths such as “data/shock_index.xlsx”.

The function here() allows us to navigate throughout each of the subfolders and files within a given RStudio Project using relative paths (see also Chapter 5).

Using head() is a convenient way to get a quick overview of our dataset and check for any obvious issues before proceeding with further analysis or data manipulation.

head(dat)
# A tibble: 6 × 4
  sex     age   sbp    HR
  <chr> <dbl> <dbl> <dbl>
1 M        54   139    85
2 F        47   151    84
3 F        58   201   106
4 M        NA   170    56
5 F        60   123    55
6 F        57   121    90

R displays the first six rows of the data frame by default. If we want to view a different number of rows (e.g., the first ten rows), we can specify the desired number as an argument as follows:

head(dat, 10)
# A tibble: 10 × 4
   sex     age   sbp    HR
   <chr> <dbl> <dbl> <dbl>
 1 M        54   139    85
 2 F        47   151    84
 3 F        58   201   106
 4 M        NA   170    56
 5 F        60   123    55
 6 F        57   121    90
 7 M        54   119    85
 8 M        NA   100    92
 9 F        53   164    72
10 M        58   161   102

When working with large datasets in R, it’s often useful to inspect the last few rows of a data frame. The tail() function is used for this purpose:

tail(dat)
# A tibble: 6 × 4
  sex     age   sbp    HR
  <chr> <dbl> <dbl> <dbl>
1 M        58   119    87
2 F        48   104    93
3 F        55   164    69
4 M        59    92   119
5 F        48   127   109
6 F        51   136    60

Now, R displays the last six rows of the data frame by default.

10.2 Basic data manipulations

10.2.1 Adding an unique identifier for each observation

A unique identifier for each observation is often useful for various data manipulation and analysis tasks. In R, we can add row numbers starting from 1 to the number of rows in our data frame using the rowid_to_column() function from the tibble package:

library(tibble)
dat <- rowid_to_column(dat)
dat
# A tibble: 30 × 5
   rowid sex     age   sbp    HR
   <int> <chr> <dbl> <dbl> <dbl>
 1     1 M        54   139    85
 2     2 F        47   151    84
 3     3 F        58   201   106
 4     4 M        NA   170    56
 5     5 F        60   123    55
 6     6 F        57   121    90
 7     7 M        54   119    85
 8     8 M        NA   100    92
 9     9 F        53   164    72
10    10 M        58   161   102
# ℹ 20 more rows

10.2.2 Renaming variables and cleaning names

library(dplyr)
dat <- rename(dat, 
               patient_ID = rowid,
               Heart.Rate = HR)
dat
# A tibble: 30 × 5
   patient_ID sex     age   sbp Heart.Rate
        <int> <chr> <dbl> <dbl>      <dbl>
 1          1 M        54   139         85
 2          2 F        47   151         84
 3          3 F        58   201        106
 4          4 M        NA   170         56
 5          5 F        60   123         55
 6          6 F        57   121         90
 7          7 M        54   119         85
 8          8 M        NA   100         92
 9          9 F        53   164         72
10         10 M        58   161        102
# ℹ 20 more rows
library(janitor)
dat <- clean_names(dat)
dat
# A tibble: 30 × 5
   patient_id sex     age   sbp heart_rate
        <int> <chr> <dbl> <dbl>      <dbl>
 1          1 M        54   139         85
 2          2 F        47   151         84
 3          3 F        58   201        106
 4          4 M        NA   170         56
 5          5 F        60   123         55
 6          6 F        57   121         90
 7          7 M        54   119         85
 8          8 M        NA   100         92
 9          9 F        53   164         72
10         10 M        58   161        102
# ℹ 20 more rows

10.2.3 Converting to the appropriate data type

We might have noticed that the categorical variable sex is coded as F (for females) and M (for males), so it is recognized of character type. We can use the factor() function to encode a variable as a factor:

dat$sex <- factor(dat$sex, levels = c("F", "M"), 
                  labels = c("female", "male"))
dat
# A tibble: 30 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1          1 male      54   139         85
 2          2 female    47   151         84
 3          3 female    58   201        106
 4          4 male      NA   170         56
 5          5 female    60   123         55
 6          6 female    57   121         90
 7          7 male      54   119         85
 8          8 male      NA   100         92
 9          9 female    53   164         72
10         10 male      58   161        102
# ℹ 20 more rows

10.3 Basic data transformations

10.3.1 Subsetting observations (rows)

A. Select observations by indexing position within square brackets

Just as we use square brackets [row, column] to select elements from matrices, we employ the same approach to select elements from a data frame. For example, by specifying the row indices [5:10, ] within the brackets, we can efficiently access the required data for rows 5 through 10:

dat1 <- dat[5:10, ]
dat1
# A tibble: 6 × 5
  patient_id sex      age   sbp heart_rate
       <int> <fct>  <dbl> <dbl>      <dbl>
1          5 female    60   123         55
2          6 female    57   121         90
3          7 male      54   119         85
4          8 male      NA   100         92
5          9 female    53   164         72
6         10 male      58   161        102

B. Select observations by indexing with conditions within the square brackets

dat2 <- dat[which(dat$age > 55), ]
dat2
# A tibble: 11 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1          3 female    58   201        106
 2          5 female    60   123         55
 3          6 female    57   121         90
 4         10 male      58   161        102
 5         13 female    57   118         89
 6         17 male      56   126        107
 7         19 male      59   138         85
 8         20 male      56   153         85
 9         24 female    57   121         90
10         25 male      58   119         87
11         28 male      59    92        119
dat3 <- dat[which(dat$age > 55 & dat$sex == "female"), ]
dat3
# A tibble: 5 × 5
  patient_id sex      age   sbp heart_rate
       <int> <fct>  <dbl> <dbl>      <dbl>
1          3 female    58   201        106
2          5 female    60   123         55
3          6 female    57   121         90
4         13 female    57   118         89
5         24 female    57   121         90

C. Select observations using the subset() function

dat4 <- subset(dat, age > 55)
dat4
# A tibble: 11 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1          3 female    58   201        106
 2          5 female    60   123         55
 3          6 female    57   121         90
 4         10 male      58   161        102
 5         13 female    57   118         89
 6         17 male      56   126        107
 7         19 male      59   138         85
 8         20 male      56   153         85
 9         24 female    57   121         90
10         25 male      58   119         87
11         28 male      59    92        119
dat5 <- subset(dat, age > 55 & sex == "female")
dat5
# A tibble: 5 × 5
  patient_id sex      age   sbp heart_rate
       <int> <fct>  <dbl> <dbl>      <dbl>
1          3 female    58   201        106
2          5 female    60   123         55
3          6 female    57   121         90
4         13 female    57   118         89
5         24 female    57   121         90

D. Select observations using the filter() function from {dplyr} package

We pass the data frame first and then one or more conditions separated by a comma:

dat6 <- filter(dat, age > 55)
dat6
# A tibble: 11 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1          3 female    58   201        106
 2          5 female    60   123         55
 3          6 female    57   121         90
 4         10 male      58   161        102
 5         13 female    57   118         89
 6         17 male      56   126        107
 7         19 male      59   138         85
 8         20 male      56   153         85
 9         24 female    57   121         90
10         25 male      58   119         87
11         28 male      59    92        119

If we want to select only female patients with age > 55:

dat7 <- filter(dat, age > 55, sex == "female")
dat7
# A tibble: 5 × 5
  patient_id sex      age   sbp heart_rate
       <int> <fct>  <dbl> <dbl>      <dbl>
1          3 female    58   201        106
2          5 female    60   123         55
3          6 female    57   121         90
4         13 female    57   118         89
5         24 female    57   121         90

10.3.2 Reordering observations (rows)

We can also sort observations in ascending or descending order of one or more variables (columns). When multiple column names are provided, each additional column helps resolve ties in preceding columns’ values. For example, we can arrange the rows of our data based on the values of sbp (in ascending order which is the default) as follows:

dat8 <- arrange(dat, sbp)
head(dat8, 10)
# A tibble: 10 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1         16 male      48    92        122
 2         28 male      59    92        119
 3          8 male      NA   100         92
 4         26 female    48   104         93
 5         15 female    50   109        142
 6         13 female    57   118         89
 7          7 male      54   119         85
 8         25 male      58   119         87
 9          6 female    57   121         90
10         24 female    57   121         90

We can use desc() to re-arrange in descending order. For example:

dat9 <- arrange(dat, desc(sbp))
head(dat9, 10)
# A tibble: 10 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1          3 female    58   201        106
 2         21 male      48   201        104
 3          4 male      NA   170         56
 4         22 female    52   170         57
 5          9 female    53   164         72
 6         11 female    49   164         90
 7         27 female    55   164         69
 8         10 male      58   161        102
 9         20 male      56   153         85
10          2 female    47   151         84
dat10 <- arrange(dat, sbp, heart_rate)
head(dat10, 10)
# A tibble: 10 × 5
   patient_id sex      age   sbp heart_rate
        <int> <fct>  <dbl> <dbl>      <dbl>
 1         28 male      59    92        119
 2         16 male      48    92        122
 3          8 male      NA   100         92
 4         26 female    48   104         93
 5         15 female    50   109        142
 6         13 female    57   118         89
 7          7 male      54   119         85
 8         25 male      58   119         87
 9          6 female    57   121         90
10         24 female    57   121         90

10.3.3 Subsetting variables (columns)

A. Select or exclude variables by indexing their name within square brackets

We can select only the sex, age, heart_rate, variables from the data frame:

dat11 <- dat[, c("sex", "age" , "heart_rate")]
head(dat11)
# A tibble: 6 × 3
  sex      age heart_rate
  <fct>  <dbl>      <dbl>
1 male      54         85
2 female    47         84
3 female    58        106
4 male      NA         56
5 female    60         55
6 female    57         90
dat12 <- dat[, !names(dat) %in% c("patient_id", "sbp")]
head(dat12)
# A tibble: 6 × 3
  sex      age heart_rate
  <fct>  <dbl>      <dbl>
1 male      54         85
2 female    47         84
3 female    58        106
4 male      NA         56
5 female    60         55
6 female    57         90

C. Select variables using the subset() function

dat15 <- subset(dat, select = c("sex", "age" , "heart_rate"))
head(dat15)
# A tibble: 6 × 3
  sex      age heart_rate
  <fct>  <dbl>      <dbl>
1 male      54         85
2 female    47         84
3 female    58        106
4 male      NA         56
5 female    60         55
6 female    57         90
dat16 <- subset(dat, select = c(2, 3, 5))
head(dat16)
# A tibble: 6 × 3
  sex      age heart_rate
  <fct>  <dbl>      <dbl>
1 male      54         85
2 female    47         84
3 female    58        106
4 male      NA         56
5 female    60         55
6 female    57         90

The select argument allows the selection by indexing the columns of interest in.

D. Select variables using the select() function from {dplyr} package

In select() function we pass the data frame first and then the variables separated by commas:

dat17 <- select(dat, sex, age, heart_rate)
head(dat17)
# A tibble: 6 × 3
  sex      age heart_rate
  <fct>  <dbl>      <dbl>
1 male      54         85
2 female    47         84
3 female    58        106
4 male      NA         56
5 female    60         55
6 female    57         90

10.3.4 Creating new variables

Suppose we want to create a new variable representing the ratio of heart rate to systolic blood pressure (HR/sbp), which we will call shock index. We can achieve this by using the mutate() function as follows:

dat18 <- mutate(dat, shock_index = round(heart_rate / sbp, digits = 2))

dat18
# A tibble: 30 × 6
   patient_id sex      age   sbp heart_rate shock_index
        <int> <fct>  <dbl> <dbl>      <dbl>       <dbl>
 1          1 male      54   139         85        0.61
 2          2 female    47   151         84        0.56
 3          3 female    58   201        106        0.53
 4          4 male      NA   170         56        0.33
 5          5 female    60   123         55        0.45
 6          6 female    57   121         90        0.74
 7          7 male      54   119         85        0.71
 8          8 male      NA   100         92        0.92
 9          9 female    53   164         72        0.44
10         10 male      58   161        102        0.63
# ℹ 20 more rows
INFO

By default, the mutate() function adds the new variable as the last column in the data frame.

Next, we want to categorize the new variable considering that normal values of this index range from 0.5 to 0.7 (beats/mmHg*min).

dat19 <- mutate(dat18,
                shock_index2 = cut(shock_index, breaks = c(-Inf, 0.5, 0.7, Inf),
                                   labels = c("low","normal","high"))
               )
dat19
# A tibble: 30 × 7
   patient_id sex      age   sbp heart_rate shock_index shock_index2
        <int> <fct>  <dbl> <dbl>      <dbl>       <dbl> <fct>       
 1          1 male      54   139         85        0.61 normal      
 2          2 female    47   151         84        0.56 normal      
 3          3 female    58   201        106        0.53 normal      
 4          4 male      NA   170         56        0.33 low         
 5          5 female    60   123         55        0.45 low         
 6          6 female    57   121         90        0.74 high        
 7          7 male      54   119         85        0.71 high        
 8          8 male      NA   100         92        0.92 high        
 9          9 female    53   164         72        0.44 low         
10         10 male      58   161        102        0.63 normal      
# ℹ 20 more rows

10.4 Using the native pipe operator in a sequence of commands

Given that each verb in dplyr specializes in a particular task, addressing complex queries often typically involves integrating several verbs, a process simplified by using the native pipe operator, |>. For example, we want to filter the original dataset to include only rows where the shock index category is “low” or “high”, using a sequence of dplyr verbs.

sub_dat <- dat |> 
  mutate(shock_index = round(heart_rate / sbp, digits = 2),
         shock_index2 = cut(shock_index, breaks = c(-Inf, 0.5, 0.7, Inf),
                            labels = c("low","normal","high"))
  ) |> 
  filter(shock_index2 == "low" | shock_index2 == "high")

We would read this sequence as:

  1. Begin with the dataset dat then

  2. Use it as input to the mutate() function to calculate the shock_index variable and create the shock_index2 categorical variable, then

  3. Use this output as the input to the filter() function to select patients with “low” or “high” value of shock index.

print(sub_dat, n = Inf)
# A tibble: 21 × 7
   patient_id sex      age   sbp heart_rate shock_index shock_index2
        <int> <fct>  <dbl> <dbl>      <dbl>       <dbl> <fct>       
 1          4 male      NA   170         56        0.33 low         
 2          5 female    60   123         55        0.45 low         
 3          6 female    57   121         90        0.74 high        
 4          7 male      54   119         85        0.71 high        
 5          8 male      NA   100         92        0.92 high        
 6          9 female    53   164         72        0.44 low         
 7         13 female    57   118         89        0.75 high        
 8         14 male      52   130        106        0.82 high        
 9         15 female    50   109        142        1.3  high        
10         16 male      48    92        122        1.33 high        
11         17 male      56   126        107        0.85 high        
12         18 female    54   139         61        0.44 low         
13         22 female    52   170         57        0.34 low         
14         23 male      55   122         56        0.46 low         
15         24 female    57   121         90        0.74 high        
16         25 male      58   119         87        0.73 high        
17         26 female    48   104         93        0.89 high        
18         27 female    55   164         69        0.42 low         
19         28 male      59    92        119        1.29 high        
20         29 female    48   127        109        0.86 high        
21         30 female    51   136         60        0.44 low         

Alternatively we can use the %in% operator. This operator helps us to easily create multiple OR arguments:

dat |> 
  mutate(shock_index = round(heart_rate / sbp, digits = 2),
         shock_index2 = cut(shock_index, breaks = c(-Inf, 0.5, 0.7, Inf),
                            labels = c("low","normal","high"))
  ) |> 
  filter(shock_index2 %in% c("low", "high"))
# A tibble: 21 × 7
   patient_id sex      age   sbp heart_rate shock_index shock_index2
        <int> <fct>  <dbl> <dbl>      <dbl>       <dbl> <fct>       
 1          4 male      NA   170         56        0.33 low         
 2          5 female    60   123         55        0.45 low         
 3          6 female    57   121         90        0.74 high        
 4          7 male      54   119         85        0.71 high        
 5          8 male      NA   100         92        0.92 high        
 6          9 female    53   164         72        0.44 low         
 7         13 female    57   118         89        0.75 high        
 8         14 male      52   130        106        0.82 high        
 9         15 female    50   109        142        1.3  high        
10         16 male      48    92        122        1.33 high        
# ℹ 11 more rows

10.5 Reshaping data

In a pre-post test study investigating the effects of margarine consumption on blood total cholesterol (TCH) levels, eighteen participants completed a 12-week dietary intervention. Before beginning the specialized diet, baseline measurements of blood total cholesterol (in mmol/L) were recorded for each participant (week 0). Follow-up evaluations took place at the 6-week and 12-week intervals, enabling the assessment of changes in cholesterol levels during the intervention period. The collected data were organized and stored within an Excel spreadsheet, structured as follows:

dat_TCH <- read_excel(here("data", "cholesterol.xlsx"))
head(dat_TCH)
# A tibble: 6 × 3
  week0 week6 week12
  <dbl> <dbl>  <dbl>
1  6.42  5.83   5.75
2  6.76  6.2    6.13
3  6.56  5.83   5.71
4  4.8   4.27   4.15
5  8.43  7.71   7.67
6  7.49  7.12   7.05

We observe that the data set is structured in what is commonly referred to as a “wide” format. The wide format of the data is particularly suited for a repeated measures design, also known as a longitudinal or within-subject design. In such a design, the same individuals are measured multiple times over a period, with each measurement representing a different time point.

10.5.1 Ensuring that every row has a unique identifier

Before proceeding with data manipulation and analysis tasks, we create a new column in the data set that assigns a unique identifier to each row using the rowid_to_column() function:

dat_TCH <- dat_TCH |> 
  rowid_to_column()
head(dat_TCH)
# A tibble: 6 × 4
  rowid week0 week6 week12
  <int> <dbl> <dbl>  <dbl>
1     1  6.42  5.83   5.75
2     2  6.76  6.2    6.13
3     3  6.56  5.83   5.71
4     4  4.8   4.27   4.15
5     5  8.43  7.71   7.67
6     6  7.49  7.12   7.05

10.5.2 From wide to long format

Wide format data may not always be the most suitable data shape for certain types of statistical analyses or visualizations compared to “long” format data. In long format, each row typically represents a single observation, while another variable denotes different time points or conditions.

Now, let’s reshape the dataset from a wide format to a long format.

library(tidyr)
dat_TCH2_long <- dat_TCH |> 
  pivot_longer(
    cols = starts_with("week"), 
    names_to = "week",
    values_to = "total_cholesterol"
  )
head(dat_TCH2_long)
# A tibble: 6 × 3
  rowid week   total_cholesterol
  <int> <chr>              <dbl>
1     1 week0               6.42
2     1 week6               5.83
3     1 week12              5.75
4     2 week0               6.76
5     2 week6               6.2 
6     2 week12              6.13

To understand how the reshaping process works, it’s helpful to break down the pivot_longer() function with its arguments:

The argument cols = starts_with("week") specifies which columns in the original data should be pivoted. In this instance, we select the columns that start with the prefix “week”, such as “week0”, “week6”, and “week12”.

First, the values within the identifier column (rowid) of wide format need to be repeated in the long format as many times as the number of columns being pivoted, which in this case is three (Figure 10.4).

Figure 10.4: Columns in the original data that should be reshaped.

By using the argument names_to = "week", the column names being pivoted (i.e., week0, week6, and week12) become values within a newly defined variable in the long format with the name week. They are repeated as many times as the number of rows in the wide format, which in this case is two (Figure 10.5).

Figure 10.5: Columns in the original data that should be reshaped.

Finally, by using the argument values_to = "total_cholesterol", the cell values from the pivoted variables are placed into a newly defined variable in the long format with the name total_cholesterol (Figure 10.6).

Figure 10.6: Columns in the original data that should be reshaped.

10.5.3 From long to wide format

dat_TCH2_long |> 
  pivot_wider(
    names_from = week,
    values_from = total_cholesterol  
    )
# A tibble: 18 × 4
   rowid week0 week6 week12
   <int> <dbl> <dbl>  <dbl>
 1     1  6.42  5.83   5.75
 2     2  6.76  6.2    6.13
 3     3  6.56  5.83   5.71
 4     4  4.8   4.27   4.15
 5     5  8.43  7.71   7.67
 6     6  7.49  7.12   7.05
 7     7  8.05  7.25   7.1 
 8     8  5.05  4.63   4.67
 9     9  5.77  5.31   5.33
10    10  3.91  3.7    3.66
11    11  6.77  6.15   5.96
12    12  6.44  5.59   5.64
13    13  6.17  5.56   5.51
14    14  7.67  7.11   6.96
15    15  7.34  6.84   6.82
16    16  6.85  6.4    6.29
17    17  5.13  4.52   4.45
18    18  5.73  5.13   5.17