A Scientist's Guide to R: Step 2.1. Data Transformation - Part 1

Last updated on May 19, 2021 38 min read 0 CommentsR, Reproducible Research, Data Cleaning and Transformation

A Scientist's Guide to R: Step 2.1. Data Transformation - Part 1

Last updated on May 19, 2021 38 min read 0 CommentsR, Reproducible Research, Data Cleaning and Transformation

1 TL;DR
2 Introduction
3 select()
- 3.1 Renaming Columns with select() or rename()
4 filter()
- 4.1 Subset Rows using Indices with slice()
5 mutate()
- 5.1 Recoding or Creating Indicator Variables using if_else(), case_when(), or recode()
6 summarise()
7 group_by()
8 Chaining Functions with the pipe operator (%>%)
9 Navigation
10 Notes

1 TL;DR

The 4th post in the Scientist’s Guide to R series introduces data transformation techniques useful for wrangling/tidying/cleaning data. Specifically, we will be learning how to use the 6 core functions (and a few others) from the popular dplyr package, perform similar operations in base R, and chain operations with the pipe operator (%>%) to streamline the process.

2 Introduction

The 4th post in the Scientist’s Guide to R series introduces data transformation techniques useful for wrangling/tidying/cleaning data, which often takes longer than any other step in the data analysis process. Specifically, we will be learning how to use the 6 core functions or “verbs” of the popular dplyr package to:

Easily select a subset of columns.
filter rows using logical tests with values of specified columns.
Modify columns or add new ones using mutate.
Obtain descriptive summaries of data using summarise (or the equivalent summarize()).
Assign a grouping structure to a data frame to enable subsequent dplyr package function calls to be executed within groups using group_by.
arrange a data frame for improved readability.

Some additional relevant functions that build on this foundation will also be covered where appropriate.

While other great packages (and other functions in base R) exist to help you manipulate data, using the dplyr package is perhaps the easiest way to learn these essential data processing techniques. However, if you find yourself working with very large datasets (e.g. > 1,000,000 rows) you might want to check out the data.table package, which emphasizes efficiency and performance over readability and ease of use.

Most dplyr functions accept a data frame or tibble as their first argument and return a data frame as their output. This is a key design feature that enables you to apply a series of operations to a data frame in a chain using the pipe operator (%>%), as demonstrated in section 8.

3 select()

select() is one of the dplyr functions you can use for subsetting. select() lets you extract a subset of columns from a data frame.

Columns may be specified using indices, names, or using string patterns with the assistance of a few tidyselect helper functions such as contains(), starts_with(), and ends_with().

library(dplyr) #load the dplyr package with the library function.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#the dplyr package will be used throughout this tutorial so make sure it is
#loaded before trying to run the examples yourself

df <- starwars #assign the star wars data frame (imported when dplyr is loaded) to the label/name "df"

# inspect the structure

glimpse(df)

## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

#in base R, you would extract columns using either [, "name"] or [["name"]], or $name
df[, "name"] #[] returns another data frame

## # A tibble: 87 x 1
##    name              
##    <chr>             
##  1 Luke Skywalker    
##  2 C-3PO             
##  3 R2-D2             
##  4 Darth Vader       
##  5 Leia Organa       
##  6 Owen Lars         
##  7 Beru Whitesun lars
##  8 R5-D4             
##  9 Biggs Darklighter 
## 10 Obi-Wan Kenobi    
## # ... with 77 more rows

head(df[["name"]]) #[[]] returns a 1 dimensional vector. head() just shows the 1st 6 values

## [1] "Luke Skywalker" "C-3PO"          "R2-D2"          "Darth Vader"   
## [5] "Leia Organa"    "Owen Lars"

df[, c("name", "mass", "hair_color")] #[] can accept a vector of names

## # A tibble: 87 x 3
##    name                mass hair_color   
##    <chr>              <dbl> <chr>        
##  1 Luke Skywalker        77 blond        
##  2 C-3PO                 75 <NA>         
##  3 R2-D2                 32 <NA>         
##  4 Darth Vader          136 none         
##  5 Leia Organa           49 brown        
##  6 Owen Lars            120 brown, grey  
##  7 Beru Whitesun lars    75 brown        
##  8 R5-D4                 32 <NA>         
##  9 Biggs Darklighter     84 black        
## 10 Obi-Wan Kenobi        77 auburn, white
## # ... with 77 more rows

#df[[c("name", "mass", "hair_color")]] #[[]] can't and returns an error

#or a logical or integer vector the same length as the number of columns

df[, 1:ncol(df) %% 2 == 0] #columns with even indices

## # A tibble: 87 x 7
##    height hair_color    eye_color sex    homeworld films     starships
##     <int> <chr>         <chr>     <chr>  <chr>     <list>    <list>   
##  1    172 blond         blue      male   Tatooine  <chr [5]> <chr [2]>
##  2    167 <NA>          yellow    none   Tatooine  <chr [6]> <chr [0]>
##  3     96 <NA>          red       none   Naboo     <chr [7]> <chr [0]>
##  4    202 none          yellow    male   Tatooine  <chr [4]> <chr [1]>
##  5    150 brown         brown     female Alderaan  <chr [5]> <chr [0]>
##  6    178 brown, grey   blue      male   Tatooine  <chr [3]> <chr [0]>
##  7    165 brown         blue      female Tatooine  <chr [3]> <chr [0]>
##  8     97 <NA>          red       none   Tatooine  <chr [1]> <chr [0]>
##  9    183 black         brown     male   Tatooine  <chr [1]> <chr [1]>
## 10    182 auburn, white blue-gray male   Stewjon   <chr [6]> <chr [5]>
## # ... with 77 more rows

df[, 1:5] #1st 5 columns

## # A tibble: 87 x 5
##    name               height  mass hair_color    skin_color 
##    <chr>               <int> <dbl> <chr>         <chr>      
##  1 Luke Skywalker        172    77 blond         fair       
##  2 C-3PO                 167    75 <NA>          gold       
##  3 R2-D2                  96    32 <NA>          white, blue
##  4 Darth Vader           202   136 none          white      
##  5 Leia Organa           150    49 brown         light      
##  6 Owen Lars             178   120 brown, grey   light      
##  7 Beru Whitesun lars    165    75 brown         light      
##  8 R5-D4                  97    32 <NA>          white, red 
##  9 Biggs Darklighter     183    84 black         light      
## 10 Obi-Wan Kenobi        182    77 auburn, white fair       
## # ... with 77 more rows

#dplyr::select() makes extracting multiple columns even easier
select(df, 
       name, mass, hair_color) #note that names don't have to be quoted

## # A tibble: 87 x 3
##    name                mass hair_color   
##    <chr>              <dbl> <chr>        
##  1 Luke Skywalker        77 blond        
##  2 C-3PO                 75 <NA>         
##  3 R2-D2                 32 <NA>         
##  4 Darth Vader          136 none         
##  5 Leia Organa           49 brown        
##  6 Owen Lars            120 brown, grey  
##  7 Beru Whitesun lars    75 brown        
##  8 R5-D4                 32 <NA>         
##  9 Biggs Darklighter     84 black        
## 10 Obi-Wan Kenobi        77 auburn, white
## # ... with 77 more rows

select(df, name:hair_color) #and you can get all columns in a range using unquoted names

## # A tibble: 87 x 4
##    name               height  mass hair_color   
##    <chr>               <int> <dbl> <chr>        
##  1 Luke Skywalker        172    77 blond        
##  2 C-3PO                 167    75 <NA>         
##  3 R2-D2                  96    32 <NA>         
##  4 Darth Vader           202   136 none         
##  5 Leia Organa           150    49 brown        
##  6 Owen Lars             178   120 brown, grey  
##  7 Beru Whitesun lars    165    75 brown        
##  8 R5-D4                  97    32 <NA>         
##  9 Biggs Darklighter     183    84 black        
## 10 Obi-Wan Kenobi        182    77 auburn, white
## # ... with 77 more rows

select(df, 1:5, 9, 10) #select also accepts column indices and doesn't require c()

## # A tibble: 87 x 7
##    name               height  mass hair_color    skin_color  gender    homeworld
##    <chr>               <int> <dbl> <chr>         <chr>       <chr>     <chr>    
##  1 Luke Skywalker        172    77 blond         fair        masculine Tatooine 
##  2 C-3PO                 167    75 <NA>          gold        masculine Tatooine 
##  3 R2-D2                  96    32 <NA>          white, blue masculine Naboo    
##  4 Darth Vader           202   136 none          white       masculine Tatooine 
##  5 Leia Organa           150    49 brown         light       feminine  Alderaan 
##  6 Owen Lars             178   120 brown, grey   light       masculine Tatooine 
##  7 Beru Whitesun lars    165    75 brown         light       feminine  Tatooine 
##  8 R5-D4                  97    32 <NA>          white, red  masculine Tatooine 
##  9 Biggs Darklighter     183    84 black         light       masculine Tatooine 
## 10 Obi-Wan Kenobi        182    77 auburn, white fair        masculine Stewjon  
## # ... with 77 more rows

#select everything other than a specified column using the subtraction operator "-"
select(df, -hair_color, -mass)

## # A tibble: 87 x 12
##    name    height skin_color eye_color birth_year sex   gender homeworld species
##    <chr>    <int> <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>  
##  1 Luke S~    172 fair       blue            19   male  mascu~ Tatooine  Human  
##  2 C-3PO      167 gold       yellow         112   none  mascu~ Tatooine  Droid  
##  3 R2-D2       96 white, bl~ red             33   none  mascu~ Naboo     Droid  
##  4 Darth ~    202 white      yellow          41.9 male  mascu~ Tatooine  Human  
##  5 Leia O~    150 light      brown           19   fema~ femin~ Alderaan  Human  
##  6 Owen L~    178 light      blue            52   male  mascu~ Tatooine  Human  
##  7 Beru W~    165 light      blue            47   fema~ femin~ Tatooine  Human  
##  8 R5-D4       97 white, red red             NA   none  mascu~ Tatooine  Droid  
##  9 Biggs ~    183 light      brown           24   male  mascu~ Tatooine  Human  
## 10 Obi-Wa~    182 fair       blue-gray       57   male  mascu~ Stewjon   Human  
## # ... with 77 more rows, and 3 more variables: films <list>, vehicles <list>,
## #   starships <list>

select(df, -c(name:hair_color)) #everything other than a range of named columns

## # A tibble: 87 x 10
##    skin_color eye_color birth_year sex   gender homeworld species films vehicles
##    <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
##  1 fair       blue            19   male  mascu~ Tatooine  Human   <chr~ <chr [2~
##  2 gold       yellow         112   none  mascu~ Tatooine  Droid   <chr~ <chr [0~
##  3 white, bl~ red             33   none  mascu~ Naboo     Droid   <chr~ <chr [0~
##  4 white      yellow          41.9 male  mascu~ Tatooine  Human   <chr~ <chr [0~
##  5 light      brown           19   fema~ femin~ Alderaan  Human   <chr~ <chr [1~
##  6 light      blue            52   male  mascu~ Tatooine  Human   <chr~ <chr [0~
##  7 light      blue            47   fema~ femin~ Tatooine  Human   <chr~ <chr [0~
##  8 white, red red             NA   none  mascu~ Tatooine  Droid   <chr~ <chr [0~
##  9 light      brown           24   male  mascu~ Tatooine  Human   <chr~ <chr [0~
## 10 fair       blue-gray       57   male  mascu~ Stewjon   Human   <chr~ <chr [1~
## # ... with 77 more rows, and 1 more variable: starships <list>

# dplyr also provides a number of "select helper" functions that allow you to
# select variables using string patterns, e.g.

select(df, contains("m")) #all columns with names that contain the letter m

## # A tibble: 87 x 4
##    name                mass homeworld films    
##    <chr>              <dbl> <chr>     <list>   
##  1 Luke Skywalker        77 Tatooine  <chr [5]>
##  2 C-3PO                 75 Tatooine  <chr [6]>
##  3 R2-D2                 32 Naboo     <chr [7]>
##  4 Darth Vader          136 Tatooine  <chr [4]>
##  5 Leia Organa           49 Alderaan  <chr [5]>
##  6 Owen Lars            120 Tatooine  <chr [3]>
##  7 Beru Whitesun lars    75 Tatooine  <chr [3]>
##  8 R5-D4                 32 Tatooine  <chr [1]>
##  9 Biggs Darklighter     84 Tatooine  <chr [1]>
## 10 Obi-Wan Kenobi        77 Stewjon   <chr [6]>
## # ... with 77 more rows

select(df, starts_with("m")) #all columns with names that start with the letter m

## # A tibble: 87 x 1
##     mass
##    <dbl>
##  1    77
##  2    75
##  3    32
##  4   136
##  5    49
##  6   120
##  7    75
##  8    32
##  9    84
## 10    77
## # ... with 77 more rows

select(df, ends_with("s")) #all columns with names that end with the letter s

## # A tibble: 87 x 5
##     mass species films     vehicles  starships
##    <dbl> <chr>   <list>    <list>    <list>   
##  1    77 Human   <chr [5]> <chr [2]> <chr [2]>
##  2    75 Droid   <chr [6]> <chr [0]> <chr [0]>
##  3    32 Droid   <chr [7]> <chr [0]> <chr [0]>
##  4   136 Human   <chr [4]> <chr [0]> <chr [1]>
##  5    49 Human   <chr [5]> <chr [1]> <chr [0]>
##  6   120 Human   <chr [3]> <chr [0]> <chr [0]>
##  7    75 Human   <chr [3]> <chr [0]> <chr [0]>
##  8    32 Droid   <chr [1]> <chr [0]> <chr [0]>
##  9    84 Human   <chr [1]> <chr [0]> <chr [1]>
## 10    77 Human   <chr [6]> <chr [1]> <chr [5]>
## # ... with 77 more rows

# there is also a select helper called "matches" that enables you to select columns
# with more complex string patterns using regular expressions 

# a detailed introduction to the "matches" select helper, string manipulation,
# and regular expressions will be covered in a later post.

# see the documentation for select_helpers (from the tidyselect package, which
# is loaded with dplyr) for more info
# ?tidyselect::select_helpers

# you can rearrange columns and rename them during the selection process
select(df, 5:last_col(), 1:4) #select columns 5 to the end, then columns 1 to 4

## # A tibble: 87 x 14
##    skin_color eye_color birth_year sex   gender homeworld species films vehicles
##    <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
##  1 fair       blue            19   male  mascu~ Tatooine  Human   <chr~ <chr [2~
##  2 gold       yellow         112   none  mascu~ Tatooine  Droid   <chr~ <chr [0~
##  3 white, bl~ red             33   none  mascu~ Naboo     Droid   <chr~ <chr [0~
##  4 white      yellow          41.9 male  mascu~ Tatooine  Human   <chr~ <chr [0~
##  5 light      brown           19   fema~ femin~ Alderaan  Human   <chr~ <chr [1~
##  6 light      blue            52   male  mascu~ Tatooine  Human   <chr~ <chr [0~
##  7 light      blue            47   fema~ femin~ Tatooine  Human   <chr~ <chr [0~
##  8 white, red red             NA   none  mascu~ Tatooine  Droid   <chr~ <chr [0~
##  9 light      brown           24   male  mascu~ Tatooine  Human   <chr~ <chr [0~
## 10 fair       blue-gray       57   male  mascu~ Stewjon   Human   <chr~ <chr [1~
## # ... with 77 more rows, and 5 more variables: starships <list>, name <chr>,
## #   height <int>, mass <dbl>, hair_color <chr>

# to select() a single column and turn it into a vector (instead of leaving it as a data frame),

all.equal(
  pull(select(df, mass), mass), # wrap it in the pull function (also from dplyr)
  df$mass, # or use the $ symbol
  df[["mass"]] # or the equivalent `[[` operator
  )

## [1] TRUE

3.1 Renaming Columns with select() or rename()

Columns can also be renamed when subsetting with the select() function or without subsetting using the rename() function (also in the dplyr package).

#rename and subset using select
select(df, nm = name, hgt = height, byear = birth_year) #select columns and rename them

## # A tibble: 87 x 3
##    nm                   hgt byear
##    <chr>              <int> <dbl>
##  1 Luke Skywalker       172  19  
##  2 C-3PO                167 112  
##  3 R2-D2                 96  33  
##  4 Darth Vader          202  41.9
##  5 Leia Organa          150  19  
##  6 Owen Lars            178  52  
##  7 Beru Whitesun lars   165  47  
##  8 R5-D4                 97  NA  
##  9 Biggs Darklighter    183  24  
## 10 Obi-Wan Kenobi       182  57  
## # ... with 77 more rows

#to just rename columns without subsetting, use dplyr::rename() instead of select()
rename(df, nm = name, hgt = height, b_year = birth_year) #returns all columns

## # A tibble: 87 x 14
##    nm           hgt  mass hair_color   skin_color eye_color b_year sex   gender 
##    <chr>      <int> <dbl> <chr>        <chr>      <chr>      <dbl> <chr> <chr>  
##  1 Luke Skyw~   172    77 blond        fair       blue        19   male  mascul~
##  2 C-3PO        167    75 <NA>         gold       yellow     112   none  mascul~
##  3 R2-D2         96    32 <NA>         white, bl~ red         33   none  mascul~
##  4 Darth Vad~   202   136 none         white      yellow      41.9 male  mascul~
##  5 Leia Orga~   150    49 brown        light      brown       19   fema~ femini~
##  6 Owen Lars    178   120 brown, grey  light      blue        52   male  mascul~
##  7 Beru Whit~   165    75 brown        light      blue        47   fema~ femini~
##  8 R5-D4         97    32 <NA>         white, red red         NA   none  mascul~
##  9 Biggs Dar~   183    84 black        light      brown       24   male  mascul~
## 10 Obi-Wan K~   182    77 auburn, whi~ fair       blue-gray   57   male  mascul~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

#to save the updated names just assign the output back to the same data frame object
renamed_df <- rename(df, nm = name, hgt = height, b_year = birth_year) 

#In base R you can just assign names using a string vector. 
#The main disadvantage of this method is that all variable names have to be specified,
#instead of just the ones that you want to change (which is all you need to do with the rename function)
#Unlike select() or rename(), the base R method also requires that names be assigned in the correct order,
#according to the column indices.

df_names <- names(df) #store the original names

names(df) <- c("nm", "hgt", "mass", 
               "hair_color", "skin_color", "eye_color",
               "b_year") #returns a warning and then all non-specified names become NA

## Warning: The `value` argument of `names<-` must have the same length as `x` as of tibble 3.0.0.
## `names` must have length 14, not 7.

## Warning: The `value` argument of `names<-` can't be empty as of tibble 3.0.0.
## Columns 8, 9, 10, 11, 12, and 2 more must be named.

names(df) <- df_names #restore the original names from the vector saved above

4 filter()

filter() is the other core dplyr verb you can use for subsetting your data. filter() lets you extract a subset of rows using logical operators.

To filter a data frame, specify the name of the data frame as the 1st argument, then any number of logical tests for values of variables in the data frame that you want to use as filtering criteria.

# we'll just work with a few columns to simplify the output and highlight the filtering
df2 <- select(df, name, height, species, homeworld) 

unique(df2$homeworld) #check the unique values of the homeworld variable, which we will use for filtering

##  [1] "Tatooine"       "Naboo"          "Alderaan"       "Stewjon"       
##  [5] "Eriadu"         "Kashyyyk"       "Corellia"       "Rodia"         
##  [9] "Nal Hutta"      "Bestine IV"     NA               "Kamino"        
## [13] "Trandosha"      "Socorro"        "Bespin"         "Mon Cala"      
## [17] "Chandrila"      "Endor"          "Sullust"        "Cato Neimoidia"
## [21] "Coruscant"      "Toydaria"       "Malastare"      "Dathomir"      
## [25] "Ryloth"         "Vulpter"        "Troiken"        "Tund"          
## [29] "Haruun Kal"     "Cerea"          "Glee Anselm"    "Iridonia"      
## [33] "Iktotch"        "Quermia"        "Dorin"          "Champala"      
## [37] "Geonosis"       "Mirial"         "Serenno"        "Concord Dawn"  
## [41] "Zolan"          "Ojom"           "Aleen Minor"    "Skako"         
## [45] "Muunilinst"     "Shili"          "Kalee"          "Umbara"        
## [49] "Utapau"

# use filter to get a subset of all star wars characters from either Tatooine or Naboo

filter(df2, 
       homeworld == "Tatooine" | homeworld == "Naboo") #recall that the vertical bar `|` is the logical operator 'or'

## # A tibble: 21 x 4
##    name               height species homeworld
##    <chr>               <int> <chr>   <chr>    
##  1 Luke Skywalker        172 Human   Tatooine 
##  2 C-3PO                 167 Droid   Tatooine 
##  3 R2-D2                  96 Droid   Naboo    
##  4 Darth Vader           202 Human   Tatooine 
##  5 Owen Lars             178 Human   Tatooine 
##  6 Beru Whitesun lars    165 Human   Tatooine 
##  7 R5-D4                  97 Droid   Tatooine 
##  8 Biggs Darklighter     183 Human   Tatooine 
##  9 Anakin Skywalker      188 Human   Tatooine 
## 10 Palpatine             170 Human   Naboo    
## # ... with 11 more rows

# this is read: filter the data frame "df2" to extract rows where the value of
# the variable homeworld is equal to "Tatooine" or homeworld is equal to "Naboo"

#in base R, one equivalent option would be:

df2[(df2$homeworld == "Tatooine" | df2$homeworld == "Naboo"), ]

## # A tibble: 31 x 4
##    name               height species homeworld
##    <chr>               <int> <chr>   <chr>    
##  1 Luke Skywalker        172 Human   Tatooine 
##  2 C-3PO                 167 Droid   Tatooine 
##  3 R2-D2                  96 Droid   Naboo    
##  4 Darth Vader           202 Human   Tatooine 
##  5 Owen Lars             178 Human   Tatooine 
##  6 Beru Whitesun lars    165 Human   Tatooine 
##  7 R5-D4                  97 Droid   Tatooine 
##  8 Biggs Darklighter     183 Human   Tatooine 
##  9 Anakin Skywalker      188 Human   Tatooine 
## 10 <NA>                   NA <NA>    <NA>     
## # ... with 21 more rows

# notice that the $ is not needed to specify which data set the variable
# "homeworld" is in for filter but it is if you are passing a logical vector to
# subset a data frame using `[]` in base R

# this is because R reads (df2$homeworld == "Tatooine" | df2$homeworld == "Naboo")
# as a logical test and returns a logical vector, which is then used to specify
# which rows to extract (those with values == TRUE) and which to discard (those
# with values == FALSE). This becomes really clear if you just pull out the
# subsetting predicates and print the result

(df2$homeworld == "Tatooine" | df2$homeworld == "Naboo")

##  [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE    NA  TRUE FALSE    NA FALSE FALSE
## [25] FALSE FALSE FALSE    NA FALSE FALSE    NA FALSE FALSE  TRUE  TRUE  TRUE
## [37]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [61] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73]    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA    NA
## [85]    NA    NA  TRUE

# to get indices instead of logical values, you can wrap the above in which(), e.g.

ind <- which(df2$homeworld == "Tatooine" | df2$homeworld == "Naboo") #returns the indices where the criteria are TRUE

ind #calling an object itself without applying any functions to it just prints it to the console

##  [1]  1  2  3  4  6  7  8  9 11 20 34 35 36 37 40 41 57 58 59 63 87

#note that you could also pass a vector of integers to subset a df2 using `[]`
df2[ind, ]

## # A tibble: 21 x 4
##    name               height species homeworld
##    <chr>               <int> <chr>   <chr>    
##  1 Luke Skywalker        172 Human   Tatooine 
##  2 C-3PO                 167 Droid   Tatooine 
##  3 R2-D2                  96 Droid   Naboo    
##  4 Darth Vader           202 Human   Tatooine 
##  5 Owen Lars             178 Human   Tatooine 
##  6 Beru Whitesun lars    165 Human   Tatooine 
##  7 R5-D4                  97 Droid   Tatooine 
##  8 Biggs Darklighter     183 Human   Tatooine 
##  9 Anakin Skywalker      188 Human   Tatooine 
## 10 Palpatine             170 Human   Naboo    
## # ... with 11 more rows

#to negate a logical test, you can use `!` which means "not"
# in the context of subsetting, this negates a logical vector 

!(df2$homeworld == "Tatooine" | df2$homeworld == "Naboo")

##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    NA FALSE  TRUE    NA  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE    NA  TRUE  TRUE    NA  TRUE  TRUE FALSE FALSE FALSE
## [37] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [61]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [73]    NA  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    NA    NA    NA
## [85]    NA    NA FALSE

#this is why the logical test for not equal to is "!="
identical(!(df2$homeworld == "Tatooine"), df2$homeworld != "Tatooine") #use identical() or all.equal() to test for equality between entire R objects

## [1] TRUE

# extract rows for characters who are at least 100 cm tall and are from Naboo
filter(df2, height >= 100 & homeworld == "Naboo")

## # A tibble: 10 x 4
##    name          height species homeworld
##    <chr>          <int> <chr>   <chr>    
##  1 Palpatine        170 Human   Naboo    
##  2 Jar Jar Binks    196 Gungan  Naboo    
##  3 Roos Tarpals     224 Gungan  Naboo    
##  4 Rugor Nass       206 Gungan  Naboo    
##  5 Ric Olié         183 <NA>    Naboo    
##  6 Quarsh Panaka    183 <NA>    Naboo    
##  7 Gregar Typho     185 Human   Naboo    
##  8 Cordé            157 Human   Naboo    
##  9 Dormé            165 Human   Naboo    
## 10 Padmé Amidala    165 Human   Naboo

# when using filter(), you can substitute a comma for `&` when specifying multiple criteria
identical(filter(df2, height >= 100 & homeworld == "Naboo"),
          
          filter(df2, height >= 100, homeworld == "Naboo"))

## [1] TRUE

# multiple or (`|`) clauses can instead be specified using the %in% infix operator
option1 <- filter(df2, homeworld == "Tatooine" | homeworld == "Naboo" | homeworld == "Alderaan")

option2 <- filter(filter(df2, homeworld %in% c("Tatooine", "Naboo", "Alderaan")))

identical(option1, option2)

## [1] TRUE

# since %in% also returns a logical vector, it can also be negated using `!`, 
# although they give different results if the data contain missing values (NA)

a <- !(df2$homeworld == "Tatooine" | df2$homeworld == "Naboo" | df2$homeworld == "Alderaan")

b <- !(df2$homeworld %in% c("Tatooine", "Naboo", "Alderaan"))

all.equal(a, b)

## [1] "'is.NA' value mismatch: 0 in current 10 in target"

b[is.na(a)] #these missing values were converted to TRUE by %in%

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# to avoid this and other problems many functions have with processing NA
# values, it is often best to either remove them or impute them, or use the
# na.rm argument for the function

x <- na.omit(df2$homeworld)

c <- !(x == "Tatooine" | x == "Naboo" | x == "Alderaan")

# negating a set of `or` clauses for equality is the same as multiple `and`
# clauses for inequality

d <- x != "Tatooine" & x != "Naboo" & x != "Alderaan"

e <- !(x %in% c("Tatooine", "Naboo", "Alderaan"))

all.equal(c, d, e)

## [1] TRUE

# key point: you can use logical comparisons and vectors for subsetting in R

4.1 Subset Rows using Indices with slice()

Unlike select(), filter() doesn’t allow the use of indices for subsetting. Instead, dplyr has another funciton called slice() for this purpose. slice() doesn’t require you to enclose multiple index values in c() either and it can also operate on indices within groups (see the section 7 below for an example).

slice(df2, 1:5, 9:14, 25:50)

## # A tibble: 37 x 4
##    name              height species homeworld
##    <chr>              <int> <chr>   <chr>    
##  1 Luke Skywalker       172 Human   Tatooine 
##  2 C-3PO                167 Droid   Tatooine 
##  3 R2-D2                 96 Droid   Naboo    
##  4 Darth Vader          202 Human   Tatooine 
##  5 Leia Organa          150 Human   Alderaan 
##  6 Biggs Darklighter    183 Human   Tatooine 
##  7 Obi-Wan Kenobi       182 Human   Stewjon  
##  8 Anakin Skywalker     188 Human   Tatooine 
##  9 Wilhuff Tarkin       180 Human   Eriadu   
## 10 Chewbacca            228 Wookiee Kashyyyk 
## # ... with 27 more rows

5 mutate()

mutate() lets you modify and/or create new columns in a data frame.

#Say for example that you wanted to calculate the BMI of each character in the starwars data frame

df2 <- select(df, name, height, mass) #extract columns of interest

df3 <- mutate(df2,
       height_in_m  = height/100, #convert height from cm to m, store in new column "height_in_m"
       bmi = mass/(height_in_m^2), #add a column called bmi for the calculated body mass index of each character
       bmi = signif(bmi, 1)) #round it to one significant digit/decimal place & assign back to the same name.
df3

## # A tibble: 87 x 5
##    name               height  mass height_in_m   bmi
##    <chr>               <int> <dbl>       <dbl> <dbl>
##  1 Luke Skywalker        172    77        1.72    30
##  2 C-3PO                 167    75        1.67    30
##  3 R2-D2                  96    32        0.96    30
##  4 Darth Vader           202   136        2.02    30
##  5 Leia Organa           150    49        1.5     20
##  6 Owen Lars             178   120        1.78    40
##  7 Beru Whitesun lars    165    75        1.65    30
##  8 R5-D4                  97    32        0.97    30
##  9 Biggs Darklighter     183    84        1.83    30
## 10 Obi-Wan Kenobi        182    77        1.82    20
## # ... with 77 more rows

# assigning a modified variable to the same name overwrites/modifies the column

# note in the above mutate statement I created a new variable then used it in the same function call. 
# this is possible because mutate() evaluates items from left to right

# the equivalent operations in base R would typically involve several steps
# requiring separate assignment of intermediate objects

df2$height_in_m <- df2$height/100

df2$bmi <- df2$mass/(df2$height_in_m^2)

df2$bmi <- signif(df2$bmi, 1)
  
all.equal(df2, df3)

## [1] TRUE

# As shown above, mutate lets you combine these steps into a single function call and doesn't
# require you to specify the dataframe$ component multiple times, you only have
# to supply the name of the data frame as the 1st argument.

# the transmute function (also in the dplyr package) is an alternative to mutate that
# only retains variables you specify (i.e. it combines select and mutate in a single step)
# You probably won't use it as often as mutate, but it can be helpful occassionally

transmute(df2,
       height_in_m  = height/100, #convert height from cm to m, store in new column "height_in_m"
       bmi = mass/(height_in_m^2), #add a column called bmi for the calculated body mass index of each character
       bmi = signif(bmi, 1))

## # A tibble: 87 x 2
##    height_in_m   bmi
##          <dbl> <dbl>
##  1        1.72    30
##  2        1.67    30
##  3        0.96    30
##  4        2.02    30
##  5        1.5     20
##  6        1.78    40
##  7        1.65    30
##  8        0.97    30
##  9        1.83    30
## 10        1.82    20
## # ... with 77 more rows

5.1 Recoding or Creating Indicator Variables using if_else(), case_when(), or recode()

A common use of mutate (or transmute) is to encode a new variable (or overwrite an existing one) based on values of an existing variable in a data frame using the recode(), if_else() or case_when() functions in the dplyr package.

if_else() is a condensed version of a simple if/else statement with only 2 outcomes: one if the condition specified as the 1st argument evaluates to TRUE and one if it evaluates to FALSE (i.e. no else if components). case_when() lets you use multiple if_else() calls together without having to nest them explicitly. General control flow operations using conditional statments (if, else, else if, etc.) may be covered in a later blog post… For now I recommend this section of the advanced R book for those who want to learn more about it before moving on.

# to create an indicator/dummy/2-level variable (e.g. 1/0, T/F) you can use
# if_else() (or ifelse()):

if_else(df$height > 180, #part 1. logical test
        "tall", #part 2. value if TRUE
        "short") #part 3. value if FALSE (should be same class as the value if TRUE)

##  [1] "short" "short" "short" "tall"  "short" "short" "short" "short" "tall" 
## [10] "tall"  "tall"  "short" "tall"  "short" "short" "short" "short" "short"
## [19] "short" "short" "tall"  "tall"  "tall"  "short" "short" "short" "short"
## [28] NA      "short" "short" "tall"  "tall"  "short" "tall"  "tall"  "tall" 
## [37] "tall"  "short" "short" "tall"  "short" "short" "short" "short" "short"
## [46] "short" "short" "tall"  "tall"  "tall"  "short" "tall"  "tall"  "tall" 
## [55] "tall"  "tall"  "tall"  "short" "tall"  "tall"  "short" "short" "short"
## [64] "tall"  "tall"  "tall"  "short" "tall"  "tall"  "tall"  "short" "short"
## [73] "short" "tall"  "tall"  "short" "tall"  "tall"  "tall"  "short" "tall" 
## [82] NA      NA      NA      NA      NA      "short"

# use mutate to add it to an existing data frame

mutate(select(df, name, height), #apply mutate to a subset of the data using a nested select call
       tall_or_short = if_else(height > 180, 
                              "tall", 
                              "short"))

## # A tibble: 87 x 3
##    name               height tall_or_short
##    <chr>               <int> <chr>        
##  1 Luke Skywalker        172 short        
##  2 C-3PO                 167 short        
##  3 R2-D2                  96 short        
##  4 Darth Vader           202 tall         
##  5 Leia Organa           150 short        
##  6 Owen Lars             178 short        
##  7 Beru Whitesun lars    165 short        
##  8 R5-D4                  97 short        
##  9 Biggs Darklighter     183 tall         
## 10 Obi-Wan Kenobi        182 tall         
## # ... with 77 more rows

# to specify more than 2 conditions/outputs you can use case_when():
mutate(select(df, name, height), 
       size_class = case_when(height > 200 ~ "tall", #logical_test ~ value_if_TRUE,
                              height < 200 &  height > 100 ~ "medium", 
                              height < 100 ~ "short")
       )

## # A tibble: 87 x 3
##    name               height size_class
##    <chr>               <int> <chr>     
##  1 Luke Skywalker        172 medium    
##  2 C-3PO                 167 medium    
##  3 R2-D2                  96 short     
##  4 Darth Vader           202 tall      
##  5 Leia Organa           150 medium    
##  6 Owen Lars             178 medium    
##  7 Beru Whitesun lars    165 medium    
##  8 R5-D4                  97 short     
##  9 Biggs Darklighter     183 medium    
## 10 Obi-Wan Kenobi        182 medium    
## # ... with 77 more rows

# you can recode a variable using recode():

recoded_using_recode <- mutate(select(df, name, hair_color),
       hair_color = recode(hair_color, #the variable you want to recode
                           "blond" = "BLOND",#old_value = new_value
                           "brown" = "BROWN") #you can recode any number of values this way
       #unspecified values are unaltered
       ) 

recoded_using_recode

## # A tibble: 87 x 2
##    name               hair_color   
##    <chr>              <chr>        
##  1 Luke Skywalker     BLOND        
##  2 C-3PO              <NA>         
##  3 R2-D2              <NA>         
##  4 Darth Vader        none         
##  5 Leia Organa        BROWN        
##  6 Owen Lars          brown, grey  
##  7 Beru Whitesun lars BROWN        
##  8 R5-D4              <NA>         
##  9 Biggs Darklighter  black        
## 10 Obi-Wan Kenobi     auburn, white
## # ... with 77 more rows

# unforunately, unlike other dplyr functions, recode()-ed values are specified as
# "old_value" = "new_value" instead of the more common "new" = "old" scheme used
# by select() and rename()

# an alternative would be to use case_when(), e.g.:

recoded_using_case_when <- mutate(select(df, name, hair_color),
       hair_color = case_when(hair_color == "blond" ~ "BLOND",
                              hair_color == "brown" ~ "BROWN", 
                              TRUE ~ as.character(hair_color)))

identical(recoded_using_recode, recoded_using_case_when)

## [1] TRUE

# This time we add in the "TRUE ~ as.character(hair_color)" part to signify that
# all other non-NA cases should be the existing value of hair color as a
# character string; otherwise non-matching values are coded as missing values (`NA`)

# note that recode, like most R functions, returns a copy of the data object, 
# that you have to assign to a name if you want to save it in the environment

#of course you can also recode variables in base R using the assingnment operator:

#`<-` (keyboard shortcut = )

df$hair_color[df$hair_color == "blond"] <- "BLOND"
df$hair_color[df$hair_color == "brown"] <- "BROWN"

#but this modifies the data frame in place, which isn't always what you want
#e.g. if you are just experimenting with a coding scheme and then decide you
#want to change it back you'll have to rerun earlier portions of your R script
#to recover the previous version

6 summarise()

summarise() makes it easy to turn a data frame into a smaller data frame of summary statistics.

summary1 <- summarise(df, #as for the other dplyr functions, the data source is specified as the 1st argument
                      n = n(), #n() is a special function for use in summarize that returns the number of values
                      mean_height = mean(height, na.rm = TRUE), #summary stat name in the output = function(column) in the input
                      median_mass = median(mass, na.rm = TRUE))

summary1

## # A tibble: 1 x 3
##       n mean_height median_mass
##   <int>       <dbl>       <dbl>
## 1    87        174.          79

#summarise() and summarize() do the same thing and just accommodate different user spelling preferences
summary2 <- summarize(df, #as for the other dplyr functions, the data source is specified as the 1st argument
                      n = n(), #n() is a special function for use in summarize that returns the number of values
                      mean_height = mean(height, na.rm = TRUE), #summary stat name in the output = function(column) in the input
                      median_mass = median(mass, na.rm = TRUE))


identical(summary1, summary2)

## [1] TRUE

7 group_by()

group_by() adds an explicit grouping structure to a data frame that can subsequently be used by other dplyr functions to perform operations within each level of the grouping variable or level combination of the grouping variables if you specify multiple grouping variables. This is particularly useful in conjunction with either summarize() or mutate(), enabling you to calculate summary statistics or transform variables separately for each of the groups defined by group_by().

#structure of the data before grouping
glimpse(df)

## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "BLOND", NA, NA, "none", "BROWN", "brown, grey", "BROWN", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

df3 <- select(df, name, species, height, mass)#1st. extract just the columns of interest

grouped_data <- group_by(df3, species) #group the data frame "df3" by the variable "species"

class(grouped_data) #new additional class = "grouped_df"

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

glimpse(grouped_data) #note that the structure now has a groups attribute

## Rows: 87
## Columns: 4
## Groups: species [38]
## $ name    <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organ~
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human",~
## $ height  <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228,~
## $ mass    <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, ~

slice(grouped_data, 1) #look at the 1st row of each group

## # A tibble: 38 x 4
## # Groups:   species [38]
##    name                  species   height  mass
##    <chr>                 <chr>      <int> <dbl>
##  1 Ratts Tyerell         Aleena        79    15
##  2 Dexter Jettster       Besalisk     198   102
##  3 Ki-Adi-Mundi          Cerean       198    82
##  4 Mas Amedda            Chagrian     196    NA
##  5 Zam Wesell            Clawdite     168    55
##  6 C-3PO                 Droid        167    75
##  7 Sebulba               Dug          112    40
##  8 Wicket Systri Warrick Ewok          88    20
##  9 Poggle the Lesser     Geonosian    183    80
## 10 Jar Jar Binks         Gungan       196    66
## # ... with 28 more rows

#next summarise calculates the specified summary statistics within species
summarise(grouped_data,
          n = n(), 
          mean_height = mean(height, na.rm = TRUE), 
          median_mass = median(mass, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 38 x 4
##    species       n mean_height median_mass
##    <chr>     <int>       <dbl>       <dbl>
##  1 Aleena        1         79         15  
##  2 Besalisk      1        198        102  
##  3 Cerean        1        198         82  
##  4 Chagrian      1        196         NA  
##  5 Clawdite      1        168         55  
##  6 Droid         6        131.        53.5
##  7 Dug           1        112         40  
##  8 Ewok          1         88         20  
##  9 Geonosian     1        183         80  
## 10 Gungan        3        209.        74  
## # ... with 28 more rows

#summarise and summarize do the same thing and just accommodate different user spelling preferences
identical(summary1, summary2)

## [1] TRUE

#to apply mutate() within each level of a grouping variable, simply use mutate
#on a grouped data frame:
mutated_grouped_data <- mutate(grouped_data, 
                               scaled_mass = scale(mass)) #convert raw masses into standardized masses (i.e. z-scores)

select(mutated_grouped_data, name, mass, scaled_mass) #just print the relevant parts

## Adding missing grouping variables: `species`

## # A tibble: 87 x 4
## # Groups:   species [38]
##    species name                mass scaled_mass[,1]
##    <chr>   <chr>              <dbl>           <dbl>
##  1 Human   Luke Skywalker        77         -0.298 
##  2 Droid   C-3PO                 75          0.103 
##  3 Droid   R2-D2                 32         -0.740 
##  4 Human   Darth Vader          136          2.75  
##  5 Human   Leia Organa           49         -1.74  
##  6 Human   Owen Lars            120          1.92  
##  7 Human   Beru Whitesun lars    75         -0.401 
##  8 Droid   R5-D4                 32         -0.740 
##  9 Human   Biggs Darklighter     84          0.0628
## 10 Human   Obi-Wan Kenobi        77         -0.298 
## # ... with 77 more rows

#note that this is done within species, and the results differ from the same
#operation applied to the ungrouped data
identical(
mutated_grouped_data$scaled_mass,
mutate(df3, scaled_mass = scale(mass))$scaled_mass,
)

## [1] FALSE

#it is recommended that you ungroup the data after you're finished mutating it
#if you want subsequent operations to be applied to the entire data frame
#instead of within groups
ungrouped_data <- ungroup(mutated_grouped_data)

glimpse(ungrouped_data) #notice that the groups attribute is no longer there

## Rows: 87
## Columns: 5
## $ name        <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O~
## $ species     <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum~
## $ height      <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, ~
## $ mass        <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77~
## $ scaled_mass <dbl[,1]> <matrix[26 x 1]>

#and grouped_df3 is no longer one of the data frame's classes
class(ungrouped_data)

## [1] "tbl_df"     "tbl"        "data.frame"

8 Chaining Functions with the pipe operator (%>%)

Thanks to the magrittr package, you can use the pipe operator, “%>%”, which allows you to apply a series of functions to an object using easily readable code without requiring you to save intermediate objects.

Note that you don’t have to load the magrittr package in a separate call to the library() function since the pipe operator is imported from magrittr automatically when you load the dplyr package.

The pipe operator takes the object on the left and passes it to the first argument of the function to the right. Functions in dplyr and other tidyverse packages take data as the 1st argument. Those function which output the same type of object that they accept as inputs (i.e. the data argument) can be chained using the “%>%” quite easily.

For example,

# %>% lets you turn this nested code (evaluated from the inside out)... 

arrange(summarize(group_by(select(starwars, species, mass), species), 
                  mean_mass = mean(mass, na.rm = T)), desc(mean_mass))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 38 x 2
##    species      mean_mass
##    <chr>            <dbl>
##  1 Hutt            1358  
##  2 Kaleesh          159  
##  3 Wookiee          124  
##  4 Trandoshan       113  
##  5 Besalisk         102  
##  6 Neimodian         90  
##  7 Kaminoan          88  
##  8 Nautolan          87  
##  9 Mon Calamari      83  
## 10 Human             82.8
## # ... with 28 more rows

# ...which could also be written more clearly like this:

arrange(#evaluated last
  
  summarize(#evaluated 3rd
    
    group_by(#evaluated 2nd
      
      select(#evaluated 1st
        
        starwars, species, mass), #1. from the starwars data frame, select() the columns species and mass
      
      species), #2. then use group_by() to group the data by species
    
    mean_mass = mean(mass, na.rm = T)),  #3. then use summarize() to calculate the mean() mass for each species
  
  desc(mean_mass)) #4. then arrange() the summary data frame by mean_mass in descending order

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 38 x 2
##    species      mean_mass
##    <chr>            <dbl>
##  1 Hutt            1358  
##  2 Kaleesh          159  
##  3 Wookiee          124  
##  4 Trandoshan       113  
##  5 Besalisk         102  
##  6 Neimodian         90  
##  7 Kaminoan          88  
##  8 Nautolan          87  
##  9 Mon Calamari      83  
## 10 Human             82.8
## # ... with 28 more rows

# ... into this tidy chain, which reads much more naturally

sorted_mass_by_sw_species_chained <- starwars %>% #take the starwars df  
  
  select(species, mass) %>% #pass it to the 1st agrument of select() (i.e. data), then extract species and mass
  
  group_by(species) %>% #output of select() becomes the input to the 1st argument of group_by(); group it by species
  
  summarise(mean_mass = mean(mass, na.rm = T)) %>% #summarise() the grouped data frame to get the mean mass per group
  
  arrange(desc(mean_mass)) # arrange the grouped summary by the mean mass column in descending order

## `summarise()` ungrouping output (override with `.groups` argument)

sorted_mass_by_sw_species_chained

## # A tibble: 38 x 2
##    species      mean_mass
##    <chr>            <dbl>
##  1 Hutt            1358  
##  2 Kaleesh          159  
##  3 Wookiee          124  
##  4 Trandoshan       113  
##  5 Besalisk         102  
##  6 Neimodian         90  
##  7 Kaminoan          88  
##  8 Nautolan          87  
##  9 Mon Calamari      83  
## 10 Human             82.8
## # ... with 28 more rows

# the order of operations using %>% is so much more obvious and produces code
# that can be read at a glance and doesn't require intermediate objects which
# each occupy memory, e.g.

starwars_sub <- select(starwars, 
                       species, mass) 

grouped_starwars_sub <- group_by(starwars_sub, 
                                 species) 

mass_by_sw_species <- summarise(grouped_starwars_sub, 
                                mean_mass = mean(mass, na.rm = T))

## `summarise()` ungrouping output (override with `.groups` argument)

sorted_mass_by_sw_species_unchained <- arrange(mass_by_sw_species, 
                                               desc(mean_mass)) 

# and then have to be deleted afterwards to reclaim the memory 
rm(starwars_sub)
rm(grouped_starwars_sub)
rm(mass_by_sw_species)

identical(sorted_mass_by_sw_species_chained, sorted_mass_by_sw_species_unchained) #the output is the same

## [1] TRUE

# note: the keyboard shortcut for %>% is ctrl + shift + M (at least on Windows machines)

We’ve seen that the default behaviour of %>% is to pass the output of the left hand side (LHS) to the 1st argument of the right hand side (RHS), i.e.

x %>% f(y) is equivalent to f(x, y)

This is great if you’re working with tidyverse functions that accept a data frame as the 1st argument, but what if you want to pass the LHS output to another argument or position in the function on the RHS (e.g. functions with data arguments in other positions)?

%>% enables you to do this by automatically assigning the LHS output to the special value “.”. To change the argument you want to assign the LHS output to, you just have to set that argument to “.”, e.g.

x %>% f(y) is equivalent to x %>% f(., y) and

y %>% f(x, .) is equivalent to f(x, y) and

z %>% f(x, y, .) is equivalent to f(x, y, z)

10 Notes

To explore dplyr and in more depth checkout chapter 5 of R for data science.
To learn more about subsetting using base R I highly recommend chapter 4 of the Advanced R book.
For more details on other data manipulation techniques in base R checkout the official R introductory manual.

Thank you for visiting my blog. I welcome any suggestions for future posts, comments or other feedback you might have. Feedback from beginners and science students/trainees (or with them in mind) is especially helpful in the interest of making this guide even better for them.

This blog is something I do as a volunteer in my free time. If you’ve found it helpful and want to give back, coffee donations would be appreciated.

R R basics

A Scientist's Guide to R: Step 2.1. Data Transformation - Part 1

A Scientist's Guide to R: Step 2.1. Data Transformation - Part 1

1 TL;DR

2 Introduction

3 select()

3.1 Renaming Columns with select() or rename()

4 filter()

4.1 Subset Rows using Indices with slice()

5 mutate()

5.1 Recoding or Creating Indicator Variables using if_else(), case_when(), or recode()

6 summarise()

7 group_by()

8 Chaining Functions with the pipe operator (%>%)

9 Navigation

10 Notes

Dr. Craig P. Hutton

Psychologist | Data Scientist

Related