A Scientist's Guide to R: Step 2.1 Data Transformation - part 2

A Scientist's Guide to R: Step 2.1 Data Transformation - part 2

1 TL;DR

In the 5th post of the Scientist’s Guide to R series we explore using the tidyr package to reshape data. You’ll learn all about splitting and combining columns and how to do wide to long or long to wide transformations.

2 Introduction

In the 5th post of the Scientist’s Guide to R series we explore using the tidyr package to reshape data. As you’ll see, this is really helpful if you are working with repeated-measures data, which is pretty common these days.

Specifically, we will be learning how to use 4 core functions of the tidyr package to:

  1. Perform “wide-to-long” transformations using pivot_longer(), converting multiple columns of the same measure into a single value column and associated key column.

  2. Spread a pair of value and key columns across multiple columns, AKA “long-to-wide” transformations, using pivot_wider().

  3. Combine multiple columns into a single column using unite().

  4. Perform the reverse operation to separate() a complex column into multiple simple columns.

Additional relevant functions that build on this foundation will also be covered where appropriate. If some of these terms don’t mean very much to you now don’t worry. Hopefully they will by the time you’re done reading this post!

In this post I will assume you know how to use the pipe operator (%>%), which I covered here.

Getting repeated-measures data ready for analysis will be no trouble at all once you’ve mastered these functions.

The examples below will make use of 2 datasets provided by R packages: the gapminder life expectancy data from the gapminder package and the OBrienKaiser data from the carData package. Since carData is automatically also loaded with the car package which we will be using later in the series for regression diagnostics (car = companion to applied regression), we’ll import the OBrienKaiser data indirectly via the car package.

These 2 datasets were chosen because they both contain repeated measurements. One, gapminder, is by default in the so-called “long” form. While the other, OBrienKaiser, is by default in the so-called “wide” form.

Recall that we load a package using the library() function, which imports any data structures and functions from the package and makes them available for use in the current R session.

library(gapminder) #load the gapminder package which mostly just contains the gapminder dataset

library(car) #load the car package with the associated OBrienKaiser data

Naturally the 1st thing you should do after importing any data is to examine its structure using either str() or dplyr::glimpse(), so that’s what we’ll do next. Data from packages also usually comes with helpful documentation you can view using ?data_object_name.

#since I prefer to use glimpse() over str() we'll also load the tidyverse
#package collection (glimpse is included with both tibble and dplyr) which also
#loads the tidyr package we will need later anyways. 
#For now this gives us access to both the pipe operator and the glimpse function.

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()
gapminder %>% glimpse
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
OBrienKaiser %>% glimpse
## Observations: 16
## Variables: 17
## $ treatment <fct> control, control, control, control, control, A, A, A, A, ...
## $ gender    <fct> M, M, M, F, F, M, M, F, F, M, M, M, F, F, F, F
## $ pre.1     <dbl> 1, 4, 5, 5, 3, 7, 5, 2, 3, 4, 3, 6, 5, 2, 2, 4
## $ pre.2     <dbl> 2, 4, 6, 4, 4, 8, 5, 3, 3, 4, 3, 7, 5, 2, 2, 5
## $ pre.3     <dbl> 4, 5, 5, 7, 6, 7, 6, 5, 4, 5, 4, 8, 6, 3, 3, 7
## $ pre.4     <dbl> 2, 3, 7, 5, 4, 9, 4, 3, 6, 3, 2, 6, 8, 1, 4, 5
## $ pre.5     <dbl> 1, 4, 7, 4, 3, 9, 5, 2, 4, 4, 3, 3, 6, 2, 4, 4
## $ post.1    <dbl> 3, 2, 4, 2, 6, 9, 7, 2, 4, 6, 5, 9, 4, 5, 6, 7
## $ post.2    <dbl> 2, 2, 5, 2, 7, 9, 7, 4, 5, 7, 4, 10, 6, 6, 6, 7
## $ post.3    <dbl> 5, 3, 7, 3, 8, 10, 8, 8, 6, 6, 7, 11, 6, 7, 7, 8
## $ post.4    <dbl> 3, 5, 5, 5, 6, 8, 10, 6, 4, 8, 5, 9, 8, 5, 9, 6
## $ post.5    <dbl> 2, 3, 4, 3, 3, 9, 8, 5, 1, 8, 4, 6, 6, 2, 7, 7
## $ fup.1     <dbl> 2, 4, 7, 4, 4, 9, 8, 6, 5, 8, 5, 8, 7, 6, 7, 7
## $ fup.2     <dbl> 3, 5, 6, 4, 3, 10, 9, 6, 4, 8, 6, 7, 7, 7, 7, 8
## $ fup.3     <dbl> 2, 6, 9, 5, 6, 11, 11, 7, 7, 9, 8, 10, 8, 8, 8, 10
## $ fup.4     <dbl> 4, 4, 7, 3, 4, 9, 9, 5, 5, 7, 6, 8, 10, 6, 6, 8
## $ fup.5     <dbl> 4, 1, 6, 4, 3, 6, 8, 6, 4, 8, 5, 7, 8, 3, 7, 7

When working with data from an add-on package it is often also a good idea to work on a copy of it rather than overwriting/modifying it under the original name. Using a copy for experimental tranformations is recommended in general in case the transformation doesn’t go quite as planned and you want to change it or recover the original data. To recover data provided in a package if you have accidentally modified it you’d have to remove the object from the global environment (deleting your custom copy of it but not the original version in the package) using the rm() function… which can be a bit confusing. Best to just work on a copy with a clearly different name.

The same recommendation applies to naming things in general. Try to use names that aren’t already assigned to other objects or functions. If you’re not sure if a name is in use or not, try to print it (just the name, enclosed in print() if you want) or look-up documentation for it with ?object or help(object).

#we'll save a copy of the gapminder data to gap_df (short for gapminder_dataframe)
gap_df <- gapminder

ob_df <- OBrienKaiser

#now we can do whatever we want to gap_df or ob_df without affecting gapminder or OBrienKaiser

#For the next section it helps to focus on just the names of the variables in each of these data frames
names(gap_df)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
names(ob_df)
##  [1] "treatment" "gender"    "pre.1"     "pre.2"     "pre.3"     "pre.4"    
##  [7] "pre.5"     "post.1"    "post.2"    "post.3"    "post.4"    "post.5"   
## [13] "fup.1"     "fup.2"     "fup.3"     "fup.4"     "fup.5"

If you haven’t see either of these datasets before I recommend reviewing the documentation for them, using:

?gapminder 

#and

?OBrienKaiser

Before reading on, try to answer the question: What is the main difference between the variable naming styles used by the authors of these two datasets?

2.1 “long” data, “wide” data, and “tidy” data

From the glimpses we requested, we can see that the gapminder data contains 1,704 observations (rows) and 6 variables (columns), while the OBrienKaiser data has only 16 rows but 17 columns. Related to the question I posed above, note that gapminder has unique names for each variable and that most of the columns in OBrienKaiser have generic names with a string.number format. Why is this?

gapminder is an example of “long” data because each row represents a unique observation and each column represents a unique variable. This is the form that data needs to be in to do most statistical analyses and generate visualizations in R. Because each unique variable is a single column, each column has a unique name.

OBrienKaiser is an example of “wide” data because each row represents a unique case or individual and all measures for these individuals are spread across multiple columns. Because several of the variables are spread across multiple columns, these have more generic names.

This is the form that data typically needs to be in to do statstical analyses in other programs like SPSS, so you’ll encounter it often if you’re importing data directly from SPSS or working on Excel files produced by people who primarily work in SPSS or just Excel itself.

The term “tidy” data refers to the long form arrangement of data with the additional criterion that each type of observational unit is in a single table rather than being split across multiple tables. Data cleaning and “tidy-ness” are so germane to real world data analysis and data science that the tidyverse was named after them. If you want to conduct an analysis using data that are spread across multiple data frames or tables, my next blog post will show you how to combine them using a set of functions called joins.

It won’t always be so obvious, but now you know that even the names of a data frame’s columns can tell you quite a bit about its structure.

3 pivot_longer()

To make data longer so that we have all values of the same measure in a single column we can use the aptly named pivot_longer(), which is a newer version of the tidyr gather() function. Arguments we need to specify are:

  1. data (or we can pipe the data in using the pipe operator)

  2. cols: the distributed columns we want to combine

  3. names_to: string name to use for the new character vector that keeps track of which columns the values originally came from, sometimes this is called the “key” column. Values of the “names_to” are by default the names of the combined columns

  4. values_to: string name to use for the new column that will contain all of the combined values

For example, this is how you could combine all of the distributed measurement columns from the ob_df data frame into a single tidy key-value column pair.

#first we'll add a column to keep track of the subject ID since the OBrienKaiser
#data doesn't already have one. you could do this either using dplyr's mutate() function,
#or using the shortcut function rowid_to_column, from the tidyverse package "tibble"

ob_df_v2 <- ob_df %>% 
  rowid_to_column("subject_id") #subject ID column values take on the row ID values
#by default this adds the column to the left side of the data frame (in the 1st position)

#alternative using mutate
ob_df_v2b <- ob_df %>% 
  mutate(subject_id = c(1:nrow(ob_df))) %>% #same end result
  # subject id becomes the numbers: [1 to the number of total rows]. 
  # By default, this adds the column to the end, so we will rearrange it using
  # select() with tidyselect helper everything()
  select(subject_id, everything()) #this selects the new subject_id column, then everything else

identical(ob_df_v2, ob_df_v2b) #check if the objects are identical
## [1] TRUE
#in this case both methods are equivalent so rowid_to_column() is preferred, but
#if you want to assign something else as subject ids, then mutate() would be
#better. Rearranging with select() is optional of course. I typically prefer to
#work with subject ID in the 1st position though

# next we make the data longer using pivot_longer()

ob_long <- pivot_longer(data = ob_df_v2, 
             cols = pre.1:fup.5, 
             names_to = "session",
             values_to = "score"
             )

ob_long %>% glimpse 
## Observations: 240
## Variables: 5
## $ subject_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,...
## $ treatment  <fct> control, control, control, control, control, control, co...
## $ gender     <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ session    <chr> "pre.1", "pre.2", "pre.3", "pre.4", "pre.5", "post.1", "...
## $ score      <dbl> 1, 2, 4, 2, 1, 3, 2, 5, 3, 2, 2, 3, 2, 4, 4, 4, 4, 5, 3,...
#now we have a total of 256 observations for our 16 subjects/cases, score
#measurements are all in a single column that we've called "score", and the data
#collection session is coded using the session variable (the values of which
#used to be the names of the original columns).

As you can see, pooling a single measure is pretty easy, but what if we want to assign multiple sets of measures to different columns, like the pre, post, and fup components of the ob_df_v2 data frame (e.g. if these represented different types of measurements on distinct scales like height, weight, and temperature). This requires an extra step and adjusting our arguments a bit, but pivot_longer() makes this pretty easy as well.

The main requirements are:

  1. A consistent and unique string pattern that can be used to determine which columns of values belong together and what to use as the values of the key colunn

  2. using the special “.value” string as part of the names_to argument and the names_sep argument instead of using values_to

In the lucky event that your column names already have a consistent pattern in them, then you can just make use of that pattern and skip to step 2. Otherwise it may be a good idea to create one using either the dplyr::rename() function (if you only have to modify a few names), or using some string manipulation functions from the stringr tidyverse package. This section will serve as a preview of the sort of monotony that stringr and regular expressions can free you from, which will be expanded upon in a future post covering stringr.

Applied to our ob_df_v2 dataset:

# step 1. consistent pattern across the set of value columns ----
 
names(ob_df_v2)
##  [1] "subject_id" "treatment"  "gender"     "pre.1"      "pre.2"     
##  [6] "pre.3"      "pre.4"      "pre.5"      "post.1"     "post.2"    
## [11] "post.3"     "post.4"     "post.5"     "fup.1"      "fup.2"     
## [16] "fup.3"      "fup.4"      "fup.5"
# Looking at the names again, we have a mere period "." as a consistent pattern.
# while you could use this to flag the columns you want to pivot, something to
# consider is that "." is also a special character used in regular expressions
# (a string pattern matching system) to represent "any character", so we would
# have to preface it with one or more of another special character combination,
# in this case "\\", to "escape it" for string pattern matching functions to
# interpret it as a literal "." instead of "any character". This is usually done
# as part of a regular expression made up of combinations of special characters.

# Don't worry if this strange business of escaping characters and regular
# expressions seems confusing at this point, we'll learn about it in much more
# detail in a later post.

# For now, you could subset the data and rename the columns, e.g.

ob_df_v2a <- ob_df_v2 %>%
  rename(pretest1 = pre.1, #rename("new_name" = "old_name")
         pretest2 = pre.2,
         pretest3 = pre.3,
         pretest4 = pre.4,
         pretest5 = pre.5,
         posttest1 = post.1, 
         posttest2 = post.2,
         posttest3 = post.3,
         posttest4 = post.4,
         posttest5 = post.5,
         fuptest1 = fup.1,
         fuptest2 = fup.2,
         fuptest3 = fup.3,
         fuptest4 = fup.4,
         fuptest5 = fup.5)

ob_df_v2a
##    subject_id treatment gender pretest1 pretest2 pretest3 pretest4 pretest5
## 1           1   control      M        1        2        4        2        1
## 2           2   control      M        4        4        5        3        4
## 3           3   control      M        5        6        5        7        7
## 4           4   control      F        5        4        7        5        4
## 5           5   control      F        3        4        6        4        3
## 6           6         A      M        7        8        7        9        9
## 7           7         A      M        5        5        6        4        5
## 8           8         A      F        2        3        5        3        2
## 9           9         A      F        3        3        4        6        4
## 10         10         B      M        4        4        5        3        4
## 11         11         B      M        3        3        4        2        3
## 12         12         B      M        6        7        8        6        3
## 13         13         B      F        5        5        6        8        6
## 14         14         B      F        2        2        3        1        2
## 15         15         B      F        2        2        3        4        4
## 16         16         B      F        4        5        7        5        4
##    posttest1 posttest2 posttest3 posttest4 posttest5 fuptest1 fuptest2 fuptest3
## 1          3         2         5         3         2        2        3        2
## 2          2         2         3         5         3        4        5        6
## 3          4         5         7         5         4        7        6        9
## 4          2         2         3         5         3        4        4        5
## 5          6         7         8         6         3        4        3        6
## 6          9         9        10         8         9        9       10       11
## 7          7         7         8        10         8        8        9       11
## 8          2         4         8         6         5        6        6        7
## 9          4         5         6         4         1        5        4        7
## 10         6         7         6         8         8        8        8        9
## 11         5         4         7         5         4        5        6        8
## 12         9        10        11         9         6        8        7       10
## 13         4         6         6         8         6        7        7        8
## 14         5         6         7         5         2        6        7        8
## 15         6         6         7         9         7        7        7        8
## 16         7         7         8         6         7        7        8       10
##    fuptest4 fuptest5
## 1         4        4
## 2         4        1
## 3         7        6
## 4         3        4
## 5         4        3
## 6         9        6
## 7         9        8
## 8         5        6
## 9         5        4
## 10        7        8
## 11        6        5
## 12        8        7
## 13       10        8
## 14        6        3
## 15        6        7
## 16        8        7
# However, I'm sure you'll appreciate how tedious this could get for more than a
# handful of columns. A better solution is to use a combination of the names()
# and stringr::str_replace_all() functions

ob_df_v2b <- ob_df_v2 #create a copy (optional)

names(ob_df_v2b) <- str_replace_all( #use stringr::str_replace_all() to replace the names
                                    
                                    names(ob_df_v2b), #character vector to replace matching values in
                                    
                                    "pre.", #string component pattern to be matched
                                    
                                    #here the "." doesn't matter because I've also
                                    #specified the rest of the start of the
                                    #string up to the "."
                                    
                                    "pretest" #string to replace it with
                                    ) 

names(ob_df_v2b) <- str_replace_all(names(ob_df_v2b), "post.", "posttest")

names(ob_df_v2b) <- str_replace_all(names(ob_df_v2b), "fup.", "fuptest") 

identical(ob_df_v2a, ob_df_v2b)
## [1] TRUE
#a bit better... but still inefficient. The obvious third option is to just
#escape the "." using "\\.":

ob_df_v2c <- ob_df_v2 
names(ob_df_v2c) <- str_replace_all(names(ob_df_v2), 
                                    "\\.", #anything with a name that contains a period
                                    "test")

identical(ob_df_v2a, ob_df_v2c)
## [1] TRUE
# Fortunately pivot_longer() incorporates the same string matching/detection
# behaviour of str_replace_all(), so you can actually just pass "\\." directly
# to the cols and names_sep arguments, as in the 2nd and 3rd examples for step 2
# below...

# step 2. use a string patten to pivot columns ----

#with one of the renamed versions of ob_df_v2
ob_long2 <- ob_df_v2c %>%
  pivot_longer(cols = contains("test"), #select columns with names containing a pattern
               
               names_to = c(".value", 
                            #special string that tells pivot_longer to use names
                            #on each side of the separator pattern

                            "session"), #name of the key column to use for the names
               
               names_sep = "test" #separator to use
               )

ob_long2
## # A tibble: 80 x 7
##    subject_id treatment gender session   pre  post   fup
##         <int> <fct>     <fct>  <chr>   <dbl> <dbl> <dbl>
##  1          1 control   M      1           1     3     2
##  2          1 control   M      2           2     2     3
##  3          1 control   M      3           4     5     2
##  4          1 control   M      4           2     3     4
##  5          1 control   M      5           1     2     4
##  6          2 control   M      1           4     2     4
##  7          2 control   M      2           4     2     5
##  8          2 control   M      3           5     3     6
##  9          2 control   M      4           3     5     4
## 10          2 control   M      5           4     3     1
## # ... with 70 more rows
#directly with a regular expression using the matches() tidyselect helper function
ob_long3 <- ob_df_v2 %>%
  pivot_longer(cols = matches("\\."), #select columns with names matching a regular expression pattern
               names_to = c(".value", "session"), #special ".value" indictor and name of the key column to use for the names
               names_sep = "\\.") #separator to use
               

ob_long3
## # A tibble: 80 x 7
##    subject_id treatment gender session   pre  post   fup
##         <int> <fct>     <fct>  <chr>   <dbl> <dbl> <dbl>
##  1          1 control   M      1           1     3     2
##  2          1 control   M      2           2     2     3
##  3          1 control   M      3           4     5     2
##  4          1 control   M      4           2     3     4
##  5          1 control   M      5           1     2     4
##  6          2 control   M      1           4     2     4
##  7          2 control   M      2           4     2     5
##  8          2 control   M      3           5     3     6
##  9          2 control   M      4           3     5     4
## 10          2 control   M      5           4     3     1
## # ... with 70 more rows
identical(ob_long2, ob_long3)
## [1] TRUE
#directly using the even easier contains() tidyselect helper function that
#absolves you of the need to worry about special characters/regular expressions

ob_long4 <- ob_df_v2 %>%
  pivot_longer(cols = contains("."), #select columns with names containing a literal string pattern
               
               names_to = c(".value", "session"), #special ".value" indictor and name of the key column to use for the names
               
               names_sep = "\\." #separator to use. 
               #names_sep unfortunately requires the escaping of special
               #characters if you want them matched literally. Unfortunately
               #contains() doesnt work in the names_sep argument.
               )

identical(ob_long3, ob_long4)
## [1] TRUE

Now you know something about making data longer as well as few helpful bits about using string patterns to change variable names.

The preceeding use case represented a situation when you have multiple value columns that you want to aggregate separately, yielding one value column for each separate set. This could occur if data on multiple measures (e.g. height & weight) were collected from the same subjects on multiple occassions. An alternative situation you might encounter (albeit probably less often) occurs when the values of a single measure of interest are spread across numerous columns with complex names consisting of components that specify the measurement group in combination. The simplest example of this might be something like a column naming format for daily temperature values of “d2019-01-23”, “d2019-02-01”, etc., being used to represent dates and you are interested in splitting the key column into its year, month, and day components (yyyy-mm-dd) and putting the values all in a single “temperature” column.

Doing this with pivot_longer() requires making use of the names_pattern argument (instead of names_sep) with a more complex regular expression that determines where to break up all of the name component. This is accomplished by enclosing each component with parentheses, e.g. 

names_pattern = "(component_1)(component_2)(component_3)"

To demonstrate this one we’ll use the WHO tuberculosis report data from the tidyr package to replicate the example of this use case from the pivot_longer() documentation, with minor modifications to simplify it a bit. This dataset is automatically imported when the tidyr package is loaded so you can use it by calling the name “who”

#?who #uncomment and run this to check the documentation for the who dataset

names(who) #check the names of the columns
##  [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
##  [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
## [11] "new_sp_m65"   "new_sp_f014"  "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
## [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
## [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"  
## [26] "new_sn_f014"  "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
## [31] "new_sn_f5564" "new_sn_f65"   "new_ep_m014"  "new_ep_m1524" "new_ep_m2534"
## [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
## [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
## [46] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
## [51] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
## [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

Note from the documentation that many of the columns have a name with 3 components: 2 separated by an underscore and an “m” or “f” male/female indicator. These components represent three different variables that collectively specify the type of case and the values of these columns are all counts of tuberculosis cases. According to the documentation for the who data, the newrel* columns are part of this scheme as well, but they only have a single underscore.

# we can standardize the newrel* column names to match the structure of the
# other new* columns using str_replace_all() to simplify subsequent pattern
# matching in pivot_longer()

who2 <- who

names(who2) <- str_replace_all(names(who2), "newrel", "new_rel")


who_long <- who2 %>% pivot_longer(
  
  cols = new_sp_m014:new_rel_f65,
  
  names_to = c("diagnosis", "gender", "age"),
  
  names_pattern = "new_(.*)_(.)(.*)", 
  # pattern to be matched in this case is 
  # "new_(component 1)_(component2[m/f])_(component3)", 
  # where component 1 is everything between the 1st and 2nd underscore,
  # component 2 is the lowercase letter "m" or "f", and component 3 is
  # everything after component 2.
  # 
  # note: "*" is used to represent the repetition of the preceeding chraracter
  # (in this case "." for "any character") any number of times in a regular
  # expresion
  
  values_to = "count")

who_long %>% 
  arrange(desc(count)) # arrange by count in descending order
## # A tibble: 405,440 x 8
##    country iso2  iso3   year diagnosis gender age    count
##    <chr>   <chr> <chr> <int> <chr>     <chr>  <chr>  <int>
##  1 India   IN    IND    2007 sn        m      3544  250051
##  2 India   IN    IND    2007 sn        f      3544  148811
##  3 China   CN    CHN    2013 rel       m      65    124476
##  4 China   CN    CHN    2013 rel       m      5564  112558
##  5 India   IN    IND    2007 ep        m      3544  105825
##  6 India   IN    IND    2007 ep        f      3544  101015
##  7 China   CN    CHN    2013 rel       m      4554  100297
##  8 India   IN    IND    2009 sp        m      3544   90830
##  9 India   IN    IND    2008 sp        m      3544   90498
## 10 India   IN    IND    2010 sp        m      3544   90440
## # ... with 405,430 more rows
# now all of the values (i.e. counts) are in a single column and our distinct
# grouping variables are specified using a tidy set of 3 simple key columns
# instead of one complicated one

# see the documentation for the who data if you're wondering why some values of
# age are so high

4 pivot_wider()

pivot_wider() makes a data frame wider by spliting one or more value columns into a larger number of value columns, with names assigned according to the levels of key columns specified using the names_from argument. Here we’ll use it to make the gapminder data wider by putting the annual life expectancy column, “lifeExp” into separate columns for each year (named using the values of the key column “year”).

Arguments to specify for the simple case of spreading a single key-value column pair are:

  1. data

  2. id_cols: the columns that together uniquely identify each observation. by default this is all columns

  3. names_from: the key column(s) to pull the names from for the new set of replacement columns

  4. values_from: the value column(s) to pull values from for the new set of replacement columns

You might want to do this as a temporary transformation to facilitate the construction of new variables derived from a combination of the more widely distributed value columns that result from the application of pivot_wider() with mutate(), or if you plan to export the data in wide from to share with someone working in other statistical software programs which require wide data.

gap_df %>% 
    pivot_wider(id_cols = c(country, continent),  
                names_from = year, values_from = lifeExp)
## # A tibble: 142 x 14
##    country continent `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
##    <fct>   <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghan~ Asia        28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8
##  2 Albania Europe      55.2   59.3   64.8   66.2   67.7   68.9   70.4   72  
##  3 Algeria Africa      43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8
##  4 Angola  Africa      30.0   32.0   34     36.0   37.9   39.5   39.9   39.9
##  5 Argent~ Americas    62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8
##  6 Austra~ Oceania     69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3
##  7 Austria Europe      66.8   67.5   69.5   70.1   70.6   72.2   73.2   74.9
##  8 Bahrain Asia        50.9   53.8   56.9   59.9   63.3   65.6   69.1   70.8
##  9 Bangla~ Asia        37.5   39.3   41.2   43.5   45.3   46.9   50.0   52.8
## 10 Belgium Europe      68     69.2   70.2   70.9   71.4   72.8   73.9   75.4
## # ... with 132 more rows, and 4 more variables: `1992` <dbl>, `1997` <dbl>,
## #   `2002` <dbl>, `2007` <dbl>
#now all the lifeExp values for each year are in separate columns

We can also reverse the first ob_df_v2 lengthening transformation we did by using the names_sep argument again. However, you should be aware that names_sep behaves differently for pivot_wider() than it did for pivot_longer(). Specifically, since we’re adding “.” to the names in pivot_wider() instead of searching for it (as we were in pivot_longer()), the names_sep argument text is read literally for pivot_wider() rather than read like a regular expression for pattern matching like it was with pivor_longer(). In most cases you can use the general rules:

adding/inserting text = “literal text” where you don’t need to worry about special characters

pattern searching/matching = “regular expresison” where you do need to worry about special characters

Some functions behave differently, but we’ll reserve that discussion for another post.

ob_long %>% glimpse
## Observations: 240
## Variables: 5
## $ subject_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,...
## $ treatment  <fct> control, control, control, control, control, control, co...
## $ gender     <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ session    <chr> "pre.1", "pre.2", "pre.3", "pre.4", "pre.5", "post.1", "...
## $ score      <dbl> 1, 2, 4, 2, 1, 3, 2, 5, 3, 2, 2, 3, 2, 4, 4, 4, 4, 5, 3,...
ob_wide_again <- pivot_wider(data = ob_long, 
                             names_from = "session",
                             values_from = "score",
                             names_sep = ".")
                            

ob_wide_again %>% glimpse
## Observations: 16
## Variables: 18
## $ subject_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ treatment  <fct> control, control, control, control, control, A, A, A, A,...
## $ gender     <fct> M, M, M, F, F, M, M, F, F, M, M, M, F, F, F, F
## $ pre.1      <dbl> 1, 4, 5, 5, 3, 7, 5, 2, 3, 4, 3, 6, 5, 2, 2, 4
## $ pre.2      <dbl> 2, 4, 6, 4, 4, 8, 5, 3, 3, 4, 3, 7, 5, 2, 2, 5
## $ pre.3      <dbl> 4, 5, 5, 7, 6, 7, 6, 5, 4, 5, 4, 8, 6, 3, 3, 7
## $ pre.4      <dbl> 2, 3, 7, 5, 4, 9, 4, 3, 6, 3, 2, 6, 8, 1, 4, 5
## $ pre.5      <dbl> 1, 4, 7, 4, 3, 9, 5, 2, 4, 4, 3, 3, 6, 2, 4, 4
## $ post.1     <dbl> 3, 2, 4, 2, 6, 9, 7, 2, 4, 6, 5, 9, 4, 5, 6, 7
## $ post.2     <dbl> 2, 2, 5, 2, 7, 9, 7, 4, 5, 7, 4, 10, 6, 6, 6, 7
## $ post.3     <dbl> 5, 3, 7, 3, 8, 10, 8, 8, 6, 6, 7, 11, 6, 7, 7, 8
## $ post.4     <dbl> 3, 5, 5, 5, 6, 8, 10, 6, 4, 8, 5, 9, 8, 5, 9, 6
## $ post.5     <dbl> 2, 3, 4, 3, 3, 9, 8, 5, 1, 8, 4, 6, 6, 2, 7, 7
## $ fup.1      <dbl> 2, 4, 7, 4, 4, 9, 8, 6, 5, 8, 5, 8, 7, 6, 7, 7
## $ fup.2      <dbl> 3, 5, 6, 4, 3, 10, 9, 6, 4, 8, 6, 7, 7, 7, 7, 8
## $ fup.3      <dbl> 2, 6, 9, 5, 6, 11, 11, 7, 7, 9, 8, 10, 8, 8, 8, 10
## $ fup.4      <dbl> 4, 4, 7, 3, 4, 9, 9, 5, 5, 7, 6, 8, 10, 6, 6, 8
## $ fup.5      <dbl> 4, 1, 6, 4, 3, 6, 8, 6, 4, 8, 5, 7, 8, 3, 7, 7
all.equal(ob_df_v2, ob_wide_again, check.attributes = F) 
## [1] TRUE
# In case you're wondering why I didn't use identical()... all.equal with
# check.atributes = F checks equality of the contents but not attributes of the
# objects. In this case ob_df_v2 is not a tibble (it's just a data frame) but
# ob_wide_again is a tibble and a data frame, so the "class" attributes of the
# objects are different, but the contents are the same.

…or the 2nd transformation with multiple distinct value columns… just specify them all as a character vector in the values_from argument

ob_long2 %>% glimpse
## Observations: 80
## Variables: 7
## $ subject_id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,...
## $ treatment  <fct> control, control, control, control, control, control, co...
## $ gender     <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, F, F, F, F,...
## $ session    <chr> "1", "2", "3", "4", "5", "1", "2", "3", "4", "5", "1", "...
## $ pre        <dbl> 1, 2, 4, 2, 1, 4, 4, 5, 3, 4, 5, 6, 5, 7, 7, 5, 4, 7, 5,...
## $ post       <dbl> 3, 2, 5, 3, 2, 2, 2, 3, 5, 3, 4, 5, 7, 5, 4, 2, 2, 3, 5,...
## $ fup        <dbl> 2, 3, 2, 4, 4, 4, 5, 6, 4, 1, 7, 6, 9, 7, 6, 4, 4, 5, 3,...
ob_wide_again2 <- ob_long2 %>% 
  pivot_wider(names_from = "session",
              values_from = c("pre", "post", "fup"),
              names_sep = ".")

all.equal(ob_df_v2, ob_wide_again2, check.attributes = F)
## [1] TRUE

…or the who2 dataset lengthening transformation by passing a vector with the key column names to the names_from argument and the original starting “new_” portion to the names_prefix argument

who_long %>% glimpse
## Observations: 405,440
## Variables: 8
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan...
## $ iso2      <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF...
## $ iso3      <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "...
## $ year      <int> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 198...
## $ diagnosis <chr> "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp", "sp...
## $ gender    <chr> "m", "m", "m", "m", "m", "m", "m", "f", "f", "f", "f", "f...
## $ age       <chr> "014", "1524", "2534", "3544", "4554", "5564", "65", "014...
## $ count     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
who2_wide_again <- who_long %>% 
  pivot_wider(names_from = c("diagnosis", "gender", "age"),
              values_from = c("count"),
              names_prefix = "new_",
              names_sep = c("_")
              )

#remove the extra underscore that was not present between the 2nd and 3rd
#components of the original names

names(who2_wide_again) <- str_replace_all(names(who2_wide_again), "m_", "m")
names(who2_wide_again) <- str_replace_all(names(who2_wide_again), "f_", "f")

all.equal(who2, who2_wide_again, check.attributes = F)
## [1] TRUE

Next we move on to unite() and separate() which are specialized tidyr functions for selectively reshaping sets of columns without altering the rest of the manipulated data frame.

5 unite()

unite() allows you to merge 2 or more columns into a single column without changing the length of the data frame. You typically specify the following arguments:

  1. data

  2. col: the name of the new column containing the values of the merged columns

  3. : a selection of columns to unite, or if unspecified, all columns will be united. Follows similar rules to the … argument of the select() function.

  4. sep: string to use to demarcate the component values of the united column

  5. remove: set this to FALSE if you want to retain the original columns, leave it as the default value of TRUE if you want to drop them.

For example, we could combine the country and continent columns of the gapminder data frame copy gap_df:

#drop the original columns
united_gap <- gap_df %>% 
  unite(col = "country_continent",
        country, continent,
        sep = ";") #the default separator is "_"
united_gap
## # A tibble: 1,704 x 5
##    country_continent  year lifeExp      pop gdpPercap
##    <chr>             <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan;Asia   1952    28.8  8425333      779.
##  2 Afghanistan;Asia   1957    30.3  9240934      821.
##  3 Afghanistan;Asia   1962    32.0 10267083      853.
##  4 Afghanistan;Asia   1967    34.0 11537966      836.
##  5 Afghanistan;Asia   1972    36.1 13079460      740.
##  6 Afghanistan;Asia   1977    38.4 14880372      786.
##  7 Afghanistan;Asia   1982    39.9 12881816      978.
##  8 Afghanistan;Asia   1987    40.8 13867957      852.
##  9 Afghanistan;Asia   1992    41.7 16317921      649.
## 10 Afghanistan;Asia   1997    41.8 22227415      635.
## # ... with 1,694 more rows
#keep them
gap_df %>% 
  unite(col = "country_continent",
        country, continent,
        sep = ";",
        remove = FALSE) 
## # A tibble: 1,704 x 7
##    country_continent country     continent  year lifeExp      pop gdpPercap
##    <chr>             <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan;Asia  Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan;Asia  Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan;Asia  Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan;Asia  Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan;Asia  Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan;Asia  Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan;Asia  Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan;Asia  Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan;Asia  Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan;Asia  Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

6 separate()

separate() does the opposite of unite(); it splits the values of a column and distributes them across multiple columns based on a separator string pattern. It also does not affect the length of the data frame. Arguments you’ll use most often are:

  1. data

  2. col: the name of the column you want to separate into pieces

  3. into: a vector of names for the new columns. NA can be used to omit component columns

  4. sep: the string pattern to use to determine where to split the values. sep also accepts an integer for position-based splitting

  5. remove: drops the original complex column

  6. convert: if TRUE, re-evaluates and assigns appropriate classes to the new columns based on the type of their contents after splitting has occurred.

Here we’ll use separate() to reverse what we just did with unite()

gap_df_separated <- united_gap %>% 
  separate(col = "country_continent",
           into = c("country", "continent"),
           sep = ";") %>% 
  #one thing separate() won't do is convert strings to factors for you (the
  #original variables were factors), so we'll take of that using dplyr::mutate()
  mutate(country = as.factor(country), 
         continent = as.factor(continent))

gap_df_separated
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows
identical(gap_df, gap_df_separated)
## [1] TRUE
#keep the complex column

united_gap %>% 
  separate(col = "country_continent",
           into = c("country", "continent"),
           sep = ";",
           remove = FALSE) 
## # A tibble: 1,704 x 7
##    country_continent country     continent  year lifeExp      pop gdpPercap
##    <chr>             <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan;Asia  Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan;Asia  Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan;Asia  Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan;Asia  Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan;Asia  Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan;Asia  Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan;Asia  Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan;Asia  Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan;Asia  Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan;Asia  Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows
#drop one of the new columns using NA

united_gap %>% 
  separate(col = "country_continent",
           into = c("country", NA),
           sep = ";") 
## # A tibble: 1,704 x 5
##    country      year lifeExp      pop gdpPercap
##    <chr>       <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan  1952    28.8  8425333      779.
##  2 Afghanistan  1957    30.3  9240934      821.
##  3 Afghanistan  1962    32.0 10267083      853.
##  4 Afghanistan  1967    34.0 11537966      836.
##  5 Afghanistan  1972    36.1 13079460      740.
##  6 Afghanistan  1977    38.4 14880372      786.
##  7 Afghanistan  1982    39.9 12881816      978.
##  8 Afghanistan  1987    40.8 13867957      852.
##  9 Afghanistan  1992    41.7 16317921      649.
## 10 Afghanistan  1997    41.8 22227415      635.
## # ... with 1,694 more rows

Now just practice these operations a few times with your own data and reshaping data to make it tidyr (pun intended) should be a breeze!

8 Notes

  • You can learn more about tidy data here and here.

  • If you want to get head start on learning about regular expressions and string manipulation, check this out.

Thank you for visiting my blog. I welcome any suggestions for future posts, comments or other feedback you might have. Feedback from beginners and science students/trainees (or with them in mind) is especially helpful in the interest of making this guide even better for them.

Avatar
Dr. Craig P. Hutton
Data Scientist | Behavioural Neuroscientist

Related