Level 2: Data Manipulation

1. Introduction to basic data manipulation (mutate, select, filter, summarize)
2. Applying dplyr to real data
3. Intermediate data manipulation (change var class, lapply, composite scores, conditional mutate)
4. Restructuring datasets (wide to long, long to wide)
5. Using correlations to inspect your data
BONUS: Writing your own functions

Load Packages

This code loads the packages you need using the library function, but will install any packages that have not yet been installed prior to loading them.

p_needed <- c("tidyverse",
              "dplyr", 
              "haven",
              "tidyr",
              "gapminder",
              "purrr",
              "skimr")

packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]

if(length(p_to_install) > 0){
  install.packages(p_to_install)
}

lapply(p_needed, library, character.only = TRUE)

Examine Data

# extract variable names
colnames(gapminder)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
skim(gapminder)
Data summary
Name gapminder
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 142 Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent 0 1 FALSE 5 Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1979.50 17.27 1952.00 1965.75 1979.50 1993.25 2007.0 ▇▅▅▅▇
lifeExp 0 1 59.47 12.92 23.60 48.20 60.71 70.85 82.6 ▁▆▇▇▇
pop 0 1 29601212.32 106157896.74 60011.00 2793664.00 7023595.50 19585221.75 1318683096.0 ▇▁▁▁▁
gdpPercap 0 1 7215.33 9857.45 241.17 1202.06 3531.85 9325.46 113523.1 ▇▁▁▁▁
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Variable Class

Before you run any analyses, you’ll need to make sure the variable class is correct (e.g. factor vs numeric vs character).

Variable Classes

1. Integers are whole numbers without decimal points.

2. Numeric/double is a general class that includes both 
  decimals (floating-point numbers) and integers. 

3. Factor represents categorical data. 
  They can store both string and integer values.

4. Character represents a string in R. 

5. Logical represents boolean values, with 
  TRUE, FALSE, and NA as possible values.
# View class of each variable
lapply(gapminder, class)
## $country
## [1] "factor"
## 
## $continent
## [1] "factor"
## 
## $year
## [1] "integer"
## 
## $lifeExp
## [1] "numeric"
## 
## $pop
## [1] "integer"
## 
## $gdpPercap
## [1] "numeric"
# Factor = Nominal in SPSS

Intro to Dplyr

Select, Filter, Mutate, Arrange, Transmute

Select allows us to easily subset our data by only selecting the desired variables.

# new_df <-
gapminder %>%
  dplyr::select (country, lifeExp)

Filter, just like the filter function in SPSS, allows us to work with the cases that meet certain criteria

# filtered_df <- # Saves in global environment
gapminder %>%
  filter(continent == "Africa", year > 1990) %>%
  head()

SPSS Syntax: Filtering

DATASET COPY  filtered_df.
DATASET ACTIVATE  filtered_df.
FILTER OFF.
USE ALL.
SELECT IF ((continent = 'Africa')  & (year > 1990)).
EXECUTE.

Mutate creates new variables or changes existing variables (if you do not provide a new name). The SPSS equivalent is COMPUTE VARIABLE.

gapminder %>%
  mutate(log.gdp = log(gdpPercap)) %>%
  head()

SPSS Syntax: Compute variable

COMPUTE diff.gdp=gdpPercap - mean(gdpPercap).
EXECUTE.
gapminder %>%
  mutate(diff.gdp = gdpPercap - mean(gdpPercap, na.rm = TRUE)) %>%
  head()

Transmute is a mix between select and mutate

gapminder %>%
  transmute (country, population_divided = pop/2)

Summarize (or summarise) calculates a single value per group. Useful for aggregating data.

gapminder %>%
  group_by(year) %>%
  dplyr::summarize(mean_gdp = mean(gdpPercap)) %>%
  arrange(desc(year))

Other functions you may want to use with mutate and summarize:

1. sum (x, na.rm = TRUE)
2. max (x, na.rm = TRUE)
3. min (x, na.rm = TRUE)
4. n_distinct (x) <- count number of unique instances

Dplyr cheatsheet

Challenge!

  1. Create a dataframe that contains the average life expectancy for each continent.
gapminder %>% 
  group_by(continent) %>% 
  summarise(mean_life = mean(lifeExp))
gapminder %>% 
  group_by(continent) %>% 
  dplyr::select(lifeExp)
## Adding missing grouping variables: `continent`
  1. Generate a dataframe that shows the total population per year for each country in Europe.
gapminder %>% 
  filter(continent == "Europe") %>% 
  group_by(year, country) %>% 
  summarise(pop)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
gapminder %>% 
  filter(continent == "Europe") %>% 
  group_by(country, year) %>% 
  dplyr::select(pop)
## Adding missing grouping variables: `country`, `year`
  1. Create a dataframe that only includes Asian countries with a life expectancy greater than 70 years in the year 2007. This subset should ONLY include these variables: country, year, and lifeExp.
gapminder %>% 
  filter(continent == "Asia" & lifeExp > 70 & year == 2007) %>% 
  dplyr::select(country, year, lifeExp)
gapminder %>% 
  filter(continent == "Asia" & lifeExp > 70 & year == 2007) 

Using a real dataset

Here, we’re using some a real dataset with responses from the Interpersonal Reactivity Index (IRI). This is a common measure for trait empathy. Now, before we could analyze it, we need to make some changes to the data.

  1. Variables ID, Dyad, Condition need to be treated like factors
  2. We need to delete one variable (iri1)
  3. Some participants should be excluded, so we want to remove those cases
  4. Some items on the IRI need to be reverse-scored
  5. Create a composite score, representing each participant’s trait empathy. Based on the IRI documentation, this should be a sum not a mean (varies by measure)
emp_df <- read_csv("/Users/kareenadelrosario/Desktop/Local R Code/Empathy_Background_randomized.csv")

Examine Data

colnames(emp_df)
##  [1] "ID"        "Dyad"      "iri1"      "iri2"      "iri3"      "iri4"     
##  [7] "iri5"      "iri6"      "iri7"      "iri8"      "iri9"      "iri10"    
## [13] "iri11"     "iri12"     "iri13"     "iri14"     "iri15"     "iri16"    
## [19] "iri17"     "iri18"     "iri19"     "iri20"     "iri21"     "iri22"    
## [25] "iri23"     "iri24"     "iri25"     "iri26"     "iri27"     "iri28"    
## [31] "Condition"

Change Variable Class

#### Option 1
# Change ID, Dyad, and Condition to factor
emp_df$ID <- as.factor(emp_df$ID)
class(emp_df$ID)
## [1] "factor"
#### Option 2
## Useful when converting multiple variables. Note: [,X] = by column
# Convert categorical variables to factor
names <- c(1:2, 31)
emp_df[,names] <- lapply(emp_df[,names], factor)
head(lapply(emp_df, class))
## $ID
## [1] "factor"
## 
## $Dyad
## [1] "factor"
## 
## $iri1
## [1] "numeric"
## 
## $iri2
## [1] "numeric"
## 
## $iri3
## [1] "numeric"
## 
## $iri4
## [1] "numeric"
#### Option 3
## Use dplyr to change class and apply value labels
# Label gender and condition and make sure they're classified as factors
emp_df <- 
  emp_df %>%
  mutate(ID = factor(ID),
         Dyad = factor(Dyad),
         Condition = factor(Condition, labels = c("Sad", "Control"))) # Change from 0,1 to Control and Sad

#### Option 4
## Use dplyr to change class and apply value labels with across()
emp_df <- 
    emp_df %>%
    mutate(across(c(ID, Dyad, Condition), as.factor))  

emp_df <- 
    emp_df %>% 
    mutate(across(where(is.character), as.factor))


emp_df %>% 
  select(ID, Dyad, Condition) %>%
  head()

Remove Variables (Columns) and Cases (Rows)

Variables

# Delete variables
emp_df %>%
  select(-c(iri1)) %>%
  head()

Cases

# Exclude participants by dyad ID
emp_df <- emp_df[ !(emp_df$Dyad %in% c(121, 124, 158, 168, 153)), ]

# Or by row number
emp_df %>%
  slice(-c(58:59, 117:118, 137:138, 107:108))

Composite Scores

#### Option 1
reverse <- emp_df %>%
  mutate(iri3r = 6 - iri3,
         iri4r = 6 - iri4,
         iri7r = 6 - iri7,
         iri12r = 6 - iri12,
         iri13r = 6 - iri13,
         iri14r = 6 - iri14,
         iri15r = 6 - iri15,
         iri18r = 6 - iri18,
         iri19r = 6 - iri19)

#### Option 2
# rowwise tells r to go by ROW not column
# we can also reverse code in one step! (Add -6 before item)
emp_df <- emp_df %>%
  rowwise() %>%
  mutate(emp_sum = sum(c(iri1, iri2, 6-iri3, 6-iri4, iri5, iri6, 6-iri7, iri8, iri9, iri10,iri11,6-iri12,6-iri13,6-iri14,6-iri15,iri16,iri17,6-iri18,6-iri19,iri20,iri21,iri22,iri23,iri24,iri25,iri26,iri27,iri28)))

Amazing! Now let’s move onto something a bit harder. Let’s talk about conditional mutates and restructuring the dataframe.

Conditional mutate (ifelse)

Using the same dataset, we want to add a few new ID and condition variables. After completing the IRI, participants were paired in dyads, half of which were in the sad dyads while the other half were in the control dyads.

Within the sad dyad, 1 person was in the sad condition (told a sad story) and their partner was in the control condition (told a neutral story).

Both participants in the control dyads told a neutral story.

Right now, we just have the ID, Dyad, and Dyad-level Condition variables, but we want to create new variables that reflect this complexity (e.g., differentiate between the participants in the sad dyad: were they the sad actor or control partner?)

The ID variable is the same thing as the Dyad variable except it has a 1 (Actor) and 2 (Partner) at the end. For example, Dyad 200 has participant 2001 (Actor) and participant 2002 (Partner).

  1. Identify the actor vs partner in each dyad by creating a partner variable
  2. Use conditional mutate to use the dyad-level Condition variable to create a 3-level individual condition variable (p_cond).

Control dyads: p_cond = 3 Sad partners: p_cond = 2 Sad actors: p_cond = 1

In SPSS: COMPUTE VARIABLE (IF)

# This dataset does not have a "partner" variable, which is super important for dyadic data. 

# we'll need to make ID and Dyad numeric to calculate partner
emp_df %>% 
  transmute(ID.n = as.numeric(ID)) #issue with converting factor to numeric
## Use mutate to create a "partner" variable.
emp_df <- emp_df %>%
  mutate(ID.n = as.numeric(as.character(ID)),
         Dyad.n = as.numeric(as.character(Dyad)),
         partner = ID.n - (Dyad.n * 10)) %>%
  mutate(partner = factor(partner))

# Preview variables
emp_df %>%
  select(Dyad, ID, partner)
## If we wanted to recode an existing variable, we could use this function: 
emp_df$Condition[emp_df$Condition == 1] <- 'Sad'

## We'll need to create a new condition variable to distinguish each individual condition (sad actors vs sad partners vs control dyad)
emp_df_recode <- emp_df %>%
  mutate(p_cond = ifelse( (Condition == "Sad") & (partner == 1), 2,
                          ifelse( (Condition == "Sad") & (partner == 2), 1, 3))) %>%
  mutate(p_cond = factor(p_cond))

#### Sad Partner = 1
#### Sad Actor = 2
#### Control Dyad = 3

## Double-check new variable
emp_df_recode %>%
  select(ID, Condition, partner, p_cond)
# Remove missing values from emp_df
emp_df_recode <- emp_df_recode %>%
  drop_na(emp_sum, Condition) %>%
  group_by (Dyad) %>%
  filter(n() == 2) %>%
  ungroup()

Restructuring Dataset

Wide vs Long Format

Wide to Long

# Create the dataframe
df_wide <- data.frame(
  ID = c(9991, 9992, 9993, 9994),
  '1990' = c(3, 2, 1, 4),
  '1992' = c(4, 5, 5, 2),
  '1994' = c(2, 3, 5, 3),
  '1996' = c(4, 4, 2, 5),
  check.names = FALSE # R doesn't love numerical column names. this tells it to chill out.
)

print(df_wide)
##     ID 1990 1992 1994 1996
## 1 9991    3    4    2    4
## 2 9992    2    5    3    4
## 3 9993    1    5    5    2
## 4 9994    4    2    3    5

Pivot_longer() from tidyr

Now, we want to change this so that it’s in long format. There should be multiple entries for each ID and Year should be a single variable.

#### Option 1
df_long <- pivot_longer(df_wide,
                        cols = !ID, # variables that should be left alone
                        names_to = "Year", # header of wide_df
                        values_to = "Value") # values that correspond to variable names
head(df_long)
#### Option 2
df_wide %>%
 pivot_longer(
   cols = starts_with("199"), # could specify which columns to pivot
   names_to = "Year",
   values_to = "Value",
   values_drop_na = TRUE # exclude NAs
 )

Long to wide

Pivot_wider()

Let’s reverse it and spread out the Year.

df_wide2 <- pivot_wider(df_long,
                        names_from = Year,
                        values_from = Value,
                        values_fill = 999) # fill NA with 999

head(df_wide2)
### Is it identical to our other wide df?
setequal(df_wide, df_wide2)
## [1] TRUE

Quick guide to correlations in R

To compute basic correlation coefficients there are three main functions that can be used: cor(), cor.test() and rcorr(). functions. The functions cor() and cor.test() are part of the base system in R, but rcorr() is part of the Hmisc package. The output varies between the three: if you want a confidence interval then you will have to use cor.test(), and if you want correlation coefficients for multiple pairs of variables then you cannot use cor.test(); similarly, if you want p-values then cor() won’t help you, etc.

cor()

Here, we’re looking at the correlation between iri at Time 1 and Time 2. cor() only outputs correlation.

# Note that cor does not like NAs. Check doc using ?cor for NA options
# pearson correlation is default
cor(emp_df_recode$iri2, emp_df_recode$iri3, use = "complete.obs")
## [1] -0.07522498
# rank correlation coefficient
cor(emp_df_recode$iri2, emp_df_recode$iri3, use = "complete.obs", method = "spearman")
## [1] -0.1429502

QUESTION: For these data (Likert scale), which correlation is correct?

We could also run correlation matrix using the following:

cor(emp_df_recode[,c("iri2", "iri3", "iri4")], use = "complete.obs")
##             iri2        iri3       iri4
## iri2  1.00000000 -0.07522498 -0.3106040
## iri3 -0.07522498  1.00000000  0.2255922
## iri4 -0.31060405  0.22559218  1.0000000

How would you interpret this matrix?

cor.test()

cor.test is a correlation test between paired samples. It outputs the correlation plus the 95% CI, t-stat, df, and p-value.

cor.test(emp_df_recode$iri2, emp_df_recode$iri3)
## 
##  Pearson's product-moment correlation
## 
## data:  emp_df_recode$iri2 and emp_df_recode$iri3
## t = -1.0177, df = 182, p-value = 0.3102
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.21751893  0.07019996
## sample estimates:
##         cor 
## -0.07522498

rcorr()

load library(Hmisc). rcorr outputs correlation matrix, n, and p values.

library(Hmisc)

# convert df into a matrix
matrix1 <- as.matrix(emp_df_recode[,c("iri2", "iri3", "iri4")])

# correlation matrix
rcorr(matrix1)
##       iri2  iri3  iri4
## iri2  1.00 -0.08 -0.31
## iri3 -0.08  1.00  0.23
## iri4 -0.31  0.23  1.00
## 
## n= 184 
## 
## 
## P
##      iri2   iri3   iri4  
## iri2        0.3102 0.0000
## iri3 0.3102        0.0021
## iri4 0.0000 0.0021

BONUS: Introduction to Writing Functions

To introduce you to what goes into writing a function, let’s create a function that computes a mean.

var_mean <- function(variable) # what does the function require to be input
{
  mean <- sum(variable)/length(variable) # how is a mean calculated
  cat("Mean = ", mean) # what it should print
}

# Let's test it out!

# Create data
test_data <- c(5,12,98,23,45,7,86,34)

# Now let's try it out
var_mean(test_data)
## Mean =  38.75

Using Functions for Things We Normally Can’t Do in R

Option 1: create your own function

descriptive_table <- function(x) { # our function accepts just a variable. if in a df, it'll be df$variable
  tab_mean <- mean(x, na.rm = TRUE) # compute mean
  tab_sd <- sd(x, na.rm = TRUE) # compute sd
  tab_min <- min(x, na.rm = TRUE) # find minimum value
  tab_max <- max(x, na.rm = TRUE) # find maximum value
    cat("Summary of descriptive statistics\n") # print title
    cat("Mean:               ", tab_mean, "\n") # print mean... and so on.
    cat("Standard Deviation: ", tab_sd, "\n")
    cat("Minimum:            ", tab_min,    "\n")
    cat("Maximum:            ", tab_max,    "\n")
  
}


descriptive_table(emp_df_recode$iri12)
## Summary of descriptive statistics
## Mean:                2.26087 
## Standard Deviation:  1.172262 
## Minimum:             1 
## Maximum:             5

Option 2: shorten existing functions with partial() from the purrr package

library(purrr)

# adding shortcuts
mymean <- partial(mean, na.rm = TRUE)