1. Introduction to basic data manipulation (mutate, select, filter, summarize)
2. Applying dplyr to real data
3. Intermediate data manipulation (change var class, lapply, composite scores, conditional mutate)
4. Restructuring datasets (wide to long, long to wide)
5. Using correlations to inspect your data
BONUS: Writing your own functions
This code loads the packages you need using the library function, but will install any packages that have not yet been installed prior to loading them.
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Name | gapminder |
Number of rows | 1704 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
country | 0 | 1 | FALSE | 142 | Afg: 12, Alb: 12, Alg: 12, Ang: 12 |
continent | 0 | 1 | FALSE | 5 | Afr: 624, Asi: 396, Eur: 360, Ame: 300 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1979.50 | 17.27 | 1952.00 | 1965.75 | 1979.50 | 1993.25 | 2007.0 | ▇▅▅▅▇ |
lifeExp | 0 | 1 | 59.47 | 12.92 | 23.60 | 48.20 | 60.71 | 70.85 | 82.6 | ▁▆▇▇▇ |
pop | 0 | 1 | 29601212.32 | 106157896.74 | 60011.00 | 2793664.00 | 7023595.50 | 19585221.75 | 1318683096.0 | ▇▁▁▁▁ |
gdpPercap | 0 | 1 | 7215.33 | 9857.45 | 241.17 | 1202.06 | 3531.85 | 9325.46 | 113523.1 | ▇▁▁▁▁ |
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Variable Classes
1. Integers are whole numbers without decimal points.
2. Numeric/double is a general class that includes both
decimals (floating-point numbers) and integers.
3. Factor represents categorical data.
They can store both string and integer values.
4. Character represents a string in R.
5. Logical represents boolean values, with
TRUE, FALSE, and NA as possible values.
## $country
## [1] "factor"
##
## $continent
## [1] "factor"
##
## $year
## [1] "integer"
##
## $lifeExp
## [1] "numeric"
##
## $pop
## [1] "integer"
##
## $gdpPercap
## [1] "numeric"
# filtered_df <- # Saves in global environment
gapminder %>%
filter(continent == "Africa", year > 1990) %>%
head()
SPSS Syntax: Filtering
DATASET COPY filtered_df.
DATASET ACTIVATE filtered_df.
FILTER OFF.
USE ALL.
SELECT IF ((continent = 'Africa') & (year > 1990)).
EXECUTE.
SPSS Syntax: Compute variable
COMPUTE diff.gdp=gdpPercap - mean(gdpPercap).
EXECUTE.
gapminder %>%
group_by(year) %>%
dplyr::summarize(mean_gdp = mean(gdpPercap)) %>%
arrange(desc(year))
Other functions you may want to use with mutate and summarize:
1. sum (x, na.rm = TRUE)
2. max (x, na.rm = TRUE)
3. min (x, na.rm = TRUE)
4. n_distinct (x) <- count number of unique instances
## Adding missing grouping variables: `continent`
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `country`, `year`
gapminder %>%
filter(continent == "Asia" & lifeExp > 70 & year == 2007) %>%
dplyr::select(country, year, lifeExp)
Here, we’re using some a real dataset with responses from the Interpersonal Reactivity Index (IRI). This is a common measure for trait empathy. Now, before we could analyze it, we need to make some changes to the data.
emp_df <- read_csv("/Users/kareenadelrosario/Desktop/Local R Code/Empathy_Background_randomized.csv")
## [1] "ID" "Dyad" "iri1" "iri2" "iri3" "iri4"
## [7] "iri5" "iri6" "iri7" "iri8" "iri9" "iri10"
## [13] "iri11" "iri12" "iri13" "iri14" "iri15" "iri16"
## [19] "iri17" "iri18" "iri19" "iri20" "iri21" "iri22"
## [25] "iri23" "iri24" "iri25" "iri26" "iri27" "iri28"
## [31] "Condition"
#### Option 1
# Change ID, Dyad, and Condition to factor
emp_df$ID <- as.factor(emp_df$ID)
class(emp_df$ID)
## [1] "factor"
#### Option 2
## Useful when converting multiple variables. Note: [,X] = by column
# Convert categorical variables to factor
names <- c(1:2, 31)
emp_df[,names] <- lapply(emp_df[,names], factor)
head(lapply(emp_df, class))
## $ID
## [1] "factor"
##
## $Dyad
## [1] "factor"
##
## $iri1
## [1] "numeric"
##
## $iri2
## [1] "numeric"
##
## $iri3
## [1] "numeric"
##
## $iri4
## [1] "numeric"
#### Option 3
## Use dplyr to change class and apply value labels
# Label gender and condition and make sure they're classified as factors
emp_df <-
emp_df %>%
mutate(ID = factor(ID),
Dyad = factor(Dyad),
Condition = factor(Condition, labels = c("Sad", "Control"))) # Change from 0,1 to Control and Sad
#### Option 4
## Use dplyr to change class and apply value labels with across()
emp_df <-
emp_df %>%
mutate(across(c(ID, Dyad, Condition), as.factor))
emp_df <-
emp_df %>%
mutate(across(where(is.character), as.factor))
emp_df %>%
select(ID, Dyad, Condition) %>%
head()
#### Option 1
reverse <- emp_df %>%
mutate(iri3r = 6 - iri3,
iri4r = 6 - iri4,
iri7r = 6 - iri7,
iri12r = 6 - iri12,
iri13r = 6 - iri13,
iri14r = 6 - iri14,
iri15r = 6 - iri15,
iri18r = 6 - iri18,
iri19r = 6 - iri19)
#### Option 2
# rowwise tells r to go by ROW not column
# we can also reverse code in one step! (Add -6 before item)
emp_df <- emp_df %>%
rowwise() %>%
mutate(emp_sum = sum(c(iri1, iri2, 6-iri3, 6-iri4, iri5, iri6, 6-iri7, iri8, iri9, iri10,iri11,6-iri12,6-iri13,6-iri14,6-iri15,iri16,iri17,6-iri18,6-iri19,iri20,iri21,iri22,iri23,iri24,iri25,iri26,iri27,iri28)))
Amazing! Now let’s move onto something a bit harder. Let’s talk about conditional mutates and restructuring the dataframe.
Using the same dataset, we want to add a few new ID and condition variables. After completing the IRI, participants were paired in dyads, half of which were in the sad dyads while the other half were in the control dyads.
Within the sad dyad, 1 person was in the sad condition (told a sad story) and their partner was in the control condition (told a neutral story).
Both participants in the control dyads told a neutral story.
Right now, we just have the ID, Dyad, and Dyad-level Condition variables, but we want to create new variables that reflect this complexity (e.g., differentiate between the participants in the sad dyad: were they the sad actor or control partner?)
The ID variable is the same thing as the Dyad variable except it has a 1 (Actor) and 2 (Partner) at the end. For example, Dyad 200 has participant 2001 (Actor) and participant 2002 (Partner).
Control dyads: p_cond = 3 Sad partners: p_cond = 2 Sad actors: p_cond = 1
# This dataset does not have a "partner" variable, which is super important for dyadic data.
# we'll need to make ID and Dyad numeric to calculate partner
emp_df %>%
transmute(ID.n = as.numeric(ID)) #issue with converting factor to numeric
## Use mutate to create a "partner" variable.
emp_df <- emp_df %>%
mutate(ID.n = as.numeric(as.character(ID)),
Dyad.n = as.numeric(as.character(Dyad)),
partner = ID.n - (Dyad.n * 10)) %>%
mutate(partner = factor(partner))
# Preview variables
emp_df %>%
select(Dyad, ID, partner)
## If we wanted to recode an existing variable, we could use this function:
emp_df$Condition[emp_df$Condition == 1] <- 'Sad'
## We'll need to create a new condition variable to distinguish each individual condition (sad actors vs sad partners vs control dyad)
emp_df_recode <- emp_df %>%
mutate(p_cond = ifelse( (Condition == "Sad") & (partner == 1), 2,
ifelse( (Condition == "Sad") & (partner == 2), 1, 3))) %>%
mutate(p_cond = factor(p_cond))
#### Sad Partner = 1
#### Sad Actor = 2
#### Control Dyad = 3
## Double-check new variable
emp_df_recode %>%
select(ID, Condition, partner, p_cond)
# Create the dataframe
df_wide <- data.frame(
ID = c(9991, 9992, 9993, 9994),
'1990' = c(3, 2, 1, 4),
'1992' = c(4, 5, 5, 2),
'1994' = c(2, 3, 5, 3),
'1996' = c(4, 4, 2, 5),
check.names = FALSE # R doesn't love numerical column names. this tells it to chill out.
)
print(df_wide)
## ID 1990 1992 1994 1996
## 1 9991 3 4 2 4
## 2 9992 2 5 3 4
## 3 9993 1 5 5 2
## 4 9994 4 2 3 5
Now, we want to change this so that it’s in long format. There should be multiple entries for each ID and Year should be a single variable.
#### Option 1
df_long <- pivot_longer(df_wide,
cols = !ID, # variables that should be left alone
names_to = "Year", # header of wide_df
values_to = "Value") # values that correspond to variable names
head(df_long)
#### Option 2
df_wide %>%
pivot_longer(
cols = starts_with("199"), # could specify which columns to pivot
names_to = "Year",
values_to = "Value",
values_drop_na = TRUE # exclude NAs
)
To compute basic correlation coefficients there are three main functions that can be used: cor(), cor.test() and rcorr(). functions. The functions cor() and cor.test() are part of the base system in R, but rcorr() is part of the Hmisc package. The output varies between the three: if you want a confidence interval then you will have to use cor.test(), and if you want correlation coefficients for multiple pairs of variables then you cannot use cor.test(); similarly, if you want p-values then cor() won’t help you, etc.
Here, we’re looking at the correlation between iri at Time 1 and Time 2. cor() only outputs correlation.
# Note that cor does not like NAs. Check doc using ?cor for NA options
# pearson correlation is default
cor(emp_df_recode$iri2, emp_df_recode$iri3, use = "complete.obs")
## [1] -0.07522498
# rank correlation coefficient
cor(emp_df_recode$iri2, emp_df_recode$iri3, use = "complete.obs", method = "spearman")
## [1] -0.1429502
QUESTION: For these data (Likert scale), which correlation is correct?
We could also run correlation matrix using the following:
## iri2 iri3 iri4
## iri2 1.00000000 -0.07522498 -0.3106040
## iri3 -0.07522498 1.00000000 0.2255922
## iri4 -0.31060405 0.22559218 1.0000000
cor.test is a correlation test between paired samples. It outputs the correlation plus the 95% CI, t-stat, df, and p-value.
##
## Pearson's product-moment correlation
##
## data: emp_df_recode$iri2 and emp_df_recode$iri3
## t = -1.0177, df = 182, p-value = 0.3102
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.21751893 0.07019996
## sample estimates:
## cor
## -0.07522498
load library(Hmisc). rcorr outputs correlation matrix, n, and p values.
library(Hmisc)
# convert df into a matrix
matrix1 <- as.matrix(emp_df_recode[,c("iri2", "iri3", "iri4")])
# correlation matrix
rcorr(matrix1)
## iri2 iri3 iri4
## iri2 1.00 -0.08 -0.31
## iri3 -0.08 1.00 0.23
## iri4 -0.31 0.23 1.00
##
## n= 184
##
##
## P
## iri2 iri3 iri4
## iri2 0.3102 0.0000
## iri3 0.3102 0.0021
## iri4 0.0000 0.0021
To introduce you to what goes into writing a function, let’s create a function that computes a mean.
var_mean <- function(variable) # what does the function require to be input
{
mean <- sum(variable)/length(variable) # how is a mean calculated
cat("Mean = ", mean) # what it should print
}
# Let's test it out!
# Create data
test_data <- c(5,12,98,23,45,7,86,34)
# Now let's try it out
var_mean(test_data)
## Mean = 38.75
descriptive_table <- function(x) { # our function accepts just a variable. if in a df, it'll be df$variable
tab_mean <- mean(x, na.rm = TRUE) # compute mean
tab_sd <- sd(x, na.rm = TRUE) # compute sd
tab_min <- min(x, na.rm = TRUE) # find minimum value
tab_max <- max(x, na.rm = TRUE) # find maximum value
cat("Summary of descriptive statistics\n") # print title
cat("Mean: ", tab_mean, "\n") # print mean... and so on.
cat("Standard Deviation: ", tab_sd, "\n")
cat("Minimum: ", tab_min, "\n")
cat("Maximum: ", tab_max, "\n")
}
descriptive_table(emp_df_recode$iri12)
## Summary of descriptive statistics
## Mean: 2.26087
## Standard Deviation: 1.172262
## Minimum: 1
## Maximum: 5