install.packages("tidyverse")Week 6: data manipulation in R and two variable statistics
by Grace Yuh Chwen Lee, Winter 2025
R packages and how to load them
There are many powerful R packages written by various scientists, and these R packages significantly augment the ability of R, making it super powerful!
To use a package, we first need to install it.
Or on the top bar, go to Tools -> install packages -> (type in the package name).
To use a package we already installed, we need to “call” it using library().
library("tidyverse")── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will be mainly using the dplyr package for manipulation of data.frame. Another commonly used package is ggplot2, which allows detailed specifications of plots.
Data manipulation in R
We will use the MLB data again - let’s load it!
data_mlb = read.table("Week2_R_basics_MLB_data.tsv", header = T)
length(data_mlb[,1])[1] 1034
names(data_mlb)[1] "Name" "Team" "Position" "Height" "Weight" "Age"
If want to get the mean Height of players of each position, we could do the hard way.
table(data_mlb$Position)
Catcher Designated_Hitter First_Baseman Outfielder
76 18 55 194
Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
315 58 52 221
Third_Baseman
45
mean(data_mlb$Height[data_mlb$Position == 'Catcher'])
mean(data_mlb$Height[data_mlb$Position == 'Designated_Hitter'])
mean(data_mlb$Height[data_mlb$Position == 'First_Baseman'])
mean(data_mlb$Height[data_mlb$Position == 'Outfielder'])
mean(data_mlb$Height[data_mlb$Position == 'Relief_Pitcher'])
mean(data_mlb$Height[data_mlb$Position == 'Second_Baseman'])
mean(data_mlb$Height[data_mlb$Position == 'Shortstop'])
mean(data_mlb$Height[data_mlb$Position == 'Starting_Pitcher'])
mean(data_mlb$Height[data_mlb$Position == 'Third_Baseman'])
Or we could use a function called tapply.
?tapply
tapply(data_mlb$Height, data_mlb$Position, mean) Catcher Designated_Hitter First_Baseman Outfielder
72.72368 74.22222 74.00000 73.01031
Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
74.37460 71.36207 71.90385 74.71946
Third_Baseman
73.04444
Or we could use functions in dplyr package.
#always remember to load your package
library("tidyverse")dplyr uses “verb” to manipulate the data.frame. After the manipulation, it always creates a new data.frame. It uses |> syntax (see below example), which is a bit different from how we use other R built-in functions.
There are three major types of “verbs” of dplyr, manipulating rows, columns, and groups.
filter()select rows based on values of a specific column (manipulation on rows)
Tip: when using dplyr functions/verb, you do not need $ to call a specific column within a data.frame.
Let’s select rows based on height (column) greater than 75.
The way you know:
logic_height = data_mlb$Height > 75
data_mlb_tall = data_mlb[logic_height,]
length(data_mlb[,1])[1] 1034
length(data_mlb_tall[,1])[1] 211
This can also be achieved using filter() in dplyr:
data_mlb_tall2 = data_mlb |>
filter(Height > 75)
length(data_mlb[,1])[1] 1034
length(data_mlb_tall2[,1])[1] 211
- mutate() add new columns that are calculated from existing columns (manipulation on columns)
Let’s add a new variable for the BMI of each of the player to the data.frame.
The way you know:
height_m = data_mlb$Height*2.54/100 #converting inch to m
weight_kg = data_mlb$Weight*0.454 #converting lbs to kg
bmi = weight_kg/(height_m^2)
#to create a new data frame
data_mlb_bmi = data.frame(data_mlb, BMI = bmi)
names(data_mlb_bmi)[1] "Name" "Team" "Position" "Height" "Weight" "Age" "BMI"
This can be also be achieved with mutation() in dplyr.
data_mlb_bmi2 = data_mlb |>
mutate(
BMI2 = Weight*0.454/((Height*2.54/100)^2)
)
names(data_mlb_bmi2)[1] "Name" "Team" "Position" "Height" "Weight" "Age" "BMI2"
See if the answers are the same:
#see if we get the same answer!
summary(data_mlb_bmi$BMI == data_mlb_bmi2$BMI2) Mode TRUE
logical 1034
mutate() can add more than one columns a time.
data_mlb_bmi3 = data_mlb |>
mutate(
height_m = Height*2.54/100,
BMI2 = Weight*0.454/((Height*2.54/100)^2)
)
names(data_mlb_bmi3)[1] "Name" "Team" "Position" "Height" "Weight" "Age" "height_m"
[8] "BMI2"
- group_by() group the data based on a specific column, which allows using sumarize() to summarize of data by group
Let’s summarize the mean of Height of each Position. We can use R’s built-in function tapply.
#if we want to get the mean height of players for each position.
#Catcher Designated_Hitter First_Baseman Outfielder Relief_Pitcher
# 76 18 55 194 315
# Second_Baseman Shortstop Starting_Pitcher Third_Baseman
# 58 52 221 45
tapply(data_mlb$Height, data_mlb$Position, mean) Catcher Designated_Hitter First_Baseman Outfielder
72.72368 74.22222 74.00000 73.01031
Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
74.37460 71.36207 71.90385 74.71946
Third_Baseman
73.04444
We can also use group_by() of dplyr.
mean_height = data_mlb |>
group_by(Position) |> #the outcome of group_by position is "piped" to summarize function
summarize(
mean_heigth_position = mean(Height)
)
mean_height# A tibble: 9 × 2
Position mean_heigth_position
<chr> <dbl>
1 Catcher 72.7
2 Designated_Hitter 74.2
3 First_Baseman 74
4 Outfielder 73.0
5 Relief_Pitcher 74.4
6 Second_Baseman 71.4
7 Shortstop 71.9
8 Starting_Pitcher 74.7
9 Third_Baseman 73.0
It is possible to use multiple functions of dplyr together, “piping” the result from one to the next!
mean_height_tall = data_mlb |>
filter(Height > 75)|> #filter out players with height <= 75
group_by(Position) |> #the outcome of group_by position is "piped" to summarize function
summarize(
mean_heigth_position = mean(Height)
)
mean_height_tall# A tibble: 7 × 2
Position mean_heigth_position
<chr> <dbl>
1 Catcher 76
2 Designated_Hitter 76.5
3 First_Baseman 77
4 Outfielder 76.5
5 Relief_Pitcher 76.9
6 Starting_Pitcher 77.2
7 Third_Baseman 76.2
Some other resources to help you understand these powerful functions of dplyr https://www.andrewheiss.com/blog/2024/04/04/group_by-summarize-ungroup-animations/
Count data - summarize data and statistical tests
When the numerical data is sparse, what we learned in Week 3 about summarizing data may not be that useful.
Do you think the histogram and boxplot helpful for summarizing v1 below?
It is likely that the data only consists of few values. table() is always a good function to see if that is true.
table(v1)v1
1 19 20
8 6 5
So, in this example, there is only three possible values of v1. Instead of a histogram, a barplot may be more appropriate for v1. To make a barplot using the the outcome of table() is easy.
barplot(table(v1))You can also specify the height of the bars by providing a vector. The names could be specify by option names = c( ). Remember, group names needs " " around them (since they are not variables).
v2 = c(1, 10, 4)
barplot(v2, names = c("control", "treatment 1", "treatment 2"))To compare the observed count with expectation, we could use fisher.test() (for 2x2 tables) and chisq.test().
A commonly used example to understand Fisher’s exact test and Chi-square test is coin tossing. Let’s say we are tossing a fair coin.
?sample
coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin = sample(coin, size = ntoss, replace = TRUE) #a vector of ntoss random toss
table(random_coin)random_coin
head tail
51 49
obs_toss = as.vector(table(random_coin)) #force the table outcome to be a vector
exp_toss = c(ntoss/2, ntoss/2) #expectation
#combine to vector into one long vector
m = c(obs_toss, exp_toss)
#make it into a 2x2 matrix
m = matrix(m, nrow = 2)
fisher.test(m)
Fisher's Exact Test for Count Data
data: m
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5759313 1.8811672
sample estimates:
odds ratio
1.040632
Let’s try tossing another “unfair” coin. You can specify the probability of each outcome in the sample() by prob = c( , ).
coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin2 = sample(coin, size = ntoss, replace = TRUE, prob = c(0.8, 0.2)) #tossing an unfair coin
table(random_coin2)random_coin2
head tail
79 21
obs_toss2 = as.vector(table(random_coin2)) #force the table outcome to be a vector
exp_toss2 = c(ntoss/2, ntoss/2) #expectation
#combine to vector into one long vector
m2 = c(obs_toss2, exp_toss2)
#make it into a 2x2 matrix
m2 = matrix(m2, nrow = 2)
m2 [,1] [,2]
[1,] 79 50
[2,] 21 50
fisher.test(m2)
Fisher's Exact Test for Count Data
data: m2
p-value = 2.945e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.940871 7.382548
sample estimates:
odds ratio
3.735756
Make sure you know how to interpret the Fisher’s Exact test results!
Fisher’s Exact tests can calculate the p-value exactly, while Chi-square tests approximate it. But for tables larger than 2x2, we have to use chiseq.test. The usage is very similar.
obs = c(100, 105, 50)
exp = c(100, 100, 100)
m3 = c(obs, exp)
m3 = matrix(m3, nrow = 2, byrow = T) ##by defult, R fill up the columns first!
chisq.test(m3)
Pearson's Chi-squared test
data: m3
X-squared = 13.227, df = 2, p-value = 0.001342
Tip: when generating a matrix, remember that R fills up the column firsts. If it is hard to remember (at least for me), then always specify byrow = T or byrow = F to avoid confusions.
Comparing distributions - Kolmogorov–Smirnov test
As we discussed last week, sometimes, the sample distributions could have the same mean and similar variance, but they in fact look very different when you make histograms. Unlike t.test() that tests whether the mean is the same and Wilcox.test() that tests whether the median is the same, ks.test() compares the whole distribution.
v1 = rnorm(100, 0, 4)
v2 = rnorm(100, 0, 1)
par(mfrow = c(1, 2)) #put two figures (1 row, 2 columns) in one plot
hist(v1)
hist(v2)#compare their distribution using ks test
ks.test(v1, v2)
Asymptotic two-sample Kolmogorov-Smirnov test
data: v1 and v2
D = 0.38, p-value = 1.071e-06
alternative hypothesis: two-sided
#let's try another distiburion with vey similar sd to v1
v3 = rnorm(100, 0, 3.9)
hist(v1)
hist(v3)ks.test(v1, v3)
Asymptotic two-sample Kolmogorov-Smirnov test
data: v1 and v3
D = 0.07, p-value = 0.9671
alternative hypothesis: two-sided
Testing associations - correlation tests
To test the associations between two variables, we have multiple ways to do it. In fact, t.test() and wilcox.test() are testing whether the values of one continuous variable (e.g., age - continuous variable) is associated with the value of a categorical variable (e.g., positions in baseball games).
Similarly, Fisher.test() and chisq.test() can also be used to test the associations between two categorical variables (see above examples).
For two continuous variables, we usually use Pearson correlation tests cor.test() to test whether their associations are statistically significant and how strong the association is. Pearson correlation is a parametric test, with the normality assumptions.
Let’s look at an example we covered last week.
Based on this figure, what null hypothesis (H0) would you propose?
cor.test(x = data_mlb2$Weight, y = data_mlb2$Height)
Pearson's product-moment correlation
data: data_mlb2$Weight and data_mlb2$Height
t = 20.192, df = 1032, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4869843 0.5744776
sample estimates:
cor
0.5321502
Pearson correlation tests can easily be biased by outliers.
Do you think there is an association between x and y in the below figure?
cor.test(v1, v2)
Pearson's product-moment correlation
data: v1 and v2
t = 3.5893, df = 11, p-value = 0.004248
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3081909 0.9151178
sample estimates:
cor
0.7344588
Nonparametric Spearman correlation test is less sensitive to outlier values. It can be specified using the method = "spearman option (remember the " ").
cor.test(v1, v2, method = "spearman")
Spearman's rank correlation rho
data: v1 and v2
S = 392, p-value = 0.8065
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.07692308