Week 6: data manipulation in R and two variable statistics

by Grace Yuh Chwen Lee, Winter 2025

R packages and how to load them

There are many powerful R packages written by various scientists, and these R packages significantly augment the ability of R, making it super powerful!

To use a package, we first need to install it.

install.packages("tidyverse")

Or on the top bar, go to Tools -> install packages -> (type in the package name).

To use a package we already installed, we need to “call” it using library().

library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We will be mainly using the dplyr package for manipulation of data.frame. Another commonly used package is ggplot2, which allows detailed specifications of plots.


Data manipulation in R

We will use the MLB data again - let’s load it!

data_mlb = read.table("Week2_R_basics_MLB_data.tsv", header = T)
length(data_mlb[,1])
[1] 1034
names(data_mlb)
[1] "Name"     "Team"     "Position" "Height"   "Weight"   "Age"     

If want to get the mean Height of players of each position, we could do the hard way.

table(data_mlb$Position)

          Catcher Designated_Hitter     First_Baseman        Outfielder 
               76                18                55               194 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
              315                58                52               221 
    Third_Baseman 
               45 
mean(data_mlb$Height[data_mlb$Position == 'Catcher'])
mean(data_mlb$Height[data_mlb$Position == 'Designated_Hitter'])
mean(data_mlb$Height[data_mlb$Position == 'First_Baseman'])
mean(data_mlb$Height[data_mlb$Position == 'Outfielder'])
mean(data_mlb$Height[data_mlb$Position == 'Relief_Pitcher'])
mean(data_mlb$Height[data_mlb$Position == 'Second_Baseman'])
mean(data_mlb$Height[data_mlb$Position == 'Shortstop'])
mean(data_mlb$Height[data_mlb$Position == 'Starting_Pitcher'])
mean(data_mlb$Height[data_mlb$Position == 'Third_Baseman'])


Or we could use a function called tapply.

?tapply
tapply(data_mlb$Height, data_mlb$Position, mean)
          Catcher Designated_Hitter     First_Baseman        Outfielder 
         72.72368          74.22222          74.00000          73.01031 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
         74.37460          71.36207          71.90385          74.71946 
    Third_Baseman 
         73.04444 

Q1 (1 point)

Among players who are older than 30, get the mean and variance of the Weight and Height of players of each Position. Please use tapply to do this.


Or we could use functions in dplyr package.

#always remember to load your package
library("tidyverse")

dplyr uses “verb” to manipulate the data.frame. After the manipulation, it always creates a new data.frame. It uses |> syntax (see below example), which is a bit different from how we use other R built-in functions.
There are three major types of “verbs” of dplyr, manipulating rows, columns, and groups.

  • filter() select rows based on values of a specific column (manipulation on rows)

Tip: when using dplyr functions/verb, you do not need $ to call a specific column within a data.frame.

Let’s select rows based on height (column) greater than 75.

The way you know:

logic_height = data_mlb$Height > 75
data_mlb_tall = data_mlb[logic_height,]
length(data_mlb[,1])
[1] 1034
length(data_mlb_tall[,1])
[1] 211

This can also be achieved using filter() in dplyr:

data_mlb_tall2 = data_mlb |> 
  filter(Height > 75)

length(data_mlb[,1])
[1] 1034
length(data_mlb_tall2[,1])
[1] 211


- mutate() add new columns that are calculated from existing columns (manipulation on columns)

Let’s add a new variable for the BMI of each of the player to the data.frame.

The way you know:

height_m = data_mlb$Height*2.54/100 #converting inch to m
weight_kg = data_mlb$Weight*0.454    #converting lbs to kg
bmi = weight_kg/(height_m^2)
#to create a new data frame
data_mlb_bmi = data.frame(data_mlb, BMI = bmi)
names(data_mlb_bmi)
[1] "Name"     "Team"     "Position" "Height"   "Weight"   "Age"      "BMI"     

This can be also be achieved with mutation() in dplyr.

data_mlb_bmi2 = data_mlb |>
  mutate(
    BMI2 = Weight*0.454/((Height*2.54/100)^2)
  )
names(data_mlb_bmi2)
[1] "Name"     "Team"     "Position" "Height"   "Weight"   "Age"      "BMI2"    

See if the answers are the same:

#see if we get the same answer!
summary(data_mlb_bmi$BMI == data_mlb_bmi2$BMI2)
   Mode    TRUE 
logical    1034 

mutate() can add more than one columns a time.

data_mlb_bmi3 = data_mlb |>
  mutate(
    height_m = Height*2.54/100,
    BMI2 = Weight*0.454/((Height*2.54/100)^2)
  )
names(data_mlb_bmi3)
[1] "Name"     "Team"     "Position" "Height"   "Weight"   "Age"      "height_m"
[8] "BMI2"    


- group_by() group the data based on a specific column, which allows using sumarize() to summarize of data by group

Let’s summarize the mean of Height of each Position. We can use R’s built-in function tapply.

#if we want to get the mean height of players for each position. 
#Catcher Designated_Hitter     First_Baseman        Outfielder    Relief_Pitcher 
#               76                18                55               194               315 
#   Second_Baseman         Shortstop  Starting_Pitcher     Third_Baseman 
#               58                52               221                45 
tapply(data_mlb$Height, data_mlb$Position, mean)
          Catcher Designated_Hitter     First_Baseman        Outfielder 
         72.72368          74.22222          74.00000          73.01031 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
         74.37460          71.36207          71.90385          74.71946 
    Third_Baseman 
         73.04444 

We can also use group_by() of dplyr.

mean_height = data_mlb |>
  group_by(Position) |>  #the outcome of group_by position is "piped" to summarize function
  summarize(
    mean_heigth_position = mean(Height)
  )
mean_height
# A tibble: 9 × 2
  Position          mean_heigth_position
  <chr>                            <dbl>
1 Catcher                           72.7
2 Designated_Hitter                 74.2
3 First_Baseman                     74  
4 Outfielder                        73.0
5 Relief_Pitcher                    74.4
6 Second_Baseman                    71.4
7 Shortstop                         71.9
8 Starting_Pitcher                  74.7
9 Third_Baseman                     73.0

It is possible to use multiple functions of dplyr together, “piping” the result from one to the next!

mean_height_tall = data_mlb |>
  filter(Height > 75)|>  #filter out players with height <= 75
  group_by(Position) |>  #the outcome of group_by position is "piped" to summarize function
  summarize(
    mean_heigth_position = mean(Height)
  )
mean_height_tall
# A tibble: 7 × 2
  Position          mean_heigth_position
  <chr>                            <dbl>
1 Catcher                           76  
2 Designated_Hitter                 76.5
3 First_Baseman                     77  
4 Outfielder                        76.5
5 Relief_Pitcher                    76.9
6 Starting_Pitcher                  77.2
7 Third_Baseman                     76.2

While filter() and mutate() do not alter the original data.frame, group_by() sort of does. If want to remove the effect of group_by(), use ungroup()

data_mlb |>
  ungroup()

Some other resources to help you understand these powerful functions of dplyr https://www.andrewheiss.com/blog/2024/04/04/group_by-summarize-ungroup-animations/

Q2 (2 point)

For players older than 25, players of which Position are the tallest (i.e., have largest mean Height)? which are the heaviest (i.e., have largest mean Weight)?

Please use filter(), group_by(), and summarize() of the dplyr to do this.


Count data - summarize data and statistical tests

When the numerical data is sparse, what we learned in Week 3 about summarizing data may not be that useful.

Do you think the histogram and boxplot helpful for summarizing v1 below?

It is likely that the data only consists of few values. table() is always a good function to see if that is true.

table(v1)
v1
 1 19 20 
 8  6  5 

So, in this example, there is only three possible values of v1. Instead of a histogram, a barplot may be more appropriate for v1. To make a barplot using the the outcome of table() is easy.

barplot(table(v1))

You can also specify the height of the bars by providing a vector. The names could be specify by option names = c( ). Remember, group names needs " " around them (since they are not variables).

v2 = c(1, 10, 4)
barplot(v2, names = c("control", "treatment 1", "treatment 2"))

To compare the observed count with expectation, we could use fisher.test() (for 2x2 tables) and chisq.test().

A commonly used example to understand Fisher’s exact test and Chi-square test is coin tossing. Let’s say we are tossing a fair coin.

?sample

coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin = sample(coin, size = ntoss, replace = TRUE) #a vector of ntoss random toss

table(random_coin)
random_coin
head tail 
  51   49 
obs_toss = as.vector(table(random_coin))  #force the table outcome to be a vector
exp_toss = c(ntoss/2, ntoss/2) #expectation

#combine to vector into one long vector
m = c(obs_toss, exp_toss)
#make it into a 2x2 matrix
m = matrix(m, nrow = 2)

fisher.test(m)

    Fisher's Exact Test for Count Data

data:  m
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.5759313 1.8811672
sample estimates:
odds ratio 
  1.040632 

Let’s try tossing another “unfair” coin. You can specify the probability of each outcome in the sample() by prob = c( , ).

coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin2 = sample(coin, size = ntoss, replace = TRUE, prob = c(0.8, 0.2)) #tossing an unfair coin

table(random_coin2)
random_coin2
head tail 
  79   21 
obs_toss2 = as.vector(table(random_coin2))  #force the table outcome to be a vector
exp_toss2 = c(ntoss/2, ntoss/2) #expectation

#combine to vector into one long vector
m2 = c(obs_toss2, exp_toss2)
#make it into a 2x2 matrix
m2 = matrix(m2, nrow = 2)
m2
     [,1] [,2]
[1,]   79   50
[2,]   21   50
fisher.test(m2)

    Fisher's Exact Test for Count Data

data:  m2
p-value = 2.945e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.940871 7.382548
sample estimates:
odds ratio 
  3.735756 

Make sure you know how to interpret the Fisher’s Exact test results!

Q3 (1 point)

An anteater flipped a mysterious coin on a sunny day and on a cloudy day, and below is what they got. Do you think the number of head the anteater got depends on the weather? Please write out your null hypothesis and use appropriate statistical analysis to test your hypothesis.

Sunny day: 45 head, 55 tail

Cloudy day: 65 head, 35 tail

Fisher’s Exact tests can calculate the p-value exactly, while Chi-square tests approximate it. But for tables larger than 2x2, we have to use chiseq.test. The usage is very similar.

obs = c(100, 105, 50)
exp = c(100, 100, 100)

m3 = c(obs, exp)
m3 = matrix(m3, nrow = 2, byrow = T) ##by defult, R fill up the columns first!

chisq.test(m3)

    Pearson's Chi-squared test

data:  m3
X-squared = 13.227, df = 2, p-value = 0.001342

Tip: when generating a matrix, remember that R fills up the column firsts. If it is hard to remember (at least for me), then always specify byrow = T or byrow = F to avoid confusions.


Comparing distributions - Kolmogorov–Smirnov test

As we discussed last week, sometimes, the sample distributions could have the same mean and similar variance, but they in fact look very different when you make histograms. Unlike t.test() that tests whether the mean is the same and Wilcox.test() that tests whether the median is the same, ks.test() compares the whole distribution.

v1 = rnorm(100, 0, 4)
v2 = rnorm(100, 0, 1)

par(mfrow = c(1, 2)) #put two figures (1 row, 2 columns) in one plot
hist(v1)
hist(v2)

#compare their distribution using ks test
ks.test(v1, v2)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  v1 and v2
D = 0.38, p-value = 1.071e-06
alternative hypothesis: two-sided
#let's try another distiburion with vey similar sd to v1
v3 = rnorm(100, 0, 3.9)
hist(v1)
hist(v3)

ks.test(v1, v3)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  v1 and v3
D = 0.07, p-value = 0.9671
alternative hypothesis: two-sided

Testing associations - correlation tests

To test the associations between two variables, we have multiple ways to do it. In fact, t.test() and wilcox.test() are testing whether the values of one continuous variable (e.g., age - continuous variable) is associated with the value of a categorical variable (e.g., positions in baseball games).

Similarly, Fisher.test() and chisq.test() can also be used to test the associations between two categorical variables (see above examples).

For two continuous variables, we usually use Pearson correlation tests cor.test() to test whether their associations are statistically significant and how strong the association is. Pearson correlation is a parametric test, with the normality assumptions.

Let’s look at an example we covered last week.

Based on this figure, what null hypothesis (H0) would you propose?

cor.test(x = data_mlb2$Weight, y = data_mlb2$Height)

    Pearson's product-moment correlation

data:  data_mlb2$Weight and data_mlb2$Height
t = 20.192, df = 1032, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4869843 0.5744776
sample estimates:
      cor 
0.5321502 

Pearson correlation tests can easily be biased by outliers.

Do you think there is an association between x and y in the below figure?

cor.test(v1, v2)

    Pearson's product-moment correlation

data:  v1 and v2
t = 3.5893, df = 11, p-value = 0.004248
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3081909 0.9151178
sample estimates:
      cor 
0.7344588 

Nonparametric Spearman correlation test is less sensitive to outlier values. It can be specified using the method = "spearman option (remember the " ").

cor.test(v1, v2, method = "spearman")

    Spearman's rank correlation rho

data:  v1 and v2
S = 392, p-value = 0.8065
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.07692308 

Q4 (1 point)

Is there any association between the Age and Weight of the MLB players? Please write out your null hypothesis and use appropriate statistical analysis to test your hypothesis.