Week3 E115 L: R Basics (2) and one variable statistics

by Grace Yuh Chwen Lee, Winter 2025


Recap from last week…

Vectors contain multiple values and following are some general ways to create vectors

v1 = c(1, 3, 5, 7)
v2 = 10:20
v3 = seq(0,20,2)
v4 = rep(3, 4)

In addition to vectors, there are several other useful data structures in R, such as matrices, lists, and data frames!

Matrices

Matrices are very much like vectors (for example, they can only contain one kind of data), but they have 2 dimensions (rows and columns). By default, matrices are filled one column at a time.

m1 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
m1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
m2 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
m2
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
m3 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = FALSE)
m3
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
Q1

What are the above codes doing? Is m1 the same as m2 and why? Is m1 the same as m3 and why?

Hint: try ?matrix or help(matrix)

Logic operators can also be used on matrices. BUT the matrices need to be of the same dimensions.

Try the following code to get a sense what they are doing. What answers did you get?

m1 == m2
m1 == m3

You can transpose a matrix by t( ) function.

t(m1)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
t(m1) == m3
      [,1]  [,2]
[1,]  TRUE FALSE
[2,] FALSE FALSE
[3,] FALSE  TRUE

Matrices can also be created by combining vectors, either combining as columns cbind or as rows rbind.

a = 1:3
b = 4:6
cbind(a, b)
     a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
rbind(a, b)
  [,1] [,2] [,3]
a    1    2    3
b    4    5    6

Tip: if you are confused by what a function is doing, remember you can always consult help by typing ?fun.

Matrices can be sampled in the same way as vectors, but you need to specify rows and columns as matrix[row,column]

#recall matrix 1 again
m1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
#call some elements in matrix 1 by specifying row and column
m1[1, 3]
[1] 5
m1[2, 2]
[1] 4

Let’s spend some time and figure out what the following codes are doing.

m1[,2] > 3 
[1] FALSE  TRUE
bool = m1[,2] > 3
m1[bool,]
[1] 2 4 6
Q2

Write 6 temperatures (32, 69, 72, 80, 100, 621) in Fahrenheit and save it in a vector.

Make a matrix with two columns, one with degrees in Fahrenheit and one in Celsius.

Subset every row where the degrees Celsius are more than 25 (use a logical statement to do this).


Data frame

Data frames are the most commonly used data structure for data analysis. They are like spreadsheet in R. Technically, data frames are lists, where every column is a vector and all vectors are of the same size. There are two dimensions of a data frame: the number of columns ncol and the number of rows nrow.

We can create a data frame by merging vectors, using data.frame( ).

#create a data frame
time = 0:24
N = 2^time
growth = data.frame(time, N)

Check how growth shows up in the Environment window.

You can also change the name of a data.frame during this process. names(data.frame) tells you the column names of that data.frame.

growth2 = data.frame(new_time = time, new_N = N)

names(growth)
[1] "time" "N"   
names(growth2)
[1] "new_time" "new_N"   

Like other types of variables, we could type their names in console to see their values. However, oftentimes, data frames are big, making it hard to read if print out the whole spreadsheet. Instead, we could use head to see the first few rows of a data frame.

head(growth)
  time  N
1    0  1
2    1  2
3    2  4
4    3  8
5    4 16
6    5 32

Alternatively, we could specify which rows and columns we want to see, again using [ ].

Q3

Try the following code. Can you figure out which one is specifying row and which one is specifying column?

growth[3,1]
growth[3,2]
growth[3,]
growth[,1]

We can also call columns by their name using $.

growth$time
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
growth$time[2]
[1] 1
Q4

What the below code is doing? What answer do you get?

growth$N[growth$time > 3]

Get the basics of a data frame

Oftentimes, you get a data frame by loading data, which we will learn in a second. To get a quick idea how the data frame looks like (especially when it is big), try the following functions.

treatment = c(rep('control', 5), rep('drug1', 5), rep('drug2', 5))
genotype = rep(LETTERS[1:5], 3)
phenotype = runif(15, 0, 1)

tdata = data.frame(treatment, genotype, phenotype)
#get the nubmer of columns (ncol) for a data frame
length(tdata[1,])
[1] 3
#get the number of rows (nrow) for a data frame
length(tdata[,1])
[1] 15
#super helpful function for data frame!
summary(tdata)
  treatment           genotype           phenotype     
 Length:15          Length:15          Min.   :0.0448  
 Class :character   Class :character   1st Qu.:0.2994  
 Mode  :character   Mode  :character   Median :0.5526  
                                       Mean   :0.5341  
                                       3rd Qu.:0.7805  
                                       Max.   :0.9796  
Q5

Explain how you could use the following code to quickly get a sense of the type of treatment and genotypes as well as the phenotypic outcome of this hypothetical experimental data.

table(tdata$treatment)
table(tdata$genotype)
summary(tdata$phenotype)
table(tdata$phenotype > 0.5)

Clean up data frames

Sometimes (or oftentimes?), experiments failed. That results in some data missing for the entire experiment. However, as long as one has sufficient numbers of observations, one can still do statistical analysis. Even if the amount of data is insufficient, getting a sense of the data you already have also help plan for future experiments. Now we will learn how to deal with missing data in a variable and data.frame.

In R, missing data is specified as NA (without quotation marks; "NA" is interpreted as character). You can use is.na( )function to check if a variable is missing data.

a = NA
is.na(a)
[1] TRUE
v_wNA = c(19, 32, 28, NA, 10)
is.na(v_wNA)
[1] FALSE FALSE FALSE  TRUE FALSE

Many R functions returns NA if any of the value in a vector is missing data.

sum(v_wNA)
[1] NA

There are different ways dealing with missing data.

Q6

Try the following codes (including calling the manual of the functions) and describe how each function was used to address missing data.

?sum
sum(v_wNA, na.rm = TRUE)

?na.omit
sum(na.omit(v_wNA))

In the following classes, we will learn other techniques to deal with missing data in data.frame.


Setting the Working Directory (Important!!)

Setting the working directory gives R the specific location (i.e. file path) where you want it to read in data/files and write out results to. You would need this to load the data you collected during the wet labs!

You can set this manually by clicking Session > Set Working Directory > Choose Directory… Alternatively, you can set the working directory with the setwd() function (if you are familiar with “path” on your computer):

setwd("/Users/Anteater/Desktop")

To check your working directory, use the getwd() function.


Importing Data

After setting the working directories, we are now ready to load data.

Before loading your own data, let’s try playing with some existing data. If you haven’t already, please download the following two data sets from Canvas to where you set your working directory.

  • Week2_R_basics_MLB_data.tsv
Q7

Try the following two codes. What data types are the columns in data_mlb1 and data_mlb2?

What is the major difference between them and which one you think we should use?

Hint: try ?read.table

data_mlb1 = read.table("Week2_R_basics_MLB_data.tsv")
length(data_mlb1[,1])
names(data_mlb1)
summary(data_mlb1)
data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
length(data_mlb2[,1])
names(data_mlb2)
summary(data_mlb2)

Remember, you need the " and " around the file name. It is always a good idea to check whether the whole data.frame was properly loaded using length(). To get a quick sense of the data, use summary().

Let’s learn how to get a subset of data

Q8 and Q9

Practice what you learn above for extracting a subset from data frames….

  • Get the first five rows of data_mlb2 and save it to data_mlb2_subset1. Then, get the mean Height for this subset of data.

  • Create a vector of logical values for rows with heights larger than 70. Use that vector to create a subset of data and save it to data_mlb2_subset2. Get the mean Weight for these individuals.

What is statistics?

In biology we often want to make inferences about a whole population, but most of the time we only have a sample. Statistics allows us to make inferences about the whole population.

Population - All the individual units of interest.

Sample - Is a subset of units taken from the whole population.

Are the followings populations or samples? It depends! Let’s take a look together.

All UCI students

All fruit flies in California

All voters in the US

When taking samples from populations, we want to make sure that the samples are unbiased with respect to the whole population.

All students in E115L as a sample of UCI students. Do you think this is an unbiased sample and why?

Statistics allows us to the following and we will learn how to use R to tackle them.

- Summarize data about a sample.

- Estimate an unknown quantity from a population using samples.

- Compare different samples (and populations), and therefore test different hypotheses.

- Predict future outcomes or unknown data.


Summarize data

We can use different measures to describe a population (if we ever had the values of the whole population). These are usually referred to as population parameters. There are two measures you are already pretty familiar with - mean and variance, which measure the central tendency and spread of the data respectively.

Measure of central tendency: the value which population centered around. These include mean (average), the mode (the most common value), and the median (the value separating the bottom and top half of the data). Two of these could be easily estimated from the sample using following R functions.

v1 = c(11, 13, 15, 17, 19, 12, 14, 11, 13, 15, 17, 19, 11, 13, 18, 17, 17, 17)
#mean
mean(v1)

#median 
median(v1)

Or we could use summary( ) to have a quick overview of the sample.

summary(v1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.00   13.00   15.00   14.94   17.00   19.00 

R does not have built-in function to find mode, but we could try building our own code based on what we learn so far.

table(v1)
v1
11 12 13 14 15 17 18 19 
 3  1  3  1  2  5  1  2 
?table

sort(table(v1))
v1
12 14 18 15 19 11 13 17 
 1  1  1  2  2  3  3  5 
?sort

sort(table(v1), decreasing = T)
v1
17 11 13 15 19 12 14 18 
 5  3  3  2  2  1  1  1 

Measure of variation: the value that describes how spread out the data is. These include variance, standard deviation, and coefficient of variance.

#variance
var(v1)
[1] 7.702614
#standard deviation
sd(v1)
[1] 2.775358
#another way to calculate standard deviation
sqrt(var(v1))
[1] 2.775358

Coefficient of variance is defined as standard deviation/mean. It basically describes the amount of variability with respect to mean of the sample. This is a very useful measure, especially when compare between variables with different units. Let’s look at the following example.

#unit: cm
vlength_cm = c(123, 231, 145, 345, 138, 195, 162, 174, 158, 120)
sd(vlength_cm)

#unit: m
vlength_m = vlength_cm/100
sd(vlength_m)

Let’s see what if we use coefficient of variance instead.

#units: cm
sd(vlength_cm)/mean(vlength_cm)

#units: m
sd(vlength_m)/mean(vlength_m)
Q10

Are the variances the same for different when the length is measured in cm vs m above? How about for coefficient of variance? Does coefficient of variance have unit?

When comparing samples of two populations, in addition to comparing the mean, it is always helpful to also compare the variance/standard deviation/coefficient of variance. Some distributions could have the same mean, but very different shape.


Summarize data using plots

In addition to getting the centrality and spread of a sample, another very useful way to analyze data is to “plot it out.” R has many built in functions to make easy plots. I recommend plotting out the data even before estimating mean and variance.

We will be using the MLB data, which we loaded above. If you missed doing the exercise, please load the data now.

Tip: remember the " " around the file name.

data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
summary(data_mlb2)
     Name               Team             Position             Height    
 Length:1034        Length:1034        Length:1034        Min.   :67.0  
 Class :character   Class :character   Class :character   1st Qu.:72.0  
 Mode  :character   Mode  :character   Mode  :character   Median :74.0  
                                                          Mean   :73.7  
                                                          3rd Qu.:75.0  
                                                          Max.   :83.0  
     Weight           Age       
 Min.   :150.0   Min.   :20.90  
 1st Qu.:187.0   1st Qu.:25.44  
 Median :200.0   Median :27.93  
 Mean   :201.7   Mean   :28.74  
 3rd Qu.:215.0   3rd Qu.:31.23  
 Max.   :290.0   Max.   :48.52  

Dot plots

#xlab: x-axis label, ylab: y-axis label, main: title of the plot
plot(data_mlb2$Age, xlab = "ith individual", ylab = "Age", main = "age of MLB players")

#let's make it prettier by making some asesthetic changes
#compare the above and below lines of code to see how it works!
plot(data_mlb2$Age, xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")

col specifies the color of the dot. R knows the names of commonly used colors (blue, red, green etc). You could try typing them in Rstudio. If R studio knows what it is, it will highlight the color names with corresponding color (Try it!).

Or you can consult this very exhaustive list https://stat.columbia.edu/~tzheng/files/Rcolor.pdf.

Tip: remember to have " " around the color name.

pch specifies the shape of the dot. A quick reference

After making the plot pretty, let’s think about - is this plot helpful? What might be more helpful ways presenting the data?

#let's order the individual by age and plot their age. Is this helpful?
plot(sort(data_mlb2$Age), xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")

Histograms

Histograms are really helpful summarizing data of continuous variables. Let’s make a histogram, again with the age of the MLB players.

hist(data_mlb2$Age, xlab = "age", main = "age of MLB players")

#try the following and see what makes the histogram looks different
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = 40)

breaks specifies the number of bins in a histogram, with a default of 20 (i.e., if you do not specify bin numbers, R will automatically use 20 bins).

With these histograms, could you provide a quick summary of the age of MLB players? Could you quickly spot “outliers”?

Another way to use breaks is to provide a vector for where the breaks need to be. This may be handy when you want to compare two distributions (hint: you may use it when analyzing the L-D data).

#if you forget what seq is doing
?seq

#specifying the breaks are between 20-50, with a step of 5
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = seq(20, 50, by = 5))

#specifing the breaks are between 20-50, with a step of 1
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = seq(20, 50, by = 1))

You can also change the color of the histogram. Try the following lines of code and see what they are doing.

#specify col
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", col = "olivedrab")

#specify border
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", border = "olivedrab")

Boxplots

Boxplot is another type of plots that provide a quick overall summary of the data.

boxplot(data_mlb2$Age)

Let’s look at the plot together to understand what it is saying (median, 1st and 3rd quartile, 95% range, and outliers).

Think about what information you could get with a boxplot but not a histogram and vice versa.

You can again change the appearance of the boxplot.

boxplot(data_mlb2$Age, pch = 20, col = "olivedrab")

X-Y plots

X-y plots are very commonly used to explore the relationship between two variables.

#exploring the relationship between age and weight
plot(x = data_mlb2$Age, y = data_mlb2$Weight, xlab = "Age", ylab = "weight", pch = 20)

#exploring the relationship between weigth and height
plot(x = data_mlb2$Weight, y = data_mlb2$Height, xlab = "weight", ylab = "height", pch = 20)

Do yo think there is any association between age and weight? How about between weight and height?

We will learn how to formally test these possibilities later!

Adding lines to existing plots

It is oftentimes helpful to add some lines to an existing plot to help us visualize the data. Look at the following examples.

va = rnorm(n = 10, mean = 0, sd = 2)
vb = rnorm(n = 1000, mean = 0, sd = 2)

#using lines() to add vertical line (here, adding a line at where the expected mean is)
hist(va, main = "va")
lines(c(0, 0), c(0, 1000), col = "red")

hist(vb, main = "vb")
lines(c(0, 0), c(0, 1000), col = "red")

#using lines to add a horizontal line, allowing easily identifying individuals younder/older than 30
plot(sort(data_mlb2$Age), xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")
lines(c(0, 10000), c(30, 30), col = "red")

vc = rnorm(20, 0, 20)
vd = rnorm(20, 0, 20)
plot(vc, vd, xlab = "vc", ylab = "vd", pch = 20, xlim = c(-40, 40), ylim = c(-40, 40))
abline(a = 0, b = 1, col = "red") #a diagnoal line, with a intercept and b slope
abline(h = 0, col = "blue")       #a horizontal line with h intercept
abline(v = 0, col = "olivedrab")  #a vertical line with v intercept

Note that lines() and abline() always need to add to existing plot and is always added to the most recent generated (and still opened) plot.

The range of x and y-axis can be specified by xlim = c(start, end) and ylim = c(start, end).


Summarize and compare between subsets of data

We will use the MLB data again to learn how to use plots and mean/variance estimates to summarize subsets of data.

#summary() only provides an overview of the numerical data
summary(data_mlb2)
     Name               Team             Position             Height    
 Length:1034        Length:1034        Length:1034        Min.   :67.0  
 Class :character   Class :character   Class :character   1st Qu.:72.0  
 Mode  :character   Mode  :character   Mode  :character   Median :74.0  
                                                          Mean   :73.7  
                                                          3rd Qu.:75.0  
                                                          Max.   :83.0  
     Weight           Age       
 Min.   :150.0   Min.   :20.90  
 1st Qu.:187.0   1st Qu.:25.44  
 Median :200.0   Median :27.93  
 Mean   :201.7   Mean   :28.74  
 3rd Qu.:215.0   3rd Qu.:31.23  
 Max.   :290.0   Max.   :48.52  
#use table() for categorical data, which we will use a lot in the class
table(data_mlb2$Position)

          Catcher Designated_Hitter     First_Baseman        Outfielder 
               76                18                55               194 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
              315                58                52               221 
    Third_Baseman 
               45 

Let’s say we only care about the age of starting pitcher.

Make sure you understand what the following lines of code are doing. This is very critical for analyzing your data.

bool = data_mlb2$Position == "Starting_Pitcher"
data_mlb2$Age[bool]

#the abvove code can be simplied as
data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
#we could save this subset of data to a new vector
age_sp = data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]

#then, we could make plots of only the age of outfielder
hist(age_sp, xlab = "age", main = "age of starting pitcher")

Let’s compare it to the age distribution of all MLB players. Can you see any difference?

Tip: when comparing between plots, be mindful about the range of axis.

hist(data_mlb2$Age, xlab = "age", main = "age of MLB players")

You may notice “oh wait, this plot also includes starting pitcher.” A more proper way is to exclude starting pitcher.

hist(data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], xlab = "age", main = "age of MLB players")

Can you explain what the above code is doing? Are the histograms with and without starting pitcher similar or different?

We could also use boxplot to compare the age between starting pitcher and other players.

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])

You can specify the names of the first and second boxes using names = c(" ", " ").

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], names = c("starting pitcher", "others"), pch = 20)

How would you describe the age of starting pitcher, compared to other MLB players?

Does adding a horizontal line help with your interpretation?

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], names = c("starting pitcher", "others"), pch = 20)
abline(h = median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher']), col = "red")

You can also compare the age between starting and relief pitcher.

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position == 'Relief_Pitcher'], names = c("starting pitcher", "relief pitcher"), pch = 20)
abline(h = median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher']), col = "red")

We will learn how to formally test whether the age of staring pitcher and all the MLB players are different later in class.


Estimate unknown of a population

Statistics allows us to the following .

- Summarize data about a sample.

- Estimate an unknown quantity from a population using samples.

- Compare different samples (and populations), and therefore test different hypotheses.

- Predict future outcomes or unknown data.

Because we could never measure every individual of a population, we can only take a sample out of the population. Sample size (the number of individuals you could measure) plays a big role in the estimation.

Let’s do some R simulations to explore this idea.

?rnorm
my_mean = 3
my_sd = 3

A normal distribution with mean = 3 and standard deviation = 3 should look like

Let’s try generating two sets of random samples.

Try it! You should get slightly different numbers from mine.

#only 10 samples
sample1 = rnorm(10, mean = my_mean, sd = my_sd)
hist(sample1, main = "10 samples")

mean(sample1)
[1] 2.22574
sd(sample1)
[1] 3.804489
#100000 samples
sample2 = rnorm(100000, mean = my_mean, sd = my_sd)
hist(sample2, main = "100000 samples")

mean(sample2)
[1] 2.97731
sd(sample2)
[1] 3.000579

Which one of the above sample distributions do you think is more similar to the true normal distribution? Why?


Compare - hypothesis testing

Statistics allows us to the following .

- Summarize data about a sample.

- Estimate an unknown quantity from a population using samples.

- Compare different samples (and populations), and therefore test different hypotheses.

- Predict future outcomes or unknown data.

When we compare between samples of different populations, we could compare plots and measures of the samples (i.e., mean and variance). But such a comparison is rarely clear cut.

Do you think the following two sample distributions have the same mean?

To formally test whether two distributions are similar (or whether a distribution is the same as the theoretical distribution), we can perform formal hypothesis testing. It is worth noting that, in statistics, hypothesis testing is not limited to comparing two distributions; and we will learn about other hypothesis testing later in class.

To do hypothesis testing, we need…

  1. A null hypothesis (H0). Usually (but not always), a null hypothesis states that there is no effect (.e.g, no difference, no association etc).
  2. Construct a distribution of a test statistics assuming H0 (e.g., t, x2 etc)
  3. Compare the test statistic estimated from the sample to (2) to obtain p-value.

p-value describes the probability of a value as extreme or more as the observed in our data under the null hypothesis H0. We usually use a threshold of 0.05. If the p-value of our statistical test is smaller than that, we say that the null hypothesis is rejected under a p-value threshold of 0.05 and the statistical test is significant.

For us, we only need to think carefully about (1) and choose the appropriate R function. Once we do that, R could help us do (2) and (3). Let’s do some statistical tests using R to put these steps into action.

Here is the boxplot for age of the MLB players again.

Do you think the mean age of starting pitcher and other players are the same or different based on the boxplot?

t.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], y = data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])

    Welch Two Sample t-test

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"] and data_mlb2$Age[data_mlb2$Position != "Starting_Pitcher"]
t = -1.8888, df = 335.69, p-value = 0.05978
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.29973562  0.02636856
sample estimates:
mean of x mean of y 
 28.23611  28.87279 

t.test(x = , y = ) tests whether the mean of two samples are the same or different. The null hypothesis (H0) is the mean of x and y are the same. Rejection of the null hypothesis under a specific p-value threshold means that the means of the two samples are significantly different.

Here, the p-value is greater than 0.05. So, we could not reject the null hypothesis under p-value threshold of 0.05. You could of course change your p-value threshold.

t.test can also be used to compare a sample to a null expectation. Say, based on the histogram, we assume that the mean age of starting pitcher should be around 25. Let’s test this idea.

t.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], mu = 25)

    One Sample t-test

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
t = 10.713, df = 220, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 27.64080 28.83141
sample estimates:
mean of x 
 28.23611 

Based on the R output, do you think the mean age of starting pitcher is around 25?

t-test is a parametric test, which means that, in order to get the p-value of the test statistics for mean comparisons, t-test assumes some underlying distribution of the test statistics (in this case, normality). What is important for you to remember is this assumption could sometimes be violated, and it is always good to also perform similar comparison using nonparametric tests.

A nonparametric counterpart of t-test is the Mann-Whitney U test (also known as Wilcoxon rank sum test). Instead of comparing the mean (t-test), MWU test compares the median. It’s usage is very similar to t-test.

wilcox.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], y = data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])

    Wilcoxon rank sum test with continuity correction

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"] and data_mlb2$Age[data_mlb2$Position != "Starting_Pitcher"]
W = 79807, p-value = 0.01085
alternative hypothesis: true location shift is not equal to 0

Unlike t-test, which summarizes the mean of the two samples, we have to calculate median ourselves.

#median age of starting pitcher
median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'])
[1] 27.39
#median age of other players
median(data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])
[1] 28.09

Based on the MWU test results, do you think the starting pitcher have significantly different age from other players? If so, older or younger?

wilcox.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], mu = 25)

    Wilcoxon signed rank test with continuity correction

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
V = 21069, p-value < 2.2e-16
alternative hypothesis: true location is not equal to 25

We will introduce more tests in future classes!