v1 = c(1, 3, 5, 7)
v2 = 10:20
v3 = seq(0,20,2)
v4 = rep(3, 4)Week3 E115 L: R Basics (2) and one variable statistics
by Grace Yuh Chwen Lee, Winter 2025
Recap from last week…
Vectors contain multiple values and following are some general ways to create vectors
In addition to vectors, there are several other useful data structures in R, such as matrices, lists, and data frames!
Matrices
Matrices are very much like vectors (for example, they can only contain one kind of data), but they have 2 dimensions (rows and columns). By default, matrices are filled one column at a time.
m1 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
m1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
m2 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
m2 [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
m3 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = FALSE)
m3 [,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Logic operators can also be used on matrices. BUT the matrices need to be of the same dimensions.
Try the following code to get a sense what they are doing. What answers did you get?
m1 == m2m1 == m3You can transpose a matrix by t( ) function.
t(m1) [,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
t(m1) == m3 [,1] [,2]
[1,] TRUE FALSE
[2,] FALSE FALSE
[3,] FALSE TRUE
Matrices can also be created by combining vectors, either combining as columns cbind or as rows rbind.
a = 1:3
b = 4:6
cbind(a, b) a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
rbind(a, b) [,1] [,2] [,3]
a 1 2 3
b 4 5 6
Tip: if you are confused by what a function is doing, remember you can always consult help by typing ?fun.
Matrices can be sampled in the same way as vectors, but you need to specify rows and columns as matrix[row,column]
#recall matrix 1 again
m1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
#call some elements in matrix 1 by specifying row and column
m1[1, 3][1] 5
m1[2, 2][1] 4
Let’s spend some time and figure out what the following codes are doing.
m1[,2] > 3 [1] FALSE TRUE
bool = m1[,2] > 3
m1[bool,][1] 2 4 6
Data frame
Data frames are the most commonly used data structure for data analysis. They are like spreadsheet in R. Technically, data frames are lists, where every column is a vector and all vectors are of the same size. There are two dimensions of a data frame: the number of columns ncol and the number of rows nrow.
We can create a data frame by merging vectors, using data.frame( ).
#create a data frame
time = 0:24
N = 2^time
growth = data.frame(time, N)Check how
growthshows up in the Environment window.
You can also change the name of a data.frame during this process. names(data.frame) tells you the column names of that data.frame.
growth2 = data.frame(new_time = time, new_N = N)
names(growth)[1] "time" "N"
names(growth2)[1] "new_time" "new_N"
Like other types of variables, we could type their names in console to see their values. However, oftentimes, data frames are big, making it hard to read if print out the whole spreadsheet. Instead, we could use head to see the first few rows of a data frame.
head(growth) time N
1 0 1
2 1 2
3 2 4
4 3 8
5 4 16
6 5 32
Alternatively, we could specify which rows and columns we want to see, again using [ ].
growth[3,1]
growth[3,2]
growth[3,]
growth[,1]We can also call columns by their name using $.
growth$time [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
growth$time[2][1] 1
growth$N[growth$time > 3]Get the basics of a data frame
Oftentimes, you get a data frame by loading data, which we will learn in a second. To get a quick idea how the data frame looks like (especially when it is big), try the following functions.
treatment = c(rep('control', 5), rep('drug1', 5), rep('drug2', 5))
genotype = rep(LETTERS[1:5], 3)
phenotype = runif(15, 0, 1)
tdata = data.frame(treatment, genotype, phenotype)#get the nubmer of columns (ncol) for a data frame
length(tdata[1,])[1] 3
#get the number of rows (nrow) for a data frame
length(tdata[,1])[1] 15
#super helpful function for data frame!
summary(tdata) treatment genotype phenotype
Length:15 Length:15 Min. :0.0448
Class :character Class :character 1st Qu.:0.2994
Mode :character Mode :character Median :0.5526
Mean :0.5341
3rd Qu.:0.7805
Max. :0.9796
table(tdata$treatment)table(tdata$genotype)summary(tdata$phenotype)
table(tdata$phenotype > 0.5)Clean up data frames
Sometimes (or oftentimes?), experiments failed. That results in some data missing for the entire experiment. However, as long as one has sufficient numbers of observations, one can still do statistical analysis. Even if the amount of data is insufficient, getting a sense of the data you already have also help plan for future experiments. Now we will learn how to deal with missing data in a variable and data.frame.
In R, missing data is specified as NA (without quotation marks; "NA" is interpreted as character). You can use is.na( )function to check if a variable is missing data.
a = NA
is.na(a)[1] TRUE
v_wNA = c(19, 32, 28, NA, 10)
is.na(v_wNA)[1] FALSE FALSE FALSE TRUE FALSE
Many R functions returns NA if any of the value in a vector is missing data.
sum(v_wNA)[1] NA
There are different ways dealing with missing data.
?sum
sum(v_wNA, na.rm = TRUE)
?na.omit
sum(na.omit(v_wNA))In the following classes, we will learn other techniques to deal with missing data in data.frame.
Setting the Working Directory (Important!!)
Setting the working directory gives R the specific location (i.e. file path) where you want it to read in data/files and write out results to. You would need this to load the data you collected during the wet labs!
You can set this manually by clicking Session > Set Working Directory > Choose Directory… Alternatively, you can set the working directory with the setwd() function (if you are familiar with “path” on your computer):
setwd("/Users/Anteater/Desktop")To check your working directory, use the getwd() function.
Importing Data
After setting the working directories, we are now ready to load data.
Before loading your own data, let’s try playing with some existing data. If you haven’t already, please download the following two data sets from Canvas to where you set your working directory.
- Week2_R_basics_MLB_data.tsv
data_mlb1 = read.table("Week2_R_basics_MLB_data.tsv")
length(data_mlb1[,1])
names(data_mlb1)
summary(data_mlb1)data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
length(data_mlb2[,1])
names(data_mlb2)
summary(data_mlb2)Remember, you need the " and " around the file name. It is always a good idea to check whether the whole data.frame was properly loaded using length(). To get a quick sense of the data, use summary().
Let’s learn how to get a subset of data
What is statistics?
In biology we often want to make inferences about a whole population, but most of the time we only have a sample. Statistics allows us to make inferences about the whole population.
Population - All the individual units of interest.
Sample - Is a subset of units taken from the whole population.
Are the followings populations or samples? It depends! Let’s take a look together.
All UCI students
All fruit flies in California
All voters in the US
When taking samples from populations, we want to make sure that the samples are unbiased with respect to the whole population.
All students in E115L as a sample of UCI students. Do you think this is an unbiased sample and why?
Statistics allows us to the following and we will learn how to use R to tackle them.
- Summarize data about a sample.
- Estimate an unknown quantity from a population using samples.
- Compare different samples (and populations), and therefore test different hypotheses.
- Predict future outcomes or unknown data.
Summarize data
We can use different measures to describe a population (if we ever had the values of the whole population). These are usually referred to as population parameters. There are two measures you are already pretty familiar with - mean and variance, which measure the central tendency and spread of the data respectively.
Measure of central tendency: the value which population centered around. These include mean (average), the mode (the most common value), and the median (the value separating the bottom and top half of the data). Two of these could be easily estimated from the sample using following R functions.
v1 = c(11, 13, 15, 17, 19, 12, 14, 11, 13, 15, 17, 19, 11, 13, 18, 17, 17, 17)#mean
mean(v1)
#median
median(v1)Or we could use summary( ) to have a quick overview of the sample.
summary(v1) Min. 1st Qu. Median Mean 3rd Qu. Max.
11.00 13.00 15.00 14.94 17.00 19.00
R does not have built-in function to find mode, but we could try building our own code based on what we learn so far.
table(v1)v1
11 12 13 14 15 17 18 19
3 1 3 1 2 5 1 2
?table
sort(table(v1))v1
12 14 18 15 19 11 13 17
1 1 1 2 2 3 3 5
?sort
sort(table(v1), decreasing = T)v1
17 11 13 15 19 12 14 18
5 3 3 2 2 1 1 1
Measure of variation: the value that describes how spread out the data is. These include variance, standard deviation, and coefficient of variance.
#variance
var(v1)[1] 7.702614
#standard deviation
sd(v1)[1] 2.775358
#another way to calculate standard deviation
sqrt(var(v1))[1] 2.775358
Coefficient of variance is defined as standard deviation/mean. It basically describes the amount of variability with respect to mean of the sample. This is a very useful measure, especially when compare between variables with different units. Let’s look at the following example.
#unit: cm
vlength_cm = c(123, 231, 145, 345, 138, 195, 162, 174, 158, 120)
sd(vlength_cm)
#unit: m
vlength_m = vlength_cm/100
sd(vlength_m)Let’s see what if we use coefficient of variance instead.
#units: cm
sd(vlength_cm)/mean(vlength_cm)
#units: m
sd(vlength_m)/mean(vlength_m)When comparing samples of two populations, in addition to comparing the mean, it is always helpful to also compare the variance/standard deviation/coefficient of variance. Some distributions could have the same mean, but very different shape.
Summarize data using plots
In addition to getting the centrality and spread of a sample, another very useful way to analyze data is to “plot it out.” R has many built in functions to make easy plots. I recommend plotting out the data even before estimating mean and variance.
We will be using the MLB data, which we loaded above. If you missed doing the exercise, please load the data now.
Tip: remember the " " around the file name.
data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
summary(data_mlb2) Name Team Position Height
Length:1034 Length:1034 Length:1034 Min. :67.0
Class :character Class :character Class :character 1st Qu.:72.0
Mode :character Mode :character Mode :character Median :74.0
Mean :73.7
3rd Qu.:75.0
Max. :83.0
Weight Age
Min. :150.0 Min. :20.90
1st Qu.:187.0 1st Qu.:25.44
Median :200.0 Median :27.93
Mean :201.7 Mean :28.74
3rd Qu.:215.0 3rd Qu.:31.23
Max. :290.0 Max. :48.52
Dot plots
#xlab: x-axis label, ylab: y-axis label, main: title of the plot
plot(data_mlb2$Age, xlab = "ith individual", ylab = "Age", main = "age of MLB players")#let's make it prettier by making some asesthetic changes
#compare the above and below lines of code to see how it works!
plot(data_mlb2$Age, xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")col specifies the color of the dot. R knows the names of commonly used colors (blue, red, green etc). You could try typing them in Rstudio. If R studio knows what it is, it will highlight the color names with corresponding color (Try it!).
Or you can consult this very exhaustive list https://stat.columbia.edu/~tzheng/files/Rcolor.pdf.
Tip: remember to have " " around the color name.
pch specifies the shape of the dot. A quick reference
#let's order the individual by age and plot their age. Is this helpful?
plot(sort(data_mlb2$Age), xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")Histograms
Histograms are really helpful summarizing data of continuous variables. Let’s make a histogram, again with the age of the MLB players.
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players")#try the following and see what makes the histogram looks different
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = 40)breaks specifies the number of bins in a histogram, with a default of 20 (i.e., if you do not specify bin numbers, R will automatically use 20 bins).
Another way to use breaks is to provide a vector for where the breaks need to be. This may be handy when you want to compare two distributions (hint: you may use it when analyzing the L-D data).
#if you forget what seq is doing
?seq
#specifying the breaks are between 20-50, with a step of 5
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = seq(20, 50, by = 5))#specifing the breaks are between 20-50, with a step of 1
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = seq(20, 50, by = 1))You can also change the color of the histogram. Try the following lines of code and see what they are doing.
#specify col
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", col = "olivedrab")#specify border
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", border = "olivedrab")Boxplots
Boxplot is another type of plots that provide a quick overall summary of the data.
boxplot(data_mlb2$Age)Let’s look at the plot together to understand what it is saying (median, 1st and 3rd quartile, 95% range, and outliers).
You can again change the appearance of the boxplot.
boxplot(data_mlb2$Age, pch = 20, col = "olivedrab")X-Y plots
X-y plots are very commonly used to explore the relationship between two variables.
#exploring the relationship between age and weight
plot(x = data_mlb2$Age, y = data_mlb2$Weight, xlab = "Age", ylab = "weight", pch = 20)#exploring the relationship between weigth and height
plot(x = data_mlb2$Weight, y = data_mlb2$Height, xlab = "weight", ylab = "height", pch = 20)Do yo think there is any association between age and weight? How about between weight and height?
We will learn how to formally test these possibilities later!
Adding lines to existing plots
It is oftentimes helpful to add some lines to an existing plot to help us visualize the data. Look at the following examples.
va = rnorm(n = 10, mean = 0, sd = 2)
vb = rnorm(n = 1000, mean = 0, sd = 2)
#using lines() to add vertical line (here, adding a line at where the expected mean is)
hist(va, main = "va")
lines(c(0, 0), c(0, 1000), col = "red")hist(vb, main = "vb")
lines(c(0, 0), c(0, 1000), col = "red")#using lines to add a horizontal line, allowing easily identifying individuals younder/older than 30
plot(sort(data_mlb2$Age), xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")
lines(c(0, 10000), c(30, 30), col = "red")vc = rnorm(20, 0, 20)
vd = rnorm(20, 0, 20)
plot(vc, vd, xlab = "vc", ylab = "vd", pch = 20, xlim = c(-40, 40), ylim = c(-40, 40))
abline(a = 0, b = 1, col = "red") #a diagnoal line, with a intercept and b slope
abline(h = 0, col = "blue") #a horizontal line with h intercept
abline(v = 0, col = "olivedrab") #a vertical line with v interceptSummarize and compare between subsets of data
We will use the MLB data again to learn how to use plots and mean/variance estimates to summarize subsets of data.
#summary() only provides an overview of the numerical data
summary(data_mlb2) Name Team Position Height
Length:1034 Length:1034 Length:1034 Min. :67.0
Class :character Class :character Class :character 1st Qu.:72.0
Mode :character Mode :character Mode :character Median :74.0
Mean :73.7
3rd Qu.:75.0
Max. :83.0
Weight Age
Min. :150.0 Min. :20.90
1st Qu.:187.0 1st Qu.:25.44
Median :200.0 Median :27.93
Mean :201.7 Mean :28.74
3rd Qu.:215.0 3rd Qu.:31.23
Max. :290.0 Max. :48.52
#use table() for categorical data, which we will use a lot in the class
table(data_mlb2$Position)
Catcher Designated_Hitter First_Baseman Outfielder
76 18 55 194
Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
315 58 52 221
Third_Baseman
45
Let’s say we only care about the age of starting pitcher.
Make sure you understand what the following lines of code are doing. This is very critical for analyzing your data.
bool = data_mlb2$Position == "Starting_Pitcher"
data_mlb2$Age[bool]
#the abvove code can be simplied as
data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]#we could save this subset of data to a new vector
age_sp = data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
#then, we could make plots of only the age of outfielder
hist(age_sp, xlab = "age", main = "age of starting pitcher")Let’s compare it to the age distribution of all MLB players. Can you see any difference?
Tip: when comparing between plots, be mindful about the range of axis.
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players")You may notice “oh wait, this plot also includes starting pitcher.” A more proper way is to exclude starting pitcher.
hist(data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], xlab = "age", main = "age of MLB players")Can you explain what the above code is doing? Are the histograms with and without starting pitcher similar or different?
We could also use boxplot to compare the age between starting pitcher and other players.
boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])You can specify the names of the first and second boxes using names = c(" ", " ").
boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], names = c("starting pitcher", "others"), pch = 20)How would you describe the age of starting pitcher, compared to other MLB players?
Does adding a horizontal line help with your interpretation?
boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], names = c("starting pitcher", "others"), pch = 20)
abline(h = median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher']), col = "red")You can also compare the age between starting and relief pitcher.
boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position == 'Relief_Pitcher'], names = c("starting pitcher", "relief pitcher"), pch = 20)
abline(h = median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher']), col = "red")We will learn how to formally test whether the age of staring pitcher and all the MLB players are different later in class.
Estimate unknown of a population
Statistics allows us to the following .
- Summarize data about a sample.
- Estimate an unknown quantity from a population using samples.
- Compare different samples (and populations), and therefore test different hypotheses.
- Predict future outcomes or unknown data.
Because we could never measure every individual of a population, we can only take a sample out of the population. Sample size (the number of individuals you could measure) plays a big role in the estimation.
Let’s do some R simulations to explore this idea.
?rnorm
my_mean = 3
my_sd = 3A normal distribution with mean = 3 and standard deviation = 3 should look like
Let’s try generating two sets of random samples.
Try it! You should get slightly different numbers from mine.
#only 10 samples
sample1 = rnorm(10, mean = my_mean, sd = my_sd)
hist(sample1, main = "10 samples")mean(sample1)[1] 2.22574
sd(sample1)[1] 3.804489
#100000 samples
sample2 = rnorm(100000, mean = my_mean, sd = my_sd)
hist(sample2, main = "100000 samples")mean(sample2)[1] 2.97731
sd(sample2)[1] 3.000579
Which one of the above sample distributions do you think is more similar to the true normal distribution? Why?
Compare - hypothesis testing
Statistics allows us to the following .
- Summarize data about a sample.
- Estimate an unknown quantity from a population using samples.
- Compare different samples (and populations), and therefore test different hypotheses.
- Predict future outcomes or unknown data.
When we compare between samples of different populations, we could compare plots and measures of the samples (i.e., mean and variance). But such a comparison is rarely clear cut.
Do you think the following two sample distributions have the same mean?
To formally test whether two distributions are similar (or whether a distribution is the same as the theoretical distribution), we can perform formal hypothesis testing. It is worth noting that, in statistics, hypothesis testing is not limited to comparing two distributions; and we will learn about other hypothesis testing later in class.
To do hypothesis testing, we need…
- A null hypothesis (H0). Usually (but not always), a null hypothesis states that there is no effect (.e.g, no difference, no association etc).
- Construct a distribution of a test statistics assuming H0 (e.g., t, x2 etc)
- Compare the test statistic estimated from the sample to (2) to obtain p-value.
p-value describes the probability of a value as extreme or more as the observed in our data under the null hypothesis H0. We usually use a threshold of 0.05. If the p-value of our statistical test is smaller than that, we say that the null hypothesis is rejected under a p-value threshold of 0.05 and the statistical test is significant.
For us, we only need to think carefully about (1) and choose the appropriate R function. Once we do that, R could help us do (2) and (3). Let’s do some statistical tests using R to put these steps into action.
Here is the boxplot for age of the MLB players again.
Do you think the mean age of starting pitcher and other players are the same or different based on the boxplot?
t.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], y = data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])
Welch Two Sample t-test
data: data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"] and data_mlb2$Age[data_mlb2$Position != "Starting_Pitcher"]
t = -1.8888, df = 335.69, p-value = 0.05978
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.29973562 0.02636856
sample estimates:
mean of x mean of y
28.23611 28.87279
t.test(x = , y = ) tests whether the mean of two samples are the same or different. The null hypothesis (H0) is the mean of x and y are the same. Rejection of the null hypothesis under a specific p-value threshold means that the means of the two samples are significantly different.
Here, the p-value is greater than 0.05. So, we could not reject the null hypothesis under p-value threshold of 0.05. You could of course change your p-value threshold.
t.test can also be used to compare a sample to a null expectation. Say, based on the histogram, we assume that the mean age of starting pitcher should be around 25. Let’s test this idea.
t.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], mu = 25)
One Sample t-test
data: data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
t = 10.713, df = 220, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
27.64080 28.83141
sample estimates:
mean of x
28.23611
Based on the R output, do you think the mean age of starting pitcher is around 25?
t-test is a parametric test, which means that, in order to get the p-value of the test statistics for mean comparisons, t-test assumes some underlying distribution of the test statistics (in this case, normality). What is important for you to remember is this assumption could sometimes be violated, and it is always good to also perform similar comparison using nonparametric tests.
A nonparametric counterpart of t-test is the Mann-Whitney U test (also known as Wilcoxon rank sum test). Instead of comparing the mean (t-test), MWU test compares the median. It’s usage is very similar to t-test.
wilcox.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], y = data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])
Wilcoxon rank sum test with continuity correction
data: data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"] and data_mlb2$Age[data_mlb2$Position != "Starting_Pitcher"]
W = 79807, p-value = 0.01085
alternative hypothesis: true location shift is not equal to 0
Unlike t-test, which summarizes the mean of the two samples, we have to calculate median ourselves.
#median age of starting pitcher
median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'])[1] 27.39
#median age of other players
median(data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])[1] 28.09
Based on the MWU test results, do you think the starting pitcher have significantly different age from other players? If so, older or younger?
wilcox.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], mu = 25)
Wilcoxon signed rank test with continuity correction
data: data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
V = 21069, p-value < 2.2e-16
alternative hypothesis: true location is not equal to 25
We will introduce more tests in future classes!