Week3 E115L: Statistics basics and L-D experiment

by Grace Yuh Chwen Lee, Spring 2024


Recap from last week…..

Setting the Working Directory

Setting the working directory gives R the specific location (i.e. file path) where you want it to read in data/files and write out results to. You would need this to load the data you collected during the wet labs!

You can set this manually by clicking Session > Set Working Directory > Choose Directory… Alternatively, you can set the working directory with the setwd() function (if you are familiar with “path” on your computer):

setwd("/Users/Anteater/Desktop")

To check your working directory, use the getwd() function.


Importing Data

After setting the working directories, we are now ready to load data.

Before loading your own data, let’s try playing with some existing data. If you haven’t already, please download the following two data sets from Canvas to where you set your working directory.

  • Week2_R_basics_MLB_weights_vector.tsv

  • Week2_R_basics_MLB_data.tsv

v_mlb = scan(Week2_R_basics_MLB_weights_vector.tsv)
v_mlb = scan("Week2_R_basics_MLB_weights_vector.tsv")

Which one gives you an error message and why?

Let’s get some summary of the data.

length(v_mlb)
[1] 1034
summary(v_mlb)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  150.0   187.0   200.0   201.7   215.0   290.0 
data_mlb1 = read.table("Week2_R_basics_MLB_data.tsv")
summary(data_mlb1)
data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
summary(data_mlb2)

What differs between these blocks of codes?

Let’s try loading the serial dilution data together again, and figure out what might have went wrong.

mydata = read.table("serial_dilution.csv")
summary(mydata)
      V1           
 Length:5          
 Class :character  
 Mode  :character  
mydata
                         V1
1 color,dilution_factor,CFU
2             blue,10e-6,57
3              blue,10e-7,9
4          orange,10e-5,201
5             orage,10e-7,1

My data should be three columns, but now it only has one column separated by “,”. What should I do? Any other suggestions about how to modify the above code?


What is statistics

In biology we often want to make inferences about a whole population, but most of the time we only have a sample. Statistics allows us to make inferences about the whole population.

Population - All the individual units of interest.

Sample - Is a subset of units taken from the whole population.

Are the followings populations or samples? It depends! Let’s take a look together.

All UCI students

All fruit flies in California

All voters in the US

When taking samples from populations, we want to make sure that the samples are unbiased with respect to the whole population.

All students in E115L as a sample of UCI students. Do you think this is an unbiased sample and why?

Statistics allows us to the following and we will learn how to use R to tackle them.

- Summarize data about a sample.

- Estimate an unknown quantity from a population using samples.

- Compare different samples (and populations), and therefore test different hypotheses.

- Predict future outcomes or unknown data.


Summarize data

We can use different measures to describe a population (if we ever had the values of the whole population). These are usually referred to as population parameters. There are two measures you are already pretty familiar with - mean and variance, which measure the central tendency and spread of the data respectively.

In the following figures, red denotes the mean, blue denotes the median, while olive line denotes the standard deviation.

Measure of central tendency: the value which population centered around. These include mean (average), the mode (the most common value), and the median (the value separating the bottom and top half of the data). Two of these could be easily estimated from the sample using following R functions.

v1 = c(11, 13, 15, 17, 19, 12, 14, 11, 13, 15, 17, 19, 11, 13, 18, 17, 17, 17)
#mean
mean(v1)

#median 
median(v1)

Or we could use summary( ) to have a quick overview of the sample.

summary(v1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.00   13.00   15.00   14.94   17.00   19.00 

R does not have built-in function to find mode, but we could try building our own code based on what we learn so far.

Try execute the following codes and guess what each line is doing.

table(v1)
?table

sort(table(v1))
?sort

sort(table(v1), decreasing = T)

Measure of variation: the value that describes how spread out the data is. These include variance, standard deviation, and coefficient of variance.

#variance
var(v1)

#standard deviation
sd(v1)

#another way to calculate standard deviation
sqrt(var(v1))

Coefficient of variance is defined as standard deviation/mean. It basically describes the amount of variability with respect to mean of the sample. This is a very useful measure, especially when compare between variables with different units. Let’s look at the following example.

#unit: cm
vlength_cm = c(123, 231, 145, 345, 138, 195, 162, 174, 158, 120)
sd(vlength_cm)

#unit: m
vlength_m = vlength_cm/100
sd(vlength_m)

Are the variances the same or different? Note that the measurements are the same, just the units differ.

Let’s see what if we use coefficient of variance instead.

#units: cm
sd(vlength_cm)/mean(vlength_cm)

#units: m
sd(vlength_m)/mean(vlength_m)

Is the coefficient of variance the same or different when the measurements are in different units?

When comparing samples of two populations, in addition to comparing the mean, it is always helpful to also compare the variance/standard deviation/coefficient of variance. Some distributions could have the same mean, but very different shape.


Summarize data using plots

In addition to getting the centrality and spread of a sample, another very useful way to analyze data is to “plot it out.” R has many built in functions to make easy plots. I recommend plotting out the data even before estimating mean and variance.

We will be using the MLB data, which we loaded above. If you missed doing the exercise, please load the data now.

Tip: remember the " " around the file name.

data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
summary(data_mlb2)
     Name               Team             Position             Height    
 Length:1034        Length:1034        Length:1034        Min.   :67.0  
 Class :character   Class :character   Class :character   1st Qu.:72.0  
 Mode  :character   Mode  :character   Mode  :character   Median :74.0  
                                                          Mean   :73.7  
                                                          3rd Qu.:75.0  
                                                          Max.   :83.0  
     Weight           Age       
 Min.   :150.0   Min.   :20.90  
 1st Qu.:187.0   1st Qu.:25.44  
 Median :200.0   Median :27.93  
 Mean   :201.7   Mean   :28.74  
 3rd Qu.:215.0   3rd Qu.:31.23  
 Max.   :290.0   Max.   :48.52  

Line plots

#xlab: x-axis label, ylab: y-axis label, main: title of the plot
plot(data_mlb2$Age, xlab = "ith individual", ylab = "Age", main = "age of MLB players")

#let's make it prettier by making some asesthetic changes
#compare the above and below lines of code to see how it works!
plot(data_mlb2$Age, xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")

col specifies the color of the dot. R knows the names of commonly used colors (blue, red, green etc). You could try typing them in Rstudio. If R studio knows what it is, it will highlight the color names with corresponding color (Try it!).

Or you can consult this very exhaustive list https://stat.columbia.edu/~tzheng/files/Rcolor.pdf.

Tip: remember to have " " around the color name.

pch specifies the shape of the dot. A quick reference

After making the plot pretty, let’s think about - is this plot helpful? What might be more helpful ways presenting the data?

#let's order the individual by age and plot their age. Is this helpful?
plot(sort(data_mlb2$Age), xlab = "ith individual", ylab = "Age", main = "age of MLB players", pch = 20, col = "olivedrab")

Histograms

Histograms are really helpful summarizing data of continuous variables. Let’s make a histogram, again with the age of the MLB players.

hist(data_mlb2$Age, xlab = "age", main = "age of MLB players")

#try the following and see what makes the histogram looks different
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = 40)

breaks specifies the number of bins in a histogram, with a default of 20 (i.e., if you do not specify bin numbers, R will automatically use 20 bins).

With these histograms, could you provide a quick summary of the age of MLB players?

Another way to use breaks is to provide a vector for where the breaks need to be. This may be handy when you want to compare two distributions (hint: you may use it when analyzing the L-D data).

#if you forget what seq is doing
?seq

#specifying the breaks are between 20-50, with a step of 5
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = seq(20, 50, by = 5))

#specifing the breaks are between 20-50, with a step of 1
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", breaks = seq(20, 50, by = 1))

You can also change the color of the histogram. Try the following lines of code and see what they are doing.

#specify col
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", col = "olivedrab")

#specify border
hist(data_mlb2$Age, xlab = "age", main = "age of MLB players", border = "olivedrab")

Boxplots

Boxplot is another type of plots that provide a quick overall summary of the data.

boxplot(data_mlb2$Age)

Let’s look at the plot together to understand what it is saying (median, 1st and 3rd quartile, 95% range, and outliers).

Think about what information you could get with a boxplot but not a histogram and vice versa.

You can again change the appearance of the boxplot.

boxplot(data_mlb2$Age, pch = 20, col = "olivedrab")

X-Y plots

X-y plots are very commonly used to explore the relationship between two variables.

#exploring the relationship between age and weight
plot(x = data_mlb2$Age, y = data_mlb2$Weight, xlab = "Age", ylab = "weight", pch = 20)

#exploring the relationship between weigth and height
plot(x = data_mlb2$Weight, y = data_mlb2$Height, xlab = "weight", ylab = "height", pch = 20)

Do yo think there is any association between age and weight? How about between weight and height?

We will learn how to formally test these possibilities later!


Summarize and compare between subsets of data

We will use the MLB data again to learn how to use plots and mean/variance estimates to summarize subsets of data.

#summary() only provides an overview of the numerical data
summary(data_mlb2)
     Name               Team             Position             Height    
 Length:1034        Length:1034        Length:1034        Min.   :67.0  
 Class :character   Class :character   Class :character   1st Qu.:72.0  
 Mode  :character   Mode  :character   Mode  :character   Median :74.0  
                                                          Mean   :73.7  
                                                          3rd Qu.:75.0  
                                                          Max.   :83.0  
     Weight           Age       
 Min.   :150.0   Min.   :20.90  
 1st Qu.:187.0   1st Qu.:25.44  
 Median :200.0   Median :27.93  
 Mean   :201.7   Mean   :28.74  
 3rd Qu.:215.0   3rd Qu.:31.23  
 Max.   :290.0   Max.   :48.52  
#use table() for categorical data, which we will use a lot in the class
table(data_mlb2$Position)

          Catcher Designated_Hitter     First_Baseman        Outfielder 
               76                18                55               194 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
              315                58                52               221 
    Third_Baseman 
               45 

Let’s say we only care about the age of starting pitcher.

Make sure you understand what the following lines of code are doing. This is very critical for analyzing your data.

bool = data_mlb2$Position == "Starting_Pitcher"
data_mlb2$Age[bool]

#the abvove code can be simplied as
data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
#we could save this subset of data to a new vector
age_sp = data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]

#then, we could make plots of only the age of outfielder
hist(age_sp, xlab = "age", main = "age of starting pitcher")

Let’s compare it to the age distribution of all MLB players. Can you see any difference?

Tip: when comparing between plots, be mindful about the range of axis.

hist(data_mlb2$Age, xlab = "age", main = "age of MLB players")

You may notice “oh wait, this plot also includes starting pitcher.” A more proper way is to exclude starting pitcher.

hist(data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], xlab = "age", main = "age of MLB players")

Can you explain what the above code is doing? Are the histograms with and without starting pitcher similar or different?

We could also use boxplot to compare the age between starting pitcher and other players.

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])

You can specify the names of the first and second boxes using names = c(" ", " ").

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'], names = c("starting pitcher", "others"), pch = 20)

How would you describe the age of starting pitcher, compared to other MLB players?

You can also compare the age between starting and relief pitcher.

boxplot(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], data_mlb2$Age[data_mlb2$Position == 'Relief_Pitcher'], names = c("starting pitcher", "relief pitcher"), pch = 20)

We will learn how to formally test whether the age of staring pitcher and all the MLB players are different later in class.


Estimate unknown of a population

Statistics allows us to the following .

- Summarize data about a sample.

- Estimate an unknown quantity from a population using samples.

- Compare different samples (and populations), and therefore test different hypotheses.

- Predict future outcomes or unknown data.

Because we could never measure every individual of a population, we can only take a sample out of the population. Sample size (the number of individuals you could measure) plays a big role in the estimation.

Let’s do some R simulations to explore this idea.

?rnorm
my_mean = 3
my_sd = 3

A normal distribution with mean = 3 and standard deviation = 3 should look like

Let’s try generating two sets of random samples.

Try it! You should get slightly different numbers from mine.

#only 10 samples
sample1 = rnorm(10, mean = my_mean, sd = my_sd)
hist(sample1, main = "10 samples")

mean(sample1)
[1] 2.235894
sd(sample1)
[1] 1.640717
#100000 samples
sample2 = rnorm(100000, mean = my_mean, sd = my_sd)
hist(sample2, main = "100000 samples")

mean(sample2)
[1] 3.006691
sd(sample2)
[1] 2.989525

Which one of the above sample distributions do you think is more similar to the true normal distribution? Why?


Compare - hypothesis testing

Statistics allows us to the following .

- Summarize data about a sample.

- Estimate an unknown quantity from a population using samples.

- Compare different samples (and populations), and therefore test different hypotheses.

- Predict future outcomes or unknown data.

When we compare between samples of different populations, we could compare plots and measures of the samples (i.e., mean and variance). But such a comparison is rarely clear cut.

Do you think the following two sample distributions have the same mean?

To formally test whether two distributions are similar (or whether a distribution is the same as the theoretical distribution), we can perform formal hypothesis testing. It is worth noting that, in statistics, hypothesis testing is not limited to comparing two distributions; and we will learn about other hypothesis testing later in class.

To do hypothesis testing, we need…

  1. A null hypothesis (H0). Usually (but not always), a null hypothesis states that there is no effect (.e.g, no difference, no association etc).
  2. Construct a distribution of a test statistics assuming H0 (e.g., t, x2 etc)
  3. Compare the test statistic estimated from the sample to (2) to obtain p-value.

p-value describes the probability of a value as extreme or more as the observed in our data under the null hypothesis H0. We usually use a threshold of 0.05. If the p-value of our statistical test is smaller than that, we say that the null hypothesis is rejected under a p-value threshold of 0.05 and the statistical test is significant.

For us, we only need to think carefully about (1) and choose the appropriate R function. Once we do that, R could help us do (2) and (3). Let’s do some statistical tests using R to put these steps into action.

Here is the boxplot for age of the MLB players again.

Do you think the mean age of starting pitcher and other players are the same or different based on the boxplot?

t.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], y = data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])

    Welch Two Sample t-test

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"] and data_mlb2$Age[data_mlb2$Position != "Starting_Pitcher"]
t = -1.8888, df = 335.69, p-value = 0.05978
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.29973562  0.02636856
sample estimates:
mean of x mean of y 
 28.23611  28.87279 

t.test(x = , y = ) tests whether the mean of two samples are the same or different. The null hypothesis (H0) is the mean of x and y are the same. Rejection of the null hypothesis under a specific p-value threshold means that the means of the two samples are significantly different.

Here, the p-value is greater than 0.05. So, we could not reject the null hypothesis under p-value threshold of 0.05. You could of course change your p-value threshold.

t.test can also be used to compare a sample to a null expectation. Say, based on the histogram, we assume that the mean age of starting pitcher should be around 25. Let’s test this idea.

t.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], mu = 25)

    One Sample t-test

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
t = 10.713, df = 220, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 27.64080 28.83141
sample estimates:
mean of x 
 28.23611 

Based on the R output, do you think the mean age of starting pitcher is around 25?

t-test is a parametric test, which means that, in order to get the p-value of the test statistics for mean comparisons, t-test assumes some underlying distribution of the test statistics (in this case, normality). What is important for you to remember is this assumption could sometimes be violated, and it is always good to also perform similar comparison using nonparametric tests.

A nonparametric counterpart of t-test is the Mann-Whitney U test (also known as Wilcoxon rank sum test). Instead of comparing the mean (t-test), MWU test compares the median. It’s usage is very similar to t-test.

wilcox.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], y = data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])

    Wilcoxon rank sum test with continuity correction

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"] and data_mlb2$Age[data_mlb2$Position != "Starting_Pitcher"]
W = 79807, p-value = 0.01085
alternative hypothesis: true location shift is not equal to 0

Unlike t-test, which summarizes the mean of the two samples, we have to calculate median ourselves.

#median age of starting pitcher
median(data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'])
[1] 27.39
#median age of other players
median(data_mlb2$Age[data_mlb2$Position != 'Starting_Pitcher'])
[1] 28.09

Based on the MWU test results, do you think the starting pitcher have significantly different age from other players? If so, older or younger?

wilcox.test(x = data_mlb2$Age[data_mlb2$Position == 'Starting_Pitcher'], mu = 25)

    Wilcoxon signed rank test with continuity correction

data:  data_mlb2$Age[data_mlb2$Position == "Starting_Pitcher"]
V = 21069, p-value < 2.2e-16
alternative hypothesis: true location is not equal to 25

We will introduce more tests in future classes!


How to analysize data of the Luria & Delbruk experiment

Now we are going to use what you learn so far into action.

  1. prepare the L-D experiemt data for the whole class so it could be read by R

  2. load data as a data frame into R

  3. quick summary/description of the data using plots and centrality/spread measures

    • Plots summarizing the distribution of your results, and those of the group (that is x axis should have the number of colonies per plate and the y axis the number or frequency of populations

    • The mean and variance in CFUs across populations for your group and considering the data of the whole class. You can present this in  parenthetic form, or as a table.

  4. Identify “jackpots” plate using summary plots

    • Identify “jackpots” (plates with a clearly higher number of colonies than the rest) in the class data.
  5. test the two alternative possibilites of the L-D experiment by using mean and variance (for your own data and the whole group)

    • The ratio of the mean and variance. Calculate this ratio for the data of each group and for all of the groups together. Is there any value close to 1? (Bonus: you could plot the mean against the variance and a reference line using abline with slope = 1 and intercept = 0; what does it mean if all the values are above this reference line?)
  6. compare your own data to the whole class

    • are they similar or different?

    • why there are variability?

    • what might be the roles of these variability in evolution?

  7. estimate mutation rate