Today we will learn how to conduct hypothesis test using t-test and ANOVA. We will use the data we cleaned in Lab 3 for today’s exercise. Below is a list of functions that are newly introduced in this lab.

Functions Tasks
describeBy() Creates a basic summary statistic by a grouping variable (in psych package)
t.test() Performs one and two sample t-tests on vectors of data
aov() Performs an analysis of variance
rnorm() Draws a random sample from a normal distribution

Before we start, let’s define a custom function I made. Copy the entire code in the gray box below and paste it into your console window and run it. We’ll get back to what it means later.

# Copy from here ---------------------------
histogram_overlay <- function(vector1, vector2){
  plot <- ggplot2::ggplot(data = data.frame(red = vector1, blue = vector2)) + 
    
  ggplot2::geom_histogram(aes(red, ), color = 'gray', fill = "red", alpha = 0.5, bins = 15) +
  ggplot2::geom_vline(xintercept = mean(vector1), color = "red", linetype = 'dashed', size = 2) + 
  
  ggplot2::geom_histogram(aes(blue), color = 'gray', fill = "blue", alpha = 0.5, bins = 15) +
  ggplot2::geom_vline(xintercept = mean(vector2), color = "blue", linetype = 'dashed', size = 2) +
  
  ggplot2::labs(x = "Z-Score", y = "Frequency")
  return(plot)
}
# To here ----------------------------------

0. Before We Get Started: Setting the working directory & reading data

As always, we start by setting the working directory and reading required files.

If you have your cleaned file saved, you can use that file. Else, you can download the CSV file (df_from_lab3.csv) I provided for this week’s module on Canvas.

# Install required packages
install.packages("psych") 
# Call the packages using library()
library(tidyverse)
library(psych)
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Reference Materials\\_for_next_year12\\Labs\\Lab4_2020") # Use your own pathname

df.ready <- read.csv("df_from_lab3.csv") 

# Check the data
head(df.ready)
##                 GEO_ID               tract         county   state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  2641
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  3573
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  4087
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  5962
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  6778
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  4090
##   under00_50 under50_99     p.pov   p.pov.z p.pov.scale
## 1        288        261 0.2078758 0.2492940   0.2492940
## 2        498        381 0.2460118 0.5384034   0.5384034
## 3        477        397 0.2138488 0.2945753   0.2945753
## 4        939       1097 0.3414961 1.2622723   1.2622723
## 5        443       1345 0.2637946 0.6732157   0.6732157
## 6        982        762 0.4264059 1.9059745   1.9059745

1. T-Test

Let’s first draw a boxplot to visualize our data. Although t-test and ANOVA are about means (and that boxplots show median), boxplot can still provide some understanding of the distribution of the data.

# Boxplot to visualize
boxplot(df.ready$p.pov ~ df.ready$county,
        main = "Poverty Rate by County",
        xlab = "County",
        ylab = "Poverty Rate")

We can see that “DeKalb County” and “Fulton County” have similar median values while the other two have distinctively different median values with less dispersion. We can also calculate the descriptive statistics for each county using describeBy in psych package which takes minimum of two arguments (1) a data.frame or a vector of interval/ratio variable and (2) a variable that can be used to group it.

# Describe statistics by county
describeBy(df.ready$p.pov, df.ready$county)
## 
##  Descriptive statistics by group 
## group: Clayton County
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 49 0.23 0.09   0.22    0.22 0.08 0.04 0.46  0.42 0.37    -0.26 0.01
## ------------------------------------------------------------ 
## group: Cobb County
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 120 0.11 0.09   0.09     0.1 0.08   0 0.4  0.39 1.12     0.51 0.01
## ------------------------------------------------------------ 
## group: DeKalb County
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 143 0.18 0.11   0.16    0.17 0.11 0.02 0.65  0.64 0.96     1.36 0.01
## ------------------------------------------------------------ 
## group: Fulton County
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 202  0.2 0.16   0.16    0.18 0.16 0.01 0.81   0.8 0.93     0.37 0.01

   

One Sample, One-Tailed T-Test (Assuming unknown variance)

Let’s compare the mean value of Fulton County to a given value (e.g., a target poverty rate which Fulton County’s planners want to make sure that Fulton County is not higher than). Let’s say that target value is 0.1.

  • The null hypothesis can be “the mean of Fulton’s poverty rate is equal to or less than 0.1”.
  • The alternative hypothesis would be “the mean of Fulton’s poverty rate is greater than 0.1”. This greater goes to alternative argument.

# Create a separate vector for Fulton County

Fulton <- df.ready$p.pov[df.ready$county == "Fulton County"]
t.test(Fulton, mu = 0.1, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  Fulton
## t = 8.764, df = 201, p-value = 3.958e-16
## alternative hypothesis: true mean is greater than 0.1
## 95 percent confidence interval:
##  0.179328      Inf
## sample estimates:
## mean of x 
## 0.1977611

Unfortunately, the t-value is much larger than zero at 8.764 and p-value nearly 0. We can reject the null hypothesis (and the planners should work harder).

   

Short Exercises

  1. We have the potato yield from 12 different farms. We know that the standard potato yield for the given variety is mu = 20. x = [21.5, 24.5, 18.5, 17.2, 14.5, 23.2, 22.1, 20.5, 19.4, 18.1, 24.1, 18.5]
  1. Test if the potato yield from these farms is significantly better than the standard yield. Answer: test statistic = 0.20066, p-value = 0.4223
  1. According to Automated Passenger Count (APC) data from MARTA in January 2020, the average weekday MARTA transit APC is 1300. Suppose that MARTA wants to know whether transit ridership on Tuesdays is less than the weekday average. To find out, they take a random sample of 18 routes on Tuesdays, and measure their APCs as shown in Table 1. They want you to determine whether the average APC on a Tuesday is significantly less than the overall weekday average APC.
  1. Conduct the appropriate hypothesis test, at the 1% significance level. Answer: test statistic = 0.64142, p-value: 0.7351

   

Two Sample, Two-Tailed T-Test (Assuming unequal variance)

Let’s compare the mean of “Fulton County” with other counties. To do so, let us first separate out rows for the counties, etc, to create three vectors and then run t-tests using them. Note we have already created a separate vector for Fulton County

# Create three separate vectors for each county. 

Dekalb <- df.ready$p.pov[df.ready$county == "DeKalb County"]
Cobb <- df.ready$p.pov[df.ready$county == "Cobb County"]
Clayton <- df.ready$p.pov[df.ready$county == "Clayton County"]

# T-Tests
t.test(Fulton, Dekalb)
## 
##  Welch Two Sample t-test
## 
## data:  Fulton and Dekalb
## t = 1.2151, df = 342.92, p-value = 0.2252
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01088587  0.04607498
## sample estimates:
## mean of x mean of y 
## 0.1977611 0.1801666
t.test(Fulton, Cobb)
## 
##  Welch Two Sample t-test
## 
## data:  Fulton and Cobb
## t = 6.3226, df = 319.55, p-value = 8.651e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06043043 0.11502784
## sample estimates:
## mean of x mean of y 
## 0.1977611 0.1100320
t.test(Fulton, Clayton)
## 
##  Welch Two Sample t-test
## 
## data:  Fulton and Clayton
## t = -1.5572, df = 121.59, p-value = 0.122
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.062116446  0.007418873
## sample estimates:
## mean of x mean of y 
## 0.1977611 0.2251099

The results indicate that, as we sort of expected from the boxplot, the mean of Fulton is not statistically different from the mean of Dekalb while it is statistically different from the mean of Cobb. Surprisingly, the third test shows that the mean of Fulton is NOT statistically different from the mean of Clayton, which is somewhat deviating from what we saw in the boxplot. This is likely due to the fact that the distribution of Fulton is positively skewed (i.e., mean > median).

   

Let’s take a closer look at the output from t-test. The image below shows what each component in the output means. The data line indicates that it is a t-test using Fulton and Clayton. The t-statistics, degrees of freedom, and p-value show the ingredients used in the hypothesis testing. It shows that we cannot reject the null hypothesis.

The 95 percent confidence interval shows that we are 95% confident that the true difference in mean between x and y (in this case, x would be the vector named Fulton; y would be the vector named Clayton) falls within the range of -0.062116446 and 0.007418873.

The last red box shows where you can find the mean of x and mean of y (again, x and y are the two vectors you provided as the argument to t.test function, x being the first argument and y the second argument).

   

Let’s do some more practices using simulated data. Remember qnorm() and pnorm() from the previous lab and that they are two of the four family members of normal distribution functions? We can use one of the other two - rnorm() function to randomly sample a vector of numbers that follows a normal distribution.

The function rnorm() takes 3 arguments, n, mean, and sd. Here, n specifies how many values you want to sample from the distribution; mean and sd specifies the mean and standard deviation of the normal distribution from which you’d like to sample your numbers. If you set, for example, mean to 0 and sd to 1, you would be sampling values from the standard normal distribution.

As an example, let me create two vectors of 200 elements with different means but same standard deviations and plot them to visualize them.

 

If you are not sure what rnorm() is doing, that is fine. Just note that we are using rnorm() because we can create any data with mean and sd set to however we like them to be. We are going to adjust them to our liking so that we can examine how t-test behaves when you change, for example, mean from 0 to 0.5 and sd from 1 to 2.

 

  • Note 1: Notice that because rnorm() function RANDOMLY samples values from a normal distribution each time it is executed, the exact values you get can be different each time you run it. That is expected and is not a problem at all.

  • Note 2: This is where the custom function is used. The custom function histogram_overlay takes two vectors as the argument and prints out a histogram chart.

# Draw numbers from the distribution
red_vector <- rnorm(200, mean = 0, sd = 1) # Create a vector that has 200 elements from the standard normal distribution.

blue_vector <- rnorm(200, mean = 2, sd = 2) # Create a vector that has 200 elements from a normal distribution with mean of 2 and sd of 2

print(red_vector)
##   [1]  0.15630886  0.87950126 -1.26789922 -1.02572101 -2.92502302  0.25306820
##   [7]  0.29324273  0.91444040 -1.34931937 -1.05015095 -0.22210232 -1.38386462
##  [13]  1.33025220 -0.19658973  0.58155923  1.01651394 -0.47345790  1.45708662
##  [19]  1.35726579  1.12013558 -0.72373036 -1.90006403 -0.61046439 -0.13762994
##  [25] -0.21230220 -0.47032103  1.52184460  1.21170454  0.10341299  0.57549740
##  [31] -0.02189756 -0.06898542 -0.91368097  0.02354278  0.34039749  1.10651503
##  [37]  0.85000181 -1.61941560 -0.55845877 -0.51975165  1.04341750  0.22861400
##  [43]  0.64460076 -0.64301873  0.19959414 -0.75696150  0.52538793  0.21646811
##  [49] -0.21509817 -0.69222822 -0.46222299 -1.36836998  0.47326134  1.54546409
##  [55] -0.11871823 -0.32261993  0.88275567  0.60012343  0.91939293  0.78786139
##  [61]  0.94589499 -0.12774962  0.03963747  0.32803821  0.20271884 -0.08182276
##  [67]  1.23141376  0.93728224  1.69135149 -1.83693240 -0.11622588 -0.52169247
##  [73] -1.83267711  0.08588009  0.74637502  0.77103823  0.70070760  0.57502777
##  [79] -0.32713391 -0.05587599 -0.77822324 -1.59132083 -0.06544094  0.27384509
##  [85] -0.87710509 -1.20023693  1.14668503 -0.34725727 -1.72361634 -0.71663550
##  [91]  2.32096682  0.17647745 -0.74410366 -0.96201918 -0.33438399  0.15558599
##  [97] -0.92398831 -0.43367561  0.61196690  1.06367011 -1.71126323  1.13307708
## [103]  0.81778682  1.63564167  0.15966225  1.41651565  0.99662963  0.90448894
## [109] -0.16431636  0.30370577 -0.79723676  0.92618106  0.72405363 -0.30638711
## [115]  0.64009733 -0.47574222 -0.56317890  0.30756077 -2.55979173  0.20862297
## [121] -1.29987578 -0.21031685 -0.77798735 -0.85323183  0.62173821 -1.22198663
## [127] -1.12169269 -1.81325641  0.71277017 -0.96287559  1.09483294  0.97825634
## [133]  0.70026338  0.50173003 -0.21153645 -1.69071117  0.39379345  0.31837625
## [139] -0.04035831 -0.88244529  0.11290474  0.92539386 -0.20574963  1.39770196
## [145] -0.94348956  0.81537421  0.01323583  0.08628050  0.90433015  0.80107796
## [151] -0.02073583  0.72312715 -0.26100406  0.11356715  0.18217908  0.73223993
## [157] -1.89143973  2.97742934 -1.47661945 -1.10438063  1.24609058 -1.95438158
## [163]  0.66513054 -1.03378149 -0.70157517  1.18367153 -0.27509734 -0.80500527
## [169]  0.83637886  1.61830110 -1.47668759  1.00290868 -0.20071281  0.52910344
## [175] -1.19986270 -0.08569601 -0.09013922 -1.03726324  0.48017324 -1.53381487
## [181] -1.16831856 -0.33468809  2.84249227  1.88416194  0.54999397  0.20880332
## [187]  1.34918744 -1.02651613 -0.19188860  1.44809604  0.60867972  0.08635366
## [193] -0.20455012 -0.24243183  1.46352518  0.17216840 -0.15462712  0.79798157
## [199] -1.13373234 -1.16538955
print(blue_vector)
##   [1]  1.35179988  2.82437437 -1.42348197 -0.75077308 -3.04137512  1.82791522
##   [7]  2.99168613  1.62537030  2.64068274  2.73869467  1.43368079  2.46171601
##  [13]  0.09777721  0.80158049  2.25259758  1.47469481  1.69665005  4.15510358
##  [19]  4.53693006 -1.04273030  2.11364942  2.15506706  3.40837919 -0.38264614
##  [25]  0.36775485  3.95631997  4.27048713 -0.25203230  0.43557006  0.82950668
##  [31]  2.86318903  4.48433905  2.93851603  4.42463716 -0.77498263  2.15217284
##  [37]  6.71208788  5.75757825  2.64128150  2.80620809 -0.53510903  7.33025964
##  [43]  0.80893823  4.59525306  6.14221269  7.15469906 -1.96085904  2.97522433
##  [49]  1.52542012  2.41794109  4.05492637  3.53140577  3.48168199 -1.82456690
##  [55]  2.70003707  2.78583103  4.28627150  4.56420874  0.99167416  3.49495196
##  [61]  2.95037601  0.27085592  3.86605905  0.66328140  3.23225428  0.89446283
##  [67]  2.93823875  0.84611235  5.36003484  2.41673900 -1.22826915 -0.22443918
##  [73] -4.14498627  1.02005703  3.72004022  0.14416973 -1.38590343  4.32975877
##  [79] -0.12365469  0.56978072  3.21578197  3.11707857  1.26758843  4.78799741
##  [85]  1.08069073  1.41205879  2.43161458  2.02935402  4.33448657  3.71710209
##  [91]  0.06001645 -0.42016949  2.51447050  1.82050228  0.40453675  0.84265173
##  [97]  5.12211305  0.10462183  0.05990062  0.87049581  3.70326220 -0.33201064
## [103]  6.26350496  2.21530379 -2.20785817  3.24599976 -0.64201789  3.47435844
## [109]  0.44848374 -0.14927254  2.73938377  1.36443444  0.29781809 -1.74536064
## [115]  1.64263490 -0.02922008  1.42002924  5.67899915  0.08514879  4.50921196
## [121]  1.08140266  3.70391394  1.20244718  2.39689629 -0.28540587  0.82605014
## [127]  2.30928626  7.42447980  4.38158567  1.07386965  2.09718236  1.02420985
## [133]  4.65761149  0.91159667  3.97561470  7.45622435 -0.86938356 -0.42222065
## [139] -0.74976971  1.26583245  3.01720681  0.38992285  4.32281472 -0.94699610
## [145]  2.15447551  2.90322238  1.82728479  1.77369819  4.85534536  4.07543417
## [151]  1.77051274  0.46193329  2.97291394  4.57533373  5.59798840  1.53421942
## [157]  3.04317276  5.41628381  3.47939637  1.55446092  5.36663200  2.20698208
## [163]  1.48004165  3.21717947 -0.64825514  3.02966540  0.90553717  0.89081390
## [169]  2.85843981  1.43491731  0.42191998  0.30462545  4.96838238  1.34199262
## [175]  2.76126247  3.16905185  2.70368519  0.96519406 -0.91061449  0.68231622
## [181]  3.87786480  0.93117105  2.81597011  2.01055733  1.26286252  4.81705954
## [187]  1.19724880  2.29995852  4.07940735  3.99086177  0.59376206  2.76872955
## [193]  3.64228679  3.47693818  2.72313843 -0.57617661 -0.61423936  3.16405249
## [199]  2.11830095  1.90519675
# Visualize the two vectors using the custom function.
histogram_overlay(red_vector, blue_vector)

You can see from the plot that red_vector and blue_vector are centered at different z-score, 0.02 and 2.08 (the red line and blue line, respectively). They also show a clear difference in terms of how much they are spread out (because we had set sd of red_vector to 1 and blue_vector to 2). Let’s see what would happen if we run a t-test using these two vectors.

t.test(red_vector, blue_vector)
## 
##  Welch Two Sample t-test
## 
## data:  red_vector and blue_vector
## t = -12.711, df = 285.56, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.383722 -1.744480
## sample estimates:
##  mean of x  mean of y 
## 0.02002835 2.08412920

The t-statistic is large at -12.71 and, obviously, the p-value is small (0, which is smaller than 0.05). We can reject the null hypothesis that the mean of red_vector and blue_vector is the same.

       

Next, let’s try putting the two distribution closer to each other and see what happens. This time, let’s keep the mean of red_vector to 0 and change the mean of blue_vector to 0.2 and run t-test.

# Draw numbers from the distribution
red_vector2 <- rnorm(200, mean = 0, sd = 1) # Create a vector that has 1000 elements from the standard normal distribution.

blue_vector2 <- rnorm(200, mean = 0.2, sd = 2) # Create a vector that has 1000 elements from a normal distribution with mean of 0.2 and sd of 2

# Visualize the two vectors
histogram_overlay(red_vector2, blue_vector2)

t.test(red_vector2, blue_vector2)
## 
##  Welch Two Sample t-test
## 
## data:  red_vector2 and blue_vector2
## t = 0.065613, df = 286.21, p-value = 0.9477
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3373510  0.3606178
## sample estimates:
##  mean of x  mean of y 
## 0.06764216 0.05600875

Now, the t-statistic is much closer to zero, and the p-value indicates that we fail to reject the null hypothesis at 95% confidence level. Note that even if you run the same code as above, sometimes your p-value can be lower than 0.05. That is because of random sampling!!

Before you move on, try various combination of the mean and standard deviation and see the result of t-test!

For example, if you keep the means at 0 and 0.2, and change the STANDARD DEVIATION, what would happen to t-statistic and p-value?

   

Different types of t-test In the lecture, Dr. Clayton covered other types of t-test, such as with unequal variance and paired t-test. Here are some sample codes that you can try. Note that the code for paired t-test fails for our data because fulton and dekalb are NOT PAIRED and have different lengths.

# Independent samples with equal variance 
t.test(Fulton, Dekalb, var.equal = TRUE)

# Independent samples with NOT equal variance
t.test(Fulton, Dekalb, var.equal = FALSE)

# Paired
t.test(Fulton, Dekalb, paired = TRUE) # This code fails because fulton and dekalb in our data are NOT PAIRED and have different lengths

   

Short Exercise

  1. We are interested in comparing the number of pedestrians crossing intersections A and B during the peak hours on weekdays. The two intersections are in totally different neighborhoods, far apart. Pedestrian data (during 15-minute intervals) at the two intersections are sampled during peak hours on a random weekday and shown in the Table below. For following questions, conduct the appropriate hypothesis test at the 5% significance level, and also provide the corresponding p-values.
  1. Test the hypothesis that the mean number of pedestrians crossing intersection B differs from the mean number crossing intersection A, assuming the population variances are equal. Answer: test statistic = 1.2792, p-value = 0.2088

  2. Test the hypothesis that the mean number of pedestrians crossing intersection B differs from the mean number crossing intersection A, assuming the population variances are unequal. Answer: test statistic = 1.2792, p-value = 0.2088

  3. What’s the difference between your two answers (a and b)?

   

2. ANOVA

Here is a preview of how to run an ANOVA test. Interpretation and set-up will be discussed in lecture next week.

The code for ANOVA is even simpler. The format of the code is similar to boxplot. Writing aov(df.ready$p.pov ~ df.read$county) is the same as saying “Run ANOVA on df.ready$p.pov using df.ready$county as the factor variable (i.e., grouping variable) and store it in an R object called anova.pov.” To see what is stored in anova.pov, use summary() function.

# ANOVA
anova.pov <- aov(df.ready$p.pov ~ df.ready$county)
summary(anova.pov)
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## df.ready$county   3  0.738 0.24600   15.32 1.47e-09 ***
## Residuals       510  8.188 0.01606                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result from ANOVA indicates that we can reject the null hypothesis that the mean of the poverty rate in the four counties are the same. At least one county has a mean that is not equal to those of other counties.