Today we will learn how to conduct hypothesis test using t-test and ANOVA. We will use the data we cleaned in Lab 3 for today’s exercise. Below is a list of functions that are newly introduced in this lab.
| Functions | Tasks |
|---|---|
describeBy() |
Creates a basic summary statistic by a grouping variable (in psych package) |
t.test() |
Performs one and two sampe t-tests on vectors of data |
aov() |
Performs an analysis of variance |
rnorm() |
Draws a random sample from a normal distribution |
Before we start, let’s define a custom function I made. Copy the entire code in the gray box below and paste it into your console window and run it. We’ll get back to what it means later.
# Copy from here ---------------------------
histogram_overlay <- function(vector1, vector2){
plot <- ggplot2::ggplot(data = data.frame(red = vector1, blue = vector2)) +
ggplot2::geom_histogram(aes(red, ), color = 'gray', fill = "red", alpha = 0.5, bins = 15) +
ggplot2::geom_vline(xintercept = mean(vector1), color = "red", linetype = 'dashed', size = 2) +
ggplot2::geom_histogram(aes(blue), color = 'gray', fill = "blue", alpha = 0.5, bins = 15) +
ggplot2::geom_vline(xintercept = mean(vector2), color = "blue", linetype = 'dashed', size = 2) +
ggplot2::labs(x = "Z-Score", y = "Frequency")
return(plot)
}
# To here ----------------------------------
As always, we start by setting the working directory and reading required files.
If you have your cleaned file saved, you can use that file. Else, you can download the CSV file (df_from_lab3.csv) I provided from this week’s module on Canvas.
# Install required packages
install.packages("psych")
# Call the packages using library()
library(tidyverse)
library(psych)
# USE YOUR OWN working directory and file name
setwd("D:/Dropbox (Personal)/School/Fall2020/CP6025/Labs/Lab4_new") # Use your own pathname
df.ready <- read.csv("df_from_lab3.csv")
# Check the data
head(df.ready)
## GEO_ID tract county state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 2641
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 3573
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 4087
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 5962
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 6778
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 4090
## under00_50 under50_99 p.pov p.pov.z p.pov.scale
## 1 288 261 0.2078758 0.2492940 0.2492940
## 2 498 381 0.2460118 0.5384034 0.5384034
## 3 477 397 0.2138488 0.2945753 0.2945753
## 4 939 1097 0.3414961 1.2622723 1.2622723
## 5 443 1345 0.2637946 0.6732157 0.6732157
## 6 982 762 0.4264059 1.9059745 1.9059745
Let’s first draw a boxplot to visualize our data. Although t-test and ANOVA are about means (and that boxplots show median), boxplot can still provide some understanding of the distribution of the data.
# Boxplot to visualize
boxplot(df.ready$p.pov ~ df.ready$county,
main = "Poverty Rate by County",
xlab = "County",
ylab = "Poverty Rate")
We can see that “DeKalb County” and “Fulton County” have similar median values while the other two have distinctively different median values with less dispersion. We can also calculate the descriptive statistics for each county using describeBy in psych package which takes minimum of two arguments (1) a data.frame or a vector of interval/ratio variable and (2) a variable that can be used to group it.
# Describe statistics by county
describeBy(df.ready$p.pov, df.ready$county)
##
## Descriptive statistics by group
## group: Clayton County
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 49 0.23 0.09 0.22 0.22 0.08 0.04 0.46 0.42 0.37 -0.26 0.01
## ------------------------------------------------------------
## group: Cobb County
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 120 0.11 0.09 0.09 0.1 0.08 0 0.4 0.39 1.12 0.51 0.01
## ------------------------------------------------------------
## group: DeKalb County
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 143 0.18 0.11 0.16 0.17 0.11 0.02 0.65 0.64 0.96 1.36 0.01
## ------------------------------------------------------------
## group: Fulton County
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 202 0.2 0.16 0.16 0.18 0.16 0.01 0.81 0.8 0.93 0.37 0.01
Let’s compare the mean of “Fulton County” with other counties. To do so, let us first separate out rows for “Fulton County”, “DeKalb County”, etc, to create four vectors and then run t-tests using them.
# Create four separate vectors for each county
Fulton <- df.ready$p.pov[df.ready$county == "Fulton County"]
Dekalb <- df.ready$p.pov[df.ready$county == "DeKalb County"]
Cobb <- df.ready$p.pov[df.ready$county == "Cobb County"]
Clayton <- df.ready$p.pov[df.ready$county == "Clayton County"]
# T-Tests
t.test(Fulton, Dekalb)
##
## Welch Two Sample t-test
##
## data: Fulton and Dekalb
## t = 1.2151, df = 342.92, p-value = 0.2252
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01088587 0.04607498
## sample estimates:
## mean of x mean of y
## 0.1977611 0.1801666
t.test(Fulton, Cobb)
##
## Welch Two Sample t-test
##
## data: Fulton and Cobb
## t = 6.3226, df = 319.55, p-value = 8.651e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06043043 0.11502784
## sample estimates:
## mean of x mean of y
## 0.1977611 0.1100320
t.test(Fulton, Clayton)
##
## Welch Two Sample t-test
##
## data: Fulton and Clayton
## t = -1.5572, df = 121.59, p-value = 0.122
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.062116446 0.007418873
## sample estimates:
## mean of x mean of y
## 0.1977611 0.2251099
The results indicate that, as we sort of expected from the boxplot, the mean of Fulton is not statistically different from the mean of Dekalb while it is statistically different from the mean of Cobb. Surprisingly, the third test shows that the mean of Fulton is NOT statistically different from the mean of Clayton, which is somewhat deviating from what we saw in the boxplot. This is likely due to the fact that the distribution of Fulton is positively skewed (i.e., mean > median).
Let’s take a closer look at the output from t-test. The image below shows what each component in the output means. The data line indicates that it is a t-test using Fulton and Clayton. The t-statistics, degrees of freedom, and p-value show the ingridients used in the hypothesis testing. It shows that we cannot reject the null hypothesis.
The 95 percent confidence interval shows that we are 95% confident that the true difference in mean between x and y (in this case, x would be the vector named Fulton; y would be the vector named Clayton) falls within the range of -0.062116446 and 0.007418873.
The last red box shows where you can find the mean of x and mean of y (again, x and y are the two vectors you provided as the argument to t.test function, x being the first argument and y the second argument).
Let’s do some more practices using simulated data. Remember qnorm() and pnorm() from the previous lab and that they are two of the four family members of normal distribution functions? We can use one of the other two - rnorm() function to randomly sample a vector of numbers that follows a normal distribution.
The function rnorm() takes 3 arguments, n, mean, and sd. Here, n specifies how many values you want to sample from the distribution; mean and sd specifies the mean and standard deviation of the normal distribution from which you’d like to sample your numbers. If you set, for example, mean to 0 and sd to 1, you would be sampling values from the standard normal distribution.
As an example, let me create two vectors of 200 elements with different means but same standard deviations and plot them to visualize them.
If you are not sure what rnorm() is doing, that is fine. Just note that we are using rnorm() because we can create any data with mean and sd set to however we like them to be. We are going to adjust them to our liking so that we can examine how t-test behaves when you change, for example, mean from 0 to 0.5 and sd from 1 to 2.
Note 1: Notice that because rnorm() function RANDOMLY samples values from a normal distribution each time it is executed, the exact values you get can be different each time you run it. That is expected and is not a problem at all.
Note 2: This is where the custom function is used. The custom function histogram_overlay takes two vectors as the argument and prints out a histogram chart.
# Draw numbers from the distribution
red_vector <- rnorm(200, mean = 0, sd = 1) # Create a vector that has 200 elements from the standard normal distribution.
blue_vector <- rnorm(200, mean = 2, sd = 2) # Create a vector that has 200 elements from a normal distribution with mean of 2 and sd of 2
print(red_vector)
## [1] 0.1559178571 0.4505737990 -1.1088612090 -1.1763689895 0.1728376497
## [6] -1.7902511157 0.0742841048 1.9360782709 0.9093108399 0.1720170758
## [11] 1.5396896972 -0.3236830858 1.3738312936 -0.0258947022 -0.3391545336
## [16] -0.0883769310 0.7922095791 0.0108135423 1.0576726226 1.3699263441
## [21] 0.8223000586 -0.5377335824 0.4321957996 0.4461055794 -0.5750959540
## [26] 1.5529006573 -0.9850049915 -0.9314852991 0.7000478049 0.9121559277
## [31] -0.8457180456 -0.6976869695 0.3987432243 -0.1877708019 -0.6390749506
## [36] 1.1796200456 0.2987316232 -0.2023645477 0.0746709069 0.0101284743
## [41] -0.1494411537 -0.8060275290 1.2767097124 0.7556676946 0.3600823390
## [46] 1.7353650880 -1.1283716572 0.6184879934 0.6397492866 -0.0486213353
## [51] 0.3988992532 0.3572960189 0.5170306451 -1.2620572522 1.3298379517
## [56] 1.3643434716 0.9718808464 -1.4624019661 -0.0940336871 -1.1989479424
## [61] -1.1938742865 -0.9455794225 1.4617764162 -0.4953482843 -1.4273121183
## [66] 0.6974868995 -1.2255133746 -1.3480213840 0.4648101283 -1.0013976796
## [71] 0.9186610713 -1.3880639005 0.2531735778 -0.3014108645 0.1536579200
## [76] 2.3739707189 1.2153577681 -1.3538956812 -0.9314378825 0.2735528346
## [81] -0.7882233261 -0.7765469512 -2.0609694742 -1.4164855293 0.8487767487
## [86] 0.4440551221 0.8341227615 -0.9509570955 -1.4788338662 0.7683658875
## [91] -0.5108213549 -0.4198569308 0.8743104547 0.8057575260 0.3555808307
## [96] -0.1399573093 -1.5169545173 0.5037647723 0.7530727283 0.6911019484
## [101] -0.6440396019 0.4304743603 -0.1625604779 1.1397696050 0.2032198234
## [106] 0.8312017167 -0.5861985842 -0.2184942256 -0.6044933199 -1.0797966001
## [111] 0.2397229972 -0.5709569057 0.1498906550 -0.7619075628 0.0929348576
## [116] -0.3189116339 -1.5796464708 -0.7622227983 0.9994390089 0.2064731663
## [121] -0.5419875157 0.0578255212 0.1357311476 -0.5460246002 1.2397968601
## [126] 0.9400514437 0.3098115899 1.2155899772 1.4918728775 -0.2128565756
## [131] 0.9944014480 -1.8197701617 -1.1612776568 0.1564131859 0.4677413093
## [136] 0.5520295945 -0.1747264066 0.4332383505 -1.4171065371 0.4741815216
## [141] 0.5144593848 -0.8718010144 -0.2231451726 0.1320714800 -1.3168522537
## [146] -0.4743634424 0.2566081707 -0.9409988918 0.8871874577 -0.1927968197
## [151] -0.3428876993 -1.0261528618 2.1962437054 -0.3539498243 -0.0007122792
## [156] 0.1911220927 -0.3159944234 -0.2692576014 0.7379891570 -0.2839424388
## [161] 0.7134172244 1.4060493057 0.2806721441 0.9194637849 -2.4232767246
## [166] 0.3067706098 1.8196424058 1.3233493574 -0.5601422253 0.3576421927
## [171] -0.0399330062 1.0402669880 -0.3643446766 -0.5555210194 0.3209417967
## [176] -0.2082125161 -0.2068179055 0.4849856429 0.1059521973 0.0174513965
## [181] -1.6001065189 -1.7893831024 -1.2408960228 -0.6427210898 1.4845018856
## [186] -0.7402692137 0.4747442200 -1.0191843894 1.0949875290 -0.5475784604
## [191] 0.1844178074 0.1982336927 -1.4558375408 -1.2195017910 -0.0596416506
## [196] 0.0904648539 1.1215122386 1.7289503150 -0.6862096825 0.7413648970
print(blue_vector)
## [1] -0.528299518 2.654707876 5.323961508 1.065540747 0.842478182
## [6] 3.908694335 2.466801750 5.531973356 0.504802525 1.287020950
## [11] 0.643508072 7.494418344 3.411769943 0.009777829 1.884198828
## [16] 1.558825247 2.444173615 -0.332702028 0.132733958 -0.937382158
## [21] -2.500716778 1.762628604 6.850427981 -0.242953805 4.313724875
## [26] 0.423297001 4.482492491 3.255752927 2.038965394 2.911873731
## [31] -0.611083063 -0.105263339 0.795384012 -0.658101484 1.878091070
## [36] 2.718177638 3.724442150 4.736214798 0.256804871 3.930528175
## [41] 2.581450732 -1.852258199 5.047870419 0.933073012 1.439705588
## [46] 2.567872932 1.381533487 4.433385008 3.325824587 2.144265823
## [51] 8.636506462 2.544771998 5.883787873 4.052486671 3.572120161
## [56] 5.068724092 3.320490138 0.683193631 -0.385439941 1.355215773
## [61] 2.837541697 4.680265840 2.642567453 -0.329383248 7.483371910
## [66] 0.694742539 1.435971810 0.760057041 1.927193343 -0.068567665
## [71] 1.876015130 2.037217496 0.555257219 3.728219158 2.159644670
## [76] 3.215019392 1.373769686 -0.817797063 5.425698876 0.961243468
## [81] -3.857579931 -1.912948523 2.953877813 3.406829863 1.380685507
## [86] 3.758383757 2.234913427 3.436859685 2.863847596 2.032041043
## [91] 4.344746765 4.465352862 3.515378693 1.970055492 -0.304028411
## [96] 1.069502713 2.196536964 2.088492360 2.649540440 4.762755290
## [101] 1.828563572 0.970525261 4.711598176 2.273913552 4.357249757
## [106] 0.936374020 -0.733519324 3.733275097 2.488064761 -1.917371072
## [111] 5.522323002 2.871899804 -1.738250772 4.092919253 0.730750551
## [116] 4.030807506 1.870733043 1.846243269 2.516650742 2.088948404
## [121] 2.326779170 1.930401517 3.452942166 4.421079223 -0.997579934
## [126] 0.647754896 3.806559558 -0.929891238 3.009851883 -1.231249695
## [131] 2.035144875 -0.109635091 4.034326565 3.484031892 1.775324767
## [136] 3.216300058 5.072892232 3.217909012 1.863179698 4.576666831
## [141] 2.580699900 4.037410763 0.590570435 3.287509815 4.958546384
## [146] 4.684368394 -2.467445937 4.225680506 2.844361338 4.321381546
## [151] 4.024624002 2.754413368 4.686819595 8.579295558 3.363666999
## [156] 3.371005693 3.011015034 2.170796444 3.260552111 -0.284529308
## [161] 0.982241441 1.216473184 1.843649103 6.397895023 -2.171305679
## [166] -0.311096759 2.941377632 4.124016677 2.377757369 2.936888128
## [171] 4.754433726 2.803774200 3.505029170 -0.357213258 4.145179933
## [176] -0.049243874 2.184162524 0.924986917 2.334841996 4.367138714
## [181] -0.006429856 5.405599370 4.796126709 -2.044586801 5.153350565
## [186] 5.059407906 3.572390472 1.828243389 0.671986044 0.742387134
## [191] 0.599346539 6.944501270 5.527055687 0.790024436 0.593036459
## [196] 0.391254214 4.143892486 3.171529338 2.254594487 -2.204732140
# Visualize the two vectors using the custom function.
histogram_overlay(red_vector, blue_vector)
You can see from the plot that red_vector and blue_vector are centered at different z-score, 0.01 and 2.33 (the red line and blue line, respectively). They also show a clear difference in terms of how much they are spread out (because we had set sd of red_vector to 1 and blue_vector to 2). Let’s see what would happen if we run a t-test using these two vectors.
t.test(red_vector, blue_vector)
##
## Welch Two Sample t-test
##
## data: red_vector and blue_vector
## t = -13.842, df = 266.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.648743 -1.989050
## sample estimates:
## mean of x mean of y
## 0.01170722 2.33060378
The t-statistic is large at -13.84 and, obviously, the p-value is small (0, which is smaller than 0.05). We can reject the null hypothesis that the mean of red_vector and blue_vector is the same.
Next, let’s try putting the two distribution closer to each other and see what happens. This time, let’s keep the mean of red_vector to 0 and change the mean of blue_vector to 0.2 and run t-test.
# Draw numbers from the distribution
red_vector2 <- rnorm(200, mean = 0, sd = 1) # Create a vector that has 1000 elements from the standard normal distribution.
blue_vector2 <- rnorm(200, mean = 0.2, sd = 2) # Create a vector that has 1000 elements from a normal distribution with mean of 0.2 and sd of 1
# Visualize the two vectors
histogram_overlay(red_vector2, blue_vector2)
t.test(red_vector2, blue_vector2)
##
## Welch Two Sample t-test
##
## data: red_vector2 and blue_vector2
## t = -1.3926, df = 288.66, p-value = 0.1648
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5060326 0.0866683
## sample estimates:
## mean of x mean of y
## 0.03159841 0.24128056
Now, the t-statistic is much closer to zero, and the p-value indicates that we fail to reject the null hypothesis at 95% confidence level. Note that even if you run the same code as above, sometimes your p-value can be lower than 0.05. That is because of random sampling!!
For example, if you keep the means at 0 and 0.2, and change the STANDARD DEVIATION, what would happen to t-statistic and p-value?
Let’s go back to our 4-county data and try one-tailed t-test using Fulton county and Cobb county.
Let us set the null hypothesis as: “the true mean of Fulton’s poverty rate is higher than that of Cobb”. This is equivalent to saying “the true different in means is greater than zero”.
Then, the alternative hypothesis would be “the true difference in means is less than zero”. This less is what goes to alternative argument below.
t.test(Fulton, Cobb, alternative = "less")
##
## Welch Two Sample t-test
##
## data: Fulton and Cobb
## t = 6.3226, df = 319.55, p-value = 1
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.1106185
## sample estimates:
## mean of x mean of y
## 0.1977611 0.1100320
The output shows that the t-value is much larger than zero (6.3226) and the p-value is essentially 1. We therefore fail to reject the null hypothesis, which makes sense given the mean values and their distributions.
Instead of comparing two different vectors (from different counties), let’s compare the mean value of Fulton County to a given value (e.g., a target poverty rate which Fulton County’s planners want to make sure that Fulton County is not higher than). Let’s say that target value is 10%.
alternative argument.t.test(Fulton, mu = 0.1, alternative = "greater")
##
## One Sample t-test
##
## data: Fulton
## t = 8.764, df = 201, p-value = 3.958e-16
## alternative hypothesis: true mean is greater than 0.1
## 95 percent confidence interval:
## 0.179328 Inf
## sample estimates:
## mean of x
## 0.1977611
Unfortunately, the t-value is much larger than zero at 8.764 and p-value nearly 0. We can reject the null hypothesis (and the planners should work harder).
Different types of t-test In the lecture, Dr. Clayton covered other types of t-test, such as with unequal variance and paired t-test. Here are some sample codes that you can try. Note that the code for paired t-test fails for our data because fulton and dekalb are NOT PAIRED and have different lengths.
# Independent samples with equal variance
t.test(Fulton, Dekalb, var.equal = TRUE)
# Independent samples with NOT equal variance
t.test(Fulton, Dekalb, var.equal = FALSE)
# Paired
t.test(Fulton, Dekalb, paired = TRUE) # This code fails because fulton and dekalb in our data are NOT PAIRED and have different lengths
Since ANOVA was addressed only very briefly in the lecture, I included just two lines of code as a trailer.
The code for ANOVA is even simpler. The format of the code is similar to boxplot. Writing aov(df.ready$p.pov ~ df.read$county) is the same as saying “Run ANOVA on df.ready$p.pov using df.ready$county as the factor variable (i.e., grouping variable) and store it in an R object called anova.pov.” To see what is stored in anova.pov, use summary() function.
# ANOVA
anova.pov <- aov(df.ready$p.pov ~ df.ready$county)
summary(anova.pov)
## Df Sum Sq Mean Sq F value Pr(>F)
## df.ready$county 3 0.738 0.24600 15.32 1.47e-09 ***
## Residuals 510 8.188 0.01606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result from ANOVA indicates that we can reject the null hypothesis that the mean of the poverty rate in the four counties are the same. At least one county has a mean that is not equal to those of other counties.