install.packages("tidyverse")
install.packages("car")
install.packages("haven")
install.packages("corrr")
install.packages("janitor")
install.packages("rstatix")
library(car) # Load the library car so we can run leveneTest, install if needed
library(tidyverse)
library(haven) #import files produced in other statistical software
library(corrr) #Includes functions to evaluate correlations within the tidyverse
library(janitor) #To analyze cross-tabulation – chi-squared
library(rstatix) #One of many packages used for t-testsImport the “anes_pilot_2016.dta” file and create a dataframe NESdta. This is an Stata file so we need to use the haven package. Hint: uses read_dta()
# Using R to import an Stata filesd
NESdta<- anes_pilot_2016View what we just imported. Did we get what we expected?
head(NESdta)NANow let’s create a new dataframe NESdata_sub with a subset of variables: fttrump, pid3, birthyr, gender, ftobama. Hint: use the select function from the tidyverse. Hint: make sure you have load your tidyverse package. The variables in our subset come from the American National Election Studies, where fttrump is a Feeling thermometer for Trump in 2016 (from 1 to 100, where 1 = cold and 100 = warm), pid3 is the respondent’s party affiliation (1 = Democrat, 2 = Independent, 3 = Republican), birthyr is the respondent’s birth year, gender is the respondent’s gender (1 = male and 2 = female) and ftobama is a feeling thermometer for Obama in 2016 (from 1 to 100, where 1 = cold and 100 = warm)
Using the pipe operator [ %>%] do the following:
replace all values where fttrump>100 for NA replace all values where ftobama>998 for NA create a new variable female drop missing values on the subset data
NESdta_sub <- NESdta %>%
dplyr::select(fttrump, pid3, birthyr, gender, ftobama) %>%
mutate(fttrump = replace(fttrump, fttrump > 100, NA),
ftobama = replace(ftobama, ftobama == 998, NA),
female = ifelse(gender == 2, 1, 0)) %>%
as.data.frame() %>%
drop_na()Using the NESdta_sub, create a new variable [Party] where: [Hint: use the mutate and case_when functions ] Party is equal to “Democrat” if pid3 is 1, Party is equal to “Republican” if pid3 is 2, Party is equal to “Independent” if pid3 is 3
NESdta_sub <- NESdta_sub %>%
mutate(Party = case_when(pid3 == 1 ~ "Democrat",
pid3 == 2 ~ "Republican", pid3 == 3 ~ "Independent"))Let’s use the glimpse function to check if the dataset looks okay
glimpse(NESdta_sub)Rows: 1,195
Columns: 7
$ fttrump <dbl> 1, 28, 100, 0, 13, 61, 5, 85, 7…
$ pid3 <dbl> 1, 3, 2, 1, 4, 3, 1, 2, 3, 1, 2…
$ birthyr <dbl> 1960, 1957, 1963, 1980, 1974, 1…
$ gender <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ftobama <dbl> 100, 39, 1, 89, 1, 0, 73, 0, 12…
$ female <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Party <chr> "Democrat", "Independent", "Rep…
t-statistics tests can be used to test difference in means between two (or more) groups using a t-distribution.
To compare the mean in a sample against a hypothetical population mean. For example,let’s compare the sample mean of approval for candidate Donald Trump (in the 2016 elections) against the hypothetical population mean above 50 (out of a 100 point scale, which we would interpret as the point where people view him more positively than negatively).
But first, let’s confirm what is the mean approval rating for Trump:
NESdta_sub %>%
summarize(mean_approval = mean(fttrump, na.rm = T))NANow let’s run the one sample t-test providing the population mean in the[mu] argument equal to 50. since it’s a one sample t-test and there is no other group in our comparison, we compare our sample value of 38.37845 against a null model using the [~ 1]
# t-test fit
NESdta_sub %>%
t_test(fttrump ~ 1, mu = 50, detailed = TRUE)We conclude that the difference between the mean we find in the sample and the hypothetical mean of 50 is statistically significant. The t-statistic is -11.07 and the p-value is < 0.001, so the difference in means is larger than what typically considered “statistically significant.”
We use the chi-square test to test whether the differences observed in a contingency tables are statistically significant, that is, if they are not due to random sampling error.
Let’s use the function tabyl from the package [janitor] to produce a contingency table:
NESdta_sub %>%
tabyl(Party, female) Party 0 1
Democrat 188 268
Independent 208 171
Republican 129 151
<NA> 44 36
We still have some missing values in our [Party] variable, so let’s exclude them using the drop. na():
NESdta_sub<-NESdta_sub %>% drop_na() Next, let’s derive a contingency table for Party and female
NESdta_sub %>%
tabyl(Party, female) Party 0 1
Democrat 188 268
Independent 208 171
Republican 129 151
And then we just add the chisq.test() to test the significance
NESdta_sub %>%
tabyl(Party, female) %>%
chisq.test()
Pearson's Chi-squared test
data: .
X-squared = 15.64, df = 2, p-value =
0.0004017
The chi-squared value is 15.64, and the corresponding p-value is 0.0004, which is lower than our 0.05 value. The results indicate that differences in party affiliations between male and female are much higher than we would have expected as a result of random chance.
meet -During the past 12 months, have you attended a meeting to talk about political- sign -Worn a campaign button, put a campaign sticker on your car, or placed a sign- pidlean -Party Id - Lean- vaccine -Favor/oppose requiring children to be vaccinated in order to attend public school- marstat -marital status-
NESsta_sub2<-NESdta %>% select(meet, info, sign, pidlean, vaccine, marstat)And remember to add the %>% drop_na() at the end to exclude the missing values for the purpose of this exercise (remember that in a full research analysis, we have to first explore if the missing values in a variable might be a sign of bias/issues in the data collection, that is, it’s important to explore whether and why some groups are more likely to show missing answers for a specific variable)
NESdta_sub2<- NESsta_sub2 %>% mutate(
plean_cat = case_when(pidlean == 1 ~ "Closer to the Republican Party",
pidlean == 2 ~ "Closer to the Democratic Party",
pidlean == 3 ~ "Neither")) %>% drop_na()vac2 = “Neither” if vaccine is 4
vac2 = “Oppose” if vaccine is either 5, 6 or 7
NESdta_sub2<- NESdta_sub2 %>% mutate(
vac2 = case_when(vaccine == 1 ~ "Favor",
vaccine == 2 ~ "Favor",
vaccine == 3 ~ "Favor",
vaccine == 4 ~ "Neither",
vaccine == 5 ~ "Oppose",
vaccine == 6 ~ "Oppose",
vaccine == 7 ~ "Oppose")) %>% drop_na()signcat = “Have NOT done this in the past 12 months” if sign =2
NESdta_sub2 <- NESdta_sub2 %>% mutate(signcat = case_when(sign == "1" ~ "Have done this in the past 12 months",
sign == "2" ~ "Have NOT done this in the past 12 months"), na.rm = TRUE) %>% drop_na()
NESdta_sub2tabyl(NESdta_sub2, plean_cat, vac2) plean_cat Favor Neither
Closer to the Democratic Party 101 9
Closer to the Republican Party 88 13
Neither 125 57
Oppose
8
26
32
tabyl(NESdta_sub2, plean_cat, signcat) plean_cat
Closer to the Democratic Party
Closer to the Republican Party
Neither
Have done this in the past 12 months
22
33
21
Have NOT done this in the past 12 months
96
94
193
tabyl(NESdta_sub2, plean_cat, vac2) %>% chisq.test()
Pearson's Chi-squared test
data: .
X-squared = 37.233, df = 4, p-value =
1.613e-07
Your answer here:
p-value = 1.613e-07 means that the test hypothesis is false or should be rejected.
tabyl(NESdta_sub2, plean_cat, sign) %>% chisq.test()
Pearson's Chi-squared test
data: .
X-squared = 15.586, df = 2, p-value =
0.0004126
Your answer here: p-value = 0.000412 means that the test hypothesis is false or should be rejected.
Let’s import the file w12datdep1.csv.
# Import our first data set
w12datdep1 <- read.csv(file.choose())Notice the difference between view() and our usual head().
# View what we just imported. Did we get what we expected?
View(w12datdep1)# Use the function for the t test
t.test(w12datdep1$Posttest, w12datdep1$Pretest, paired = TRUE, alternative = "greater")
Paired t-test
data: w12datdep1$Posttest and w12datdep1$Pretest
t = 2.4495, df = 24, p-value = 0.01099
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
0.3618424 Inf
sample estimates:
mean difference
1.2
Let’s look at a scatterplot before calculating a correlation
# Let's look at a scatterplot before calculating a correlation
ggplot(w12datdep1, aes(Pretest,Posttest)) +
geom_point() +
xlab("Pretest") +
ylab("Posttest") +
ggtitle("Pretest and Posttest Reading Achievement Scores") +
theme_minimal() +
scale_fill_brewer(palette="Blues") NA
NA
NANow let’s calculate the correlation:
cor(w12datdep1$Pretest, w12datdep1$Posttest)[1] 0.05071834
Not only is the difference significant, but the size of the difference is nontrivial.
Let’s import the w12data1.csv file.
# Import our first data set
w12data1 <- read_csv(file.choose())Rows: 36 Columns: 2── Column specification ─────────────────────────
Delimiter: ","
dbl (2): Qual.Marriage, Qual.PC
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View what we just imported. Did we get what we expected?
head(w12data1)Let’s calculate the correlation
# Calculate the correlation
cor(w12data1$Qual.Marriage, w12data1$Qual.PC)[1] -0.09996305
# Look at a scatterplot of the data
plot(w12data1$Qual.Marriage, w12data1$Qual.PC,
xlab = "Marriage Quality", ylab = "Parent-Child Interaction Quality",
main = "Scatterplot")+
theme_minimal() NULL
Now let’s obtain the correlation and a test of statistical significance
# Get the correlation and a test of statistical significance
cor.test(w12data1$Qual.Marriage, w12data1$Qual.PC)
Pearson's product-moment correlation
data: w12data1$Qual.Marriage and w12data1$Qual.PC
t = -0.58581, df = 34, p-value = 0.5619
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4148737 0.2363342
sample estimates:
cor
-0.09996305
And if we were interested in getting a one-tail test:
# Ask for one-tailed test results
cor.test(w12data1$Qual.Marriage, w12data1$Qual.PC, alternative = "greater")
Pearson's product-moment correlation
data: w12data1$Qual.Marriage and w12data1$Qual.PC
t = -0.58581, df = 34, p-value = 0.7191
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
-0.3684516 1.0000000
sample estimates:
cor
-0.09996305
Let’s import the w12data2.csv file.
# Import our first data set
w12data2 <- read.csv(file.choose())
head(w12data2)NAWhen we run the t-test with the var.equal = TRUE, we are telling the t.test() function that our groups have equal variance, we will learn a test for this later.
# Run the t test
t.test(MemoryTest ~ Group, data = w12data2, var.equal = TRUE)
Two Sample t-test
data: MemoryTest by Group
t = -0.1371, df = 58, p-value = 0.8914
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-1.560009 1.360009
sample estimates:
mean in group 1 mean in group 2
5.433333 5.533333
We are using MemoryTest ~ Group:, as MemoryTest is the score we are comparing across our groups, with 1 used for the treatment group and 2 for the control group in the variable Group.
And the var.equal = TRUE tells the t.test() function that the groups have equal variance.
The results, t = –0.1371, df = 58, p = .8914, tell us the t, the degrees of freedom for our test, and the exact p.
Let’s look at the boxplot between memory test and group, and save the plot in a new object [memory], so we can access some of the summary stats used to construct it.
memory <- boxplot(MemoryTest ~ Group, data = w12data2, main = "Memory Test Scores",
xlab = "Groups", ylab = "Scores")Now we can use our new memory object to look at the distribution of numbers in the two groups using the stats option after running the plot. Based on what we can see from the plot, we can see the variance is different between our groups, but we don’t really know whether it is VERY different. So we will use a statistic to answer this question, the Levene’s test.
# Show the numbers used to create the boxplot
# [1] Lower whisker
# [2] 25th percentile value
# [3] 50th percentile value - median
# [4] 75th percentile value
# [5] Uppoer whisker
memory$stats [,1] [,2]
[1,] 1 2.0
[2,] 3 4.0
[3,] 5 5.5
[4,] 8 7.0
[5,] 15 9.0
We will now use Levene’s test from the package [car] to test the null hypothesis that the variances are equal, and the research hypothesis is that they are different. Let’s use p < .05 to signal that we should reject our null hypothesis.
# Test to see if the variances are equal in the two groups
leveneTest(MemoryTest ~ as.factor(Group), data = w12data2)Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 3.3429 0.07264 .
58
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Levene’s test is F(1, 58) = 3.34, p = .07. The p for Levene’s test exceeds .05, so we fail to reject the null hypothesis of equal variance. We are going to omit var.equal statement, and what this means is that R will use a different t-test -the Welch’s t test-. This is going to change the number of degrees of freedom in the calculation to minimize Type I or Type II error. In our specific example, the p-value with the Welch’s test still leads to the same conclusion, and we fail to reject the null hypothesis of no difference in the memory test results between the treatment and control groups.
# Re-run t.test with var.equal omitted
t.test(MemoryTest ~ Group, data = w12data2)
Welch Two Sample t-test
data: MemoryTest by Group
t = -0.1371, df = 47.635, p-value = 0.8915
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-1.566803 1.366803
sample estimates:
mean in group 1 mean in group 2
5.433333 5.533333
# Look at help for t.test
?t.test# Import our first data set
w12data3 <- read.csv(file.choose())cor(w12data3$Motivation, w12data3$GPA)[1] 0.4340226
cor.test(w12data3$Motivation, w12data3$GPA)
Pearson's product-moment correlation
data: w12data3$Motivation and w12data3$GPA
t = 2.5493, df = 28, p-value = 0.01656
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.08742338 0.68688681
sample estimates:
cor
0.4340226
Your answer here: significant positive correlation but not absolute.
True or false? The more highly you are motivated, the more you will study. Your answer here:
True
# Import our first data set
w12data4 <- read.csv(file.choose())cor(w12data4$Income, w12data4$Level.of.Education)[1] 0.6289264
cor.test(w12data4$Income, w12data4$Level.of.Education)
Pearson's product-moment correlation
data: w12data4$Income and w12data4$Level.of.Education
t = 3.4321, df = 18, p-value = 0.002973
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2582914 0.8381728
sample estimates:
cor
0.6289264
Your answer here: positive correlation between education and income
# Import our first data set
w12data5 <- read.csv(file.choose())cor(w12data5$Hours.of.Study, w12data5$Grade)[1] 0.6858935
cor.test(w12data5$Hours.of.Study, w12data5$Grade)
Pearson's product-moment correlation
data: w12data5$Hours.of.Study and w12data5$Grade
t = 2.6659, df = 8, p-value = 0.02854
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.09903704 0.91875110
sample estimates:
cor
0.6858935
Your answer here: positive correlation among grade and study hours
0.6858935^2[1] 0.4704499
Your answer here: 0.4704499 The shared variance equals: 47%