We have a lot of columns, here’s what they are: PARTY: Political Party demFav: Average favorability towards Democrats repFav: Average favorability towards Republicans Q2A_A: Donald Trump, R Q2A_B: Bill Weld, R Q2A_C: Nancy Pelosi, D Q2A_D: Alexandria Ocasio-Cortez, D Q2A_E: Mitch McConnell, R Q2A_F: Barack Obama, D Q2B_A: Joe Biden, D Q2B_B: Cory Booker, D Q2B_C: Pete Buttigieg, D Q2B_D: Kamala Harris, D Q2B_E: Bernie Sanders, D Q2B_F: Elizabeth Warren, D Q2B_G: Steve Bullock, D Q2B_H: Julian Castro, D Q2B_I: Bill de Blasio, D Q2B_J: Tulsi Gabbard, D Q2B_K: Kirsten Gillibrand, D Q2B_L: Amy Klobuchar, D Q2B_M: Beto O’Rourke, D Q2B_N: Tom Steyer, D Q2B_O: Marianne Williamson, D Q2B_P: Andrew Yang, D
This week, we’re going to talk about regression and t-tests (but we’ll start by reviewing the correlation we ran last week) Let’s get set up. Start by setting your working directory!
library(ggplot2)
library(ggeffects)
library(tidyr)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Let’s just remind ourselves about last week’s analyses. Let’s read in our data
df <- read.csv("Republican and Democrat Favorability Scale.csv")
Remember when we visualized it, we got a sense of what the relationship looks like
ggplot(data = df, aes(x = demFav,y=repFav)) +
geom_point() +
xlab("Democrat Favorability") +
ylab("Republican Favorability") +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
Next, we calculated the Pearson’s r correlation Notice that we’re first need to save the function as an object. In this case, we’re calling the object ‘r’. Then we use the print() function to see our results
library(psych)
r <- corr.test(df$demFav,df$repFav)
print(r,short=F)
## Call:corr.test(x = df$demFav, y = df$repFav)
## Correlation matrix
## [1] -0.59
## Sample Size
## [1] 2511
## These are the unadjusted probability values.
## The probability values adjusted for multiple tests are in the p.adj object.
## [1] 0
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## NA-NA -0.62 -0.59 -0.56 0 -0.62 -0.56
You can also use the cor() function, but it only gives you the correlation without all the other information, which you’ll need anyway.
As you can see, correlations are pretty straight forward. Let’s move on to regression
First, let’s import some data to look at!
df1 <- read.csv("Week 5 Favorability Data.csv")
This is the same data we’ve used before. We can use the columns (Dem and Rep Favorability) to find a regression: Again, we’re saving the function as an object, ‘reg’, so we can pull it up later
reg <- lm(repFav~demFav, data=df1)
summary(reg)
##
## Call:
## lm(formula = repFav ~ demFav, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77262 -0.47053 -0.04061 0.40316 2.98520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.35856 0.03939 85.26 <2e-16 ***
## demFav -0.58594 0.01598 -36.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6418 on 2509 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.349, Adjusted R-squared: 0.3487
## F-statistic: 1345 on 1 and 2509 DF, p-value: < 2.2e-16
There’s a lot going on here and it may be a little confusing to read, but know that the “Estimate” is the b-coefficient for your predictor and that the Pr(>|t|) bit is the p-value. The first “Estimate,” by (Intercept), is the intercept! So here, we want to know Republican favorability as predicted by Democrat favorability. So, when Democrat favorability is 0, Republican favorability is 3.36. Then for every unit change in Democratic favorability, Republician favorability DECREASES by .59 units (we know it decreases because the slope is negative) So we can see that the formula for our regression is Y-hat = 3.36 - .59
demfav is predictor and repfav is outcome variable intercept is when predictor is at 0 so when demfav is 0 repfav is 3.36 units for every one unit increase in demfav the repfav changes -.59 units b = -.59, p<.001
This is cool, but there’s more we can do! We can find confidence intervals around our values:
confint(reg)
## 2.5 % 97.5 %
## (Intercept) 3.2813153 3.4358015
## demFav -0.6172674 -0.5546108
We also are using data that doesn’t have a truly meaningful zero, so we don’t get much out of knowing our intercept. Let’s change that by centering our data around the mean:
df$cDemFav <- scale(df$demFav,center=T,scale=F)
This only centers our data; it doesn’t z-score it. But now, if we run another regression:
reg2 <- lm(repFav~cDemFav,data=df)
summary(reg2)
##
## Call:
## lm(formula = repFav ~ cDemFav, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77262 -0.47053 -0.04061 0.40316 2.98520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.99164 0.01281 155.50 <2e-16 ***
## cDemFav -0.58594 0.01598 -36.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6418 on 2509 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.349, Adjusted R-squared: 0.3487
## F-statistic: 1345 on 1 and 2509 DF, p-value: < 2.2e-16
Now our intercept has meaning! because we know the average of demfav
demfav is 0 and when that is the case repfav is 1.99 and for every 1 increase in demfav there is a -.58 unit change in repfav
We can also standardize our data to get beta coefficients in our regression:
df$zDemFav <- scale(df$demFav)
df$zRepFav <- scale(df$repFav)
reg3 <- lm(zRepFav~zDemFav,data=df)
summary(reg3)
##
## Call:
## lm(formula = zRepFav ~ zDemFav, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2281 -0.5914 -0.0510 0.5067 3.7522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.001421 0.016099 -0.088 0.93
## zDemFav -0.590373 0.016097 -36.675 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8067 on 2509 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.349, Adjusted R-squared: 0.3487
## F-statistic: 1345 on 1 and 2509 DF, p-value: < 2.2e-16
when demfav is at 0 repfav is -.001 deviations away from intercept and as demfav increases repfav will decrease -.59 standard deviations
We can then plot our regression, using ggpredict
gg <- ggpredict(reg,terms=c("demFav"))
plot(gg)+labs(x="Democrat Favorability",y="Republican Favorability",title="")
Now, we’re going to talk about t-tests. ## One Sample First, let’s start with a one-sample t-test. We might be wondering if our Republican favorability or Democrat favorability is different from the mean. We’re going to assume that we know something about the population mean for democrat favorability. Let’s assume we know that the population average is 2.5. We can test that:
t.test(df$demFav,mu=2.5)
##
## One Sample t-test
##
## data: df$demFav
## t = -10.473, df = 2522, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
## 2.301568 2.364155
## sample estimates:
## mean of x
## 2.332861
What do our results tell us? we reject the null hypothesis that the mean = 2.5. the mean is 2.33 with p of less than .001 so we know theres a significant difference.
a one-sample t-test can be informative, but usually, we don’t know the population mean, and we’re often more interested in comparing two groups. Let’s do that
Like I said, one sample t-tests aren’t very common, so we can instead talk about two sample t-tests, using the same code. We can compare, for example, if people differ in favorability towards Democrats or Republicans. But first, let’s check our homogeneity of variance:
var.test(df$demFav,df$repFav)
##
## F test to compare two variances
##
## data: df$demFav and df$repFav
## F = 1.0152, num df = 2522, denom df = 2513, p-value = 0.7053
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9388773 1.0976992
## sample estimates:
## ratio of variances
## 1.01519
Based on this, what do we know? Should we assume the variances are equal? p is greater than .05. This is good and what we want
Because our results were not significant, we can assume that the variances are equal, so we have met our assumption of homoscedasticity or homogeneity of variances (assumption of equal or similar variances in different groups we want to compare). Given that we’ve met our assumption, we can now perform a t-test:
t.test(df$demFav,df$repFav,var.equal=T)
##
## Two Sample t-test
##
## data: df$demFav and df$repFav
## t = 15.112, df = 5035, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2959682 0.3842071
## sample estimates:
## mean of x mean of y
## 2.332861 1.992774
What type of t-test did we just run? Is it independent or paired? The t.test() function has independent as default.
But now, is an independent samples t-test the right choice? Keep in mind that all participants are asked about their level of favorability for both parties. So, what would make more sense? A paired samples t-test. We can change that by setting the ‘paired’ argument to TRUE:
t.test(df$demFav,df$repFav,var.equal= T,paired= T)
##
## Paired t-test
##
## data: df$demFav and df$repFav
## t = 11.935, df = 2510, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.2835196 0.3950007
## sample estimates:
## mean difference
## 0.3392602
There’s also another way to do a t-test based on how your data are organized. Let’s say we want to only look at Democrats and Republicans and see how they vary in how they feel about the Republican party. First, we can isolate just them:
df2 <- subset(df1,df1$PARTY=="Republican"|df1$PARTY=="Democrat")
And then we can run another t-test! Testing variance first:
var.test(repFav~PARTY,data=df2)
##
## F test to compare two variances
##
## data: repFav by PARTY
## F = 0.74949, num df = 892, denom df = 620, p-value = 8.491e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6476723 0.8656475
## sample estimates:
## ratio of variances
## 0.749488
What can we see here? Do we meet the assumption for homoscedasticity? Luckily for us, if we type ?t.test in the console, we can read the documentation and we learn that when variances are not equal, R uses a correction so we can still run the test
Let’s now run a t-test: Notice that we’ve now set the var.equal argument to FALSE
t.test(repFav~PARTY,data=df2,var.equal= F)
##
## Welch Two Sample t-test
##
## data: repFav by PARTY
## t = -41.05, df = 1206.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Democrat and group Republican is not equal to 0
## 95 percent confidence interval:
## -1.302398 -1.183584
## sample estimates:
## mean in group Democrat mean in group Republican
## 1.495334 2.738325
GROUP 1 (M=1.50, SD=.53) GROUP 2 (M=2.74, SD=.61) t(1206.8) = -41.05, CI[ ], p < .001
R output gives you means for the groups, but you may also want standard deviations when writing up your results in APA style. Here’s how you would find standard deviation by group:
aggregate(repFav~PARTY, df2, sd)
## PARTY repFav
## 1 Democrat 0.5296485
## 2 Republican 0.6117943
If we wanted the SD for the t-tests by column (the above ones we did first comparing demFav and repFav), you can just run the standard deviation function on the whole column, as we’ve done before:
sd(df$demFav,na.rm=T)
## [1] 0.801605
sd(df$repFav,na.rm=T)
## [1] 0.7955854
We may want to graph t-tests as bar graphs with confidence intervals.
dfSumm <- df2 %>%
group_by(PARTY) %>%
summarise(
n=n(), #Tells it what n is for SE calculation later
sd = sd(repFav, na.rm=T), #Finds SD for SE calculation
repFav = mean(repFav,na.rm=T) #Mean for later
)%>%
mutate( se=sd/sqrt(n)) %>% #Calculates SE
mutate( ci=se * 1.96) #Calculates confidence interval. You may want to add in your t-crit
ggplot(dfSumm, aes(PARTY, repFav)) +
geom_col(fill=c("blue", "red")) +
geom_errorbar(aes(ymin = repFav-ci, ymax = repFav+ci),width=0.3)+
xlab("Political Party")+
ylab("Republican Favorability")
Did anyone else besides me get an error? Why did this happen? Earlier,
we loaded a lot of different packages. In fact, the last library we
loaded was the plyr library. The plyr library also has a summarise()
function. Because we loaded the psych library AFTER the dplyr library, R
is trying to use that library. So, we need to specify that we want to
use the summarise() function from the dplyr library. All we need to do
is go back to the line where you have the summarise() function and
specify that we want the dplyr library. Go back and type dplyr:: right
before the function.
Sometimes, you may want to store your data in different ways. For the most part, most data sets are in what we call wide format. Meaning, each row represents a different participant and each column represents a different value for that particular participant. But sometimes, we may want to transform our data to a LONG format. In that case, each participant will have a different row for each value we collected
For t-tests, you can largely use data regardless of if your groups are in separate columns or in the same column with a grouping variable elsewhere, but other measures may require you to change it (or you may just want to). Two functions that may be especially helpful are pivot_wider and pivot_longer.
dfLong <- pivot_longer(df,cols=c("repFav","demFav"),
names_to="polParty",values_to="partyFav")
Let’s take a look at the new data frame to understand what it did
I would do this to make a bar chart, mostly because I just taught myself how to make a bar chart with data stored in this way and Google is not helpful in making a bar chart in a different way.
dfLong$polParty <- as.factor(dfLong$polParty)
dfSumm2 <- dfLong %>%
group_by(polParty) %>%
dplyr::summarise(
n=n(), #Tells it what n is for SE calculation later
sd = sd(partyFav, na.rm=T), #Finds SD for SE calculation
partyFav = mean(partyFav,na.rm=T) #Mean for later
)%>%
mutate( se=sd/sqrt(n)) %>% #Calculates SE
mutate( ci=se * 1.96) #Calculates confidence interval. You may want to add in your t-crit
ggplot(dfSumm2, aes(polParty, partyFav)) +
geom_col(fill=c("blue", "red")) +
geom_errorbar(aes(ymin = partyFav-ci, ymax = partyFav+ci),width=0.3)+
xlab("Political Party")+
ylab("Favorability")+
scale_x_discrete(labels=c("repFav" = "Republican Party", "demFav" = "Democratic Party"))
You may also want to do the reverse of this kind of code.
dfWide <- pivot_wider(dfLong,names_from="polParty",values_from="partyFav")