Codebook

We have a lot of columns, here’s what they are: PARTY: Political Party demFav: Average favorability towards Democrats repFav: Average favorability towards Republicans Q2A_A: Donald Trump, R Q2A_B: Bill Weld, R Q2A_C: Nancy Pelosi, D Q2A_D: Alexandria Ocasio-Cortez, D Q2A_E: Mitch McConnell, R Q2A_F: Barack Obama, D Q2B_A: Joe Biden, D Q2B_B: Cory Booker, D Q2B_C: Pete Buttigieg, D Q2B_D: Kamala Harris, D Q2B_E: Bernie Sanders, D Q2B_F: Elizabeth Warren, D Q2B_G: Steve Bullock, D Q2B_H: Julian Castro, D Q2B_I: Bill de Blasio, D Q2B_J: Tulsi Gabbard, D Q2B_K: Kirsten Gillibrand, D Q2B_L: Amy Klobuchar, D Q2B_M: Beto O’Rourke, D Q2B_N: Tom Steyer, D Q2B_O: Marianne Williamson, D Q2B_P: Andrew Yang, D

Setup

This week, we’re going to talk about regression and t-tests (but we’ll start by reviewing the correlation we ran last week) Let’s get set up. Start by setting your working directory!

library(ggplot2)
library(ggeffects)
library(tidyr)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Correlations

Let’s just remind ourselves about last week’s analyses. Let’s read in our data

df <- read.csv("Republican and Democrat Favorability Scale.csv")

Remember when we visualized it, we got a sense of what the relationship looks like

ggplot(data = df, aes(x = demFav,y=repFav)) + 
  geom_point() +
  xlab("Democrat Favorability") +
  ylab("Republican Favorability") +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 16 rows containing non-finite values (stat_smooth).

## Warning: Removed 16 rows containing missing values (geom_point).

Next, we calculated the Pearson’s r correlation Notice that we’re first need to save the function as an object. In this case, we’re calling the object ‘r’. Then we use the print() function to see our results

library(psych)
r <- corr.test(df$demFav,df$repFav)
print(r,short=F)

## Call:corr.test(x = df$demFav, y = df$repFav)
## Correlation matrix 
## [1] -0.59
## Sample Size 
## [1] 2511
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##       raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## NA-NA     -0.62 -0.59     -0.56     0     -0.62     -0.56

You can also use the cor() function, but it only gives you the correlation without all the other information, which you’ll need anyway.

As you can see, correlations are pretty straight forward. Let’s move on to regression

Regression

First, let’s import some data to look at!

df1 <- read.csv("Week 5 Favorability Data.csv")

Computing Regression

This is the same data we’ve used before. We can use the columns (Dem and Rep Favorability) to find a regression: Again, we’re saving the function as an object, ‘reg’, so we can pull it up later

reg <- lm(repFav~demFav, data=df1)
summary(reg)

## 
## Call:
## lm(formula = repFav ~ demFav, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77262 -0.47053 -0.04061  0.40316  2.98520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.35856    0.03939   85.26   <2e-16 ***
## demFav      -0.58594    0.01598  -36.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6418 on 2509 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.349,  Adjusted R-squared:  0.3487 
## F-statistic:  1345 on 1 and 2509 DF,  p-value: < 2.2e-16

There’s a lot going on here and it may be a little confusing to read, but know that the “Estimate” is the b-coefficient for your predictor and that the Pr(>|t|) bit is the p-value. The first “Estimate,” by (Intercept), is the intercept! So here, we want to know Republican favorability as predicted by Democrat favorability. So, when Democrat favorability is 0, Republican favorability is 3.36. Then for every unit change in Democratic favorability, Republician favorability DECREASES by .59 units (we know it decreases because the slope is negative) So we can see that the formula for our regression is Y-hat = 3.36 - .59

demfav is predictor and repfav is outcome variable intercept is when predictor is at 0 so when demfav is 0 repfav is 3.36 units for every one unit increase in demfav the repfav changes -.59 units b = -.59, p<.001

This is cool, but there’s more we can do! We can find confidence intervals around our values:

confint(reg)

##                  2.5 %     97.5 %
## (Intercept)  3.2813153  3.4358015
## demFav      -0.6172674 -0.5546108

We also are using data that doesn’t have a truly meaningful zero, so we don’t get much out of knowing our intercept. Let’s change that by centering our data around the mean:

df$cDemFav <- scale(df$demFav,center=T,scale=F)

This only centers our data; it doesn’t z-score it. But now, if we run another regression:

reg2 <- lm(repFav~cDemFav,data=df)
summary(reg2)

## 
## Call:
## lm(formula = repFav ~ cDemFav, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77262 -0.47053 -0.04061  0.40316  2.98520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.99164    0.01281  155.50   <2e-16 ***
## cDemFav     -0.58594    0.01598  -36.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6418 on 2509 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.349,  Adjusted R-squared:  0.3487 
## F-statistic:  1345 on 1 and 2509 DF,  p-value: < 2.2e-16

Now our intercept has meaning! because we know the average of demfav

demfav is 0 and when that is the case repfav is 1.99 and for every 1 increase in demfav there is a -.58 unit change in repfav

We can also standardize our data to get beta coefficients in our regression:

df$zDemFav <- scale(df$demFav)
df$zRepFav <- scale(df$repFav)

reg3 <- lm(zRepFav~zDemFav,data=df)
summary(reg3)

## 
## Call:
## lm(formula = zRepFav ~ zDemFav, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2281 -0.5914 -0.0510  0.5067  3.7522 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.001421   0.016099  -0.088     0.93    
## zDemFav     -0.590373   0.016097 -36.675   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8067 on 2509 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.349,  Adjusted R-squared:  0.3487 
## F-statistic:  1345 on 1 and 2509 DF,  p-value: < 2.2e-16

when demfav is at 0 repfav is -.001 deviations away from intercept and as demfav increases repfav will decrease -.59 standard deviations

Plotting Regression

We can then plot our regression, using ggpredict

gg <- ggpredict(reg,terms=c("demFav"))

plot(gg)+labs(x="Democrat Favorability",y="Republican Favorability",title="")

t-tests

Now, we’re going to talk about t-tests. ## One Sample First, let’s start with a one-sample t-test. We might be wondering if our Republican favorability or Democrat favorability is different from the mean. We’re going to assume that we know something about the population mean for democrat favorability. Let’s assume we know that the population average is 2.5. We can test that:

t.test(df$demFav,mu=2.5)

## 
##  One Sample t-test
## 
## data:  df$demFav
## t = -10.473, df = 2522, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
##  2.301568 2.364155
## sample estimates:
## mean of x 
##  2.332861

What do our results tell us? we reject the null hypothesis that the mean = 2.5. the mean is 2.33 with p of less than .001 so we know theres a significant difference.

a one-sample t-test can be informative, but usually, we don’t know the population mean, and we’re often more interested in comparing two groups. Let’s do that

Two Sample

Like I said, one sample t-tests aren’t very common, so we can instead talk about two sample t-tests, using the same code. We can compare, for example, if people differ in favorability towards Democrats or Republicans. But first, let’s check our homogeneity of variance:

var.test(df$demFav,df$repFav)

## 
##  F test to compare two variances
## 
## data:  df$demFav and df$repFav
## F = 1.0152, num df = 2522, denom df = 2513, p-value = 0.7053
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9388773 1.0976992
## sample estimates:
## ratio of variances 
##            1.01519

Based on this, what do we know? Should we assume the variances are equal? p is greater than .05. This is good and what we want

Because our results were not significant, we can assume that the variances are equal, so we have met our assumption of homoscedasticity or homogeneity of variances (assumption of equal or similar variances in different groups we want to compare). Given that we’ve met our assumption, we can now perform a t-test:

t.test(df$demFav,df$repFav,var.equal=T)

## 
##  Two Sample t-test
## 
## data:  df$demFav and df$repFav
## t = 15.112, df = 5035, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2959682 0.3842071
## sample estimates:
## mean of x mean of y 
##  2.332861  1.992774

What type of t-test did we just run? Is it independent or paired? The t.test() function has independent as default.

But now, is an independent samples t-test the right choice? Keep in mind that all participants are asked about their level of favorability for both parties. So, what would make more sense? A paired samples t-test. We can change that by setting the ‘paired’ argument to TRUE:

t.test(df$demFav,df$repFav,var.equal= T,paired= T)

## 
##  Paired t-test
## 
## data:  df$demFav and df$repFav
## t = 11.935, df = 2510, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.2835196 0.3950007
## sample estimates:
## mean difference 
##       0.3392602

There’s also another way to do a t-test based on how your data are organized. Let’s say we want to only look at Democrats and Republicans and see how they vary in how they feel about the Republican party. First, we can isolate just them:

df2 <- subset(df1,df1$PARTY=="Republican"|df1$PARTY=="Democrat")

= or

And then we can run another t-test! Testing variance first:

var.test(repFav~PARTY,data=df2)

## 
##  F test to compare two variances
## 
## data:  repFav by PARTY
## F = 0.74949, num df = 892, denom df = 620, p-value = 8.491e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6476723 0.8656475
## sample estimates:
## ratio of variances 
##           0.749488

What can we see here? Do we meet the assumption for homoscedasticity? Luckily for us, if we type ?t.test in the console, we can read the documentation and we learn that when variances are not equal, R uses a correction so we can still run the test

Let’s now run a t-test: Notice that we’ve now set the var.equal argument to FALSE

t.test(repFav~PARTY,data=df2,var.equal= F)

## 
##  Welch Two Sample t-test
## 
## data:  repFav by PARTY
## t = -41.05, df = 1206.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Democrat and group Republican is not equal to 0
## 95 percent confidence interval:
##  -1.302398 -1.183584
## sample estimates:
##   mean in group Democrat mean in group Republican 
##                 1.495334                 2.738325

GROUP 1 (M=1.50, SD=.53) GROUP 2 (M=2.74, SD=.61) t(1206.8) = -41.05, CI[ ], p < .001

R output gives you means for the groups, but you may also want standard deviations when writing up your results in APA style. Here’s how you would find standard deviation by group:

aggregate(repFav~PARTY, df2, sd)

##        PARTY    repFav
## 1   Democrat 0.5296485
## 2 Republican 0.6117943

If we wanted the SD for the t-tests by column (the above ones we did first comparing demFav and repFav), you can just run the standard deviation function on the whole column, as we’ve done before:

sd(df$demFav,na.rm=T)

## [1] 0.801605

sd(df$repFav,na.rm=T)

## [1] 0.7955854

Graphing t-tests

We may want to graph t-tests as bar graphs with confidence intervals.

dfSumm <- df2 %>%
  group_by(PARTY) %>%
  summarise(
    n=n(), #Tells it what n is for SE calculation later
    sd = sd(repFav, na.rm=T), #Finds SD for SE calculation
    repFav = mean(repFav,na.rm=T) #Mean for later
    )%>%
  mutate( se=sd/sqrt(n))  %>% #Calculates SE
  mutate( ci=se * 1.96) #Calculates confidence interval. You may want to add in your t-crit
            
ggplot(dfSumm, aes(PARTY, repFav)) +
  geom_col(fill=c("blue", "red")) +
  geom_errorbar(aes(ymin = repFav-ci, ymax = repFav+ci),width=0.3)+
  xlab("Political Party")+
  ylab("Republican Favorability")

Did anyone else besides me get an error? Why did this happen? Earlier, we loaded a lot of different packages. In fact, the last library we loaded was the plyr library. The plyr library also has a summarise() function. Because we loaded the psych library AFTER the dplyr library, R is trying to use that library. So, we need to specify that we want to use the summarise() function from the dplyr library. All we need to do is go back to the line where you have the summarise() function and specify that we want the dplyr library. Go back and type dplyr:: right before the function.

Transforming Data

Sometimes, you may want to store your data in different ways. For the most part, most data sets are in what we call wide format. Meaning, each row represents a different participant and each column represents a different value for that particular participant. But sometimes, we may want to transform our data to a LONG format. In that case, each participant will have a different row for each value we collected

For t-tests, you can largely use data regardless of if your groups are in separate columns or in the same column with a grouping variable elsewhere, but other measures may require you to change it (or you may just want to). Two functions that may be especially helpful are pivot_wider and pivot_longer.

dfLong <- pivot_longer(df,cols=c("repFav","demFav"),
                    names_to="polParty",values_to="partyFav")

Let’s take a look at the new data frame to understand what it did

I would do this to make a bar chart, mostly because I just taught myself how to make a bar chart with data stored in this way and Google is not helpful in making a bar chart in a different way.

dfLong$polParty <- as.factor(dfLong$polParty)

dfSumm2 <- dfLong %>%
  group_by(polParty) %>%
  dplyr::summarise(
    n=n(), #Tells it what n is for SE calculation later
    sd = sd(partyFav, na.rm=T), #Finds SD for SE calculation
    partyFav = mean(partyFav,na.rm=T) #Mean for later
    )%>%
  mutate( se=sd/sqrt(n))  %>% #Calculates SE
  mutate( ci=se * 1.96) #Calculates confidence interval. You may want to add in your t-crit

ggplot(dfSumm2, aes(polParty, partyFav)) +
  geom_col(fill=c("blue", "red")) +
  geom_errorbar(aes(ymin = partyFav-ci, ymax = partyFav+ci),width=0.3)+
  xlab("Political Party")+
  ylab("Favorability")+
  scale_x_discrete(labels=c("repFav" = "Republican Party", "demFav" = "Democratic Party"))

You may also want to do the reverse of this kind of code.

dfWide <- pivot_wider(dfLong,names_from="polParty",values_from="partyFav")

Fall_2022_Week5

Janine_Medina

10/28/2022