1. Today’s Data

Load the data set “anes_2016.csv”

anes <- read.csv("anes_2016.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'anes_2016.csv'

The names and descriptions of variables in the data set

Name Description
age Age of individual at time of the survey
atten tiontopolitics Some people don’t pay much attention to politics How about you? 1 Not much / 5 Very much attention
a ttentiontonews Some people don’t pay much attention to news. How about you? 1 Not much / 5 Very much attention
votein2012 Did you vote for President in 2012? 1 Yes / 0 No
votein2016 Did you vote for President in 2016? 1 Yes / 0 No
female Indicator variable for whether individual identifies as female (1) or not (0)
white Indicator variable for whether individual identifies as white (1) or not (0)
latino Indicator variable for whether individual identifies as latino (1) or not (0)
black Indicator variable for whether individual is identifies as black (1) or not (0)
asian Indicator variable for whether individual is identifies as asian (1) or not (0)
registered Indicator variable for whether an individual was registered to vote (1) or not (0)
gotochurch Indicator variable for whether an individual goes to church (1) or not (0)
internetathome Indicator variable for whether an individual has internet at home (1) or not (0)
homeowner Indicator variable for whether an individual was home owner (1) or not (0)
clintonfeel Self-placement on feeling regarding Clinton from dislike (1) to like a lot (100)
trumpfeel Self-placement on feeling regarding Trump from dislike (1) to like a lot (100)
conservatism Self-placement on the level of conservatism from Low (1) to High (7)

2. R Cheat sheet

1. Quick review

Operators

  • # used to comment code
  • <- Assignment operator used to create new objects
  • $ used to access an element inside an object, such as a variable inside a dataframe (data$variable)
  • == relational operator
  • [] subsetting population

Functions

  • read.csv(”filename.csv”) reads CSV files
  • head() or tail() shows the first/last observations in a dataframe
  • dim() provides the dimensions of a dataframe
  • mean(Data$Variable) calculates the mean of a variable
  • median(Data$Variable) calculates the median of a variable
  • sd(Data$Variable) calculates the standard deviation
  • var(Data$Variable) calculates the variance
  • table(Data$Variable) creates a frequency table
  • prop.table(table(Data$Variable)) creates a table of proportions
  • table(Data$Variable1, Data$Variable1) creates a two-way frequency table
  • hist(Data$Variable) creates an histogram
  • plot(Data$Variable1, Data$Variable1) creates a scatter plot
  • cor(Data$Variable1, Data$Variable1) calculates the correlation coefficient

R functions for today

  • lm() fits a linear model. It requires a formula of the type: Y~X, where Y identifies the outcome variable and X identifies the X variable. lm(data$y_var~data$x_var) or lm(y_var~x_var, data=data)

  • summary(lm()) provides a summary of the fitted linear model.

  • abline() adds a straight line to a graph. To add the fitted line, we specify as the main argument the object that contains the output of the lm() function. fit<-lm(Y~X);abline(fit)

2. Prediction

  • Step 1 - Fit a Model:

    • we will use variables (predictors) \(X\) \(\rightarrow\) to explain/predict the outcome \(Y\)
  • Step 2 - Make a Prediction:

    • From Step 1, we use our model to get a predicted value for a new observation \(\rightarrow\) \(\widehat{Y}\)

\[\hat{Y}_i = \hat{\alpha} + \hat{\beta} X_i\] ### Interpretation of \(\hat{\alpha}\) and \(\hat{\beta}\)

  • \(\hat{\alpha}\) is the estimated intercept, corresponds to the prediction when \(X_i\) =0
  • \(\hat{\beta}\) is the estimated slope, a positive \(\hat{\beta}\) corresponds to a positive relationship, and a negative \(\hat{\beta}\) corresponds to a negative relationship.
  • It can be interpreted as:
  • Changing \(X_i\) by some amount \(\Delta X\) changes the prediction in \(\widehat{Y}_i\) by \(\hat{\beta} \Delta X\)

R functions

  • lm \(\rightarrow\) linear model
lm(data$outcomeYvariable  ~ data$predictorXVariable) 

or

lm(outcomeYvariable  ~ predictorXVariable, data=dataname) 

The outcome variable \(Y\) is predicted by the predictor variable \(X\)

2.1 Practice Prediction

Let’s estimating a linear model predicting attention to politics with education

\[\widehat{Attention to Politics}_i = \hat{\alpha} + \hat{\beta} Education_i\]

#
#

\[\widehat{Attention to Politics}_i = 3.364 + 0.093 Education_i\]

My friend Jane’s election level is 6, which is her predicted value for attention to politics?

3.364 + 0.093 * 6
## [1] 3.922

2.1 Summarizing predicted LM & R-Squared

To visualize a summary of the results we use summary(lm(data$Y ~ data$Y))

\(R^2\) (R-Squared): - What proportion of variation in the outcome is explained by the model? - Ranges from 0 to 1

#summary(lm(anes$attentiontopolitics  ~ anes$education))

Note: \(R^2=0.018 \rightarrow\) this model explains 1.8% of the variation of the outcome (attention to politics)

2.2 Visualizing results

#abline(lm_attention) # adds a fitted line 

3. Practice

Estimate a linear model where you predict trumpfeel by education

1. Exploration before the prediction

1.1 Scatter plot - create a scatter plot showing the relationship

FORMAT: plot(data$variable1,data$variable2)

# read.csv

1.2 Correlation - calculates the correlation between trumpfeel and education

To calculate the correlation coefficient between two variables in R, we use the function cor(). FORMAT: cor(data$variable1, data$variable2)

# read.csv

Here write your interpretation:

2. Linear Model

2.1 Estimating a linear model

The fitted line: \[\hat Y_i=\hat\alpha+\hat\beta X_i\] - \(\hat\alpha\) : the estimated intercept - \(\hat\beta\) : the estimated slope

To estimate the coefficients of the linear model using the least squares method in R, we use the function lm(), which stands for linear model. This function requires that we specify as the main argument a formula of the type: Y~X, where Y identifies the outcome variable and X identifies the predictor. There are two ways to establish a linear model.

FORMAT1: lm(data$outcomevariable ~ data$predictorvariable)

FORMAT2: lm(outcomevariable ~ predictorvariable, data=data)

variables: trumpfeel and education

# read.csv
  • the estimated intercept (\(\hat\alpha\)) is:

  • the estimated slope (\(\hat\beta\)), the coefficient for the variable education: is:

Here write the fitted linear model

Now write the interpretation

  • (Intercept):
  • (slope):

2.2 Summarize the linear model estimated

We can get more detailed information with the function of summary() including \(R^2\) Format: summary(lm(outcomevariable ~ predictorvariable, data=data))

# read.csv

What proportion of variation in the outcome is explained by the model? Answer here

2.3 Adding the fitted line to the scatter plot

abline() adds a straight line to a graph. To add the fitted line, we specify as the main argument the object that contains the output of the lm() function. In other words, we should specify the output of the fitted model when adding the fitted line with abline() function. This will add the fitted line to the most recently created plot and will give you an error message if you have yet to create any plot. If you get error message (‘plot.new has not been called yet’), then run the functions, plot() and abline(), together.

FORMAT: fit<-lm(Y~X); abline(fit)

4. More practice

Estimate a linear model predicting attentiontonews with age. In other words, Y=attentiontonews, X=age. Please, repeat what you did before and interpret the outcome.

# read.csv