Project Milestone 3

Milestone 3: Perform a Simple Linear Regression

Using the same variables from Milestone 2.

Continuing simple linear regression analysis:

–Create an ANOVA table and produce the F-Statistic and discuss the R-Squared value.

fec<-read.csv("fec_independent_expenditures.csv", header = TRUE)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.1
## v readr   1.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidyr)

opposition<-fec%>%
  filter(support_oppose_indicator=="O", report_year>="2013", candidate_office=="P",
         candidate_id!="P80002801",
         candidate_id!="")%>%
  group_by(candidate_id)%>%
  summarise(expenditure_amount=sum(expenditure_amount, na.rm = TRUE))

support<-fec%>%
  filter(support_oppose_indicator=="S", report_year>="2013", candidate_office=="P",
         candidate_id!="P00547984",
         candidate_id!="P20002721",
         candidate_id!="P60007671",
         candidate_id!="P60007895",
         candidate_id!="P60009354",
         candidate_id!="P60019239",
         candidate_id!="P60021102",
         candidate_id!="P60022118",
         candidate_id!="P60023215",
         candidate_id!="P80003353",
         candidate_id!="")%>%
  group_by(candidate_id)%>%
  summarise(expenditure_amount=sum(expenditure_amount, na.rm = TRUE))

fec<-cbind(support, opposition)
fec$opp_amt<-fec[,4]
fec$sup_amt<-fec[,2]
fec<-fec[-c(2:4)]
x<-fec$sup_amt
y<-fec$opp_amt

mod<-lm(y~x)
AN<-anova(mod)
ssres<-AN$`Sum Sq`[2]
ssreg<-AN$`Sum Sq`[1]
n<-dim(fec)[1]
AN

## Analysis of Variance Table
## 
## Response: y
##           Df     Sum Sq    Mean Sq F value    Pr(>F)    
## x          1 2.5267e+16 2.5267e+16  45.936 3.227e-06 ***
## Residuals 17 9.3510e+15 5.5006e+14                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression: - Degrees of Freedom: 1 - Sum of Squares: 2.5267e+16 - Mean Squares: 2.5267e+16 - F-Value: 45.936 - P-Value: 3.227e-06

Residual: - Degrees of Freedom: 17 - Sum of Squares: 9.3510e+15 - Mean Squares: 5.5006e+14

Total Degrees of Freedom: 18 Total Sum of Squares: 3.4618e+16

Our F-Statistic value is:

f_stat<-(ssreg/1)/(ssres/(n-2))
f_stat

## [1] 45.93577

Which has been stated in the ANOVA table.

Our R-Squared value is:

R2<-(AN$`Sum Sq`[1]/3.4618e+16)
R2

## [1] 0.7298875

Which tells us that roughly 73% of our “x” variable is represented by the regression model.

–Create diagnostic plots to assess assumptions. Summarize your findings.

plot(mod)

The three plots we’re interested in are the Residuals vs. Fitted, Normal QQ Plot, and the Leverage vs. Residual Plot. Based on these findings, our data is pretty normal with almost no outliers that would affect our regression model. We do have three values that stand out as potential influencers to the model, as shown by the QQ Plot and the Residuals vs. Fitted Plot. However, the Residuals vs. Leverage plot shows us that one of those potentially influential data points won’t affect our model as much as we thought. We do have two outliers that lie outside the maximum Cook’s distance, which tells us that we might need to exclude these values in order to have a more “accurate” regression model.

Summarize your findings:

Overall, our regression model represents our “x” variable fairly well. Our R-Squared value tells us that almost 3/4’s of our “x” variable is represented by our regression model; this might be raised if we remove our influential outliers as defined by the various plots we created. Especially if we remove the two variables outside the maximum Cook’s distance.