CUNY SPS - Master of Science in Data Science

Discussion 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this week’s discussion, I found an “Avengers” dataset that include characters’ names along with information about gender, appearances in comic books, current member of the Avengers, years since joining Avengers, and whether they are honorary members among other variables.

I will first load the data from FiveThirtyEight’s github repository:

# load libraries
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# load data
avengers <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv", header = TRUE)

We will also need to perform some data transformation in order to exclude some rows, that for some reason, don’t have a character’s name.

avengers2 <- avengers %>% filter(Name.Alias != "")

We have two dichotomous terms in our data, “gender” and “current”, I will use “gender” and “appearances” for the dichotomous vs. quantitative interaction. The “years since joining” will be squared in order to convert it into our quadratic term.

We will also convert our categorical variables into integers:

gender_factor <- as.factor(avengers2$Gender)
gender <- as.integer(gender_factor)

current_factor <- as.factor(avengers2$Current.)
current <- as.integer(current_factor)

honorary_factor <- as.factor(avengers2$Honorary)
honorary <- as.integer(honorary_factor)

interaction <- gender * avengers2$Appearances

years_join_square <- avengers2$Years.since.joining^2

avengers3 <- cbind(avengers2, gender, current, honorary, interaction, years_join_square)

We will take a look at the relationship of a character’s appearances in a comic book and how some of our other variables in the data might be able to explain this number.

m_Appearance <- lm(avengers3$Appearances ~ avengers3$current + avengers3$gender + avengers3$Years.since.joining + avengers3$honorary + avengers3$interaction + avengers3$years_join_square )

summary(m_Appearance)

## 
## Call:
## lm(formula = avengers3$Appearances ~ avengers3$current + avengers3$gender + 
##     avengers3$Years.since.joining + avengers3$honorary + avengers3$interaction + 
##     avengers3$years_join_square)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.97  -44.25   -1.13   17.68  672.71 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    226.76018   59.28959   3.825 0.000189 ***
## avengers3$current               25.46011   17.95020   1.418 0.158076    
## avengers3$gender              -148.63617   18.15678  -8.186 8.98e-14 ***
## avengers3$Years.since.joining    1.81657    0.93884   1.935 0.054811 .  
## avengers3$honorary               3.97441   18.68166   0.213 0.831804    
## avengers3$interaction            0.50593    0.00676  74.840  < 2e-16 ***
## avengers3$years_join_square     -0.01844    0.00830  -2.222 0.027726 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.6 on 156 degrees of freedom
## Multiple R-squared:  0.9788, Adjusted R-squared:  0.978 
## F-statistic:  1201 on 6 and 156 DF,  p-value: < 2.2e-16

We can see from our first model that the variable with the biggest p-value is “honorary”, which means that this variable does not help us explain the number of appearances of a character in comic books.

We will use the backward elimination process to remove this variable in order to improve the model’s performance:

m_Appearance <- lm(avengers3$Appearances ~ avengers3$current + avengers3$gender + avengers3$Years.since.joining + avengers3$interaction + avengers3$years_join_square)

summary(m_Appearance)

## 
## Call:
## lm(formula = avengers3$Appearances ~ avengers3$current + avengers3$gender + 
##     avengers3$Years.since.joining + avengers3$interaction + avengers3$years_join_square)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.72  -42.97   -1.88   17.11  672.86 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.349e+02  4.499e+01   5.222 5.53e-07 ***
## avengers3$current              2.460e+01  1.743e+01   1.411   0.1602    
## avengers3$gender              -1.482e+02  1.797e+01  -8.245 6.18e-14 ***
## avengers3$Years.since.joining  1.860e+00  9.140e-01   2.034   0.0436 *  
## avengers3$interaction          5.059e-01  6.733e-03  75.132  < 2e-16 ***
## avengers3$years_join_square   -1.893e-02  7.948e-03  -2.382   0.0184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.3 on 157 degrees of freedom
## Multiple R-squared:  0.9788, Adjusted R-squared:  0.9781 
## F-statistic:  1450 on 5 and 157 DF,  p-value: < 2.2e-16

In one of my attempts I have also tried to remove the variable “current”, which seems to have a p-value a bit higher than we would want to for the model, but in doing so it brings the p-value of all the other variables up and the r-squared down, thus I will include it in my final model.

We can tell from our summary that our residuals are not nearly normally distributed. We do have a median value near zero, however, the min and max values are not the same magnitud and neither are the first and third quartile.

Gender and the quadratic variable seem to be negatively correlated to appearances, while current, years since joining, and the interaction variable seem to be positively correlated. Most of our variables have standard errors that are either too high or too low, but they all have p-values small enough to say that they are relevant in predicting appearances.

Additionally, our \(R^2\) 0.9788 indicate that we have a good-fitting model, and the p-value, which is close to zero, tells us that our variables are statistically significant predictors of our characeters appearing in comic books. Perhaps we may even say that we are overfitting the model with the chosen variables, but for the sake of this exercise we will leave the model as is for now.

Lastly, we will conduct a residual analysis to further substantiate our evidence that we have a good linear model.

plot(fitted(m_Appearance), resid(m_Appearance))

hist(m_Appearance$residuals)

qqnorm(m_Appearance$residuals)
qqline(m_Appearance$residuals)

There seems to be a pattern in the residual plot where variability is more apparent at lower number of appearances. Additionally, the residuals distribution is skewed to the right, and this is evident in the normal probability plot that shows how the points deviate from the line. The normal residuals condition has not been met, thus the linear model does not seem appropriate according to our residual analysis.

CUNY SPS - Master of Science in Data Science - DATA605

Mario Pena

April 22, 2020

Discussion 12