Data 605 Discussion 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/exams.csv"
exam<-read_csv(url(urlfile))

## Rows: 1000 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, race/ethnicity, parental level of education, lunch, test pr...
## dbl (3): math score, reading score, writing score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(exam)

## # A tibble: 6 × 8
##   gender `race/ethnicity` parental level…¹ lunch test …² math …³ readi…⁴ writi…⁵
##   <chr>  <chr>            <chr>            <chr> <chr>     <dbl>   <dbl>   <dbl>
## 1 female group D          some college     stan… comple…      59      70      78
## 2 male   group D          associate's deg… stan… none         96      93      87
## 3 female group D          some college     free… none         57      76      77
## 4 male   group B          some college     free… none         70      70      63
## 5 female group D          associate's deg… stan… none         83      85      86
## 6 male   group C          some high school stan… none         68      57      54
## # … with abbreviated variable names ¹`parental level of education`,
## #   ²`test preparation course`, ³`math score`, ⁴`reading score`,
## #   ⁵`writing score`

This model will try to predict writing score by using the prediction variables math score, writing score, and gender. From the pairwise plots below, the dichotomous term below has different means per score. There may be a different effect from the continuous variable such as math score at different levels of the dichotomous values. In this case, the dichotomous vs. quantitative interaction term will be gender and math score.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

df <- subset(exam,select = c('gender','parental level of education','math score','reading score','writing score'))
df %>% ggpairs()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The dichotomous predictor value will be parental level of education. Any education beyond high school will be scored as 1.

#all the unique education level values
unique(df$`parental level of education`)

## [1] "some college"       "associate's degree" "some high school"  
## [4] "bachelor's degree"  "master's degree"    "high school"

#dichotomous variable created such that 1 is education greater than high school and 0 is high school education only
df$`parental level of education` <- ifelse(df$`parental level of education`=='high school','hs only','higher ed')

df$gender <- ifelse(df$gender=='male',0,1)

Lastly, the quadratic predictor variable will be reading score since reading score may have a larger impact on writing score than say math score.

The linear model is built. The residuals of the fitted model is generally around 0 (median). The 1Q and 3Q are of similar magnitude and so is the max and min. These traits support a normal distribution.

Interpretation of coefficients: - The coefficient represents the estimated change in the response variable for a one-unit change in the predictor variable, while holding all other predictor variables constant. In this model, for one unit of change of the reading score, 0.0065 is the estimated change of the writing score. - For the dichotomous variable, the writing score is expected to change by -1.25 depending on the level of parental education level. - For the interaction variable, the average change for one unit of change per gender category is 28.9.

model <- lm(df$`writing score` ~ I(df$gender*df$`math score`)+ df$`parental level of education` + I((df$`reading score`)**2))
model

## 
## Call:
## lm(formula = df$`writing score` ~ I(df$gender * df$`math score`) + 
##     df$`parental level of education` + I((df$`reading score`)^2))
## 
## Coefficients:
##                             (Intercept)  
##                               32.775991  
##          I(df$gender * df$`math score`)  
##                                0.038873  
## df$`parental level of education`hs only  
##                               -1.559147  
##               I((df$`reading score`)^2)  
##                                0.006882

summary(model)

## 
## Call:
## lm(formula = df$`writing score` ~ I(df$gender * df$`math score`) + 
##     df$`parental level of education` + I((df$`reading score`)^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.5182  -3.2183   0.0721   3.3883  16.9650 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              3.278e+01  4.564e-01  71.817  < 2e-16
## I(df$gender * df$`math score`)           3.887e-02  5.098e-03   7.625 5.70e-14
## df$`parental level of education`hs only -1.559e+00  3.859e-01  -4.040 5.75e-05
## I((df$`reading score`)^2)                6.882e-03  8.843e-05  77.828  < 2e-16
##                                            
## (Intercept)                             ***
## I(df$gender * df$`math score`)          ***
## df$`parental level of education`hs only ***
## I((df$`reading score`)^2)               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.003 on 996 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8891 
## F-statistic:  2672 on 3 and 996 DF,  p-value: < 2.2e-16

The residuals generally evenly distributed about 0. The Q-Q plot shows a linear diagonal line. The residuals meet the normal distribution requirements. All these signs point to the linear model being a good fit to predict writing scores.

par(mfrow=c(2,2))
plot(model)

Data 605 Discussion 12

Catherine Cho