I added head(chol) to allow for a refresher on what the cholesterol.tsv dataset looks like, and updated the state column to a factor. I found anova function used on the two models, one including age and the other not, to be very useful in understanding how the residual sum of squares can be used to determine fit. This was also my first time learning about stepAIC, and I found the systematic evaluation of each term to be very useful and helpful in determining model fit. I am a little confused still about how despite state not having an effect on cholesterol, the best fit still includes state.
library(readr)
#setwd("/Users/Nazija/Desktop/BIOS 621")
chol <- read_table("cholesterol.tsv",
col_types = cols(age = col_double()))
chol$state<- factor(chol$state)
head(chol)
## # A tibble: 6 × 3
## cholesterol age state
## <dbl> <dbl> <fct>
## 1 181 46 Iowa
## 2 228 52 Iowa
## 3 182 39 Iowa
## 4 249 65 Iowa
## 5 259 54 Iowa
## 6 201 33 Iowa
summary(chol)
## cholesterol age state
## Min. :112.0 Min. :18.00 Iowa :11
## 1st Qu.:181.2 1st Qu.:39.50 Nebraska:19
## Median :199.0 Median :48.00
## Mean :213.7 Mean :48.57
## 3rd Qu.:247.0 3rd Qu.:58.00
## Max. :356.0 Max. :78.00
library(ggplot2)
ggplot(chol, aes(x=age, y=cholesterol, shape=state, color=state)) +
geom_point(size=2) +
geom_smooth(method=lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
# Fit a linear model with age, state, and interaction as predictors
cholesterol as the outcome variable.
fit <- lm(cholesterol ~ age * state, data=chol)
summary(fit)
##
## Call:
## lm(formula = cholesterol ~ age * state, data = chol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.480 -31.907 -4.303 22.829 85.833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8112 55.1166 0.650 0.52156
## age 3.2381 1.0088 3.210 0.00352 **
## stateNebraska 65.4866 61.9834 1.057 0.30045
## age:stateNebraska -0.7177 1.1628 -0.617 0.54247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.14 on 26 degrees of freedom
## Multiple R-squared: 0.5326, Adjusted R-squared: 0.4786
## F-statistic: 9.875 on 3 and 26 DF, p-value: 0.00016
anova(fit)
## Analysis of Variance Table
##
## Response: cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 48976 48976 26.3124 2.388e-05 ***
## state 1 5456 5456 2.9315 0.09877 .
## age:state 1 709 709 0.3809 0.54247
## Residuals 26 48395 1861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2, 2))
plot(fit)
fit1 is nested inside fit2 (as long as the observation set is the same )
fit1 <- lm(cholesterol ~ state, data=chol)
fit2 <- lm(cholesterol ~ state + age, data = chol)
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: cholesterol ~ state
## Model 2: cholesterol ~ state + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 102924
## 2 27 49104 1 53820 29.593 9.361e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Age is a meaningful predictor of cholesterol because adding it to the model significant reduces the RSS
library(MASS)
fit <- lm(cholesterol ~ age * state, data=chol)
step <- stepAIC(fit, direction = "backward") #removes terms from model to see if prediction becomes better
## Start: AIC=229.58
## cholesterol ~ age * state
##
## Df Sum of Sq RSS AIC
## - age:state 1 709.05 49104 228.01
## <none> 48395 229.58
##
## Step: AIC=228.01
## cholesterol ~ age + state
##
## Df Sum of Sq RSS AIC
## <none> 49104 228.01
## - state 1 5456 54560 229.18
## - age 1 53820 102924 248.22
AIC = Akaike’s Information Criterion The best model is the age + state one