library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library()

Question 1: Demonstration of the Central Limit Theorem — If you have a sample of 25 numbers drawn from a population with a mean of 60, and a standard deviation of 20. What is the expected sample mean and the expected sample standard deviation? What is the expected standard error of the mean? Do about 1000 simulations of sampling N=25 from a population distributed as N(μ = 60,σ2 = 202) and confirm that the mean of those samples is close to the expected mean, that the standard deviation of those samples is close to the expected standard deviation, and the standard error of the mean is close to the expected standard error of the mean. How would you describe the difference between the expected standard deviation of the sample and the expected standard error of the mean?

n <- 25
pop_mean <- 60
pop_sd <- 20

n_sim <- 1000

n_means <- rep(0, n_sim)
n_sd <- rep(0, n_sim)
n_se <- rep(0, n_sim)

for (i in 1:n_sim) {
  sample <- rnorm(n, mean = pop_mean, sd = pop_sd)
  n_means[i] <- mean(sample)
  n_sd[i] <- sd(sample)
  n_se[i] <- pop_sd/sqrt(n)
}

mean(n_means)
## [1] 59.86853
mean(n_sd)
## [1] 19.78673
mean(n_se)
## [1] 4

The expected mean and standard deviation of the sample is expected to be similar to that of the population. The standard error is the expected deviation of the sampling distributions’ mean from the population mean. The sample means are 59.97 which is very close to population mean 60 and the sample standard deviations is 19.79 which is very close to the population sd of 20. The expected sd of the sample is representative of the variability within the individual samples. The expected se of the mean is representative of the variability of the samples as a whole around the population’s mean.

Question 2:

crash_data <- read.csv("crash.csv")

Part A: Fit at least two models each with at least two inputs to understand the relationships between the variables. Explain your choices of predictor and outcome variables as well as any transformations you performed before fitting the models. Consider at least one model with interactions.

One predictor models

lm_no_rest_used <- lm(Alc_10 ~ KillUnre, data = crash_data)
summary(lm_no_rest_used)
## 
## Call:
## lm(formula = Alc_10 ~ KillUnre, data = crash_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6287  -5.0559  -0.6566   3.7442  11.5288 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.16397    5.29762   5.883 3.55e-07 ***
## KillUnre     0.02812    0.09176   0.306    0.761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.032 on 49 degrees of freedom
## Multiple R-squared:  0.001913,   Adjusted R-squared:  -0.01846 
## F-statistic: 0.09391 on 1 and 49 DF,  p-value: 0.7606

In the above model, I am looking at how well the percentage of those killed while not using restraint predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When the percentage of those killed while not using restraint is 0, or held constant, the percent of fatalities involving a BAC > .10g/dl is 31.16. For every one unit increase in the percent of those killed while using restraint there is a .03 increase in the fatalities involving a BAC > .10g/dl. This model accounts for 0% of variance (or -0.01846 according to the summary function output).

lm_rest_used <- lm(Alc_10 ~ KillRest, data = crash_data)
summary(lm_rest_used)
## 
## Call:
## lm(formula = Alc_10 ~ KillRest, data = crash_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7179  -4.8326  -0.5475   3.6748  12.2196 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.26163    3.71304   9.497 1.09e-12 ***
## KillRest    -0.07324    0.10616  -0.690    0.494    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.009 on 49 degrees of freedom
## Multiple R-squared:  0.00962,    Adjusted R-squared:  -0.01059 
## F-statistic: 0.476 on 1 and 49 DF,  p-value: 0.4935

In the above model, I am looking at how well the percentage of those killed while using restraint predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When the percentage of those killed while using restraint is 0 the percent of fatalities involving a BAC > .10g/dl is 35.26. For every one unit increase in the percent of those killed while using restraint there is a .07 decrease in the fatalities involving a BAC > .10g/dl. This model accounts for 0% of variance.

Two predictor models:

lm_restraint <- lm(Alc_10 ~ KillRest + BeltUse, data = crash_data)
summary(lm_restraint)
## 
## Call:
## lm(formula = Alc_10 ~ KillRest + BeltUse, data = crash_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6020  -4.5548  -0.1099   3.5751  12.6953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.5324     4.6347   6.372 7.34e-08 ***
## KillRest     -0.2586     0.1475  -1.753   0.0862 .  
## BeltUse       0.1883     0.0994   1.894   0.0643 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.888 on 47 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.07742,    Adjusted R-squared:  0.03816 
## F-statistic: 1.972 on 2 and 47 DF,  p-value: 0.1505

In the above model, I am looking at how the state percentage of those killed while using restraint and the state estimated seat belt use rate predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When both predictors or independent variables are 0 the percent of fatalities is 29.53. For every one unit increase in those killed while using restraint there is a .26 decrease in fatalities involving a BAC > .10g/dl. For every one unit increase in state compliance with seat belt use there is a .19 unit increase in fatalities involving a BAC > .10g/dl. This model accounts for 4% of the total variance.

lm_no_restraint <- lm(Alc_10 ~ KillUnre + BeltUse, data = crash_data)
summary(lm_no_restraint)
## 
## Call:
## lm(formula = Alc_10 ~ KillUnre + BeltUse, data = crash_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8215  -5.1300  -0.7285   2.8652  13.4944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  18.4565    11.6932   1.578    0.121
## KillUnre      0.1138     0.1218   0.934    0.355
## BeltUse       0.1219     0.0939   1.298    0.200
## 
## Residual standard error: 6.022 on 47 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03503,    Adjusted R-squared:  -0.006034 
## F-statistic: 0.853 on 2 and 47 DF,  p-value: 0.4326

In the above model, I am looking at how the state percentage of those killed while not using restraint and the state compliance with seat belt use predicts the percent of fatalities that involved blood alcohol levels above .10g/dl. When both predictors or independent variables are 0 the percent of fatalities involving a blood alcohol level above .10g/dl is 18.46. For every one unit increase in those killed while not using restraint there is a .11 increase in fatalities involving a BAC > .10g/dl. For every one unit increase in state compliance with seat belt use there is a .12 unit increase in fatalities involving a BAC > .10g/dl. This model accounts for 0% of the total variance.

model with an interaction

lm_restraint_interaction <- lm(Alc_10 ~ KillRest * BeltUse, data = crash_data)
summary(lm_restraint_interaction)
## 
## Call:
## lm(formula = Alc_10 ~ KillRest * BeltUse, data = crash_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2672  -4.2410  -0.7067   3.8862  12.5998 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      41.408465  18.275503   2.266   0.0282 *
## KillRest         -0.637154   0.582472  -1.094   0.2797  
## BeltUse           0.002414   0.294116   0.008   0.9935  
## KillRest:BeltUse  0.005755   0.008563   0.672   0.5049  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.923 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.08639,    Adjusted R-squared:  0.02681 
## F-statistic:  1.45 on 3 and 46 DF,  p-value: 0.2406

In the above model, I am looking to see if there is an interaction between state percentage of those killed while using restraint and the state estimated seat belt use rate, in which the relationship between the impact of the state percentage of those killed while using restraint and the percent of fatalities that involved blood alcohol levels above .10g/dl is dependent on the state’s estimated seat belt use rate. When both predictors or independent variables are 0 the percent of fatalities is 41.41. For every one unit increase in those killed while using restraint there is a .64 decrease in fatalities involving a BAC > .10g/dl. For every one unit increase in state compliance with seat belt use there is no unit change in fatalities involving a BAC > .10g/dl. Depending on the state percentage of those killed while using restraint for every increase in state seat belt use compliance there is a .01 increase in fatalities involving a BAC > .10g/dl. This model accounts for 2% of the total variance.

Part B: Display the fitted model graphically, and explain the meaning of each of the coefficients, along with the residual standard deviation. Plot the residuals versus fitted values. Comment on any relationships in the data that surprised you initially and whether or not you were able to make sense of them.

par(mfrow=c(2,2))
plot(lm_no_rest_used)

par(mfrow=c(2,2))
plot(lm_rest_used)

par(mfrow=c(2,2))
plot(lm_restraint)

par(mfrow=c(2,2))
plot(lm_no_restraint)

Looking at the last two graphs for the models with two predictors (lm_restraint and lm_no_restraint) I am see that there is slight clustering.

plotting the models with one predictor

plot(crash_data$KillUnre, crash_data$Alc_10, xlab = "Restraint Not Used", ylab = "Fatlities BAC > .10")
abline(lm_no_rest_used)

plot(crash_data$KillRest, crash_data$Alc_10, xlab = "Restraint Used", ylab = "Fatlities BAC > .10")
abline(lm_rest_used)

plotting the models with two predictors

plot(fitted(lm_no_restraint), resid(lm_no_restraint),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs. Fitted Values")

plot(fitted(lm_restraint), resid(lm_restraint),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs. Fitted Values")

In the first plot for lm_no_restraint there is slight clustering in the middle and in the second plot for lm_restraint there is slight clustering towards the left side.

Residuals are the difference between the observed and predicted values. The standard deviation of the residuals is measuring the variability of the residuals. The dispersion is pretty even so we can assume the assumption of homoscedasticity is not violated.

Question 3:

Part A: For two inputs in your models above (X1 and X2), compute their mean and standard deviation, X1 and sX1 , respectively, and vary them so that you have all 9 combinations of X1 − sX1 , X1, X1 +sX1 and X2 −sX2, X2, X2 +sX2.

(KillUnre <- mean(crash_data$KillUnre, na.rm=TRUE) +
    c(-1, 0, 1) * sd(crash_data$KillUnre, na.rm=TRUE))
## [1] 47.69781 56.99412 66.29043
(BeltUse <- mean(crash_data$BeltUse, na.rm=TRUE) +
    c(-1, 0, 1) * sd(crash_data$BeltUse, na.rm=TRUE))
## [1] 51.85677 63.72000 75.58323
(combs <- expand_grid(KillUnre, BeltUse))
## # A tibble: 9 × 2
##   KillUnre BeltUse
##      <dbl>   <dbl>
## 1     47.7    51.9
## 2     47.7    63.7
## 3     47.7    75.6
## 4     57.0    51.9
## 5     57.0    63.7
## 6     57.0    75.6
## 7     66.3    51.9
## 8     66.3    63.7
## 9     66.3    75.6

Part B: Now use the predict() function with one of your fits from lm() for question 2 to generate 9 predictions. Visualize those values and explain how they correspond to your understanding about the relationships you examined previously.

predictions <- predict(lm_no_restraint, newdata = combs)
predictions
##        1        2        3        4        5        6        7        8 
## 30.20733 31.65375 33.10018 31.26527 32.71170 34.15813 32.32322 33.76964 
##        9 
## 35.21607
plot(combs$KillUnre, predictions, type = "b", 
     xlab = "Predicted KillUnre", ylab = "Predicted Alc_10",
     main = "Predicted Alc_10 vs. Predicted KillUnre")

plot(combs$BeltUse, predictions, type = "b", 
     xlab = "Predicted BeltUse", ylab = "Predicted Alc_10",
     main = "Predicted Alc_10 vs. Predicted BeltUse")

They correspond in the sense that an increase in the percentage of those killed while not using restraint is associated with an increase in the the percent of fatalities that involved blood alcohol levels above .10g/dl.