#install.packages("interactions")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(interactions)

Question 1: Demonstration of the Central Limit Theorem — If you have a sample of 25 numbers drawn from a population with a mean of 60, and a standard deviation of 20. What is the expected sample mean and the expected sample standard deviation? What is the expected standard error of the mean? Do about 1000 simulations of sampling N=25 from a population distributed as N(μ = 60,σ2 = 202) and confirm that the mean of those samples is close to the expected mean, that the standard deviation of those samples is close to the expected standard deviation, and the standard error of the mean is close to the expected standard error of the mean. How would you describe the difference between the expected standard deviation of the sample and the expected standard error of the mean?

n <- 25
pop_mean <- 60
pop_sd <- 20

n_sim <- 1000

n_means <- rep(0, n_sim)
n_sd <- rep(0, n_sim)
n_se <- rep(0, n_sim)

for (i in 1:n_sim) {
  sample <- rnorm(n, mean = pop_mean, sd = pop_sd)
  n_means[i] <- mean(sample)
  n_sd[i] <- sd(sample)
  n_se[i] <- pop_sd/sqrt(n)
}

mean(n_means)
## [1] 59.84864
mean(n_sd)
## [1] 19.79219
mean(n_se)
## [1] 4

The expected mean and standard deviation of the sample is expected to be similar to that of the population. The standard error is the expected deviation of the sampling distributions’ mean from the population mean. The sample means are 59.97 which is very close to population mean 60 and the sample standard deviations is 19.79 which is very close to the population sd of 20. The expected sd of the sample is representative of the variability within the individual samples. The expected se of the mean is representative of the variability of the samples as a whole around the population’s mean.The expected sample mean is plus or minus 4 when we do it over and over again.

Question 2:

crash_data <- read.csv("crash.csv")

Part A: Fit at least two models each with at least two inputs to understand the relationships between the variables. Explain your choices of predictor and outcome variables as well as any transformations you performed before fitting the models. Consider at least one model with interactions.

crash_centered <- crash_data %>% 
  mutate(PerMiles_Centered = PerMiles - mean(PerMiles),
         BeltUse_Centered = BeltUse - mean(BeltUse, na.rm = TRUE))

In the code above I am centering the data so I can use the average instead of 0.

Two predictor models:

lm_restraint <- lm(Rate ~ PerMiles_Centered + BeltUse_Centered, data = crash_centered)
summary(lm_restraint)
## 
## Call:
## lm(formula = Rate ~ PerMiles_Centered + BeltUse_Centered, data = crash_centered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0089  -2.3551   0.0917   2.3786   7.8492 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       25.15523    0.59450   42.31   <2e-16 ***
## PerMiles_Centered 17.72165    1.38310   12.81   <2e-16 ***
## BeltUse_Centered  -0.04505    0.05060   -0.89    0.378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.202 on 47 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.778,  Adjusted R-squared:  0.7686 
## F-statistic: 82.38 on 2 and 47 DF,  p-value: 4.336e-16

In the above model, I am looking at how the average fatality rate per 100 million miles traveled and the average state compliance with seat belt use predicts the fatality rate per 100,000 drivers. When both fatality rate per 100 million miles traveled and state compliance with seat belt use are at the average there are 25.16 fatalities out of 100,000. For every one unit increase in fatality rate per 100 million miles traveled there is a 17.72 increase in fatality rate. When there is a one unit increase in seat belt use compliance there is a .04 decrease in fatality rate.

lm_restraint_interaction <- lm(Rate ~ PerMiles_Centered * BeltUse_Centered, data = crash_centered)
summary(lm_restraint_interaction)
## 
## Call:
## lm(formula = Rate ~ PerMiles_Centered * BeltUse_Centered, data = crash_centered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9769  -2.3605   0.0889   2.3904   7.8043 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        25.155104   0.600879  41.864  < 2e-16 ***
## PerMiles_Centered                  17.748253   1.429684  12.414 2.73e-16 ***
## BeltUse_Centered                   -0.046211   0.052784  -0.875    0.386    
## PerMiles_Centered:BeltUse_Centered  0.009146   0.103012   0.089    0.930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7781, Adjusted R-squared:  0.7636 
## F-statistic: 53.76 on 3 and 46 DF,  p-value: 4.473e-15

In the above model, I am looking at how the average fatality rate per 100 million miles traveled and the average state compliance with seat belt use predicts the fatality rate per 100,000 drivers. I am also looking at the interaction between the two predictors. Depending on the level of percent state compliance with seat belt use the effect of fatality rate per 100 million miles traveled on the fatality rate per 100,000 drivers is .009.

Part B: Display the fitted model graphically, and explain the meaning of each of the coefficients, along with the residual standard deviation. Plot the residuals versus fitted values. Comment on any relationships in the data that surprised you initially and whether or not you were able to make sense of them.

Displaying the fitted models graphically

interact_plot(model= lm_restraint, pred=PerMiles_Centered, modx=BeltUse_Centered, plot.points = TRUE)
## Warning: PerMiles_Centered and BeltUse_Centered are not included in an interaction
## with one another in the model.

interact_plot(model = lm_restraint_interaction, pred = BeltUse_Centered, modx = PerMiles_Centered, plot.points = TRUE)

Finding the standard deviation of the residuals

coef(lm_restraint_interaction)
##                        (Intercept)                  PerMiles_Centered 
##                       25.155104378                       17.748252688 
##                   BeltUse_Centered PerMiles_Centered:BeltUse_Centered 
##                       -0.046211236                        0.009145975
residuals <- residuals(lm_restraint_interaction)
residuals 
##            1            2            3            4            5            6 
##   5.49933329 -10.69944318  -2.77319093  -1.44565564   1.56004173  -2.37982742 
##            7            8            9           10           11           12 
##   1.01461771   1.27743306  -7.77293521  -7.46354999   6.44688246  -3.79391033 
##           13           14           15           16           17           18 
##   1.45695838  -2.44082581   4.55424574  -0.90986776  -1.14203469   2.15343948 
##           19           20           21           22           23           24 
##  -0.78405835  -0.46961826   0.31733031  -0.12295128  -2.30268743   3.16123088 
##           25           26           27           28           29           30 
##   7.80433161   5.61024358   6.43277878   0.09916438  -1.43420704   0.00703043 
##           31           32           33           34           35           36 
##  -2.56762142   7.75726915  -3.68497484   3.22213396   2.58944315   0.60956507 
##           37           38           39           40           41           42 
##   3.79775482  -4.16305508  -3.43530580  -1.56086998   0.72359489 -10.97687864 
##           43           44           45           46           47           48 
##   0.07867624   0.98783075   1.93793716   0.06513772   3.16846819   1.58726429 
##           49           50 
##  -4.06599523   2.46932709
sd_residuals <- sd(residuals(lm_restraint_interaction))
sd_residuals
## [1] 4.115054

Residuals are the difference between the observed and predicted values. The standard deviation of the residuals is measuring the variability of the residuals. The residual standard deviation is 4.12 meaning the average variability of the observed data points from the predicted values is about 4.12.

Plotting the residuals versus fitted values

par(mfrow=c(2,2))
plot(lm_restraint)

par(mfrow=c(2,2))
plot(lm_restraint_interaction)

#OR 

plot(lm_restraint, which = 1)
plot(lm_restraint_interaction, which = 1)

I am not completely able to make sense of the them.

Question 3:

Part A: For two inputs in your models above (X1 and X2), compute their mean and standard deviation, X1 and sX1 , respectively, and vary them so that you have all 9 combinations of X1 − sX1 , X1, X1 +sX1 and X2 −sX2, X2, X2 +sX2.

(PerMiles_Centered <- mean(crash_centered$PerMiles_Centered, na.rm=TRUE) +
    c(-1, 0, 1) * sd(crash_centered$PerMiles_Centered, na.rm=TRUE))
## [1] -4.384733e-01 -7.618859e-17  4.384733e-01
(BeltUse_Centered <- mean(crash_centered$BeltUse_Centered, na.rm=TRUE) +
    c(-1, 0, 1) * sd(crash_centered$BeltUse_Centered, na.rm=TRUE))
## [1] -1.186323e+01  1.137055e-15  1.186323e+01
(combs <- expand_grid(PerMiles_Centered, BeltUse_Centered))
## # A tibble: 9 × 2
##   PerMiles_Centered BeltUse_Centered
##               <dbl>            <dbl>
## 1         -4.38e- 1        -1.19e+ 1
## 2         -4.38e- 1         1.14e-15
## 3         -4.38e- 1         1.19e+ 1
## 4         -7.62e-17        -1.19e+ 1
## 5         -7.62e-17         1.14e-15
## 6         -7.62e-17         1.19e+ 1
## 7          4.38e- 1        -1.19e+ 1
## 8          4.38e- 1         1.14e-15
## 9          4.38e- 1         1.19e+ 1

-4, 0 , 4 and -1, 0, 1

Part B: Now use the predict() function with one of your fits from lm() for question 2 to generate 9 predictions. Visualize those values and explain how they correspond to your understanding about the relationships you examined previously.

predictions <- predict(lm_restraint, newdata = combs)
predictions
##        1        2        3        4        5        6        7        8 
## 17.91923 17.38476 16.85029 25.68970 25.15523 24.62076 33.46017 32.92570 
##        9 
## 32.39123
ggplot(data = combs, mapping = aes(x= combs$PerMiles_Centered, y = predictions, color = BeltUse_Centered)) +  geom_point() 
## Warning: Use of `combs$PerMiles_Centered` is discouraged.
## ℹ Use `PerMiles_Centered` instead.

The above visualization for the predictions looks similar to the lm_restraint and lm_restraint_interaction visualizations with them all looking like a positive correlation with an increase in one leading to an increase in the other.