Question 1:
For this question use the Japan_earthquakes.csv. This
data set contains information regarding earthquakes from 2001-2018
in/around Japan.
Please create an interactive map using leaflet . Answer
the following questions.
(Please export your maps as images for each part for the PDF file, as html maps will not be supported when converted into PDFs.
You can use "always_allow_html: true" on the
header of your rmd file, so that you can execute the codes without any
errors. Please see the header of the attached rmd
file.
You must include the all relevant codes as well. )
#1. Use geolocation of Japan (lat and long values) to set up a base
map. (You can use any providerTiles that is
appropriate).
library(leaflet)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
leaflet() %>%
addTiles() #base map
leaflet() %>%
addProviderTiles(providers$Esri.OceanBasemap)
leaflet() %>%
setView(lng = 139.412689, lat = 36.709055, zoom = 5.7) %>%
addProviderTiles(providers$Esri.OceanBasemap) %>%
addProviderTiles(providers$Stamen.TonerLabels)
#2. Create subset, of all earthquake data which has a magnitude
(mag) of 6 and higher.
# Import Japan earthquake data
quakes = dat = read.csv('/Users/ryankelly/Desktop/Desktop - Ryan’s MacBook Pro/GMU/Spring 2023/STAT 515/Homework/HW2/Japan_earthquakes.csv')
# Adjust the data set to only include magnitude and location data
quakes <- quakes[, c("mag","latitude","longitude","time","place","depth")]
# Create a subset of the quakes data frame where mag is greater than or equal to 6
quakes_6 <- subset(quakes, mag >= 6)
#3. Using the subset from part 2) plot all the earthquake locations on the map of Japan.
#4. Use Circles to represent the location of earthquakes
and make sure that the area of the circle corresponds to
the magnitude of each earthquake.(Hint: Use the following formula
for the radius, \(\sqrt(10^x)*c\)
: Here x is the magnitude and the c is a
arbitrary constant that you can choose your own).
#5. Adjust the circles that you got from part 4), so that the map is aesthetically pleasing. (Hint: colors, opacity etc).
#6. Add labels to each circle indicating the
place that each earthquake occurred.
#7. Add a popups to each circle to see the following
information:
- `Time` of the Earthquake
- `Magnitude` of the Earthquake
- `Depth` in (km) of the Earthquake
- `Place` of the Earthquake.
# Plot the earthquakes on the map
leaflet(quakes_6) %>%
addProviderTiles(providers$Esri.OceanBasemap) %>%
addProviderTiles(providers$Stamen.TonerLabels) %>%
addCircles(lng = ~longitude,
lat = ~latitude,
popup = paste("Time: ", quakes_6$time, "<br>", "Magnitude: ", quakes_6$mag,
"<br>", "Depth: ", quakes_6$depth, "<br>", "Place: ", quakes_6$place),
label = ~place, # adding labels
weight = ~sqrt(10^mag)*.002, # changing the size of circles
opacity = .75, # changing the opacity of circles
color = 'darkred' # usually we use hexadecimal colors
)
Question 2:
Is there a relationship between student math test scores and
socioeconomic variables? The data set CASchools.csv
contains data on test performance, school demographics, and student
demographic background for school districts in California. Remove the
variable county before the analysis.
Please use the Data set description document to learn more about the data set.
#1. Fit a linear regression model to predict math test
scores using all variables in the data set. Discuss your results, making
sure to cover the following points:
CAS <- read.csv('/Users/ryankelly/Desktop/Desktop - Ryan’s MacBook Pro/GMU/Spring 2023/STAT 515/Homework/HW2/CASchools.csv')
# Remove the variable "county" from the data frame
CAS <- CAS[, -which(names(CAS) == "county")]
CAS <- CAS[, -which(names(CAS) == "school")]
CAS <- CAS[, -which(names(CAS) == "grades")]
CAS <- CAS[, -which(names(CAS) == "district")]
# Change income to numeric for manual color change
CAS$income <- gsub("[^0-9.]", "", CAS$income)
CAS$income = as.numeric(CAS$income)
# Change computer to numeric for manual color change
library(stringr)
CAS$computer <- as.numeric(str_replace_all(CAS$computer, "[^[:digit:]]", ""))
## Consider removing missing data
#sum(is.na(CAS$?????)) # Replace ????? with columns with missing values
#CAS=na.omit(CAS) # removing rows with missing values
# Replace missing values with interpolated values rows where values are missing
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
set.seed(123) # Set the seed for reproducibility
CAS <- mice(CAS, m = 5, maxit = 50, method = "pmm") # Impute missing values using the mice package
##
## iter imp variable
## 1 1 calworks lunch computer income
## 1 2 calworks lunch computer income
## 1 3 calworks lunch computer income
## 1 4 calworks lunch computer income
## 1 5 calworks lunch computer income
## 2 1 calworks lunch computer income
## 2 2 calworks lunch computer income
## 2 3 calworks lunch computer income
## 2 4 calworks lunch computer income
## 2 5 calworks lunch computer income
## 3 1 calworks lunch computer income
## 3 2 calworks lunch computer income
## 3 3 calworks lunch computer income
## 3 4 calworks lunch computer income
## 3 5 calworks lunch computer income
## 4 1 calworks lunch computer income
## 4 2 calworks lunch computer income
## 4 3 calworks lunch computer income
## 4 4 calworks lunch computer income
## 4 5 calworks lunch computer income
## 5 1 calworks lunch computer income
## 5 2 calworks lunch computer income
## 5 3 calworks lunch computer income
## 5 4 calworks lunch computer income
## 5 5 calworks lunch computer income
## 6 1 calworks lunch computer income
## 6 2 calworks lunch computer income
## 6 3 calworks lunch computer income
## 6 4 calworks lunch computer income
## 6 5 calworks lunch computer income
## 7 1 calworks lunch computer income
## 7 2 calworks lunch computer income
## 7 3 calworks lunch computer income
## 7 4 calworks lunch computer income
## 7 5 calworks lunch computer income
## 8 1 calworks lunch computer income
## 8 2 calworks lunch computer income
## 8 3 calworks lunch computer income
## 8 4 calworks lunch computer income
## 8 5 calworks lunch computer income
## 9 1 calworks lunch computer income
## 9 2 calworks lunch computer income
## 9 3 calworks lunch computer income
## 9 4 calworks lunch computer income
## 9 5 calworks lunch computer income
## 10 1 calworks lunch computer income
## 10 2 calworks lunch computer income
## 10 3 calworks lunch computer income
## 10 4 calworks lunch computer income
## 10 5 calworks lunch computer income
## 11 1 calworks lunch computer income
## 11 2 calworks lunch computer income
## 11 3 calworks lunch computer income
## 11 4 calworks lunch computer income
## 11 5 calworks lunch computer income
## 12 1 calworks lunch computer income
## 12 2 calworks lunch computer income
## 12 3 calworks lunch computer income
## 12 4 calworks lunch computer income
## 12 5 calworks lunch computer income
## 13 1 calworks lunch computer income
## 13 2 calworks lunch computer income
## 13 3 calworks lunch computer income
## 13 4 calworks lunch computer income
## 13 5 calworks lunch computer income
## 14 1 calworks lunch computer income
## 14 2 calworks lunch computer income
## 14 3 calworks lunch computer income
## 14 4 calworks lunch computer income
## 14 5 calworks lunch computer income
## 15 1 calworks lunch computer income
## 15 2 calworks lunch computer income
## 15 3 calworks lunch computer income
## 15 4 calworks lunch computer income
## 15 5 calworks lunch computer income
## 16 1 calworks lunch computer income
## 16 2 calworks lunch computer income
## 16 3 calworks lunch computer income
## 16 4 calworks lunch computer income
## 16 5 calworks lunch computer income
## 17 1 calworks lunch computer income
## 17 2 calworks lunch computer income
## 17 3 calworks lunch computer income
## 17 4 calworks lunch computer income
## 17 5 calworks lunch computer income
## 18 1 calworks lunch computer income
## 18 2 calworks lunch computer income
## 18 3 calworks lunch computer income
## 18 4 calworks lunch computer income
## 18 5 calworks lunch computer income
## 19 1 calworks lunch computer income
## 19 2 calworks lunch computer income
## 19 3 calworks lunch computer income
## 19 4 calworks lunch computer income
## 19 5 calworks lunch computer income
## 20 1 calworks lunch computer income
## 20 2 calworks lunch computer income
## 20 3 calworks lunch computer income
## 20 4 calworks lunch computer income
## 20 5 calworks lunch computer income
## 21 1 calworks lunch computer income
## 21 2 calworks lunch computer income
## 21 3 calworks lunch computer income
## 21 4 calworks lunch computer income
## 21 5 calworks lunch computer income
## 22 1 calworks lunch computer income
## 22 2 calworks lunch computer income
## 22 3 calworks lunch computer income
## 22 4 calworks lunch computer income
## 22 5 calworks lunch computer income
## 23 1 calworks lunch computer income
## 23 2 calworks lunch computer income
## 23 3 calworks lunch computer income
## 23 4 calworks lunch computer income
## 23 5 calworks lunch computer income
## 24 1 calworks lunch computer income
## 24 2 calworks lunch computer income
## 24 3 calworks lunch computer income
## 24 4 calworks lunch computer income
## 24 5 calworks lunch computer income
## 25 1 calworks lunch computer income
## 25 2 calworks lunch computer income
## 25 3 calworks lunch computer income
## 25 4 calworks lunch computer income
## 25 5 calworks lunch computer income
## 26 1 calworks lunch computer income
## 26 2 calworks lunch computer income
## 26 3 calworks lunch computer income
## 26 4 calworks lunch computer income
## 26 5 calworks lunch computer income
## 27 1 calworks lunch computer income
## 27 2 calworks lunch computer income
## 27 3 calworks lunch computer income
## 27 4 calworks lunch computer income
## 27 5 calworks lunch computer income
## 28 1 calworks lunch computer income
## 28 2 calworks lunch computer income
## 28 3 calworks lunch computer income
## 28 4 calworks lunch computer income
## 28 5 calworks lunch computer income
## 29 1 calworks lunch computer income
## 29 2 calworks lunch computer income
## 29 3 calworks lunch computer income
## 29 4 calworks lunch computer income
## 29 5 calworks lunch computer income
## 30 1 calworks lunch computer income
## 30 2 calworks lunch computer income
## 30 3 calworks lunch computer income
## 30 4 calworks lunch computer income
## 30 5 calworks lunch computer income
## 31 1 calworks lunch computer income
## 31 2 calworks lunch computer income
## 31 3 calworks lunch computer income
## 31 4 calworks lunch computer income
## 31 5 calworks lunch computer income
## 32 1 calworks lunch computer income
## 32 2 calworks lunch computer income
## 32 3 calworks lunch computer income
## 32 4 calworks lunch computer income
## 32 5 calworks lunch computer income
## 33 1 calworks lunch computer income
## 33 2 calworks lunch computer income
## 33 3 calworks lunch computer income
## 33 4 calworks lunch computer income
## 33 5 calworks lunch computer income
## 34 1 calworks lunch computer income
## 34 2 calworks lunch computer income
## 34 3 calworks lunch computer income
## 34 4 calworks lunch computer income
## 34 5 calworks lunch computer income
## 35 1 calworks lunch computer income
## 35 2 calworks lunch computer income
## 35 3 calworks lunch computer income
## 35 4 calworks lunch computer income
## 35 5 calworks lunch computer income
## 36 1 calworks lunch computer income
## 36 2 calworks lunch computer income
## 36 3 calworks lunch computer income
## 36 4 calworks lunch computer income
## 36 5 calworks lunch computer income
## 37 1 calworks lunch computer income
## 37 2 calworks lunch computer income
## 37 3 calworks lunch computer income
## 37 4 calworks lunch computer income
## 37 5 calworks lunch computer income
## 38 1 calworks lunch computer income
## 38 2 calworks lunch computer income
## 38 3 calworks lunch computer income
## 38 4 calworks lunch computer income
## 38 5 calworks lunch computer income
## 39 1 calworks lunch computer income
## 39 2 calworks lunch computer income
## 39 3 calworks lunch computer income
## 39 4 calworks lunch computer income
## 39 5 calworks lunch computer income
## 40 1 calworks lunch computer income
## 40 2 calworks lunch computer income
## 40 3 calworks lunch computer income
## 40 4 calworks lunch computer income
## 40 5 calworks lunch computer income
## 41 1 calworks lunch computer income
## 41 2 calworks lunch computer income
## 41 3 calworks lunch computer income
## 41 4 calworks lunch computer income
## 41 5 calworks lunch computer income
## 42 1 calworks lunch computer income
## 42 2 calworks lunch computer income
## 42 3 calworks lunch computer income
## 42 4 calworks lunch computer income
## 42 5 calworks lunch computer income
## 43 1 calworks lunch computer income
## 43 2 calworks lunch computer income
## 43 3 calworks lunch computer income
## 43 4 calworks lunch computer income
## 43 5 calworks lunch computer income
## 44 1 calworks lunch computer income
## 44 2 calworks lunch computer income
## 44 3 calworks lunch computer income
## 44 4 calworks lunch computer income
## 44 5 calworks lunch computer income
## 45 1 calworks lunch computer income
## 45 2 calworks lunch computer income
## 45 3 calworks lunch computer income
## 45 4 calworks lunch computer income
## 45 5 calworks lunch computer income
## 46 1 calworks lunch computer income
## 46 2 calworks lunch computer income
## 46 3 calworks lunch computer income
## 46 4 calworks lunch computer income
## 46 5 calworks lunch computer income
## 47 1 calworks lunch computer income
## 47 2 calworks lunch computer income
## 47 3 calworks lunch computer income
## 47 4 calworks lunch computer income
## 47 5 calworks lunch computer income
## 48 1 calworks lunch computer income
## 48 2 calworks lunch computer income
## 48 3 calworks lunch computer income
## 48 4 calworks lunch computer income
## 48 5 calworks lunch computer income
## 49 1 calworks lunch computer income
## 49 2 calworks lunch computer income
## 49 3 calworks lunch computer income
## 49 4 calworks lunch computer income
## 49 5 calworks lunch computer income
## 50 1 calworks lunch computer income
## 50 2 calworks lunch computer income
## 50 3 calworks lunch computer income
## 50 4 calworks lunch computer income
## 50 5 calworks lunch computer income
CAS <- complete(CAS) # Extract the completed data set from the imputed data
pairs(CAS)
str(CAS)
## 'data.frame': 429 obs. of 9 variables:
## $ students : int 195 240 1550 243 1335 137 195 888 379 2247 ...
## $ teachers : num 10.9 11.1 82.9 14 71.5 ...
## $ calworks : num 0.51 15.42 55.03 36.48 33.11 ...
## $ lunch : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computer : num 67 101 169 85 171 25 28 66 35 0 ...
## $ expenditure: num 6385 5099 5502 7102 5236 ...
## $ income : num 22.69 9.82 8.98 8.98 9.08 ...
## $ english : num 0 4.58 30 0 13.86 ...
## $ math : num 690 662 651 644 640 ...
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
pairs.panels(CAS)
fit = lm(math~., data=CAS)
summary(fit)
##
## Call:
## lm(formula = math ~ ., data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.2764 -6.6892 0.2302 5.9269 31.5397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.580e+02 4.328e+00 152.030 < 2e-16 ***
## students -7.837e-04 1.727e-03 -0.454 0.65018
## teachers 4.584e-03 3.799e-02 0.121 0.90400
## calworks -9.618e-02 6.625e-02 -1.452 0.14735
## lunch -3.782e-01 3.691e-02 -10.247 < 2e-16 ***
## computer 4.769e-03 3.246e-03 1.469 0.14253
## expenditure 8.990e-04 8.526e-04 1.054 0.29231
## income 6.547e-01 1.004e-01 6.523 1.98e-10 ***
## english -7.622e-02 2.607e-02 -2.924 0.00364 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.905 on 420 degrees of freedom
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.723
## F-statistic: 140.6 on 8 and 420 DF, p-value: < 2.2e-16
# It appears there are some noticeable trends in the observing the data points in the plot martix above.
# It is difficult the exact nature of the relationship, but there does appear to be some correlation
# between math scores and some of the predictors, such as 'lunch' and 'income.'
From the regression output above, it appears that sutudents, teachers, calworks, computer, and expenditure are insignificant.
It appears that lunch is explaining math scores the best, based on the p-value of < 2e-16 ***
The p-value of < 2e-16 for the response variable lunch indicates the predictor is significant. The *** indicates that this predictor variable is significant at the p < 0.001 level. This means the probability of observing the result by chance (null hypothesis) is less than 0.1%
#2. Do you think you can fit a better model than the model in part (1)? Fit a new model that you think is a better fit for the data. What do you observe about this new model?
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(fit)
## students teachers calworks lunch computer expenditure
## 196.504216 219.232247 2.489533 4.363760 8.811838 1.319400
## income english
## 2.468116 1.593449
# 'students' and 'teachers' have very high VIF values --> start with removing those from model
fit1 = lm(math~calworks+lunch+computer+expenditure+income+english, data=CAS); summary(fit1);
##
## Call:
## lm(formula = math ~ calworks + lunch + computer + expenditure +
## income + english, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.566 -6.970 0.306 5.936 31.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.573e+02 4.253e+00 154.543 < 2e-16 ***
## calworks -9.180e-02 6.607e-02 -1.389 0.16542
## lunch -3.837e-01 3.679e-02 -10.431 < 2e-16 ***
## computer 1.986e-04 1.153e-03 0.172 0.86333
## expenditure 1.051e-03 8.429e-04 1.247 0.21312
## income 6.563e-01 9.981e-02 6.576 1.43e-10 ***
## english -7.887e-02 2.605e-02 -3.028 0.00261 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.915 on 422 degrees of freedom
## Multiple R-squared: 0.7263, Adjusted R-squared: 0.7224
## F-statistic: 186.6 on 6 and 422 DF, p-value: < 2.2e-16
fit2 = lm(math~lunch+computer+income+english, data=CAS); summary(fit2);
##
## Call:
## lm(formula = math ~ lunch + computer + income + english, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.157 -6.939 0.002 5.909 31.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.619e+02 2.326e+00 284.564 < 2e-16 ***
## lunch -4.069e-01 2.786e-02 -14.605 < 2e-16 ***
## computer -1.284e-04 1.137e-03 -0.113 0.91014
## income 7.087e-01 8.983e-02 7.889 2.62e-14 ***
## english -7.248e-02 2.503e-02 -2.896 0.00398 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.927 on 424 degrees of freedom
## Multiple R-squared: 0.7243, Adjusted R-squared: 0.7217
## F-statistic: 278.5 on 4 and 424 DF, p-value: < 2.2e-16
fit3 = lm(math~lunch+income+english, data=CAS); summary(fit3);
##
## Call:
## lm(formula = math ~ lunch + income + english, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.126 -6.974 0.005 5.870 31.966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.89369 2.32328 284.896 < 2e-16 ***
## lunch -0.40704 0.02779 -14.647 < 2e-16 ***
## income 0.70721 0.08878 7.966 1.51e-14 ***
## english -0.07304 0.02452 -2.979 0.00306 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.915 on 425 degrees of freedom
## Multiple R-squared: 0.7243, Adjusted R-squared: 0.7224
## F-statistic: 372.2 on 3 and 425 DF, p-value: < 2.2e-16
fit4 = lm(math~lunch, data=CAS); summary(fit4);
##
## Call:
## lm(formula = math ~ lunch, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.313 -6.778 -0.250 6.423 31.535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 679.01320 0.98941 686.28 <2e-16 ***
## lunch -0.57315 0.01897 -30.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.64 on 427 degrees of freedom
## Multiple R-squared: 0.6813, Adjusted R-squared: 0.6805
## F-statistic: 912.7 on 1 and 427 DF, p-value: < 2.2e-16
# Compare original models to adjusted models
anova(fit,fit1)
## Analysis of Variance Table
##
## Model 1: math ~ students + teachers + calworks + lunch + computer + expenditure +
## income + english
## Model 2: math ~ calworks + lunch + computer + expenditure + income + english
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 420 41208
## 2 422 41490 -2 -282.29 1.4386 0.2384
anova(fit1,fit2)
## Analysis of Variance Table
##
## Model 1: math ~ calworks + lunch + computer + expenditure + income + english
## Model 2: math ~ lunch + computer + income + english
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 422 41490
## 2 424 41783 -2 -293.45 1.4924 0.226
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: math ~ lunch + computer + income + english
## Model 2: math ~ lunch + income + english
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 424 41783
## 2 425 41784 -1 -1.2569 0.0128 0.9101
anova(fit3,fit4) # fit4 is the best linear model based on analysis of variance
## Analysis of Variance Table
##
## Model 1: math ~ lunch + income + english
## Model 2: math ~ lunch
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 425 41784
## 2 427 48311 -2 -6526.2 33.19 4.041e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Non-linear models
fit5 = lm(math~poly(lunch, 5), data=CAS); summary(fit5);
##
## Call:
## lm(formula = math ~ poly(lunch, 5), data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.177 -7.188 0.105 6.740 30.392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 653.4639 0.4978 1312.574 < 2e-16 ***
## poly(lunch, 5)1 -321.3471 10.3116 -31.164 < 2e-16 ***
## poly(lunch, 5)2 33.5624 10.3116 3.255 0.00123 **
## poly(lunch, 5)3 -42.7860 10.3116 -4.149 4.03e-05 ***
## poly(lunch, 5)4 19.2891 10.3116 1.871 0.06209 .
## poly(lunch, 5)5 2.0707 10.3116 0.201 0.84094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.31 on 423 degrees of freedom
## Multiple R-squared: 0.7033, Adjusted R-squared: 0.6998
## F-statistic: 200.5 on 5 and 423 DF, p-value: < 2.2e-16
fit6 = lm(math~lunch+I(lunch^2), data=CAS); summary(fit6);
##
## Call:
## lm(formula = math ~ lunch + I(lunch^2), data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.601 -7.175 -0.216 6.756 32.843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.823e+02 1.422e+00 479.935 < 2e-16 ***
## lunch -7.835e-01 6.858e-02 -11.424 < 2e-16 ***
## I(lunch^2) 2.239e-03 7.021e-04 3.189 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.52 on 426 degrees of freedom
## Multiple R-squared: 0.6887, Adjusted R-squared: 0.6872
## F-statistic: 471.2 on 2 and 426 DF, p-value: < 2.2e-16
# Compare non-linear models
anova(fit5, fit6)
## Analysis of Variance Table
##
## Model 1: math ~ poly(lunch, 5)
## Model 2: math ~ lunch + I(lunch^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 423 44977
## 2 426 47184 -3 -2207 6.9188 0.000148 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare best linear model to best non-linear model
anova(fit4, fit6) # Result shows fit6 (model with quadratic term) is better than best linear model
## Analysis of Variance Table
##
## Model 1: math ~ lunch
## Model 2: math ~ lunch + I(lunch^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 427 48311
## 2 426 47184 1 1126.4 10.17 0.001533 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Finally, compare quadratic model to original model
anova(fit, fit6)
## Analysis of Variance Table
##
## Model 1: math ~ students + teachers + calworks + lunch + computer + expenditure +
## income + english
## Model 2: math ~ lunch + I(lunch^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 420 41208
## 2 426 47184 -6 -5976.8 10.153 1.702e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit6)
# Based on the modeling and comparisons above, fit6 (model with quadratic term) is a best model that all
# others tested.
#3. Write the equation for the least square fit from part (2). y = 6.823e+02 - 7.835e-01x + 2.239e-03x^2
#4. Compare the R2 and RSE for models from part (1) and (2). Part 1 Model: Part 2 Model: R^2: 0.7281 R^2: 0.6887 Adj R^2: 0.723 Adj R^2: 0.6872 RSE: 9.905 RSE: 10.52
#5. Using a residuals plot for part 1 and part 2 models, state the appropriateness of using a linear model for this problem.
# Residual plot for linear fit
plot(fit)
# Residual plot for quadratic fit
plot(fit6)
# The residual plot above for the original linear model shows a slight pattern, but not strong enough to
# conclude a linear model would not be appropriate. However, the distribution of errors in the residual
# plot for the non-linear model shows no discernible pattern, idicating the non-linear model could be a
# better fit (as supported by previous analysis).
#6. Incorporate an interaction term into your multiple linear regression and respond to the following:
# Begin with the linear model with strongest indicator variables
fit7 = lm(math~lunch+income+lunch*income, data=CAS)
summary(fit7)
##
## Call:
## lm(formula = math ~ lunch + income + lunch * income, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.597 -6.470 -0.045 6.056 32.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.220457 2.293128 289.221 < 2e-16 ***
## lunch -0.376059 0.039922 -9.420 < 2e-16 ***
## income 0.713577 0.090642 7.872 2.92e-14 ***
## lunch:income -0.007279 0.003220 -2.261 0.0243 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.959 on 425 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.72
## F-statistic: 367.8 on 3 and 425 DF, p-value: < 2.2e-16
fit8 = lm(math~lunch:income, data=CAS)
summary(fit8)
##
## Call:
## lm(formula = math ~ lunch:income, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.018 -8.526 0.214 8.719 34.762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 679.17559 1.45202 467.75 <2e-16 ***
## lunch:income -0.04685 0.00236 -19.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.59 on 427 degrees of freedom
## Multiple R-squared: 0.4799, Adjusted R-squared: 0.4787
## F-statistic: 394 on 1 and 427 DF, p-value: < 2.2e-16
# Model with main effects and interaction
fit9 = lm(math~lunch*income, data=CAS)
summary(fit9)
##
## Call:
## lm(formula = math ~ lunch * income, data = CAS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.597 -6.470 -0.045 6.056 32.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.220457 2.293128 289.221 < 2e-16 ***
## lunch -0.376059 0.039922 -9.420 < 2e-16 ***
## income 0.713577 0.090642 7.872 2.92e-14 ***
## lunch:income -0.007279 0.003220 -2.261 0.0243 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.959 on 425 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.72
## F-statistic: 367.8 on 3 and 425 DF, p-value: < 2.2e-16
# Based on the output above, we can see that the p-value of lunch:income is significant at the 0.001 level.
# Create the interaction plot
library(ggplot2)
# Create the interaction plot
interact_plot <- ggplot(CAS, aes(x=lunch, y=math, color=income, group=income)) +
geom_point() +
labs(title = "Interaction Plot: Lunch and Income Group",
x = "Lunch",
y = "Math Score",
color = "Income") +
theme_minimal()
# Display the plot
print(interact_plot)
# b. Compare your results with the models from parts (1) and (2).
# Compare the interaction model with the best linear model
anova(fit4, fit7)
## Analysis of Variance Table
##
## Model 1: math ~ lunch
## Model 2: math ~ lunch + income + lunch * income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 427 48311
## 2 425 42150 2 6160.7 31.059 2.573e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the interaction model with the best non-linear model
anova(fit6, fit7)
## Analysis of Variance Table
##
## Model 1: math ~ lunch + I(lunch^2)
## Model 2: math ~ lunch + income + lunch * income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 426 47184
## 2 425 42150 1 5034.2 50.76 4.497e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Based on the comparisons above, the interaction model with income and lunch (fit7) is the
# strongest model overall, with a p-value of 4.497e-12, which is significant at the 0.001 level.