Question 1:

For this question use the Japan_earthquakes.csv. This data set contains information regarding earthquakes from 2001-2018 in/around Japan.

Please create an interactive map using leaflet . Answer the following questions.

(Please export your maps as images for each part for the PDF file, as html maps will not be supported when converted into PDFs.

You can use "always_allow_html: true" on the header of your rmd file, so that you can execute the codes without any errors. Please see the header of the attached rmd file.

You must include the all relevant codes as well. )

#1. Use geolocation of Japan (lat and long values) to set up a base map. (You can use any providerTiles that is appropriate).

library(leaflet)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
leaflet() %>% 
  addTiles() #base map
leaflet() %>% 
  addProviderTiles(providers$Esri.OceanBasemap)
leaflet() %>% 
  setView(lng = 139.412689, lat = 36.709055, zoom = 5.7) %>% 
  addProviderTiles(providers$Esri.OceanBasemap) %>%
  addProviderTiles(providers$Stamen.TonerLabels)

#2. Create subset, of all earthquake data which has a magnitude (mag) of 6 and higher.

# Import Japan earthquake data
quakes = dat = read.csv('/Users/ryankelly/Desktop/Desktop - Ryan’s MacBook Pro/GMU/Spring 2023/STAT 515/Homework/HW2/Japan_earthquakes.csv')

# Adjust the data set to only include magnitude and location data
quakes <- quakes[, c("mag","latitude","longitude","time","place","depth")]

# Create a subset of the quakes data frame where mag is greater than or equal to 6
quakes_6 <- subset(quakes, mag >= 6)

#3. Using the subset from part 2) plot all the earthquake locations on the map of Japan.

#4. Use Circles to represent the location of earthquakes and make sure that the area of the circle corresponds to the magnitude of each earthquake.(Hint: Use the following formula for the radius, \(\sqrt(10^x)*c\) : Here x is the magnitude and the c is a arbitrary constant that you can choose your own).

#5. Adjust the circles that you got from part 4), so that the map is aesthetically pleasing. (Hint: colors, opacity etc).

#6. Add labels to each circle indicating the place that each earthquake occurred.

#7. Add a popups to each circle to see the following information:

-   `Time` of the Earthquake

-   `Magnitude` of the Earthquake

-   `Depth` in (km) of the Earthquake

-   `Place` of the Earthquake.
# Plot the earthquakes on the map
leaflet(quakes_6) %>% 
  addProviderTiles(providers$Esri.OceanBasemap) %>%
  addProviderTiles(providers$Stamen.TonerLabels) %>%
  addCircles(lng = ~longitude,
             lat = ~latitude,
             popup = paste("Time: ", quakes_6$time, "<br>", "Magnitude: ", quakes_6$mag,
                           "<br>", "Depth: ", quakes_6$depth, "<br>", "Place: ", quakes_6$place),
             label = ~place, # adding labels
             weight = ~sqrt(10^mag)*.002, # changing the size of circles
             opacity = .75, # changing the opacity of circles
             color = 'darkred' # usually we use hexadecimal colors
             ) 

Question 2:

Is there a relationship between student math test scores and socioeconomic variables? The data set CASchools.csv contains data on test performance, school demographics, and student demographic background for school districts in California. Remove the variable county before the analysis.

Please use the Data set description document to learn more about the data set.

#1. Fit a linear regression model to predict math test scores using all variables in the data set. Discuss your results, making sure to cover the following points:

CAS <- read.csv('/Users/ryankelly/Desktop/Desktop - Ryan’s MacBook Pro/GMU/Spring 2023/STAT 515/Homework/HW2/CASchools.csv')

# Remove the variable "county" from the data frame
CAS <- CAS[, -which(names(CAS) == "county")]
CAS <- CAS[, -which(names(CAS) == "school")]
CAS <- CAS[, -which(names(CAS) == "grades")]
CAS <- CAS[, -which(names(CAS) == "district")]

# Change income to numeric for manual color change
CAS$income <- gsub("[^0-9.]", "", CAS$income)
CAS$income = as.numeric(CAS$income)

# Change computer to numeric for manual color change
library(stringr)
CAS$computer <- as.numeric(str_replace_all(CAS$computer, "[^[:digit:]]", "")) 

## Consider removing missing data
#sum(is.na(CAS$?????)) # Replace ????? with columns with missing values
#CAS=na.omit(CAS) # removing rows with missing values


# Replace missing values with interpolated values rows where values are missing
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
set.seed(123) # Set the seed for reproducibility
CAS <- mice(CAS, m = 5, maxit = 50, method = "pmm") # Impute missing values using the mice package
## 
##  iter imp variable
##   1   1  calworks  lunch  computer  income
##   1   2  calworks  lunch  computer  income
##   1   3  calworks  lunch  computer  income
##   1   4  calworks  lunch  computer  income
##   1   5  calworks  lunch  computer  income
##   2   1  calworks  lunch  computer  income
##   2   2  calworks  lunch  computer  income
##   2   3  calworks  lunch  computer  income
##   2   4  calworks  lunch  computer  income
##   2   5  calworks  lunch  computer  income
##   3   1  calworks  lunch  computer  income
##   3   2  calworks  lunch  computer  income
##   3   3  calworks  lunch  computer  income
##   3   4  calworks  lunch  computer  income
##   3   5  calworks  lunch  computer  income
##   4   1  calworks  lunch  computer  income
##   4   2  calworks  lunch  computer  income
##   4   3  calworks  lunch  computer  income
##   4   4  calworks  lunch  computer  income
##   4   5  calworks  lunch  computer  income
##   5   1  calworks  lunch  computer  income
##   5   2  calworks  lunch  computer  income
##   5   3  calworks  lunch  computer  income
##   5   4  calworks  lunch  computer  income
##   5   5  calworks  lunch  computer  income
##   6   1  calworks  lunch  computer  income
##   6   2  calworks  lunch  computer  income
##   6   3  calworks  lunch  computer  income
##   6   4  calworks  lunch  computer  income
##   6   5  calworks  lunch  computer  income
##   7   1  calworks  lunch  computer  income
##   7   2  calworks  lunch  computer  income
##   7   3  calworks  lunch  computer  income
##   7   4  calworks  lunch  computer  income
##   7   5  calworks  lunch  computer  income
##   8   1  calworks  lunch  computer  income
##   8   2  calworks  lunch  computer  income
##   8   3  calworks  lunch  computer  income
##   8   4  calworks  lunch  computer  income
##   8   5  calworks  lunch  computer  income
##   9   1  calworks  lunch  computer  income
##   9   2  calworks  lunch  computer  income
##   9   3  calworks  lunch  computer  income
##   9   4  calworks  lunch  computer  income
##   9   5  calworks  lunch  computer  income
##   10   1  calworks  lunch  computer  income
##   10   2  calworks  lunch  computer  income
##   10   3  calworks  lunch  computer  income
##   10   4  calworks  lunch  computer  income
##   10   5  calworks  lunch  computer  income
##   11   1  calworks  lunch  computer  income
##   11   2  calworks  lunch  computer  income
##   11   3  calworks  lunch  computer  income
##   11   4  calworks  lunch  computer  income
##   11   5  calworks  lunch  computer  income
##   12   1  calworks  lunch  computer  income
##   12   2  calworks  lunch  computer  income
##   12   3  calworks  lunch  computer  income
##   12   4  calworks  lunch  computer  income
##   12   5  calworks  lunch  computer  income
##   13   1  calworks  lunch  computer  income
##   13   2  calworks  lunch  computer  income
##   13   3  calworks  lunch  computer  income
##   13   4  calworks  lunch  computer  income
##   13   5  calworks  lunch  computer  income
##   14   1  calworks  lunch  computer  income
##   14   2  calworks  lunch  computer  income
##   14   3  calworks  lunch  computer  income
##   14   4  calworks  lunch  computer  income
##   14   5  calworks  lunch  computer  income
##   15   1  calworks  lunch  computer  income
##   15   2  calworks  lunch  computer  income
##   15   3  calworks  lunch  computer  income
##   15   4  calworks  lunch  computer  income
##   15   5  calworks  lunch  computer  income
##   16   1  calworks  lunch  computer  income
##   16   2  calworks  lunch  computer  income
##   16   3  calworks  lunch  computer  income
##   16   4  calworks  lunch  computer  income
##   16   5  calworks  lunch  computer  income
##   17   1  calworks  lunch  computer  income
##   17   2  calworks  lunch  computer  income
##   17   3  calworks  lunch  computer  income
##   17   4  calworks  lunch  computer  income
##   17   5  calworks  lunch  computer  income
##   18   1  calworks  lunch  computer  income
##   18   2  calworks  lunch  computer  income
##   18   3  calworks  lunch  computer  income
##   18   4  calworks  lunch  computer  income
##   18   5  calworks  lunch  computer  income
##   19   1  calworks  lunch  computer  income
##   19   2  calworks  lunch  computer  income
##   19   3  calworks  lunch  computer  income
##   19   4  calworks  lunch  computer  income
##   19   5  calworks  lunch  computer  income
##   20   1  calworks  lunch  computer  income
##   20   2  calworks  lunch  computer  income
##   20   3  calworks  lunch  computer  income
##   20   4  calworks  lunch  computer  income
##   20   5  calworks  lunch  computer  income
##   21   1  calworks  lunch  computer  income
##   21   2  calworks  lunch  computer  income
##   21   3  calworks  lunch  computer  income
##   21   4  calworks  lunch  computer  income
##   21   5  calworks  lunch  computer  income
##   22   1  calworks  lunch  computer  income
##   22   2  calworks  lunch  computer  income
##   22   3  calworks  lunch  computer  income
##   22   4  calworks  lunch  computer  income
##   22   5  calworks  lunch  computer  income
##   23   1  calworks  lunch  computer  income
##   23   2  calworks  lunch  computer  income
##   23   3  calworks  lunch  computer  income
##   23   4  calworks  lunch  computer  income
##   23   5  calworks  lunch  computer  income
##   24   1  calworks  lunch  computer  income
##   24   2  calworks  lunch  computer  income
##   24   3  calworks  lunch  computer  income
##   24   4  calworks  lunch  computer  income
##   24   5  calworks  lunch  computer  income
##   25   1  calworks  lunch  computer  income
##   25   2  calworks  lunch  computer  income
##   25   3  calworks  lunch  computer  income
##   25   4  calworks  lunch  computer  income
##   25   5  calworks  lunch  computer  income
##   26   1  calworks  lunch  computer  income
##   26   2  calworks  lunch  computer  income
##   26   3  calworks  lunch  computer  income
##   26   4  calworks  lunch  computer  income
##   26   5  calworks  lunch  computer  income
##   27   1  calworks  lunch  computer  income
##   27   2  calworks  lunch  computer  income
##   27   3  calworks  lunch  computer  income
##   27   4  calworks  lunch  computer  income
##   27   5  calworks  lunch  computer  income
##   28   1  calworks  lunch  computer  income
##   28   2  calworks  lunch  computer  income
##   28   3  calworks  lunch  computer  income
##   28   4  calworks  lunch  computer  income
##   28   5  calworks  lunch  computer  income
##   29   1  calworks  lunch  computer  income
##   29   2  calworks  lunch  computer  income
##   29   3  calworks  lunch  computer  income
##   29   4  calworks  lunch  computer  income
##   29   5  calworks  lunch  computer  income
##   30   1  calworks  lunch  computer  income
##   30   2  calworks  lunch  computer  income
##   30   3  calworks  lunch  computer  income
##   30   4  calworks  lunch  computer  income
##   30   5  calworks  lunch  computer  income
##   31   1  calworks  lunch  computer  income
##   31   2  calworks  lunch  computer  income
##   31   3  calworks  lunch  computer  income
##   31   4  calworks  lunch  computer  income
##   31   5  calworks  lunch  computer  income
##   32   1  calworks  lunch  computer  income
##   32   2  calworks  lunch  computer  income
##   32   3  calworks  lunch  computer  income
##   32   4  calworks  lunch  computer  income
##   32   5  calworks  lunch  computer  income
##   33   1  calworks  lunch  computer  income
##   33   2  calworks  lunch  computer  income
##   33   3  calworks  lunch  computer  income
##   33   4  calworks  lunch  computer  income
##   33   5  calworks  lunch  computer  income
##   34   1  calworks  lunch  computer  income
##   34   2  calworks  lunch  computer  income
##   34   3  calworks  lunch  computer  income
##   34   4  calworks  lunch  computer  income
##   34   5  calworks  lunch  computer  income
##   35   1  calworks  lunch  computer  income
##   35   2  calworks  lunch  computer  income
##   35   3  calworks  lunch  computer  income
##   35   4  calworks  lunch  computer  income
##   35   5  calworks  lunch  computer  income
##   36   1  calworks  lunch  computer  income
##   36   2  calworks  lunch  computer  income
##   36   3  calworks  lunch  computer  income
##   36   4  calworks  lunch  computer  income
##   36   5  calworks  lunch  computer  income
##   37   1  calworks  lunch  computer  income
##   37   2  calworks  lunch  computer  income
##   37   3  calworks  lunch  computer  income
##   37   4  calworks  lunch  computer  income
##   37   5  calworks  lunch  computer  income
##   38   1  calworks  lunch  computer  income
##   38   2  calworks  lunch  computer  income
##   38   3  calworks  lunch  computer  income
##   38   4  calworks  lunch  computer  income
##   38   5  calworks  lunch  computer  income
##   39   1  calworks  lunch  computer  income
##   39   2  calworks  lunch  computer  income
##   39   3  calworks  lunch  computer  income
##   39   4  calworks  lunch  computer  income
##   39   5  calworks  lunch  computer  income
##   40   1  calworks  lunch  computer  income
##   40   2  calworks  lunch  computer  income
##   40   3  calworks  lunch  computer  income
##   40   4  calworks  lunch  computer  income
##   40   5  calworks  lunch  computer  income
##   41   1  calworks  lunch  computer  income
##   41   2  calworks  lunch  computer  income
##   41   3  calworks  lunch  computer  income
##   41   4  calworks  lunch  computer  income
##   41   5  calworks  lunch  computer  income
##   42   1  calworks  lunch  computer  income
##   42   2  calworks  lunch  computer  income
##   42   3  calworks  lunch  computer  income
##   42   4  calworks  lunch  computer  income
##   42   5  calworks  lunch  computer  income
##   43   1  calworks  lunch  computer  income
##   43   2  calworks  lunch  computer  income
##   43   3  calworks  lunch  computer  income
##   43   4  calworks  lunch  computer  income
##   43   5  calworks  lunch  computer  income
##   44   1  calworks  lunch  computer  income
##   44   2  calworks  lunch  computer  income
##   44   3  calworks  lunch  computer  income
##   44   4  calworks  lunch  computer  income
##   44   5  calworks  lunch  computer  income
##   45   1  calworks  lunch  computer  income
##   45   2  calworks  lunch  computer  income
##   45   3  calworks  lunch  computer  income
##   45   4  calworks  lunch  computer  income
##   45   5  calworks  lunch  computer  income
##   46   1  calworks  lunch  computer  income
##   46   2  calworks  lunch  computer  income
##   46   3  calworks  lunch  computer  income
##   46   4  calworks  lunch  computer  income
##   46   5  calworks  lunch  computer  income
##   47   1  calworks  lunch  computer  income
##   47   2  calworks  lunch  computer  income
##   47   3  calworks  lunch  computer  income
##   47   4  calworks  lunch  computer  income
##   47   5  calworks  lunch  computer  income
##   48   1  calworks  lunch  computer  income
##   48   2  calworks  lunch  computer  income
##   48   3  calworks  lunch  computer  income
##   48   4  calworks  lunch  computer  income
##   48   5  calworks  lunch  computer  income
##   49   1  calworks  lunch  computer  income
##   49   2  calworks  lunch  computer  income
##   49   3  calworks  lunch  computer  income
##   49   4  calworks  lunch  computer  income
##   49   5  calworks  lunch  computer  income
##   50   1  calworks  lunch  computer  income
##   50   2  calworks  lunch  computer  income
##   50   3  calworks  lunch  computer  income
##   50   4  calworks  lunch  computer  income
##   50   5  calworks  lunch  computer  income
CAS <- complete(CAS)  # Extract the completed data set from the imputed data

pairs(CAS)

str(CAS)
## 'data.frame':    429 obs. of  9 variables:
##  $ students   : int  195 240 1550 243 1335 137 195 888 379 2247 ...
##  $ teachers   : num  10.9 11.1 82.9 14 71.5 ...
##  $ calworks   : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ lunch      : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer   : num  67 101 169 85 171 25 28 66 35 0 ...
##  $ expenditure: num  6385 5099 5502 7102 5236 ...
##  $ income     : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ english    : num  0 4.58 30 0 13.86 ...
##  $ math       : num  690 662 651 644 640 ...
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pairs.panels(CAS)

a. What do you observe about the relationship between these predictors and math test scores?

fit = lm(math~., data=CAS)

summary(fit)
## 
## Call:
## lm(formula = math ~ ., data = CAS)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.2764  -6.6892   0.2302   5.9269  31.5397 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.580e+02  4.328e+00 152.030  < 2e-16 ***
## students    -7.837e-04  1.727e-03  -0.454  0.65018    
## teachers     4.584e-03  3.799e-02   0.121  0.90400    
## calworks    -9.618e-02  6.625e-02  -1.452  0.14735    
## lunch       -3.782e-01  3.691e-02 -10.247  < 2e-16 ***
## computer     4.769e-03  3.246e-03   1.469  0.14253    
## expenditure  8.990e-04  8.526e-04   1.054  0.29231    
## income       6.547e-01  1.004e-01   6.523 1.98e-10 ***
## english     -7.622e-02  2.607e-02  -2.924  0.00364 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.905 on 420 degrees of freedom
## Multiple R-squared:  0.7281, Adjusted R-squared:  0.723 
## F-statistic: 140.6 on 8 and 420 DF,  p-value: < 2.2e-16
# It appears there are some noticeable trends in the observing the data points in the plot martix above.
# It is difficult the exact nature of the relationship, but there does appear to be some correlation
# between math scores and some of the predictors, such as 'lunch' and 'income.'

b. Are there any insignificant variables?

      From the regression output above, it appears that sutudents, teachers, calworks, computer, and expenditure are insignificant.
      

c. Which predictor is explaining math scores the best?

      It appears that lunch is explaining math scores the best, based on the p-value of < 2e-16 ***

d. Explain the interpretation of the coefficient on lunch in terms of the response variable.

      The p-value of < 2e-16 for the response variable lunch indicates the predictor is significant.  The ***                indicates that this predictor variable is significant at the p < 0.001 level.  This means the probability of observing the result by chance (null hypothesis) is less than 0.1%

#2. Do you think you can fit a better model than the model in part (1)? Fit a new model that you think is a better fit for the data. What do you observe about this new model?

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(fit)
##    students    teachers    calworks       lunch    computer expenditure 
##  196.504216  219.232247    2.489533    4.363760    8.811838    1.319400 
##      income     english 
##    2.468116    1.593449
# 'students' and 'teachers' have very high VIF values --> start with removing those from model
fit1 = lm(math~calworks+lunch+computer+expenditure+income+english, data=CAS); summary(fit1);
## 
## Call:
## lm(formula = math ~ calworks + lunch + computer + expenditure + 
##     income + english, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.566  -6.970   0.306   5.936  31.774 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.573e+02  4.253e+00 154.543  < 2e-16 ***
## calworks    -9.180e-02  6.607e-02  -1.389  0.16542    
## lunch       -3.837e-01  3.679e-02 -10.431  < 2e-16 ***
## computer     1.986e-04  1.153e-03   0.172  0.86333    
## expenditure  1.051e-03  8.429e-04   1.247  0.21312    
## income       6.563e-01  9.981e-02   6.576 1.43e-10 ***
## english     -7.887e-02  2.605e-02  -3.028  0.00261 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.915 on 422 degrees of freedom
## Multiple R-squared:  0.7263, Adjusted R-squared:  0.7224 
## F-statistic: 186.6 on 6 and 422 DF,  p-value: < 2.2e-16
fit2 = lm(math~lunch+computer+income+english, data=CAS); summary(fit2);
## 
## Call:
## lm(formula = math ~ lunch + computer + income + english, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.157  -6.939   0.002   5.909  31.942 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.619e+02  2.326e+00 284.564  < 2e-16 ***
## lunch       -4.069e-01  2.786e-02 -14.605  < 2e-16 ***
## computer    -1.284e-04  1.137e-03  -0.113  0.91014    
## income       7.087e-01  8.983e-02   7.889 2.62e-14 ***
## english     -7.248e-02  2.503e-02  -2.896  0.00398 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.927 on 424 degrees of freedom
## Multiple R-squared:  0.7243, Adjusted R-squared:  0.7217 
## F-statistic: 278.5 on 4 and 424 DF,  p-value: < 2.2e-16
fit3 = lm(math~lunch+income+english, data=CAS); summary(fit3);
## 
## Call:
## lm(formula = math ~ lunch + income + english, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.126  -6.974   0.005   5.870  31.966 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 661.89369    2.32328 284.896  < 2e-16 ***
## lunch        -0.40704    0.02779 -14.647  < 2e-16 ***
## income        0.70721    0.08878   7.966 1.51e-14 ***
## english      -0.07304    0.02452  -2.979  0.00306 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.915 on 425 degrees of freedom
## Multiple R-squared:  0.7243, Adjusted R-squared:  0.7224 
## F-statistic: 372.2 on 3 and 425 DF,  p-value: < 2.2e-16
fit4 = lm(math~lunch, data=CAS); summary(fit4);  
## 
## Call:
## lm(formula = math ~ lunch, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.313  -6.778  -0.250   6.423  31.535 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 679.01320    0.98941  686.28   <2e-16 ***
## lunch        -0.57315    0.01897  -30.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.64 on 427 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6805 
## F-statistic: 912.7 on 1 and 427 DF,  p-value: < 2.2e-16
# Compare original models to adjusted models
anova(fit,fit1)
## Analysis of Variance Table
## 
## Model 1: math ~ students + teachers + calworks + lunch + computer + expenditure + 
##     income + english
## Model 2: math ~ calworks + lunch + computer + expenditure + income + english
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    420 41208                           
## 2    422 41490 -2   -282.29 1.4386 0.2384
anova(fit1,fit2)
## Analysis of Variance Table
## 
## Model 1: math ~ calworks + lunch + computer + expenditure + income + english
## Model 2: math ~ lunch + computer + income + english
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    422 41490                           
## 2    424 41783 -2   -293.45 1.4924  0.226
anova(fit2,fit3)
## Analysis of Variance Table
## 
## Model 1: math ~ lunch + computer + income + english
## Model 2: math ~ lunch + income + english
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    424 41783                           
## 2    425 41784 -1   -1.2569 0.0128 0.9101
anova(fit3,fit4) # fit4 is the best linear model based on analysis of variance
## Analysis of Variance Table
## 
## Model 1: math ~ lunch + income + english
## Model 2: math ~ lunch
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    425 41784                                 
## 2    427 48311 -2   -6526.2 33.19 4.041e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Non-linear models
fit5 = lm(math~poly(lunch, 5), data=CAS); summary(fit5);
## 
## Call:
## lm(formula = math ~ poly(lunch, 5), data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.177  -7.188   0.105   6.740  30.392 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      653.4639     0.4978 1312.574  < 2e-16 ***
## poly(lunch, 5)1 -321.3471    10.3116  -31.164  < 2e-16 ***
## poly(lunch, 5)2   33.5624    10.3116    3.255  0.00123 ** 
## poly(lunch, 5)3  -42.7860    10.3116   -4.149 4.03e-05 ***
## poly(lunch, 5)4   19.2891    10.3116    1.871  0.06209 .  
## poly(lunch, 5)5    2.0707    10.3116    0.201  0.84094    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.31 on 423 degrees of freedom
## Multiple R-squared:  0.7033, Adjusted R-squared:  0.6998 
## F-statistic: 200.5 on 5 and 423 DF,  p-value: < 2.2e-16
fit6 = lm(math~lunch+I(lunch^2), data=CAS); summary(fit6);
## 
## Call:
## lm(formula = math ~ lunch + I(lunch^2), data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.601  -7.175  -0.216   6.756  32.843 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.823e+02  1.422e+00 479.935  < 2e-16 ***
## lunch       -7.835e-01  6.858e-02 -11.424  < 2e-16 ***
## I(lunch^2)   2.239e-03  7.021e-04   3.189  0.00153 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.52 on 426 degrees of freedom
## Multiple R-squared:  0.6887, Adjusted R-squared:  0.6872 
## F-statistic: 471.2 on 2 and 426 DF,  p-value: < 2.2e-16
# Compare non-linear models
anova(fit5, fit6)
## Analysis of Variance Table
## 
## Model 1: math ~ poly(lunch, 5)
## Model 2: math ~ lunch + I(lunch^2)
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)    
## 1    423 44977                                 
## 2    426 47184 -3     -2207 6.9188 0.000148 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare best linear model to best non-linear model
anova(fit4, fit6) # Result shows fit6 (model with quadratic term) is better than best linear model
## Analysis of Variance Table
## 
## Model 1: math ~ lunch
## Model 2: math ~ lunch + I(lunch^2)
##   Res.Df   RSS Df Sum of Sq     F   Pr(>F)   
## 1    427 48311                               
## 2    426 47184  1    1126.4 10.17 0.001533 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Finally, compare quadratic model to original model
anova(fit, fit6)
## Analysis of Variance Table
## 
## Model 1: math ~ students + teachers + calworks + lunch + computer + expenditure + 
##     income + english
## Model 2: math ~ lunch + I(lunch^2)
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    420 41208                                  
## 2    426 47184 -6   -5976.8 10.153 1.702e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit6)

# Based on the modeling and comparisons above, fit6 (model with quadratic term) is a best model that all
# others tested.

#3. Write the equation for the least square fit from part (2). y = 6.823e+02 - 7.835e-01x + 2.239e-03x^2

#4. Compare the R2 and RSE for models from part (1) and (2). Part 1 Model: Part 2 Model: R^2: 0.7281 R^2: 0.6887 Adj R^2: 0.723 Adj R^2: 0.6872 RSE: 9.905 RSE: 10.52

#5. Using a residuals plot for part 1 and part 2 models, state the appropriateness of using a linear model for this problem.

# Residual plot for linear fit
plot(fit)

# Residual plot for quadratic fit
plot(fit6)

# The residual plot above for the original linear model shows a slight pattern, but not strong enough to 
# conclude a linear model would not be appropriate.  However, the distribution of errors in the residual
# plot for the non-linear model shows no discernible pattern, idicating the non-linear model could be a 
# better fit (as supported by previous analysis).

#6. Incorporate an interaction term into your multiple linear regression and respond to the following:

# Begin with the linear model with strongest indicator variables
fit7 = lm(math~lunch+income+lunch*income, data=CAS)
summary(fit7)
## 
## Call:
## lm(formula = math ~ lunch + income + lunch * income, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.597  -6.470  -0.045   6.056  32.821 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  663.220457   2.293128 289.221  < 2e-16 ***
## lunch         -0.376059   0.039922  -9.420  < 2e-16 ***
## income         0.713577   0.090642   7.872 2.92e-14 ***
## lunch:income  -0.007279   0.003220  -2.261   0.0243 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.959 on 425 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:   0.72 
## F-statistic: 367.8 on 3 and 425 DF,  p-value: < 2.2e-16
fit8 = lm(math~lunch:income, data=CAS)
summary(fit8)
## 
## Call:
## lm(formula = math ~ lunch:income, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.018  -8.526   0.214   8.719  34.762 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  679.17559    1.45202  467.75   <2e-16 ***
## lunch:income  -0.04685    0.00236  -19.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.59 on 427 degrees of freedom
## Multiple R-squared:  0.4799, Adjusted R-squared:  0.4787 
## F-statistic:   394 on 1 and 427 DF,  p-value: < 2.2e-16
# Model with main effects and interaction
fit9 = lm(math~lunch*income, data=CAS)
summary(fit9)
## 
## Call:
## lm(formula = math ~ lunch * income, data = CAS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.597  -6.470  -0.045   6.056  32.821 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  663.220457   2.293128 289.221  < 2e-16 ***
## lunch         -0.376059   0.039922  -9.420  < 2e-16 ***
## income         0.713577   0.090642   7.872 2.92e-14 ***
## lunch:income  -0.007279   0.003220  -2.261   0.0243 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.959 on 425 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:   0.72 
## F-statistic: 367.8 on 3 and 425 DF,  p-value: < 2.2e-16

a. Provide analytical justification for your interaction term.

# Based on the output above, we can see that the p-value of lunch:income is significant at the 0.001 level. 

# Create the interaction plot
library(ggplot2)

# Create the interaction plot
interact_plot <- ggplot(CAS, aes(x=lunch, y=math, color=income, group=income)) +
  geom_point() +
  labs(title = "Interaction Plot: Lunch and Income Group",
       x = "Lunch",
       y = "Math Score",
       color = "Income") +
  theme_minimal()

# Display the plot
print(interact_plot)

# b. Compare your results with the models from parts (1) and (2).

# Compare the interaction model with the best linear model
anova(fit4, fit7)
## Analysis of Variance Table
## 
## Model 1: math ~ lunch
## Model 2: math ~ lunch + income + lunch * income
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    427 48311                                  
## 2    425 42150  2    6160.7 31.059 2.573e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the interaction model with the best non-linear model
anova(fit6, fit7)
## Analysis of Variance Table
## 
## Model 1: math ~ lunch + I(lunch^2)
## Model 2: math ~ lunch + income + lunch * income
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    426 47184                                 
## 2    425 42150  1    5034.2 50.76 4.497e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Based on the comparisons above, the interaction model with income and lunch (fit7) is the 
# strongest model overall, with a p-value of 4.497e-12, which is significant at the 0.001 level.