(2) Relationship between HS graduation rate and Income

           Population Income Illiteracy LifeExp Murder Grad Frost   Area
Alabama          3615   3624        2.1   69.05   15.1 41.3    20  50708
Alaska            365   6315        1.5   69.31   11.3 66.7   152 566432
Arizona          2212   4530        1.8   70.55    7.8 58.1    15 113417
Arkansas         2110   3378        1.9   70.66   10.1 39.9    65  51945
California      21198   5114        1.1   71.71   10.3 62.6    20 156361
Colorado         2541   4884        0.7   72.06    6.8 63.9   166 103766
            GradSq
Alabama    1705.69
Alaska     4448.89
Arizona    3375.61
Arkansas   1592.01
California 3918.76
Colorado   4083.21

Fit a simple linear regression model to this data. Plot this line over the points.


Call:
lm(formula = Income ~ Grad, data = df)

Coefficients:
(Intercept)         Grad  
    1931.10        47.16  

Fit a linear regression model with a quadratic effect for HS graduation rate to this data. Plot this line over the points.


Call:
lm(formula = Income ~ Grad + GradSq, data = df)

Coefficients:
(Intercept)         Grad       GradSq  
  -1505.424      183.196       -1.313  

Fit a loess model to this data with an appropriate bandwidth. Plot this line over the points.

Warning: `data_frame()` is deprecated, use `tibble()`.
This warning is displayed once per session.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
parametric, : pseudoinverse used at 52.5
Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
parametric, : neighborhood radius 0.2
Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
parametric, : reciprocal condition number 4.7944e-17

Call:
loess(formula = Income ~ Grad, data = df, span = 1.25)

Number of Observations: 50 
Equivalent Number of Parameters: 3.2 
Residual Standard Error: 476.9 

Comments on the similarities and differences between these three different fits.

Because the scatterplot of the data may be indicating some sort of quadratic relationship, the linear regression model with a quadratic effect (red) and the loess model (black) seem to better fit the data than the simple linear regression (blue). Depending on how \(\alpha\) is tuned, the loess regression model may overfit the data (lower bandwidth values) or may be too general (higher values), but somewhere in the goldilocks range is a tuning parameter value that allows the Loess model to more accurately illustrate the relationship between variables by performing least squares regressions on local subsets of data. When the tuning parameter is too high, the loess regression line fits similar to the linear regression model with a quadratic effect because the size of local subsets used for fitting isn’t small enough to pick up subtleties in the data.

(4) Previous problem with values of \(\sigma\) equal to 0.1, 0.5, 1, and 1.5

Discuss the relationship between the random variablity in the data and the range of reasonable spans.

It seems that as random variability in the data increases, the range of reasonable spans shifts right and becomes wider. That is, for lower \(\sigma\) noise levels, the range of appropriate spans seems to be [.2,.6], but for higher \(\sigma\) noise levels a more appropriate range might be [.5,1+].