HW Instructions

Students should create a separate R markdown file that includes the necessary code and graphics to answer each question. The R markdown file should be submitted as well as a knitted version in either Word or HTML. Although the homeworks are designed to offer students a jumping off point when it comes to coding, students are expected to utilize the discussions and R examples used in the live sessions.

For problems that include making graphics or performing analysis, students should be sure to articulate their answers in writing unless explicitly told not to. That is, if you provide a figure or table of results then it is expected that commentary will also be provided.

Exercise 1: Air Pollution Study

Researchers wish to assess whether air pollution has a negative impact on well populated cities by investigating the cities mortality rates (per 100,000). It is generally accepted that mortality rates depend on a a number of factors including socioeconomic and environmental variables.

To assess whether air pollution has an effect on the mortality rate, we wish to determine whether measurements of pollutants such as nitrogen oxides and sulfur dioxide have an effect after accounting for well-known factors. The following code chunk loads in the data set and provides an overall summary graph. Precipitation is mean rainfall in inches. Education is median number of school years completed among adults and nonwhite is the percentage of the cities population that is nonwhite.

# A
# Load necessary libraries
library(Sleuth3)

## Warning: package 'Sleuth3' was built under R version 4.3.2

library(GGally)

## Warning: package 'GGally' was built under R version 4.3.2

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Load the data
air <- ex1123
# View the first few rows of the data
head(air)

##              City   Mort Precip Educ NonWhite NOX SO2
## 1    San Jose, CA 790.73     13 12.2      3.0  32   3
## 2     Wichita, KS 823.76     28 12.1      7.5   2   1
## 3   San Diego, CA 839.71     10 12.1      5.9  66  20
## 4   Lancaster, PA 844.05     43  9.5      2.9   7  32
## 5 Minneapolis, MN 857.62     25 12.1      2.0  11  26
## 6      Dallas, TX 860.10     35 11.8     14.8   1   1

# Create ggpairs plot
ggpairs(air[, c(3:7, 2)])

#B
library(stats)
# Fit a linear regression model using all predictors
model <- lm(Mort ~ Precip + Educ + NonWhite + NOX + SO2, data = air)
# Summary of the model to check coefficients
summary(model)

## 
## Call:
## lm(formula = Mort ~ Precip + Educ + NonWhite + NOX + SO2, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.893 -18.986  -3.433  15.872  91.528 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1000.1026    92.3982  10.824 3.85e-15 ***
## Precip         1.3792     0.7000   1.970 0.053943 .  
## Educ         -15.0791     7.0706  -2.133 0.037518 *  
## NonWhite       3.1602     0.6287   5.026 5.84e-06 ***
## NOX           -0.1076     0.1359  -0.792 0.432030    
## SO2            0.3554     0.0914   3.889 0.000278 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.36 on 54 degrees of freedom
## Multiple R-squared:  0.6698, Adjusted R-squared:  0.6392 
## F-statistic:  21.9 on 5 and 54 DF,  p-value: 6.478e-12

# Residual diagnostics
par(mfrow=c(2,2))
plot(model)

# Transforming response variable using log transformation to address potential heteroscedasticity
air$log_Mort <- log(air$Mort)
# Fitting a new model with log-transformed Mortality
model_log <- lm(log_Mort ~ Precip + Educ + NonWhite + NOX + SO2, data = air)
# Checking the new model's diagnostics
par(mfrow=c(2,2))
plot(model_log)

#B 2 - Modifying  model to address issues
library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

# Checking for multicollinearity using VIF
vif(model_log)

##   Precip     Educ NonWhite      NOX      SO2 
## 2.064568 1.509714 1.329690 1.675305 1.418590

# Adding a squared term for NOX
air$NOX2 <- air$NOX^2
# Fitting the model with the squared term
model_poly <- lm(log_Mort ~ Precip + Educ + NonWhite + NOX + I(NOX^2) + SO2, data = air)
model_poly

## 
## Call:
## lm(formula = log_Mort ~ Precip + Educ + NonWhite + NOX + I(NOX^2) + 
##     SO2, data = air)
## 
## Coefficients:
## (Intercept)       Precip         Educ     NonWhite          NOX     I(NOX^2)  
##   6.901e+00    1.640e-03   -1.597e-02    3.182e-03    6.308e-05   -5.611e-07  
##         SO2  
##   3.587e-04

# Diagnostic plots for the polynomial model
par(mfrow=c(2,2))
plot(model_poly)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

# Summary statistics for the polynomial model
summary(model_poly)

## 
## Call:
## lm(formula = log_Mort ~ Precip + Educ + NonWhite + NOX + I(NOX^2) + 
##     SO2, data = air)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.102765 -0.019367 -0.002096  0.019278  0.097152 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.901e+00  1.003e-01  68.804  < 2e-16 ***
## Precip       1.640e-03  7.880e-04   2.081  0.04232 *  
## Educ        -1.597e-02  7.746e-03  -2.062  0.04412 *  
## NonWhite     3.182e-03  7.085e-04   4.492 3.86e-05 ***
## NOX          6.308e-05  5.257e-04   0.120  0.90495    
## I(NOX^2)    -5.611e-07  1.576e-06  -0.356  0.72321    
## SO2          3.587e-04  1.238e-04   2.896  0.00548 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04054 on 53 degrees of freedom
## Multiple R-squared:  0.6641, Adjusted R-squared:  0.626 
## F-statistic: 17.46 on 6 and 53 DF,  p-value: 5.007e-11

# Identify potential high leverage points
influence_measures <- influence.measures(model_poly)
print(influence_measures)

## Influence measures of
##   lm(formula = log_Mort ~ Precip + Educ + NonWhite + NOX + I(NOX^2) +      SO2, data = air) :
## 
##       dfb.1_  dfb.Prcp dfb.Educ  dfb.NnWh   dfb.NOX  dfb.I.NO   dfb.SO2
## 1  -0.194726  4.91e-01  0.03848 -0.023997 -2.41e-01  0.315034  0.327974
## 2   0.096787  2.24e-01 -0.21477 -0.102355  2.50e-01 -0.203655 -0.075169
## 3  -0.097439  1.41e-01  0.06501  0.005985 -2.49e-01  0.265290  0.219642
## 4  -0.667314 -1.30e-01  0.75149  0.538008 -2.25e-01  0.169234  0.342568
## 5   0.014597  2.98e-02 -0.03232  0.011608  2.09e-02 -0.012774 -0.012576
## 6   0.110861  1.28e-01 -0.18620 -0.200296  2.19e-01 -0.192314 -0.045076
## 7   0.992661 -1.32e+00 -0.76855  0.454590 -2.01e-01  0.082098  0.099954
## 8  -0.006044  3.56e-02 -0.01116 -0.038537  2.80e-01 -0.592790 -0.140664
## 9  -0.085986  9.20e-02  0.05595  0.009383  5.39e-02 -0.044579  0.020607
## 10  0.019424 -1.39e-01  0.02937  0.048313 -8.57e-02  0.063469  0.041425
## 11 -0.041794  5.93e-02  0.01682  0.017675  5.02e-02 -0.041729  0.000124
## 12  0.083772 -8.71e-02 -0.08132  0.062028  1.33e-02 -0.017337  0.001152
## 13 -0.016174  5.58e-02 -0.00402 -0.037389  4.76e-02 -0.040531 -0.000806
## 14  0.012889 -1.08e-02 -0.01261  0.011510 -6.44e-03  0.006628  0.000185
## 15  0.013298 -3.21e-02 -0.00923  0.036277 -7.55e-03  0.004686  0.007446
## 16 -0.055240  1.60e-02  0.06461 -0.018095 -1.96e-02  0.015547  0.017427
## 17 -0.012976 -7.43e-02  0.02724  0.085557 -3.41e-02  0.023557  0.055745
## 18  0.020177 -4.04e-02 -0.01540  0.041233 -4.80e-03  0.001650  0.003283
## 19 -0.006117  7.50e-02 -0.02674 -0.064876  4.30e-01 -0.388577 -0.280276
## 20 -0.396991  3.38e-02  0.42626  0.171642 -6.41e-02  0.045159  0.128491
## 21  0.005367  1.03e-03 -0.00554 -0.006092  3.53e-04  0.000062 -0.003259
## 22 -0.004979  2.20e-03  0.00417  0.002462  6.02e-04 -0.000447  0.002702
## 23 -0.044580 -1.35e-02  0.06070  0.029675 -3.93e-02  0.033253  0.009416
## 24  0.002655 -6.42e-04 -0.00332  0.001392  1.09e-03 -0.000623 -0.002208
## 25 -0.002396  3.29e-03  0.00206 -0.001897  7.55e-05  0.000178 -0.000504
## 26  0.000542  6.23e-03 -0.00344  0.003886  7.05e-03 -0.004756 -0.014832
## 27 -0.226215  2.56e-01  0.19046 -0.205719  1.73e-01 -0.163647 -0.024394
## 28 -0.010767 -2.60e-02  0.02621  0.029593 -4.41e-02  0.038498  0.007675
## 29  0.076025  3.45e-02 -0.08588 -0.099138  2.43e-02 -0.015879 -0.050806
## 30  0.100227 -1.36e-01 -0.06587  0.082619 -6.98e-02  0.057254 -0.019631
## 31 -0.016985 -1.79e-03  0.01848  0.016565  9.07e-04 -0.001399 -0.003898
## 32 -0.094804  8.40e-02  0.11170 -0.153329 -5.47e-02  0.050862  0.033993
## 33  0.007948 -5.33e-03 -0.00771 -0.008171  1.95e-03 -0.002314  0.003165
## 34 -0.162192  1.26e-01  0.14661 -0.080026  3.03e-02 -0.021946  0.015066
## 35  0.026281 -9.29e-03 -0.02361  0.000890  4.39e-03 -0.005676 -0.016491
## 36 -0.148153  1.83e-02  0.17733  0.055152 -5.13e-02  0.038855  0.007434
## 37 -0.014011  2.76e-02  0.00828 -0.019138 -2.13e-03  0.008103 -0.019051
## 38 -0.076911  1.16e-02  0.08773 -0.108666  4.05e-02 -0.040713 -0.031081
## 39  0.154264  1.00e-01 -0.18644 -0.228716  7.13e-02 -0.051050 -0.087183
## 40  0.149326 -3.74e-02 -0.15508 -0.088416  2.63e-02 -0.016217 -0.080278
## 41 -0.053150 -1.16e-02  0.07069  0.064886 -6.90e-02  0.061031  0.033520
## 42 -0.020715 -2.41e-03  0.02426  0.012111  1.82e-02 -0.012434 -0.082738
## 43  0.001229 -1.01e-03 -0.00109  0.001966 -6.87e-04  0.000681 -0.000796
## 44  0.234176 -2.55e-01 -0.16861  0.079405 -1.13e-01  0.088337 -0.040209
## 45  0.012789 -4.80e-03 -0.01240 -0.033815  8.79e-03 -0.008993  0.002374
## 46  0.007269 -5.17e-02  0.01136  0.060428  3.02e-03 -0.017533  0.010194
## 47 -0.036160  3.86e-02  0.03032 -0.010114  1.63e-02 -0.009433 -0.045903
## 48  0.043959 -4.33e-02 -0.03492  0.067689 -1.70e-02  0.024984 -0.140211
## 49 -0.030760  7.04e-02  0.01604 -0.054851  3.61e-02 -0.036524  0.045788
## 50  0.074050 -9.38e-03 -0.03393 -0.264455 -3.14e-02  0.015170  0.004142
## 51  0.240694 -9.76e-02 -0.22038 -0.077309  2.89e-02 -0.035895 -0.106452
## 52 -0.211816  2.32e-01  0.18664 -0.082138  1.51e-02 -0.006000  0.029588
## 53 -0.025510  1.69e-05  0.04717 -0.293141 -3.70e-02  0.029870  0.102598
## 54 -0.006666  6.86e-03  0.00469  0.002790 -2.72e-03  0.001130  0.025725
## 55  0.040851  2.00e-02 -0.05371  0.021824  1.60e-02 -0.009049 -0.037588
## 56  0.008369 -7.34e-05 -0.00864 -0.000625  2.33e-03 -0.000130 -0.025767
## 57 -0.031586 -3.62e-02  0.04232  0.185527 -8.32e-02  0.076358  0.035941
## 58 -0.000275 -9.12e-02  0.05762 -0.301333 -1.53e-01  0.137030  0.126074
## 59  0.011810 -5.98e-03 -0.01419  0.017968 -7.44e-03  0.005330  0.032289
## 60  0.330075  1.96e-01 -0.47260  0.433016  4.33e-01 -0.360470 -0.648280
##       dffit  cov.r   cook.d    hat inf
## 1  -0.84819  0.885 9.83e-02 0.1738    
## 2  -0.49957  0.859 3.45e-02 0.0812    
## 3  -0.38410  1.612 2.14e-02 0.3202   *
## 4  -1.04225  0.460 1.36e-01 0.1162   *
## 5  -0.07951  1.223 9.19e-04 0.0752    
## 6  -0.41665  0.887 2.41e-02 0.0662    
## 7  -1.49210  0.477 2.77e-01 0.2012   *
## 8  -1.29773 47.878 2.45e-01 0.9763   *
## 9  -0.16364  1.128 3.86e-03 0.0518    
## 10  0.17800  1.361 4.60e-03 0.1769    
## 11 -0.13187  1.128 2.51e-03 0.0410    
## 12 -0.15607  1.138 3.52e-03 0.0528    
## 13 -0.08878  1.198 1.15e-03 0.0617    
## 14 -0.02021  1.240 5.95e-05 0.0792    
## 15 -0.04861  1.246 3.44e-04 0.0858    
## 16  0.08473  1.212 1.04e-03 0.0695    
## 17 -0.14266  1.166 2.95e-03 0.0605    
## 18 -0.06015  1.218 5.26e-04 0.0678    
## 19  0.48625  2.698 3.43e-02 0.5860   *
## 20 -0.50008  1.117 3.55e-02 0.1512    
## 21  0.01110  1.221 1.79e-05 0.0644    
## 22 -0.00959  1.181 1.34e-05 0.0328    
## 23  0.09910  1.191 1.43e-03 0.0603    
## 24 -0.00664  1.174 6.42e-06 0.0269    
## 25  0.00489  1.211 3.48e-06 0.0562    
## 26 -0.02149  1.224 6.73e-05 0.0673    
## 27  0.34022  1.292 1.67e-02 0.1772    
## 28  0.08844  1.152 1.13e-03 0.0359    
## 29  0.16231  1.188 3.81e-03 0.0775    
## 30  0.17193  1.179 4.27e-03 0.0761    
## 31 -0.03114  1.264 1.41e-04 0.0973    
## 32  0.29079  0.926 1.19e-02 0.0422    
## 33 -0.01866  1.222 5.07e-05 0.0653    
## 34 -0.19088  1.223 5.27e-03 0.1039    
## 35  0.06060  1.141 5.33e-04 0.0211    
## 36  0.26210  0.997 9.74e-03 0.0472    
## 37 -0.05896  1.202 5.06e-04 0.0565    
## 38 -0.23333  1.039 7.77e-03 0.0478    
## 39  0.34195  1.122 1.67e-02 0.1066    
## 40 -0.20123  1.324 5.87e-03 0.1613    
## 41  0.15048  1.093 3.26e-03 0.0361    
## 42 -0.13687  1.173 2.71e-03 0.0624    
## 43  0.00288  1.238 1.20e-06 0.0769    
## 44  0.35483  0.890 1.76e-02 0.0518    
## 45 -0.04843  1.242 3.41e-04 0.0834    
## 46  0.16191  1.013 3.74e-03 0.0237    
## 47 -0.08327  1.328 1.01e-03 0.1445    
## 48 -0.21481  1.471 6.70e-03 0.2386   *
## 49  0.14427  1.120 3.00e-03 0.0423    
## 50  0.47279  0.514 2.89e-02 0.0329   *
## 51  0.36728  0.704 1.82e-02 0.0327    
## 52  0.33934  0.852 1.60e-02 0.0429    
## 53 -0.38556  1.229 2.13e-02 0.1616    
## 54  0.03997  1.232 2.33e-04 0.0748    
## 55  0.09587  1.250 1.34e-03 0.0966    
## 56 -0.03467  1.533 1.75e-04 0.2548   *
## 57  0.22223  1.185 7.13e-03 0.0947    
## 58 -0.49395  1.228 3.48e-02 0.1901    
## 59  0.05861  1.343 5.00e-04 0.1514    
## 60  1.08611  0.738 1.57e-01 0.1897

# Examine the Cook's distance specifically
cooks_d <- cooks.distance(model_poly)
plot(cooks_d, type="h")
#C VIF already coded.
#D
mymodel<-lm(Mort~Precip+Educ+NonWhite+log(NOX)+log(SO2),data=air)

Using the graph, provide commentary on what trends you see with the predictors and the mortality rate. Do any of them look nonlinear? ANSWER:

Precipitation (Precip): The distribution appears left-skewed, indicating that most cities have lower precipitation values. The correlation with mortality (Mort) is negative, suggesting that higher precipitation might be associated with lower mortality rates, although this is not a strong relationship. The moderate negative correlation with Mort, suggesting that as precipitation increases, mortality rate might decrease.

Education (Educ): The distribution seems to be normally distributed. Education has a significant negative correlation with mortality, which could imply that higher education levels are associated with lower mortality rates.

Percentage of Nonwhite Population (NonWhite): This distribution is right-skewed, suggesting a lower percentage of nonwhite population in most cities. The correlation with mortality is positive, indicating that a higher percentage of nonwhite population might be associated with higher mortality rates.

Nitrogen Oxides (NOX): The distribution of NOX is right-skewed with some potential outliers at higher values. The correlation with mortality is positive, suggesting that higher NOX levels could be associated with higher mortality rates. But, the relationship in non-linear and seems to increase at a decreasing rate.

Sulfur dioxide (SO2) does not show a strong linear relationship with mortality rate, as indicated by a low correlation coefficient.

Regarding nonlinearity, both NOX and SO2 show potential nonlinear relationships with Mort. The scatter plots for these variables with Mort show that the relationship is not exactly a straight line, especially for higher values of NOX and SO2 where the mortality rate does not increase proportionally.

Fit a regression model using all five predictors. Provide the residual diagnostics and comment on whether the assumptions look reasonable.

ANSWER:

Code is included above.

Results: Residuals vs Fitted Plot: There appears to be a slight curve in the residuals, suggesting a potential non-linearity in the relationship between predictors and the response variable. The residuals don’t seem to be randomly distributed around the horizontal line, which could imply that the model doesn’t capture all the patterns in the data.

Normal Q-Q Plot: The points in the Q-Q plot largely follow the line, but there are some deviations at the ends. This suggests that the residuals are approximately normally distributed, although there may be some issues with extreme values.

Scale-Location Plot: The spread of residuals doesn’t appear to be constant (there’s a slight increase in spread for higher fitted values), indicating potential heteroscedasticity.

Residuals vs Leverage Plot: There are a few points with higher leverage, but they do not appear to have large residuals. No points are beyond the Cook’s distance lines, which suggests that there aren’t any particularly influential outliers that are having a disproportionate impact on the model.

Based on these observations, there are indications that some of the assumptions required for linear regression may not be fully met. In particular, the potential non-linearity and heteroscedasticity could be addressed by transforming variables or using a different model specification.

B2 To address issues we could modify the model with the following steps: (see code “#B2”) 1. Addressing Non-linearity: We can try adding polynomial terms for predictors where we observed non-linear relationships in the plots.

Addressing Heteroscedasticity: If the variance of the residuals increases with the fitted values, a transformation of the response variable may help. Common transformations include the logarithmic, square root, or inverse transformation.
Checking for and Addressing Multicollinearity: We should check for multicollinearity, as it can affect the estimates of the coefficients. We can use the Variance Inflation Factor (VIF) to identify if multicollinearity is present.

This model shows the coefficients from the polynomial regression model where we’ve included both the NOX variable and its squared term (NOX^2) to account for the potential non-linear relationship. To interpret the model coefficients:

(Intercept): This represents the estimated log mortality rate when all other predictors are zero. Precip: The coefficient for precipitation suggests a very small positive association with log mortality rate.

Educ: The negative coefficient for education level suggests that higher education is associated with a lower log mortality rate.

NonWhite: The positive coefficient here indicates a small positive association between the percentage of nonwhite population and log mortality rate.

NOX: The positive coefficient for NOX indicates that as the level of nitrogen oxides increases, the log mortality rate slightly increases. I(NOX^2): The negative coefficient for the squared term of NOX suggests that the relationship between NOX and log mortality rate could be non-linear, with the effect of NOX diminishing at higher levels. The NOX squared term (I(NOX^2)) indicates the change in the effect of NOX on the log mortality rate for each one-unit increase in NOX.

SO2: The coefficient for sulfur dioxide shows a slight positive association with log mortality rate.

The intercept and coefficients for Precip, Educ, NonWhite, NOX, and SO2 represent the expected change in the log-transformed mortality rate for a one-unit change in the predictor, holding all other variables constant.

When checking specific points for cook’s d the plots suggest there are a few observations with a relatively high Cook’s distance, which might be influential.

Use the vif function within the car package to examine the VIFs. Are there any concerns for this particular study’s goals and research question? What if there were really high VIFs among Education, Nonwhite, and Precip? Would that change your answer?

ANSWER: Precip Educ NonWhite NOX SO2 2.064568 1.509714 1.329690 1.675305 1.418590 There don’t seem to be concerns regarding multicollinearity for this particular study, as all VIFs are well below the threshold of 5. These VIF values suggest that each predictor in our model has a relatively low level of multicollinearity with the other predictors. This is a positive indication for the study, as it suggests that the coefficients for each predictor are estimated without a large amount of distortion due to correlations with other predictors. However, if there were really high VIFs among Education, Nonwhite, and Precip, it would indicate that these predictors are highly correlated with each other. This could potentially distort the individual effect of each predictor. If high VIFs were observed, we would need to reconsider the interpretation of our regression coefficients, as they may not accurately reflect the independent effect. Things we could do: 1. Remove one of the correlated predictors to reduce multicollinearity. 2. Combine the correlated variables into a single predictor through techniques like Principal Component Analysis (PCA) which reduces dimensionality while retaining most of the information. 3. Regularization methods such as Ridge or Lasso regression can be applied, which are designed to handle multicollinearity by penalizing large coefficients.

Assuming the assumptions of this model are reasonable, what can generally be said about the association of the pollution variables with mortality? Directly interpret the regression coefficient

ANSWER:

In examining the data and the resulting model, I have observed that the coefficient for nitrogen oxides (NOX) is positive. This indicates that there is a positive association between NOX levels and the log mortality rate in the initial range of NOX values. However, this relationship appears to be complex rather than straightforward.

Upon including a squared term for NOX in the model, the negative coefficient of this term suggests a non-linear effect. Specifically, while lower levels of NOX are associated with more significant increases in mortality rates, these increases are less pronounced at higher levels of NOX. This pattern of diminishing returns suggests that the impact of NOX on mortality rates grows but at a decreasing rate as NOX values rise.

For sulfur dioxide (SO2), the positive coefficient points to a direct and linear relationship with the log mortality rate. As SO2 levels rise, the log mortality rate is expected to increase correspondingly, with each unit increase in SO2 predicted to raise the log mortality rate by the coefficient’s value.

In sum, my analysis suggests that both NOX and SO2 contribute to higher mortality rates, with NOX exhibiting a non-linear relationship and SO2 showing a linear one. It is crucial to note that these findings are contingent on the model’s assumptions being valid and that the model accurately captures the data’s underlying relationships. If the model’s assumptions are violated, or if the model does not fully account for the data’s complexity, then the interpretations I have made may not hold true.

Additionally, these interpretations, while suggesting associations within the model, cannot establish causality on their own. It is possible that other unmodeled factors may influence both pollution levels and mortality rates, necessitating a broader investigation to include other potential influences.

Lastly, when drawing conclusions from this analysis, I must consider the broader context and any limitations of the data. External variables such as the quality of healthcare, lifestyle choices, and additional environmental factors mIt is essential to approach these results with a critical understanding that they are part of a larger web of causes and effects, and that my conclusions are based on the current model and available data. Further research could be required to deepen the understanding of these relationships and to explore the potential for other variables to affect the findings.ight also have significant impacts of martality rates and could be considered and/pr taken into account with a more extensive study. It is essential to approach these results with a critical understanding that they are part of a larger web of causes and effects, and that my conclusions are based on the current model and available data. Further research could be required to deepen the understanding of these relationships and to explore the potential for other variables to affect the findings.

The interested student might wonder whether a potentially better fit to the data set may be modeling NOX and SO2 on the log scale. The graphs below highlight the basic differences. Update your previous model and comment on whether the general conclusion would change or not. Directly interpret the regression coefficient on the $SO_2$ variable now.

ANSWER: par(mfrow=c(2,2)) plot(Mort~SO2,data=air) plot(Mort~log(SO2),data=air) plot(Mort~NOX,data=air) plot(Mort~log(NOX),data=air) par(mfrow=c(1,1)) model_log_NOX_SO2 <- lm(log_Mort ~ Precip + Educ + NonWhite + log(NOX) + log(SO2), data = air) summary(model_log_NOX_SO2) plot(model_log_NOX_SO2) log(SO2) Exploring whether using the logarithms of NOX and SO2 might provide a better fit to the model, as opposed to their raw values. This is a common practice when dealing with variables that do not have a linear relationship with the response variable. By using a log transformation, we can potentially linearize the relationship and stabilize the variance of the residuals, which can lead to a better model fit. Key takeaways from results: Model Fit:

The multiple R-squared of 0.6877 indicates that 68.77% of the variation in log-transformed mortality can be explained by the predictors in the model. The adjusted R-squared of 0.6588 accounts for the number of predictors and suggests a reasonably good fit. The F-statistic is significant (p-value 1.49e-12), indicating that the model as a whole is statistically significant. Regression Coefficients:

log(SO2) Coefficient: 0.0141281 Interpretation: A 1% increase in SO2 is associated with an estimated 1.41% increase in mortality, holding other variables constant. This suggests a positive relationship between SO2 levels and mortality. Statistical Significance: The p-value of 0.01546 indicates that this association is statistically significant at the 0.05 level. Other Predictors:

Precip and NonWhite also have statistically significant positive associations with mortality. Educ has a negative association with mortality, but it’s only marginally significant. log(NOX) doesn’t show a significant association in this model.

Exercise 2: Predicting Fish Weights

This will be a quick exercise in writing a contrast. The fish data set includes weight, height, width, and 3 length measurements for seven different species of fish.

A quick exploration between the relationships between log(weight) and Length1 for the different species suggests that they potentially have different slopes.

fish<-read.csv("fish.csv",header=T,stringsAsFactors = T)
summary(fish)

##   Species       Weight           Length1         Length2         Length3     
##  Perch:56   Min.   :   5.90   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
##  Pike :17   1st Qu.:  90.25   1st Qu.:18.20   1st Qu.:19.85   1st Qu.:21.45  
##  Roach:19   Median : 175.00   Median :22.00   Median :24.00   Median :26.35  
##  Smelt:14   Mean   : 347.37   Mean   :25.61   Mean   :27.61   Mean   :29.65  
##             3rd Qu.: 513.00   3rd Qu.:34.58   3rd Qu.:37.00   3rd Qu.:39.38  
##             Max.   :1650.00   Max.   :59.00   Max.   :63.40   Max.   :68.00  
##      Height           Width      
##  Min.   : 1.728   Min.   :1.048  
##  1st Qu.: 5.534   1st Qu.:3.205  
##  Median : 6.495   Median :3.823  
##  Mean   : 6.884   Mean   :4.158  
##  3rd Qu.: 8.507   3rd Qu.:5.135  
##  Max.   :12.800   Max.   :8.142

library(ggplot2)
ggplot(fish,aes(x=Length1,y=log(Weight),color=Species))+geom_point()+geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

A. Fit a model so that individual intercepts and slopes are fitted for each species. Use the contrast framework to estimate the slope for the Smelt species. Provide a formal interpretation of this slope and include a 95 percent confidence interval. You do not need to provide a formal interpretation of the interval, just the slope value (point estimate). ANSWER: The coefficient for the interaction term Length1:SpeciesSmelt is 0.139809, with a 95% confidence interval ranging from approximately 0.0424 to 0.2372. This coefficient suggests that for the Smelt species, for every one-unit increase in Length1, the log of the Weight increases by approximately 0.139809 units, holding all other variables constant. The positive sign indicates that there is a positive association between Length1 and the log of the Weight for the Smelt species.

Fit the model with individual intercepts and slopes for each species

mymodel <- lm(log(Weight) ~ Length1 * Species, data = fish)

Get the summary of the model

summary(mymodel) Residuals: Min 1Q Median 3Q Max -1.42900 -0.08097 0.03575 0.10868 0.43677

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.276152 0.091656 24.834 < 2e-16 Length1 0.123707 0.003382 36.574 < 2e-16 SpeciesPike 1.044957 0.273696 3.818 0.000236 SpeciesRoach -0.339018 0.314780 -1.077 0.284125
SpeciesSmelt -2.883465 0.561749 -5.133 1.45e-06 Length1:SpeciesPike -0.051934 0.006841 -7.592 1.87e-11 * Length1:SpeciesRoach 0.021607 0.014724 1.468 0.145441
Length1:SpeciesSmelt 0.139809 0.049085 2.848 0.005356 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2148 on 98 degrees of freedom Multiple R-squared: 0.9794, Adjusted R-squared: 0.9779 F-statistic: 665.2 on 7 and 98 DF, p-value: < 2.2e-16

Extract 95% confidence interval for the slope of Smelt species

confint_smelt <- confint(mymodel, “Length1:SpeciesSmelt”, level=0.95) confint_smelt 2.5 % 97.5 % Length1:SpeciesSmelt 0.04240155 0.237216

B. (Stretch question) Suppose I wanted to formally test if the slope for the pike species is different from the slope of the smelt. What contrast could we construct to formally test this? Hint: Determine the beta’s that are involved in each one of the individuals slopes and subtract one from the other.

mymodel<-lm(log(Weight)~Length1*Species,data=fish)
summary(mymodel)

## 
## Call:
## lm(formula = log(Weight) ~ Length1 * Species, data = fish)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42900 -0.08097  0.03575  0.10868  0.43677 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.276152   0.091656  24.834  < 2e-16 ***
## Length1               0.123707   0.003382  36.574  < 2e-16 ***
## SpeciesPike           1.044957   0.273696   3.818 0.000236 ***
## SpeciesRoach         -0.339018   0.314780  -1.077 0.284125    
## SpeciesSmelt         -2.883465   0.561749  -5.133 1.45e-06 ***
## Length1:SpeciesPike  -0.051934   0.006841  -7.592 1.87e-11 ***
## Length1:SpeciesRoach  0.021607   0.014724   1.468 0.145441    
## Length1:SpeciesSmelt  0.139809   0.049085   2.848 0.005356 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2148 on 98 degrees of freedom
## Multiple R-squared:  0.9794, Adjusted R-squared:  0.9779 
## F-statistic: 665.2 on 7 and 98 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(mymodel)

ANSWER: The hypothesis test conducted using linearHypothesis(mymodel, “Length1:SpeciesPike - Length1:SpeciesSmelt = 0”) checks whether there is a statistically significant difference between the slopes for Pike and Smelt species. The result (p-value < 0.0001846) indicates that there is a significant difference between the slopes for these two species. Specifically, the slope for Pike is significantly different from the slope for Smelt with a high degree of confidence.

The F-statistic value of 15.11 further confirms that the difference in slopes is statistically significant. This implies that the relationship between Length1 and the log of the Weight is different for Pike compared to Smelt, which could be due to biological differences between the species affecting how Length1 relates to Weight.

Retrieve model coefficients

coefficients <- summary(mymodel)$coefficients

Contrast for the slopes

pike_slope <- coefficients[“Length1:SpeciesPike”, ] smelt_slope <- coefficients[“Length1:SpeciesSmelt”, ]

contrast is the difference in slopes

slope_contrast <- pike_slope - smelt_slope

perform a formal test

library(car)

Test if the slope for Pike is different from Smelt

linearHypothesis(mymodel, “Length1:SpeciesPike - Length1:SpeciesSmelt = 0”)

Exercise 3

The following code simulates a data set that follows a trend that is a 4th degree polynomial. Below is a plot of the data along with residual plots from regression model.

set.seed(1234)
x<-runif(100,-6,6.5)
trend<- 2*x^4-2*x^3-50*x^2+100*x+2
y<-trend+rnorm(100,0,100)
mymodel<-lm(y~poly(x,2))

par(mfrow=c(1,3))
plot(x,y)
plot(mymodel$fitted.values,mymodel$residuals,ylab="Residuals",xlab="Fitted")
plot(x,mymodel$residuals,ylab="Resdiduals")

par(mfrow=c(1,1))

A. Using the graphics, explain to a layman why the current model being fitted is not a good fit to the data set. ANSWER: Explaining the Poor Fit:

Imagine a flexible ruler trying to follow the data’s path: The current model (a quadratic curve) is like a ruler that can bend a bit, but not enough to capture the intricate twists and turns in the data. Data points are scattered far from the model’s line: The first plot shows many points not closely hugging the curve, suggesting a mismatch. Residual plots reveal patterns: The second plot shows residuals (distances between points and the curve) forming a distinct U-shape, indicating a systematic error in fitting. The third plot shows residuals clustering differently for different x values, suggesting the model’s shape doesn’t match the data’s trend.

B. To help the laymen better understand, repeat the graphics using different polynomial fits to illustrate what a better model fit would look. ANSWER:

1. Fit models with higher-degree polynomials:

mymodel3 <- lm(y ~ poly(x, 3)) # Cubic model mymodel4 <- lm(y ~ poly(x, 4)) # Quartic model (matching the true trend)

2. Generate the same graphics for each model:

par(mfrow=c(1,3)) plot(x, y) lines(x, predict(mymodel3), col = “red”) # Add cubic model’s line lines(x, predict(mymodel4), col = “blue”) # Add quartic model’s line plot(mymodel3$fitted.values, mymodel3$residuals, ylab=“Residuals”, xlab=“Fitted”) plot(mymodel4$fitted.values, mymodel4$residuals, ylab=“Residuals”, xlab=“Fitted”) plot(x, mymodel3$residuals, ylab="Residuals") plot(x, mymodel4$residuals, ylab=“Residuals”) par(mfrow=c(1,1))

Explain the improvements: ANSWER: Code: set.seed(1234) x <- runif(100, -6, 6.5) trend <- 2x^4 - 2x^3 - 50x^2 + 100x + 2 y <- trend + rnorm(100, 0, 100)

Fit models with polynomial of degree 2, 3, and 4

model2 <- lm(y ~ poly(x, 2)) model3 <- lm(y ~ poly(x, 3)) model4 <- lm(y ~ poly(x, 4))

Plotting the data and fits for a better model illustration

par(mfrow=c(2, 2))

Original data with degree 2 polynomial fit

plot(x, y, main = “Original Data with Degree 2 Fit”) points(x, predict(model2), col = “red”, pch = 20)

Original data with degree 3 polynomial fit

plot(x, y, main = “Original Data with Degree 3 Fit”) points(x, predict(model3), col = “blue”, pch = 20)

Original data with degree 4 polynomial fit

plot(x, y, main = “Original Data with Degree 4 Fit”) points(x, predict(model4), col = “green”, pch = 20)

Residuals for the degree 4 polynomial fit

plot(predict(model4), residuals(model4), ylab = “Residuals”, xlab = “Fitted”, main = “Residuals for Degree 4 Fit”)

par(mfrow=c(1, 1))

The quartic model’s line should closely follow the data’s path in the first plot. Its residual plots should show more random scatter without clear patterns, indicating a better fit. The cubic model might improve but still show some residual patterns, suggesting it’s not quite flexible enough.

Degree 2 Fit: The first plot shows the data points with a second-degree polynomial fit (a parabola). The red curve doesn’t capture the pattern of the data well—it only turns once, so it misses the more complex ups and downs in the data. This is like trying to fit a small hat on a head with lots of hair; it won’t cover everything.

Degree 3 Fit: The second plot with a third-degree polynomial fit (a cubic curve) shows a blue curve that turns twice, getting closer to the true pattern but still not quite there. It’s like a larger hat on the same head—it covers more hair but still not all.

Degree 4 Fit: The third plot shows the fourth-degree polynomial fit. This green curve turns three times, closely following the true pattern of the data. This fit is like a perfectly shaped hat that covers all the hair neatly.

Residuals for Degree 4 Fit: The last plot shows the residuals (the differences between the actual data points and the fitted values) for the fourth-degree polynomial fit. The residuals are scattered randomly around the horizontal line at 0, which suggests that the fourth-degree model captures the pattern of the data well. There’s no obvious pattern in the residuals, which is what we want. It means that the “misses” are random and the model isn’t systematically off in any particular direction.

In layman’s terms, when we have a complex pattern, we need a model that’s flexible enough to follow all the twists and turns. The fourth-degree polynomial is like a flexible hat that can mold itself to the shape of the head, covering everything just right. The residuals plot confirms this, showing that there aren’t any systematic areas where the model is consistently too high or too low—it fits just right.

Key Points for Layman: Flexibility matters: Higher-degree polynomials can bend and twist more to capture complex patterns. Residuals reveal fit: Randomly scattered residuals indicate a good fit; patterns suggest a mismatch. Visualizing is key: Seeing how models match the data helps us choose the best fit.

Unit1 HW

MSDS 6372: Jessica McPhaul - Link: https://rpubs.com/Texaschikkita/1138180