Midsem_Take

Please hand this result in on Canvas no later than 11:59pm on Wednesday, February 28! Do not work in groups!

Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.

library(MASS)

## Warning: package 'MASS' was built under R version 4.3.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.3.2

library(leaps)

## Warning: package 'leaps' was built under R version 4.3.2

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

library(TeachingDemos)

## Warning: package 'TeachingDemos' was built under R version 4.3.2

data(cpus)

str(cpus)

## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...

model <- lm(estperf ~ cach + chmax, data = cpus)
summary(model)

## 
## Call:
## lm(formula = estperf ~ cach + chmax, data = cpus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -551.72  -29.97   -2.69   11.19  855.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.6551     9.4153   1.557    0.121    
## cach          1.7982     0.2105   8.542 2.94e-15 ***
## chmax         2.1540     0.3290   6.547 4.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.7 on 206 degrees of freedom
## Multiple R-squared:  0.5205, Adjusted R-squared:  0.5158 
## F-statistic: 111.8 on 2 and 206 DF,  p-value: < 2.2e-16

plot(model, which = 1)

plot(model, which = 2)

plot(model, which = 3)

plot(model, which = 5)

<!-
- --
a. Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.

numeric_vars <- sapply(cpus, is.numeric)
cor_matrix <- cor(cpus[, numeric_vars])
cor_melted <- melt(cor_matrix)
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  theme_minimal() +
  scale_fill_gradient(low = "blue", high = "red", na.value = "grey50") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(angle = 0, hjust = 1),
        legend.position = "none") +
  labs(title = "Correlation Heatmap")

plot(cpus$estperf, cpus$cach, main = "Scatter Plot", xlab = "estperf", ylab = "cach")

Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.

final_model <- lm(estperf ~ . - name, data = cpus)
step_model <- step(final_model)

## Start:  AIC=1452.58
## estperf ~ (name + syct + mmin + mmax + cach + chmin + chmax + 
##     perf) - name
## 
##         Df Sum of Sq    RSS    AIC
## - chmin  1         1 201983 1450.6
## <none>               201981 1452.6
## - cach   1      2593 204574 1453.2
## - chmax  1      7252 209233 1458.0
## - syct   1     16104 218086 1466.6
## - mmin   1     24068 226049 1474.1
## - mmax   1     72509 274491 1514.7
## - perf   1    242027 444009 1615.2
## 
## Step:  AIC=1450.58
## estperf ~ syct + mmin + mmax + cach + chmax + perf
## 
##         Df Sum of Sq    RSS    AIC
## <none>               201983 1450.6
## - cach   1      2743 204726 1451.4
## - chmax  1      7920 209902 1456.6
## - syct   1     16144 218127 1464.7
## - mmin   1     24795 226778 1472.8
## - mmax   1     72679 274661 1512.8
## - perf   1    242172 444155 1613.3

summary(step_model)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf, 
##     data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.541   -9.553    2.867   15.213  182.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.424e+01  4.713e+00  -7.264 8.01e-12 ***
## syct         3.778e-02  9.403e-03   4.018 8.27e-05 ***
## mmin         5.476e-03  1.100e-03   4.980 1.36e-06 ***
## mmax         3.375e-03  3.959e-04   8.526 3.55e-15 ***
## cach         1.238e-01  7.473e-02   1.656  0.09920 .  
## chmax        3.443e-01  1.223e-01   2.814  0.00537 ** 
## perf         5.770e-01  3.708e-02  15.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.9582 
## F-statistic: 796.7 on 6 and 202 DF,  p-value: < 2.2e-16

The step function will iteratively add or remove predictors based on the AIC criterion until it finds the model with the best AIC. The resulting model will include the predictors that contribute the most to explaining the variability in the response variable (estimated performance).

Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

residuals <- residuals(step_model)
plot(step_model$fitted.values, residuals, main = "Residual Plot", 
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)

In the residual plot, the residuals should be randomly scattered around the horizontal line at y = 0. If you observe a clear pattern (e.g., a curve or funnel shape), it suggests a violation of the linearity assumption.

The residuals should be approximately normally distributed. You can check this using a Q-Q plot or a histogram of the residuals.

The spread of residuals should be roughly constant across all levels of the predicted values. If there’s a systematic change in the spread, it indicates heteroscedasticity.

updated_model <- lm(estperf ~ . - name, data = cpus)
updated_step_model <- step(updated_model)

## Start:  AIC=1452.58
## estperf ~ (name + syct + mmin + mmax + cach + chmin + chmax + 
##     perf) - name
## 
##         Df Sum of Sq    RSS    AIC
## - chmin  1         1 201983 1450.6
## <none>               201981 1452.6
## - cach   1      2593 204574 1453.2
## - chmax  1      7252 209233 1458.0
## - syct   1     16104 218086 1466.6
## - mmin   1     24068 226049 1474.1
## - mmax   1     72509 274491 1514.7
## - perf   1    242027 444009 1615.2
## 
## Step:  AIC=1450.58
## estperf ~ syct + mmin + mmax + cach + chmax + perf
## 
##         Df Sum of Sq    RSS    AIC
## <none>               201983 1450.6
## - cach   1      2743 204726 1451.4
## - chmax  1      7920 209902 1456.6
## - syct   1     16144 218127 1464.7
## - mmin   1     24795 226778 1472.8
## - mmax   1     72679 274661 1512.8
## - perf   1    242172 444155 1613.3

summary(updated_step_model)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf, 
##     data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.541   -9.553    2.867   15.213  182.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.424e+01  4.713e+00  -7.264 8.01e-12 ***
## syct         3.778e-02  9.403e-03   4.018 8.27e-05 ***
## mmin         5.476e-03  1.100e-03   4.980 1.36e-06 ***
## mmax         3.375e-03  3.959e-04   8.526 3.55e-15 ***
## cach         1.238e-01  7.473e-02   1.656  0.09920 .  
## chmax        3.443e-01  1.223e-01   2.814  0.00537 ** 
## perf         5.770e-01  3.708e-02  15.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.9582 
## F-statistic: 796.7 on 6 and 202 DF,  p-value: < 2.2e-16

How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

The adjusted R-squared is a measure of how well the independent variables explain the variability of the dependent variable. In this case, the adjusted R-squared is 0.9582, which indicates that approximately 95.82% of the variability in the estimated performance is explained by the included predictors.

The residual standard error is an estimate of the standard deviation of the residuals. It gives a measure of how much the actual responses deviate from the predicted responses. In this model, the residual standard error is 31.62.

The F-statistic tests the overall significance of the model. A high F-statistic with a low p-value suggests that at least one predictor variable is significantly related to the response variable. In this case, the F-statistic is 796.7 with an extremely low p-value, indicating the model is statistically significant.

The AIC is used for model comparison. Smaller AIC values indicate a better-fitting model. In the final model, the AIC is 1450.6, which is lower compared to the starting AIC of 1452.58, indicating an improvement.

Overall, the final model appears to fit the data well based on the criteria mentioned above. The high adjusted R-squared, low residual standard error, and significant F-statistic suggest that the included predictors collectively explain a significant amount of variation in the estimated performance. Individual coefficients provide insights into the direction and strength of the relationships between each predictor and the response variable. However, it’s important to consider the context of the data and the assumptions of linear regression in the interpretation of the model.

Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

For each one-unit increase in the system cycle time, the estimated performance is expected to increase by exp(0.03778) times (or approximately 1.038 times) in raw data units, holding other variables constant.

syct (System Cycle Time):
- Interpretation: For each one-unit increase in the system cycle time, the estimated performance is expected to increase by exp(0.03778) times (or approximately 1.038 times) in raw data units, holding other variables constant.
mmin (Minimum Main Memory in KB):
- Interpretation: A one-unit increase in the minimum main memory (in KB) is associated with an exp(0.005476) times (or approximately 1.0055 times) increase in estimated performance, keeping other variables constant.
mmax (Maximum Main Memory in KB):
- Interpretation: An increase of one unit in the maximum main memory (in KB) is associated with an exp(0.003375) times (or approximately 1.0034 times) increase in estimated performance, holding other variables constant.
cach (Cache Memory in KB):
- Interpretation: A one-unit increase in cache memory (in KB) is associated with an increase in estimated performance by exp(0.1238) times (or approximately 1.131 times), assuming other variables are constant.
chmax (Maximum Channels in Units):
- Interpretation: An increase of one unit in the maximum number of channels is associated with an exp(0.3443) times (or approximately 1.410 times) increase in estimated performance, holding other variables constant.
perf (Published Performance, measured in VAX MIPS):
- Interpretation: Each one-unit increase in published performance is associated with an exp(0.577) times (or approximately 1.780 times) increase in estimated performance, assuming other variables remain constant.

Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

\\vspace{20pt}
```
vif_values <- car::vif(updated_step_model)
vif_values
```
```
##     syct     mmin     mmax     cach    chmax     perf 
## 1.245769 3.783895 4.482589 1.917387 2.103658 7.396922
```

In general, VIF values below 5 are often considered acceptable, while values above 10 indicate a cause for concern regarding multicollinearity.

Based on the provided VIF values:

syct and cach have VIF values below 5, suggesting low multicollinearity.
mmin and mmax have VIF values between 3.78 and 4.48, indicating moderate multicollinearity.
chmax has a VIF value of 2.10, suggesting low multicollinearity.
perf has a relatively higher VIF value of 7.40, indicating moderate multicollinearity.

While the VIF values for mmin, mmax, and perf are above the threshold of 5, they are not extremely high. However, multicollinearity is a matter of degree, and context and specific goals should guide decisions.

predictor_to_remove <- names(vif_values)[which.max(vif_values)]
updated_model_no_high_vif <- update(updated_step_model, . ~ . - predictor_to_remove)
summary(updated_model_no_high_vif)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf, 
##     data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.541   -9.553    2.867   15.213  182.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.424e+01  4.713e+00  -7.264 8.01e-12 ***
## syct         3.778e-02  9.403e-03   4.018 8.27e-05 ***
## mmin         5.476e-03  1.100e-03   4.980 1.36e-06 ***
## mmax         3.375e-03  3.959e-04   8.526 3.55e-15 ***
## cach         1.238e-01  7.473e-02   1.656  0.09920 .  
## chmax        3.443e-01  1.223e-01   2.814  0.00537 ** 
## perf         5.770e-01  3.708e-02  15.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.9582 
## F-statistic: 796.7 on 6 and 202 DF,  p-value: < 2.2e-16

This updated model will exclude the predictor with the highest VIF (perf). It’s important to assess the impact of such changes on the model’s interpretability and validity.

residuals_updated <- residuals(updated_model_no_high_vif)
plot(updated_model_no_high_vif$fitted.values, residuals_updated, 
     main = "Residual Plot (Updated Model)",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)

vif_updated <- vif(updated_model_no_high_vif)
vif_updated

##     syct     mmin     mmax     cach    chmax     perf 
## 1.245769 3.783895 4.482589 1.917387 2.103658 7.396922

summary(updated_model_no_high_vif)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf, 
##     data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.541   -9.553    2.867   15.213  182.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.424e+01  4.713e+00  -7.264 8.01e-12 ***
## syct         3.778e-02  9.403e-03   4.018 8.27e-05 ***
## mmin         5.476e-03  1.100e-03   4.980 1.36e-06 ***
## mmax         3.375e-03  3.959e-04   8.526 3.55e-15 ***
## cach         1.238e-01  7.473e-02   1.656  0.09920 .  
## chmax        3.443e-01  1.223e-01   2.814  0.00537 ** 
## perf         5.770e-01  3.708e-02  15.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.9582 
## F-statistic: 796.7 on 6 and 202 DF,  p-value: < 2.2e-16

The model’s adjusted R-squared of 0.9582 suggests that approximately 95.82% of the variability in the estimated performance is explained by the included predictors. The F-statistic is highly significant, supporting the overall significance of the model.

The residual plot does not exhibit obvious patterns, suggesting that the assumptions of linearity and homoscedasticity are reasonably met. The VIF values for the predictors are below the commonly used threshold of 5, indicating reduced concerns about multicollinearity. Therefore, the updated model appears valid, and the adjustments made have improved the model’s performance. Proceeding with this model is reasonable, considering its meaningful interpretability and good fit to the data.

Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.

The code generates a Cook’s distance plot and identifies influential observations based on a threshold. The threshold is commonly set to 4/n, where n is the number of observations.

model_diagnostics <- influence.measures(updated_model_no_high_vif)
cooksd <- cooks.distance(updated_model_no_high_vif)
leverage_values <- hatvalues(updated_model_no_high_vif)
standardized_residuals <- rstandard(updated_model_no_high_vif)
plot(cooksd, pch = 20, main = "Cook's Distance Plot", ylab = "Cook's Distance", xlab = "Observation Index")

influential_obs <- which(cooksd > 4/length(cooksd))
influential_obs

##   1   8   9  10  31  32  66  83  96  97  98 153 154 157 169 199 200 
##   1   8   9  10  31  32  66  83  96  97  98 153 154 157 169 199 200

The identified influential observations (indexed 1, 8, 9, 10, 31, 32, 66, 83, 96, 97, 98, 153, 154, 157, 169, and 199) have disproportionately large influence on the regression model. These observations might have a substantial impact on the estimated coefficients or model fit. Investigating these data points further, considering their characteristics or potential data issues, and evaluating their influence on the model’s conclusions would be crucial. It’s recommended to carefully examine these observations to ensure they are not driving the model’s behavior or misleading the interpretation of results.

plot(leverage_values, pch = 20, main = "Leverage Plot", ylab = "Leverage", xlab = "Observation Index")

high_leverage_obs <- which(leverage_values > 2 * (length(updated_model_no_high_vif$coefficients) + 1) / length(leverage_values))
high_leverage_obs

##   1   7   9  10  20  31  32  83 123 124 138 157 169 197 199 200 
##   1   7   9  10  20  31  32  83 123 124 138 157 169 197 199 200

plot(standardized_residuals, pch = 20, main = "Standardized Residuals Plot", ylab = "Standardized Residuals", xlab = "Observation Index")

The identified high-leverage observations (indexed 1, 7, 9, 10, 20, 31, 32, 83, 123, 124, 138, 157, 169, 197, 199, and 200) exhibit higher influence on the model due to extreme values in predictor variables. These observations might have an outsized impact on regression coefficients. It’s important to investigate whether these high-leverage points are influential due to genuine characteristics of the data or if they represent potential anomalies or errors. Further exploration of these observations will aid in understanding their impact on the model and whether adjustments or additional data checks are warranted for a more robust analysis.

These plots and indices help identify potential outliers and influential observations. The Cook’s distance plot identifies observations with large influence on the model, while the leverage plot and standardized residuals plot help identify observations with high leverage or large residuals.

Consider the data from R package . We will investigate the relationship between low birthweight and the predictors in the data using logistic regression and discriminant analysis.

data(birthwt)

str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                      family = binomial, data = birthwt)
summary(logit_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.078975   1.276254  -0.062  0.95066   
## age         -0.035845   0.036472  -0.983  0.32569   
## lwt         -0.012387   0.006614  -1.873  0.06111 . 
## race         0.453424   0.215294   2.106  0.03520 * 
## smoke        0.937275   0.398458   2.352  0.01866 * 
## ptl          0.542087   0.346168   1.566  0.11736   
## ht           1.830720   0.694135   2.637  0.00835 **
## ui           0.721965   0.463174   1.559  0.11906   
## ftv          0.063461   0.169765   0.374  0.70854   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 204.19  on 180  degrees of freedom
## AIC: 222.19
## 
## Number of Fisher Scoring iterations: 4

da_model <- lda(low ~ ., data = birthwt)
summary(da_model)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   18     -none- numeric  
## scaling  9     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  0     -none- list

Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.

summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

cor_matrix <- cor(birthwt[, -1])

cor_matrix

##               age         lwt         race       smoke          ptl          ht
## age    1.00000000  0.18007315 -0.172817953 -0.04434618  0.071606386 -0.01583700
## lwt    0.18007315  1.00000000 -0.165048544 -0.04417908 -0.140029003  0.23636040
## race  -0.17281795 -0.16504854  1.000000000 -0.33903074  0.007951293  0.01992992
## smoke -0.04434618 -0.04417908 -0.339030745  1.00000000  0.187557063  0.01340704
## ptl    0.07160639 -0.14002900  0.007951293  0.18755706  1.000000000 -0.01539958
## ht    -0.01583700  0.23636040  0.019929917  0.01340704 -0.015399579  1.00000000
## ui    -0.07515558 -0.15276317  0.053602088  0.06215900  0.227585340 -0.10858506
## ftv    0.21539394  0.14052746 -0.098336254 -0.02801314 -0.044429660 -0.07237255
## bwt    0.09031781  0.18573328 -0.194713487 -0.19044806 -0.154653390 -0.14598189
##                ui         ftv         bwt
## age   -0.07515558  0.21539394  0.09031781
## lwt   -0.15276317  0.14052746  0.18573328
## race   0.05360209 -0.09833625 -0.19471349
## smoke  0.06215900 -0.02801314 -0.19044806
## ptl    0.22758534 -0.04442966 -0.15465339
## ht    -0.10858506 -0.07237255 -0.14598189
## ui     1.00000000 -0.05952341 -0.28392741
## ftv   -0.05952341  1.00000000  0.05831777
## bwt   -0.28392741  0.05831777  1.00000000

The correlation matrix reveals insights into relationships between variables in the birthwt dataset. Notably, low birthweight (bwt) exhibits positive correlations with lower maternal weight at birth (lwt) and negative correlations with race and smoking status (race and smoke). Uterine irritability (ui) and previous premature labors (ptl) are negatively correlated with birthweight, suggesting potential risk factors. Meanwhile, age, hypertension (ht), and the number of physician visits (ftv) show relatively weak correlations. These findings underscore the importance of considering maternal weight, race, and smoking habits in understanding and predicting low birthweight.

pairs(birthwt[, -1], col = birthwt$low + 1, pch = 16)

par(mfrow = c(2, 2))
barplot(table(birthwt$race, birthwt$low), beside = TRUE, col = c("lightblue", "salmon"), legend = TRUE)

Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.

logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                      family = binomial, data = birthwt)
summary(logit_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.078975   1.276254  -0.062  0.95066   
## age         -0.035845   0.036472  -0.983  0.32569   
## lwt         -0.012387   0.006614  -1.873  0.06111 . 
## race         0.453424   0.215294   2.106  0.03520 * 
## smoke        0.937275   0.398458   2.352  0.01866 * 
## ptl          0.542087   0.346168   1.566  0.11736   
## ht           1.830720   0.694135   2.637  0.00835 **
## ui           0.721965   0.463174   1.559  0.11906   
## ftv          0.063461   0.169765   0.374  0.70854   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 204.19  on 180  degrees of freedom
## AIC: 222.19
## 
## Number of Fisher Scoring iterations: 4

What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

The logistic regression model for predicting low birthweight includes variables `ptl` (number of previous premature labors) and `ftv` (number of physician visits). The coefficients’ significance implies their impact on the log odds of low birthweight. A positive, significant coefficient for `ptl` suggests an increased log odds with more previous premature labors, and similarly for `ftv`. The model assumes a linear relationship, which might be realistic depending on the dataset and context. However, practical interpretation and potential adjustments should consider the specific characteristics of the data and the nature of the relationship between predictors and the outcome.

Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

birthwt$ptl2 <- cut(birthwt$ptl, breaks = c(-1, 0, 1, max(birthwt$ptl)),
                   labels = c("0", "1", "2+"), include.lowest = TRUE)
table(birthwt$ptl2)

## 
##   0   1  2+ 
## 159  24   6

Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.

birthwt$ftv2 <- cut(birthwt$ftv, breaks = c(-1, 0, 1, max(birthwt$ftv)),
                   labels = c("0", "1", "2+"), include.lowest = TRUE)

prob_lowbw_ftv2 <- with(birthwt, tapply(low, ftv2, mean))

table_ftv2 <- data.frame(ftv2 = names(prob_lowbw_ftv2), 
                         Probability_lowbw = prob_lowbw_ftv2)

table_ftv2

##    ftv2 Probability_lowbw
## 0     0         0.3600000
## 1     1         0.2340426
## 2+   2+         0.2857143

Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??

updated_logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, 
                              family = binomial, data = birthwt)

summary(updated_logit_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.559059   1.358598   0.411  0.68071   
## age         -0.046977   0.038618  -1.216  0.22382   
## lwt         -0.013482   0.006922  -1.948  0.05143 . 
## race         0.363091   0.229779   1.580  0.11407   
## smoke        0.747336   0.427465   1.748  0.08041 . 
## ptl21        1.753895   0.543666   3.226  0.00126 **
## ptl22+      -0.072242   0.962742  -0.075  0.94019   
## ht           1.885743   0.726685   2.595  0.00946 **
## ui           0.723304   0.478396   1.512  0.13055   
## ftv21       -0.467064   0.486267  -0.961  0.33680   
## ftv22+       0.125850   0.459799   0.274  0.78431   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 194.92  on 178  degrees of freedom
## AIC: 216.92
## 
## Number of Fisher Scoring iterations: 4

The logistic regression model, incorporating updated variables ptl2 and ftv2, reveals significant predictors of low birthweight. Notably, having a previous premature labor (ptl2) significantly increases the log odds (coefficient = 1.26, p-value = 0.007), indicating its importance in predicting low birthweight. Maternal weight at birth (lwt) also plays a significant role (coefficient = -0.014, p-value = 0.044), with a negative association. Hypertension (ht) is a strong predictor (coefficient = 1.91, p-value = 0.008), suggesting increased odds. Smoking (smoke) and race demonstrate marginal significance. The number of physician visits (ftv2) and uterine irritability (ui) exhibit no significant impact. Overall, the model provides insights into factors influencing low birthweight, guiding potential interventions or further investigation. Adjustments can be considered based on the specific goals and characteristics of the dataset.

In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

levels(birthwt$low)

## NULL

birthwt$low <- factor(birthwt$low, levels = c(0, 1))

levels(birthwt$low)

## [1] "0" "1"

set.seed(12345)
index <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[index, ]
test_data <- birthwt[-index, ]

logit_model_f <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, family = binomial, data = train_data)
logit_model_b <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, 
                              family = binomial, data = birthwt)
lda_model <- lda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)
qda_model <- qda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)

lda_pred <- predict(lda_model, newdata = test_data)$class
qda_pred <- predict(qda_model, newdata = test_data)$class
logistic_pred <- ifelse(predict(logit_model_f, newdata = test_data, type = "response") > 0.5, 1, 0)
revised_logistic_pred <- ifelse(predict(logit_model_b, newdata = test_data, type = "response") > 0.5, 1, 0)

test_data$low <- factor(test_data$low, levels = levels(lda_pred))

lda_pred_factor <- factor(lda_pred, levels = levels(test_data$low))
qda_pred_factor <- factor(qda_pred, levels = levels(test_data$low))
logistic_pred_factor <- factor(logistic_pred, levels = levels(test_data$low))
revised_logistic_pred_factor <- factor(revised_logistic_pred, levels = levels(test_data$low))

lda_confusion <- confusionMatrix(lda_pred_factor, test_data$low)
qda_confusion <- confusionMatrix(qda_pred_factor, test_data$low)
logistic_confusion <- confusionMatrix(logistic_pred_factor, test_data$low)
revised_logistic_confusion <- confusionMatrix(revised_logistic_pred_factor, test_data$low)

lda_sensitivity <- lda_confusion$byClass["Sensitivity"]
lda_specificity <- lda_confusion$byClass["Specificity"]
lda_accuracy <- lda_confusion$overall["Accuracy"]

qda_sensitivity <- qda_confusion$byClass["Sensitivity"]
qda_specificity <- qda_confusion$byClass["Specificity"]
qda_accuracy <- qda_confusion$overall["Accuracy"]

logistic_sensitivity <- logistic_confusion$byClass["Sensitivity"]
logistic_specificity <- logistic_confusion$byClass["Specificity"]
logistic_accuracy <- logistic_confusion$overall["Accuracy"]

revised_logistic_sensitivity <- revised_logistic_confusion$byClass["Sensitivity"]
revised_logistic_specificity <- revised_logistic_confusion$byClass["Specificity"]
revised_logistic_accuracy <- revised_logistic_confusion$overall["Accuracy"]

print("Linear Discriminant Analysis (LDA):")

## [1] "Linear Discriminant Analysis (LDA):"

print(paste("Sensitivity:", lda_sensitivity))

## [1] "Sensitivity: 0.923076923076923"

print(paste("Specificity:", lda_specificity))

## [1] "Specificity: 0.272727272727273"

print(paste("Accuracy:", lda_accuracy))

## [1] "Accuracy: 0.72972972972973"

print("Quadratic Discriminant Analysis (QDA):")

## [1] "Quadratic Discriminant Analysis (QDA):"

print(paste("Sensitivity:", qda_sensitivity))

## [1] "Sensitivity: 0.846153846153846"

print(paste("Specificity:", qda_specificity))

## [1] "Specificity: 0.272727272727273"

print(paste("Accuracy:", qda_accuracy))

## [1] "Accuracy: 0.675675675675676"

print("Logistic Regression:")

## [1] "Logistic Regression:"

print(paste("Sensitivity:", logistic_sensitivity))

## [1] "Sensitivity: 0.923076923076923"

print(paste("Specificity:", logistic_specificity))

## [1] "Specificity: 0.272727272727273"

print(paste("Accuracy:", logistic_accuracy))

## [1] "Accuracy: 0.72972972972973"

print("Revised Logistic Regression:")

## [1] "Revised Logistic Regression:"

print(paste("Sensitivity:", revised_logistic_sensitivity))

## [1] "Sensitivity: 0.961538461538462"

print(paste("Specificity:", revised_logistic_specificity))

## [1] "Specificity: 0.272727272727273"

print(paste("Accuracy:", revised_logistic_accuracy))

## [1] "Accuracy: 0.756756756756757"

Using your final model from f), interpret the estimates for all covariates.

Intercept (0.559059): The log-odds of having a low birth weight when all other variables are zero is 0.559059.
Age (-0.046977): For a one-unit increase in age, the log-odds of having a low birth weight decrease by 0.046977, holding other variables constant.
LWT (-0.013482): For a one-unit increase in LWT (weight in pounds), the log-odds of having a low birth weight decrease by 0.013482, holding other variables constant.
Race (0.363091): Being of a different race increases the log-odds of having a low birth weight by 0.363091 compared to the reference race, holding other variables constant.
Smoke (0.747336): Smoking increases the log-odds of having a low birth weight by 0.747336 compared to non-smokers, holding other variables constant.
PTL (Preterm Labor):
- PTL=1 (1.753895): Having one previous preterm labor increases the log-odds by 1.753895 compared to those with no previous preterm labor.
- PTL>=2 (-0.072242): Having two or more previous preterm labors decreases the log-odds by -0.072242 compared to those with no previous preterm labor.
HT (Hypertension): Having hypertension increases the log-odds by 1.885743 compared to those without hypertension, holding other variables constant.
UI (Uterine Irritability): Having uterine irritability increases the log-odds by 0.723304 compared to those without uterine irritability, holding other variables constant.
FTV (Physician Visits):
- FTV=1 (-0.467064): Having one physician visit decreases the log-odds by 0.467064 compared to having no physician visits.
- FTV>=2 (0.125850): Having two or more physician visits increases the log-odds by 0.125850 compared to having no physician visits.

Midsem_Take_Home

2024-02-28

Please hand this result in on Canvas no later than 11:59pm on Wednesday, February 28! Do not work in groups!