Exam1 (Take Home)

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.

a. Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.

library(MASS)
library(ggplot2)

# Load the cpus dataset (from MASS)
data(cpus, package="MASS")

str(cpus)

## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...

summary(cpus)

##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
##

cor_matrix <- cor(cpus[, -1])
print(cor_matrix)

##               syct       mmin       mmax       cach      chmin      chmax
## syct     1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin    -0.3356422  1.0000000  0.7581573  0.5347291  0.5171892  0.2669074
## mmax    -0.3785606  0.7581573  1.0000000  0.5379898  0.5605134  0.5272462
## cach    -0.3209998  0.5347291  0.5379898  1.0000000  0.5822455  0.4878458
## chmin   -0.3010897  0.5171892  0.5605134  0.5822455  1.0000000  0.5482812
## chmax   -0.2505023  0.2669074  0.5272462  0.4878458  0.5482812  1.0000000
## perf    -0.3070821  0.7949233  0.8629942  0.6626135  0.6089025  0.6052193
## estperf -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556
##               perf    estperf
## syct    -0.3070821 -0.2883956
## mmin     0.7949233  0.8192915
## mmax     0.8629942  0.9012024
## cach     0.6626135  0.6486203
## chmin    0.6089025  0.6105802
## chmax    0.6052193  0.5921556
## perf     1.0000000  0.9664687
## estperf  0.9664687  1.0000000

pairs(cpus[, -1], main = "Scatterplot Matrix of cpus Data")

Observations:

Correlation Analysis:

Strongest predictors of estperf:
- mmax has the highest correlation (0.901)
- mmin shows strong correlation (0.819)
- cach (0.649), chmin (0.611), and chmax (0.592) show moderate correlations

Inter-variable Relationships:

Notable multicollinearity concerns:
- mmin and mmax are highly correlated (0.758)
- Several moderate correlations exist between predictors (0.52-0.58 range)
- perf and estperf are highly correlated (0.966) as expected, but perf won’t be used as a predictor

Visual Observations:

Many variables show right-skewed distributions
Non-linear relationships appear in several scatterplots
Potential outliers exist at higher variable values
The relationships between mmax/mmin and estperf appear particularly strong
Data points tend to cluster at lower values for most variables

b. Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.

Looking at the scatterplots, I notice that many variables exhibit right-skewed distributions with non-linear relationships to estperf12. The first step is to apply a log transformation to both response and predictor variables to Linearize relationships between variables, Reduce skewness in distributions and Stabilize variance across the range of predictions.

data <- cpus[, !(names(cpus) %in% c("name", "perf"))]

cpus_log <- data
cpus_log$log_estperf <- log(data$estperf)
cpus_log$log_chmin <- log(data$chmin)
cpus_log$log_chmax <- log(data$chmax)
cpus_log$log_mmin <- log(data$mmin)
cpus_log$log_mmax <- log(data$mmax)
cpus_log$log_cach <- log(data$cach)

cpus_log <- as.data.frame(cpus_log)

Variable Selection Process

I’ll use a combination of techniques to determine the best predictors:

Start with full model: Include all transformed variables except name and estperf

cpus_log <- na.omit(cpus_log)  # Removes rows with NAs
cpus_log <- cpus_log[!apply(cpus_log, 1, function(x) any(is.infinite(x))), ] # Removes rows with Inf values

full_model <- lm(log_estperf ~ log_chmin + log_chmax + log_mmin + log_mmax + log_cach, data=cpus_log)
summary(full_model)

## 
## Call:
## lm(formula = log_estperf ~ log_chmin + log_chmax + log_mmin + 
##     log_mmax + log_cach, data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42690 -0.16513 -0.02378  0.12030  0.83766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.46464    0.23226 -10.611  < 2e-16 ***
## log_chmin    0.11114    0.02681   4.145 6.02e-05 ***
## log_chmax    0.08956    0.02815   3.181 0.001824 ** 
## log_mmin     0.12792    0.03482   3.674 0.000345 ***
## log_mmax     0.51887    0.03581  14.492  < 2e-16 ***
## log_cach     0.23201    0.02388   9.718  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.235 on 133 degrees of freedom
## Multiple R-squared:  0.9385, Adjusted R-squared:  0.9362 
## F-statistic: 406.1 on 5 and 133 DF,  p-value: < 2.2e-16

Apply backward elimination: Remove variables that contribute least to the model based on AIC

step_model <- step(full_model, direction="backward")

## Start:  AIC=-396.68
## log_estperf ~ log_chmin + log_chmax + log_mmin + log_mmax + log_cach
## 
##             Df Sum of Sq     RSS     AIC
## <none>                    7.3473 -396.68
## - log_chmax  1    0.5591  7.9064 -388.49
## - log_mmin   1    0.7458  8.0930 -385.24
## - log_chmin  1    0.9490  8.2962 -381.80
## - log_cach   1    5.2168 12.5641 -324.10
## - log_mmax   1   11.6012 18.9485 -266.99

Check for multicollinearity: From the correlation matrix, mmin and mmax have high correlation (0.758), which could cause multicollinearity issues. VIF (Variance Inflation Factor) analysis might reveal this needs addressing.

Model Refinement

Looking at the scatterplots, I notice that mmax has the strongest relationship with estperf (correlation 0.901). The stepwise procedure would likely retain this variable. However, due to multicollinearity concerns, I might need to choose between mmin and mmax.

Based on statistical significance in the step_model and multicollinearity analysis, I would likely arrive at a final model with:

final_model <- lm(log_estperf ~ log_mmax + log_cach + log_chmin, data=cpus_log)
summary(final_model)

## 
## Call:
## lm(formula = log_estperf ~ log_mmax + log_cach + log_chmin, data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41808 -0.17007 -0.02499  0.13284  0.81832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.32079    0.24169  -9.602  < 2e-16 ***
## log_mmax     0.61502    0.02915  21.097  < 2e-16 ***
## log_cach     0.26745    0.02388  11.199  < 2e-16 ***
## log_chmin    0.17315    0.02391   7.242 3.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2497 on 135 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.928 
## F-statistic: 593.8 on 3 and 135 DF,  p-value: < 2.2e-16

This model would prioritize maximum memory (mmax) and cache size (cach) as the primary predictors, potentially including minimum channels (chmin) if it adds significant explanatory power after accounting for the other variables.

Interpretation:

Intercept (-2.32079)
- When all predictors are at their baseline (log-transformed values of 0), the estimated performance is exp(-2.32079) ≈ 0.098 (not meaningful but useful for transformations).
log_mmax (0.61502, p < 2e-16)

-   A **1% increase in `mmax`** (main memory size) leads to an **approximate 0.615% increase** in estimated performance, **holding other variables constant**.

log_cach (0.26745, p < 2e-16)

-   A **1% increase in `cach`** (cache size) leads to an **approximate 0.267% increase** in estimated performance.

log_chmin (0.17315, p = 3.05e-11)
- A 1% increase in chmin (minimum number of channels) results in a 0.173% increase in estimated performance.

Model Diagnostics

Residual Standard Error (RSE): 0.2497, indicating moderate spread in residuals.
F-statistic (593.8, p < 2.2e-16): Strong evidence that at least one predictor is significant.
Residuals: Symmetrically distributed around zero, but you should check residual plots for assumptions of normality and homoscedasticity.

c. Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

plot(final_model$fitted.values, final_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

Observations:

Non-linearity: The residuals exhibit a clear U-shaped pattern with:
- Mostly positive residuals at low fitted values (2-3)
- Predominantly negative residuals in the middle range (4-5)
- Returning to positive residuals at high fitted values (6+)
Heteroscedasticity: The spread of residuals is not constant:
- Greater variability at the extremes of fitted values
- More compressed in the middle range
Potential outliers: Several observations show large residuals (around 0.8) at both the low and high ends of fitted values

Assumption Violations

Based on these observations, two major linear regression assumptions are being violated:

Linearity assumption: The U-shaped pattern indicates non-linear relationships
Homoscedasticity: The varying spread of residuals indicates non-constant variance

To address these issues, I modify the model as follows:

Polynomial transformation: Add quadratic terms for key predictors:

poly_model <- lm(log(estperf) ~ poly(mmax, 2) + poly(mmin, 2) + log(cach), data=cpus_log)
summary(poly_model)

## 
## Call:
## lm(formula = log(estperf) ~ poly(mmax, 2) + poly(mmin, 2) + log(cach), 
##     data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31136 -0.08852 -0.03328  0.06795  0.83412 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.59567    0.04975  72.278  < 2e-16 ***
## poly(mmax, 2)1  6.76230    0.26490  25.527  < 2e-16 ***
## poly(mmax, 2)2 -1.40273    0.19400  -7.231 3.40e-11 ***
## poly(mmin, 2)1  1.50458    0.26160   5.752 5.79e-08 ***
## poly(mmin, 2)2 -0.46350    0.20188  -2.296   0.0232 *  
## log(cach)       0.26285    0.01583  16.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1684 on 133 degrees of freedom
## Multiple R-squared:  0.9684, Adjusted R-squared:  0.9673 
## F-statistic: 816.3 on 5 and 133 DF,  p-value: < 2.2e-16

plot(poly_model$fitted.values, poly_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

Interaction terms: Consider interactions between key predictors:

interaction_model <- lm(log(estperf) ~ log(mmax)*log(cach), data=cpus_log)
summary(interaction_model)

## 
## Call:
## lm(formula = log(estperf) ~ log(mmax) * log(cach), data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36543 -0.16143 -0.04096  0.08368  1.01237 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.75866    0.62762   1.209 0.228859    
## log(mmax)            0.27227    0.07055   3.859 0.000176 ***
## log(cach)           -0.86929    0.19885  -4.372 2.44e-05 ***
## log(mmax):log(cach)  0.13063    0.02140   6.104 1.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2605 on 135 degrees of freedom
## Multiple R-squared:  0.9233, Adjusted R-squared:  0.9216 
## F-statistic: 542.1 on 3 and 135 DF,  p-value: < 2.2e-16

plot(interaction_model$fitted.values, interaction_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

I feel the polynomial transformation is performing well compared to interaction model based on the residual plot pattern.

Observation of Polynomial_transformation plot:

Non-random distribution: The residuals show a non-random pattern, with clusters of positive residuals around fitted values of 3.5-4.5, while most other areas have residuals closer to zero or slightly negative.
Potential outliers: Several points stand out with large positive residuals (around 0.6-0.8) in the 3.5-4.5 fitted value range. These could be influential observations affecting the model.
Slight heteroscedasticity: The spread of residuals appears somewhat uneven across fitted values, with more variability in the lower fitted value range and less at higher values.
Mean shift: The average residual appears slightly negative for higher fitted values (5.5-7), suggesting possible systematic prediction errors in that range.

d.How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

Based on the residual plot from the final polynomial model, I can evaluate several aspects of model fit:

Quantitative Measures of Fit

The model demonstrates excellent overall fit based on several key metrics:

R-squared: 0.9684 - This indicates that approximately 96.84% of the variance in log(estperf) is explained by the model, which is exceptionally high
Adjusted R-squared: 0.9673 - The adjustment for the number of predictors shows the model is not overfitting
F-statistic: 816.3 (p < 2.2e-16) - The extremely high F-value and significant p-value confirm the model is statistically valid
Residual standard error: 0.1684 - A relatively small value indicating good prediction accuracy

Coefficient Assessment

All predictors are statistically significant, with most at p < 0.001
The quadratic terms for both mmax and mmin are significant, justifying the polynomial approach
The log(cach) term is highly significant (t = 16.601), confirming cache size is an important predictor

Residual Analysis

Looking at the residual plot:

Central tendency: Most residuals are clustered around zero, with a median of -0.03328
Outliers: Several notable positive outliers (maximum of 0.83412) appear in the lower fitted value range
Pattern: While improved from earlier models, some non-random pattern remains visible with more positive residuals at lower fitted values
Heteroscedasticity: Some inconsistency in residual spread across fitted values remains, with greater variance at lower fitted values

Overall Assessment

The polynomial model represents a substantial improvement over simpler models, capturing the non-linear relationships in the data effectively. The extremely high R-squared and significant coefficients indicate a strong model. While the residual plot still shows some pattern and outliers, the majority of observations are well-predicted by the model.

The remaining outliers might represent specific CPU models with unusual performance characteristics that could be investigated further, but they don’t significantly compromise the overall excellent fit of the model.

e. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

Interpretation of Variables in the Final Polynomial Model

The final model uses a log-transformed response variable (log(estperf)) with polynomial terms for mmax and mmin, and a log-transformed cache variable. To interpret coefficients, we must remember that the inverse of the log function is the exponential function (e^x), which allows us to convert back to the original performance scale.

Interpretation of Cache Size (log(cach))

Cache size coefficient: 0.26285

Since both the predictor (cach) and response (estperf) are log-transformed, this coefficient represents an elasticity. A 1% increase in cache size is associated with approximately a 0.26% increase in estimated performance, holding maximum and minimum memory constant. For example, increasing cache size from 100MB to 101MB (a 1% increase) would result in approximately a 0.26% increase in the CPU’s estimated performance.

Interpretation of Maximum Memory (poly(mmax, 2))

Linear term coefficient: 6.76230
Quadratic term coefficient: -1.40273

The relationship between maximum memory and estimated performance is non-linear. The positive linear term (6.76) combined with the negative quadratic term (-1.40) indicates that increases in maximum memory are associated with increases in estimated performance, but with diminishing returns at higher memory levels. This means the performance benefit of adding memory capacity decreases as the maximum memory gets larger.

Interpretation of Minimum Memory (poly(mmin, 2))

Linear term coefficient: 1.50458
Quadratic term coefficient: -0.46350

Similar to maximum memory, minimum memory shows a non-linear relationship with performance. The positive linear coefficient (1.50) and negative quadratic coefficient (-0.46) indicate that increasing minimum memory is associated with improved performance, but these improvements diminish at higher levels of minimum memory.

Converting to Original Performance Scale

To convert predicted values from the log scale back to the original performance scale, we use the exponential function:

Original estimated performance = e^(predicted log(estperf))

For example, if our model predicts a log(estperf) value of 5.0, the estimated performance would be e^5.0 ≈ 148.4 units on the original performance scale.

The combined effect of all variables creates a model that captures both the importance of cache size for performance and the diminishing returns of increasing memory capacity beyond certain thresholds.

f. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

Variance Inflation Factor (VIF) Analysis

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

vif_values <- vif(poly_model)
print(vif_values)

##                   GVIF Df GVIF^(1/(2*Df))
## poly(mmax, 2) 3.226432  2        1.340234
## poly(mmin, 2) 3.292278  2        1.347020
## log(cach)     1.677777  1        1.295290

Interpretation of Results

The GVIF values show moderate levels of multicollinearity in the final model:

Memory variables: Both polynomial terms for maximum memory (poly(mmax, 2)) and minimum memory (poly(mmin, 2)) show similar GVIF values around 3.2-3.3, which indicates some correlation between these predictors.
Cache variable: The log(cach) term shows a lower GVIF of approximately 1.68, suggesting it has less correlation with other predictors.
Standardized measure: The GVIF^(1/(2*Df)) values, which account for the degrees of freedom and allow for better comparison across terms with different dimensions, are all between 1.29-1.35, which is well below concerning levels.

The observed level of multicollinearity is:

Mild to moderate - All GVIF values are below 5, which is generally considered acceptable
Expected - Given the correlation (0.758) previously observed between mmin and mmax
Managed effectively - The use of orthogonal polynomials through the poly() function has helped control potential collinearity issues

No further action is necessary because:

The multicollinearity is not severe enough to compromise the model’s reliability
The model already demonstrates excellent fit (R² = 0.9684)
All predictors remain statistically significant despite the presence of some multicollinearity

This analysis confirms that the polynomial model specification effectively balances capturing non-linear relationships while maintaining acceptable levels of multicollinearity.

g. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.

Visual Identification from Residual Plot

The residual plot shows clear evidence of potential outliers:

Several observations with extremely large positive residuals (approximately 0.6-0.8) around fitted values of 3.5-4.0
One particularly extreme point with a residual of approximately 0.8 at a fitted value of around 4.0
Most residuals are concentrated between -0.2 and 0.2, making these extreme points stand out significantly

#evalate standardized residuals
std_resid <- rstandard(poly_model)

# Plot standardized residuals
plot(std_resid, type="h", main="Standardized Residuals")
abline(h=c(-3,3), col="red")  # Threshold lines at ±3

Rule of thumb: Points with standardized residuals > |3| are considered outliers.

# Calculate Cook's distance
cooks_d <- cooks.distance(poly_model)

n=139
p=5  # n = sample size, p = number of parameters
  
# Plot Cook's distance
plot(cooks_d, type="h", main="Cook's Distance")
abline(h=4/(n-p-1), col="red")

Rule of thumb: Points with Cook’s distance > 4/(n-p-1) are influential.

# Calculate leverage values
leverage <- hatvalues(poly_model)

# Plot leverage values
plot(leverage, type="h", main="Leverage Values")
abline(h=2*p/n, col="red")

Rule of thumb: Points with leverage > 2p/n are high-leverage observations.

Standardized Residuals Analysis

The standardized residuals plot shows:

Several observations with values exceeding the critical threshold of ±3 (indicated by the red line)
Notable outliers appear at indices ~10, ~20, ~30, and ~60
These points have standardized residuals ranging from approximately 3 to 5
Most observations fall within the expected range of -2 to +2

Cook’s Distance Analysis

The Cook’s distance plot reveals:

Observation #10 has an extremely high Cook’s distance of approximately 2.5
This far exceeds the threshold (red line) of 4/(n-p-1), which appears to be close to 0.03
Several other observations show minor elevations in Cook’s distance (near indices 60, 110, and 140)
Observation #10 has substantially greater influence than any other point in the dataset

Leverage Analysis

The leverage plot indicates:

Observation #10 has remarkably high leverage (~0.9)
Two observations near the end of the dataset (around index 140) also have elevated leverage (~0.35)
These values exceed the typical threshold of 2p/n (shown by the red line)
Most observations have very low leverage values

Conclusion on Outliers and Influential Points

The diagnostic plots provide strong evidence that:

Observation #10 is an extremely influential point with high leverage, high Cook’s distance, and a large standardized residual. This single observation significantly influences the regression coefficients and warrants careful investigation.
Several additional outliers exist that have large standardized residuals but lower leverage and influence.
A few observations near the end of the dataset have high leverage but relatively low influence according to Cook’s distance.

These findings suggest that observation #10 should potentially be investigated for data entry errors or removed in a sensitivity analysis to assess its impact on model results. Despite these outliers, the overall model appears to maintain good fit (as seen in previous analyses with R² = 0.9684).

Consider the data from R package . We will investigate the relationship between low birthweight and the predictors in the data using logistic regression and discriminant analysis.

Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.

library(MASS)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the birthwt dataset
data("birthwt", package = "MASS")

# Transform categorical variables for better visualization
bwt <- with(birthwt, {
  race <- factor(race, labels = c("white", "black", "other"))
  ptd <- factor(ptl > 0, labels = c("no", "yes"))
  ftv <- factor(ftv)
  levels(ftv)[-(1:2)] <- "2+"
  data.frame(low = factor(low, labels = c("normal", "low")),
             age, lwt, race, smoke = factor(smoke, labels = c("no", "yes")),
             ptd, ht = factor(ht, labels = c("no", "yes")),
             ui = factor(ui, labels = c("no", "yes")), ftv)
})

# Summary statistics for numeric variables
summary(bwt)

##      low           age             lwt           race    smoke      ptd     
##  normal:130   Min.   :14.00   Min.   : 80.0   white:96   no :115   no :159  
##  low   : 59   1st Qu.:19.00   1st Qu.:110.0   black:26   yes: 74   yes: 30  
##               Median :23.00   Median :121.0   other:67                      
##               Mean   :23.24   Mean   :129.8                                 
##               3rd Qu.:26.00   3rd Qu.:140.0                                 
##               Max.   :45.00   Max.   :250.0                                 
##    ht        ui      ftv     
##  no :177   no :161   0 :100  
##  yes: 12   yes: 28   1 : 47  
##                      2+: 42  
##                              
##                              
##

# Correlation matrix for numeric predictors and outcome
cor_matrix <- cor(bwt %>% select_if(is.numeric))
print(cor_matrix)

##           age       lwt
## age 1.0000000 0.1800732
## lwt 0.1800732 1.0000000

Observations:

Continuous predictors like age and lwt (mother’s weight) are summarized numerically.
Correlations between predictors and the binary outcome (low) help identify linear relationships.

Continuous Predictors vs Outcome (`low`)

# Boxplot for mother's age vs low birth weight
ggplot(bwt, aes(x = low, y = age)) +
  geom_boxplot() +
  labs(title = "Mother's Age vs Low Birth Weight", x = "Birth Weight Status", y = "Age")

# Boxplot for mother's weight vs low birth weight
ggplot(bwt, aes(x = low, y = lwt)) +
  geom_boxplot() +
  labs(title = "Mother's Weight vs Low Birth Weight", x = "Birth Weight Status", y = "Weight (lbs)")

Categorical Predictors vs Outcome (`low`)

# Bar plot for race vs low birth weight
ggplot(bwt, aes(x = race, fill = low)) +
  geom_bar(position = "fill") +
  labs(title = "Race vs Low Birth Weight", x = "Race", y = "Proportion")

# Bar plot for smoking status vs low birth weight
ggplot(bwt, aes(x = smoke, fill = low)) +
  geom_bar(position = "fill") +
  labs(title = "Smoking Status vs Low Birth Weight", x = "Smoking Status", y = "Proportion")

# Bar plot for hypertension history vs low birth weight
ggplot(bwt, aes(x = ht, fill = low)) +
  geom_bar(position = "fill") +
  labs(title = "Hypertension History vs Low Birth Weight", x = "Hypertension History", y = "Proportion")

Observations from Visualizations

Continuous Predictors:
- Mothers with lower weights (lwt) tend to have higher proportions of low-birthweight babies.
- Younger mothers (age) appear to have slightly higher proportions of low-birthweight babies.
Categorical Predictors:
- Race: Black mothers have a higher proportion of low-birthweight babies compared to white or other racial groups.
- Smoking: Smoking during pregnancy is strongly associated with a higher proportion of low-birthweight babies.
- Hypertension (ht): Mothers with a history of hypertension show a higher proportion of low-birthweight babies.
Surprising Findings:
- The presence of uterine irritability (ui) does not show as strong an association with low birth weight as expected.
- The number of physician visits during the first trimester (ftv) does not show a clear trend in reducing the risk of low birth weight.

This analysis provides a comprehensive view of the relationships between predictors and the outcome variable (low). Continuous predictors are visualized using boxplots to highlight differences in distributions across birthweight categories. Categorical predictors are visualized using bar plots to show proportions within each category. These insights will inform subsequent modeling steps using logistic regression and discriminant analysis.

b. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.

# Create a cleaned version of the 'birthwt' dataset
bwt <- with(birthwt, {
  # Convert race to a factor with meaningful labels:
  race <- factor(race, labels = c("white", "black", "other"))
  
  # Convert ptl (number of previous premature labors) into a binary indicator:
  ptd <- factor(ptl > 0, labels = c("none", "yes"))
  
  # Convert ftv (first trimester physician visits) into a factor,
  # recoding levels beyond the first two as "2+"
  ftv <- factor(ftv)
  levels(ftv)[-c(1,2)] <- "2+"
  
  # Build a new data frame with logically acceptable predictors.
  data.frame(low = factor(low, labels = c("normal", "low")),
             age,          # Maternal age (years)
             lwt,          # Maternal weight (lbs) at last menstrual period
             race,         # Categorical: white, black, or other 
             smoke = factor(smoke, labels = c("no", "yes")),
             ptd,          # Indicator of previous premature labors (none or yes)
             ht = factor(ht, labels = c("no", "yes")),  # History of hypertension
             ui = factor(ui, labels = c("no", "yes")),  # Uterine irritability
             ftv)          # Number of first trimester physician visits
})

# Fit the logistic regression model.
# We include only those predictors that are logically acceptable and that are known risk factors.
bw.glm <- glm(low ~ age + lwt + race + smoke + ptd + ht + ui + ftv,
              data = bwt,
              family = binomial)

summary(bw.glm)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptd + ht + ui + 
##     ftv, family = binomial, data = bwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.82302    1.24471   0.661  0.50848   
## age         -0.03723    0.03870  -0.962  0.33602   
## lwt         -0.01565    0.00708  -2.211  0.02705 * 
## raceblack    1.19241    0.53597   2.225  0.02609 * 
## raceother    0.74069    0.46174   1.604  0.10869   
## smokeyes     0.75553    0.42502   1.778  0.07546 . 
## ptdyes       1.34376    0.48062   2.796  0.00518 **
## htyes        1.91317    0.72074   2.654  0.00794 **
## uiyes        0.68019    0.46434   1.465  0.14296   
## ftv1        -0.43638    0.47939  -0.910  0.36268   
## ftv2+        0.17901    0.45638   0.392  0.69488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 195.48  on 178  degrees of freedom
## AIC: 217.48
## 
## Number of Fisher Scoring iterations: 4

Data Preparation:
The code above converts several variables appropriately:
- race is recoded into a factor with levels “white,” “black,” and “other.”
- ptl becomes a binary indicator (stored as ptd) for whether a mother has had any previous premature labors.
- ftv is recoded so that values beyond 2 are grouped into “2+.”
- The outcome, low, is set as a factor using the label “normal” for a birth weight of at least 2.5 kg and “low” for less than 2.5 kg.
Model Fitting:
The logistic regression model is fit with glm using a binomial family, which models the log odds of a low-birthweight outcome. The predictors include maternal age (age), weight (lwt), race, smoking status (smoke), previous premature labors (ptd), history of hypertension (ht), uterine irritability (ui), and first‐trimester physician visits (ftv).
This specification avoids including the actual birth weight (bwt) or any variable that would not be logically acceptable in predicting low birth weight, as discussed in class and in the MASS documentation
Model Interpretation
1. lwt (maternal weight) has a negative coefficient (–0.01565) that is statistically significant (p = 0.027). This implies that as maternal weight increases, the odds of low birth weight decrease; specifically, for each additional pound of maternal weight, the odds are multiplied by exp(–0.01565) ≈ 0.9845 (i.e., a 1.5% decrease in the odds per pound).
2. Race is significant. Comparing to the reference category “white”:
  - For black mothers, the coefficient is 1.19241 (p = 0.026), so the odds of low birth weight are exp(1.19241) ≈ 3.29 times higher compared to white mothers, holding other factors constant.
  - For the “other” category, the coefficient (0.74069) is not statistically significant at the 0.05 level (p = 0.109), suggesting that the difference may not be as strong.
3. Prior premature labor (ptd) is significant (coefficient 1.34376, p = 0.00518), meaning mothers with any prior premature labor have exp(1.34376) ≈ 3.83 times higher odds of a low birth weight infant than those with none.
4. History of hypertension (ht) also is significant (coefficient 1.91317, p = 0.00794), with an odds ratio of exp(1.91317) ≈ 6.78, indicating a strong association with low birth weight.
5. smoke (smokeyes) shows a trend (p = 0.07546) suggesting that smoking might increase the odds (OR exp(0.75553) ≈ 2.13), although it is only marginally significant by conventional criteria.
6. Other variables (age, ui, and the ftv indicators) are not statistically significant in this model; their p-values exceed 0.10, suggesting that, after controlling for other factors, they do not contribute significantly to predicting low birth weight.

Conclusions

This logistic regression model indicates that lower maternal weight, being black (relative to white), having a history of prior premature labor, and having hypertension are significant predictors of low birth weight. The estimated odds ratios (obtained by exponentiating the coefficients) allow for a more intuitive interpretation. For example, black mothers in this dataset have about 3.3 times the odds of delivering a low-birthweight infant than white mothers, and a history of hypertension increases the odds by nearly 6.8 times.

c. What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

ptl (Previous Premature Labors):
In our analysis, rather than using the raw count of previous premature labors, we recoded this variable into a binary indicator (often labeled “none” versus “yes”). By doing so, the model implicitly assumes that the risk associated with previous premature labor operates as a threshold effect. That is, it assumes that having any history of premature labor increases the log odds of low birth weight by a fixed amount, and that having one versus two or more preterm labors does not make any additional difference. Essentially, once the threshold of “at least one” is crossed, the risk is elevated by the same amount for all mothers in that category.
ftv (First Trimester Physician Visits):
The variable ftv was converted into a factor with levels such as “0,” “1,” and “2+” visits. By treating ftv as a categorical variable, the model assumes that each category has its own fixed effect on the log odds of low birth weight relative to a baseline (usually the “0 visits” category). This means that the effect of having, say, 1 visit versus 0 visits is assumed to be constant, and any differences among higher numbers of visits are grouped together under “2+.” The model does not assume a continuous, incremental change in risk with each additional visit; rather, it imposes a step function where the effect jumps at the boundaries of these categories.

Are These Assumptions Realistic?

For ptl:
The assumption that any history of premature labor has the same effect, regardless of how many times it has occurred, may be an oversimplification. In clinical reality, the risk might further increase with multiple prior preterm labors. However, if the data are sparse for mothers with more than one such event, grouping them into a “yes” category can be a practical move to avoid unstable estimates.
For ftv:
By grouping first-trimester physician visits into a few categories, the model assumes that the impact on the log odds is constant within those groups. This discretization may not capture more subtle or linear dose‐response relationships. If prenatal care exhibits a more gradual protective effect with increasing visits, then modeling it as a continuous variable or exploring non-linear relationships (e.g., using splines) might provide better insight. The categorical approach simplifies the analysis and interpretation but at the potential cost of missing a more nuanced association.

In Summary

The logistic regression model is implicitly assuming:

That for ptl, there is a binary (threshold) effect—no risk if none, and the same elevated risk if one or more preterm labors have occurred.
That for ftv, the effects on the log odds are constant across categorically defined groups (0, 1, or 2+ visits).

These assumptions simplify the modeling process and are often used when the exact nature of the relationship is unclear or when data in some categories are sparse. However, if additional information or clinical insight suggests that the risk changes gradually (or escalates further with more preterm labors), then alternative strategies (e.g., keeping the variable continuous or using non-linear transformations such as splines) might be more realistic and informative.

d. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

Below is one approach to create a new variable, ptl2, that collapses the original ptl counts into a smaller number of categories. Often in the birthwt data the number of previous premature labors (ptl) is low, so we might want to distinguish mothers with no history, mothers with exactly one event, and mothers with two or more events.

Assuming bwt is our cleaned birthwt dataset, we create ptl2 so that:

“0” means no previous premature labor,
“1” means there was a previous premature labor history.

bwt$ptl2 <- with(bwt,
ifelse(ptd == "none", "0",
ifelse(ptd == "yes", "1", "1")))

#Convert ptl2 into a factor with ordered levels.
bwt$ptl2 <- factor(bwt$ptl2, levels = c("0", "1"))

#count of each category
table(bwt$ptl2)

## 
##   0   1 
## 159  30

This new variable ptl2 is often more useful for analysis, especially when the sample sizes in some of the higher count categories are very small. By collapsing ptl into three groups—0, 1, and 2 or more—we can examine whether there is a dose‐response effect of previous premature labor on the risk of low birthweight while avoiding unstable estimates for very rare categories.

e. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.

The original variable ftv represents the number of first-trimester physician visits. Since higher values (e.g., 3, 4, etc.) are rare, we can collapse levels into broader categories such as:

“0”: No visits
“1”: One visit
“2+”: Two or more visits

# Collapse ftv into broader categories to create ftv2
bwt$ftv2 <- with(bwt, 
                 ifelse(ftv == 0, "0", 
                        ifelse(ftv == 1, "1", "2+")))
# Convert ftv2 into a factor with ordered levels
bwt$ftv2 <- factor(bwt$ftv2, levels = c("0", "1", "2+"))

# frequency of each category
table(bwt$ftv2)

## 
##   0   1  2+ 
## 100  47  42

To better understand the relationship between ftv2 and the probability of low birthweight, we calculate the proportion of low-birthweight babies (low = "low") within each category of ftv2.

# Summarize probabilities of low birthweight by ftv2 levels
summary_table <- bwt %>%
  group_by(ftv2) %>%
  summarize(
    total = n(),
    low_count = sum(low == "low"),
    probability_low = mean(low == "low")
  )

print(summary_table)

## # A tibble: 3 × 4
##   ftv2  total low_count probability_low
##   <fct> <int>     <int>           <dbl>
## 1 0       100        36           0.36 
## 2 1        47        11           0.234
## 3 2+       42        12           0.286

Interpretation

No Visits (ftv2 = "0"):
- Out of 100 mothers who had no first-trimester physician visits, 36 delivered low-birthweight babies.
- The probability of low birthweight in this group is 36%.
- This suggests that mothers who did not receive prenatal care during the first trimester are at higher risk for low birthweight.
One Visit (ftv2 = "1"):
- Out of 47 mothers who had exactly one visit, 11 delivered low-birthweight babies.
- The probability of low birthweight in this group is 23.4%.
- This indicates that even one prenatal visit reduces the risk of low birthweight compared to no visits.
Two or More Visits (ftv2 = "2+"):
- Out of 42 mothers who had two or more visits, 12 delivered low-birthweight babies.
- The probability of low birthweight in this group is 28.6%.
- While the risk is lower than for no visits, it is slightly higher than for one visit, suggesting that the protective effect may not increase substantially beyond one visit in this dataset.

Observations

The probability of low birthweight decreases significantly from 36% (no visits) to 23.4% (one visit), highlighting the importance of at least one prenatal care visit during the first trimester.
Interestingly, the probability increases slightly to 28.6% for two or more visits, which may be counterintuitive. This could suggest:
- Mothers with higher-risk pregnancies might attend more prenatal visits, which could confound the relationship between ftv2 and low birthweight.
- Additional analysis, such as adjusting for other risk factors (e.g., maternal weight, smoking status), might be needed to clarify this relationship.

f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??

bwt$race <- as.factor(bwt$race)
bwt$smoke <- as.factor(bwt$smoke)
bwt$ptl2 <- as.factor(bwt$ptl2)
bwt$ht <- as.factor(bwt$ht)
bwt$ui <- as.factor(bwt$ui)
bwt$ftv2 <- as.factor(bwt$ftv2)

# Fit a logistic regression model with ptl2 and ftv2
updated_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                     data = bwt,
                     family = binomial)

summary(updated_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = bwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.82302    1.24471   0.661  0.50848   
## age         -0.03723    0.03870  -0.962  0.33602   
## lwt         -0.01565    0.00708  -2.211  0.02705 * 
## raceblack    1.19241    0.53597   2.225  0.02609 * 
## raceother    0.74069    0.46174   1.604  0.10869   
## smokeyes     0.75553    0.42502   1.778  0.07546 . 
## ptl21        1.34376    0.48062   2.796  0.00518 **
## htyes        1.91317    0.72074   2.654  0.00794 **
## uiyes        0.68019    0.46434   1.465  0.14296   
## ftv21       -0.43638    0.47939  -0.910  0.36268   
## ftv22+       0.17901    0.45638   0.392  0.69488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 195.48  on 178  degrees of freedom
## AIC: 217.48
## 
## Number of Fisher Scoring iterations: 4

Observations:

1. ptl2 (Previous Premature Labors):
- The new variable ptl2 appears in the output as indicator variables (e.g., “ptl21” representing mothers with exactly one previous premature labor, with “0” as the reference).
- The coefficient for ptl21 is 1.34376 and is statistically significant (p = 0.00518).
- Exponentiating this coefficient (exp(1.34376) ≈ 3.83) suggests that, compared to mothers with no previous premature labor, mothers with one previous premature labor have about 3.8 times higher odds of delivering a low-birthweight infant.
- If there were also a “ptl22+” category in the model (e.g., mothers with two or more previous premature labors), we would interpret its coefficient similarly; in this output only “ptl21” appears (indicating that, after collapsing the categories, the important contrast detected is between none versus one or more).
- Thus, the recoding of ptl into ptl2 clearly distinguishes risk groups and is important in predicting low birthweight.

2. ftv2 (First-Trimester Physician Visits):
- The new variable ftv2 is represented in two contrasts: “ftv21” (one visit vs. none) and “ftv22+” (two or more visits vs. none).
- Neither of these contrasts are statistically significant (p = 0.36268 and p = 0.69488, respectively).
- The coefficient for ftv21 is -0.43638, which (if significant) would have indicated a protective effect (odds ratio exp(-0.43638) ≈ 0.646), but its lack of significance suggests that, after controlling for other factors, the difference between having no visits versus one visit is not statistically apparent.
- Likewise, the contrast for ftv22+ does not reach significance.
- Thus, the recoded ftv2 variable does not appear to be important in predicting low birthweight in this multivariable model.

Overall Comments

ptl2 is a strong and significant predictor: The recoding of prior premature labors into ptl2 has clarified that even a single previous premature labor markedly increases the odds of low birthweight, with a clear dose‐response effect (mothers with one or more events have substantially higher risk).
ftv2 does not add significant predictive value: Although initial univariate summaries of ftv2 may have hinted at differences in low birthweight probabilities (with no visits having a higher rate), once other factors like maternal weight, race, and hypertension are controlled for in the logistic regression, ftv2 does not have a statistically significant independent effect. This may be because the effect of prenatal visits is confounded by other risk factors or because only a subset of mothers drive the association.

g. In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

library(TeachingDemos)  # For setting the seed

## Warning: package 'TeachingDemos' was built under R version 4.4.3

library(MASS)          # For the birthwt dataset and lda/qda functions
library(caret)         # For confusionMatrix

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

# Set the seed using TeachingDemos
char2seed("birthwt_analysis")

### Split the Data into Training (80%) and Test (20%) Sets

set.seed(123)  # Ensure reproducibility in splitting
n <- nrow(bwt)
train_idx <- sample(1:n, size = round(0.8 * n), replace = FALSE)
train_bwt <- bwt[train_idx, ]
test_bwt  <- bwt[-train_idx, ]

cat("Training set size:", nrow(train_bwt), "\n")

## Training set size: 151

cat("Test set size:", nrow(test_bwt), "\n")

## Test set size: 38

### Discriminant Analysis Models

# For discriminant analysis, we include only continuous predictors that are more likely to satisfy
# the multivariate normality assumption. Here we choose age and lwt.
lda_model <- lda(low ~ age + lwt, data = train_bwt)
qda_model <- qda(low ~ age + lwt, data = train_bwt)

# Make predictions on the test set
lda_pred <- predict(lda_model, newdata = test_bwt)$class
qda_pred <- predict(qda_model, newdata = test_bwt)$class

# Create confusion matrices using caret's confusionMatrix function (treat "low" as the positive outcome)
conf_lda <- confusionMatrix(lda_pred, test_bwt$low, positive = "low")
conf_qda <- confusionMatrix(qda_pred, test_bwt$low, positive = "low")

cat("LDA Confusion Matrix:\n")

## LDA Confusion Matrix:

print(conf_lda$table)

##           Reference
## Prediction normal low
##     normal     27  11
##     low         0   0

cat("\nLDA Metrics:\n")

## 
## LDA Metrics:

print(conf_lda$byClass[c("Sensitivity", "Specificity")])

## Sensitivity Specificity 
##           0           1

cat("LDA Accuracy:", conf_lda$overall["Accuracy"], "\n\n")

## LDA Accuracy: 0.7105263

cat("QDA Confusion Matrix:\n")

## QDA Confusion Matrix:

print(conf_qda$table)

##           Reference
## Prediction normal low
##     normal     27  11
##     low         0   0

cat("\nQDA Metrics:\n")

## 
## QDA Metrics:

print(conf_qda$byClass[c("Sensitivity", "Specificity")])

## Sensitivity Specificity 
##           0           1

cat("QDA Accuracy:", conf_qda$overall["Accuracy"], "\n\n")

## QDA Accuracy: 0.7105263

## Logistic Regression Models

# The logistic regression models from parts b) and f) use a larger set of predictors.
# Here we fit the updated logistic regression using age, lwt, race, smoke, ptl2, ht, ui, and ftv2.
logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                   data = train_bwt, family = binomial)

# Predict probabilities on the test set and convert them to classes using threshold 0.5.
logit_pred_prob <- predict(logit_model, newdata = test_bwt, type = "response")
logit_pred <- ifelse(logit_pred_prob > 0.5, "low", "normal")
logit_pred <- factor(logit_pred, levels = c("normal", "low"))

conf_logit <- confusionMatrix(logit_pred, test_bwt$low, positive = "low")

cat("Logistic Regression Confusion Matrix:\n")

## Logistic Regression Confusion Matrix:

print(conf_logit$table)

##           Reference
## Prediction normal low
##     normal     24   9
##     low         3   2

cat("\nLogistic Regression Metrics:\n")

## 
## Logistic Regression Metrics:

print(conf_logit$byClass[c("Sensitivity", "Specificity")])

## Sensitivity Specificity 
##   0.1818182   0.8888889

cat("Logistic Regression Accuracy:", conf_logit$overall["Accuracy"], "\n")

## Logistic Regression Accuracy: 0.6842105

Interpretation:

1. LDA and QDA models:
- Both methods predict every observation as “normal” (i.e. the majority class). This results in a moderate overall accuracy of about 71%, but it completely fails to identify any low birthweight cases (sensitivity = 0).
- Although high specificity (100%) is obtained, for a serious clinical condition like low birthweight, leaving all “low” cases undetected is not acceptable—even if overall accuracy seems reasonable.

2. Logistic Regression:
- The logistic regression model, which uses a richer set of predictors (including ptl2 and ftv2), correctly identifies 2 out of 11 low birthweight cases (sensitivity ≈ 18.2%).
- Its overall accuracy is slightly lower (68.4%), and specificity is slightly lower (88.9%) compared to LDA/QDA, but it has the advantage of at least detecting some of the at-risk (“low”) cases.

3. Which Model Performs the Best?
- Overall Accuracy: LDA/QDA have a slight edge (≈71%) compared to logistic regression (≈68%); however, this is largely because the data are imbalanced and the “normal” class dominates.
- Sensitivity: Logistic regression is the only model that correctly identifies any low birthweight cases (18.2% versus 0% for LDA and QDA), which is critical if the goal is to screen for high-risk pregnancies.
- Conclusion: In a real-world scenario—especially in a clinical setting where missing a high-risk case (low birthweight) has serious consequences—you would prioritize sensitivity.
Therefore, logistic regression performs better in this context because, despite its slightly lower overall accuracy, it does identify a fraction of the low birthweight cases while LDA and QDA completely miss them.

h. Using your final model from f), interpret the estimates for all covariates.

summary(logit_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = train_bwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.257670   1.474143  -0.175   0.8612  
## age         -0.006270   0.042593  -0.147   0.8830  
## lwt         -0.013617   0.008361  -1.629   0.1034  
## raceblack    1.104833   0.643369   1.717   0.0859 .
## raceother    1.136605   0.545198   2.085   0.0371 *
## smokeyes     0.981383   0.507433   1.934   0.0531 .
## ptl21        1.178750   0.510017   2.311   0.0208 *
## htyes        1.796111   0.859947   2.089   0.0367 *
## uiyes        0.821373   0.508352   1.616   0.1061  
## ftv21       -0.388984   0.521600  -0.746   0.4558  
## ftv22+      -0.462438   0.543327  -0.851   0.3947  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 188.83  on 150  degrees of freedom
## Residual deviance: 155.79  on 140  degrees of freedom
## AIC: 177.79
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

1. (Intercept): –0.25767
- This is the estimated log-odds of low birthweight when all predictors are at their reference (or zero, for continuous variables) values.
- Practically, since age and lwt are continuous and the reference categories for the factors are “white” for race, “no” for smoke, ht, ui, “0” for ftv2, and “0” for ptl2, the intercept is not directly interpretable on its own—but it serves as the baseline for combining effects from the other variables.

2. age: –0.00627
- Each additional year of maternal age is associated with a decrease in the log-odds of low birthweight by 0.00627.
- Expressed as an OR: exp(–0.00627) ≈ 0.994; that is, for each extra year, the odds of low birthweight decrease by about 0.6%. This effect is very small and not statistically significant (p = 0.883).

3. lwt (maternal weight): –0.01362
- For each additional pound in maternal weight, the log-odds of low birthweight decrease by 0.01362.
- OR ≈ exp(–0.01362) ≈ 0.986, meaning roughly a 1.4% reduction in the odds per additional pound. This suggests that heavier mothers tend to have slightly lower odds of delivering a low-birthweight baby, although this effect is only marginally significant (p = 0.103).

4. race (categorical; reference: white)
- raceblack: 1.10483
– Black mothers have log-odds of low birthweight that are 1.10483 higher than those for white mothers.
– OR ≈ exp(1.10483) ≈ 3.02, so black mothers have about 3 times higher odds than white mothers; this effect is marginally significant (p = 0.086).
- raceother: 1.13661
– Mothers classified as “other” race have a log-odds increase of 1.13661 relative to white mothers.
– OR ≈ exp(1.13661) ≈ 3.12, meaning their odds are roughly 3.1 times higher than white mothers; this effect is statistically significant (p = 0.037).

5. smokeyes: 0.98138
- Mothers who smoke (compared to those who do not) have a 0.98138 higher log-odds of low birthweight.
- OR ≈ exp(0.98138) ≈ 2.67, so smoking is associated with about 2.7 times higher odds of low birthweight (p = 0.053, marginal significance).

6. ptl2 (previous premature labor; reference: “0” = none)
- ptl21: 1.17875
– Mothers with one previous premature labor have log-odds that are 1.17875 higher than mothers with no previous premature labor.
– OR ≈ exp(1.17875) ≈ 3.25, meaning these mothers have roughly 3.3 times greater odds of having a low-birthweight baby; this finding is statistically significant (p = 0.021).

7. ht (hypertension; htyes): 1.79611
- Mothers with a history of hypertension have log-odds that are 1.79611 higher than mothers without hypertension.
- OR ≈ exp(1.79611) ≈ 6.02, suggesting that hypertension increases the odds of low birthweight by about sixfold (p = 0.037).

8. ui (uterine irritability; uiyes): 0.82137
- Mothers with uterine irritability have 0.82137 higher log-odds of low birthweight than those without.
- OR ≈ exp(0.82137) ≈ 2.27, indicating roughly double the odds, though this effect is not statistically significant (p = 0.106).

9. ftv2 (first-trimester physician visits; reference: “0” visits)
- ftv21: –0.38898
– Mothers with exactly one visit (compared to no visits) have a 0.38898 lower log-odds of low birthweight.
– OR ≈ exp(–0.38898) ≈ 0.678, implying about a 32% reduction in the odds, but this effect is not statistically significant (p = 0.456).
- ftv22+: –0.46244
– Mothers with two or more visits (compared to no visits) have a 0.46244 lower log-odds of low birthweight.
– OR ≈ exp(–0.46244) ≈ 0.630, corresponding to about a 37% reduction in the odds. This effect is also not statistically significant (p = 0.395).

Overall Summary

Maternal weight (lwt) shows a protective effect: higher weight is associated with lower odds of low birthweight.
Race: Both black and “other” race categories are associated with notably higher odds than white mothers.
Smoking tends to increase the odds of low birthweight, although significance is borderline.
Prior premature labor (ptl2): Having one previous premature labor significantly raises the odds by over threefold.
Hypertension (htyes): Has a very strong association, increasing the odds of low birthweight nearly sixfold.
Uterine irritability (uiyes) potentially doubles the odds, but this effect is not statistically robust in this model.
Prenatal visits (ftv2): The protective effects suggested by negative coefficients for 1 or 2+ visits versus no visits are not statistically significant after adjusting for other risk factors.

These interpretations help illustrate which maternal factors play a key role in predicting low birthweight and provide insight into where preventive or interventional efforts might be focused.

Exam1 (Take Home)

Kiran

2025-03-12

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

Correlation Analysis:

Inter-variable Relationships:

Visual Observations:

Variable Selection Process

Model Refinement

Model Diagnostics

Assumption Violations

Quantitative Measures of Fit

Coefficient Assessment

Residual Analysis

Overall Assessment

Interpretation of Variables in the Final Polynomial Model

Interpretation of Cache Size (log(cach))

Interpretation of Maximum Memory (poly(mmax, 2))

Interpretation of Minimum Memory (poly(mmin, 2))

Converting to Original Performance Scale

Variance Inflation Factor (VIF) Analysis

Interpretation of Results

Visual Identification from Residual Plot

Standardized Residuals Analysis

Cook’s Distance Analysis

Leverage Analysis

Conclusion on Outliers and Influential Points

Observations:

Continuous Predictors vs Outcome (`low`)

Categorical Predictors vs Outcome (`low`)

Observations from Visualizations

Conclusions

Are These Assumptions Realistic?

In Summary

Interpretation

Observations

Overall Comments

Overall Summary

Exam1 (Take Home)

Kiran

2025-03-12

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

Correlation Analysis:

Inter-variable Relationships:

Visual Observations:

Variable Selection Process

Model Refinement

Model Diagnostics

Assumption Violations

Quantitative Measures of Fit

Coefficient Assessment

Residual Analysis

Overall Assessment

Interpretation of Variables in the Final Polynomial Model

Interpretation of Cache Size (log(cach))

Interpretation of Maximum Memory (poly(mmax, 2))

Interpretation of Minimum Memory (poly(mmin, 2))

Converting to Original Performance Scale

Variance Inflation Factor (VIF) Analysis

Interpretation of Results

Visual Identification from Residual Plot

Standardized Residuals Analysis

Cook’s Distance Analysis

Leverage Analysis

Conclusion on Outliers and Influential Points

Observations:

Continuous Predictors vs Outcome (low)

Categorical Predictors vs Outcome (low)

Observations from Visualizations

Conclusions

Are These Assumptions Realistic?

In Summary

Interpretation

Observations

Overall Comments

Overall Summary

Continuous Predictors vs Outcome (`low`)

Categorical Predictors vs Outcome (`low`)