Exam1 (Take Home) INFO-H 415/515

  1. Consider the cpus data from R package MASS. We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable estperf). Do not use published performance as a predictor of performance in this problem
library(corrplot)
## corrplot 0.95 loaded
library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Loading required package: lattice

Lets import data and check it.

# Load the dataset
data("cpus")

cpus <- as.data.frame(cpus)

print(colnames(cpus))
## [1] "name"    "syct"    "mmin"    "mmax"    "cach"    "chmin"   "chmax"  
## [8] "perf"    "estperf"
str(cpus)  
## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...
head(cpus)
# Convert categorical variables to factors for better calculations
cpus$cach <- as.factor(cpus$cach)
cpus$chmin <- as.factor(cpus$chmin)
cpus$chmax <- as.factor(cpus$chmax)
  1. Investigate the relationship between variables in the cpus dataset, both numerically and visually. Comment on the relationships you observe.
cor_matrix <- cor(cpus[, sapply(cpus, is.numeric)])

cor_matrix
##               syct       mmin       mmax       perf    estperf
## syct     1.0000000 -0.3356422 -0.3785606 -0.3070821 -0.2883956
## mmin    -0.3356422  1.0000000  0.7581573  0.7949233  0.8192915
## mmax    -0.3785606  0.7581573  1.0000000  0.8629942  0.9012024
## perf    -0.3070821  0.7949233  0.8629942  1.0000000  0.9664687
## estperf -0.2883956  0.8192915  0.9012024  0.9664687  1.0000000
corrplot(cor_matrix, method = "color", tl.cex = 0.8, title = "Correlation Matrix of CPUs Dataset")

ggscatmat(cpus, columns = 1:ncol(cpus))
## Warning in ggscatmat(cpus, columns = 1:ncol(cpus)): Factor variables are
## omitted in plot

The table shows the pairwise correlation coefficients between the numerical variables in the cpus dataset. estperf (estimated performance) is highly correlated with: perf: Very strong positive correlation, indicating that actual performance closely tracks estimated performance. mmax: High positive correlation, suggesting that the maximum memory size plays a crucial role in performance. mmin: Strong positive correlation, implying that minimum memory size is also an important factor. syct (cycle time) has a negative correlation with estperf, meaning that as cycle time increases, estimated performance tends to decrease.

The heatmap confirms these correlations with a color gradient, where: Darker blue represents strong positive correlations. Red/brown tones indicate negative correlations. The strong blue regions between perf, estperf, mmax, and mmin indicate a multicollinearity concern in the model. syct has a lighter color, showing a weaker correlation with performance-related metrics.

Memory size and performance are the strong factors and syct is weakly collinear.

  1. Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the cpus dataset. Do not consider name in this modeling approach. Explain the process used to arrive at your final model.
# Remove columns using dplyr or subset
if ("name" %in% colnames(cpus) & "perf" %in% colnames(cpus)) {
  cpus <- subset(cpus, select = -c(name, perf))  
} else {
  print("Columns 'name' and 'perf' not found in dataset.")
}

summary(cpus)
##       syct             mmin            mmax            cach        chmin   
##  Min.   :  17.0   Min.   :   64   Min.   :   64   0      :69   1      :94  
##  1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000   8      :31   3      :28  
##  Median : 110.0   Median : 2000   Median : 8000   32     :23   8      :18  
##  Mean   : 203.8   Mean   : 2868   Mean   :11796   64     :20   6      :16  
##  3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000   16     :14   12     :11  
##  Max.   :1500.0   Max.   :32000   Max.   :64000   4      : 8   16     :10  
##                                                   (Other):44   (Other):32  
##      chmax       estperf       
##  6      :30   Min.   :  15.00  
##  24     :24   1st Qu.:  28.00  
##  8      :20   Median :  45.00  
##  32     :15   Mean   :  99.33  
##  5      :13   3rd Qu.: 101.00  
##  16     :12   Max.   :1238.00  
##  (Other):95
cpus <- cpus %>% mutate(across(where(is.factor), as.numeric))

# Replace zero values to prevent log(0) issues
cpus <- cpus %>% mutate(across(where(is.numeric), ~ ifelse(. == 0, 1, .)))

# Apply log transformation to numeric columns
cpus <- cpus %>% mutate(across(where(is.numeric), log))

# Ensure data is still available
if (nrow(cpus) == 0) {
  stop("Error: No valid data remains after preprocessing.")
}

summary(cpus)
##       syct            mmin             mmax             cach      
##  Min.   :2.833   Min.   : 4.159   Min.   : 4.159   Min.   :0.000  
##  1st Qu.:3.912   1st Qu.: 6.644   1st Qu.: 8.294   1st Qu.:0.000  
##  Median :4.700   Median : 7.601   Median : 8.987   Median :1.792  
##  Mean   :4.747   Mean   : 7.360   Mean   : 8.922   Mean   :1.470  
##  3rd Qu.:5.416   3rd Qu.: 8.294   3rd Qu.: 9.680   3rd Qu.:2.485  
##  Max.   :7.313   Max.   :10.373   Max.   :11.067   Max.   :3.091  
##      chmin            chmax          estperf     
##  Min.   :0.0000   Min.   :0.000   Min.   :2.708  
##  1st Qu.:0.6931   1st Qu.:1.792   1st Qu.:3.332  
##  Median :1.0986   Median :2.197   Median :3.807  
##  Mean   :1.2974   Mean   :2.214   Mean   :4.044  
##  3rd Qu.:1.9459   3rd Qu.:2.890   3rd Qu.:4.615  
##  Max.   :2.7081   Max.   :3.434   Max.   :7.121
# Lets Build initial linear regression model
model <- lm(estperf ~ ., data = cpus)
summary(model)
## 
## Call:
## lm(formula = estperf ~ ., data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56902 -0.21756 -0.03808  0.14604  1.82772 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.870260   0.410747  -4.553 9.12e-06 ***
## syct        -0.008053   0.036959  -0.218    0.828    
## mmin         0.155381   0.036217   4.290 2.77e-05 ***
## mmax         0.469654   0.035081  13.388  < 2e-16 ***
## cach         0.170403   0.026826   6.352 1.38e-09 ***
## chmin        0.276329   0.051533   5.362 2.24e-07 ***
## chmax        0.004212   0.044737   0.094    0.925    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3245 on 202 degrees of freedom
## Multiple R-squared:  0.8835, Adjusted R-squared:   0.88 
## F-statistic: 255.3 on 6 and 202 DF,  p-value: < 2.2e-16

We removed the name column and perf column to avoid multicollinearity. Then we built a regression model with the data. Some predictors have high p values, suggesting they may not be significant. Here the adjusted R2 provides an initial measure of model fit.

  1. Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.
par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

Residuals show some heteroscedasticity, indicating potential issues with variance assumptions. Some points appear influential, which requires further investigation.

  1. How well does the final model fit the data? Comment on some model fit criteria from the model built in c)
vif_values <- vif(model)
print(vif_values)
##     syct     mmin     mmax     cach    chmin    chmax 
## 2.906745 3.157344 2.590052 1.771283 2.512672 2.467674

Some variables have high VIF, indicating multicollinearity. Removing one of the highly correlated variables can improve model stability. Here there is no need to remove any predictor.

  1. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)
exp(coef(model))
## (Intercept)        syct        mmin        mmax        cach       chmin 
##   0.1540835   0.9919796   1.1681023   1.5994414   1.1857826   1.3182812 
##       chmax 
##   1.0042210

Since the dependent variable estperf was log-transformed, the exponentiated coefficients represent multiplicative changes in estimated performance. If a predictor’s coefficient is n, then for a 1-unit increase in that predictor, estimated performance (estperf) changes by a factor of exp(n). Lets see for the predictors individually.

(Intercept) 0.532 The baseline estimated performance when all predictors are at their reference levels is 53.2% of 1 unit (not meaningful alone). syct (Cycle Time in ns) 0.904 A 1% increase in cycle time is associated with a 9.6% decrease in estimated performance. This makes sense because higher cycle time (slower CPU) reduces performance. mmin (Minimum Memory in KB) 1.122 A 1% increase in minimum memory increases estimated performance by 12.2%. mmax (Maximum Memory in KB) 2.298 A 1% increase in maximum memory results in a 129.8% increase in estimated performance. This suggests that increasing max memory is highly beneficial for CPU performance. cach (Cache Size in KB) 1.106 A 1% increase in cache size increases performance by 10.6%. Cache plays a crucial role in improving CPU speed by reducing memory access delays. chmin (Minimum Number of Channels) 1.113 A 1% increase in minimum channels increases performance by 11.3%. This suggests that even lower-bound memory channels contribute to better performance. chmax (Maximum Number of Channels) 1.019 A 1% increase in maximum channels leads to a 1.9% increase in estimated performance.

  1. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.
vif_values <- vif(model)
print(vif_values)
##     syct     mmin     mmax     cach    chmin    chmax 
## 2.906745 3.157344 2.590052 1.771283 2.512672 2.467674

As they are below 5, there will be no multicollinearity issues.

  1. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations. I am using the rule of thumb for these calculations below.
cooksd <- cooks.distance(model)
influential <- as.numeric(names(cooksd)[(cooksd > (4/length(cpus$estperf)))])
print(influential)
##  [1]   1   9  10  15  31  32  47  66 100 123 157 199 200

Cook’s distance identifies potential outliers that significantly influence the model. Further investigation may be required to determine their impact on prediction accuracy.Lets check leverage points.

leverage_values <- hatvalues(model)
high_leverage_points <- which(leverage_values > (2 * (length(coef(model)) / nrow(cpus))))
print(high_leverage_points)
##   1  15  47  79 123 208 
##   1  15  47  79 123 208
std_residuals <- rstandard(model)
outliers <- which(abs(std_residuals) > 3)
print(outliers)
##   1  10  15 199 200 
##   1  10  15 199 200
plot(cooksd, type="h", main="Cook's Distance for Influential Observations", ylab="Cook's Distance")
abline(h = 4/nrow(cpus), col="red", lty=2)

Lets build a visualization to properly to justify the influential observations.

influencePlot(model, main="Influence Plot", sub="Circle size is proportional to Cook’s Distance")
  1. Consider the birthwt data from R package MASS. We will investigate the relationship between low birthweight and the predictors in the birthwt data using logistic regression and discriminant analysis.
library(MASS)
library(dplyr)
library(ggplot2)
library(caret)
library(TeachingDemos)
## 
## Attaching package: 'TeachingDemos'
## The following object is masked _by_ '.GlobalEnv':
## 
##     outliers
library(car)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Load birthwt dataset
data("birthwt")
bwt <- birthwt

# Convert categorical variables to factors
bwt <- bwt %>% mutate(
  low = factor(low, levels = c(0, 1), labels = c("Normal", "Low")),
  race = factor(race),
  smoke = factor(smoke),
  ptl = factor(ptl),
  ht = factor(ht),
  ui = factor(ui),
  ftv = factor(ftv)
)
  1. Investigate the relationship between variables in the birthwt dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.
summary(bwt)
##      low           age             lwt        race   smoke   ptl     ht     
##  Normal:130   Min.   :14.00   Min.   : 80.0   1:96   0:115   0:159   0:177  
##  Low   : 59   1st Qu.:19.00   1st Qu.:110.0   2:26   1: 74   1: 24   1: 12  
##               Median :23.00   Median :121.0   3:67           2:  5          
##               Mean   :23.24   Mean   :129.8                  3:  1          
##               3rd Qu.:26.00   3rd Qu.:140.0                                 
##               Max.   :45.00   Max.   :250.0                                 
##  ui      ftv          bwt      
##  0:161   0:100   Min.   : 709  
##  1: 28   1: 47   1st Qu.:2414  
##          2: 30   Median :2977  
##          3:  7   Mean   :2945  
##          4:  4   3rd Qu.:3487  
##          6:  1   Max.   :4990
# Visualizing low birthweight vs categorical predictors
ggplot(bwt, aes(x = race, fill = low)) + geom_bar(position = "fill") + 
  labs(title = "Proportion of Low Birthweight by Race")

ggplot(bwt, aes(x = smoke, fill = low)) + geom_bar(position = "fill") + 
  labs(title = "Proportion of Low Birthweight by Smoking Status")

# Visualizing low birthweight vs numeric predictors
ggplot(bwt, aes(x = age, y = bwt, color = low)) + geom_point() + 
  labs(title = "Birthweight by Age")

Smoking appears to be associated with a higher proportion of low birthweight. Younger mothers tend to have lower birthweight babies. There may be racial disparities in birthweight outcomes.Race seems to influence birth weight outcomes.Certain racial groups have a higher proportion of low birthweight cases. Mothers with hypertension (ht = 1) have a much higher chance of having a low birthweight baby

  1. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in birthwt to avoid including variables that are not logically acceptable for inclusion in the model.
logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                   data = bwt, family = binomial)
summary(logit_model)
## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = bwt)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  8.986e-01  1.299e+00   0.692  0.48908   
## age         -3.889e-02  3.952e-02  -0.984  0.32508   
## lwt         -1.536e-02  7.483e-03  -2.052  0.04017 * 
## race2        1.110e+00  5.509e-01   2.016  0.04382 * 
## race3        6.698e-01  4.774e-01   1.403  0.16061   
## smoke1       7.073e-01  4.399e-01   1.608  0.10784   
## ptl1         1.865e+00  5.722e-01   3.259  0.00112 **
## ptl2         4.875e-01  1.008e+00   0.484  0.62858   
## ptl3        -1.553e+01  1.455e+03  -0.011  0.99148   
## ht1          1.805e+00  7.488e-01   2.410  0.01595 * 
## ui1          7.930e-01  4.780e-01   1.659  0.09712 . 
## ftv1        -5.598e-01  4.958e-01  -1.129  0.25879   
## ftv2        -8.141e-02  5.447e-01  -0.149  0.88120   
## ftv3         1.103e+00  8.648e-01   1.276  0.20213   
## ftv4        -9.225e-01  1.384e+00  -0.667  0.50496   
## ftv6        -1.291e+01  1.455e+03  -0.009  0.99292   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 188.60  on 173  degrees of freedom
## AIC: 220.6
## 
## Number of Fisher Scoring iterations: 14

Smoking and hypertension appear to be significant predictors of low birthweight. Some predictors may need transformation or re-categorization.

  1. What do you notice regarding the variables ptl and ftv. What is your logistic regression model in b. (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
table(bwt$ptl, bwt$low)
##    
##     Normal Low
##   0    118  41
##   1      8  16
##   2      3   2
##   3      1   0
table(bwt$ftv, bwt$low)
##    
##     Normal Low
##   0     64  36
##   1     36  11
##   2     23   7
##   3      3   4
##   4      3   1
##   6      1   0

The frequency of ptl ≥ 2 is very low, with only 3 cases for ptl = 2 and 1 case for ptl = 3 in the Normal birthweight group. For ftv, values greater than 3 are rare, making estimates for these groups unstable in the logistic regression model.

  1. Create a new variable for ptl named ptl2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
  2. Create a new variable for ftv named ftv2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
bwt <- bwt %>% mutate(
  ptl = as.numeric(as.character(ptl)),  # Convert ptl to numeric
  ftv = as.numeric(as.character(ftv)),  # Convert ftv to numeric
  ptl2 = ifelse(ptl >= 2, "2+", as.character(ptl)),  
  ftv2 = ifelse(ftv >= 2, "2+", as.character(ftv))  
)

# Convert new variables to factors
bwt$ptl2 <- factor(bwt$ptl2)
bwt$ftv2 <- factor(bwt$ftv2)

ptl2 groups multiple premature labors into a single category to improve model stability. ftv2 collapses the number of physician visits for better interpretability. The new variables help improve model interpretability. ptl2 and ftv2 remain significant, suggesting their importance in predicting low birthweight. Higher values (3, 4, 6) have very few cases, making individual category estimates unreliable.

  1. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use ftv2 and ptl2 in the modeling). Comment on what you find are the new versions of these variables important in predicting low birthweight??
logit_model2 <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, 
                    data = bwt, family = binomial)
summary(logit_model2)
## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = bwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.03635    1.26647   0.818  0.41319   
## age         -0.04079    0.03922  -1.040  0.29832   
## lwt         -0.01636    0.00721  -2.270  0.02323 * 
## race2        1.12251    0.54311   2.067  0.03875 * 
## race3        0.69388    0.46950   1.478  0.13943   
## smoke1       0.75024    0.43167   1.738  0.08221 . 
## ptl21        1.71542    0.54301   3.159  0.00158 **
## ptl22+      -0.02002    0.96939  -0.021  0.98352   
## ht1          1.90929    0.72963   2.617  0.00888 **
## ui1          0.75203    0.47273   1.591  0.11165   
## ftv21       -0.48603    0.48814  -0.996  0.31941   
## ftv22+       0.11418    0.46233   0.247  0.80494   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 192.54  on 177  degrees of freedom
## AIC: 216.54
## 
## Number of Fisher Scoring iterations: 4

The coefficient for ptl2 is statistically significant, indicating that a history of multiple preterm labors strongly increases the likelihood of low birthweight. Collapsing ptl ≥ 2 into one category improved interpretability without losing predictive power.

The coefficient for ftv2 does not show strong statistical significance, suggesting that physician visits alone may not directly predict low birthweight. This aligns with the idea that high physician visits may indicate high risk pregnancies rather than preventive care.

  1. In a manner similar to the approach used in the book, split the birthwt data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the TeachingDemos package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
library(TeachingDemos)

char2seed("TakeHome") 

runif(1)
## [1] 0.2111143
set.seed(123)
train_index <- createDataPartition(bwt$low, p = 0.8, list = FALSE)
train_data <- bwt[train_index, ]
test_data <- bwt[-train_index, ]

# Fit LDA and QDA models
lda_model <- lda(low ~ age + lwt + race + smoke + ht + ui, data = train_data)
qda_model <- qda(low ~ age + lwt + race + smoke + ht + ui, data = train_data)

# Predict
lda_pred <- predict(lda_model, test_data)$class
qda_pred <- predict(qda_model, test_data)$class

# Evaluate
lda_cm <- confusionMatrix(lda_pred, test_data$low)
qda_cm <- confusionMatrix(qda_pred, test_data$low)

# Logistic Regression - Initial Model
logit_model1 <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                     data = train_data, family = binomial)

# Logistic Regression - Updated Model (Using ptl2 and ftv2)
logit_model2 <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, 
                     data = train_data, family = binomial)

# Predict on test data
logit_pred1 <- predict(logit_model1, test_data, type = "response")
logit_pred_class1 <- factor(ifelse(logit_pred1 > 0.5, "Low", "Normal"), levels = c("Normal", "Low"))

logit_pred2 <- predict(logit_model2, test_data, type = "response")
logit_pred_class2 <- factor(ifelse(logit_pred2 > 0.5, "Low", "Normal"), levels = c("Normal", "Low"))

# Confusion Matrices for Logistic Regression
logit_cm1 <- confusionMatrix(logit_pred_class1, test_data$low)
logit_cm2 <- confusionMatrix(logit_pred_class2, test_data$low)

results <- data.frame(
  Model = c("LDA", "QDA", "Logistic Regression (Original)", "Logistic Regression (Updated)"),
  Accuracy = c(lda_cm$overall["Accuracy"], qda_cm$overall["Accuracy"],
               logit_cm1$overall["Accuracy"], logit_cm2$overall["Accuracy"]),
  Sensitivity = c(lda_cm$byClass["Sensitivity"], qda_cm$byClass["Sensitivity"],
                  logit_cm1$byClass["Sensitivity"], logit_cm2$byClass["Sensitivity"]),
  Specificity = c(lda_cm$byClass["Specificity"], qda_cm$byClass["Specificity"],
                  logit_cm1$byClass["Specificity"], logit_cm2$byClass["Specificity"])
)

print(results)
##                            Model  Accuracy Sensitivity Specificity
## 1                            LDA 0.7027027   0.8076923   0.4545455
## 2                            QDA 0.6216216   0.6923077   0.4545455
## 3 Logistic Regression (Original) 0.7027027   0.8461538   0.3636364
## 4  Logistic Regression (Updated) 0.6216216   0.8076923   0.1818182

Logistic regression, LDA, and QDA can be compared using accuracy and confusion matrices. Different models may have varying performance based on sensitivity and specificity. Accuracy scores for LDA, QDA, and logistic regression help determine the best-performing model. The most suitable model should balance sensitivity and specificity.

LDA and Logistic Regression (Original) have the highest accuracy (0.70). QDA (0.62) and Logistic Regression (Updated) (0.62) perform worse. LDA maintains strong performance while Logistic Regression (Updated) does not improve accuracy.

Logistic Regression (Original) has the highest Sensitivity (0.85), making it the best at detecting low birthweight cases. LDA (0.81) and Logistic Regression (Updated) (0.81) are slightly lower but still strong. QDA (0.69) has the lowest Sensitivity, meaning it misses more low birthweight cases.

LDA and QDA (0.45) have a better balance between Sensitivity and Specificity. Logistic Regression (Original) has low Specificity (0.36), meaning it misclassifies more normal-weight births. Logistic Regression (Updated) has very low Specificity (0.18), making it unreliable for distinguishing normal birthweights.

  1. Using your final model from f), interpret the estimates for all covariates.
exp(coef(logit_model2))
## (Intercept)         age         lwt       race2       race3      smoke1 
##   6.3956737   0.9564887   0.9794287   2.6251128   1.6974582   1.4042721 
##       ptl21      ptl22+         ht1         ui1       ftv21      ftv22+ 
##   8.9446651   0.7805705   6.3322386   3.2940275   0.4356383   1.2396967

Odds ratios indicate the impact of predictors on low birthweight. Smoking, race, and hypertension may have the strongest associations with low birthweight. Smoking, preterm labor history, and hypertension significantly increase the risk of low birthweight. Smoking negatively affects fetal development, multiple preterm labors (ptl2 = 2+) strongly increase the odds, and hypertension (ht) is associated with complications that elevate risk.

Higher maternal weight and medical supervision may reduce the risk of low birthweight. Mothers with higher weight (lwt) generally have lower odds of low birthweight, and physician visits (ftv2) may indicate better prenatal care, though frequent visits could also be linked to high-risk pregnancies.

Racial disparities exist in birthweight outcomes. The effects of race (race) vary depending on the reference group, suggesting socio-economic or genetic differences that influence pregnancy health and birth outcomes. Targeted prenatal care, smoking cessation programs, and hypertension management are crucial to reducing low birthweight cases. Focusing on high-risk groups through early medical intervention and lifestyle changes can improve birth outcomes.