Take Home Exam

if (!require("vcd")) install.packages("vcd")

## Loading required package: vcd

## Loading required package: grid

library(vcd)
library(MASS)       
library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(car)

## Loading required package: carData

library(ggplot2)

Question 1:

a. Exploring Relationship:

# Loading Dataset and observing its summary:

data(cpus)
str(cpus)

## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...

summary(cpus)

##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
##

head(cpus)

##             name syct mmin  mmax cach chmin chmax perf estperf
## 1  ADVISOR 32/60  125  256  6000  256    16   128  198     199
## 2  AMDAHL 470V/7   29 8000 32000   32     8    32  269     253
## 3  AMDAHL 470/7A   29 8000 32000   32     8    32  220     253
## 4 AMDAHL 470V/7B   29 8000 32000   32     8    32  172     253
## 5 AMDAHL 470V/7C   29 8000 16000   32     8    16  132     132
## 6  AMDAHL 470V/8   26 8000 32000   64     8    32  318     290

# Correlation Matrix:
# Only numerical variables are considered here for better clarity:

vars <- c("syct", "mmin", "mmax", "cach", "chmin", "chmax")
cor_matrix <- cor(cpus[, vars])
print(cor_matrix)

##             syct       mmin       mmax       cach      chmin      chmax
## syct   1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin  -0.3356422  1.0000000  0.7581573  0.5347291  0.5171892  0.2669074
## mmax  -0.3785606  0.7581573  1.0000000  0.5379898  0.5605134  0.5272462
## cach  -0.3209998  0.5347291  0.5379898  1.0000000  0.5822455  0.4878458
## chmin -0.3010897  0.5171892  0.5605134  0.5822455  1.0000000  0.5482812
## chmax -0.2505023  0.2669074  0.5272462  0.4878458  0.5482812  1.0000000

The above correlation matrix reveals that memory-related variables, such as maximum memory size (mmax) and minimum memory size (mmin), are the most strongly associated with each other, showing high positive correlations of 0.758. This might suggest a potential multi-collinearity in modeling.

Channel times are also moderately correlated with each other (0.548), implying that faster channel speeds contribute to improved performance but not as significantly as memory size.

Cache size (cach) shows a moderate correlations with the other predictors, indicating a supportive but less dominant role in influencing performance compared to memory-related variables.

# Visualizing relationships using a scatterplot matrix:

ggpairs(cpus[, vars], title = "Scatterplot Matrix for cpus Data")

The scatter plot matrix above reveals non-linear patterns, particularly between mmin and mmax suggesting that log-transforming variables could improve linearity. The wide range of values for mmin and mmax dominates the plots, potentially obscuring trends. Additionally, increasing spread at higher values suggests heteroscedasticity, indicating a need for transformation or alternative modeling approaches.

b. Linear Regression Model:

# Linear regression model:
fit <- lm(estperf ~ . - name - perf, data = cpus)

summary(fit)

## 
## Call:
## lm(formula = estperf ~ . - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.198  -24.064    1.429   23.136  288.718 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.648e+01  6.288e+00 -10.574  < 2e-16 ***
## syct         6.596e-02  1.369e-02   4.818 2.85e-06 ***
## mmin         1.431e-02  1.428e-03  10.021  < 2e-16 ***
## mmax         6.590e-03  5.016e-04  13.138  < 2e-16 ***
## cach         4.945e-01  1.091e-01   4.533 9.94e-06 ***
## chmin       -1.723e-01  6.687e-01  -0.258    0.797    
## chmax        1.201e+00  1.720e-01   6.985 4.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.88 on 202 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.9082 
## F-statistic: 344.1 on 6 and 202 DF,  p-value: < 2.2e-16

From the above results we can see that, most of the variables besides chmin are strongly significant with very low p-value. Among all, mmin, and mmax are the most significant variables.

c. Residual Plots:

par(mfrow = c(2, 2))
plot(fit)

From the above plot we can observe heteroscedasticity. We can see in the first plot that the variance is unequal through out the plot. Also, the deviation is slightly observed in the left part of Q-Q residuals. The final plot sort of indicates the outliers- specifically those standing out from the cook’s distance. A better model might be needed to stabilize the variance.

Let’s use log of the estperf and re-fit the model.

# Assuming your original model was fit like this:
# fit <- lm(estperf ~ perf + syct + mmin + mmax + cach + chmin + chmax, data = cpus)

# Fit a new model with log-transformed response
fit_log <- lm(log(estperf) ~ . - name - perf, data = cpus)

# Summarize the new model
summary(fit_log)

## 
## Call:
## lm(formula = log(estperf) ~ . - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96828 -0.12609  0.02829  0.15471  0.46886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.293e+00  2.934e-02 112.254  < 2e-16 ***
## syct        -4.510e-04  6.389e-05  -7.059 2.64e-11 ***
## mmin         1.278e-05  6.662e-06   1.919   0.0565 .  
## mmax         5.386e-05  2.341e-06  23.011  < 2e-16 ***
## cach         7.024e-03  5.090e-04  13.801  < 2e-16 ***
## chmin        4.808e-03  3.120e-03   1.541   0.1249    
## chmax       -1.598e-03  8.024e-04  -1.992   0.0478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9455 
## F-statistic: 602.1 on 6 and 202 DF,  p-value: < 2.2e-16

# Check diagnostics again
par(mfrow = c(2, 2))
plot(fit_log)

From the above plots, we can definitely see that the model has improved. The residual red line in the first plot has been straightened. The Q-Q plot is fitting data better than before. This still does not drastically change improve the model, however the heteroscedasticity has been improved than before.

d. Final Model:

We can observe that the above (final) model has R2 value of 0.947, which means 95% of variance in the data is captured by the model. Also, the residual standard error of final model is 0.2188 while the first model is 46.88. This drastic decrements in residual standard error evidence the robustness of final model.

Not only that, comparing the residual plots of first model and the final model, we can observe a better fit of data by the model as evidence by decrease in heteroscedasticity.

e. Interpret all Variables:

Intercept (𝛃0 = 3.293):

This could be a bit unrealistic, but when all other predictors are zero, then the baseline estimated performance is e^(3.293) = 26.92.
syct (𝛃(syct) = -4.510e-04):

A 1-unit increase in cycle time decreases log(estperf) by 0.000451, that corresponds to multiplying (estperf) by e^(-0.000451) = approx 1.00045.
mmin (𝛃(mmin) = 1.278e-05):

In case of MB, 1 MB = 1024 KB. So, 1.278e-05 * 1024 = 0.0131 => e^(0.0131) = approx 1.0137, i.e., about a 1.37% increase in estperf can happen for each additional 1 MB of mmin.
mmax (𝛃(mmax) = 5.386e-05):

In case of MB, 1 MB = 1024 KB. So, 5.386e-05 * 1024 = 0.0552 => e^(0.0552) = approx 1.0567, i.e., about a 5.7% increase in estperf can happen for each additional 1 MB of mmax.
cach (𝛃(cach) = 7.024e-03):

A 1 KB increase in cache size adds 0.007024 to the log(estperf).

Exponentially: e^(0.007024) = approx 1.00705, about a 0.70% increase in estperf.
chmin (𝛃(chmin) = 4.808e-03):

Each 1-unit increase in chmin multiplies estperf by e^(0.004808) = approx 1.00482, or about 0.48% increase. The p-value is 0.1249, so this is statistically significant. Since chmin is statistically not so strong, we are not sure if this predictor affect the estperf.
chmax (𝛃(chmax) = -1.598e-03):

Each 1-unit increase in chmax multiplies estperf by e^(-0.001598) = approx 0.9984, or about 0.16% decrease. The p-value is 0.0478, so this might not be too statistically significant.

f. Multicollinearity:

vif_values <- vif(fit_log)
vif_values

##     syct     mmin     mmax     cach    chmin    chmax 
## 1.201644 2.902002 3.274002 1.858349 1.966234 1.891392

We can observe from vif values from the new model, almost all predictors have vif around 1 which shows no multicollinearity among them. We can also see that, mmin and mmax has slightly higher vif than others; however those can also be considered normal. Hence, the new model is performing fairly good.

g. Outliers:

# Assume your final model is:
fit_final <- lm(log(estperf) ~ . - name - perf, data = cpus)

# 1. Cook's Distance -------------------------------------------

# Calculate Cook's distances
cooksd <- cooks.distance(fit_final)

# Plot Cook's distance values
plot(cooksd, type = "h", main = "Cook's Distance", 
     ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = 4/length(cooksd), col = "red", lty = 2)
text(x = 1:length(cooksd), y = cooksd, labels = ifelse(cooksd > 4/length(cooksd), names(cooksd), ""), pos = 4, cex = 0.7)

# Rule of thumb: Observations with Cook's distance > 4/n (n = number of observations) may be influential.
threshold <- 4/length(cooksd)
cat("Cook's distance threshold (4/n):", threshold, "\n")

## Cook's distance threshold (4/n): 0.01913876

# 2. Leverage (Hat Values) -------------------------------------

# Calculate hat values (leverages)
hats <- hatvalues(fit_final)

# Plot hat values
plot(hats, type = "h", main = "Leverage Values", 
     ylab = "Leverage", xlab = "Observation Index")
# Rule of thumb: Leverage > 2*(p+1)/n might be considered high, where p = number of predictors.
p <- length(coef(fit_final)) - 1  # number of predictors
leverage_threshold <- 2*(p + 1)/nrow(cpus)
abline(h = leverage_threshold, col = "red", lty = 2)

cat("Leverage threshold (2*(p+1)/n):", leverage_threshold, "\n")

## Leverage threshold (2*(p+1)/n): 0.06698565

# 3. Influence Plot --------------------------------------------

# Generate an influence plot that shows Cook's distance, leverage, and studentized residuals.
influencePlot(fit_final, main="Influence Plot", 
              sub="Circle size is proportional to Cook's Distance")

##       StudRes       Hat       CookD
## 9   -4.023115 0.1316038 0.325910095
## 10  -5.819145 0.3273595 2.024883077
## 197 -0.328550 0.3417243 0.008040722
## 200 -3.383141 0.2644439 0.558937693

# 4. Standardized Residuals -------------------------------------

# Calculate standardized residuals
std_resid <- rstandard(fit_final)
# Plot standardized residuals vs. fitted values
plot(fitted(fit_final), std_resid,
     xlab = "Fitted Values",
     ylab = "Standardized Residuals",
     main = "Standardized Residuals vs Fitted Values")
abline(h = c(-2, 2), col = "red", lty = 2)

Outliers / Influential Points:

#10 is the most influential observation, with very high Cook’s distance and high leverage, plus a large negative residual.
#9 and #200 also show somewhat elevated influence.
#197 and #199 have moderate Cook’s distance but do not reach #10’s level.

Implications:

These influential points can shift model coefficients if they are genuine data points.

Question 2:

a. Exploring Relationship:

data("birthwt")

# Inspect structure and basic summary
str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

#Visualizations:
# Example: Mosaic plot of low birthweight vs. race
mosaic(~ low + race, data = birthwt, shade = TRUE, main = "Mosaic Plot: Low Birthweight vs. Race")

# Example: Boxplot of mother's weight (lwt) by low birthweight status
boxplot(lwt ~ low, data = birthwt,
        main = "Mother's Weight by Birthweight Status",
        xlab = "Low Birthweight",
        ylab = "Mother's Weight (lbs)")

From the above summaries, we observed that sot all predictors are numerical like in question #1. Here, most of the variables are categorical representing binary decision boundary.

Also, we can obtain above visuals to help understand the data set better. We can see that most of the white race people have no issues with low birth weight of child, while black race people are very thin in numbers who have no issues with low birth weight of child.

Also, we can see that mothers who had low birth weight issue with children had lower wight in themselves as compared to the other case.

b. Logistic Model:

# Load necessary package and data
library(MASS)
data(birthwt)

# Coding variables so that categorical variables are factors:
birthwt$low   <- factor(birthwt$low, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$race  <- factor(birthwt$race, levels = c(1, 2, 3), 
                        labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$ht    <- factor(birthwt$ht, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$ui    <- factor(birthwt$ui, levels = c(0, 1), labels = c("No", "Yes"))

# Logistic regression model:
# We exclude bwt because it is used to define low birthweight.
logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                   data = birthwt, family = binomial)
summary(logit_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.480623   1.196888   0.402  0.68801   
## age         -0.029549   0.037031  -0.798  0.42489   
## lwt         -0.015424   0.006919  -2.229  0.02580 * 
## raceBlack    1.272260   0.527357   2.413  0.01584 * 
## raceOther    0.880496   0.440778   1.998  0.04576 * 
## smokeYes     0.938846   0.402147   2.335  0.01957 * 
## ptl          0.543337   0.345403   1.573  0.11571   
## htYes        1.863303   0.697533   2.671  0.00756 **
## uiYes        0.767648   0.459318   1.671  0.09467 . 
## ftv          0.065302   0.172394   0.379  0.70484   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 201.28  on 179  degrees of freedom
## AIC: 221.28
## 
## Number of Fisher Scoring iterations: 4

c. Explore Results:

The model is assuming that the effect of each additional unit increase in ptl and ftv is constant on the log‐odds scale. In other words, it implies that an increase of one previous preterm labor (ptl) always increases the log odds of a low‐birth weight outcome by about 0.54, and that one extra physician visit in the first trimester (ftv) increases the log odds by about 0.065—regardless of the starting point. This is the standard “linearity in the log-odds” assumption inherent in logistic regression.

However, these assumptions may not be realistic. For count variables like ptl and ftv, the effect may not be strictly linear over their entire range. For example, going from 0 to 1 previous preterm labor might have a much different impact than going from 2 to 3, or the effect of additional physician visits might plateau after a certain number of visits. The non-significant coefficient for ftv (p = 0.70) and the moderate effect (with high uncertainty) for ptl (p ≈ 0.116) suggest that the relationship may not be adequately captured by a simple linear term.

New Variable (ptl2):

# Creating a new variable 'ptl2': "0" if ptl equals 0, "1+" otherwise.
birthwt$ptl2 <- factor(ifelse(birthwt$ptl == 0, "0", "1+"))

table(birthwt$ptl2)

## 
##   0  1+ 
## 159  30

We can see that there are 159 “no previous premature labors” and 30 “yes previous premature labors”.

e. New Variable (ftv2):

# Creating a new variable 'ftv2' by collapsing ftv into three categories:
# "0" for no visits, "1" for one visit, and "2+" for two or more visits.
birthwt$ftv2 <- with(birthwt, ifelse(ftv == 0, "0",
                              ifelse(ftv == 1, "1", "2+")))
birthwt$ftv2 <- factor(birthwt$ftv2, levels = c("0", "1", "2+"))

print(table(birthwt$ftv2))

## 
##   0   1  2+ 
## 100  47  42

# Creating a cross-tabulation of ftv2 with the outcome 'low'
ftv2_table <- table(birthwt$ftv2, birthwt$low)
print(ftv2_table)

##     
##      No Yes
##   0  64  36
##   1  36  11
##   2+ 30  12

# Calculating row proportions to see the probability of low birthweight by ftv2 level
ftv2_prop <- prop.table(ftv2_table, margin = 1)
print(round(ftv2_prop, 2))

##     
##        No  Yes
##   0  0.64 0.36
##   1  0.77 0.23
##   2+ 0.71 0.29

From the above table, there are 100 for 0 time doctor visit, 47 for 1 time doctor visit and 42 for 2 or more times doctor visits.

Also, we can see that we have 64%, 77% and 71% for 0, 1 and 2 times doctor visit that had no issue with low birth weight. And, 36%, 23%, and 29% for 0, 1 and 2 times doctor visit that had issue with low birth weight.

f. New Logistic Model:

# Refit the logistic regression model using ptl2 and ftv2
logit_model_new <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                       data = birthwt, family = binomial)
summary(logit_model_new)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = birthwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.82302    1.24471   0.661  0.50848   
## age         -0.03723    0.03870  -0.962  0.33602   
## lwt         -0.01565    0.00708  -2.211  0.02705 * 
## raceBlack    1.19241    0.53597   2.225  0.02609 * 
## raceOther    0.74069    0.46174   1.604  0.10869   
## smokeYes     0.75553    0.42502   1.778  0.07546 . 
## ptl21+       1.34376    0.48062   2.796  0.00518 **
## htYes        1.91317    0.72074   2.654  0.00794 **
## uiYes        0.68019    0.46434   1.465  0.14296   
## ftv21       -0.43638    0.47939  -0.910  0.36268   
## ftv22+       0.17901    0.45638   0.392  0.69488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 195.48  on 178  degrees of freedom
## AIC: 217.48
## 
## Number of Fisher Scoring iterations: 4

Previous Preterm Labors (ptl2):

The coefficient for mothers with any previous preterm labor (“1+”) is 1.344 (p = 0.00518). This means that, holding all other variables constant, having one or more previous preterm labors increases the log-odds of a low-birth weight outcome by about 1.34. In terms of odds, exp(1.344) ≈ 3.83, so the odds of a low-birth weight baby are nearly 3.8 times higher for mothers with a history of preterm labor compared to those with none. This effect is statistically significant and indicates that ptl2 is an important predictor.

First-Trimester Physician Visits (ftv2):

In contrast, the coefficients for ftv2 are not significant. For mothers with one visit (ftv21) the coefficient is –0.436 (p = 0.363) and for those with two or more visits (ftv22+) the coefficient is 0.179 (p = 0.695). This implies that, after adjusting for other factors, there is no strong evidence that the number of first-trimester visits—when grouped into these categories—is associated with the odds of low birth weight.

Overall, the new variable ptl2 appears to capture an important risk factor for low birth weight, while the recoded ftv2 does not seem to add predictive value in this model.

g. Split and Train LDA, QDA, and Logistic Models:

# Install packages if not already installed
if (!require("caret")) install.packages("caret")

## Loading required package: caret

## Loading required package: lattice

if (!require("MASS")) install.packages("MASS")
library(caret)
library(MASS)

# Loading the birthwt data
data(birthwt)

# Recoding variables to factors as appropriate
birthwt$low   <- factor(birthwt$low, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$race  <- factor(birthwt$race, levels = c(1, 2, 3),
                        labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0, 1),
                        labels = c("No", "Yes"))
birthwt$ht    <- factor(birthwt$ht, levels = c(0, 1),
                        labels = c("No", "Yes"))
birthwt$ui    <- factor(birthwt$ui, levels = c(0, 1),
                        labels = c("No", "Yes"))

# Creating new variables:
birthwt$ptl2 <- factor(ifelse(birthwt$ptl == 0, "0", "1+"),
                       levels = c("0", "1+"))

birthwt$ftv2 <- factor(with(birthwt, ifelse(ftv == 0, "0",
                                     ifelse(ftv == 1, "1", "2+"))),
                       levels = c("0", "1", "2+"))


# Splitting Data (Stratified on 'low')
set.seed(1234)  
trainIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train <- birthwt[trainIndex, ]
test  <- birthwt[-trainIndex, ]

train$ptl2 <- factor(train$ptl2, levels = c("0", "1+"))
train$ftv2 <- factor(train$ftv2, levels = c("0", "1", "2+"))


# Fitting Models on the Training Set

# (a) Logistic Regression 
logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                   data = train, family = binomial)
summary(logit_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.265238   1.443592   0.184  0.85422   
## age         -0.003943   0.043030  -0.092  0.92699   
## lwt         -0.017885   0.007891  -2.267  0.02342 * 
## raceBlack    1.448961   0.612146   2.367  0.01793 * 
## raceOther    0.911531   0.520036   1.753  0.07963 . 
## smokeYes     1.070048   0.476025   2.248  0.02458 * 
## ptl21+       1.368684   0.510787   2.680  0.00737 **
## htYes        1.341059   0.812738   1.650  0.09893 . 
## uiYes        0.474728   0.512900   0.926  0.35467   
## ftv21       -0.348842   0.524607  -0.665  0.50608   
## ftv22+      -0.236118   0.512911  -0.460  0.64527   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 189.59  on 151  degrees of freedom
## Residual deviance: 157.21  on 141  degrees of freedom
## AIC: 179.21
## 
## Number of Fisher Scoring iterations: 4

# (b) LDA and QDA 
lda_model <- lda(low ~ age + lwt, data = train)
print(lda_model)

## Call:
## lda(low ~ age + lwt, data = train)
## 
## Prior probabilities of groups:
##        No       Yes 
## 0.6842105 0.3157895 
## 
## Group means:
##          age      lwt
## No  23.58654 134.9615
## Yes 22.54167 121.1667
## 
## Coefficients of linear discriminants:
##             LD1
## age -0.05315507
## lwt -0.03006282

qda_model <- qda(low ~ age + lwt, data = train)
print(qda_model)

## Call:
## qda(low ~ age + lwt, data = train)
## 
## Prior probabilities of groups:
##        No       Yes 
## 0.6842105 0.3157895 
## 
## Group means:
##          age      lwt
## No  23.58654 134.9615
## Yes 22.54167 121.1667

# Defining a Function to Compute Metrics
calc_metrics <- function(conf_matrix) {
  TP <- conf_matrix["Yes", "Yes"]
  TN <- conf_matrix["No", "No"]
  FP <- conf_matrix["Yes", "No"]
  FN <- conf_matrix["No", "Yes"]
  sensitivity <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  accuracy <- (TP + TN) / sum(conf_matrix)
  return(c(sensitivity = sensitivity, specificity = specificity, accuracy = accuracy))
}


# Evaluating Models on the Test Set
## Logistic Regression Predictions
pred_probs_logit <- predict(logit_model, newdata = test, type = "response")
pred_class_logit <- ifelse(pred_probs_logit > 0.5, "Yes", "No")
conf_logit <- table(Predicted = pred_class_logit, Actual = test$low)
cat("Logistic Regression Confusion Matrix:\n")

## Logistic Regression Confusion Matrix:

print(conf_logit)

##          Actual
## Predicted No Yes
##       No  23   7
##       Yes  3   4

metrics_logit <- calc_metrics(conf_logit)
cat("Logistic Regression Metrics:\n")

## Logistic Regression Metrics:

print(metrics_logit)

## sensitivity specificity    accuracy 
##   0.3636364   0.8846154   0.7297297

## LDA Predictions (using age and lwt)
lda_pred <- predict(lda_model, newdata = test)
conf_lda <- table(Predicted = lda_pred$class, Actual = test$low)
cat("LDA Confusion Matrix:\n")

## LDA Confusion Matrix:

print(conf_lda)

##          Actual
## Predicted No Yes
##       No  26  11
##       Yes  0   0

metrics_lda <- calc_metrics(conf_lda)
cat("LDA Metrics:\n")

## LDA Metrics:

print(metrics_lda)

## sensitivity specificity    accuracy 
##   0.0000000   1.0000000   0.7027027

## QDA Predictions (using age and lwt)
qda_pred <- predict(qda_model, newdata = test)
conf_qda <- table(Predicted = qda_pred$class, Actual = test$low)
cat("QDA Confusion Matrix:\n")

## QDA Confusion Matrix:

print(conf_qda)

##          Actual
## Predicted No Yes
##       No  26  11
##       Yes  0   0

metrics_qda <- calc_metrics(conf_qda)
cat("QDA Metrics:\n")

## QDA Metrics:

print(metrics_qda)

## sensitivity specificity    accuracy 
##   0.0000000   1.0000000   0.7027027

# 6. Comparing Model Performance
cat("Final Performance Comparison:\n")

## Final Performance Comparison:

cat("Logistic Regression (Full Model):\n")

## Logistic Regression (Full Model):

print(metrics_logit)

## sensitivity specificity    accuracy 
##   0.3636364   0.8846154   0.7297297

cat("LDA (age and lwt):\n")

## LDA (age and lwt):

print(metrics_lda)

## sensitivity specificity    accuracy 
##   0.0000000   1.0000000   0.7027027

cat("QDA (age and lwt):\n")

## QDA (age and lwt):

print(metrics_qda)

## sensitivity specificity    accuracy 
##   0.0000000   1.0000000   0.7027027

From the above results, we can see the following results,

Logistic Regression (Full Model):

Sensitivity: 36.4% ( the model is correctly identiftying 4 out of 11 low birth weight cases)
Specificity: 88.5%
Accuracy: 73.0%

LDA and QDA (using age and lwt):

Sensitivity: 0% (they did not classify any case as low birth weight)
Specificity: 100%
Accuracy: 70.3%

Because sensitivity (the ability to correctly detect low birth weight cases) is critical, and overall accuracy is slightly higher, the logistic regression model performs the best among these methods.

h. Interpreting Final Logistic Model:

(Intercept): 0.82302

When all continuous predictors are 0 and for all categorical variables the reference levels are present (i.e. race = “White”, smoke = “No”, ptl2 = “0”, ht = “No”, ui = “No”, ftv2 = “0”), the log odds of having a low‐birthweight baby is 0.823. However, since the age and lwt being 0 are not realistic, the intercept mainly serves as a baseline for the model.

age: –0.03723

Each additional year in maternal age is associated with a decrease of 0.037 in the log odds of low birthweight. The corresponding odds ratio is exp(–0.03723) ≈ 0.964, meaning roughly a 3.6% decrease in the odds per year. However, this effect is not statistically significant (p ≈ 0.336).

lwt: –0.01565

For each additional pound in the mother’s weight at the last menstrual period, the log odds of low birthweight decrease by 0.01565. The odds ratio is exp(–0.01565) ≈ 0.9845, implying about a 1.6% decrease in odds per pound. This effect is statistically significant (p ≈ 0.027).

raceBlack: 1.19241

Compared to the reference group (White), Black mothers have 1.192 higher log odds of low birthweight. This corresponds to an odds ratio of exp(1.19241) ≈ 3.29, meaning Black mothers have about 3.3 times the odds of a low‐birthweight baby, holding other factors constant (p ≈ 0.026).

raceOther: 0.74069

Compared to White mothers, those in the “Other” race category have 0.741 higher log odds of low birthweight. The odds ratio is exp(0.74069) ≈ 2.10, indicating roughly 2.1 times the odds, though this effect is only marginally significant (p ≈ 0.109).

smokeYes: 0.75553

Mothers who smoke have a 0.756 unit higher log odds of low birthweight compared to non-smokers. The odds ratio is exp(0.75553) ≈ 2.13, meaning smoking is associated with roughly double the odds of low birthweight (p ≈ 0.075, marginal significance).

ptl2 (“1+” vs. “0”): 1.34376

Mothers with one or more previous preterm labors (ptl2 = “1+”) have log odds of low birthweight that are 1.344 units higher than mothers with no previous preterm labors. This translates to an odds ratio of exp(1.34376) ≈ 3.83, so they have nearly 4 times the odds of a low‐birthweight baby (p ≈ 0.005).

htYes: 1.91317

A history of hypertension increases the log odds of low birthweight by 1.913 compared to no hypertension. The odds ratio is exp(1.91317) ≈ 6.77, meaning mothers with hypertension have roughly 6.8 times the odds of low birthweight (p ≈ 0.008).

uiYes: 0.68019

Mothers with uterine irritability have 0.680 higher log odds of low birthweight compared to those without. The corresponding odds ratio is exp(0.68019) ≈ 1.97, nearly doubling the odds, although this effect is not statistically significant (p ≈ 0.143).

ftv2: (with reference level “0”)
ftv21: –0.43638

Having 1 first-trimester physician visit (versus 0) is associated with a decrease of 0.436 in the log odds of low birthweight. The odds ratio is exp(–0.43638) ≈ 0.646, or about a 35% reduction in odds; however, this effect is not statistically significant (p ≈ 0.362).

ftv22+: 0.17901

Having 2 or more first-trimester visits (versus 0) increases the log odds by 0.179, with an odds ratio of exp(0.17901) ≈ 1.196 (about a 20% increase in odds), but this effect is not statistically significant (p ≈ 0.694).

Take Home Exam

2025-03-12

Question 1:

a. Exploring Relationship:

b. Linear Regression Model:

c. Residual Plots:

d. Final Model:

e. Interpret all Variables:

f. Multicollinearity:

g. Outliers:

Question 2:

a. Exploring Relationship:

b. Logistic Model:

c. Explore Results:

e. New Variable (ftv2):

f. New Logistic Model:

g. Split and Train LDA, QDA, and Logistic Models:

h. Interpreting Final Logistic Model: