if (!require("vcd")) install.packages("vcd")
## Loading required package: vcd
## Loading required package: grid
library(vcd)
library(MASS)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(car)
## Loading required package: carData
library(ggplot2)
# Loading Dataset and observing its summary:
data(cpus)
str(cpus)
## 'data.frame': 209 obs. of 9 variables:
## $ name : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
## $ syct : int 125 29 29 29 29 26 23 23 23 23 ...
## $ mmin : int 256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
## $ mmax : int 6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
## $ cach : int 256 32 32 32 32 64 64 64 64 128 ...
## $ chmin : int 16 8 8 8 8 8 16 16 16 32 ...
## $ chmax : int 128 32 32 32 16 32 32 32 32 64 ...
## $ perf : int 198 269 220 172 132 318 367 489 636 1144 ...
## $ estperf: int 199 253 253 253 132 290 381 381 749 1238 ...
summary(cpus)
## name syct mmin mmax
## ADVISOR 32/60 : 1 Min. : 17.0 Min. : 64 Min. : 64
## AMDAHL 470/7A : 1 1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000
## AMDAHL 470V/7 : 1 Median : 110.0 Median : 2000 Median : 8000
## AMDAHL 470V/7B: 1 Mean : 203.8 Mean : 2868 Mean :11796
## AMDAHL 470V/7C: 1 3rd Qu.: 225.0 3rd Qu.: 4000 3rd Qu.:16000
## AMDAHL 470V/8 : 1 Max. :1500.0 Max. :32000 Max. :64000
## (Other) :203
## cach chmin chmax perf
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 6.0
## 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 27.0
## Median : 8.00 Median : 2.000 Median : 8.00 Median : 50.0
## Mean : 25.21 Mean : 4.699 Mean : 18.27 Mean : 105.6
## 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00 3rd Qu.: 113.0
## Max. :256.00 Max. :52.000 Max. :176.00 Max. :1150.0
##
## estperf
## Min. : 15.00
## 1st Qu.: 28.00
## Median : 45.00
## Mean : 99.33
## 3rd Qu.: 101.00
## Max. :1238.00
##
head(cpus)
## name syct mmin mmax cach chmin chmax perf estperf
## 1 ADVISOR 32/60 125 256 6000 256 16 128 198 199
## 2 AMDAHL 470V/7 29 8000 32000 32 8 32 269 253
## 3 AMDAHL 470/7A 29 8000 32000 32 8 32 220 253
## 4 AMDAHL 470V/7B 29 8000 32000 32 8 32 172 253
## 5 AMDAHL 470V/7C 29 8000 16000 32 8 16 132 132
## 6 AMDAHL 470V/8 26 8000 32000 64 8 32 318 290
# Correlation Matrix:
# Only numerical variables are considered here for better clarity:
vars <- c("syct", "mmin", "mmax", "cach", "chmin", "chmax")
cor_matrix <- cor(cpus[, vars])
print(cor_matrix)
## syct mmin mmax cach chmin chmax
## syct 1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin -0.3356422 1.0000000 0.7581573 0.5347291 0.5171892 0.2669074
## mmax -0.3785606 0.7581573 1.0000000 0.5379898 0.5605134 0.5272462
## cach -0.3209998 0.5347291 0.5379898 1.0000000 0.5822455 0.4878458
## chmin -0.3010897 0.5171892 0.5605134 0.5822455 1.0000000 0.5482812
## chmax -0.2505023 0.2669074 0.5272462 0.4878458 0.5482812 1.0000000
The above correlation matrix reveals that memory-related variables, such as maximum memory size (mmax) and minimum memory size (mmin), are the most strongly associated with each other, showing high positive correlations of 0.758. This might suggest a potential multi-collinearity in modeling.
Channel times are also moderately correlated with each other (0.548), implying that faster channel speeds contribute to improved performance but not as significantly as memory size.
Cache size (cach) shows a moderate correlations with the other predictors, indicating a supportive but less dominant role in influencing performance compared to memory-related variables.
# Visualizing relationships using a scatterplot matrix:
ggpairs(cpus[, vars], title = "Scatterplot Matrix for cpus Data")
The scatter plot matrix above reveals non-linear patterns, particularly between mmin and mmax suggesting that log-transforming variables could improve linearity. The wide range of values for mmin and mmax dominates the plots, potentially obscuring trends. Additionally, increasing spread at higher values suggests heteroscedasticity, indicating a need for transformation or alternative modeling approaches.
# Linear regression model:
fit <- lm(estperf ~ . - name - perf, data = cpus)
summary(fit)
##
## Call:
## lm(formula = estperf ~ . - name - perf, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.198 -24.064 1.429 23.136 288.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.648e+01 6.288e+00 -10.574 < 2e-16 ***
## syct 6.596e-02 1.369e-02 4.818 2.85e-06 ***
## mmin 1.431e-02 1.428e-03 10.021 < 2e-16 ***
## mmax 6.590e-03 5.016e-04 13.138 < 2e-16 ***
## cach 4.945e-01 1.091e-01 4.533 9.94e-06 ***
## chmin -1.723e-01 6.687e-01 -0.258 0.797
## chmax 1.201e+00 1.720e-01 6.985 4.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.88 on 202 degrees of freedom
## Multiple R-squared: 0.9109, Adjusted R-squared: 0.9082
## F-statistic: 344.1 on 6 and 202 DF, p-value: < 2.2e-16
From the above results we can see that, most of the variables besides chmin are strongly significant with very low p-value. Among all, mmin, and mmax are the most significant variables.
par(mfrow = c(2, 2))
plot(fit)
From the above plot we can observe heteroscedasticity. We can see in the first plot that the variance is unequal through out the plot. Also, the deviation is slightly observed in the left part of Q-Q residuals. The final plot sort of indicates the outliers- specifically those standing out from the cook’s distance. A better model might be needed to stabilize the variance.
Let’s use log of the estperf and re-fit the model.
# Assuming your original model was fit like this:
# fit <- lm(estperf ~ perf + syct + mmin + mmax + cach + chmin + chmax, data = cpus)
# Fit a new model with log-transformed response
fit_log <- lm(log(estperf) ~ . - name - perf, data = cpus)
# Summarize the new model
summary(fit_log)
##
## Call:
## lm(formula = log(estperf) ~ . - name - perf, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96828 -0.12609 0.02829 0.15471 0.46886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.293e+00 2.934e-02 112.254 < 2e-16 ***
## syct -4.510e-04 6.389e-05 -7.059 2.64e-11 ***
## mmin 1.278e-05 6.662e-06 1.919 0.0565 .
## mmax 5.386e-05 2.341e-06 23.011 < 2e-16 ***
## cach 7.024e-03 5.090e-04 13.801 < 2e-16 ***
## chmin 4.808e-03 3.120e-03 1.541 0.1249
## chmax -1.598e-03 8.024e-04 -1.992 0.0478 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.9455
## F-statistic: 602.1 on 6 and 202 DF, p-value: < 2.2e-16
# Check diagnostics again
par(mfrow = c(2, 2))
plot(fit_log)
From the above plots, we can definitely see that the model has improved. The residual red line in the first plot has been straightened. The Q-Q plot is fitting data better than before. This still does not drastically change improve the model, however the heteroscedasticity has been improved than before.
We can observe that the above (final) model has R2 value of 0.947, which means 95% of variance in the data is captured by the model. Also, the residual standard error of final model is 0.2188 while the first model is 46.88. This drastic decrements in residual standard error evidence the robustness of final model.
Not only that, comparing the residual plots of first model and the final model, we can observe a better fit of data by the model as evidence by decrease in heteroscedasticity.
Intercept (𝛃0 = 3.293):
This could be a bit unrealistic, but when all other predictors are zero, then the baseline estimated performance is e^(3.293) = 26.92.
syct (𝛃(syct) = -4.510e-04):
A 1-unit increase in cycle time decreases log(estperf) by 0.000451, that corresponds to multiplying (estperf) by e^(-0.000451) = approx 1.00045.
mmin (𝛃(mmin) = 1.278e-05):
In case of MB, 1 MB = 1024 KB. So, 1.278e-05 * 1024 = 0.0131 => e^(0.0131) = approx 1.0137, i.e., about a 1.37% increase in estperf can happen for each additional 1 MB of mmin.
mmax (𝛃(mmax) = 5.386e-05):
In case of MB, 1 MB = 1024 KB. So, 5.386e-05 * 1024 = 0.0552 => e^(0.0552) = approx 1.0567, i.e., about a 5.7% increase in estperf can happen for each additional 1 MB of mmax.
cach (𝛃(cach) = 7.024e-03):
A 1 KB increase in cache size adds 0.007024 to the log(estperf).
Exponentially: e^(0.007024) = approx 1.00705, about a 0.70% increase in estperf.
chmin (𝛃(chmin) = 4.808e-03):
Each 1-unit increase in chmin multiplies estperf by e^(0.004808) = approx 1.00482, or about 0.48% increase. The p-value is 0.1249, so this is statistically significant. Since chmin is statistically not so strong, we are not sure if this predictor affect the estperf.
chmax (𝛃(chmax) = -1.598e-03):
Each 1-unit increase in chmax multiplies estperf by e^(-0.001598) = approx 0.9984, or about 0.16% decrease. The p-value is 0.0478, so this might not be too statistically significant.
vif_values <- vif(fit_log)
vif_values
## syct mmin mmax cach chmin chmax
## 1.201644 2.902002 3.274002 1.858349 1.966234 1.891392
We can observe from vif values from the new model, almost all predictors have vif around 1 which shows no multicollinearity among them. We can also see that, mmin and mmax has slightly higher vif than others; however those can also be considered normal. Hence, the new model is performing fairly good.
# Assume your final model is:
fit_final <- lm(log(estperf) ~ . - name - perf, data = cpus)
# 1. Cook's Distance -------------------------------------------
# Calculate Cook's distances
cooksd <- cooks.distance(fit_final)
# Plot Cook's distance values
plot(cooksd, type = "h", main = "Cook's Distance",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = 4/length(cooksd), col = "red", lty = 2)
text(x = 1:length(cooksd), y = cooksd, labels = ifelse(cooksd > 4/length(cooksd), names(cooksd), ""), pos = 4, cex = 0.7)
# Rule of thumb: Observations with Cook's distance > 4/n (n = number of observations) may be influential.
threshold <- 4/length(cooksd)
cat("Cook's distance threshold (4/n):", threshold, "\n")
## Cook's distance threshold (4/n): 0.01913876
# 2. Leverage (Hat Values) -------------------------------------
# Calculate hat values (leverages)
hats <- hatvalues(fit_final)
# Plot hat values
plot(hats, type = "h", main = "Leverage Values",
ylab = "Leverage", xlab = "Observation Index")
# Rule of thumb: Leverage > 2*(p+1)/n might be considered high, where p = number of predictors.
p <- length(coef(fit_final)) - 1 # number of predictors
leverage_threshold <- 2*(p + 1)/nrow(cpus)
abline(h = leverage_threshold, col = "red", lty = 2)
cat("Leverage threshold (2*(p+1)/n):", leverage_threshold, "\n")
## Leverage threshold (2*(p+1)/n): 0.06698565
# 3. Influence Plot --------------------------------------------
# Generate an influence plot that shows Cook's distance, leverage, and studentized residuals.
influencePlot(fit_final, main="Influence Plot",
sub="Circle size is proportional to Cook's Distance")
## StudRes Hat CookD
## 9 -4.023115 0.1316038 0.325910095
## 10 -5.819145 0.3273595 2.024883077
## 197 -0.328550 0.3417243 0.008040722
## 200 -3.383141 0.2644439 0.558937693
# 4. Standardized Residuals -------------------------------------
# Calculate standardized residuals
std_resid <- rstandard(fit_final)
# Plot standardized residuals vs. fitted values
plot(fitted(fit_final), std_resid,
xlab = "Fitted Values",
ylab = "Standardized Residuals",
main = "Standardized Residuals vs Fitted Values")
abline(h = c(-2, 2), col = "red", lty = 2)
Outliers / Influential Points:
#10 is the most influential observation, with very high Cook’s distance and high leverage, plus a large negative residual.
#9 and #200 also show somewhat elevated influence.
#197 and #199 have moderate Cook’s distance but do not reach #10’s level.
Implications:
data("birthwt")
# Inspect structure and basic summary
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
summary(birthwt)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
#Visualizations:
# Example: Mosaic plot of low birthweight vs. race
mosaic(~ low + race, data = birthwt, shade = TRUE, main = "Mosaic Plot: Low Birthweight vs. Race")
# Example: Boxplot of mother's weight (lwt) by low birthweight status
boxplot(lwt ~ low, data = birthwt,
main = "Mother's Weight by Birthweight Status",
xlab = "Low Birthweight",
ylab = "Mother's Weight (lbs)")
From the above summaries, we observed that sot all predictors are numerical like in question #1. Here, most of the variables are categorical representing binary decision boundary.
Also, we can obtain above visuals to help understand the data set better. We can see that most of the white race people have no issues with low birth weight of child, while black race people are very thin in numbers who have no issues with low birth weight of child.
Also, we can see that mothers who had low birth weight issue with children had lower wight in themselves as compared to the other case.
# Load necessary package and data
library(MASS)
data(birthwt)
# Coding variables so that categorical variables are factors:
birthwt$low <- factor(birthwt$low, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$race <- factor(birthwt$race, levels = c(1, 2, 3),
labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$ht <- factor(birthwt$ht, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$ui <- factor(birthwt$ui, levels = c(0, 1), labels = c("No", "Yes"))
# Logistic regression model:
# We exclude bwt because it is used to define low birthweight.
logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
data = birthwt, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.480623 1.196888 0.402 0.68801
## age -0.029549 0.037031 -0.798 0.42489
## lwt -0.015424 0.006919 -2.229 0.02580 *
## raceBlack 1.272260 0.527357 2.413 0.01584 *
## raceOther 0.880496 0.440778 1.998 0.04576 *
## smokeYes 0.938846 0.402147 2.335 0.01957 *
## ptl 0.543337 0.345403 1.573 0.11571
## htYes 1.863303 0.697533 2.671 0.00756 **
## uiYes 0.767648 0.459318 1.671 0.09467 .
## ftv 0.065302 0.172394 0.379 0.70484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 201.28 on 179 degrees of freedom
## AIC: 221.28
##
## Number of Fisher Scoring iterations: 4
The model is assuming that the effect of each additional unit increase in ptl and ftv is constant on the log‐odds scale. In other words, it implies that an increase of one previous preterm labor (ptl) always increases the log odds of a low‐birth weight outcome by about 0.54, and that one extra physician visit in the first trimester (ftv) increases the log odds by about 0.065—regardless of the starting point. This is the standard “linearity in the log-odds” assumption inherent in logistic regression.
However, these assumptions may not be realistic. For count variables like ptl and ftv, the effect may not be strictly linear over their entire range. For example, going from 0 to 1 previous preterm labor might have a much different impact than going from 2 to 3, or the effect of additional physician visits might plateau after a certain number of visits. The non-significant coefficient for ftv (p = 0.70) and the moderate effect (with high uncertainty) for ptl (p ≈ 0.116) suggest that the relationship may not be adequately captured by a simple linear term.
# Creating a new variable 'ptl2': "0" if ptl equals 0, "1+" otherwise.
birthwt$ptl2 <- factor(ifelse(birthwt$ptl == 0, "0", "1+"))
table(birthwt$ptl2)
##
## 0 1+
## 159 30
We can see that there are 159 “no previous premature labors” and 30 “yes previous premature labors”.
# Creating a new variable 'ftv2' by collapsing ftv into three categories:
# "0" for no visits, "1" for one visit, and "2+" for two or more visits.
birthwt$ftv2 <- with(birthwt, ifelse(ftv == 0, "0",
ifelse(ftv == 1, "1", "2+")))
birthwt$ftv2 <- factor(birthwt$ftv2, levels = c("0", "1", "2+"))
print(table(birthwt$ftv2))
##
## 0 1 2+
## 100 47 42
# Creating a cross-tabulation of ftv2 with the outcome 'low'
ftv2_table <- table(birthwt$ftv2, birthwt$low)
print(ftv2_table)
##
## No Yes
## 0 64 36
## 1 36 11
## 2+ 30 12
# Calculating row proportions to see the probability of low birthweight by ftv2 level
ftv2_prop <- prop.table(ftv2_table, margin = 1)
print(round(ftv2_prop, 2))
##
## No Yes
## 0 0.64 0.36
## 1 0.77 0.23
## 2+ 0.71 0.29
From the above table, there are 100 for 0 time doctor visit, 47 for 1 time doctor visit and 42 for 2 or more times doctor visits.
Also, we can see that we have 64%, 77% and 71% for 0, 1 and 2 times doctor visit that had no issue with low birth weight. And, 36%, 23%, and 29% for 0, 1 and 2 times doctor visit that had issue with low birth weight.
# Refit the logistic regression model using ptl2 and ftv2
logit_model_new <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = birthwt, family = binomial)
summary(logit_model_new)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.82302 1.24471 0.661 0.50848
## age -0.03723 0.03870 -0.962 0.33602
## lwt -0.01565 0.00708 -2.211 0.02705 *
## raceBlack 1.19241 0.53597 2.225 0.02609 *
## raceOther 0.74069 0.46174 1.604 0.10869
## smokeYes 0.75553 0.42502 1.778 0.07546 .
## ptl21+ 1.34376 0.48062 2.796 0.00518 **
## htYes 1.91317 0.72074 2.654 0.00794 **
## uiYes 0.68019 0.46434 1.465 0.14296
## ftv21 -0.43638 0.47939 -0.910 0.36268
## ftv22+ 0.17901 0.45638 0.392 0.69488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 195.48 on 178 degrees of freedom
## AIC: 217.48
##
## Number of Fisher Scoring iterations: 4
Previous Preterm Labors (ptl2):
First-Trimester Physician Visits (ftv2):
Overall, the new variable ptl2 appears to capture an important risk factor for low birth weight, while the recoded ftv2 does not seem to add predictive value in this model.
# Install packages if not already installed
if (!require("caret")) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice
if (!require("MASS")) install.packages("MASS")
library(caret)
library(MASS)
# Loading the birthwt data
data(birthwt)
# Recoding variables to factors as appropriate
birthwt$low <- factor(birthwt$low, levels = c(0, 1), labels = c("No", "Yes"))
birthwt$race <- factor(birthwt$race, levels = c(1, 2, 3),
labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0, 1),
labels = c("No", "Yes"))
birthwt$ht <- factor(birthwt$ht, levels = c(0, 1),
labels = c("No", "Yes"))
birthwt$ui <- factor(birthwt$ui, levels = c(0, 1),
labels = c("No", "Yes"))
# Creating new variables:
birthwt$ptl2 <- factor(ifelse(birthwt$ptl == 0, "0", "1+"),
levels = c("0", "1+"))
birthwt$ftv2 <- factor(with(birthwt, ifelse(ftv == 0, "0",
ifelse(ftv == 1, "1", "2+"))),
levels = c("0", "1", "2+"))
# Splitting Data (Stratified on 'low')
set.seed(1234)
trainIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train <- birthwt[trainIndex, ]
test <- birthwt[-trainIndex, ]
train$ptl2 <- factor(train$ptl2, levels = c("0", "1+"))
train$ftv2 <- factor(train$ftv2, levels = c("0", "1", "2+"))
# Fitting Models on the Training Set
# (a) Logistic Regression
logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = train, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.265238 1.443592 0.184 0.85422
## age -0.003943 0.043030 -0.092 0.92699
## lwt -0.017885 0.007891 -2.267 0.02342 *
## raceBlack 1.448961 0.612146 2.367 0.01793 *
## raceOther 0.911531 0.520036 1.753 0.07963 .
## smokeYes 1.070048 0.476025 2.248 0.02458 *
## ptl21+ 1.368684 0.510787 2.680 0.00737 **
## htYes 1.341059 0.812738 1.650 0.09893 .
## uiYes 0.474728 0.512900 0.926 0.35467
## ftv21 -0.348842 0.524607 -0.665 0.50608
## ftv22+ -0.236118 0.512911 -0.460 0.64527
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 189.59 on 151 degrees of freedom
## Residual deviance: 157.21 on 141 degrees of freedom
## AIC: 179.21
##
## Number of Fisher Scoring iterations: 4
# (b) LDA and QDA
lda_model <- lda(low ~ age + lwt, data = train)
print(lda_model)
## Call:
## lda(low ~ age + lwt, data = train)
##
## Prior probabilities of groups:
## No Yes
## 0.6842105 0.3157895
##
## Group means:
## age lwt
## No 23.58654 134.9615
## Yes 22.54167 121.1667
##
## Coefficients of linear discriminants:
## LD1
## age -0.05315507
## lwt -0.03006282
qda_model <- qda(low ~ age + lwt, data = train)
print(qda_model)
## Call:
## qda(low ~ age + lwt, data = train)
##
## Prior probabilities of groups:
## No Yes
## 0.6842105 0.3157895
##
## Group means:
## age lwt
## No 23.58654 134.9615
## Yes 22.54167 121.1667
# Defining a Function to Compute Metrics
calc_metrics <- function(conf_matrix) {
TP <- conf_matrix["Yes", "Yes"]
TN <- conf_matrix["No", "No"]
FP <- conf_matrix["Yes", "No"]
FN <- conf_matrix["No", "Yes"]
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
accuracy <- (TP + TN) / sum(conf_matrix)
return(c(sensitivity = sensitivity, specificity = specificity, accuracy = accuracy))
}
# Evaluating Models on the Test Set
## Logistic Regression Predictions
pred_probs_logit <- predict(logit_model, newdata = test, type = "response")
pred_class_logit <- ifelse(pred_probs_logit > 0.5, "Yes", "No")
conf_logit <- table(Predicted = pred_class_logit, Actual = test$low)
cat("Logistic Regression Confusion Matrix:\n")
## Logistic Regression Confusion Matrix:
print(conf_logit)
## Actual
## Predicted No Yes
## No 23 7
## Yes 3 4
metrics_logit <- calc_metrics(conf_logit)
cat("Logistic Regression Metrics:\n")
## Logistic Regression Metrics:
print(metrics_logit)
## sensitivity specificity accuracy
## 0.3636364 0.8846154 0.7297297
## LDA Predictions (using age and lwt)
lda_pred <- predict(lda_model, newdata = test)
conf_lda <- table(Predicted = lda_pred$class, Actual = test$low)
cat("LDA Confusion Matrix:\n")
## LDA Confusion Matrix:
print(conf_lda)
## Actual
## Predicted No Yes
## No 26 11
## Yes 0 0
metrics_lda <- calc_metrics(conf_lda)
cat("LDA Metrics:\n")
## LDA Metrics:
print(metrics_lda)
## sensitivity specificity accuracy
## 0.0000000 1.0000000 0.7027027
## QDA Predictions (using age and lwt)
qda_pred <- predict(qda_model, newdata = test)
conf_qda <- table(Predicted = qda_pred$class, Actual = test$low)
cat("QDA Confusion Matrix:\n")
## QDA Confusion Matrix:
print(conf_qda)
## Actual
## Predicted No Yes
## No 26 11
## Yes 0 0
metrics_qda <- calc_metrics(conf_qda)
cat("QDA Metrics:\n")
## QDA Metrics:
print(metrics_qda)
## sensitivity specificity accuracy
## 0.0000000 1.0000000 0.7027027
# 6. Comparing Model Performance
cat("Final Performance Comparison:\n")
## Final Performance Comparison:
cat("Logistic Regression (Full Model):\n")
## Logistic Regression (Full Model):
print(metrics_logit)
## sensitivity specificity accuracy
## 0.3636364 0.8846154 0.7297297
cat("LDA (age and lwt):\n")
## LDA (age and lwt):
print(metrics_lda)
## sensitivity specificity accuracy
## 0.0000000 1.0000000 0.7027027
cat("QDA (age and lwt):\n")
## QDA (age and lwt):
print(metrics_qda)
## sensitivity specificity accuracy
## 0.0000000 1.0000000 0.7027027
From the above results, we can see the following results,
Logistic Regression (Full Model):
Sensitivity: 36.4% ( the model is correctly identiftying 4 out of 11 low birth weight cases)
Specificity: 88.5%
Accuracy: 73.0%
LDA and QDA (using age and lwt):
Sensitivity: 0% (they did not classify any case as low birth weight)
Specificity: 100%
Accuracy: 70.3%
Because sensitivity (the ability to correctly detect low birth weight cases) is critical, and overall accuracy is slightly higher, the logistic regression model performs the best among these methods.
When all continuous predictors are 0 and for all categorical variables the reference levels are present (i.e. race = “White”, smoke = “No”, ptl2 = “0”, ht = “No”, ui = “No”, ftv2 = “0”), the log odds of having a low‐birthweight baby is 0.823. However, since the age and lwt being 0 are not realistic, the intercept mainly serves as a baseline for the model.
Each additional year in maternal age is associated with a decrease of 0.037 in the log odds of low birthweight. The corresponding odds ratio is exp(–0.03723) ≈ 0.964, meaning roughly a 3.6% decrease in the odds per year. However, this effect is not statistically significant (p ≈ 0.336).
For each additional pound in the mother’s weight at the last menstrual period, the log odds of low birthweight decrease by 0.01565. The odds ratio is exp(–0.01565) ≈ 0.9845, implying about a 1.6% decrease in odds per pound. This effect is statistically significant (p ≈ 0.027).
Compared to the reference group (White), Black mothers have 1.192 higher log odds of low birthweight. This corresponds to an odds ratio of exp(1.19241) ≈ 3.29, meaning Black mothers have about 3.3 times the odds of a low‐birthweight baby, holding other factors constant (p ≈ 0.026).
Compared to White mothers, those in the “Other” race category have 0.741 higher log odds of low birthweight. The odds ratio is exp(0.74069) ≈ 2.10, indicating roughly 2.1 times the odds, though this effect is only marginally significant (p ≈ 0.109).
Mothers who smoke have a 0.756 unit higher log odds of low birthweight compared to non-smokers. The odds ratio is exp(0.75553) ≈ 2.13, meaning smoking is associated with roughly double the odds of low birthweight (p ≈ 0.075, marginal significance).
Mothers with one or more previous preterm labors (ptl2 = “1+”) have log odds of low birthweight that are 1.344 units higher than mothers with no previous preterm labors. This translates to an odds ratio of exp(1.34376) ≈ 3.83, so they have nearly 4 times the odds of a low‐birthweight baby (p ≈ 0.005).
A history of hypertension increases the log odds of low birthweight by 1.913 compared to no hypertension. The odds ratio is exp(1.91317) ≈ 6.77, meaning mothers with hypertension have roughly 6.8 times the odds of low birthweight (p ≈ 0.008).
Mothers with uterine irritability have 0.680 higher log odds of low birthweight compared to those without. The corresponding odds ratio is exp(0.68019) ≈ 1.97, nearly doubling the odds, although this effect is not statistically significant (p ≈ 0.143).
ftv2: (with reference level “0”)
ftv21: –0.43638
Having 1 first-trimester physician visit (versus 0) is associated with a decrease of 0.436 in the log odds of low birthweight. The odds ratio is exp(–0.43638) ≈ 0.646, or about a 35% reduction in odds; however, this effect is not statistically significant (p ≈ 0.362).
Having 2 or more first-trimester visits (versus 0) increases the log odds by 0.179, with an odds ratio of exp(0.17901) ≈ 1.196 (about a 20% increase in odds), but this effect is not statistically significant (p ≈ 0.694).