Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

  1. Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.

    a. Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.

library(MASS)
library(ggplot2)
# Load the cpus dataset (from MASS)
data(cpus, package="MASS")
str(cpus)
## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...
summary(cpus)
##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
## 
cor_matrix <- cor(cpus[, -1])
print(cor_matrix)
##               syct       mmin       mmax       cach      chmin      chmax
## syct     1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin    -0.3356422  1.0000000  0.7581573  0.5347291  0.5171892  0.2669074
## mmax    -0.3785606  0.7581573  1.0000000  0.5379898  0.5605134  0.5272462
## cach    -0.3209998  0.5347291  0.5379898  1.0000000  0.5822455  0.4878458
## chmin   -0.3010897  0.5171892  0.5605134  0.5822455  1.0000000  0.5482812
## chmax   -0.2505023  0.2669074  0.5272462  0.4878458  0.5482812  1.0000000
## perf    -0.3070821  0.7949233  0.8629942  0.6626135  0.6089025  0.6052193
## estperf -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556
##               perf    estperf
## syct    -0.3070821 -0.2883956
## mmin     0.7949233  0.8192915
## mmax     0.8629942  0.9012024
## cach     0.6626135  0.6486203
## chmin    0.6089025  0.6105802
## chmax    0.6052193  0.5921556
## perf     1.0000000  0.9664687
## estperf  0.9664687  1.0000000
pairs(cpus[, -1], main = "Scatterplot Matrix of cpus Data")

Observations:

Correlation Analysis:

Inter-variable Relationships:

Visual Observations:

b. Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.

Looking at the scatterplots, I notice that many variables exhibit right-skewed distributions with non-linear relationships to estperf12. The first step is to apply a log transformation to both response and predictor variables to Linearize relationships between variables, Reduce skewness in distributions and Stabilize variance across the range of predictions.

data <- cpus[, !(names(cpus) %in% c("name", "perf"))]

cpus_log <- data
cpus_log$log_estperf <- log(data$estperf)
cpus_log$log_chmin <- log(data$chmin)
cpus_log$log_chmax <- log(data$chmax)
cpus_log$log_mmin <- log(data$mmin)
cpus_log$log_mmax <- log(data$mmax)
cpus_log$log_cach <- log(data$cach)

cpus_log <- as.data.frame(cpus_log)

Variable Selection Process

I’ll use a combination of techniques to determine the best predictors:

  1. Start with full model: Include all transformed variables except name and estperf
cpus_log <- na.omit(cpus_log)  # Removes rows with NAs
cpus_log <- cpus_log[!apply(cpus_log, 1, function(x) any(is.infinite(x))), ] # Removes rows with Inf values
full_model <- lm(log_estperf ~ log_chmin + log_chmax + log_mmin + log_mmax + log_cach, data=cpus_log)
summary(full_model)
## 
## Call:
## lm(formula = log_estperf ~ log_chmin + log_chmax + log_mmin + 
##     log_mmax + log_cach, data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42690 -0.16513 -0.02378  0.12030  0.83766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.46464    0.23226 -10.611  < 2e-16 ***
## log_chmin    0.11114    0.02681   4.145 6.02e-05 ***
## log_chmax    0.08956    0.02815   3.181 0.001824 ** 
## log_mmin     0.12792    0.03482   3.674 0.000345 ***
## log_mmax     0.51887    0.03581  14.492  < 2e-16 ***
## log_cach     0.23201    0.02388   9.718  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.235 on 133 degrees of freedom
## Multiple R-squared:  0.9385, Adjusted R-squared:  0.9362 
## F-statistic: 406.1 on 5 and 133 DF,  p-value: < 2.2e-16
  1. Apply backward elimination: Remove variables that contribute least to the model based on AIC
step_model <- step(full_model, direction="backward")
## Start:  AIC=-396.68
## log_estperf ~ log_chmin + log_chmax + log_mmin + log_mmax + log_cach
## 
##             Df Sum of Sq     RSS     AIC
## <none>                    7.3473 -396.68
## - log_chmax  1    0.5591  7.9064 -388.49
## - log_mmin   1    0.7458  8.0930 -385.24
## - log_chmin  1    0.9490  8.2962 -381.80
## - log_cach   1    5.2168 12.5641 -324.10
## - log_mmax   1   11.6012 18.9485 -266.99
  1. Check for multicollinearity: From the correlation matrix, mmin and mmax have high correlation (0.758), which could cause multicollinearity issues. VIF (Variance Inflation Factor) analysis might reveal this needs addressing.

Model Refinement

Looking at the scatterplots, I notice that mmax has the strongest relationship with estperf (correlation 0.901). The stepwise procedure would likely retain this variable. However, due to multicollinearity concerns, I might need to choose between mmin and mmax.

Based on statistical significance in the step_model and multicollinearity analysis, I would likely arrive at a final model with:

final_model <- lm(log_estperf ~ log_mmax + log_cach + log_chmin, data=cpus_log)
summary(final_model)
## 
## Call:
## lm(formula = log_estperf ~ log_mmax + log_cach + log_chmin, data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41808 -0.17007 -0.02499  0.13284  0.81832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.32079    0.24169  -9.602  < 2e-16 ***
## log_mmax     0.61502    0.02915  21.097  < 2e-16 ***
## log_cach     0.26745    0.02388  11.199  < 2e-16 ***
## log_chmin    0.17315    0.02391   7.242 3.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2497 on 135 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.928 
## F-statistic: 593.8 on 3 and 135 DF,  p-value: < 2.2e-16

This model would prioritize maximum memory (mmax) and cache size (cach) as the primary predictors, potentially including minimum channels (chmin) if it adds significant explanatory power after accounting for the other variables.

Interpretation:

  1. Intercept (-2.32079)

    • When all predictors are at their baseline (log-transformed values of 0), the estimated performance is exp(-2.32079) ≈ 0.098 (not meaningful but useful for transformations).
  2. log_mmax (0.61502, p < 2e-16)

-   A **1% increase in `mmax`** (main memory size) leads to an **approximate 0.615% increase** in estimated performance, **holding other variables constant**.
  1. log_cach (0.26745, p < 2e-16)
-   A **1% increase in `cach`** (cache size) leads to an **approximate 0.267% increase** in estimated performance.
  1. log_chmin (0.17315, p = 3.05e-11)

    • A 1% increase in chmin (minimum number of channels) results in a 0.173% increase in estimated performance.

Model Diagnostics

  • Residual Standard Error (RSE): 0.2497, indicating moderate spread in residuals.

  • F-statistic (593.8, p < 2.2e-16): Strong evidence that at least one predictor is significant.

  • Residuals: Symmetrically distributed around zero, but you should check residual plots for assumptions of normality and homoscedasticity.

c. Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

plot(final_model$fitted.values, final_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

Observations:

  1. Non-linearity: The residuals exhibit a clear U-shaped pattern with:

    • Mostly positive residuals at low fitted values (2-3)

    • Predominantly negative residuals in the middle range (4-5)

    • Returning to positive residuals at high fitted values (6+)

  2. Heteroscedasticity: The spread of residuals is not constant:

    • Greater variability at the extremes of fitted values

    • More compressed in the middle range

  3. Potential outliers: Several observations show large residuals (around 0.8) at both the low and high ends of fitted values

Assumption Violations

Based on these observations, two major linear regression assumptions are being violated:

To address these issues, I modify the model as follows:

  1. Polynomial transformation: Add quadratic terms for key predictors:
poly_model <- lm(log(estperf) ~ poly(mmax, 2) + poly(mmin, 2) + log(cach), data=cpus_log)
summary(poly_model)
## 
## Call:
## lm(formula = log(estperf) ~ poly(mmax, 2) + poly(mmin, 2) + log(cach), 
##     data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31136 -0.08852 -0.03328  0.06795  0.83412 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.59567    0.04975  72.278  < 2e-16 ***
## poly(mmax, 2)1  6.76230    0.26490  25.527  < 2e-16 ***
## poly(mmax, 2)2 -1.40273    0.19400  -7.231 3.40e-11 ***
## poly(mmin, 2)1  1.50458    0.26160   5.752 5.79e-08 ***
## poly(mmin, 2)2 -0.46350    0.20188  -2.296   0.0232 *  
## log(cach)       0.26285    0.01583  16.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1684 on 133 degrees of freedom
## Multiple R-squared:  0.9684, Adjusted R-squared:  0.9673 
## F-statistic: 816.3 on 5 and 133 DF,  p-value: < 2.2e-16
plot(poly_model$fitted.values, poly_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

  1. Interaction terms: Consider interactions between key predictors:
interaction_model <- lm(log(estperf) ~ log(mmax)*log(cach), data=cpus_log)
summary(interaction_model)
## 
## Call:
## lm(formula = log(estperf) ~ log(mmax) * log(cach), data = cpus_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36543 -0.16143 -0.04096  0.08368  1.01237 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.75866    0.62762   1.209 0.228859    
## log(mmax)            0.27227    0.07055   3.859 0.000176 ***
## log(cach)           -0.86929    0.19885  -4.372 2.44e-05 ***
## log(mmax):log(cach)  0.13063    0.02140   6.104 1.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2605 on 135 degrees of freedom
## Multiple R-squared:  0.9233, Adjusted R-squared:  0.9216 
## F-statistic: 542.1 on 3 and 135 DF,  p-value: < 2.2e-16
plot(interaction_model$fitted.values, interaction_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

Observation of Polynomial_transformation plot:

  1. Non-random distribution: The residuals show a non-random pattern, with clusters of positive residuals around fitted values of 3.5-4.5, while most other areas have residuals closer to zero or slightly negative.

  2. Potential outliers: Several points stand out with large positive residuals (around 0.6-0.8) in the 3.5-4.5 fitted value range. These could be influential observations affecting the model.

  3. Slight heteroscedasticity: The spread of residuals appears somewhat uneven across fitted values, with more variability in the lower fitted value range and less at higher values.

  4. Mean shift: The average residual appears slightly negative for higher fitted values (5.5-7), suggesting possible systematic prediction errors in that range.

d.How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

Based on the residual plot from the final polynomial model, I can evaluate several aspects of model fit:

Quantitative Measures of Fit

The model demonstrates excellent overall fit based on several key metrics:

Coefficient Assessment

Residual Analysis

Looking at the residual plot:

Overall Assessment

The polynomial model represents a substantial improvement over simpler models, capturing the non-linear relationships in the data effectively. The extremely high R-squared and significant coefficients indicate a strong model. While the residual plot still shows some pattern and outliers, the majority of observations are well-predicted by the model.

The remaining outliers might represent specific CPU models with unusual performance characteristics that could be investigated further, but they don’t significantly compromise the overall excellent fit of the model.

e. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

Interpretation of Variables in the Final Polynomial Model

The final model uses a log-transformed response variable (log(estperf)) with polynomial terms for mmax and mmin, and a log-transformed cache variable. To interpret coefficients, we must remember that the inverse of the log function is the exponential function (e^x), which allows us to convert back to the original performance scale.

Interpretation of Cache Size (log(cach))

Cache size coefficient: 0.26285

Since both the predictor (cach) and response (estperf) are log-transformed, this coefficient represents an elasticity. A 1% increase in cache size is associated with approximately a 0.26% increase in estimated performance, holding maximum and minimum memory constant. For example, increasing cache size from 100MB to 101MB (a 1% increase) would result in approximately a 0.26% increase in the CPU’s estimated performance.

Interpretation of Maximum Memory (poly(mmax, 2))

Linear term coefficient: 6.76230
Quadratic term coefficient: -1.40273

The relationship between maximum memory and estimated performance is non-linear. The positive linear term (6.76) combined with the negative quadratic term (-1.40) indicates that increases in maximum memory are associated with increases in estimated performance, but with diminishing returns at higher memory levels. This means the performance benefit of adding memory capacity decreases as the maximum memory gets larger.

Interpretation of Minimum Memory (poly(mmin, 2))

Linear term coefficient: 1.50458
Quadratic term coefficient: -0.46350

Similar to maximum memory, minimum memory shows a non-linear relationship with performance. The positive linear coefficient (1.50) and negative quadratic coefficient (-0.46) indicate that increasing minimum memory is associated with improved performance, but these improvements diminish at higher levels of minimum memory.

Converting to Original Performance Scale

To convert predicted values from the log scale back to the original performance scale, we use the exponential function:

Original estimated performance = e^(predicted log(estperf))

For example, if our model predicts a log(estperf) value of 5.0, the estimated performance would be e^5.0 ≈ 148.4 units on the original performance scale.

The combined effect of all variables creates a model that captures both the importance of cache size for performance and the diminishing returns of increasing memory capacity beyond certain thresholds.

f. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

Variance Inflation Factor (VIF) Analysis

library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
vif_values <- vif(poly_model)
print(vif_values)
##                   GVIF Df GVIF^(1/(2*Df))
## poly(mmax, 2) 3.226432  2        1.340234
## poly(mmin, 2) 3.292278  2        1.347020
## log(cach)     1.677777  1        1.295290

Interpretation of Results

The GVIF values show moderate levels of multicollinearity in the final model:

  1. Memory variables: Both polynomial terms for maximum memory (poly(mmax, 2)) and minimum memory (poly(mmin, 2)) show similar GVIF values around 3.2-3.3, which indicates some correlation between these predictors.

  2. Cache variable: The log(cach) term shows a lower GVIF of approximately 1.68, suggesting it has less correlation with other predictors.

  3. Standardized measure: The GVIF^(1/(2*Df)) values, which account for the degrees of freedom and allow for better comparison across terms with different dimensions, are all between 1.29-1.35, which is well below concerning levels.

The observed level of multicollinearity is:

No further action is necessary because:

  1. The multicollinearity is not severe enough to compromise the model’s reliability

  2. The model already demonstrates excellent fit (R² = 0.9684)

  3. All predictors remain statistically significant despite the presence of some multicollinearity

This analysis confirms that the polynomial model specification effectively balances capturing non-linear relationships while maintaining acceptable levels of multicollinearity.

g. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.

Visual Identification from Residual Plot

The residual plot shows clear evidence of potential outliers:

#evalate standardized residuals
std_resid <- rstandard(poly_model)

# Plot standardized residuals
plot(std_resid, type="h", main="Standardized Residuals")
abline(h=c(-3,3), col="red")  # Threshold lines at ±3

Rule of thumb: Points with standardized residuals > |3| are considered outliers.

# Calculate Cook's distance
cooks_d <- cooks.distance(poly_model)

n=139
p=5  # n = sample size, p = number of parameters
  
# Plot Cook's distance
plot(cooks_d, type="h", main="Cook's Distance")
abline(h=4/(n-p-1), col="red") 

Rule of thumb: Points with Cook’s distance > 4/(n-p-1) are influential.

# Calculate leverage values
leverage <- hatvalues(poly_model)

# Plot leverage values
plot(leverage, type="h", main="Leverage Values")
abline(h=2*p/n, col="red")

Rule of thumb: Points with leverage > 2p/n are high-leverage observations.

Standardized Residuals Analysis

The standardized residuals plot shows:

Cook’s Distance Analysis

The Cook’s distance plot reveals:

Leverage Analysis

The leverage plot indicates:

Conclusion on Outliers and Influential Points

The diagnostic plots provide strong evidence that:

  1. Observation #10 is an extremely influential point with high leverage, high Cook’s distance, and a large standardized residual. This single observation significantly influences the regression coefficients and warrants careful investigation.

  2. Several additional outliers exist that have large standardized residuals but lower leverage and influence.

  3. A few observations near the end of the dataset have high leverage but relatively low influence according to Cook’s distance.

These findings suggest that observation #10 should potentially be investigated for data entry errors or removed in a sensitivity analysis to assess its impact on model results. Despite these outliers, the overall model appears to maintain good fit (as seen in previous analyses with R² = 0.9684).


  1. Consider the data from R package . We will investigate the relationship between low birthweight and the predictors in the data using logistic regression and discriminant analysis.
  1. Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.
library(MASS)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the birthwt dataset
data("birthwt", package = "MASS")
# Transform categorical variables for better visualization
bwt <- with(birthwt, {
  race <- factor(race, labels = c("white", "black", "other"))
  ptd <- factor(ptl > 0, labels = c("no", "yes"))
  ftv <- factor(ftv)
  levels(ftv)[-(1:2)] <- "2+"
  data.frame(low = factor(low, labels = c("normal", "low")),
             age, lwt, race, smoke = factor(smoke, labels = c("no", "yes")),
             ptd, ht = factor(ht, labels = c("no", "yes")),
             ui = factor(ui, labels = c("no", "yes")), ftv)
})
# Summary statistics for numeric variables
summary(bwt)
##      low           age             lwt           race    smoke      ptd     
##  normal:130   Min.   :14.00   Min.   : 80.0   white:96   no :115   no :159  
##  low   : 59   1st Qu.:19.00   1st Qu.:110.0   black:26   yes: 74   yes: 30  
##               Median :23.00   Median :121.0   other:67                      
##               Mean   :23.24   Mean   :129.8                                 
##               3rd Qu.:26.00   3rd Qu.:140.0                                 
##               Max.   :45.00   Max.   :250.0                                 
##    ht        ui      ftv     
##  no :177   no :161   0 :100  
##  yes: 12   yes: 28   1 : 47  
##                      2+: 42  
##                              
##                              
## 
# Correlation matrix for numeric predictors and outcome
cor_matrix <- cor(bwt %>% select_if(is.numeric))
print(cor_matrix)
##           age       lwt
## age 1.0000000 0.1800732
## lwt 0.1800732 1.0000000

Observations:

Continuous Predictors vs Outcome (low)

# Boxplot for mother's age vs low birth weight
ggplot(bwt, aes(x = low, y = age)) +
  geom_boxplot() +
  labs(title = "Mother's Age vs Low Birth Weight", x = "Birth Weight Status", y = "Age")

# Boxplot for mother's weight vs low birth weight
ggplot(bwt, aes(x = low, y = lwt)) +
  geom_boxplot() +
  labs(title = "Mother's Weight vs Low Birth Weight", x = "Birth Weight Status", y = "Weight (lbs)")

Categorical Predictors vs Outcome (low)

# Bar plot for race vs low birth weight
ggplot(bwt, aes(x = race, fill = low)) +
  geom_bar(position = "fill") +
  labs(title = "Race vs Low Birth Weight", x = "Race", y = "Proportion")

# Bar plot for smoking status vs low birth weight
ggplot(bwt, aes(x = smoke, fill = low)) +
  geom_bar(position = "fill") +
  labs(title = "Smoking Status vs Low Birth Weight", x = "Smoking Status", y = "Proportion")

# Bar plot for hypertension history vs low birth weight
ggplot(bwt, aes(x = ht, fill = low)) +
  geom_bar(position = "fill") +
  labs(title = "Hypertension History vs Low Birth Weight", x = "Hypertension History", y = "Proportion")

Observations from Visualizations

  1. Continuous Predictors:

    • Mothers with lower weights (lwt) tend to have higher proportions of low-birthweight babies.

    • Younger mothers (age) appear to have slightly higher proportions of low-birthweight babies.

  2. Categorical Predictors:

    • Race: Black mothers have a higher proportion of low-birthweight babies compared to white or other racial groups.

    • Smoking: Smoking during pregnancy is strongly associated with a higher proportion of low-birthweight babies.

    • Hypertension (ht): Mothers with a history of hypertension show a higher proportion of low-birthweight babies.

  3. Surprising Findings:

    • The presence of uterine irritability (ui) does not show as strong an association with low birth weight as expected.

    • The number of physician visits during the first trimester (ftv) does not show a clear trend in reducing the risk of low birth weight.

This analysis provides a comprehensive view of the relationships between predictors and the outcome variable (low). Continuous predictors are visualized using boxplots to highlight differences in distributions across birthweight categories. Categorical predictors are visualized using bar plots to show proportions within each category. These insights will inform subsequent modeling steps using logistic regression and discriminant analysis.

b. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.

# Create a cleaned version of the 'birthwt' dataset
bwt <- with(birthwt, {
  # Convert race to a factor with meaningful labels:
  race <- factor(race, labels = c("white", "black", "other"))
  
  # Convert ptl (number of previous premature labors) into a binary indicator:
  ptd <- factor(ptl > 0, labels = c("none", "yes"))
  
  # Convert ftv (first trimester physician visits) into a factor,
  # recoding levels beyond the first two as "2+"
  ftv <- factor(ftv)
  levels(ftv)[-c(1,2)] <- "2+"
  
  # Build a new data frame with logically acceptable predictors.
  data.frame(low = factor(low, labels = c("normal", "low")),
             age,          # Maternal age (years)
             lwt,          # Maternal weight (lbs) at last menstrual period
             race,         # Categorical: white, black, or other 
             smoke = factor(smoke, labels = c("no", "yes")),
             ptd,          # Indicator of previous premature labors (none or yes)
             ht = factor(ht, labels = c("no", "yes")),  # History of hypertension
             ui = factor(ui, labels = c("no", "yes")),  # Uterine irritability
             ftv)          # Number of first trimester physician visits
})
# Fit the logistic regression model.
# We include only those predictors that are logically acceptable and that are known risk factors.
bw.glm <- glm(low ~ age + lwt + race + smoke + ptd + ht + ui + ftv,
              data = bwt,
              family = binomial)

summary(bw.glm)
## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptd + ht + ui + 
##     ftv, family = binomial, data = bwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.82302    1.24471   0.661  0.50848   
## age         -0.03723    0.03870  -0.962  0.33602   
## lwt         -0.01565    0.00708  -2.211  0.02705 * 
## raceblack    1.19241    0.53597   2.225  0.02609 * 
## raceother    0.74069    0.46174   1.604  0.10869   
## smokeyes     0.75553    0.42502   1.778  0.07546 . 
## ptdyes       1.34376    0.48062   2.796  0.00518 **
## htyes        1.91317    0.72074   2.654  0.00794 **
## uiyes        0.68019    0.46434   1.465  0.14296   
## ftv1        -0.43638    0.47939  -0.910  0.36268   
## ftv2+        0.17901    0.45638   0.392  0.69488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 195.48  on 178  degrees of freedom
## AIC: 217.48
## 
## Number of Fisher Scoring iterations: 4

Conclusions

This logistic regression model indicates that lower maternal weight, being black (relative to white), having a history of prior premature labor, and having hypertension are significant predictors of low birth weight. The estimated odds ratios (obtained by exponentiating the coefficients) allow for a more intuitive interpretation. For example, black mothers in this dataset have about 3.3 times the odds of delivering a low-birthweight infant than white mothers, and a history of hypertension increases the odds by nearly 6.8 times.

c. What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

  1. ptl (Previous Premature Labors):
    In our analysis, rather than using the raw count of previous premature labors, we recoded this variable into a binary indicator (often labeled “none” versus “yes”). By doing so, the model implicitly assumes that the risk associated with previous premature labor operates as a threshold effect. That is, it assumes that having any history of premature labor increases the log odds of low birth weight by a fixed amount, and that having one versus two or more preterm labors does not make any additional difference. Essentially, once the threshold of “at least one” is crossed, the risk is elevated by the same amount for all mothers in that category.

  2. ftv (First Trimester Physician Visits):
    The variable ftv was converted into a factor with levels such as “0,” “1,” and “2+” visits. By treating ftv as a categorical variable, the model assumes that each category has its own fixed effect on the log odds of low birth weight relative to a baseline (usually the “0 visits” category). This means that the effect of having, say, 1 visit versus 0 visits is assumed to be constant, and any differences among higher numbers of visits are grouped together under “2+.” The model does not assume a continuous, incremental change in risk with each additional visit; rather, it imposes a step function where the effect jumps at the boundaries of these categories.

Are These Assumptions Realistic?

In Summary

The logistic regression model is implicitly assuming:

These assumptions simplify the modeling process and are often used when the exact nature of the relationship is unclear or when data in some categories are sparse. However, if additional information or clinical insight suggests that the risk changes gradually (or escalates further with more preterm labors), then alternative strategies (e.g., keeping the variable continuous or using non-linear transformations such as splines) might be more realistic and informative.

d. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

Below is one approach to create a new variable, ptl2, that collapses the original ptl counts into a smaller number of categories. Often in the birthwt data the number of previous premature labors (ptl) is low, so we might want to distinguish mothers with no history, mothers with exactly one event, and mothers with two or more events.

Assuming bwt is our cleaned birthwt dataset, we create ptl2 so that:

bwt$ptl2 <- with(bwt,
ifelse(ptd == "none", "0",
ifelse(ptd == "yes", "1", "1")))

#Convert ptl2 into a factor with ordered levels.
bwt$ptl2 <- factor(bwt$ptl2, levels = c("0", "1"))
#count of each category
table(bwt$ptl2)
## 
##   0   1 
## 159  30

This new variable ptl2 is often more useful for analysis, especially when the sample sizes in some of the higher count categories are very small. By collapsing ptl into three groups—0, 1, and 2 or more—we can examine whether there is a dose‐response effect of previous premature labor on the risk of low birthweight while avoiding unstable estimates for very rare categories.

e. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.

The original variable ftv represents the number of first-trimester physician visits. Since higher values (e.g., 3, 4, etc.) are rare, we can collapse levels into broader categories such as:

# Collapse ftv into broader categories to create ftv2
bwt$ftv2 <- with(bwt, 
                 ifelse(ftv == 0, "0", 
                        ifelse(ftv == 1, "1", "2+")))
# Convert ftv2 into a factor with ordered levels
bwt$ftv2 <- factor(bwt$ftv2, levels = c("0", "1", "2+"))

# frequency of each category
table(bwt$ftv2)
## 
##   0   1  2+ 
## 100  47  42

To better understand the relationship between ftv2 and the probability of low birthweight, we calculate the proportion of low-birthweight babies (low = "low") within each category of ftv2.

# Summarize probabilities of low birthweight by ftv2 levels
summary_table <- bwt %>%
  group_by(ftv2) %>%
  summarize(
    total = n(),
    low_count = sum(low == "low"),
    probability_low = mean(low == "low")
  )

print(summary_table)
## # A tibble: 3 × 4
##   ftv2  total low_count probability_low
##   <fct> <int>     <int>           <dbl>
## 1 0       100        36           0.36 
## 2 1        47        11           0.234
## 3 2+       42        12           0.286

Interpretation

  1. No Visits (ftv2 = "0"):

    • Out of 100 mothers who had no first-trimester physician visits, 36 delivered low-birthweight babies.

    • The probability of low birthweight in this group is 36%.

    • This suggests that mothers who did not receive prenatal care during the first trimester are at higher risk for low birthweight.

  2. One Visit (ftv2 = "1"):

    • Out of 47 mothers who had exactly one visit, 11 delivered low-birthweight babies.

    • The probability of low birthweight in this group is 23.4%.

    • This indicates that even one prenatal visit reduces the risk of low birthweight compared to no visits.

  3. Two or More Visits (ftv2 = "2+"):

    • Out of 42 mothers who had two or more visits, 12 delivered low-birthweight babies.

    • The probability of low birthweight in this group is 28.6%.

    • While the risk is lower than for no visits, it is slightly higher than for one visit, suggesting that the protective effect may not increase substantially beyond one visit in this dataset.

Observations

f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??

bwt$race <- as.factor(bwt$race)
bwt$smoke <- as.factor(bwt$smoke)
bwt$ptl2 <- as.factor(bwt$ptl2)
bwt$ht <- as.factor(bwt$ht)
bwt$ui <- as.factor(bwt$ui)
bwt$ftv2 <- as.factor(bwt$ftv2)
# Fit a logistic regression model with ptl2 and ftv2
updated_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                     data = bwt,
                     family = binomial)

summary(updated_model)
## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = bwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.82302    1.24471   0.661  0.50848   
## age         -0.03723    0.03870  -0.962  0.33602   
## lwt         -0.01565    0.00708  -2.211  0.02705 * 
## raceblack    1.19241    0.53597   2.225  0.02609 * 
## raceother    0.74069    0.46174   1.604  0.10869   
## smokeyes     0.75553    0.42502   1.778  0.07546 . 
## ptl21        1.34376    0.48062   2.796  0.00518 **
## htyes        1.91317    0.72074   2.654  0.00794 **
## uiyes        0.68019    0.46434   1.465  0.14296   
## ftv21       -0.43638    0.47939  -0.910  0.36268   
## ftv22+       0.17901    0.45638   0.392  0.69488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 195.48  on 178  degrees of freedom
## AIC: 217.48
## 
## Number of Fisher Scoring iterations: 4

Observations:

1. ptl2 (Previous Premature Labors):
  -  The new variable ptl2 appears in the output as indicator variables (e.g., “ptl21” representing mothers with exactly one previous premature labor, with “0” as the reference).
  -  The coefficient for ptl21 is 1.34376 and is statistically significant (p = 0.00518).
  -  Exponentiating this coefficient (exp(1.34376) ≈ 3.83) suggests that, compared to mothers with no previous premature labor, mothers with one previous premature labor have about 3.8 times higher odds of delivering a low-birthweight infant.
  -  If there were also a “ptl22+” category in the model (e.g., mothers with two or more previous premature labors), we would interpret its coefficient similarly; in this output only “ptl21” appears (indicating that, after collapsing the categories, the important contrast detected is between none versus one or more).
  -  Thus, the recoding of ptl into ptl2 clearly distinguishes risk groups and is important in predicting low birthweight.

2. ftv2 (First-Trimester Physician Visits):
  -  The new variable ftv2 is represented in two contrasts: “ftv21” (one visit vs. none) and “ftv22+” (two or more visits vs. none).
  -  Neither of these contrasts are statistically significant (p = 0.36268 and p = 0.69488, respectively).
  -  The coefficient for ftv21 is -0.43638, which (if significant) would have indicated a protective effect (odds ratio exp(-0.43638) ≈ 0.646), but its lack of significance suggests that, after controlling for other factors, the difference between having no visits versus one visit is not statistically apparent.
  -  Likewise, the contrast for ftv22+ does not reach significance.
  -  Thus, the recoded ftv2 variable does not appear to be important in predicting low birthweight in this multivariable model.

Overall Comments

g. In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

library(TeachingDemos)  # For setting the seed
## Warning: package 'TeachingDemos' was built under R version 4.4.3
library(MASS)          # For the birthwt dataset and lda/qda functions
library(caret)         # For confusionMatrix
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Set the seed using TeachingDemos
char2seed("birthwt_analysis")
### Split the Data into Training (80%) and Test (20%) Sets

set.seed(123)  # Ensure reproducibility in splitting
n <- nrow(bwt)
train_idx <- sample(1:n, size = round(0.8 * n), replace = FALSE)
train_bwt <- bwt[train_idx, ]
test_bwt  <- bwt[-train_idx, ]

cat("Training set size:", nrow(train_bwt), "\n")
## Training set size: 151
cat("Test set size:", nrow(test_bwt), "\n")
## Test set size: 38
### Discriminant Analysis Models

# For discriminant analysis, we include only continuous predictors that are more likely to satisfy
# the multivariate normality assumption. Here we choose age and lwt.
lda_model <- lda(low ~ age + lwt, data = train_bwt)
qda_model <- qda(low ~ age + lwt, data = train_bwt)

# Make predictions on the test set
lda_pred <- predict(lda_model, newdata = test_bwt)$class
qda_pred <- predict(qda_model, newdata = test_bwt)$class
# Create confusion matrices using caret's confusionMatrix function (treat "low" as the positive outcome)
conf_lda <- confusionMatrix(lda_pred, test_bwt$low, positive = "low")
conf_qda <- confusionMatrix(qda_pred, test_bwt$low, positive = "low")
cat("LDA Confusion Matrix:\n")
## LDA Confusion Matrix:
print(conf_lda$table)
##           Reference
## Prediction normal low
##     normal     27  11
##     low         0   0
cat("\nLDA Metrics:\n")
## 
## LDA Metrics:
print(conf_lda$byClass[c("Sensitivity", "Specificity")])
## Sensitivity Specificity 
##           0           1
cat("LDA Accuracy:", conf_lda$overall["Accuracy"], "\n\n")
## LDA Accuracy: 0.7105263
cat("QDA Confusion Matrix:\n")
## QDA Confusion Matrix:
print(conf_qda$table)
##           Reference
## Prediction normal low
##     normal     27  11
##     low         0   0
cat("\nQDA Metrics:\n")
## 
## QDA Metrics:
print(conf_qda$byClass[c("Sensitivity", "Specificity")])
## Sensitivity Specificity 
##           0           1
cat("QDA Accuracy:", conf_qda$overall["Accuracy"], "\n\n")
## QDA Accuracy: 0.7105263
## Logistic Regression Models

# The logistic regression models from parts b) and f) use a larger set of predictors.
# Here we fit the updated logistic regression using age, lwt, race, smoke, ptl2, ht, ui, and ftv2.
logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                   data = train_bwt, family = binomial)

# Predict probabilities on the test set and convert them to classes using threshold 0.5.
logit_pred_prob <- predict(logit_model, newdata = test_bwt, type = "response")
logit_pred <- ifelse(logit_pred_prob > 0.5, "low", "normal")
logit_pred <- factor(logit_pred, levels = c("normal", "low"))

conf_logit <- confusionMatrix(logit_pred, test_bwt$low, positive = "low")

cat("Logistic Regression Confusion Matrix:\n")
## Logistic Regression Confusion Matrix:
print(conf_logit$table)
##           Reference
## Prediction normal low
##     normal     24   9
##     low         3   2
cat("\nLogistic Regression Metrics:\n")
## 
## Logistic Regression Metrics:
print(conf_logit$byClass[c("Sensitivity", "Specificity")])
## Sensitivity Specificity 
##   0.1818182   0.8888889
cat("Logistic Regression Accuracy:", conf_logit$overall["Accuracy"], "\n")
## Logistic Regression Accuracy: 0.6842105

Interpretation:

1. LDA and QDA models:
  - Both methods predict every observation as “normal” (i.e. the majority class). This results in a moderate overall accuracy of about 71%, but it completely fails to identify any low birthweight cases (sensitivity = 0).
  - Although high specificity (100%) is obtained, for a serious clinical condition like low birthweight, leaving all “low” cases undetected is not acceptable—even if overall accuracy seems reasonable.

2. Logistic Regression:
  - The logistic regression model, which uses a richer set of predictors (including ptl2 and ftv2), correctly identifies 2 out of 11 low birthweight cases (sensitivity ≈ 18.2%).
  - Its overall accuracy is slightly lower (68.4%), and specificity is slightly lower (88.9%) compared to LDA/QDA, but it has the advantage of at least detecting some of the at-risk (“low”) cases.

3. Which Model Performs the Best?
  - Overall Accuracy: LDA/QDA have a slight edge (≈71%) compared to logistic regression (≈68%); however, this is largely because the data are imbalanced and the “normal” class dominates.
  - Sensitivity: Logistic regression is the only model that correctly identifies any low birthweight cases (18.2% versus 0% for LDA and QDA), which is critical if the goal is to screen for high-risk pregnancies.
  - Conclusion: In a real-world scenario—especially in a clinical setting where missing a high-risk case (low birthweight) has serious consequences—you would prioritize sensitivity.
    Therefore, logistic regression performs better in this context because, despite its slightly lower overall accuracy, it does identify a fraction of the low birthweight cases while LDA and QDA completely miss them.

h. Using your final model from f), interpret the estimates for all covariates.

summary(logit_model)
## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = train_bwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.257670   1.474143  -0.175   0.8612  
## age         -0.006270   0.042593  -0.147   0.8830  
## lwt         -0.013617   0.008361  -1.629   0.1034  
## raceblack    1.104833   0.643369   1.717   0.0859 .
## raceother    1.136605   0.545198   2.085   0.0371 *
## smokeyes     0.981383   0.507433   1.934   0.0531 .
## ptl21        1.178750   0.510017   2.311   0.0208 *
## htyes        1.796111   0.859947   2.089   0.0367 *
## uiyes        0.821373   0.508352   1.616   0.1061  
## ftv21       -0.388984   0.521600  -0.746   0.4558  
## ftv22+      -0.462438   0.543327  -0.851   0.3947  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 188.83  on 150  degrees of freedom
## Residual deviance: 155.79  on 140  degrees of freedom
## AIC: 177.79
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

1. (Intercept): –0.25767
 -  This is the estimated log-odds of low birthweight when all predictors are at their reference (or zero, for continuous variables) values.
 -  Practically, since age and lwt are continuous and the reference categories for the factors are “white” for race, “no” for smoke, ht, ui, “0” for ftv2, and “0” for ptl2, the intercept is not directly interpretable on its own—but it serves as the baseline for combining effects from the other variables.


2. age: –0.00627
 -  Each additional year of maternal age is associated with a decrease in the log-odds of low birthweight by 0.00627.
 -  Expressed as an OR: exp(–0.00627) ≈ 0.994; that is, for each extra year, the odds of low birthweight decrease by about 0.6%. This effect is very small and not statistically significant (p = 0.883).


3. lwt (maternal weight): –0.01362
 -  For each additional pound in maternal weight, the log-odds of low birthweight decrease by 0.01362.
 -  OR ≈ exp(–0.01362) ≈ 0.986, meaning roughly a 1.4% reduction in the odds per additional pound. This suggests that heavier mothers tend to have slightly lower odds of delivering a low-birthweight baby, although this effect is only marginally significant (p = 0.103).


4. race (categorical; reference: white)
 -  raceblack: 1.10483
  – Black mothers have log-odds of low birthweight that are 1.10483 higher than those for white mothers.
  – OR ≈ exp(1.10483) ≈ 3.02, so black mothers have about 3 times higher odds than white mothers; this effect is marginally significant (p = 0.086).
 -  raceother: 1.13661
  – Mothers classified as “other” race have a log-odds increase of 1.13661 relative to white mothers.
  – OR ≈ exp(1.13661) ≈ 3.12, meaning their odds are roughly 3.1 times higher than white mothers; this effect is statistically significant (p = 0.037).


5. smokeyes: 0.98138
 -  Mothers who smoke (compared to those who do not) have a 0.98138 higher log-odds of low birthweight.
 -  OR ≈ exp(0.98138) ≈ 2.67, so smoking is associated with about 2.7 times higher odds of low birthweight (p = 0.053, marginal significance).


6. ptl2 (previous premature labor; reference: “0” = none)
 -  ptl21: 1.17875
  – Mothers with one previous premature labor have log-odds that are 1.17875 higher than mothers with no previous premature labor.
  – OR ≈ exp(1.17875) ≈ 3.25, meaning these mothers have roughly 3.3 times greater odds of having a low-birthweight baby; this finding is statistically significant (p = 0.021).


7. ht (hypertension; htyes): 1.79611
 -  Mothers with a history of hypertension have log-odds that are 1.79611 higher than mothers without hypertension.
 -  OR ≈ exp(1.79611) ≈ 6.02, suggesting that hypertension increases the odds of low birthweight by about sixfold (p = 0.037).


8. ui (uterine irritability; uiyes): 0.82137
 -  Mothers with uterine irritability have 0.82137 higher log-odds of low birthweight than those without.
 -  OR ≈ exp(0.82137) ≈ 2.27, indicating roughly double the odds, though this effect is not statistically significant (p = 0.106).


9. ftv2 (first-trimester physician visits; reference: “0” visits)
 -  ftv21: –0.38898
  – Mothers with exactly one visit (compared to no visits) have a 0.38898 lower log-odds of low birthweight.
  – OR ≈ exp(–0.38898) ≈ 0.678, implying about a 32% reduction in the odds, but this effect is not statistically significant (p = 0.456).
 -  ftv22+: –0.46244
  – Mothers with two or more visits (compared to no visits) have a 0.46244 lower log-odds of low birthweight.
  – OR ≈ exp(–0.46244) ≈ 0.630, corresponding to about a 37% reduction in the odds. This effect is also not statistically significant (p = 0.395).

Overall Summary

These interpretations help illustrate which maternal factors play a key role in predicting low birthweight and provide insight into where preventive or interventional efforts might be focused.