Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.
a. Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.
library(MASS)
library(ggplot2)
# Load the cpus dataset (from MASS)
data(cpus, package="MASS")
str(cpus)
## 'data.frame': 209 obs. of 9 variables:
## $ name : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
## $ syct : int 125 29 29 29 29 26 23 23 23 23 ...
## $ mmin : int 256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
## $ mmax : int 6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
## $ cach : int 256 32 32 32 32 64 64 64 64 128 ...
## $ chmin : int 16 8 8 8 8 8 16 16 16 32 ...
## $ chmax : int 128 32 32 32 16 32 32 32 32 64 ...
## $ perf : int 198 269 220 172 132 318 367 489 636 1144 ...
## $ estperf: int 199 253 253 253 132 290 381 381 749 1238 ...
summary(cpus)
## name syct mmin mmax
## ADVISOR 32/60 : 1 Min. : 17.0 Min. : 64 Min. : 64
## AMDAHL 470/7A : 1 1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000
## AMDAHL 470V/7 : 1 Median : 110.0 Median : 2000 Median : 8000
## AMDAHL 470V/7B: 1 Mean : 203.8 Mean : 2868 Mean :11796
## AMDAHL 470V/7C: 1 3rd Qu.: 225.0 3rd Qu.: 4000 3rd Qu.:16000
## AMDAHL 470V/8 : 1 Max. :1500.0 Max. :32000 Max. :64000
## (Other) :203
## cach chmin chmax perf
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 6.0
## 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 27.0
## Median : 8.00 Median : 2.000 Median : 8.00 Median : 50.0
## Mean : 25.21 Mean : 4.699 Mean : 18.27 Mean : 105.6
## 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00 3rd Qu.: 113.0
## Max. :256.00 Max. :52.000 Max. :176.00 Max. :1150.0
##
## estperf
## Min. : 15.00
## 1st Qu.: 28.00
## Median : 45.00
## Mean : 99.33
## 3rd Qu.: 101.00
## Max. :1238.00
##
cor_matrix <- cor(cpus[, -1])
print(cor_matrix)
## syct mmin mmax cach chmin chmax
## syct 1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin -0.3356422 1.0000000 0.7581573 0.5347291 0.5171892 0.2669074
## mmax -0.3785606 0.7581573 1.0000000 0.5379898 0.5605134 0.5272462
## cach -0.3209998 0.5347291 0.5379898 1.0000000 0.5822455 0.4878458
## chmin -0.3010897 0.5171892 0.5605134 0.5822455 1.0000000 0.5482812
## chmax -0.2505023 0.2669074 0.5272462 0.4878458 0.5482812 1.0000000
## perf -0.3070821 0.7949233 0.8629942 0.6626135 0.6089025 0.6052193
## estperf -0.2883956 0.8192915 0.9012024 0.6486203 0.6105802 0.5921556
## perf estperf
## syct -0.3070821 -0.2883956
## mmin 0.7949233 0.8192915
## mmax 0.8629942 0.9012024
## cach 0.6626135 0.6486203
## chmin 0.6089025 0.6105802
## chmax 0.6052193 0.5921556
## perf 1.0000000 0.9664687
## estperf 0.9664687 1.0000000
pairs(cpus[, -1], main = "Scatterplot Matrix of cpus Data")
Observations:
Strongest predictors of estperf:
mmax has the highest correlation (0.901)
mmin shows strong correlation (0.819)
cach (0.649), chmin (0.611), and chmax (0.592) show moderate correlations
Notable multicollinearity concerns:
mmin and mmax are highly correlated (0.758)
Several moderate correlations exist between predictors (0.52-0.58 range)
perf and estperf are highly correlated (0.966) as expected, but perf won’t be used as a predictor
Many variables show right-skewed distributions
Non-linear relationships appear in several scatterplots
Potential outliers exist at higher variable values
The relationships between mmax/mmin and estperf appear particularly strong
Data points tend to cluster at lower values for most variables
b. Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.
Looking at the scatterplots, I notice that many variables exhibit right-skewed distributions with non-linear relationships to estperf12. The first step is to apply a log transformation to both response and predictor variables to Linearize relationships between variables, Reduce skewness in distributions and Stabilize variance across the range of predictions.
data <- cpus[, !(names(cpus) %in% c("name", "perf"))]
cpus_log <- data
cpus_log$log_estperf <- log(data$estperf)
cpus_log$log_chmin <- log(data$chmin)
cpus_log$log_chmax <- log(data$chmax)
cpus_log$log_mmin <- log(data$mmin)
cpus_log$log_mmax <- log(data$mmax)
cpus_log$log_cach <- log(data$cach)
cpus_log <- as.data.frame(cpus_log)
I’ll use a combination of techniques to determine the best predictors:
cpus_log <- na.omit(cpus_log) # Removes rows with NAs
cpus_log <- cpus_log[!apply(cpus_log, 1, function(x) any(is.infinite(x))), ] # Removes rows with Inf values
full_model <- lm(log_estperf ~ log_chmin + log_chmax + log_mmin + log_mmax + log_cach, data=cpus_log)
summary(full_model)
##
## Call:
## lm(formula = log_estperf ~ log_chmin + log_chmax + log_mmin +
## log_mmax + log_cach, data = cpus_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42690 -0.16513 -0.02378 0.12030 0.83766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.46464 0.23226 -10.611 < 2e-16 ***
## log_chmin 0.11114 0.02681 4.145 6.02e-05 ***
## log_chmax 0.08956 0.02815 3.181 0.001824 **
## log_mmin 0.12792 0.03482 3.674 0.000345 ***
## log_mmax 0.51887 0.03581 14.492 < 2e-16 ***
## log_cach 0.23201 0.02388 9.718 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.235 on 133 degrees of freedom
## Multiple R-squared: 0.9385, Adjusted R-squared: 0.9362
## F-statistic: 406.1 on 5 and 133 DF, p-value: < 2.2e-16
step_model <- step(full_model, direction="backward")
## Start: AIC=-396.68
## log_estperf ~ log_chmin + log_chmax + log_mmin + log_mmax + log_cach
##
## Df Sum of Sq RSS AIC
## <none> 7.3473 -396.68
## - log_chmax 1 0.5591 7.9064 -388.49
## - log_mmin 1 0.7458 8.0930 -385.24
## - log_chmin 1 0.9490 8.2962 -381.80
## - log_cach 1 5.2168 12.5641 -324.10
## - log_mmax 1 11.6012 18.9485 -266.99
Looking at the scatterplots, I notice that mmax has the strongest relationship with estperf (correlation 0.901). The stepwise procedure would likely retain this variable. However, due to multicollinearity concerns, I might need to choose between mmin and mmax.
Based on statistical significance in the step_model and multicollinearity analysis, I would likely arrive at a final model with:
final_model <- lm(log_estperf ~ log_mmax + log_cach + log_chmin, data=cpus_log)
summary(final_model)
##
## Call:
## lm(formula = log_estperf ~ log_mmax + log_cach + log_chmin, data = cpus_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41808 -0.17007 -0.02499 0.13284 0.81832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.32079 0.24169 -9.602 < 2e-16 ***
## log_mmax 0.61502 0.02915 21.097 < 2e-16 ***
## log_cach 0.26745 0.02388 11.199 < 2e-16 ***
## log_chmin 0.17315 0.02391 7.242 3.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2497 on 135 degrees of freedom
## Multiple R-squared: 0.9296, Adjusted R-squared: 0.928
## F-statistic: 593.8 on 3 and 135 DF, p-value: < 2.2e-16
This model would prioritize maximum memory (mmax) and cache size (cach) as the primary predictors, potentially including minimum channels (chmin) if it adds significant explanatory power after accounting for the other variables.
Interpretation:
Intercept (-2.32079)
log_mmax
(0.61502, p <
2e-16)
- A **1% increase in `mmax`** (main memory size) leads to an **approximate 0.615% increase** in estimated performance, **holding other variables constant**.
log_cach
(0.26745, p < 2e-16)- A **1% increase in `cach`** (cache size) leads to an **approximate 0.267% increase** in estimated performance.
log_chmin
(0.17315, p =
3.05e-11)
chmin
(minimum number
of channels) results in a 0.173% increase in estimated
performance.Residual Standard Error (RSE): 0.2497, indicating moderate spread in residuals.
F-statistic (593.8, p < 2.2e-16): Strong evidence that at least one predictor is significant.
Residuals: Symmetrically distributed around zero, but you should check residual plots for assumptions of normality and homoscedasticity.
c. Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.
plot(final_model$fitted.values, final_model$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual Plot")
abline(h = 0, col = "red")
Observations:
Non-linearity: The residuals exhibit a clear U-shaped pattern with:
Mostly positive residuals at low fitted values (2-3)
Predominantly negative residuals in the middle range (4-5)
Returning to positive residuals at high fitted values (6+)
Heteroscedasticity: The spread of residuals is not constant:
Greater variability at the extremes of fitted values
More compressed in the middle range
Potential outliers: Several observations show large residuals (around 0.8) at both the low and high ends of fitted values
Based on these observations, two major linear regression assumptions are being violated:
Linearity assumption: The U-shaped pattern indicates non-linear relationships
Homoscedasticity: The varying spread of residuals indicates non-constant variance
To address these issues, I modify the model as follows:
poly_model <- lm(log(estperf) ~ poly(mmax, 2) + poly(mmin, 2) + log(cach), data=cpus_log)
summary(poly_model)
##
## Call:
## lm(formula = log(estperf) ~ poly(mmax, 2) + poly(mmin, 2) + log(cach),
## data = cpus_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31136 -0.08852 -0.03328 0.06795 0.83412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.59567 0.04975 72.278 < 2e-16 ***
## poly(mmax, 2)1 6.76230 0.26490 25.527 < 2e-16 ***
## poly(mmax, 2)2 -1.40273 0.19400 -7.231 3.40e-11 ***
## poly(mmin, 2)1 1.50458 0.26160 5.752 5.79e-08 ***
## poly(mmin, 2)2 -0.46350 0.20188 -2.296 0.0232 *
## log(cach) 0.26285 0.01583 16.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1684 on 133 degrees of freedom
## Multiple R-squared: 0.9684, Adjusted R-squared: 0.9673
## F-statistic: 816.3 on 5 and 133 DF, p-value: < 2.2e-16
plot(poly_model$fitted.values, poly_model$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual Plot")
abline(h = 0, col = "red")
interaction_model <- lm(log(estperf) ~ log(mmax)*log(cach), data=cpus_log)
summary(interaction_model)
##
## Call:
## lm(formula = log(estperf) ~ log(mmax) * log(cach), data = cpus_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36543 -0.16143 -0.04096 0.08368 1.01237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.75866 0.62762 1.209 0.228859
## log(mmax) 0.27227 0.07055 3.859 0.000176 ***
## log(cach) -0.86929 0.19885 -4.372 2.44e-05 ***
## log(mmax):log(cach) 0.13063 0.02140 6.104 1.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2605 on 135 degrees of freedom
## Multiple R-squared: 0.9233, Adjusted R-squared: 0.9216
## F-statistic: 542.1 on 3 and 135 DF, p-value: < 2.2e-16
plot(interaction_model$fitted.values, interaction_model$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual Plot")
abline(h = 0, col = "red")
Observation of Polynomial_transformation plot:
Non-random distribution: The residuals show a non-random pattern, with clusters of positive residuals around fitted values of 3.5-4.5, while most other areas have residuals closer to zero or slightly negative.
Potential outliers: Several points stand out with large positive residuals (around 0.6-0.8) in the 3.5-4.5 fitted value range. These could be influential observations affecting the model.
Slight heteroscedasticity: The spread of residuals appears somewhat uneven across fitted values, with more variability in the lower fitted value range and less at higher values.
Mean shift: The average residual appears slightly negative for higher fitted values (5.5-7), suggesting possible systematic prediction errors in that range.
d.How well does the final model fit the data? Comment on some model fit criteria from the model built in c)
Based on the residual plot from the final polynomial model, I can evaluate several aspects of model fit:
The model demonstrates excellent overall fit based on several key metrics:
R-squared: 0.9684 - This indicates that approximately 96.84% of the variance in log(estperf) is explained by the model, which is exceptionally high
Adjusted R-squared: 0.9673 - The adjustment for the number of predictors shows the model is not overfitting
F-statistic: 816.3 (p < 2.2e-16) - The extremely high F-value and significant p-value confirm the model is statistically valid
Residual standard error: 0.1684 - A relatively small value indicating good prediction accuracy
All predictors are statistically significant, with most at p < 0.001
The quadratic terms for both mmax and mmin are significant, justifying the polynomial approach
The log(cach) term is highly significant (t = 16.601), confirming cache size is an important predictor
Looking at the residual plot:
Central tendency: Most residuals are clustered around zero, with a median of -0.03328
Outliers: Several notable positive outliers (maximum of 0.83412) appear in the lower fitted value range
Pattern: While improved from earlier models, some non-random pattern remains visible with more positive residuals at lower fitted values
Heteroscedasticity: Some inconsistency in residual spread across fitted values remains, with greater variance at lower fitted values
The polynomial model represents a substantial improvement over simpler models, capturing the non-linear relationships in the data effectively. The extremely high R-squared and significant coefficients indicate a strong model. While the residual plot still shows some pattern and outliers, the majority of observations are well-predicted by the model.
The remaining outliers might represent specific CPU models with unusual performance characteristics that could be investigated further, but they don’t significantly compromise the overall excellent fit of the model.
e. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)
The final model uses a log-transformed response variable (log(estperf)) with polynomial terms for mmax and mmin, and a log-transformed cache variable. To interpret coefficients, we must remember that the inverse of the log function is the exponential function (e^x), which allows us to convert back to the original performance scale.
Cache size coefficient: 0.26285
Since both the predictor (cach) and response (estperf) are log-transformed, this coefficient represents an elasticity. A 1% increase in cache size is associated with approximately a 0.26% increase in estimated performance, holding maximum and minimum memory constant. For example, increasing cache size from 100MB to 101MB (a 1% increase) would result in approximately a 0.26% increase in the CPU’s estimated performance.
Linear term coefficient: 6.76230
Quadratic term coefficient: -1.40273
The relationship between maximum memory and estimated performance is non-linear. The positive linear term (6.76) combined with the negative quadratic term (-1.40) indicates that increases in maximum memory are associated with increases in estimated performance, but with diminishing returns at higher memory levels. This means the performance benefit of adding memory capacity decreases as the maximum memory gets larger.
Linear term coefficient: 1.50458
Quadratic term coefficient: -0.46350
Similar to maximum memory, minimum memory shows a non-linear relationship with performance. The positive linear coefficient (1.50) and negative quadratic coefficient (-0.46) indicate that increasing minimum memory is associated with improved performance, but these improvements diminish at higher levels of minimum memory.
To convert predicted values from the log scale back to the original performance scale, we use the exponential function:
Original estimated performance = e^(predicted log(estperf))
For example, if our model predicts a log(estperf) value of 5.0, the estimated performance would be e^5.0 ≈ 148.4 units on the original performance scale.
The combined effect of all variables creates a model that captures both the importance of cache size for performance and the diminishing returns of increasing memory capacity beyond certain thresholds.
f. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
vif_values <- vif(poly_model)
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## poly(mmax, 2) 3.226432 2 1.340234
## poly(mmin, 2) 3.292278 2 1.347020
## log(cach) 1.677777 1 1.295290
The GVIF values show moderate levels of multicollinearity in the final model:
Memory variables: Both polynomial terms for
maximum memory (poly(mmax, 2)
) and minimum
memory (poly(mmin, 2)
) show similar GVIF
values around 3.2-3.3, which indicates some correlation between these
predictors.
Cache variable: The
log(cach)
term shows a lower GVIF of
approximately 1.68, suggesting it has less correlation with other
predictors.
Standardized measure: The GVIF^(1/(2*Df)) values, which account for the degrees of freedom and allow for better comparison across terms with different dimensions, are all between 1.29-1.35, which is well below concerning levels.
The observed level of multicollinearity is:
Mild to moderate - All GVIF values are below 5, which is generally considered acceptable
Expected - Given the correlation (0.758) previously observed between mmin and mmax
Managed effectively - The use of orthogonal
polynomials through the poly()
function
has helped control potential collinearity issues
No further action is necessary because:
The multicollinearity is not severe enough to compromise the model’s reliability
The model already demonstrates excellent fit (R² = 0.9684)
All predictors remain statistically significant despite the presence of some multicollinearity
This analysis confirms that the polynomial model specification effectively balances capturing non-linear relationships while maintaining acceptable levels of multicollinearity.
g. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.
The residual plot shows clear evidence of potential outliers:
Several observations with extremely large positive residuals (approximately 0.6-0.8) around fitted values of 3.5-4.0
One particularly extreme point with a residual of approximately 0.8 at a fitted value of around 4.0
Most residuals are concentrated between -0.2 and 0.2, making these extreme points stand out significantly
#evalate standardized residuals
std_resid <- rstandard(poly_model)
# Plot standardized residuals
plot(std_resid, type="h", main="Standardized Residuals")
abline(h=c(-3,3), col="red") # Threshold lines at ±3
Rule of thumb: Points with standardized residuals > |3| are considered outliers.
# Calculate Cook's distance
cooks_d <- cooks.distance(poly_model)
n=139
p=5 # n = sample size, p = number of parameters
# Plot Cook's distance
plot(cooks_d, type="h", main="Cook's Distance")
abline(h=4/(n-p-1), col="red")
Rule of thumb: Points with Cook’s distance > 4/(n-p-1) are influential.
# Calculate leverage values
leverage <- hatvalues(poly_model)
# Plot leverage values
plot(leverage, type="h", main="Leverage Values")
abline(h=2*p/n, col="red")
Rule of thumb: Points with leverage > 2p/n are high-leverage observations.
The standardized residuals plot shows:
Several observations with values exceeding the critical threshold of ±3 (indicated by the red line)
Notable outliers appear at indices ~10, ~20, ~30, and ~60
These points have standardized residuals ranging from approximately 3 to 5
Most observations fall within the expected range of -2 to +2
The Cook’s distance plot reveals:
Observation #10 has an extremely high Cook’s distance of approximately 2.5
This far exceeds the threshold (red line) of 4/(n-p-1), which appears to be close to 0.03
Several other observations show minor elevations in Cook’s distance (near indices 60, 110, and 140)
Observation #10 has substantially greater influence than any other point in the dataset
The leverage plot indicates:
Observation #10 has remarkably high leverage (~0.9)
Two observations near the end of the dataset (around index 140) also have elevated leverage (~0.35)
These values exceed the typical threshold of 2p/n (shown by the red line)
Most observations have very low leverage values
The diagnostic plots provide strong evidence that:
Observation #10 is an extremely influential point with high leverage, high Cook’s distance, and a large standardized residual. This single observation significantly influences the regression coefficients and warrants careful investigation.
Several additional outliers exist that have large standardized residuals but lower leverage and influence.
A few observations near the end of the dataset have high leverage but relatively low influence according to Cook’s distance.
These findings suggest that observation #10 should potentially be investigated for data entry errors or removed in a sensitivity analysis to assess its impact on model results. Despite these outliers, the overall model appears to maintain good fit (as seen in previous analyses with R² = 0.9684).
library(MASS)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the birthwt dataset
data("birthwt", package = "MASS")
# Transform categorical variables for better visualization
bwt <- with(birthwt, {
race <- factor(race, labels = c("white", "black", "other"))
ptd <- factor(ptl > 0, labels = c("no", "yes"))
ftv <- factor(ftv)
levels(ftv)[-(1:2)] <- "2+"
data.frame(low = factor(low, labels = c("normal", "low")),
age, lwt, race, smoke = factor(smoke, labels = c("no", "yes")),
ptd, ht = factor(ht, labels = c("no", "yes")),
ui = factor(ui, labels = c("no", "yes")), ftv)
})
# Summary statistics for numeric variables
summary(bwt)
## low age lwt race smoke ptd
## normal:130 Min. :14.00 Min. : 80.0 white:96 no :115 no :159
## low : 59 1st Qu.:19.00 1st Qu.:110.0 black:26 yes: 74 yes: 30
## Median :23.00 Median :121.0 other:67
## Mean :23.24 Mean :129.8
## 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :45.00 Max. :250.0
## ht ui ftv
## no :177 no :161 0 :100
## yes: 12 yes: 28 1 : 47
## 2+: 42
##
##
##
# Correlation matrix for numeric predictors and outcome
cor_matrix <- cor(bwt %>% select_if(is.numeric))
print(cor_matrix)
## age lwt
## age 1.0000000 0.1800732
## lwt 0.1800732 1.0000000
Continuous predictors like age
and
lwt
(mother’s weight) are summarized
numerically.
Correlations between predictors and the binary outcome
(low
) help identify linear
relationships.
low
)# Boxplot for mother's age vs low birth weight
ggplot(bwt, aes(x = low, y = age)) +
geom_boxplot() +
labs(title = "Mother's Age vs Low Birth Weight", x = "Birth Weight Status", y = "Age")
# Boxplot for mother's weight vs low birth weight
ggplot(bwt, aes(x = low, y = lwt)) +
geom_boxplot() +
labs(title = "Mother's Weight vs Low Birth Weight", x = "Birth Weight Status", y = "Weight (lbs)")
low
)# Bar plot for race vs low birth weight
ggplot(bwt, aes(x = race, fill = low)) +
geom_bar(position = "fill") +
labs(title = "Race vs Low Birth Weight", x = "Race", y = "Proportion")
# Bar plot for smoking status vs low birth weight
ggplot(bwt, aes(x = smoke, fill = low)) +
geom_bar(position = "fill") +
labs(title = "Smoking Status vs Low Birth Weight", x = "Smoking Status", y = "Proportion")
# Bar plot for hypertension history vs low birth weight
ggplot(bwt, aes(x = ht, fill = low)) +
geom_bar(position = "fill") +
labs(title = "Hypertension History vs Low Birth Weight", x = "Hypertension History", y = "Proportion")
Continuous Predictors:
Mothers with lower weights (lwt
)
tend to have higher proportions of low-birthweight babies.
Younger mothers (age
) appear to
have slightly higher proportions of low-birthweight babies.
Categorical Predictors:
Race: Black mothers have a higher proportion of low-birthweight babies compared to white or other racial groups.
Smoking: Smoking during pregnancy is strongly associated with a higher proportion of low-birthweight babies.
Hypertension (ht): Mothers with a history of hypertension show a higher proportion of low-birthweight babies.
Surprising Findings:
The presence of uterine irritability
(ui
) does not show as strong an
association with low birth weight as expected.
The number of physician visits during the first trimester
(ftv
) does not show a clear trend in
reducing the risk of low birth weight.
This analysis provides a comprehensive view of the relationships
between predictors and the outcome variable
(low
). Continuous predictors are
visualized using boxplots to highlight differences in distributions
across birthweight categories. Categorical predictors are visualized
using bar plots to show proportions within each category. These insights
will inform subsequent modeling steps using logistic regression and
discriminant analysis.
b. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.
# Create a cleaned version of the 'birthwt' dataset
bwt <- with(birthwt, {
# Convert race to a factor with meaningful labels:
race <- factor(race, labels = c("white", "black", "other"))
# Convert ptl (number of previous premature labors) into a binary indicator:
ptd <- factor(ptl > 0, labels = c("none", "yes"))
# Convert ftv (first trimester physician visits) into a factor,
# recoding levels beyond the first two as "2+"
ftv <- factor(ftv)
levels(ftv)[-c(1,2)] <- "2+"
# Build a new data frame with logically acceptable predictors.
data.frame(low = factor(low, labels = c("normal", "low")),
age, # Maternal age (years)
lwt, # Maternal weight (lbs) at last menstrual period
race, # Categorical: white, black, or other
smoke = factor(smoke, labels = c("no", "yes")),
ptd, # Indicator of previous premature labors (none or yes)
ht = factor(ht, labels = c("no", "yes")), # History of hypertension
ui = factor(ui, labels = c("no", "yes")), # Uterine irritability
ftv) # Number of first trimester physician visits
})
# Fit the logistic regression model.
# We include only those predictors that are logically acceptable and that are known risk factors.
bw.glm <- glm(low ~ age + lwt + race + smoke + ptd + ht + ui + ftv,
data = bwt,
family = binomial)
summary(bw.glm)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptd + ht + ui +
## ftv, family = binomial, data = bwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.82302 1.24471 0.661 0.50848
## age -0.03723 0.03870 -0.962 0.33602
## lwt -0.01565 0.00708 -2.211 0.02705 *
## raceblack 1.19241 0.53597 2.225 0.02609 *
## raceother 0.74069 0.46174 1.604 0.10869
## smokeyes 0.75553 0.42502 1.778 0.07546 .
## ptdyes 1.34376 0.48062 2.796 0.00518 **
## htyes 1.91317 0.72074 2.654 0.00794 **
## uiyes 0.68019 0.46434 1.465 0.14296
## ftv1 -0.43638 0.47939 -0.910 0.36268
## ftv2+ 0.17901 0.45638 0.392 0.69488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 195.48 on 178 degrees of freedom
## AIC: 217.48
##
## Number of Fisher Scoring iterations: 4
Data Preparation:
The code above converts several variables appropriately:
race is recoded into a factor with levels “white,” “black,” and “other.”
ptl becomes a binary indicator (stored as ptd) for whether a mother has had any previous premature labors.
ftv is recoded so that values beyond 2 are grouped into “2+.”
The outcome, low, is set as a factor using the label “normal” for a birth weight of at least 2.5 kg and “low” for less than 2.5 kg.
Model Fitting:
The logistic regression model is fit with
glm
using a binomial family, which models
the log odds of a low-birthweight outcome. The predictors include
maternal age (age), weight (lwt), race, smoking status (smoke), previous
premature labors (ptd), history of hypertension (ht), uterine
irritability (ui), and first‐trimester physician visits (ftv).
This specification avoids including the actual birth weight (bwt) or any
variable that would not be logically acceptable in predicting low birth
weight, as discussed in class and in the MASS documentation
Model Interpretation
This logistic regression model indicates that lower maternal weight, being black (relative to white), having a history of prior premature labor, and having hypertension are significant predictors of low birth weight. The estimated odds ratios (obtained by exponentiating the coefficients) allow for a more intuitive interpretation. For example, black mothers in this dataset have about 3.3 times the odds of delivering a low-birthweight infant than white mothers, and a history of hypertension increases the odds by nearly 6.8 times.
c. What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
ptl (Previous Premature Labors):
In our analysis, rather than using the raw count of previous premature
labors, we recoded this variable into a binary indicator (often labeled
“none” versus “yes”). By doing so, the model implicitly assumes that the
risk associated with previous premature labor operates as a threshold
effect. That is, it assumes that having any history of
premature labor increases the log odds of low birth weight by a fixed
amount, and that having one versus two or more preterm labors does not
make any additional difference. Essentially, once the threshold of “at
least one” is crossed, the risk is elevated by the same amount for all
mothers in that category.
ftv (First Trimester Physician Visits):
The variable ftv was converted into a factor with levels such as “0,”
“1,” and “2+” visits. By treating ftv as a categorical variable, the
model assumes that each category has its own fixed effect on the log
odds of low birth weight relative to a baseline (usually the “0 visits”
category). This means that the effect of having, say, 1 visit versus 0
visits is assumed to be constant, and any differences among higher
numbers of visits are grouped together under “2+.” The model does not
assume a continuous, incremental change in risk with each additional
visit; rather, it imposes a step function where the effect jumps at the
boundaries of these categories.
For ptl:
The assumption that any history of premature labor has the same effect,
regardless of how many times it has occurred, may be an
oversimplification. In clinical reality, the risk might further increase
with multiple prior preterm labors. However, if the data are sparse for
mothers with more than one such event, grouping them into a “yes”
category can be a practical move to avoid unstable estimates.
For ftv:
By grouping first-trimester physician visits into a few categories, the
model assumes that the impact on the log odds is constant within those
groups. This discretization may not capture more subtle or linear
dose‐response relationships. If prenatal care exhibits a more gradual
protective effect with increasing visits, then modeling it as a
continuous variable or exploring non-linear relationships (e.g., using
splines) might provide better insight. The categorical approach
simplifies the analysis and interpretation but at the potential cost of
missing a more nuanced association.
The logistic regression model is implicitly assuming:
That for ptl, there is a binary (threshold) effect—no risk if none, and the same elevated risk if one or more preterm labors have occurred.
That for ftv, the effects on the log odds are constant across categorically defined groups (0, 1, or 2+ visits).
These assumptions simplify the modeling process and are often used when the exact nature of the relationship is unclear or when data in some categories are sparse. However, if additional information or clinical insight suggests that the risk changes gradually (or escalates further with more preterm labors), then alternative strategies (e.g., keeping the variable continuous or using non-linear transformations such as splines) might be more realistic and informative.
d. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
Below is one approach to create a new variable, ptl2, that collapses the original ptl counts into a smaller number of categories. Often in the birthwt data the number of previous premature labors (ptl) is low, so we might want to distinguish mothers with no history, mothers with exactly one event, and mothers with two or more events.
Assuming bwt is our cleaned birthwt dataset, we create ptl2 so that:
“0” means no previous premature labor,
“1” means there was a previous premature labor history.
bwt$ptl2 <- with(bwt,
ifelse(ptd == "none", "0",
ifelse(ptd == "yes", "1", "1")))
#Convert ptl2 into a factor with ordered levels.
bwt$ptl2 <- factor(bwt$ptl2, levels = c("0", "1"))
#count of each category
table(bwt$ptl2)
##
## 0 1
## 159 30
This new variable ptl2 is often more useful for analysis, especially when the sample sizes in some of the higher count categories are very small. By collapsing ptl into three groups—0, 1, and 2 or more—we can examine whether there is a dose‐response effect of previous premature labor on the risk of low birthweight while avoiding unstable estimates for very rare categories.
e. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
The original variable ftv
represents
the number of first-trimester physician visits. Since higher values
(e.g., 3, 4, etc.) are rare, we can collapse levels into broader
categories such as:
“0”: No visits
“1”: One visit
“2+”: Two or more visits
# Collapse ftv into broader categories to create ftv2
bwt$ftv2 <- with(bwt,
ifelse(ftv == 0, "0",
ifelse(ftv == 1, "1", "2+")))
# Convert ftv2 into a factor with ordered levels
bwt$ftv2 <- factor(bwt$ftv2, levels = c("0", "1", "2+"))
# frequency of each category
table(bwt$ftv2)
##
## 0 1 2+
## 100 47 42
To better understand the relationship between ftv2
and
the probability of low birthweight, we calculate the proportion of
low-birthweight babies (low = "low"
) within each category
of ftv2
.
# Summarize probabilities of low birthweight by ftv2 levels
summary_table <- bwt %>%
group_by(ftv2) %>%
summarize(
total = n(),
low_count = sum(low == "low"),
probability_low = mean(low == "low")
)
print(summary_table)
## # A tibble: 3 × 4
## ftv2 total low_count probability_low
## <fct> <int> <int> <dbl>
## 1 0 100 36 0.36
## 2 1 47 11 0.234
## 3 2+ 42 12 0.286
No Visits (ftv2 = "0"
):
Out of 100 mothers who had no first-trimester physician visits, 36 delivered low-birthweight babies.
The probability of low birthweight in this group is 36%.
This suggests that mothers who did not receive prenatal care during the first trimester are at higher risk for low birthweight.
One Visit (ftv2 = "1"
):
Out of 47 mothers who had exactly one visit, 11 delivered low-birthweight babies.
The probability of low birthweight in this group is 23.4%.
This indicates that even one prenatal visit reduces the risk of low birthweight compared to no visits.
Two or More Visits
(ftv2 = "2+"
):
Out of 42 mothers who had two or more visits, 12 delivered low-birthweight babies.
The probability of low birthweight in this group is 28.6%.
While the risk is lower than for no visits, it is slightly higher than for one visit, suggesting that the protective effect may not increase substantially beyond one visit in this dataset.
The probability of low birthweight decreases significantly from 36% (no visits) to 23.4% (one visit), highlighting the importance of at least one prenatal care visit during the first trimester.
Interestingly, the probability increases slightly to 28.6% for two or more visits, which may be counterintuitive. This could suggest:
Mothers with higher-risk pregnancies might attend more prenatal
visits, which could confound the relationship between
ftv2
and low birthweight.
Additional analysis, such as adjusting for other risk factors (e.g., maternal weight, smoking status), might be needed to clarify this relationship.
f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??
bwt$race <- as.factor(bwt$race)
bwt$smoke <- as.factor(bwt$smoke)
bwt$ptl2 <- as.factor(bwt$ptl2)
bwt$ht <- as.factor(bwt$ht)
bwt$ui <- as.factor(bwt$ui)
bwt$ftv2 <- as.factor(bwt$ftv2)
# Fit a logistic regression model with ptl2 and ftv2
updated_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = bwt,
family = binomial)
summary(updated_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = bwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.82302 1.24471 0.661 0.50848
## age -0.03723 0.03870 -0.962 0.33602
## lwt -0.01565 0.00708 -2.211 0.02705 *
## raceblack 1.19241 0.53597 2.225 0.02609 *
## raceother 0.74069 0.46174 1.604 0.10869
## smokeyes 0.75553 0.42502 1.778 0.07546 .
## ptl21 1.34376 0.48062 2.796 0.00518 **
## htyes 1.91317 0.72074 2.654 0.00794 **
## uiyes 0.68019 0.46434 1.465 0.14296
## ftv21 -0.43638 0.47939 -0.910 0.36268
## ftv22+ 0.17901 0.45638 0.392 0.69488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 195.48 on 178 degrees of freedom
## AIC: 217.48
##
## Number of Fisher Scoring iterations: 4
Observations:
1. ptl2 (Previous Premature Labors):
- The new variable ptl2 appears in the output as indicator variables
(e.g., “ptl21” representing mothers with exactly one previous premature
labor, with “0” as the reference).
- The coefficient for ptl21 is 1.34376 and is statistically
significant (p = 0.00518).
- Exponentiating this coefficient (exp(1.34376) ≈ 3.83) suggests
that, compared to mothers with no previous premature labor, mothers with
one previous premature labor have about 3.8 times higher odds of
delivering a low-birthweight infant.
- If there were also a “ptl22+” category in the model (e.g., mothers
with two or more previous premature labors), we would interpret its
coefficient similarly; in this output only “ptl21” appears (indicating
that, after collapsing the categories, the important contrast detected
is between none versus one or more).
- Thus, the recoding of ptl into ptl2 clearly distinguishes
risk groups and is important in predicting low birthweight.
2. ftv2 (First-Trimester Physician Visits):
- The new variable ftv2 is represented in two contrasts: “ftv21” (one
visit vs. none) and “ftv22+” (two or more visits vs. none).
- Neither of these contrasts are statistically significant (p =
0.36268 and p = 0.69488, respectively).
- The coefficient for ftv21 is -0.43638, which (if significant) would
have indicated a protective effect (odds ratio exp(-0.43638) ≈ 0.646),
but its lack of significance suggests that, after controlling for other
factors, the difference between having no visits versus one visit is not
statistically apparent.
- Likewise, the contrast for ftv22+ does not reach
significance.
- Thus, the recoded ftv2 variable does not appear to be
important in predicting low birthweight in this multivariable
model.
ptl2 is a strong and significant predictor: The recoding of prior premature labors into ptl2 has clarified that even a single previous premature labor markedly increases the odds of low birthweight, with a clear dose‐response effect (mothers with one or more events have substantially higher risk).
ftv2 does not add significant predictive value: Although initial univariate summaries of ftv2 may have hinted at differences in low birthweight probabilities (with no visits having a higher rate), once other factors like maternal weight, race, and hypertension are controlled for in the logistic regression, ftv2 does not have a statistically significant independent effect. This may be because the effect of prenatal visits is confounded by other risk factors or because only a subset of mothers drive the association.
g. In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
library(TeachingDemos) # For setting the seed
## Warning: package 'TeachingDemos' was built under R version 4.4.3
library(MASS) # For the birthwt dataset and lda/qda functions
library(caret) # For confusionMatrix
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Set the seed using TeachingDemos
char2seed("birthwt_analysis")
### Split the Data into Training (80%) and Test (20%) Sets
set.seed(123) # Ensure reproducibility in splitting
n <- nrow(bwt)
train_idx <- sample(1:n, size = round(0.8 * n), replace = FALSE)
train_bwt <- bwt[train_idx, ]
test_bwt <- bwt[-train_idx, ]
cat("Training set size:", nrow(train_bwt), "\n")
## Training set size: 151
cat("Test set size:", nrow(test_bwt), "\n")
## Test set size: 38
### Discriminant Analysis Models
# For discriminant analysis, we include only continuous predictors that are more likely to satisfy
# the multivariate normality assumption. Here we choose age and lwt.
lda_model <- lda(low ~ age + lwt, data = train_bwt)
qda_model <- qda(low ~ age + lwt, data = train_bwt)
# Make predictions on the test set
lda_pred <- predict(lda_model, newdata = test_bwt)$class
qda_pred <- predict(qda_model, newdata = test_bwt)$class
# Create confusion matrices using caret's confusionMatrix function (treat "low" as the positive outcome)
conf_lda <- confusionMatrix(lda_pred, test_bwt$low, positive = "low")
conf_qda <- confusionMatrix(qda_pred, test_bwt$low, positive = "low")
cat("LDA Confusion Matrix:\n")
## LDA Confusion Matrix:
print(conf_lda$table)
## Reference
## Prediction normal low
## normal 27 11
## low 0 0
cat("\nLDA Metrics:\n")
##
## LDA Metrics:
print(conf_lda$byClass[c("Sensitivity", "Specificity")])
## Sensitivity Specificity
## 0 1
cat("LDA Accuracy:", conf_lda$overall["Accuracy"], "\n\n")
## LDA Accuracy: 0.7105263
cat("QDA Confusion Matrix:\n")
## QDA Confusion Matrix:
print(conf_qda$table)
## Reference
## Prediction normal low
## normal 27 11
## low 0 0
cat("\nQDA Metrics:\n")
##
## QDA Metrics:
print(conf_qda$byClass[c("Sensitivity", "Specificity")])
## Sensitivity Specificity
## 0 1
cat("QDA Accuracy:", conf_qda$overall["Accuracy"], "\n\n")
## QDA Accuracy: 0.7105263
## Logistic Regression Models
# The logistic regression models from parts b) and f) use a larger set of predictors.
# Here we fit the updated logistic regression using age, lwt, race, smoke, ptl2, ht, ui, and ftv2.
logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = train_bwt, family = binomial)
# Predict probabilities on the test set and convert them to classes using threshold 0.5.
logit_pred_prob <- predict(logit_model, newdata = test_bwt, type = "response")
logit_pred <- ifelse(logit_pred_prob > 0.5, "low", "normal")
logit_pred <- factor(logit_pred, levels = c("normal", "low"))
conf_logit <- confusionMatrix(logit_pred, test_bwt$low, positive = "low")
cat("Logistic Regression Confusion Matrix:\n")
## Logistic Regression Confusion Matrix:
print(conf_logit$table)
## Reference
## Prediction normal low
## normal 24 9
## low 3 2
cat("\nLogistic Regression Metrics:\n")
##
## Logistic Regression Metrics:
print(conf_logit$byClass[c("Sensitivity", "Specificity")])
## Sensitivity Specificity
## 0.1818182 0.8888889
cat("Logistic Regression Accuracy:", conf_logit$overall["Accuracy"], "\n")
## Logistic Regression Accuracy: 0.6842105
Interpretation:
1. LDA and QDA models:
- Both methods predict every observation as “normal” (i.e. the
majority class). This results in a moderate overall accuracy of about
71%, but it completely fails to identify any low birthweight cases
(sensitivity = 0).
- Although high specificity (100%) is obtained, for a serious clinical
condition like low birthweight, leaving all “low” cases undetected is
not acceptable—even if overall accuracy seems reasonable.
2. Logistic Regression:
- The logistic regression model, which uses a richer set of predictors
(including ptl2 and ftv2), correctly identifies 2 out of 11 low
birthweight cases (sensitivity ≈ 18.2%).
- Its overall accuracy is slightly lower (68.4%), and specificity is
slightly lower (88.9%) compared to LDA/QDA, but it has the advantage of
at least detecting some of the at-risk (“low”) cases.
3. Which Model Performs the Best?
- Overall Accuracy: LDA/QDA have a slight edge (≈71%)
compared to logistic regression (≈68%); however, this is largely because
the data are imbalanced and the “normal” class dominates.
- Sensitivity: Logistic regression is the only model
that correctly identifies any low birthweight cases (18.2% versus 0% for
LDA and QDA), which is critical if the goal is to screen for high-risk
pregnancies.
- Conclusion: In a real-world scenario—especially in
a clinical setting where missing a high-risk case (low birthweight) has
serious consequences—you would prioritize sensitivity.
Therefore, logistic regression performs better in
this context because, despite its slightly lower overall accuracy, it
does identify a fraction of the low birthweight cases while LDA and QDA
completely miss them.
h. Using your final model from f), interpret the estimates for all covariates.
summary(logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = train_bwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.257670 1.474143 -0.175 0.8612
## age -0.006270 0.042593 -0.147 0.8830
## lwt -0.013617 0.008361 -1.629 0.1034
## raceblack 1.104833 0.643369 1.717 0.0859 .
## raceother 1.136605 0.545198 2.085 0.0371 *
## smokeyes 0.981383 0.507433 1.934 0.0531 .
## ptl21 1.178750 0.510017 2.311 0.0208 *
## htyes 1.796111 0.859947 2.089 0.0367 *
## uiyes 0.821373 0.508352 1.616 0.1061
## ftv21 -0.388984 0.521600 -0.746 0.4558
## ftv22+ -0.462438 0.543327 -0.851 0.3947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 188.83 on 150 degrees of freedom
## Residual deviance: 155.79 on 140 degrees of freedom
## AIC: 177.79
##
## Number of Fisher Scoring iterations: 4
Interpretation:
1. (Intercept): –0.25767
- This is the estimated log-odds of low birthweight when all
predictors are at their reference (or zero, for continuous variables)
values.
- Practically, since age and lwt are continuous and the reference
categories for the factors are “white” for race, “no” for smoke, ht, ui,
“0” for ftv2, and “0” for ptl2, the intercept is not directly
interpretable on its own—but it serves as the baseline for combining
effects from the other variables.
2. age: –0.00627
- Each additional year of maternal age is associated with a decrease
in the log-odds of low birthweight by 0.00627.
- Expressed as an OR: exp(–0.00627) ≈ 0.994; that is, for each extra
year, the odds of low birthweight decrease by about 0.6%. This effect is
very small and not statistically significant (p = 0.883).
3. lwt (maternal weight): –0.01362
- For each additional pound in maternal weight, the log-odds of low
birthweight decrease by 0.01362.
- OR ≈ exp(–0.01362) ≈ 0.986, meaning roughly a 1.4% reduction in the
odds per additional pound. This suggests that heavier mothers tend to
have slightly lower odds of delivering a low-birthweight baby, although
this effect is only marginally significant (p = 0.103).
4. race (categorical; reference: white)
- raceblack: 1.10483
– Black mothers have log-odds of low birthweight that are 1.10483
higher than those for white mothers.
– OR ≈ exp(1.10483) ≈ 3.02, so black mothers have about 3 times higher
odds than white mothers; this effect is marginally significant (p =
0.086).
- raceother: 1.13661
– Mothers classified as “other” race have a log-odds increase of
1.13661 relative to white mothers.
– OR ≈ exp(1.13661) ≈ 3.12, meaning their odds are roughly 3.1 times
higher than white mothers; this effect is statistically significant (p =
0.037).
5. smokeyes: 0.98138
- Mothers who smoke (compared to those who do not) have a 0.98138
higher log-odds of low birthweight.
- OR ≈ exp(0.98138) ≈ 2.67, so smoking is associated with about 2.7
times higher odds of low birthweight (p = 0.053, marginal
significance).
6. ptl2 (previous premature labor; reference: “0” =
none)
- ptl21: 1.17875
– Mothers with one previous premature labor have log-odds that are
1.17875 higher than mothers with no previous premature labor.
– OR ≈ exp(1.17875) ≈ 3.25, meaning these mothers have roughly 3.3
times greater odds of having a low-birthweight baby; this finding is
statistically significant (p = 0.021).
7. ht (hypertension; htyes): 1.79611
- Mothers with a history of hypertension have log-odds that are
1.79611 higher than mothers without hypertension.
- OR ≈ exp(1.79611) ≈ 6.02, suggesting that hypertension increases the
odds of low birthweight by about sixfold (p = 0.037).
8. ui (uterine irritability; uiyes): 0.82137
- Mothers with uterine irritability have 0.82137 higher log-odds of
low birthweight than those without.
- OR ≈ exp(0.82137) ≈ 2.27, indicating roughly double the odds, though
this effect is not statistically significant (p = 0.106).
9. ftv2 (first-trimester physician visits; reference: “0”
visits)
- ftv21: –0.38898
– Mothers with exactly one visit (compared to no visits) have a
0.38898 lower log-odds of low birthweight.
– OR ≈ exp(–0.38898) ≈ 0.678, implying about a 32% reduction in the
odds, but this effect is not statistically significant (p =
0.456).
- ftv22+: –0.46244
– Mothers with two or more visits (compared to no visits) have a
0.46244 lower log-odds of low birthweight.
– OR ≈ exp(–0.46244) ≈ 0.630, corresponding to about a 37% reduction
in the odds. This effect is also not statistically significant (p =
0.395).
Maternal weight (lwt) shows a protective effect: higher weight is associated with lower odds of low birthweight.
Race: Both black and “other” race categories are associated with notably higher odds than white mothers.
Smoking tends to increase the odds of low birthweight, although significance is borderline.
Prior premature labor (ptl2): Having one previous premature labor significantly raises the odds by over threefold.
Hypertension (htyes): Has a very strong association, increasing the odds of low birthweight nearly sixfold.
Uterine irritability (uiyes) potentially doubles the odds, but this effect is not statistically robust in this model.
Prenatal visits (ftv2): The protective effects suggested by negative coefficients for 1 or 2+ visits versus no visits are not statistically significant after adjusting for other risk factors.
These interpretations help illustrate which maternal factors play a key role in predicting low birthweight and provide insight into where preventive or interventional efforts might be focused.