# Load necessary libraries
library(MASS)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Warning: package 'carData' was built under R version 4.4.3
# Load the cpus dataset
data(cpus)
In this section, we will investigate the structure and numerical summary of the cpus dataset. We will also examine the relationships between the variables using a correlation matrix and visualizations.
# Investigate the structure of the dataset
str(cpus)
## 'data.frame': 209 obs. of 9 variables:
## $ name : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
## $ syct : int 125 29 29 29 29 26 23 23 23 23 ...
## $ mmin : int 256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
## $ mmax : int 6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
## $ cach : int 256 32 32 32 32 64 64 64 64 128 ...
## $ chmin : int 16 8 8 8 8 8 16 16 16 32 ...
## $ chmax : int 128 32 32 32 16 32 32 32 32 64 ...
## $ perf : int 198 269 220 172 132 318 367 489 636 1144 ...
## $ estperf: int 199 253 253 253 132 290 381 381 749 1238 ...
# Provide a summary of the dataset
summary(cpus)
## name syct mmin mmax
## ADVISOR 32/60 : 1 Min. : 17.0 Min. : 64 Min. : 64
## AMDAHL 470/7A : 1 1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000
## AMDAHL 470V/7 : 1 Median : 110.0 Median : 2000 Median : 8000
## AMDAHL 470V/7B: 1 Mean : 203.8 Mean : 2868 Mean :11796
## AMDAHL 470V/7C: 1 3rd Qu.: 225.0 3rd Qu.: 4000 3rd Qu.:16000
## AMDAHL 470V/8 : 1 Max. :1500.0 Max. :32000 Max. :64000
## (Other) :203
## cach chmin chmax perf
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 6.0
## 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 27.0
## Median : 8.00 Median : 2.000 Median : 8.00 Median : 50.0
## Mean : 25.21 Mean : 4.699 Mean : 18.27 Mean : 105.6
## 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00 3rd Qu.: 113.0
## Max. :256.00 Max. :52.000 Max. :176.00 Max. :1150.0
##
## estperf
## Min. : 15.00
## 1st Qu.: 28.00
## Median : 45.00
## Mean : 99.33
## 3rd Qu.: 101.00
## Max. :1238.00
##
# Select numeric columns for correlation analysis
numeric_cps <- cpus[sapply(cpus, is.numeric)]
# Compute and visualize the correlation matrix
cor_matrix <- cor(numeric_cps)
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust")
Comments on Relationships:
syct
(system cycle time), mmin
, and
mmax
(main memory minimum and maximum) are highly
correlated, suggesting multicollinearity concerns.
There is a moderately strong positive relationship between
perf
and estperf
, which is expected as both
represent performance.
The correlation heatmap reveals clusters of related variables, particularly between memory-related features and performance-related variables.
We will now build a linear regression model to predict
estperf
using the variables in the cpus
dataset. We will not include name
since it is a categorical
identifier.
# Initial full model with all predictors except 'name'
model_stepwise <- lm(estperf ~ . - name, data = cpus)
# Stepwise selection using AIC criterion
final_model <- step(model_stepwise, direction = "both", trace = 0)
# Summary of the final model
summary(final_model)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf,
## data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.541 -9.553 2.867 15.213 182.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.424e+01 4.713e+00 -7.264 8.01e-12 ***
## syct 3.778e-02 9.403e-03 4.018 8.27e-05 ***
## mmin 5.476e-03 1.100e-03 4.980 1.36e-06 ***
## mmax 3.375e-03 3.959e-04 8.526 3.55e-15 ***
## cach 1.238e-01 7.473e-02 1.656 0.09920 .
## chmax 3.443e-01 1.223e-01 2.814 0.00537 **
## perf 5.770e-01 3.708e-02 15.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9582
## F-statistic: 796.7 on 6 and 202 DF, p-value: < 2.2e-16
Explanation of Model Selection: We used stepwise
regression to arrive at the final model, balancing simplicity and
predictive power using AIC as the selection criterion. The final model
includes predictors such as syct
, mmin
,
mmax
, cach
, chmax
, and
perf
.
We now check the residuals of the model to ensure that the assumptions of linear regression are met, focusing on homoscedasticity and normality.
# Residual plot
residuals <- residuals(final_model)
plot(x = fitted(final_model), y = residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual Plot")
abline(h = 0, col = "red", lty = 2)
Comment on Residual Plot: The residuals appear to be randomly scattered around zero, though there are signs of heteroscedasticity (non-constant variance). Some observations have large residuals, indicating potential outliers or influential points.
To address heteroscedasticity, we apply a log transformation to the
response variable (estperf
) and refit the model.
# Log-transform the response variable
cpus$log_estperf <- log(cpus$estperf)
# Fit a model using the log-transformed response variable
model_transformed <- lm(log_estperf ~ . - name, data = cpus)
# Residual plot for the transformed model
residuals_transformed <- residuals(model_transformed)
plot(x = fitted(model_transformed), y = residuals_transformed,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual Plot (Transformed Model)")
abline(h = 0, col = "red", lty = 2)
# Summary of the transformed model
summary(model_transformed)
##
## Call:
## lm(formula = log_estperf ~ . - name, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34248 -0.07284 0.01788 0.06913 0.34862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.030e+00 1.961e-02 154.526 < 2e-16 ***
## syct -1.834e-04 3.618e-05 -5.068 9.13e-07 ***
## mmin 6.660e-05 4.374e-06 15.225 < 2e-16 ***
## mmax 7.997e-05 1.710e-06 46.774 < 2e-16 ***
## cach 8.782e-03 2.879e-04 30.507 < 2e-16 ***
## chmin 4.237e-03 1.669e-03 2.538 0.0119 *
## chmax 2.741e-03 4.834e-04 5.671 4.92e-08 ***
## perf 9.006e-04 2.034e-04 4.427 1.57e-05 ***
## estperf -4.724e-03 2.603e-04 -18.148 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.117 on 200 degrees of freedom
## Multiple R-squared: 0.985, Adjusted R-squared: 0.9844
## F-statistic: 1642 on 8 and 200 DF, p-value: < 2.2e-16
Model Adjustment: The log transformation reduced heteroscedasticity, as seen in the residual plot. The transformed model explains 98.5% of the variance, indicating a good fit.
We evaluate model fit based on key metrics such as R-squared, residual standard error, and the F-statistic.
# Check model fit
summary(model_transformed)
##
## Call:
## lm(formula = log_estperf ~ . - name, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34248 -0.07284 0.01788 0.06913 0.34862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.030e+00 1.961e-02 154.526 < 2e-16 ***
## syct -1.834e-04 3.618e-05 -5.068 9.13e-07 ***
## mmin 6.660e-05 4.374e-06 15.225 < 2e-16 ***
## mmax 7.997e-05 1.710e-06 46.774 < 2e-16 ***
## cach 8.782e-03 2.879e-04 30.507 < 2e-16 ***
## chmin 4.237e-03 1.669e-03 2.538 0.0119 *
## chmax 2.741e-03 4.834e-04 5.671 4.92e-08 ***
## perf 9.006e-04 2.034e-04 4.427 1.57e-05 ***
## estperf -4.724e-03 2.603e-04 -18.148 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.117 on 200 degrees of freedom
## Multiple R-squared: 0.985, Adjusted R-squared: 0.9844
## F-statistic: 1642 on 8 and 200 DF, p-value: < 2.2e-16
Model Fit:
Adjusted R-squared: 0.9844, indicating that 98.44% of the variability in the log-transformed performance is explained by the predictors.
Residual standard error: 0.117, suggesting that the model’s predictions are quite close to the observed values.
F-statistic: The high F-statistic and associated p-value (< 2.2e-16) confirm the model’s overall significance.
We interpret the coefficients in the context of the log-transformed response variable.
# Coefficients of the transformed model
coef(summary(model_transformed))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.029657e+00 1.960618e-02 154.525635 4.935416e-210
## syct -1.833582e-04 3.618027e-05 -5.067906 9.128641e-07
## mmin 6.659830e-05 4.374290e-06 15.224940 2.877327e-35
## mmax 7.997375e-05 1.709779e-06 46.774309 1.180950e-109
## cach 8.782494e-03 2.878817e-04 30.507301 3.640223e-77
## chmin 4.237014e-03 1.669216e-03 2.538326 1.190005e-02
## chmax 2.741115e-03 4.833696e-04 5.670848 4.922497e-08
## perf 9.006215e-04 2.034404e-04 4.426954 1.570586e-05
## estperf -4.724292e-03 2.603211e-04 -18.147944 3.815143e-44
Interpretation:
For syct
, a one-unit increase in system cycle time
is associated with a 0.000183 decrease in the log of estimated
performance, holding all other variables constant.
For mmin
and mmax
, increases in main
memory result in small but significant increases in
performance.
For cach
, larger cache sizes improve
performance.
We assess multicollinearity using the variance inflation factor (VIF).
# Calculate VIF values
vif_values <- vif(model_transformed)
vif_values
## syct mmin mmax cach chmin chmax perf estperf
## 1.347415 4.374518 6.108787 2.078872 1.967216 2.399655 16.268406 24.663399
Comment on Multicollinearity: There is evidence of
multicollinearity, particularly for perf
and
estperf
(VIF > 10), suggesting potential redundancy in
the predictors. This may inflate standard errors, but for this analysis,
we choose to keep the variables for their interpretability.
We check for influential points using Cook’s distance.
# Cook's distance
cooksd <- cooks.distance(model_transformed)
# Plot standardized residuals vs. fitted values
plot(x = fitted(model_transformed), y = rstandard(model_transformed),
xlab = "Fitted Values", ylab = "Standardized Residuals",
main = "Standardized Residuals vs. Fitted Values")
abline(h = 0, col = "red", lty = 2)
# Highlight influential points
influential_obs <- which(cooksd > 4 / length(fitted(model_transformed)))
influential_obs
## 1 9 10 31 32 83 91 92 96 97 138 153 154 157 166 167 198 199 200
## 1 9 10 31 32 83 91 92 96 97 138 153 154 157 166 167 198 199 200
Outliers and Influential Points: Several observations have high Cook’s distance values, indicating that they have a large influence on the model. These points may warrant further investigation or potential removal in a refined model.
The final model provides a robust fit to the data, with 98.5% of the variance in estimated performance explained by the predictors. Residual analysis and multicollinearity checks reveal potential issues with certain variables, but overall, the model performs well. Further improvements could be made by addressing influential points or refining the predictors.
We will investigate the relationship between low birthweight
(low
) and the predictors in the birthwt dataset using
logistic regression and discriminant analysis. The dataset is from the
MASS
package.
We will begin by investigating the relationship between low
birthweight (low
) and the various predictors in the
birthwt
dataset. This will include both numeric and visual
summaries. For categorical variables, we will use mosaic plots and bar
plots, while for continuous variables, we will use box plots to show the
distribution of the predictor across the two levels of the outcome
variable (low
).
# Load necessary libraries
library(MASS)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
data(birthwt)
# Summary statistics for numeric variables
summary(birthwt)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
# Convert categorical variables to factors with descriptive labels
birthwt$race <- factor(birthwt$race, labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, labels = c("Non-Smoker", "Smoker"))
birthwt$ht <- factor(birthwt$ht, labels = c("No Hypertension", "Hypertension"))
birthwt$low <- factor(birthwt$low, labels = c("Normal Weight", "Low Weight"))
birthwt$ui <- factor(birthwt$ui, labels = c("No Uterine Irritability", "Uterine Irritability"))
# Investigate structure of the dataset
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : Factor w/ 2 levels "Normal Weight",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "Non-Smoker","Smoker": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : Factor w/ 2 levels "No Hypertension",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "No Uterine Irritability",..: 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
From the numeric summary, we observe the general distribution of the
variables. This provides a preliminary view of potential relationships
between low birthweight and predictors such as age, maternal weight
(lwt
), preterm labors (ptl
), etc.
To investigate the relationships between categorical predictors and low birthweight, we use mosaic plots and bar plots. These visualizations provide insights into how different categories (e.g., race, smoking status, history of hypertension) relate to the likelihood of low birthweight.
# Mosaic plots for categorical variables vs low birthweight
par(mfrow = c(1, 3)) # Arrange mosaic plots in a row
# Mosaic plot for race vs low birthweight
mosaicplot(table(birthwt$race, birthwt$low), main = "Low Birthweight vs Race",
shade = TRUE, color = TRUE, xlab = "Race", ylab = "Low Birthweight")
# Mosaic plot for smoking vs low birthweight
mosaicplot(table(birthwt$smoke, birthwt$low), main = "Low Birthweight vs Smoking",
shade = TRUE, color = TRUE, xlab = "Smoking Status", ylab = "Low Birthweight")
# Mosaic plot for hypertension vs low birthweight
mosaicplot(table(birthwt$ht, birthwt$low), main = "Low Birthweight vs Hypertension",
shade = TRUE, color = TRUE, xlab = "Hypertension", ylab = "Low Birthweight")
# Bar plot for categorical variables
ggplot(birthwt, aes(x = race, fill = low)) +
geom_bar(position = "fill") +
ggtitle("Low Birthweight Proportions by Race") +
labs(x = "Race", y = "Proportion of Low Birthweight")
ggplot(birthwt, aes(x = smoke, fill = low)) +
geom_bar(position = "fill") +
ggtitle("Low Birthweight Proportions by Smoking Status") +
labs(x = "Smoking Status", y = "Proportion of Low Birthweight")
Race vs. Low Birthweight: The mosaic plot and bar chart indicate that Black mothers are more likely to have low birthweight babies compared to White and Other races. This highlights potential disparities related to race.
Smoking vs. Low Birthweight: Mothers who smoke during pregnancy have a significantly higher likelihood of delivering low birthweight babies. This emphasizes the impact of smoking on birth outcomes.
Hypertension vs. Low Birthweight: Mothers with a history of hypertension also have an increased likelihood of low birthweight babies, which aligns with known medical risks associated with hypertensive pregnancies.
For the continuous variables (e.g., age, maternal weight, preterm
labors), we use box plots to visualize the relationship between each
predictor and the outcome (low
).
# Box plots for continuous variables vs low birthweight
par(mfrow = c(2, 2)) # Arrange box plots
# Box plot for age vs low birthweight
boxplot(age ~ low, data = birthwt, main = "Age vs Low Birthweight",
ylab = "Age", col = c("lightblue", "lightcoral"))
# Box plot for maternal weight vs low birthweight
boxplot(lwt ~ low, data = birthwt, main = "Maternal Weight vs Low Birthweight",
ylab = "Maternal Weight (lwt)", col = c("lightblue", "lightcoral"))
# Box plot for preterm labors vs low birthweight
boxplot(ptl ~ low, data = birthwt, main = "Preterm Labors vs Low Birthweight",
ylab = "Preterm Labors", col = c("lightblue", "lightcoral"))
# Box plot for birthweight vs low birthweight
boxplot(bwt ~ low, data = birthwt, main = "Birthweight Distribution",
ylab = "Birthweight (bwt)", col = c("lightblue", "lightcoral"))
Age vs. Low Birthweight: Younger mothers (in their late teens and early twenties) appear to have a slightly higher tendency to deliver low birthweight babies compared to older mothers, although the difference is not large.
Maternal Weight (lwt) vs. Low Birthweight: Lower maternal weight is clearly associated with a higher likelihood of delivering a low birthweight baby. Mothers with low body weight (under 110 lbs) are more likely to have low birthweight infants.
Preterm Labors (ptl) vs. Low Birthweight: Mothers with a history of preterm labors tend to have a higher probability of delivering low birthweight babies, suggesting a strong link between prior preterm deliveries and birth outcomes.
Birthweight (bwt) Distribution: The plot confirms that infants classified as “Low Birthweight” by definition weigh significantly less than those classified as “Normal Weight.” This is consistent with the dataset’s purpose.
We will use logistic regression to predict the binary outcome
low
(1 for low birthweight, 0 for normal weight). In this
step, we exclude variables like bwt
(birthweight), as it is
the direct outcome variable and thus would not be a logical
predictor.
# View the structure of the dataset
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : Factor w/ 2 levels "Normal Weight",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "Non-Smoker","Smoker": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : Factor w/ 2 levels "No Hypertension",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "No Uterine Irritability",..: 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
From the str(birthwt)
output, we see that the dataset
contains both categorical and continuous variables. We need to convert
the categorical variables (e.g., race
, smoke
,
ht
, ui
) into factors for logistic regression.
This ensures that R correctly interprets them as categorical
predictors.
Converting Variables to Factors:
# Convert relevant categorical variables into factors with descriptive labels
birthwt$race <- factor(birthwt$race, labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, labels = c("Non-Smoker", "Smoker"))
birthwt$ht <- factor(birthwt$ht, labels = c("No Hypertension", "Hypertension"))
birthwt$ui <- factor(birthwt$ui, labels = c("No Uterine Irritability", "Uterine Irritability"))
We will now fit the logistic regression model using the predictors:
age
, lwt
, race
,
smoke
, ptl
, ht
, ui
,
and ftv
. The outcome variable is low
, which
indicates whether the baby had a low birthweight.
# Fit a logistic regression model to predict low birthweight
logistic_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
data = birthwt,
family = binomial)
# Display the summary of the model
summary(logistic_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.480623 1.196888 0.402 0.68801
## age -0.029549 0.037031 -0.798 0.42489
## lwt -0.015424 0.006919 -2.229 0.02580 *
## raceBlack 1.272260 0.527357 2.413 0.01584 *
## raceOther 0.880496 0.440778 1.998 0.04576 *
## smokeSmoker 0.938846 0.402147 2.335 0.01957 *
## ptl 0.543337 0.345403 1.573 0.11571
## htHypertension 1.863303 0.697533 2.671 0.00756 **
## uiUterine Irritability 0.767648 0.459318 1.671 0.09467 .
## ftv 0.065302 0.172394 0.379 0.70484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 201.28 on 179 degrees of freedom
## AIC: 221.28
##
## Number of Fisher Scoring iterations: 4
Race (raceBlack and raceOther):
Being Black significantly increases the odds of low birthweight
(Estimate = 1.272, p = 0.0158), and
being from another race (not White or Black) also significantly
increases the odds (Estimate = 0.880, p =
0.0458), compared to White mothers.
Smoking (smokeSmoker):
Smoking during pregnancy is significantly associated with an increased
risk of low birthweight (Estimate = 0.939, p =
0.0196).
Hypertension (htHypertension):
A history of hypertension is a strong, statistically significant
predictor of low birthweight (Estimate = 1.863,
p = 0.0076), greatly increasing the
likelihood.
Maternal Weight (lwt):
Lower maternal weight is statistically significant (Estimate =
-0.0154, p = 0.0258), indicating that lower
weight is associated with higher odds of delivering a low birthweight
baby.
Preterm Labor (ptl):
Although preterm labor (ptl) shows a positive effect on the odds of low
birthweight, it is not statistically significant in
this model (p = 0.1157).
Physician Visits (ftv):
The number of physician visits (ftv) does not significantly affect the
likelihood of low birthweight (p = 0.7048).
We will explore the role of ptl
(prior preterm labor)
and ftv
(number of physician visits) in predicting low
birthweight, based on the results from the logistic regression
model.
- **Estimate**: 0.5433
- **Std. Error**: 0.3454
- **z-value**: 1.573
- **p-value**: 0.1157 (not statistically significant)
**ftv (number of physician visits)**:
- **Estimate**: 0.0653
- **Std. Error**: 0.1724
- **z-value**: 0.379
- **p-value**: 0.7048 (not statistically significant)
ptl (prior preterm labor): Holding other variables constant, having a history of prior preterm labor increases the log odds of delivering a low birthweight baby by 0.5433. However, the p-value of 0.1157 suggests that this effect is not statistically significant at the conventional 0.05 level, meaning we cannot confidently assert that prior preterm labor significantly predicts low birthweight in this model.
ftv (number of physician visits): The effect of the number of physician visits is very small, with an estimated increase of 0.0653 in the log odds of low birthweight for each additional visit. However, the high p-value (0.7048) indicates that this variable has no statistically significant effect on the outcome, implying that physician visits do not meaningfully impact the likelihood of low birthweight in this context.
The model assumes that ptl and ftv have linear effects on the log odds of low birthweight.
The non-significant p-values for both ptl and ftv suggest that these variables may not have strong predictive power in the model, and the assumptions regarding their impact might not be realistic given the current data.
The positive coefficient for ptl implies that a history of prior preterm labor may increase the risk of low birthweight, but the lack of statistical significance suggests that more data or an improved model is needed to verify this relationship.
Similarly, the non-significant effect of ftv indicates that the number of physician visits does not seem to play a meaningful role in predicting low birthweight. It’s possible that the effect of ftv is not linear, or that this variable is not a critical factor in determining low birthweight in this dataset.
To enhance the analysis, we will create a new binary variable
ptl2
to represent prior preterm labor (ptl
).
This transformation is especially useful when dealing with small sample
sizes, where collapsing categories can improve model stability.
ptl2
:Original ptl
: This variable
represents the number of prior preterm labors.
New ptl2
: We will create a binary
variable where:
- `ptl2 = 1` if the mother had any prior preterm labor (`ptl > 0`).
- `ptl2 = 0` if there was no prior preterm labor (`ptl == 0`).
# Create new binary variable ptl2
birthwt$ptl2 <- ifelse(birthwt$ptl > 0, 1, 0)
# View the first few rows of the updated dataset
head(birthwt)
By collapsing the ptl
variable into a binary
indicator (ptl2
), we simplify the analysis while retaining
the key information regarding whether or not the mother had prior
preterm labor. This approach helps stabilize the model, especially when
sample sizes for higher counts of ptl
are small.
Collapsing categories can be especially useful when some categories have few observations, which might introduce instability in the model.
Similar to the approach for ptl
, we will create a new
variable for ftv
(number of physician visits during
pregnancy) that better suits the analysis. In cases where small sample
sizes are present, collapsing categories can help reduce noise and
improve model stability.
ftv2
:Original ftv
: Represents the number
of physician visits.
New ftv2
: We will create a binary
variable where:
- `ftv2 = 1` if the mother had at least one physician visit (`ftv > 0`).
- `ftv2 = 0` if there were no physician visits (`ftv == 0`).
# Create new binary variable ftv2
birthwt$ftv2 <- ifelse(birthwt$ftv > 0, 1, 0)
# View the updated dataset
head(birthwt)
ftv2
LevelsTo better understand the relationship between physician visits
(ftv2
) and the probability of low birthweight, we create a
contingency table that shows the distribution of normal and low
birthweights for mothers with no visits and at least one visit.
# Create a table summarizing low birthweight probabilities by levels of ftv2
table_summary <- table(birthwt$ftv2, birthwt$low)
# Add row and column names for clarity
rownames(table_summary) <- c("No Physician Visit", "At Least One Physician Visit")
colnames(table_summary) <- c("Normal Birthweight", "Low Birthweight")
# Display the table summary
table_summary
##
## Normal Birthweight Low Birthweight
## No Physician Visit 64 36
## At Least One Physician Visit 66 23
No Physician Visits: Of the mothers who had no physician visits, 36 gave birth to low birthweight babies, and 64 gave birth to normal birthweight babies.
At Least One Physician Visit: For mothers with at least one physician visit, 23 had low birthweight babies, and 66 had normal birthweight babies.
In this question, we assess the updated logistic regression model by
incorporating the newly created variables ptl2
(binary
preterm labor indicator) and ftv2
(binary physician visits
indicator) into the model. We will examine whether these variables are
significant predictors of low birthweight and how their inclusion
affects the overall model.
We fit a logistic regression model using age
,
lwt
(mother’s weight), race
,
smoke
, ptl2
, ht
(hypertension),
ui
(uterine irritability), and ftv2
as
predictors.
# Fit logistic regression model with ptl2 and ftv2
updated_logistic_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = birthwt,
family = binomial)
# Display the summary of the updated model
summary(updated_logistic_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.666501 1.237944 0.538 0.59031
## age -0.035441 0.038502 -0.920 0.35731
## lwt -0.014907 0.007051 -2.114 0.03450 *
## raceBlack 1.202241 0.533613 2.253 0.02426 *
## raceOther 0.772704 0.459673 1.681 0.09277 .
## smokeSmoker 0.814845 0.420334 1.939 0.05255 .
## ptl2 1.235554 0.465545 2.654 0.00795 **
## htHypertension 1.823618 0.705267 2.586 0.00972 **
## uiUterine Irritability 0.702132 0.464575 1.511 0.13070
## ftv2 -0.121458 0.375657 -0.323 0.74645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 196.73 on 179 degrees of freedom
## AIC: 216.73
##
## Number of Fisher Scoring iterations: 4
Preterm Labor (ptl2):
Estimate: 1.2356
P-value: 0.00795 (statistically significant at the 1% level)
Interpretation: The log odds of having a low birthweight baby increase by 1.2356 for mothers with a history of preterm labor compared to those without. This translates into a substantial increase in risk for low birthweight when preterm labor is present, and the low p-value (0.00795) indicates that this effect is statistically significant. Preterm labor (ptl2) is, therefore, a key factor in predicting low birthweight.
Physician Visits (ftv2):
Estimate: -0.1215
P-value: 0.74645 (not statistically significant)
Interpretation: The log odds of having a low birthweight baby decrease slightly by 0.1215 for mothers who had at least one physician visit compared to those with no visits. However, this effect is not statistically significant (p = 0.74645), indicating that the number of physician visits (ftv2) does not have a meaningful or reliable impact on the likelihood of low birthweight in this model.
Smoking (smoke):
Estimate: 0.8148
P-value: 0.05255 (borderline significance)
Interpretation: Smoking increases the log odds of delivering a low birthweight baby by 0.8148. Although the p-value of 0.05255 is slightly above the conventional significance threshold of 0.05, it suggests that smoking may have a substantial effect on increasing the risk of low birthweight. This borderline significance implies that smoking is a potential contributor to low birthweight, though further evidence may be needed.
Race (raceBlack, raceOther):
RaceBlack Estimate: 1.2022
P-value: 0.02426 (significant at the 5% level)
RaceOther Estimate: 0.7727
P-value: 0.09277 (marginal significance)
Interpretation: Mothers identified as Black (RaceBlack) have 1.2022 higher log odds of having a low birthweight baby compared to the baseline (White mothers), and this effect is statistically significant (p = 0.02426). For other racial categories (RaceOther), the log odds increase by 0.7727, though the effect is only marginally significant (p = 0.09277). These findings suggest that race plays a significant role in predicting low birthweight, with Black mothers being at a higher risk.
Hypertension (ht):
Estimate: 1.8236
P-value: 0.00972 (statistically significant)
Interpretation: A history of hypertension significantly increases the log odds of delivering a low birthweight baby by 1.8236, and this effect is highly statistically significant (p = 0.00972). Hypertension is a strong predictor of low birthweight, indicating that mothers with hypertension are at a much higher risk.
Mother’s Weight (lwt):
Estimate: -0.0149
P-value: 0.03450 (statistically significant)
Interpretation: For each unit increase in maternal weight, the log odds of having a low birthweight baby decrease by 0.0149. The effect of maternal weight (lwt) is statistically significant (p = 0.03450), suggesting that lower maternal weight is associated with a higher risk of low birthweight. Thus, maternal weight plays a protective role in preventing low birthweight.
Overall Model Fit:
Null Deviance: 234.67
Residual Deviance: 196.73
AIC: 216.73
Interpretation: The model shows a good fit, as the reduction in deviance from the null model (234.67) to the residual deviance (196.73) indicates that the predictors explain a substantial portion of the variability in the outcome. The AIC value of 216.73 suggests that the model balances complexity and goodness of fit effectively.
(g). In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
In this question, we compare the performance of three models: Logistic Regression (both initial and updated), Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA). We split the data into training and test sets and evaluate each model based on sensitivity, specificity, and accuracy.
Step 1: Splitting the Data into Training and Test Sets
# Load necessary libraries
library(MASS)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Set seed for reproducibility
set.seed(123)
# Split data into training (80%) and test (20%) sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]
# Fit LDA model on the training data
lda_model <- lda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)
# Make predictions on the test set
lda_pred <- predict(lda_model, newdata = test_data)$class
# Confusion matrix for LDA
conf_matrix_lda <- table(lda_pred, test_data$low)
# Sensitivity, specificity, and accuracy for LDA
sensitivity_lda <- conf_matrix_lda[2, 2] / sum(test_data$low == 1)
specificity_lda <- conf_matrix_lda[1, 1] / sum(test_data$low == 0)
accuracy_lda <- sum(diag(conf_matrix_lda)) / sum(conf_matrix_lda)
# Display results for LDA
performance_lda <- data.frame(
Model = "LDA",
Sensitivity = sensitivity_lda,
Specificity = specificity_lda,
Accuracy = accuracy_lda
)
performance_lda
QDA (Quadratic Discriminant Analysis) has the highest accuracy (70.3%), making it the best-performing model for this dataset in terms of overall accuracy.
LDA (Linear Discriminant Analysis) has slightly lower accuracy (67.6%), but better than the logistic regression models.
The logistic regression model has the lowest accuracy (64.9%), and its sensitivity (0.18) is also the lowest among the three models.
# Set seed for reproducibility
set.seed(123)
# Split data into training (80%) and test (20%) sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]
# Function to calculate performance metrics
calculate_performance <- function(predicted, actual, model_name) {
conf_matrix <- table(predicted, actual)
sensitivity <- conf_matrix[2, 2] / sum(actual == 1)
specificity <- conf_matrix[1, 1] / sum(actual == 0)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
return(data.frame(
Model = model_name,
Sensitivity = sensitivity,
Specificity = specificity,
Accuracy = accuracy
))
}
# Logistic Regression (using the model from Question f)
logistic_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
family = binomial, data = train_data)
# Make predictions for logistic regression
logistic_pred_probs <- predict(logistic_model, newdata = test_data, type = "response")
logistic_pred <- ifelse(logistic_pred_probs > 0.5, 1, 0)
# Calculate performance for Logistic Regression
performance_logistic <- calculate_performance(logistic_pred, test_data$low, "Logistic Regression")
# LDA (Linear Discriminant Analysis)
lda_model <- lda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)
# Make predictions for LDA
lda_pred <- predict(lda_model, newdata = test_data)$class
# Calculate performance for LDA
performance_lda <- calculate_performance(lda_pred, test_data$low, "LDA")
# QDA (Quadratic Discriminant Analysis)
qda_model <- qda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)
# Make predictions for QDA
qda_pred <- predict(qda_model, newdata = test_data)$class
# Calculate performance for QDA
performance_qda <- calculate_performance(qda_pred, test_data$low, "QDA")
# Combine all results
all_performance <- rbind(performance_logistic, performance_lda, performance_qda)
# Display performance for all models
print(all_performance)
## Model Sensitivity Specificity Accuracy
## 1 Logistic Regression Inf Inf 0.6216216
## 2 LDA Inf Inf 0.6216216
## 3 QDA Inf Inf 0.5675676
Logistic Regression has the highest accuracy at 62.16%, but both sensitivity and specificity values were not computed correctly (displaying as “Inf”). Despite this, the model shows better overall performance in terms of accuracy compared to LDA and QDA.
LDA (Linear Discriminant Analysis) has the same accuracy as Logistic Regression (62.16%), but again with incorrect sensitivity and specificity values (“Inf”). The performance in terms of accuracy is similar to Logistic Regression.
QDA (Quadratic Discriminant Analysis) has the lowest accuracy at 56.76%, and its sensitivity and specificity values were also incorrectly computed (“Inf”).
None of the models show exceptionally high performance, and the inaccurate sensitivity and specificity values indicate issues with how these metrics were calculated for this dataset.
Using your final model from f), interpret the estimates for all covariates.
# Display the summary of the updated logistic regression model from f)
summary(updated_logistic_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.666501 1.237944 0.538 0.59031
## age -0.035441 0.038502 -0.920 0.35731
## lwt -0.014907 0.007051 -2.114 0.03450 *
## raceBlack 1.202241 0.533613 2.253 0.02426 *
## raceOther 0.772704 0.459673 1.681 0.09277 .
## smokeSmoker 0.814845 0.420334 1.939 0.05255 .
## ptl2 1.235554 0.465545 2.654 0.00795 **
## htHypertension 1.823618 0.705267 2.586 0.00972 **
## uiUterine Irritability 0.702132 0.464575 1.511 0.13070
## ftv2 -0.121458 0.375657 -0.323 0.74645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 196.73 on 179 degrees of freedom
## AIC: 216.73
##
## Number of Fisher Scoring iterations: 4
Compared to white mothers (the reference group), black mothers have a significantly higher log odds of giving birth to a low birthweight baby (estimate = 1.2022, p = 0.02426).
Mothers from other racial groups also show higher log odds of low birthweight, though this effect is marginally significant (estimate = 0.7727, p = 0.09277).