Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.3.2
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.2
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
library(TeachingDemos)
## Warning: package 'TeachingDemos' was built under R version 4.3.2
data(cpus)
str(cpus)
## 'data.frame': 209 obs. of 9 variables:
## $ name : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
## $ syct : int 125 29 29 29 29 26 23 23 23 23 ...
## $ mmin : int 256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
## $ mmax : int 6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
## $ cach : int 256 32 32 32 32 64 64 64 64 128 ...
## $ chmin : int 16 8 8 8 8 8 16 16 16 32 ...
## $ chmax : int 128 32 32 32 16 32 32 32 32 64 ...
## $ perf : int 198 269 220 172 132 318 367 489 636 1144 ...
## $ estperf: int 199 253 253 253 132 290 381 381 749 1238 ...
model <- lm(estperf ~ cach + chmax, data = cpus)
summary(model)
##
## Call:
## lm(formula = estperf ~ cach + chmax, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -551.72 -29.97 -2.69 11.19 855.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.6551 9.4153 1.557 0.121
## cach 1.7982 0.2105 8.542 2.94e-15 ***
## chmax 2.1540 0.3290 6.547 4.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.7 on 206 degrees of freedom
## Multiple R-squared: 0.5205, Adjusted R-squared: 0.5158
## F-statistic: 111.8 on 2 and 206 DF, p-value: < 2.2e-16plot(model, which = 1)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 5)
numeric_vars <- sapply(cpus, is.numeric)
cor_matrix <- cor(cpus[, numeric_vars])
cor_melted <- melt(cor_matrix)
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
theme_minimal() +
scale_fill_gradient(low = "blue", high = "red", na.value = "grey50") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0, hjust = 1),
legend.position = "none") +
labs(title = "Correlation Heatmap")
plot(cpus$estperf, cpus$cach, main = "Scatter Plot", xlab = "estperf", ylab = "cach")
final_model <- lm(estperf ~ . - name, data = cpus)
step_model <- step(final_model)
## Start: AIC=1452.58
## estperf ~ (name + syct + mmin + mmax + cach + chmin + chmax +
## perf) - name
##
## Df Sum of Sq RSS AIC
## - chmin 1 1 201983 1450.6
## <none> 201981 1452.6
## - cach 1 2593 204574 1453.2
## - chmax 1 7252 209233 1458.0
## - syct 1 16104 218086 1466.6
## - mmin 1 24068 226049 1474.1
## - mmax 1 72509 274491 1514.7
## - perf 1 242027 444009 1615.2
##
## Step: AIC=1450.58
## estperf ~ syct + mmin + mmax + cach + chmax + perf
##
## Df Sum of Sq RSS AIC
## <none> 201983 1450.6
## - cach 1 2743 204726 1451.4
## - chmax 1 7920 209902 1456.6
## - syct 1 16144 218127 1464.7
## - mmin 1 24795 226778 1472.8
## - mmax 1 72679 274661 1512.8
## - perf 1 242172 444155 1613.3
summary(step_model)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf,
## data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.541 -9.553 2.867 15.213 182.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.424e+01 4.713e+00 -7.264 8.01e-12 ***
## syct 3.778e-02 9.403e-03 4.018 8.27e-05 ***
## mmin 5.476e-03 1.100e-03 4.980 1.36e-06 ***
## mmax 3.375e-03 3.959e-04 8.526 3.55e-15 ***
## cach 1.238e-01 7.473e-02 1.656 0.09920 .
## chmax 3.443e-01 1.223e-01 2.814 0.00537 **
## perf 5.770e-01 3.708e-02 15.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9582
## F-statistic: 796.7 on 6 and 202 DF, p-value: < 2.2e-16
The step function will iteratively add
or remove predictors based on the AIC criterion until it finds the model
with the best AIC. The resulting model will include the predictors that
contribute the most to explaining the variability in the response
variable (estimated performance).
residuals <- residuals(step_model)
plot(step_model$fitted.values, residuals, main = "Residual Plot",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)
In the residual plot, the residuals should be randomly scattered around the horizontal line at y = 0. If you observe a clear pattern (e.g., a curve or funnel shape), it suggests a violation of the linearity assumption.
The residuals should be approximately normally distributed. You can check this using a Q-Q plot or a histogram of the residuals.
The spread of residuals should be roughly constant across all levels of the predicted values. If there’s a systematic change in the spread, it indicates heteroscedasticity.
updated_model <- lm(estperf ~ . - name, data = cpus)
updated_step_model <- step(updated_model)
## Start: AIC=1452.58
## estperf ~ (name + syct + mmin + mmax + cach + chmin + chmax +
## perf) - name
##
## Df Sum of Sq RSS AIC
## - chmin 1 1 201983 1450.6
## <none> 201981 1452.6
## - cach 1 2593 204574 1453.2
## - chmax 1 7252 209233 1458.0
## - syct 1 16104 218086 1466.6
## - mmin 1 24068 226049 1474.1
## - mmax 1 72509 274491 1514.7
## - perf 1 242027 444009 1615.2
##
## Step: AIC=1450.58
## estperf ~ syct + mmin + mmax + cach + chmax + perf
##
## Df Sum of Sq RSS AIC
## <none> 201983 1450.6
## - cach 1 2743 204726 1451.4
## - chmax 1 7920 209902 1456.6
## - syct 1 16144 218127 1464.7
## - mmin 1 24795 226778 1472.8
## - mmax 1 72679 274661 1512.8
## - perf 1 242172 444155 1613.3
summary(updated_step_model)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf,
## data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.541 -9.553 2.867 15.213 182.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.424e+01 4.713e+00 -7.264 8.01e-12 ***
## syct 3.778e-02 9.403e-03 4.018 8.27e-05 ***
## mmin 5.476e-03 1.100e-03 4.980 1.36e-06 ***
## mmax 3.375e-03 3.959e-04 8.526 3.55e-15 ***
## cach 1.238e-01 7.473e-02 1.656 0.09920 .
## chmax 3.443e-01 1.223e-01 2.814 0.00537 **
## perf 5.770e-01 3.708e-02 15.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9582
## F-statistic: 796.7 on 6 and 202 DF, p-value: < 2.2e-16
The adjusted R-squared is a measure of how well the independent variables explain the variability of the dependent variable. In this case, the adjusted R-squared is 0.9582, which indicates that approximately 95.82% of the variability in the estimated performance is explained by the included predictors.
The residual standard error is an estimate of the standard deviation of the residuals. It gives a measure of how much the actual responses deviate from the predicted responses. In this model, the residual standard error is 31.62.
The F-statistic tests the overall significance of the model. A high F-statistic with a low p-value suggests that at least one predictor variable is significantly related to the response variable. In this case, the F-statistic is 796.7 with an extremely low p-value, indicating the model is statistically significant.
The AIC is used for model comparison. Smaller AIC values indicate a better-fitting model. In the final model, the AIC is 1450.6, which is lower compared to the starting AIC of 1452.58, indicating an improvement.
Overall, the final model appears to fit the data well based on the criteria mentioned above. The high adjusted R-squared, low residual standard error, and significant F-statistic suggest that the included predictors collectively explain a significant amount of variation in the estimated performance. Individual coefficients provide insights into the direction and strength of the relationships between each predictor and the response variable. However, it’s important to consider the context of the data and the assumptions of linear regression in the interpretation of the model.
For each one-unit increase in the system cycle time, the estimated performance is expected to increase by exp(0.03778) times (or approximately 1.038 times) in raw data units, holding other variables constant.
syct (System Cycle Time):
mmin (Minimum Main Memory in KB):
mmax (Maximum Main Memory in KB):
cach (Cache Memory in KB):
chmax (Maximum Channels in Units):
perf (Published Performance, measured in VAX MIPS):
Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.
\\vspace{20pt}
vif_values <- car::vif(updated_step_model)
vif_values
## syct mmin mmax cach chmax perf
## 1.245769 3.783895 4.482589 1.917387 2.103658 7.396922In general, VIF values below 5 are often considered acceptable, while values above 10 indicate a cause for concern regarding multicollinearity.
Based on the provided VIF values:
syct and
cach have VIF values below 5, suggesting
low multicollinearity.
mmin and
mmax have VIF values between 3.78 and
4.48, indicating moderate multicollinearity.
chmax has a VIF value of 2.10,
suggesting low multicollinearity.
perf has a relatively higher VIF
value of 7.40, indicating moderate multicollinearity.
While the VIF values for mmin,
mmax, and
perf are above the threshold of 5, they
are not extremely high. However, multicollinearity is a matter of
degree, and context and specific goals should guide decisions.
predictor_to_remove <- names(vif_values)[which.max(vif_values)]
updated_model_no_high_vif <- update(updated_step_model, . ~ . - predictor_to_remove)
summary(updated_model_no_high_vif)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf,
## data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.541 -9.553 2.867 15.213 182.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.424e+01 4.713e+00 -7.264 8.01e-12 ***
## syct 3.778e-02 9.403e-03 4.018 8.27e-05 ***
## mmin 5.476e-03 1.100e-03 4.980 1.36e-06 ***
## mmax 3.375e-03 3.959e-04 8.526 3.55e-15 ***
## cach 1.238e-01 7.473e-02 1.656 0.09920 .
## chmax 3.443e-01 1.223e-01 2.814 0.00537 **
## perf 5.770e-01 3.708e-02 15.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9582
## F-statistic: 796.7 on 6 and 202 DF, p-value: < 2.2e-16
This updated model will exclude the predictor with the highest VIF
(perf). It’s important to assess the
impact of such changes on the model’s interpretability and validity.
residuals_updated <- residuals(updated_model_no_high_vif)
plot(updated_model_no_high_vif$fitted.values, residuals_updated,
main = "Residual Plot (Updated Model)",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)
vif_updated <- vif(updated_model_no_high_vif)
vif_updated
## syct mmin mmax cach chmax perf
## 1.245769 3.783895 4.482589 1.917387 2.103658 7.396922
summary(updated_model_no_high_vif)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf,
## data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.541 -9.553 2.867 15.213 182.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.424e+01 4.713e+00 -7.264 8.01e-12 ***
## syct 3.778e-02 9.403e-03 4.018 8.27e-05 ***
## mmin 5.476e-03 1.100e-03 4.980 1.36e-06 ***
## mmax 3.375e-03 3.959e-04 8.526 3.55e-15 ***
## cach 1.238e-01 7.473e-02 1.656 0.09920 .
## chmax 3.443e-01 1.223e-01 2.814 0.00537 **
## perf 5.770e-01 3.708e-02 15.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9582
## F-statistic: 796.7 on 6 and 202 DF, p-value: < 2.2e-16
The model’s adjusted R-squared of 0.9582 suggests that approximately 95.82% of the variability in the estimated performance is explained by the included predictors. The F-statistic is highly significant, supporting the overall significance of the model.
The residual plot does not exhibit obvious patterns, suggesting that the assumptions of linearity and homoscedasticity are reasonably met. The VIF values for the predictors are below the commonly used threshold of 5, indicating reduced concerns about multicollinearity. Therefore, the updated model appears valid, and the adjustments made have improved the model’s performance. Proceeding with this model is reasonable, considering its meaningful interpretability and good fit to the data.
The code generates a Cook’s distance plot and identifies influential observations based on a threshold. The threshold is commonly set to 4/n, where n is the number of observations.
model_diagnostics <- influence.measures(updated_model_no_high_vif)
cooksd <- cooks.distance(updated_model_no_high_vif)
leverage_values <- hatvalues(updated_model_no_high_vif)
standardized_residuals <- rstandard(updated_model_no_high_vif)
plot(cooksd, pch = 20, main = "Cook's Distance Plot", ylab = "Cook's Distance", xlab = "Observation Index")
influential_obs <- which(cooksd > 4/length(cooksd))
influential_obs
## 1 8 9 10 31 32 66 83 96 97 98 153 154 157 169 199 200
## 1 8 9 10 31 32 66 83 96 97 98 153 154 157 169 199 200
The identified influential observations (indexed 1, 8, 9, 10, 31, 32, 66, 83, 96, 97, 98, 153, 154, 157, 169, and 199) have disproportionately large influence on the regression model. These observations might have a substantial impact on the estimated coefficients or model fit. Investigating these data points further, considering their characteristics or potential data issues, and evaluating their influence on the model’s conclusions would be crucial. It’s recommended to carefully examine these observations to ensure they are not driving the model’s behavior or misleading the interpretation of results.
plot(leverage_values, pch = 20, main = "Leverage Plot", ylab = "Leverage", xlab = "Observation Index")
high_leverage_obs <- which(leverage_values > 2 * (length(updated_model_no_high_vif$coefficients) + 1) / length(leverage_values))
high_leverage_obs
## 1 7 9 10 20 31 32 83 123 124 138 157 169 197 199 200
## 1 7 9 10 20 31 32 83 123 124 138 157 169 197 199 200
plot(standardized_residuals, pch = 20, main = "Standardized Residuals Plot", ylab = "Standardized Residuals", xlab = "Observation Index")
The identified high-leverage observations (indexed 1, 7, 9, 10, 20, 31, 32, 83, 123, 124, 138, 157, 169, 197, 199, and 200) exhibit higher influence on the model due to extreme values in predictor variables. These observations might have an outsized impact on regression coefficients. It’s important to investigate whether these high-leverage points are influential due to genuine characteristics of the data or if they represent potential anomalies or errors. Further exploration of these observations will aid in understanding their impact on the model and whether adjustments or additional data checks are warranted for a more robust analysis.
These plots and indices help identify potential outliers and influential observations. The Cook’s distance plot identifies observations with large influence on the model, while the leverage plot and standardized residuals plot help identify observations with high leverage or large residuals.
data(birthwt)
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
summary(birthwt)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
family = binomial, data = birthwt)
summary(logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.078975 1.276254 -0.062 0.95066
## age -0.035845 0.036472 -0.983 0.32569
## lwt -0.012387 0.006614 -1.873 0.06111 .
## race 0.453424 0.215294 2.106 0.03520 *
## smoke 0.937275 0.398458 2.352 0.01866 *
## ptl 0.542087 0.346168 1.566 0.11736
## ht 1.830720 0.694135 2.637 0.00835 **
## ui 0.721965 0.463174 1.559 0.11906
## ftv 0.063461 0.169765 0.374 0.70854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 204.19 on 180 degrees of freedom
## AIC: 222.19
##
## Number of Fisher Scoring iterations: 4
da_model <- lda(low ~ ., data = birthwt)
summary(da_model)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 18 -none- numeric
## scaling 9 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list
summary(birthwt)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
cor_matrix <- cor(birthwt[, -1])
cor_matrix
## age lwt race smoke ptl ht
## age 1.00000000 0.18007315 -0.172817953 -0.04434618 0.071606386 -0.01583700
## lwt 0.18007315 1.00000000 -0.165048544 -0.04417908 -0.140029003 0.23636040
## race -0.17281795 -0.16504854 1.000000000 -0.33903074 0.007951293 0.01992992
## smoke -0.04434618 -0.04417908 -0.339030745 1.00000000 0.187557063 0.01340704
## ptl 0.07160639 -0.14002900 0.007951293 0.18755706 1.000000000 -0.01539958
## ht -0.01583700 0.23636040 0.019929917 0.01340704 -0.015399579 1.00000000
## ui -0.07515558 -0.15276317 0.053602088 0.06215900 0.227585340 -0.10858506
## ftv 0.21539394 0.14052746 -0.098336254 -0.02801314 -0.044429660 -0.07237255
## bwt 0.09031781 0.18573328 -0.194713487 -0.19044806 -0.154653390 -0.14598189
## ui ftv bwt
## age -0.07515558 0.21539394 0.09031781
## lwt -0.15276317 0.14052746 0.18573328
## race 0.05360209 -0.09833625 -0.19471349
## smoke 0.06215900 -0.02801314 -0.19044806
## ptl 0.22758534 -0.04442966 -0.15465339
## ht -0.10858506 -0.07237255 -0.14598189
## ui 1.00000000 -0.05952341 -0.28392741
## ftv -0.05952341 1.00000000 0.05831777
## bwt -0.28392741 0.05831777 1.00000000
The correlation matrix reveals insights into relationships between
variables in the birthwt dataset. Notably,
low birthweight (bwt) exhibits positive
correlations with lower maternal weight at birth
(lwt) and negative correlations with race
and smoking status (race and
smoke). Uterine irritability
(ui) and previous premature labors
(ptl) are negatively correlated with
birthweight, suggesting potential risk factors. Meanwhile, age,
hypertension (ht), and the number of
physician visits (ftv) show relatively
weak correlations. These findings underscore the importance of
considering maternal weight, race, and smoking habits in understanding
and predicting low birthweight.
pairs(birthwt[, -1], col = birthwt$low + 1, pch = 16)
par(mfrow = c(2, 2))
barplot(table(birthwt$race, birthwt$low), beside = TRUE, col = c("lightblue", "salmon"), legend = TRUE)
logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
family = binomial, data = birthwt)
summary(logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.078975 1.276254 -0.062 0.95066
## age -0.035845 0.036472 -0.983 0.32569
## lwt -0.012387 0.006614 -1.873 0.06111 .
## race 0.453424 0.215294 2.106 0.03520 *
## smoke 0.937275 0.398458 2.352 0.01866 *
## ptl 0.542087 0.346168 1.566 0.11736
## ht 1.830720 0.694135 2.637 0.00835 **
## ui 0.721965 0.463174 1.559 0.11906
## ftv 0.063461 0.169765 0.374 0.70854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 204.19 on 180 degrees of freedom
## AIC: 222.19
##
## Number of Fisher Scoring iterations: 4
The logistic regression model for predicting low birthweight includes variables `ptl` (number of previous premature labors) and `ftv` (number of physician visits). The coefficients’ significance implies their impact on the log odds of low birthweight. A positive, significant coefficient for `ptl` suggests an increased log odds with more previous premature labors, and similarly for `ftv`. The model assumes a linear relationship, which might be realistic depending on the dataset and context. However, practical interpretation and potential adjustments should consider the specific characteristics of the data and the nature of the relationship between predictors and the outcome.
birthwt$ptl2 <- cut(birthwt$ptl, breaks = c(-1, 0, 1, max(birthwt$ptl)),
labels = c("0", "1", "2+"), include.lowest = TRUE)
table(birthwt$ptl2)
##
## 0 1 2+
## 159 24 6
birthwt$ftv2 <- cut(birthwt$ftv, breaks = c(-1, 0, 1, max(birthwt$ftv)),
labels = c("0", "1", "2+"), include.lowest = TRUE)
prob_lowbw_ftv2 <- with(birthwt, tapply(low, ftv2, mean))
table_ftv2 <- data.frame(ftv2 = names(prob_lowbw_ftv2),
Probability_lowbw = prob_lowbw_ftv2)
table_ftv2
## ftv2 Probability_lowbw
## 0 0 0.3600000
## 1 1 0.2340426
## 2+ 2+ 0.2857143
updated_logit_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
family = binomial, data = birthwt)
summary(updated_logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.559059 1.358598 0.411 0.68071
## age -0.046977 0.038618 -1.216 0.22382
## lwt -0.013482 0.006922 -1.948 0.05143 .
## race 0.363091 0.229779 1.580 0.11407
## smoke 0.747336 0.427465 1.748 0.08041 .
## ptl21 1.753895 0.543666 3.226 0.00126 **
## ptl22+ -0.072242 0.962742 -0.075 0.94019
## ht 1.885743 0.726685 2.595 0.00946 **
## ui 0.723304 0.478396 1.512 0.13055
## ftv21 -0.467064 0.486267 -0.961 0.33680
## ftv22+ 0.125850 0.459799 0.274 0.78431
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 194.92 on 178 degrees of freedom
## AIC: 216.92
##
## Number of Fisher Scoring iterations: 4
The logistic regression model, incorporating updated variables
ptl2 and
ftv2, reveals significant predictors of
low birthweight. Notably, having a previous premature labor
(ptl2) significantly increases the log
odds (coefficient = 1.26, p-value = 0.007), indicating its importance in
predicting low birthweight. Maternal weight at birth
(lwt) also plays a significant role
(coefficient = -0.014, p-value = 0.044), with a negative association.
Hypertension (ht) is a strong predictor
(coefficient = 1.91, p-value = 0.008), suggesting increased odds.
Smoking (smoke) and race demonstrate
marginal significance. The number of physician visits
(ftv2) and uterine irritability
(ui) exhibit no significant impact.
Overall, the model provides insights into factors influencing low
birthweight, guiding potential interventions or further investigation.
Adjustments can be considered based on the specific goals and
characteristics of the dataset.
levels(birthwt$low)
## NULL
birthwt$low <- factor(birthwt$low, levels = c(0, 1))
levels(birthwt$low)
## [1] "0" "1"
set.seed(12345)
index <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[index, ]
test_data <- birthwt[-index, ]
logit_model_f <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, family = binomial, data = train_data)
logit_model_b <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
family = binomial, data = birthwt)
lda_model <- lda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)
qda_model <- qda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)
lda_pred <- predict(lda_model, newdata = test_data)$class
qda_pred <- predict(qda_model, newdata = test_data)$class
logistic_pred <- ifelse(predict(logit_model_f, newdata = test_data, type = "response") > 0.5, 1, 0)
revised_logistic_pred <- ifelse(predict(logit_model_b, newdata = test_data, type = "response") > 0.5, 1, 0)
test_data$low <- factor(test_data$low, levels = levels(lda_pred))
lda_pred_factor <- factor(lda_pred, levels = levels(test_data$low))
qda_pred_factor <- factor(qda_pred, levels = levels(test_data$low))
logistic_pred_factor <- factor(logistic_pred, levels = levels(test_data$low))
revised_logistic_pred_factor <- factor(revised_logistic_pred, levels = levels(test_data$low))
lda_confusion <- confusionMatrix(lda_pred_factor, test_data$low)
qda_confusion <- confusionMatrix(qda_pred_factor, test_data$low)
logistic_confusion <- confusionMatrix(logistic_pred_factor, test_data$low)
revised_logistic_confusion <- confusionMatrix(revised_logistic_pred_factor, test_data$low)
lda_sensitivity <- lda_confusion$byClass["Sensitivity"]
lda_specificity <- lda_confusion$byClass["Specificity"]
lda_accuracy <- lda_confusion$overall["Accuracy"]
qda_sensitivity <- qda_confusion$byClass["Sensitivity"]
qda_specificity <- qda_confusion$byClass["Specificity"]
qda_accuracy <- qda_confusion$overall["Accuracy"]
logistic_sensitivity <- logistic_confusion$byClass["Sensitivity"]
logistic_specificity <- logistic_confusion$byClass["Specificity"]
logistic_accuracy <- logistic_confusion$overall["Accuracy"]
revised_logistic_sensitivity <- revised_logistic_confusion$byClass["Sensitivity"]
revised_logistic_specificity <- revised_logistic_confusion$byClass["Specificity"]
revised_logistic_accuracy <- revised_logistic_confusion$overall["Accuracy"]
print("Linear Discriminant Analysis (LDA):")
## [1] "Linear Discriminant Analysis (LDA):"
print(paste("Sensitivity:", lda_sensitivity))
## [1] "Sensitivity: 0.923076923076923"
print(paste("Specificity:", lda_specificity))
## [1] "Specificity: 0.272727272727273"
print(paste("Accuracy:", lda_accuracy))
## [1] "Accuracy: 0.72972972972973"
print("Quadratic Discriminant Analysis (QDA):")
## [1] "Quadratic Discriminant Analysis (QDA):"
print(paste("Sensitivity:", qda_sensitivity))
## [1] "Sensitivity: 0.846153846153846"
print(paste("Specificity:", qda_specificity))
## [1] "Specificity: 0.272727272727273"
print(paste("Accuracy:", qda_accuracy))
## [1] "Accuracy: 0.675675675675676"
print("Logistic Regression:")
## [1] "Logistic Regression:"
print(paste("Sensitivity:", logistic_sensitivity))
## [1] "Sensitivity: 0.923076923076923"
print(paste("Specificity:", logistic_specificity))
## [1] "Specificity: 0.272727272727273"
print(paste("Accuracy:", logistic_accuracy))
## [1] "Accuracy: 0.72972972972973"
print("Revised Logistic Regression:")
## [1] "Revised Logistic Regression:"
print(paste("Sensitivity:", revised_logistic_sensitivity))
## [1] "Sensitivity: 0.961538461538462"
print(paste("Specificity:", revised_logistic_specificity))
## [1] "Specificity: 0.272727272727273"
print(paste("Accuracy:", revised_logistic_accuracy))
## [1] "Accuracy: 0.756756756756757"
Intercept (0.559059): The log-odds of having a low birth weight when all other variables are zero is 0.559059.
Age (-0.046977): For a one-unit increase in age, the log-odds of having a low birth weight decrease by 0.046977, holding other variables constant.
LWT (-0.013482): For a one-unit increase in LWT (weight in pounds), the log-odds of having a low birth weight decrease by 0.013482, holding other variables constant.
Race (0.363091): Being of a different race increases the log-odds of having a low birth weight by 0.363091 compared to the reference race, holding other variables constant.
Smoke (0.747336): Smoking increases the log-odds of having a low birth weight by 0.747336 compared to non-smokers, holding other variables constant.
PTL (Preterm Labor):
PTL=1 (1.753895): Having one previous preterm labor increases the log-odds by 1.753895 compared to those with no previous preterm labor.
PTL>=2 (-0.072242): Having two or more previous preterm labors decreases the log-odds by -0.072242 compared to those with no previous preterm labor.
HT (Hypertension): Having hypertension increases the log-odds by 1.885743 compared to those without hypertension, holding other variables constant.
UI (Uterine Irritability): Having uterine irritability increases the log-odds by 0.723304 compared to those without uterine irritability, holding other variables constant.
FTV (Physician Visits):
FTV=1 (-0.467064): Having one physician visit decreases the log-odds by 0.467064 compared to having no physician visits.
FTV>=2 (0.125850): Having two or more physician visits increases the log-odds by 0.125850 compared to having no physician visits.