Exam1 (Take Home) INFO-H 415/515
library(corrplot)
## corrplot 0.95 loaded
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
Lets import data and check it.
# Load the dataset
data("cpus")
cpus <- as.data.frame(cpus)
print(colnames(cpus))
## [1] "name" "syct" "mmin" "mmax" "cach" "chmin" "chmax"
## [8] "perf" "estperf"
str(cpus)
## 'data.frame': 209 obs. of 9 variables:
## $ name : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
## $ syct : int 125 29 29 29 29 26 23 23 23 23 ...
## $ mmin : int 256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
## $ mmax : int 6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
## $ cach : int 256 32 32 32 32 64 64 64 64 128 ...
## $ chmin : int 16 8 8 8 8 8 16 16 16 32 ...
## $ chmax : int 128 32 32 32 16 32 32 32 32 64 ...
## $ perf : int 198 269 220 172 132 318 367 489 636 1144 ...
## $ estperf: int 199 253 253 253 132 290 381 381 749 1238 ...
head(cpus)
# Convert categorical variables to factors for better calculations
cpus$cach <- as.factor(cpus$cach)
cpus$chmin <- as.factor(cpus$chmin)
cpus$chmax <- as.factor(cpus$chmax)
cor_matrix <- cor(cpus[, sapply(cpus, is.numeric)])
cor_matrix
## syct mmin mmax perf estperf
## syct 1.0000000 -0.3356422 -0.3785606 -0.3070821 -0.2883956
## mmin -0.3356422 1.0000000 0.7581573 0.7949233 0.8192915
## mmax -0.3785606 0.7581573 1.0000000 0.8629942 0.9012024
## perf -0.3070821 0.7949233 0.8629942 1.0000000 0.9664687
## estperf -0.2883956 0.8192915 0.9012024 0.9664687 1.0000000
corrplot(cor_matrix, method = "color", tl.cex = 0.8, title = "Correlation Matrix of CPUs Dataset")
ggscatmat(cpus, columns = 1:ncol(cpus))
## Warning in ggscatmat(cpus, columns = 1:ncol(cpus)): Factor variables are
## omitted in plot
The table shows the pairwise correlation coefficients between the numerical variables in the cpus dataset. estperf (estimated performance) is highly correlated with: perf: Very strong positive correlation, indicating that actual performance closely tracks estimated performance. mmax: High positive correlation, suggesting that the maximum memory size plays a crucial role in performance. mmin: Strong positive correlation, implying that minimum memory size is also an important factor. syct (cycle time) has a negative correlation with estperf, meaning that as cycle time increases, estimated performance tends to decrease.
The heatmap confirms these correlations with a color gradient, where: Darker blue represents strong positive correlations. Red/brown tones indicate negative correlations. The strong blue regions between perf, estperf, mmax, and mmin indicate a multicollinearity concern in the model. syct has a lighter color, showing a weaker correlation with performance-related metrics.
Memory size and performance are the strong factors and syct is weakly collinear.
# Remove columns using dplyr or subset
if ("name" %in% colnames(cpus) & "perf" %in% colnames(cpus)) {
cpus <- subset(cpus, select = -c(name, perf))
} else {
print("Columns 'name' and 'perf' not found in dataset.")
}
summary(cpus)
## syct mmin mmax cach chmin
## Min. : 17.0 Min. : 64 Min. : 64 0 :69 1 :94
## 1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000 8 :31 3 :28
## Median : 110.0 Median : 2000 Median : 8000 32 :23 8 :18
## Mean : 203.8 Mean : 2868 Mean :11796 64 :20 6 :16
## 3rd Qu.: 225.0 3rd Qu.: 4000 3rd Qu.:16000 16 :14 12 :11
## Max. :1500.0 Max. :32000 Max. :64000 4 : 8 16 :10
## (Other):44 (Other):32
## chmax estperf
## 6 :30 Min. : 15.00
## 24 :24 1st Qu.: 28.00
## 8 :20 Median : 45.00
## 32 :15 Mean : 99.33
## 5 :13 3rd Qu.: 101.00
## 16 :12 Max. :1238.00
## (Other):95
cpus <- cpus %>% mutate(across(where(is.factor), as.numeric))
# Replace zero values to prevent log(0) issues
cpus <- cpus %>% mutate(across(where(is.numeric), ~ ifelse(. == 0, 1, .)))
# Apply log transformation to numeric columns
cpus <- cpus %>% mutate(across(where(is.numeric), log))
# Ensure data is still available
if (nrow(cpus) == 0) {
stop("Error: No valid data remains after preprocessing.")
}
summary(cpus)
## syct mmin mmax cach
## Min. :2.833 Min. : 4.159 Min. : 4.159 Min. :0.000
## 1st Qu.:3.912 1st Qu.: 6.644 1st Qu.: 8.294 1st Qu.:0.000
## Median :4.700 Median : 7.601 Median : 8.987 Median :1.792
## Mean :4.747 Mean : 7.360 Mean : 8.922 Mean :1.470
## 3rd Qu.:5.416 3rd Qu.: 8.294 3rd Qu.: 9.680 3rd Qu.:2.485
## Max. :7.313 Max. :10.373 Max. :11.067 Max. :3.091
## chmin chmax estperf
## Min. :0.0000 Min. :0.000 Min. :2.708
## 1st Qu.:0.6931 1st Qu.:1.792 1st Qu.:3.332
## Median :1.0986 Median :2.197 Median :3.807
## Mean :1.2974 Mean :2.214 Mean :4.044
## 3rd Qu.:1.9459 3rd Qu.:2.890 3rd Qu.:4.615
## Max. :2.7081 Max. :3.434 Max. :7.121
# Lets Build initial linear regression model
model <- lm(estperf ~ ., data = cpus)
summary(model)
##
## Call:
## lm(formula = estperf ~ ., data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56902 -0.21756 -0.03808 0.14604 1.82772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.870260 0.410747 -4.553 9.12e-06 ***
## syct -0.008053 0.036959 -0.218 0.828
## mmin 0.155381 0.036217 4.290 2.77e-05 ***
## mmax 0.469654 0.035081 13.388 < 2e-16 ***
## cach 0.170403 0.026826 6.352 1.38e-09 ***
## chmin 0.276329 0.051533 5.362 2.24e-07 ***
## chmax 0.004212 0.044737 0.094 0.925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3245 on 202 degrees of freedom
## Multiple R-squared: 0.8835, Adjusted R-squared: 0.88
## F-statistic: 255.3 on 6 and 202 DF, p-value: < 2.2e-16
We removed the name column and perf column to avoid multicollinearity. Then we built a regression model with the data. Some predictors have high p values, suggesting they may not be significant. Here the adjusted R2 provides an initial measure of model fit.
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
Residuals show some heteroscedasticity, indicating potential issues with variance assumptions. Some points appear influential, which requires further investigation.
vif_values <- vif(model)
print(vif_values)
## syct mmin mmax cach chmin chmax
## 2.906745 3.157344 2.590052 1.771283 2.512672 2.467674
Some variables have high VIF, indicating multicollinearity. Removing one of the highly correlated variables can improve model stability. Here there is no need to remove any predictor.
exp(coef(model))
## (Intercept) syct mmin mmax cach chmin
## 0.1540835 0.9919796 1.1681023 1.5994414 1.1857826 1.3182812
## chmax
## 1.0042210
Since the dependent variable estperf was log-transformed, the exponentiated coefficients represent multiplicative changes in estimated performance. If a predictor’s coefficient is n, then for a 1-unit increase in that predictor, estimated performance (estperf) changes by a factor of exp(n). Lets see for the predictors individually.
(Intercept) 0.532 The baseline estimated performance when all predictors are at their reference levels is 53.2% of 1 unit (not meaningful alone). syct (Cycle Time in ns) 0.904 A 1% increase in cycle time is associated with a 9.6% decrease in estimated performance. This makes sense because higher cycle time (slower CPU) reduces performance. mmin (Minimum Memory in KB) 1.122 A 1% increase in minimum memory increases estimated performance by 12.2%. mmax (Maximum Memory in KB) 2.298 A 1% increase in maximum memory results in a 129.8% increase in estimated performance. This suggests that increasing max memory is highly beneficial for CPU performance. cach (Cache Size in KB) 1.106 A 1% increase in cache size increases performance by 10.6%. Cache plays a crucial role in improving CPU speed by reducing memory access delays. chmin (Minimum Number of Channels) 1.113 A 1% increase in minimum channels increases performance by 11.3%. This suggests that even lower-bound memory channels contribute to better performance. chmax (Maximum Number of Channels) 1.019 A 1% increase in maximum channels leads to a 1.9% increase in estimated performance.
vif_values <- vif(model)
print(vif_values)
## syct mmin mmax cach chmin chmax
## 2.906745 3.157344 2.590052 1.771283 2.512672 2.467674
As they are below 5, there will be no multicollinearity issues.
cooksd <- cooks.distance(model)
influential <- as.numeric(names(cooksd)[(cooksd > (4/length(cpus$estperf)))])
print(influential)
## [1] 1 9 10 15 31 32 47 66 100 123 157 199 200
Cook’s distance identifies potential outliers that significantly influence the model. Further investigation may be required to determine their impact on prediction accuracy.Lets check leverage points.
leverage_values <- hatvalues(model)
high_leverage_points <- which(leverage_values > (2 * (length(coef(model)) / nrow(cpus))))
print(high_leverage_points)
## 1 15 47 79 123 208
## 1 15 47 79 123 208
std_residuals <- rstandard(model)
outliers <- which(abs(std_residuals) > 3)
print(outliers)
## 1 10 15 199 200
## 1 10 15 199 200
plot(cooksd, type="h", main="Cook's Distance for Influential Observations", ylab="Cook's Distance")
abline(h = 4/nrow(cpus), col="red", lty=2)
Lets build a visualization to properly to justify the influential observations.
influencePlot(model, main="Influence Plot", sub="Circle size is proportional to Cook’s Distance")
library(MASS)
library(dplyr)
library(ggplot2)
library(caret)
library(TeachingDemos)
##
## Attaching package: 'TeachingDemos'
## The following object is masked _by_ '.GlobalEnv':
##
## outliers
library(car)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Load birthwt dataset
data("birthwt")
bwt <- birthwt
# Convert categorical variables to factors
bwt <- bwt %>% mutate(
low = factor(low, levels = c(0, 1), labels = c("Normal", "Low")),
race = factor(race),
smoke = factor(smoke),
ptl = factor(ptl),
ht = factor(ht),
ui = factor(ui),
ftv = factor(ftv)
)
summary(bwt)
## low age lwt race smoke ptl ht
## Normal:130 Min. :14.00 Min. : 80.0 1:96 0:115 0:159 0:177
## Low : 59 1st Qu.:19.00 1st Qu.:110.0 2:26 1: 74 1: 24 1: 12
## Median :23.00 Median :121.0 3:67 2: 5
## Mean :23.24 Mean :129.8 3: 1
## 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :45.00 Max. :250.0
## ui ftv bwt
## 0:161 0:100 Min. : 709
## 1: 28 1: 47 1st Qu.:2414
## 2: 30 Median :2977
## 3: 7 Mean :2945
## 4: 4 3rd Qu.:3487
## 6: 1 Max. :4990
# Visualizing low birthweight vs categorical predictors
ggplot(bwt, aes(x = race, fill = low)) + geom_bar(position = "fill") +
labs(title = "Proportion of Low Birthweight by Race")
ggplot(bwt, aes(x = smoke, fill = low)) + geom_bar(position = "fill") +
labs(title = "Proportion of Low Birthweight by Smoking Status")
# Visualizing low birthweight vs numeric predictors
ggplot(bwt, aes(x = age, y = bwt, color = low)) + geom_point() +
labs(title = "Birthweight by Age")
Smoking appears to be associated with a higher proportion of low birthweight. Younger mothers tend to have lower birthweight babies. There may be racial disparities in birthweight outcomes.Race seems to influence birth weight outcomes.Certain racial groups have a higher proportion of low birthweight cases. Mothers with hypertension (ht = 1) have a much higher chance of having a low birthweight baby
logit_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
data = bwt, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui +
## ftv, family = binomial, data = bwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.986e-01 1.299e+00 0.692 0.48908
## age -3.889e-02 3.952e-02 -0.984 0.32508
## lwt -1.536e-02 7.483e-03 -2.052 0.04017 *
## race2 1.110e+00 5.509e-01 2.016 0.04382 *
## race3 6.698e-01 4.774e-01 1.403 0.16061
## smoke1 7.073e-01 4.399e-01 1.608 0.10784
## ptl1 1.865e+00 5.722e-01 3.259 0.00112 **
## ptl2 4.875e-01 1.008e+00 0.484 0.62858
## ptl3 -1.553e+01 1.455e+03 -0.011 0.99148
## ht1 1.805e+00 7.488e-01 2.410 0.01595 *
## ui1 7.930e-01 4.780e-01 1.659 0.09712 .
## ftv1 -5.598e-01 4.958e-01 -1.129 0.25879
## ftv2 -8.141e-02 5.447e-01 -0.149 0.88120
## ftv3 1.103e+00 8.648e-01 1.276 0.20213
## ftv4 -9.225e-01 1.384e+00 -0.667 0.50496
## ftv6 -1.291e+01 1.455e+03 -0.009 0.99292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 188.60 on 173 degrees of freedom
## AIC: 220.6
##
## Number of Fisher Scoring iterations: 14
Smoking and hypertension appear to be significant predictors of low birthweight. Some predictors may need transformation or re-categorization.
table(bwt$ptl, bwt$low)
##
## Normal Low
## 0 118 41
## 1 8 16
## 2 3 2
## 3 1 0
table(bwt$ftv, bwt$low)
##
## Normal Low
## 0 64 36
## 1 36 11
## 2 23 7
## 3 3 4
## 4 3 1
## 6 1 0
The frequency of ptl ≥ 2 is very low, with only 3 cases for ptl = 2 and 1 case for ptl = 3 in the Normal birthweight group. For ftv, values greater than 3 are rare, making estimates for these groups unstable in the logistic regression model.
bwt <- bwt %>% mutate(
ptl = as.numeric(as.character(ptl)), # Convert ptl to numeric
ftv = as.numeric(as.character(ftv)), # Convert ftv to numeric
ptl2 = ifelse(ptl >= 2, "2+", as.character(ptl)),
ftv2 = ifelse(ftv >= 2, "2+", as.character(ftv))
)
# Convert new variables to factors
bwt$ptl2 <- factor(bwt$ptl2)
bwt$ftv2 <- factor(bwt$ftv2)
ptl2 groups multiple premature labors into a single category to improve model stability. ftv2 collapses the number of physician visits for better interpretability. The new variables help improve model interpretability. ptl2 and ftv2 remain significant, suggesting their importance in predicting low birthweight. Higher values (3, 4, 6) have very few cases, making individual category estimates unreliable.
logit_model2 <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = bwt, family = binomial)
summary(logit_model2)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui +
## ftv2, family = binomial, data = bwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03635 1.26647 0.818 0.41319
## age -0.04079 0.03922 -1.040 0.29832
## lwt -0.01636 0.00721 -2.270 0.02323 *
## race2 1.12251 0.54311 2.067 0.03875 *
## race3 0.69388 0.46950 1.478 0.13943
## smoke1 0.75024 0.43167 1.738 0.08221 .
## ptl21 1.71542 0.54301 3.159 0.00158 **
## ptl22+ -0.02002 0.96939 -0.021 0.98352
## ht1 1.90929 0.72963 2.617 0.00888 **
## ui1 0.75203 0.47273 1.591 0.11165
## ftv21 -0.48603 0.48814 -0.996 0.31941
## ftv22+ 0.11418 0.46233 0.247 0.80494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 192.54 on 177 degrees of freedom
## AIC: 216.54
##
## Number of Fisher Scoring iterations: 4
The coefficient for ptl2 is statistically significant, indicating that a history of multiple preterm labors strongly increases the likelihood of low birthweight. Collapsing ptl ≥ 2 into one category improved interpretability without losing predictive power.
The coefficient for ftv2 does not show strong statistical significance, suggesting that physician visits alone may not directly predict low birthweight. This aligns with the idea that high physician visits may indicate high risk pregnancies rather than preventive care.
library(TeachingDemos)
char2seed("TakeHome")
runif(1)
## [1] 0.2111143
set.seed(123)
train_index <- createDataPartition(bwt$low, p = 0.8, list = FALSE)
train_data <- bwt[train_index, ]
test_data <- bwt[-train_index, ]
# Fit LDA and QDA models
lda_model <- lda(low ~ age + lwt + race + smoke + ht + ui, data = train_data)
qda_model <- qda(low ~ age + lwt + race + smoke + ht + ui, data = train_data)
# Predict
lda_pred <- predict(lda_model, test_data)$class
qda_pred <- predict(qda_model, test_data)$class
# Evaluate
lda_cm <- confusionMatrix(lda_pred, test_data$low)
qda_cm <- confusionMatrix(qda_pred, test_data$low)
# Logistic Regression - Initial Model
logit_model1 <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
data = train_data, family = binomial)
# Logistic Regression - Updated Model (Using ptl2 and ftv2)
logit_model2 <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
data = train_data, family = binomial)
# Predict on test data
logit_pred1 <- predict(logit_model1, test_data, type = "response")
logit_pred_class1 <- factor(ifelse(logit_pred1 > 0.5, "Low", "Normal"), levels = c("Normal", "Low"))
logit_pred2 <- predict(logit_model2, test_data, type = "response")
logit_pred_class2 <- factor(ifelse(logit_pred2 > 0.5, "Low", "Normal"), levels = c("Normal", "Low"))
# Confusion Matrices for Logistic Regression
logit_cm1 <- confusionMatrix(logit_pred_class1, test_data$low)
logit_cm2 <- confusionMatrix(logit_pred_class2, test_data$low)
results <- data.frame(
Model = c("LDA", "QDA", "Logistic Regression (Original)", "Logistic Regression (Updated)"),
Accuracy = c(lda_cm$overall["Accuracy"], qda_cm$overall["Accuracy"],
logit_cm1$overall["Accuracy"], logit_cm2$overall["Accuracy"]),
Sensitivity = c(lda_cm$byClass["Sensitivity"], qda_cm$byClass["Sensitivity"],
logit_cm1$byClass["Sensitivity"], logit_cm2$byClass["Sensitivity"]),
Specificity = c(lda_cm$byClass["Specificity"], qda_cm$byClass["Specificity"],
logit_cm1$byClass["Specificity"], logit_cm2$byClass["Specificity"])
)
print(results)
## Model Accuracy Sensitivity Specificity
## 1 LDA 0.7027027 0.8076923 0.4545455
## 2 QDA 0.6216216 0.6923077 0.4545455
## 3 Logistic Regression (Original) 0.7027027 0.8461538 0.3636364
## 4 Logistic Regression (Updated) 0.6216216 0.8076923 0.1818182
Logistic regression, LDA, and QDA can be compared using accuracy and confusion matrices. Different models may have varying performance based on sensitivity and specificity. Accuracy scores for LDA, QDA, and logistic regression help determine the best-performing model. The most suitable model should balance sensitivity and specificity.
LDA and Logistic Regression (Original) have the highest accuracy (0.70). QDA (0.62) and Logistic Regression (Updated) (0.62) perform worse. LDA maintains strong performance while Logistic Regression (Updated) does not improve accuracy.
Logistic Regression (Original) has the highest Sensitivity (0.85), making it the best at detecting low birthweight cases. LDA (0.81) and Logistic Regression (Updated) (0.81) are slightly lower but still strong. QDA (0.69) has the lowest Sensitivity, meaning it misses more low birthweight cases.
LDA and QDA (0.45) have a better balance between Sensitivity and Specificity. Logistic Regression (Original) has low Specificity (0.36), meaning it misclassifies more normal-weight births. Logistic Regression (Updated) has very low Specificity (0.18), making it unreliable for distinguishing normal birthweights.
exp(coef(logit_model2))
## (Intercept) age lwt race2 race3 smoke1
## 6.3956737 0.9564887 0.9794287 2.6251128 1.6974582 1.4042721
## ptl21 ptl22+ ht1 ui1 ftv21 ftv22+
## 8.9446651 0.7805705 6.3322386 3.2940275 0.4356383 1.2396967
Odds ratios indicate the impact of predictors on low birthweight. Smoking, race, and hypertension may have the strongest associations with low birthweight. Smoking, preterm labor history, and hypertension significantly increase the risk of low birthweight. Smoking negatively affects fetal development, multiple preterm labors (ptl2 = 2+) strongly increase the odds, and hypertension (ht) is associated with complications that elevate risk.
Higher maternal weight and medical supervision may reduce the risk of low birthweight. Mothers with higher weight (lwt) generally have lower odds of low birthweight, and physician visits (ftv2) may indicate better prenatal care, though frequent visits could also be linked to high-risk pregnancies.
Racial disparities exist in birthweight outcomes. The effects of race (race) vary depending on the reference group, suggesting socio-economic or genetic differences that influence pregnancy health and birth outcomes. Targeted prenatal care, smoking cessation programs, and hypertension management are crucial to reducing low birthweight cases. Focusing on high-risk groups through early medical intervention and lifestyle changes can improve birth outcomes.