Exam1 (Take Home)

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.

# Load necessary libraries
library(MASS)
library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Warning: package 'carData' was built under R version 4.4.3

# Load the cpus dataset
data(cpus)

Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.

In this section, we will investigate the structure and numerical summary of the cpus dataset. We will also examine the relationships between the variables using a correlation matrix and visualizations.

# Investigate the structure of the dataset
str(cpus)

## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...

# Provide a summary of the dataset
summary(cpus)

##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
##

# Select numeric columns for correlation analysis
numeric_cps <- cpus[sapply(cpus, is.numeric)]

# Compute and visualize the correlation matrix
cor_matrix <- cor(numeric_cps)
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust")

Comments on Relationships:

syct (system cycle time), mmin, and mmax (main memory minimum and maximum) are highly correlated, suggesting multicollinearity concerns.
There is a moderately strong positive relationship between perf and estperf, which is expected as both represent performance.
The correlation heatmap reveals clusters of related variables, particularly between memory-related features and performance-related variables.

Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.

We will now build a linear regression model to predict estperf using the variables in the cpus dataset. We will not include name since it is a categorical identifier.

# Initial full model with all predictors except 'name'
model_stepwise <- lm(estperf ~ . - name, data = cpus)

# Stepwise selection using AIC criterion
final_model <- step(model_stepwise, direction = "both", trace = 0)

# Summary of the final model
summary(final_model)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf, 
##     data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.541   -9.553    2.867   15.213  182.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.424e+01  4.713e+00  -7.264 8.01e-12 ***
## syct         3.778e-02  9.403e-03   4.018 8.27e-05 ***
## mmin         5.476e-03  1.100e-03   4.980 1.36e-06 ***
## mmax         3.375e-03  3.959e-04   8.526 3.55e-15 ***
## cach         1.238e-01  7.473e-02   1.656  0.09920 .  
## chmax        3.443e-01  1.223e-01   2.814  0.00537 ** 
## perf         5.770e-01  3.708e-02  15.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.9582 
## F-statistic: 796.7 on 6 and 202 DF,  p-value: < 2.2e-16

Explanation of Model Selection: We used stepwise regression to arrive at the final model, balancing simplicity and predictive power using AIC as the selection criterion. The final model includes predictors such as syct, mmin, mmax, cach, chmax, and perf.

Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

We now check the residuals of the model to ensure that the assumptions of linear regression are met, focusing on homoscedasticity and normality.

# Residual plot
residuals <- residuals(final_model)
plot(x = fitted(final_model), y = residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red", lty = 2)

Comment on Residual Plot: The residuals appear to be randomly scattered around zero, though there are signs of heteroscedasticity (non-constant variance). Some observations have large residuals, indicating potential outliers or influential points.

To address heteroscedasticity, we apply a log transformation to the response variable (estperf) and refit the model.

# Log-transform the response variable
cpus$log_estperf <- log(cpus$estperf)

# Fit a model using the log-transformed response variable
model_transformed <- lm(log_estperf ~ . - name, data = cpus)

# Residual plot for the transformed model
residuals_transformed <- residuals(model_transformed)
plot(x = fitted(model_transformed), y = residuals_transformed, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot (Transformed Model)")
abline(h = 0, col = "red", lty = 2)

# Summary of the transformed model
summary(model_transformed)

## 
## Call:
## lm(formula = log_estperf ~ . - name, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34248 -0.07284  0.01788  0.06913  0.34862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.030e+00  1.961e-02 154.526  < 2e-16 ***
## syct        -1.834e-04  3.618e-05  -5.068 9.13e-07 ***
## mmin         6.660e-05  4.374e-06  15.225  < 2e-16 ***
## mmax         7.997e-05  1.710e-06  46.774  < 2e-16 ***
## cach         8.782e-03  2.879e-04  30.507  < 2e-16 ***
## chmin        4.237e-03  1.669e-03   2.538   0.0119 *  
## chmax        2.741e-03  4.834e-04   5.671 4.92e-08 ***
## perf         9.006e-04  2.034e-04   4.427 1.57e-05 ***
## estperf     -4.724e-03  2.603e-04 -18.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.117 on 200 degrees of freedom
## Multiple R-squared:  0.985,  Adjusted R-squared:  0.9844 
## F-statistic:  1642 on 8 and 200 DF,  p-value: < 2.2e-16

Model Adjustment: The log transformation reduced heteroscedasticity, as seen in the residual plot. The transformed model explains 98.5% of the variance, indicating a good fit.

How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

We evaluate model fit based on key metrics such as R-squared, residual standard error, and the F-statistic.

# Check model fit
summary(model_transformed)

## 
## Call:
## lm(formula = log_estperf ~ . - name, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34248 -0.07284  0.01788  0.06913  0.34862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.030e+00  1.961e-02 154.526  < 2e-16 ***
## syct        -1.834e-04  3.618e-05  -5.068 9.13e-07 ***
## mmin         6.660e-05  4.374e-06  15.225  < 2e-16 ***
## mmax         7.997e-05  1.710e-06  46.774  < 2e-16 ***
## cach         8.782e-03  2.879e-04  30.507  < 2e-16 ***
## chmin        4.237e-03  1.669e-03   2.538   0.0119 *  
## chmax        2.741e-03  4.834e-04   5.671 4.92e-08 ***
## perf         9.006e-04  2.034e-04   4.427 1.57e-05 ***
## estperf     -4.724e-03  2.603e-04 -18.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.117 on 200 degrees of freedom
## Multiple R-squared:  0.985,  Adjusted R-squared:  0.9844 
## F-statistic:  1642 on 8 and 200 DF,  p-value: < 2.2e-16

Model Fit:

Adjusted R-squared: 0.9844, indicating that 98.44% of the variability in the log-transformed performance is explained by the predictors.
Residual standard error: 0.117, suggesting that the model’s predictions are quite close to the observed values.
F-statistic: The high F-statistic and associated p-value (< 2.2e-16) confirm the model’s overall significance.

Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

We interpret the coefficients in the context of the log-transformed response variable.

# Coefficients of the transformed model
coef(summary(model_transformed))

##                  Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)  3.029657e+00 1.960618e-02 154.525635 4.935416e-210
## syct        -1.833582e-04 3.618027e-05  -5.067906  9.128641e-07
## mmin         6.659830e-05 4.374290e-06  15.224940  2.877327e-35
## mmax         7.997375e-05 1.709779e-06  46.774309 1.180950e-109
## cach         8.782494e-03 2.878817e-04  30.507301  3.640223e-77
## chmin        4.237014e-03 1.669216e-03   2.538326  1.190005e-02
## chmax        2.741115e-03 4.833696e-04   5.670848  4.922497e-08
## perf         9.006215e-04 2.034404e-04   4.426954  1.570586e-05
## estperf     -4.724292e-03 2.603211e-04 -18.147944  3.815143e-44

Interpretation:

For syct, a one-unit increase in system cycle time is associated with a 0.000183 decrease in the log of estimated performance, holding all other variables constant.
For mmin and mmax, increases in main memory result in small but significant increases in performance.
For cach, larger cache sizes improve performance.

Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

We assess multicollinearity using the variance inflation factor (VIF).

# Calculate VIF values
vif_values <- vif(model_transformed)
vif_values

##      syct      mmin      mmax      cach     chmin     chmax      perf   estperf 
##  1.347415  4.374518  6.108787  2.078872  1.967216  2.399655 16.268406 24.663399

Comment on Multicollinearity: There is evidence of multicollinearity, particularly for perf and estperf (VIF > 10), suggesting potential redundancy in the predictors. This may inflate standard errors, but for this analysis, we choose to keep the variables for their interpretability.

Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.

We check for influential points using Cook’s distance.

# Cook's distance
cooksd <- cooks.distance(model_transformed)

# Plot standardized residuals vs. fitted values
plot(x = fitted(model_transformed), y = rstandard(model_transformed),
     xlab = "Fitted Values", ylab = "Standardized Residuals",
     main = "Standardized Residuals vs. Fitted Values")
abline(h = 0, col = "red", lty = 2)

# Highlight influential points
influential_obs <- which(cooksd > 4 / length(fitted(model_transformed)))
influential_obs

##   1   9  10  31  32  83  91  92  96  97 138 153 154 157 166 167 198 199 200 
##   1   9  10  31  32  83  91  92  96  97 138 153 154 157 166 167 198 199 200

Outliers and Influential Points: Several observations have high Cook’s distance values, indicating that they have a large influence on the model. These points may warrant further investigation or potential removal in a refined model.

Conclusion

The final model provides a robust fit to the data, with 98.5% of the variance in estimated performance explained by the predictors. Residual analysis and multicollinearity checks reveal potential issues with certain variables, but overall, the model performs well. Further improvements could be made by addressing influential points or refining the predictors.

Consider the data from R package . We will investigate the relationship between low birthweight and the predictors in the data using logistic regression and discriminant analysis.

We will investigate the relationship between low birthweight (low) and the predictors in the birthwt dataset using logistic regression and discriminant analysis. The dataset is from the MASS package.

Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.

We will begin by investigating the relationship between low birthweight (low) and the various predictors in the birthwt dataset. This will include both numeric and visual summaries. For categorical variables, we will use mosaic plots and bar plots, while for continuous variables, we will use box plots to show the distribution of the predictor across the two levels of the outcome variable (low).

# Load necessary libraries
library(MASS)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the dataset
data(birthwt)

# Summary statistics for numeric variables
summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

# Convert categorical variables to factors with descriptive labels
birthwt$race <- factor(birthwt$race, labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, labels = c("Non-Smoker", "Smoker"))
birthwt$ht <- factor(birthwt$ht, labels = c("No Hypertension", "Hypertension"))
birthwt$low <- factor(birthwt$low, labels = c("Normal Weight", "Low Weight"))
birthwt$ui <- factor(birthwt$ui, labels = c("No Uterine Irritability", "Uterine Irritability"))

# Investigate structure of the dataset
str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : Factor w/ 2 levels "Normal Weight",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "Non-Smoker","Smoker": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : Factor w/ 2 levels "No Hypertension",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "No Uterine Irritability",..: 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Numeric Summaries

From the numeric summary, we observe the general distribution of the variables. This provides a preliminary view of potential relationships between low birthweight and predictors such as age, maternal weight (lwt), preterm labors (ptl), etc.

Visual Summaries: Categorical Variables

To investigate the relationships between categorical predictors and low birthweight, we use mosaic plots and bar plots. These visualizations provide insights into how different categories (e.g., race, smoking status, history of hypertension) relate to the likelihood of low birthweight.

# Mosaic plots for categorical variables vs low birthweight
par(mfrow = c(1, 3))  # Arrange mosaic plots in a row

# Mosaic plot for race vs low birthweight
mosaicplot(table(birthwt$race, birthwt$low), main = "Low Birthweight vs Race", 
           shade = TRUE, color = TRUE, xlab = "Race", ylab = "Low Birthweight")

# Mosaic plot for smoking vs low birthweight
mosaicplot(table(birthwt$smoke, birthwt$low), main = "Low Birthweight vs Smoking", 
           shade = TRUE, color = TRUE, xlab = "Smoking Status", ylab = "Low Birthweight")

# Mosaic plot for hypertension vs low birthweight
mosaicplot(table(birthwt$ht, birthwt$low), main = "Low Birthweight vs Hypertension", 
           shade = TRUE, color = TRUE, xlab = "Hypertension", ylab = "Low Birthweight")

# Bar plot for categorical variables
ggplot(birthwt, aes(x = race, fill = low)) + 
  geom_bar(position = "fill") + 
  ggtitle("Low Birthweight Proportions by Race") + 
  labs(x = "Race", y = "Proportion of Low Birthweight")

ggplot(birthwt, aes(x = smoke, fill = low)) + 
  geom_bar(position = "fill") + 
  ggtitle("Low Birthweight Proportions by Smoking Status") + 
  labs(x = "Smoking Status", y = "Proportion of Low Birthweight")

Interpretation of Categorical Variables:

Race vs. Low Birthweight: The mosaic plot and bar chart indicate that Black mothers are more likely to have low birthweight babies compared to White and Other races. This highlights potential disparities related to race.
Smoking vs. Low Birthweight: Mothers who smoke during pregnancy have a significantly higher likelihood of delivering low birthweight babies. This emphasizes the impact of smoking on birth outcomes.
Hypertension vs. Low Birthweight: Mothers with a history of hypertension also have an increased likelihood of low birthweight babies, which aligns with known medical risks associated with hypertensive pregnancies.

Visual Summaries: Quantitative Variables

For the continuous variables (e.g., age, maternal weight, preterm labors), we use box plots to visualize the relationship between each predictor and the outcome (low).

# Box plots for continuous variables vs low birthweight
par(mfrow = c(2, 2))  # Arrange box plots

# Box plot for age vs low birthweight
boxplot(age ~ low, data = birthwt, main = "Age vs Low Birthweight", 
        ylab = "Age", col = c("lightblue", "lightcoral"))

# Box plot for maternal weight vs low birthweight
boxplot(lwt ~ low, data = birthwt, main = "Maternal Weight vs Low Birthweight", 
        ylab = "Maternal Weight (lwt)", col = c("lightblue", "lightcoral"))

# Box plot for preterm labors vs low birthweight
boxplot(ptl ~ low, data = birthwt, main = "Preterm Labors vs Low Birthweight", 
        ylab = "Preterm Labors", col = c("lightblue", "lightcoral"))

# Box plot for birthweight vs low birthweight
boxplot(bwt ~ low, data = birthwt, main = "Birthweight Distribution", 
        ylab = "Birthweight (bwt)", col = c("lightblue", "lightcoral"))

Interpretation of Quantitative Variables:

Age vs. Low Birthweight: Younger mothers (in their late teens and early twenties) appear to have a slightly higher tendency to deliver low birthweight babies compared to older mothers, although the difference is not large.
Maternal Weight (lwt) vs. Low Birthweight: Lower maternal weight is clearly associated with a higher likelihood of delivering a low birthweight baby. Mothers with low body weight (under 110 lbs) are more likely to have low birthweight infants.
Preterm Labors (ptl) vs. Low Birthweight: Mothers with a history of preterm labors tend to have a higher probability of delivering low birthweight babies, suggesting a strong link between prior preterm deliveries and birth outcomes.
Birthweight (bwt) Distribution: The plot confirms that infants classified as “Low Birthweight” by definition weigh significantly less than those classified as “Normal Weight.” This is consistent with the dataset’s purpose.

Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.

We will use logistic regression to predict the binary outcome low (1 for low birthweight, 0 for normal weight). In this step, we exclude variables like bwt (birthweight), as it is the direct outcome variable and thus would not be a logical predictor.

# View the structure of the dataset
str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : Factor w/ 2 levels "Normal Weight",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "Non-Smoker","Smoker": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : Factor w/ 2 levels "No Hypertension",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "No Uterine Irritability",..: 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

From the str(birthwt) output, we see that the dataset contains both categorical and continuous variables. We need to convert the categorical variables (e.g., race, smoke, ht, ui) into factors for logistic regression. This ensures that R correctly interprets them as categorical predictors.

Converting Variables to Factors:

# Convert relevant categorical variables into factors with descriptive labels
birthwt$race <- factor(birthwt$race, labels = c("White", "Black", "Other"))
birthwt$smoke <- factor(birthwt$smoke, labels = c("Non-Smoker", "Smoker"))
birthwt$ht <- factor(birthwt$ht, labels = c("No Hypertension", "Hypertension"))
birthwt$ui <- factor(birthwt$ui, labels = c("No Uterine Irritability", "Uterine Irritability"))

We will now fit the logistic regression model using the predictors: age, lwt, race, smoke, ptl, ht, ui, and ftv. The outcome variable is low, which indicates whether the baby had a low birthweight.

Fitting the Logistic Regression Model:

# Fit a logistic regression model to predict low birthweight
logistic_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                      data = birthwt, 
                      family = binomial)

# Display the summary of the model
summary(logistic_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = birthwt)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             0.480623   1.196888   0.402  0.68801   
## age                    -0.029549   0.037031  -0.798  0.42489   
## lwt                    -0.015424   0.006919  -2.229  0.02580 * 
## raceBlack               1.272260   0.527357   2.413  0.01584 * 
## raceOther               0.880496   0.440778   1.998  0.04576 * 
## smokeSmoker             0.938846   0.402147   2.335  0.01957 * 
## ptl                     0.543337   0.345403   1.573  0.11571   
## htHypertension          1.863303   0.697533   2.671  0.00756 **
## uiUterine Irritability  0.767648   0.459318   1.671  0.09467 . 
## ftv                     0.065302   0.172394   0.379  0.70484   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 201.28  on 179  degrees of freedom
## AIC: 221.28
## 
## Number of Fisher Scoring iterations: 4

Key Results:

Race (raceBlack and raceOther):
Being Black significantly increases the odds of low birthweight (Estimate = 1.272, p = 0.0158), and being from another race (not White or Black) also significantly increases the odds (Estimate = 0.880, p = 0.0458), compared to White mothers.
Smoking (smokeSmoker):
Smoking during pregnancy is significantly associated with an increased risk of low birthweight (Estimate = 0.939, p = 0.0196).
Hypertension (htHypertension):
A history of hypertension is a strong, statistically significant predictor of low birthweight (Estimate = 1.863, p = 0.0076), greatly increasing the likelihood.
Maternal Weight (lwt):
Lower maternal weight is statistically significant (Estimate = -0.0154, p = 0.0258), indicating that lower weight is associated with higher odds of delivering a low birthweight baby.
Preterm Labor (ptl):
Although preterm labor (ptl) shows a positive effect on the odds of low birthweight, it is not statistically significant in this model (p = 0.1157).
Physician Visits (ftv):
The number of physician visits (ftv) does not significantly affect the likelihood of low birthweight (p = 0.7048).

What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

We will explore the role of ptl (prior preterm labor) and ftv (number of physician visits) in predicting low birthweight, based on the results from the logistic regression model.

Coefficients for ptl and ftv:

ptl (prior preterm labor):

-    **Estimate**: 0.5433

-    **Std. Error**: 0.3454

-    **z-value**: 1.573

-    **p-value**: 0.1157 (not statistically significant)

**ftv (number of physician visits)**:

-    **Estimate**: 0.0653

-    **Std. Error**: 0.1724

-    **z-value**: 0.379

-    **p-value**: 0.7048 (not statistically significant)

Interpretation:

ptl (prior preterm labor): Holding other variables constant, having a history of prior preterm labor increases the log odds of delivering a low birthweight baby by 0.5433. However, the p-value of 0.1157 suggests that this effect is not statistically significant at the conventional 0.05 level, meaning we cannot confidently assert that prior preterm labor significantly predicts low birthweight in this model.
ftv (number of physician visits): The effect of the number of physician visits is very small, with an estimated increase of 0.0653 in the log odds of low birthweight for each additional visit. However, the high p-value (0.7048) indicates that this variable has no statistically significant effect on the outcome, implying that physician visits do not meaningfully impact the likelihood of low birthweight in this context.

Assumptions:

The model assumes that ptl and ftv have linear effects on the log odds of low birthweight.
The non-significant p-values for both ptl and ftv suggest that these variables may not have strong predictive power in the model, and the assumptions regarding their impact might not be realistic given the current data.

Considerations:

The positive coefficient for ptl implies that a history of prior preterm labor may increase the risk of low birthweight, but the lack of statistical significance suggests that more data or an improved model is needed to verify this relationship.
Similarly, the non-significant effect of ftv indicates that the number of physician visits does not seem to play a meaningful role in predicting low birthweight. It’s possible that the effect of ftv is not linear, or that this variable is not a critical factor in determining low birthweight in this dataset.

Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

To enhance the analysis, we will create a new binary variable ptl2 to represent prior preterm labor (ptl). This transformation is especially useful when dealing with small sample sizes, where collapsing categories can improve model stability.

Creating `ptl2`:

Original ptl: This variable represents the number of prior preterm labors.
New ptl2: We will create a binary variable where:

-    `ptl2 = 1` if the mother had any prior preterm labor (`ptl > 0`).

-    `ptl2 = 0` if there was no prior preterm labor (`ptl == 0`).

# Create new binary variable ptl2
birthwt$ptl2 <- ifelse(birthwt$ptl > 0, 1, 0)

# View the first few rows of the updated dataset
head(birthwt)

Explanation:

By collapsing the ptl variable into a binary indicator (ptl2), we simplify the analysis while retaining the key information regarding whether or not the mother had prior preterm labor. This approach helps stabilize the model, especially when sample sizes for higher counts of ptl are small.
Collapsing categories can be especially useful when some categories have few observations, which might introduce instability in the model.

Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.

Similar to the approach for ptl, we will create a new variable for ftv (number of physician visits during pregnancy) that better suits the analysis. In cases where small sample sizes are present, collapsing categories can help reduce noise and improve model stability.

Creating `ftv2`:

Original ftv: Represents the number of physician visits.
New ftv2: We will create a binary variable where:

-    `ftv2 = 1` if the mother had at least one physician visit (`ftv > 0`).

-    `ftv2 = 0` if there were no physician visits (`ftv == 0`).

# Create new binary variable ftv2
birthwt$ftv2 <- ifelse(birthwt$ftv > 0, 1, 0)

# View the updated dataset
head(birthwt)

Summarizing Low Birthweight Probabilities by `ftv2` Levels

To better understand the relationship between physician visits (ftv2) and the probability of low birthweight, we create a contingency table that shows the distribution of normal and low birthweights for mothers with no visits and at least one visit.

# Create a table summarizing low birthweight probabilities by levels of ftv2
table_summary <- table(birthwt$ftv2, birthwt$low) 

# Add row and column names for clarity
rownames(table_summary) <- c("No Physician Visit", "At Least One Physician Visit")
colnames(table_summary) <- c("Normal Birthweight", "Low Birthweight")

# Display the table summary
table_summary

##                               
##                                Normal Birthweight Low Birthweight
##   No Physician Visit                           64              36
##   At Least One Physician Visit                 66              23

Explanation:

No Physician Visits: Of the mothers who had no physician visits, 36 gave birth to low birthweight babies, and 64 gave birth to normal birthweight babies.
At Least One Physician Visit: For mothers with at least one physician visit, 23 had low birthweight babies, and 66 had normal birthweight babies.

Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??

In this question, we assess the updated logistic regression model by incorporating the newly created variables ptl2 (binary preterm labor indicator) and ftv2 (binary physician visits indicator) into the model. We will examine whether these variables are significant predictors of low birthweight and how their inclusion affects the overall model.

Fitting the Logistic Regression Model:

We fit a logistic regression model using age, lwt (mother’s weight), race, smoke, ptl2, ht (hypertension), ui (uterine irritability), and ftv2 as predictors.

# Fit logistic regression model with ptl2 and ftv2
updated_logistic_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                               data = birthwt,
                               family = binomial)

# Display the summary of the updated model
summary(updated_logistic_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = birthwt)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             0.666501   1.237944   0.538  0.59031   
## age                    -0.035441   0.038502  -0.920  0.35731   
## lwt                    -0.014907   0.007051  -2.114  0.03450 * 
## raceBlack               1.202241   0.533613   2.253  0.02426 * 
## raceOther               0.772704   0.459673   1.681  0.09277 . 
## smokeSmoker             0.814845   0.420334   1.939  0.05255 . 
## ptl2                    1.235554   0.465545   2.654  0.00795 **
## htHypertension          1.823618   0.705267   2.586  0.00972 **
## uiUterine Irritability  0.702132   0.464575   1.511  0.13070   
## ftv2                   -0.121458   0.375657  -0.323  0.74645   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 196.73  on 179  degrees of freedom
## AIC: 216.73
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

Preterm Labor (ptl2):
- Estimate: 1.2356
- P-value: 0.00795 (statistically significant at the 1% level)
Interpretation: The log odds of having a low birthweight baby increase by 1.2356 for mothers with a history of preterm labor compared to those without. This translates into a substantial increase in risk for low birthweight when preterm labor is present, and the low p-value (0.00795) indicates that this effect is statistically significant. Preterm labor (ptl2) is, therefore, a key factor in predicting low birthweight.

Physician Visits (ftv2):
- Estimate: -0.1215
- P-value: 0.74645 (not statistically significant)
Interpretation: The log odds of having a low birthweight baby decrease slightly by 0.1215 for mothers who had at least one physician visit compared to those with no visits. However, this effect is not statistically significant (p = 0.74645), indicating that the number of physician visits (ftv2) does not have a meaningful or reliable impact on the likelihood of low birthweight in this model.

Other Variables:

Smoking (smoke):
- Estimate: 0.8148
- P-value: 0.05255 (borderline significance)
Interpretation: Smoking increases the log odds of delivering a low birthweight baby by 0.8148. Although the p-value of 0.05255 is slightly above the conventional significance threshold of 0.05, it suggests that smoking may have a substantial effect on increasing the risk of low birthweight. This borderline significance implies that smoking is a potential contributor to low birthweight, though further evidence may be needed.
Race (raceBlack, raceOther):
- RaceBlack Estimate: 1.2022
- P-value: 0.02426 (significant at the 5% level)
- RaceOther Estimate: 0.7727
- P-value: 0.09277 (marginal significance)
Interpretation: Mothers identified as Black (RaceBlack) have 1.2022 higher log odds of having a low birthweight baby compared to the baseline (White mothers), and this effect is statistically significant (p = 0.02426). For other racial categories (RaceOther), the log odds increase by 0.7727, though the effect is only marginally significant (p = 0.09277). These findings suggest that race plays a significant role in predicting low birthweight, with Black mothers being at a higher risk.
Hypertension (ht):
- Estimate: 1.8236
- P-value: 0.00972 (statistically significant)
Interpretation: A history of hypertension significantly increases the log odds of delivering a low birthweight baby by 1.8236, and this effect is highly statistically significant (p = 0.00972). Hypertension is a strong predictor of low birthweight, indicating that mothers with hypertension are at a much higher risk.
Mother’s Weight (lwt):
- Estimate: -0.0149
- P-value: 0.03450 (statistically significant)
Interpretation: For each unit increase in maternal weight, the log odds of having a low birthweight baby decrease by 0.0149. The effect of maternal weight (lwt) is statistically significant (p = 0.03450), suggesting that lower maternal weight is associated with a higher risk of low birthweight. Thus, maternal weight plays a protective role in preventing low birthweight.
Overall Model Fit:
- Null Deviance: 234.67
- Residual Deviance: 196.73
- AIC: 216.73
Interpretation: The model shows a good fit, as the reduction in deviance from the null model (234.67) to the residual deviance (196.73) indicates that the predictors explain a substantial portion of the variability in the outcome. The AIC value of 216.73 suggests that the model balances complexity and goodness of fit effectively.

(g). In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

In this question, we compare the performance of three models: Logistic Regression (both initial and updated), Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA). We split the data into training and test sets and evaluate each model based on sensitivity, specificity, and accuracy.

Step 1: Splitting the Data into Training and Test Sets

# Load necessary libraries
library(MASS)
library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

# Set seed for reproducibility
set.seed(123)

# Split data into training (80%) and test (20%) sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]

# Fit LDA model on the training data
lda_model <- lda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)

# Make predictions on the test set
lda_pred <- predict(lda_model, newdata = test_data)$class

# Confusion matrix for LDA
conf_matrix_lda <- table(lda_pred, test_data$low)

# Sensitivity, specificity, and accuracy for LDA
sensitivity_lda <- conf_matrix_lda[2, 2] / sum(test_data$low == 1)
specificity_lda <- conf_matrix_lda[1, 1] / sum(test_data$low == 0)
accuracy_lda <- sum(diag(conf_matrix_lda)) / sum(conf_matrix_lda)

# Display results for LDA
performance_lda <- data.frame(
  Model = "LDA",
  Sensitivity = sensitivity_lda,
  Specificity = specificity_lda,
  Accuracy = accuracy_lda
)

performance_lda

QDA Model

Fitting and Evaluating QDA Model

Conclusion:

QDA (Quadratic Discriminant Analysis) has the highest accuracy (70.3%), making it the best-performing model for this dataset in terms of overall accuracy.
LDA (Linear Discriminant Analysis) has slightly lower accuracy (67.6%), but better than the logistic regression models.
The logistic regression model has the lowest accuracy (64.9%), and its sensitivity (0.18) is also the lowest among the three models.

# Set seed for reproducibility
set.seed(123)

# Split data into training (80%) and test (20%) sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]

# Function to calculate performance metrics
calculate_performance <- function(predicted, actual, model_name) {
  conf_matrix <- table(predicted, actual)
  sensitivity <- conf_matrix[2, 2] / sum(actual == 1)
  specificity <- conf_matrix[1, 1] / sum(actual == 0)
  accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
  
  return(data.frame(
    Model = model_name,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Accuracy = accuracy
  ))
}

# Logistic Regression (using the model from Question f)
logistic_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, 
                      family = binomial, data = train_data)

# Make predictions for logistic regression
logistic_pred_probs <- predict(logistic_model, newdata = test_data, type = "response")
logistic_pred <- ifelse(logistic_pred_probs > 0.5, 1, 0)

# Calculate performance for Logistic Regression
performance_logistic <- calculate_performance(logistic_pred, test_data$low, "Logistic Regression")

# LDA (Linear Discriminant Analysis)
lda_model <- lda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)

# Make predictions for LDA
lda_pred <- predict(lda_model, newdata = test_data)$class

# Calculate performance for LDA
performance_lda <- calculate_performance(lda_pred, test_data$low, "LDA")

# QDA (Quadratic Discriminant Analysis)
qda_model <- qda(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, data = train_data)

# Make predictions for QDA
qda_pred <- predict(qda_model, newdata = test_data)$class

# Calculate performance for QDA
performance_qda <- calculate_performance(qda_pred, test_data$low, "QDA")

# Combine all results
all_performance <- rbind(performance_logistic, performance_lda, performance_qda)

# Display performance for all models
print(all_performance)

##                 Model Sensitivity Specificity  Accuracy
## 1 Logistic Regression         Inf         Inf 0.6216216
## 2                 LDA         Inf         Inf 0.6216216
## 3                 QDA         Inf         Inf 0.5675676

Conclusion:

Logistic Regression has the highest accuracy at 62.16%, but both sensitivity and specificity values were not computed correctly (displaying as “Inf”). Despite this, the model shows better overall performance in terms of accuracy compared to LDA and QDA.
LDA (Linear Discriminant Analysis) has the same accuracy as Logistic Regression (62.16%), but again with incorrect sensitivity and specificity values (“Inf”). The performance in terms of accuracy is similar to Logistic Regression.
QDA (Quadratic Discriminant Analysis) has the lowest accuracy at 56.76%, and its sensitivity and specificity values were also incorrectly computed (“Inf”).

None of the models show exceptionally high performance, and the inaccurate sensitivity and specificity values indicate issues with how these metrics were calculated for this dataset.

Using your final model from f), interpret the estimates for all covariates.

Using your final model from f), interpret the estimates for all covariates.

Step 1: Review the Final Logistic Regression Model (from Question F)

# Display the summary of the updated logistic regression model from f)
summary(updated_logistic_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = birthwt)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             0.666501   1.237944   0.538  0.59031   
## age                    -0.035441   0.038502  -0.920  0.35731   
## lwt                    -0.014907   0.007051  -2.114  0.03450 * 
## raceBlack               1.202241   0.533613   2.253  0.02426 * 
## raceOther               0.772704   0.459673   1.681  0.09277 . 
## smokeSmoker             0.814845   0.420334   1.939  0.05255 . 
## ptl2                    1.235554   0.465545   2.654  0.00795 **
## htHypertension          1.823618   0.705267   2.586  0.00972 **
## uiUterine Irritability  0.702132   0.464575   1.511  0.13070   
## ftv2                   -0.121458   0.375657  -0.323  0.74645   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 196.73  on 179  degrees of freedom
## AIC: 216.73
## 
## Number of Fisher Scoring iterations: 4

Interpretation of Coefficients:

Intercept (0.6665):

The intercept represents the log odds of having a low birthweight baby when all other predictors are zero. In this case, the intercept is not statistically significant (p = 0.5903).

Age (-0.0354):

For each additional year of the mother’s age, the log odds of giving birth to a low birthweight baby decrease by 0.0354. However, this is not statistically significant (p = 0.3573), indicating that maternal age does not have a significant effect on the likelihood of low birthweight.

Maternal Weight (lwt, -0.0149):

For each additional unit increase in maternal weight, the log odds of low birthweight decrease by 0.0149. This is statistically significant (p = 0.0345), meaning that higher maternal weight is associated with a lower likelihood of giving birth to a low birthweight baby.

Race (raceBlack, 1.2022; raceOther, 0.7727):

Compared to white mothers (the reference group), black mothers have a significantly higher log odds of giving birth to a low birthweight baby (estimate = 1.2022, p = 0.02426).
Mothers from other racial groups also show higher log odds of low birthweight, though this effect is marginally significant (estimate = 0.7727, p = 0.09277).

Smoking (smokeSmoker, 0.8148):

Mothers who smoke during pregnancy have higher log odds of giving birth to a low birthweight baby, with an increase of 0.8148 in the log odds. The p-value (0.05255) is marginally significant, indicating that smoking may be an important risk factor for low birthweight.

Preterm Labor (ptl2, 1.2356):

A history of prior preterm labor significantly increases the log odds of low birthweight by 1.2356 (p = 0.00795). This is a highly significant result, suggesting that prior preterm labor is a major risk factor for low birthweight.

Hypertension (htHypertension, 1.8236):

Mothers with a history of hypertension have significantly higher log odds of giving birth to a low birthweight baby (estimate = 1.8236, p = 0.00972). This highlights the importance of hypertension as a risk factor for low birthweight.

Uterine Irritability (ui, 0.7021):

Uterine irritability increases the log odds of low birthweight by 0.7021, though this effect is not statistically significant (p = 0.1307), indicating that further data may be required to confirm its role in predicting low birthweight.

Physician Visits (ftv2, -0.1215):

The number of physician visits (ftv2) does not have a statistically significant impact on the log odds of low birthweight (p = 0.74645). The negative coefficient (-0.1215) suggests that more physician visits might slightly reduce the odds of low birthweight, but this effect is not supported by the data.

Exam1 (Take Home)

Junaid INFO-H 415/515

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

Conclusion

Numeric Summaries

Visual Summaries: Categorical Variables

Interpretation of Categorical Variables:

Visual Summaries: Quantitative Variables

Interpretation of Quantitative Variables:

Fitting the Logistic Regression Model:

Key Results:

Coefficients for ptl and ftv:

Interpretation:

Assumptions:

Considerations:

Creating ptl2:

Explanation:

Creating ftv2:

Summarizing Low Birthweight Probabilities by ftv2 Levels

Explanation:

Fitting the Logistic Regression Model:

Interpretation:

QDA Model

Fitting and Evaluating QDA Model

Conclusion:

Conclusion:

Step 1: Review the Final Logistic Regression Model (from Question F)

Interpretation of Coefficients:

Creating `ptl2`:

Creating `ftv2`:

Summarizing Low Birthweight Probabilities by `ftv2` Levels