1.)

#Load in Data
library(MASS)
data("cpus")

#Summarize Data
summary(cpus)
##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
## 
head(cpus)
##             name syct mmin  mmax cach chmin chmax perf estperf
## 1  ADVISOR 32/60  125  256  6000  256    16   128  198     199
## 2  AMDAHL 470V/7   29 8000 32000   32     8    32  269     253
## 3  AMDAHL 470/7A   29 8000 32000   32     8    32  220     253
## 4 AMDAHL 470V/7B   29 8000 32000   32     8    32  172     253
## 5 AMDAHL 470V/7C   29 8000 16000   32     8    16  132     132
## 6  AMDAHL 470V/8   26 8000 32000   64     8    32  318     290

A.)

#1a.)

#Pairwise Scatterplots
pairs(cpus)

#Correlation Matrix 
cor(cpus[, sapply(cpus, is.numeric)]) 
##               syct       mmin       mmax       cach      chmin      chmax
## syct     1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin    -0.3356422  1.0000000  0.7581573  0.5347291  0.5171892  0.2669074
## mmax    -0.3785606  0.7581573  1.0000000  0.5379898  0.5605134  0.5272462
## cach    -0.3209998  0.5347291  0.5379898  1.0000000  0.5822455  0.4878458
## chmin   -0.3010897  0.5171892  0.5605134  0.5822455  1.0000000  0.5482812
## chmax   -0.2505023  0.2669074  0.5272462  0.4878458  0.5482812  1.0000000
## perf    -0.3070821  0.7949233  0.8629942  0.6626135  0.6089025  0.6052193
## estperf -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556
##               perf    estperf
## syct    -0.3070821 -0.2883956
## mmin     0.7949233  0.8192915
## mmax     0.8629942  0.9012024
## cach     0.6626135  0.6486203
## chmin    0.6089025  0.6105802
## chmax    0.6052193  0.5921556
## perf     1.0000000  0.9664687
## estperf  0.9664687  1.0000000

1a Interpretations.)

-The first thing that stood out to me in the pairwise scatter-plot is the syct variable tends to slope downwards with all variables except for name, which isn’t significant in the case of our plot. Thus we could infer that higher syct would usually corresponds with lower performance. This suggestion makes sense, syct represents cycle time and lower cycle times are better. Next, I observed mmin and mmax to have positive slopes with performance suggesting that as minimum and maximums of main memory increase so does performance, this also makes sense in terms of computer hardware. Variables such as cach, chmin, and chmax also seem to have an mostly positive relationship with performance, but does not seem as positive as mmin and mmax suggesting that they have less positive relationship with performance. Additionally, variables like mmax/mmin and chmin/chmax are highly correlated with one another which makes sense since the the two pairs will typically increase together. Lastly, the name variable is not numeric making its plots not meaningful in this scenario.

-In the correlation matrix, I have identified some of the key correlations in regards to the dataset’s context.

B.)

#1b.)
#Linear Regression Model exluding name and published performance. 
model <- lm(estperf ~ . - name - perf, data = cpus)
summary(model)
## 
## Call:
## lm(formula = estperf ~ . - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.198  -24.064    1.429   23.136  288.718 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.648e+01  6.288e+00 -10.574  < 2e-16 ***
## syct         6.596e-02  1.369e-02   4.818 2.85e-06 ***
## mmin         1.431e-02  1.428e-03  10.021  < 2e-16 ***
## mmax         6.590e-03  5.016e-04  13.138  < 2e-16 ***
## cach         4.945e-01  1.091e-01   4.533 9.94e-06 ***
## chmin       -1.723e-01  6.687e-01  -0.258    0.797    
## chmax        1.201e+00  1.720e-01   6.985 4.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.88 on 202 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.9082 
## F-statistic: 344.1 on 6 and 202 DF,  p-value: < 2.2e-16

Model Creation Process

-The prior questions allowed me to confirm correlations and the scatterplot allowed me the visually confirm whether or not linear relationships were present. From both the provided question and our exploratory analysis, we can confirm the name has no correlation with that of estimated performance, thus we do not include it. We do not include published performance in our predictive model since it states to not include it in question #1. All other variables were included due to their strong correlations with estimated performance. I included syct, even though the correlation wasn’t strong because it is the only variable negatively correlated with estimated performance.

C.)

#1c.) 

#Residual vs Fitted
plot(model, which = 1)

#Normal Q-Q Plot
plot(model, which = 2)

Interpretations

-Residuals vs Fitted: Most of the residuals appear to bunch up around the zero line, this indicates that the model is capturing much of the linear signal. However, the red line appears to dip below zero with a slight rise towards the end of the plot, this might be indicative of a slightly curved pattern. This bend means that the data may not be perfectly linear in certain parts of the dataset. There doesn’t appear to be a severe “funnel” shape that would be indicative of heteroscedasticity, but their is some non-linearity present. Additionally, there are a few outliers to note, mainly 199, 200, and 10 in the top right of the graph. Leverage should be checked for these outliers.

-QQ Plot: Majority of residuals hang around the diagonal indicating a reasonably normal residuals. The upper tail deviation is made more drastic than the lower tail due to the presence of outliers suggesting a heavy upper tail. Overall, the majority of residuals are well behaved with small sets of the data standing out especially in the upper tail.

-Outliers: Cooks distance and leverage are also visualized and show taht the three main outliers (10, 199, 200) do indeed significantly influence our data. I went and viewed those rows specifically to see if there were any typos or errors that may have caused this and discovered nothing definitive, however, if you look through the data itself and a summary of the data we can see that these outliers in estimated performance don’t seem to fit in with the other data. For example, the mean of estimated performance is 99.33 and its max is 1238, this simply does not make sense. To address this I will make a model with a log-transformation this will alleviate some of the influence of the large values on the models fit. I am keeping the outliers in the dataset because I cannot confidently say they do not belong.

-The final model, seen below, is the best model. To address the slight curvature in the residuals plot I tried to use a logarithmic and polynomial transformation. I decided to use the log transformation in my final model since the fit statistics are very strong and the residuals curve looks far more “fit”. Though, more curvature can be seen in the graph this is okay due to the high predictive accuracy. Below, it can also be seen that I tested polynomial transformations and even removed the outliers. I opted to not remove the outliers and observed taht polynomial transformation did not have the best fit for this scenario.

#Final Model
model_log <- lm(log(estperf) ~ . - name - perf, data = cpus)
summary(model_log)
## 
## Call:
## lm(formula = log(estperf) ~ . - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96828 -0.12609  0.02829  0.15471  0.46886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.293e+00  2.934e-02 112.254  < 2e-16 ***
## syct        -4.510e-04  6.389e-05  -7.059 2.64e-11 ***
## mmin         1.278e-05  6.662e-06   1.919   0.0565 .  
## mmax         5.386e-05  2.341e-06  23.011  < 2e-16 ***
## cach         7.024e-03  5.090e-04  13.801  < 2e-16 ***
## chmin        4.808e-03  3.120e-03   1.541   0.1249    
## chmax       -1.598e-03  8.024e-04  -1.992   0.0478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9455 
## F-statistic: 602.1 on 6 and 202 DF,  p-value: < 2.2e-16
plot(model_log, which = 1:2)

#Investigating Outliers
plot(model_log, which = 4)  # Cook's distance

plot(model_log, which = 5)  # Residuals vs. leverage

#New Polynomial Model
model_poly <- lm(estperf ~ syct + I(syct^2) + mmin + mmax + cach + chmin + chmax, data = cpus)
summary(model_poly)
## 
## Call:
## lm(formula = estperf ~ syct + I(syct^2) + mmin + mmax + cach + 
##     chmin + chmax, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -144.102  -23.062    0.181   22.600  275.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.933e+01  8.103e+00  -9.789  < 2e-16 ***
## syct         1.569e-01  3.926e-02   3.996 9.03e-05 ***
## I(syct^2)   -8.052e-05  3.264e-05  -2.467   0.0145 *  
## mmin         1.464e-02  1.417e-03  10.335  < 2e-16 ***
## mmax         6.721e-03  4.982e-04  13.490  < 2e-16 ***
## cach         5.238e-01  1.084e-01   4.833 2.67e-06 ***
## chmin       -6.019e-02  6.620e-01  -0.091   0.9276    
## chmax        1.164e+00  1.705e-01   6.826 1.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.3 on 201 degrees of freedom
## Multiple R-squared:  0.9135, Adjusted R-squared:  0.9105 
## F-statistic: 303.2 on 7 and 201 DF,  p-value: < 2.2e-16
#Check POLY Model
plot(model_poly, which = 1)  

plot(model_poly, which = 2)  

# Clean dataset of outliers
cpus_clean <- cpus[-c(10, 199, 200), ]

# Re-fit the model
model_no_outliers <- lm(log(estperf) ~ . - name - perf, data = cpus_clean)
summary(model_no_outliers)
## 
## Call:
## lm(formula = log(estperf) ~ . - name - perf, data = cpus_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11276 -0.10900  0.02333  0.12428  0.39123 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.192e+00  2.756e-02 115.831  < 2e-16 ***
## syct        -3.582e-04  5.533e-05  -6.473 7.32e-10 ***
## mmin         2.753e-05  6.470e-06   4.255 3.22e-05 ***
## mmax         5.690e-05  2.168e-06  26.242  < 2e-16 ***
## cach         6.567e-03  4.386e-04  14.971  < 2e-16 ***
## chmin       -4.558e-04  2.868e-03  -0.159   0.8739    
## chmax        1.627e-03  8.202e-04   1.983   0.0487 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1861 on 199 degrees of freedom
## Multiple R-squared:  0.9561, Adjusted R-squared:  0.9548 
## F-statistic: 722.1 on 6 and 199 DF,  p-value: < 2.2e-16
#Check No Outliers Model
plot(model_no_outliers, which = 1)  

plot(model_no_outliers, which = 2) 

D.)

#1d.)

#Final Model
model_log <- lm(log(estperf) ~ . - name - perf, data = cpus)
summary(model_log)
## 
## Call:
## lm(formula = log(estperf) ~ . - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96828 -0.12609  0.02829  0.15471  0.46886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.293e+00  2.934e-02 112.254  < 2e-16 ***
## syct        -4.510e-04  6.389e-05  -7.059 2.64e-11 ***
## mmin         1.278e-05  6.662e-06   1.919   0.0565 .  
## mmax         5.386e-05  2.341e-06  23.011  < 2e-16 ***
## cach         7.024e-03  5.090e-04  13.801  < 2e-16 ***
## chmin        4.808e-03  3.120e-03   1.541   0.1249    
## chmax       -1.598e-03  8.024e-04  -1.992   0.0478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9455 
## F-statistic: 602.1 on 6 and 202 DF,  p-value: < 2.2e-16

-Model Fit Stats

E.)

#Final Model
model_log <- lm(log(estperf) ~ . - name - perf, data = cpus)
summary(model_log)
## 
## Call:
## lm(formula = log(estperf) ~ . - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96828 -0.12609  0.02829  0.15471  0.46886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.293e+00  2.934e-02 112.254  < 2e-16 ***
## syct        -4.510e-04  6.389e-05  -7.059 2.64e-11 ***
## mmin         1.278e-05  6.662e-06   1.919   0.0565 .  
## mmax         5.386e-05  2.341e-06  23.011  < 2e-16 ***
## cach         7.024e-03  5.090e-04  13.801  < 2e-16 ***
## chmin        4.808e-03  3.120e-03   1.541   0.1249    
## chmax       -1.598e-03  8.024e-04  -1.992   0.0478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9455 
## F-statistic: 602.1 on 6 and 202 DF,  p-value: < 2.2e-16

Interpretations

-Intercept: If all predictors were zero, log(estperf) would be 3.293 (exp(3.239) = 25.5)

-syct: For each +1 ns increase in cycle time, log(estperf) changes by -4.51e-04. This results in a -0.045% change in actual estperf per additional ns. This reinfornces that faster cycles times lead to slightly higher estimated performance.

-mmin: For each +1kb in minimum memory, log(estperf) increases by 1.278e-05. This results in a ~+0.0013% change in estperf per additional kB. In terms of MB, one could infer that there is a +1.3% increase in estperf for each 1MB of mmin.

-mmax: For each +1kb in maximum memory, log(estperf) increases by 5.386e-05. This results in a ~+0.0054% change in estperf per additional kB. In terms of MB, one could infer that there is a +5.5% increase in estperf for each 1MB of mmax.

-cach: For each +1kb in cache, log(estperf) increases by 7.024e-03. This results in a ~+0.7% change in estperf per additional kB.

-chmax: For each additional maximum channel, log(estperf) is raised by 0.001598 (~-.16%).

-chmin: For each additional minimum channel, log(estperf) is raised by 0.04088 (~+4.09%).

F.)

library(car)
## Loading required package: carData
vif(model_log)
##     syct     mmin     mmax     cach    chmin    chmax 
## 1.201644 2.902002 3.274002 1.858349 1.966234 1.891392

-None of these values exceed the 5-10 threshold that would be indicative of problematic multicollinearity. The mmax variable is the highest but is still not so high that we;d be concerned about it, this being the highest suggests that it is more correlated with other predictors. No sever multicollinearity is present, so the model’s coefficients are not inflated or unstable due to correlated predictors.

G.)

plot(model_log, which = 4:5)

-Interpretations:

From our plots we are able to observe three significnat outliers (10, 157, 200). 10 is observed to have the highest cooks distance out of all the outliers suggesting it would result in the most change if this particular point were to be ommitted. 157 and 200 are past the .5 threshold as well but not as dramatically as 10. We checked the data’s structure to ensure that these weren’t input errors. Nothing I found allows me to confidentially remove the points form our data.

2.)

a.)

# Load package and data
library(MASS)
data("birthwt")

summary(birthwt)
##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990
str(birthwt)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
# Convert 'low' 
birthwt$low <- factor(birthwt$low, levels = c(0,1), labels = c("No","Yes"))

# Boxplot:Age by Low Birthweight
boxplot(age ~ low, data=birthwt,
        main="Age by Low Birthweight",
        xlab="Low Birthweight?",
        ylab="Age (years)")

# Boxplot: Weight by Low Birthweight
boxplot(lwt ~ low, data=birthwt,
        main="Weight by Low Birthweight",
        xlab="Low Birthweight?",
        ylab="Weight at LMP (lbs)")

# Convert 'smoke' 
birthwt$smoke <- factor(birthwt$smoke, levels=c(0,1), labels=c("No","Yes"))

# Barplot: Smoking vs Low Birthweight
smoke_tab <- table(birthwt$smoke, birthwt$low)
barplot(smoke_tab, beside=TRUE, legend=TRUE,
        xlab="Low Birthweight?",
        ylab="Count",
        main="Smoking vs Low Birthweight")

?birthwt

Interpretation:

-From our initial data exploration we are able to peer into the data structure and see important things like means of certain variables and the total number of observations. From our first box plot we compare Age with the likelihood of a lower birth weight. The graph does suggest that there may be a relationship between younger mothers and having babies with lower birthweights, this relationship may be one worth investigating further. The other boxplot investigates the relationship with Weight and birthweight, it suggests that mothers with lower pre-pregnancy weights will be more likely to have lower birth weights. I also wanted to investigate the smoking variable as it is a popular point on contrversy in healthcare. From my graph we can suggest a positive correlation between smoking and lower birth weights. These findings are in line with that of common risk factors in childbirth like young maternal age, lower maternal weight, and smoking.

B.)

data("birthwt")

birthwt$low   <- factor(birthwt$low,   levels = c(0,1), labels = c("No","Yes"))
birthwt$race  <- factor(birthwt$race,  levels = c(1,2,3), labels = c("white","black","other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0,1), labels = c("No","Yes"))
birthwt$ht    <- factor(birthwt$ht,    levels = c(0,1), labels = c("No","Yes"))
birthwt$ui    <- factor(birthwt$ui,    levels = c(0,1), labels = c("No","Yes"))

# Fit a logistic regression model
model_glm <- glm(
  low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
  data = birthwt,
  family = binomial
)

# View summary of the model
summary(model_glm)
## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.480623   1.196888   0.402  0.68801   
## age         -0.029549   0.037031  -0.798  0.42489   
## lwt         -0.015424   0.006919  -2.229  0.02580 * 
## raceblack    1.272260   0.527357   2.413  0.01584 * 
## raceother    0.880496   0.440778   1.998  0.04576 * 
## smokeYes     0.938846   0.402147   2.335  0.01957 * 
## ptl          0.543337   0.345403   1.573  0.11571   
## htYes        1.863303   0.697533   2.671  0.00756 **
## uiYes        0.767648   0.459318   1.671  0.09467 . 
## ftv          0.065302   0.172394   0.379  0.70484   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 201.28  on 179  degrees of freedom
## AIC: 221.28
## 
## Number of Fisher Scoring iterations: 4
#Residual vs Fitted
plot(model_glm, which = 1)

#Normal Q-Q Plot
plot(model_glm, which = 2)

-We have changed the variables that we are able to to categorical variables. From what we can observe now there are many factors that are significant but many are on the border for the p-value, none of them are exceptionally low. It seems that being black, having hypertension, and smoking have the most correlation to increasing odds of a low birth weight. Residuals deviance seems to be very high. Below, I attempt to remove some variables based on variable selection, deviance didn’t change but the model now takes into account factors that we found to have correlation with low birth weight.

step_model <- step(model_glm, direction="backward", trace=FALSE)
summary(step_model)
## 
## Call:
## glm(formula = low ~ lwt + race + smoke + ptl + ht + ui, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.086550   0.951760  -0.091  0.92754   
## lwt         -0.015905   0.006855  -2.320  0.02033 * 
## raceblack    1.325719   0.522243   2.539  0.01113 * 
## raceother    0.897078   0.433881   2.068  0.03868 * 
## smokeYes     0.938727   0.398717   2.354  0.01855 * 
## ptl          0.503215   0.341231   1.475  0.14029   
## htYes        1.855042   0.695118   2.669  0.00762 **
## uiYes        0.785698   0.456441   1.721  0.08519 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 201.99  on 181  degrees of freedom
## AIC: 217.99
## 
## Number of Fisher Scoring iterations: 4
# Final logistic model (removing ptl, ui)
final_model <- glm(
  low ~ lwt + race + smoke + ht, 
  data = birthwt,
  family = binomial
)
summary(final_model)
## 
## Call:
## glm(formula = low ~ lwt + race + smoke + ht, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.352049   0.924438   0.381  0.70333   
## lwt         -0.017907   0.006799  -2.634  0.00844 **
## raceblack    1.287662   0.521648   2.468  0.01357 * 
## raceother    0.943645   0.423382   2.229  0.02583 * 
## smokeYes     1.071566   0.387517   2.765  0.00569 **
## htYes        1.749163   0.690820   2.532  0.01134 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 208.25  on 183  degrees of freedom
## AIC: 220.25
## 
## Number of Fisher Scoring iterations: 4
#Residual vs Fitted
plot(final_model, which = 1)

#Normal Q-Q Plot
plot(final_model, which = 2)

C.)

summary(final_model)
## 
## Call:
## glm(formula = low ~ lwt + race + smoke + ht, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.352049   0.924438   0.381  0.70333   
## lwt         -0.017907   0.006799  -2.634  0.00844 **
## raceblack    1.287662   0.521648   2.468  0.01357 * 
## raceother    0.943645   0.423382   2.229  0.02583 * 
## smokeYes     1.071566   0.387517   2.765  0.00569 **
## htYes        1.749163   0.690820   2.532  0.01134 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 208.25  on 183  degrees of freedom
## AIC: 220.25
## 
## Number of Fisher Scoring iterations: 4
# Exponentiate coefficients to get odds ratios
exp_coef <- exp(coef(final_model))
exp_coef
## (Intercept)         lwt   raceblack   raceother    smokeYes       htYes 
##   1.4219787   0.9822528   3.6243042   2.5693292   2.9199496   5.7497858
# Optional: confidence intervals for odds ratios
conf_int <- exp(confint(final_model))
## Waiting for profiling to be done...
conf_int
##                 2.5 %     97.5 %
## (Intercept) 0.2452984  9.3479425
## lwt         0.9683866  0.9946908
## raceblack   1.3074813 10.2836018
## raceother   1.1357960  6.0302269
## smokeYes    1.3869773  6.3939742
## htYes       1.5395790 24.4078437

Interpretation: pt1 and ftv are numeric continuous variables. This model implicitly assumes a linear affect on the log-odds of low birthweight for each 1-unit increase in these two variables. Are these assumptions realistic? Often, they are NOT. In the case of these two variables we have to consider possible non linear effects. In this model, ptl (number of previous premature labors) and ftv (number of first‐trimester visits) are NOT included as numeric predictors,if they were, they would be implying a linear relationship with the log‐odds of low birthweight for each 1‐unit increase. However, this assumption may be unrealistic—for example, going from 0 to 1 prior labor might have a bigger impact than going from 2 to 3, or adding a third doctor’s visit could be less impactful than the first. In practice, these variables may exhibit threshold or nonlinear effects. pt1 and ftv are omitted from the final model because they were found to be highly insignificant in regards to birth weight.

D + E.)

-We will now create new variables, the new variable for pt1 will serve to collapse the numeric into two levels. This will better capture the idea that any prior premature labor is the main jump in risk. The ftv variable will be changed into 3 categories.

# ptl2: Two categories -> "0" if ptl=0, "1plus" if ptl>=1
birthwt$ptl2 <- ifelse(birthwt$ptl == 0, "0", "1plus")
birthwt$ptl2 <- factor(birthwt$ptl2, levels=c("0","1plus"))

table(birthwt$ptl, birthwt$ptl2)
##    
##       0 1plus
##   0 159     0
##   1   0    24
##   2   0     5
##   3   0     1
# ftv2: Three categories -> "0", "1to2", "3plus"
birthwt$ftv2 <- cut(
  birthwt$ftv,
  breaks=c(-0.1,0.9,2.9,99),
  labels=c("0","1to2","3plus"),
  right=TRUE
)
birthwt$ftv2 <- factor(birthwt$ftv2, levels=c("0","1to2","3plus"))

table(birthwt$ftv, birthwt$ftv2)
##    
##       0 1to2 3plus
##   0 100    0     0
##   1   0   47     0
##   2   0   30     0
##   3   0    0     7
##   4   0    0     4
##   6   0    0     1

F.)

# Fit a new model, adding ptl2 and ftv2:
model_glm_final <- glm(
  low ~ lwt + race + smoke + ht + ptl2 + ftv2, 
  data = birthwt,
  family = binomial
)
summary(model_glm_final)
## 
## Call:
## glm(formula = low ~ lwt + race + smoke + ht + ptl2 + ftv2, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.282455   0.977003   0.289  0.77250   
## lwt         -0.016713   0.006905  -2.420  0.01551 * 
## raceblack    1.220303   0.534237   2.284  0.02236 * 
## raceother    0.774835   0.449109   1.725  0.08448 . 
## smokeYes     0.771590   0.416443   1.853  0.06391 . 
## htYes        1.718945   0.715277   2.403  0.01625 * 
## ptl21plus    1.277156   0.452558   2.822  0.00477 **
## ftv21to2    -0.360098   0.389036  -0.926  0.35465   
## ftv23plus    0.391374   0.689491   0.568  0.57029   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 199.00  on 180  degrees of freedom
## AIC: 217
## 
## Number of Fisher Scoring iterations: 4

-The model moderately changes the results. With the inclusion of our new variables we can see that there is now significance to these variables that we were not able to see prior to transforming them. Asides from slight changes to p-values, the deviance and other factors did not change signficiantly. The new model brings about information that any prior preterm labor does significantly influence low birth weight and a slightly lower AIC suggesting the model is a modestly better fit.

G.)

#Set seed 
set.seed(123)

#Train/Test Split 
n <- nrow(birthwt)
train_idx <- sample(seq_len(n), size = 0.8 * n)
train_data <- birthwt[train_idx, ]
test_data  <- birthwt[-train_idx, ]

#Logistic Model with ptl2, ftv2 on Training
model_glm_final <- glm(
  low ~ lwt + race + smoke + ht + ptl2 + ftv2,
  data = train_data,
  family = binomial
)
summary(model_glm_final)
## 
## Call:
## glm(formula = low ~ lwt + race + smoke + ht + ptl2 + ftv2, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.100900   1.174321  -0.086  0.93153   
## lwt         -0.014436   0.008302  -1.739  0.08206 . 
## raceblack    1.080805   0.632869   1.708  0.08768 . 
## raceother    1.125216   0.533566   2.109  0.03496 * 
## smokeYes     0.935765   0.495109   1.890  0.05876 . 
## htYes        1.539385   0.876670   1.756  0.07910 . 
## ptl21plus    1.315245   0.497726   2.643  0.00823 **
## ftv21to2    -0.664959   0.440970  -1.508  0.13157   
## ftv23plus    0.609648   0.845861   0.721  0.47107   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 188.83  on 150  degrees of freedom
## Residual deviance: 156.31  on 142  degrees of freedom
## AIC: 174.31
## 
## Number of Fisher Scoring iterations: 4
# 5) LDA 
model_lda_final <- lda(low ~ lwt + race + smoke + ht + ptl2 + ftv2, 
                       data = train_data)

# 6) QDA 
model_qda_final <- qda(low ~ lwt + race + smoke + ht + ptl2 + ftv2,
                       data = train_data)

#Predictions on Test Set

## Logistic
prob_glm <- predict(model_glm_final, newdata=test_data, type="response")
pred_glm <- ifelse(prob_glm > 0.5, "Yes", "No")

## LDA
pred_lda <- predict(model_lda_final, newdata=test_data)$class

## QDA
pred_qda <- predict(model_qda_final, newdata=test_data)$class

# 8) Confusion Matrices
cm_glm <- table(Pred=pred_glm, True=test_data$low)
cm_lda <- table(Pred=pred_lda, True=test_data$low)
cm_qda <- table(Pred=pred_qda, True=test_data$low)

cm_glm
##      True
## Pred  No Yes
##   No  25   7
##   Yes  2   4
cm_lda
##      True
## Pred  No Yes
##   No  25   7
##   Yes  2   4
cm_qda
##      True
## Pred  No Yes
##   No  22   6
##   Yes  5   5
# 9) Accuracy
acc_glm <- sum(diag(cm_glm)) / sum(cm_glm)
acc_lda <- sum(diag(cm_lda)) / sum(cm_lda)
acc_qda <- sum(diag(cm_qda)) / sum(cm_qda)

acc_glm
## [1] 0.7631579
acc_lda
## [1] 0.7631579
acc_qda
## [1] 0.7105263
#Sens/Spec

calc_sens_spec <- function(cm, positive="Yes") {
  if(!all(dim(cm)==c(2,2))) stop("Confusion matrix must be 2x2!")
  rowYes <- which(rownames(cm)==positive)
  colYes <- which(colnames(cm)==positive)
  TP <- cm[rowYes, colYes]
  FN <- cm[rowYes, -colYes]
  FP <- cm[-rowYes, colYes]
  TN <- cm[-rowYes, -colYes]
  sens <- TP / (TP + FN)
  spec <- TN / (TN + FP)
  list(sensitivity=sens, specificity=spec)
}

calc_sens_spec(cm_glm)
## $sensitivity
## [1] 0.6666667
## 
## $specificity
## [1] 0.78125
calc_sens_spec(cm_lda)
## $sensitivity
## [1] 0.6666667
## 
## $specificity
## [1] 0.78125
calc_sens_spec(cm_qda)
## $sensitivity
## [1] 0.5
## 
## $specificity
## [1] 0.7857143

Interpretations:

-In terms of overall accuracy, it seems that our original logistic regression is just as good as the LDA model. The QDA model seem to be significantly shorter in terms of accuracy. From the sensitivity and specificity numbers we can also suggest that the logistic and LDA models will each have a better change of catching low-weight births than the QDA model.

H.)

#Final Model
model_glm_final <- glm(
  low ~ lwt + race + smoke + ht + ptl2 + ftv2, 
  data = birthwt,
  family = binomial
)
summary(model_glm_final)
## 
## Call:
## glm(formula = low ~ lwt + race + smoke + ht + ptl2 + ftv2, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.282455   0.977003   0.289  0.77250   
## lwt         -0.016713   0.006905  -2.420  0.01551 * 
## raceblack    1.220303   0.534237   2.284  0.02236 * 
## raceother    0.774835   0.449109   1.725  0.08448 . 
## smokeYes     0.771590   0.416443   1.853  0.06391 . 
## htYes        1.718945   0.715277   2.403  0.01625 * 
## ptl21plus    1.277156   0.452558   2.822  0.00477 **
## ftv21to2    -0.360098   0.389036  -0.926  0.35465   
## ftv23plus    0.391374   0.689491   0.568  0.57029   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 199.00  on 180  degrees of freedom
## AIC: 217
## 
## Number of Fisher Scoring iterations: 4

-Interpretation:

-lwt (-0.0167, p=.01551): This shows moderate significance and suggests that for each 1lb increase in a mother’s weight reduces the log-odds of low birthweight by about 1.7%.

-raceblack (1.2203, p=.02236): This shows moderate signifcance and suggests taht black mothers have a 3.4x the odds of low birth weight compared to white mothers.

-raceother (.7748, p=.08448): Though insignificant. it was worth noting the potential 2.17x inscrease in having low-weight birth if one is not white or black.

-smoke (0.7716, p = 0.06391): Smokers have ~2.16× the odds of low birth weight comapred to non‐smokers, but the p‐value is ~0.064, making it borderline.

-ht (1.7189, p = 0.01625): Hypertension strongly raises the odds (over 5x). This is typically one of the most significant risk factors.

-pt12 (1.2772, p = 0.00477**): Having any previous premature labor multiplies the odds of low birthweight by 3.6.

-Overall, Hypertension and prior preterm labors seem to have the strongest relationship with that of low weight child birth.