Statistical Learning Take Home Exam1

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!

Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.

Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.

library(MASS) 
library(ggplot2) 
data("cpus") #load the data 
head(cpus)

##             name syct mmin  mmax cach chmin chmax perf estperf
## 1  ADVISOR 32/60  125  256  6000  256    16   128  198     199
## 2  AMDAHL 470V/7   29 8000 32000   32     8    32  269     253
## 3  AMDAHL 470/7A   29 8000 32000   32     8    32  220     253
## 4 AMDAHL 470V/7B   29 8000 32000   32     8    32  172     253
## 5 AMDAHL 470V/7C   29 8000 16000   32     8    16  132     132
## 6  AMDAHL 470V/8   26 8000 32000   64     8    32  318     290

#build a correlation matrix to see correlations between estimated performance (estperf) and other variables
cpus = na.omit(cpus) #drop NA values from dataset

corr = cor(cpus[, c("syct", "mmin", "mmax", "cach", "chmin", "chmax", "perf", "estperf")])
corr["estperf",]

##       syct       mmin       mmax       cach      chmin      chmax       perf 
## -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556  0.9664687 
##    estperf 
##  1.0000000

#scatterplot of published vs syct (cycle time in nanoseconds)
ggplot(cpus, aes(x = syct, y = estperf)) + geom_point()

#scatterplot between mmin and estperf
ggplot(cpus, aes(x = mmin, y = estperf)) + geom_point()

#scatterplot between mmax and estperf

ggplot(cpus, aes(x = mmax, y = estperf)) + geom_point()

#Scatterplot between cach and estperf

ggplot(cpus, aes(x = cach, y = estperf)) + geom_point()

#scatterplot between chmin and estperf

ggplot(cpus, aes(x = chmin, y = estperf)) + geom_point()

#scatterplot between chmax and estperf
ggplot(cpus, aes(x = chmax, y = estperf)) + geom_point()

#scatterplot between per and estperf

ggplot(cpus, aes(x = perf, y = estperf)) + geom_point()

Using the correlation matrix and scatterplots to assess the relationship between ‘estperf’ and the other variables in this dataset:

The correlation between ‘estperf’ and ‘syct’ is -0.288, implying that there is a negative correlation between the two variables (as the cycle time increases by 1 nanosecond, the estimated performance decreases by ~ 0.288). However, this factor is relatively small, so I would say that the relationship between these two variables is relatively week overall since -0.288 is close to 0 (0 = no correlation). Looking at the scatterplot between ‘syct’ and ‘estperf’, it’s also apparent that there is no semblance of a relationship between the two variables. The points on the graph have no pattern, which makes sense because of the weak correlation between the two variables.
The correlation between ‘estperf’ and ‘mmin’ is 0.819, which implies a strong positive correlation between the two variables - as mmin (minimum main memory in KB) increases by 1 KB, the estimated performance increases by ~ 0.82. This positive linear relationship is pretty visible in the scatterplot showcasing the relationship between estperf and mmin - there is a general positive linear theme to the graph with only a few outliers.
The correlation between ‘estperf’ and ‘mmax’ is 0.901, which again implies a strong positive correlation between these two variables. So as mmax (maximum main memory in KB) increases by 1KB, the estimated performance increases by ~0.90. Looking at the scatterplot, this is again confirmed - the relationship looks pretty linear with only a few outliers. Based on the fact that mmin and mmax are highly positively correlated with estperf, I feel comfortable concluding that the main memory and the performance of a CPU are related.
The correlation between ‘estperf’ and ‘cach’ is 0.649, which is a somewhat strong positive correlation, although not as strong as mmin and mmax. As the cache size, in KB, increases by 1KB, the estimated performance increases by ~0.65. The scatterplot of these two variables starts out displaying a somewhat linear trend towards the left-hand side of the graph, but the values fan out as the cache size increases, which says to me that maybe another related factor might be the cause for the increasing estimated performance.
The correlation between ‘estperf’ and ‘chmin’ is 0.611, which, like ‘cach’, is a relatively strong positive correlation, but not as strong as between ‘estperf’ and some other variables. As the minimum number of channels increases by 1, so does the estimated performance by 0.61. The scatterplot of these two variables doesn’t show a very promising linear relationship, however, as the values are stacked on top of one another on the left-hand side of the graph and fanned out as the minimum channel number increases.
The correlation between ‘estperf’ and ‘chmax’ is 0.592, which like ‘chmin’ and ‘cach’ is a positive correlation, but not very strong. As the maximum number of channels increases (by a factor of 1), the estimated performance increases by 0.592. Looking at the scatterplot of these two variables, the same thing that was true for ‘chmin’ is true here: there is not much visual evidence of a positive relationship. The values have no apparent pattern, and are stacked atop each other on the left side and fanned out on the right.
The correlation between ‘estperf’ and ‘perf’ is 0.966, which is almost 1, meaning this relationship is a very strong positive correlation - as the published performance increases by one unit, the estimated performance increases by 0.966. The scatterplot of these two variables showcases this strong positive correlation, with a very apparent positive linear relationship. However, this is to be expected since both variables are concerned with ‘performance’ and I would assume that the estimated performance is, in some capacity, based on the recorded performance. In the rest of this analysis, the ‘perf’ variable will not be used.

Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.
```
cpus_model = lm(estperf ~ syct + mmin + mmax + cach + chmax, data = cpus)

summary(cpus_model)
```
```
## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -136.633  -23.149    1.574   22.957  287.869 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.660e+01  6.257e+00 -10.643  < 2e-16 ***
## syct         6.613e-02  1.365e-02   4.846 2.50e-06 ***
## mmin         1.424e-02  1.397e-03  10.188  < 2e-16 ***
## mmax         6.584e-03  4.999e-04  13.171  < 2e-16 ***
## cach         4.871e-01  1.050e-01   4.639 6.28e-06 ***
## chmax        1.187e+00  1.623e-01   7.314 5.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.78 on 203 degrees of freedom
## Multiple R-squared:  0.9108, Adjusted R-squared:  0.9086 
## F-statistic: 414.8 on 5 and 203 DF,  p-value: < 2.2e-16
```
My final model used the variables ‘syct’, ‘mmin’, ‘mmax’, ‘cach’, and ‘chmax’ as predictors for ‘estperf’. Per request of question (b), the variable ‘name’ is not included, and ‘perf’ is not used as a predictor as was requested in the initial instructions for this question. Initially, my model also included ‘chmin’, but after running a summary, I found that it was the only predictor that wasn’t statistically significant (a p-value of 0.797). The adjusted R-Squared value also increased slightly (from 0.9082 to 0.9086) with the removal of ‘chmin’. The fact that all of the included variables are statistically significant and the R-Sqaured values are ~ 90% indicate that this linear regression model is a strong fit, and this particular one is the best representation of the predictors and ‘estperf’.
Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.
```
plot(cpus_model)
```
Based on the 4 residual plots made (Residuals vs Fitted, Q-Q Plot, Scale-Location, and Residuals vs Leverage), I believe that this model does violate some of the assumptions of linear regression, primarily the normality assumption and the homoscedasticity assumption. Specifically looking at the Scale-Location graph, the red line and the points show a general positive trend, which implies that the residuals are not normalized across the model. Additionally, looking at the Q-Q Plot, it’s apparent that for the most part the residuals fit normally, they vary greatly at either end of the plot, implying that the distribution is not entirely normal. Finally, there are clearly outliers present in this model/data based on the four generated graphs, specifically observations 199, 200, and 10, which are highlighted in every graph.

In order to adjust my model, I think that transforming the estperf variable might be effective as it would address the issue of the model violating the homoscedasticity assumption.

#trying a log transformation of 'estperf' in my modeland re-generating the 4 plots

log_model = lm((log(estperf)) ~ syct + mmin + mmax + cach + chmax, data = cpus)

plot(log_model)

Looking at these 4 new residual plots using my new model with the ‘estperf’ variable logarithmically transformed, it appears that there was not much of a change in the violation of the homoscedasticity assumption, as the red trend line is still not flat and is gradually increasing as you move to the right of the graph, implying that the log function didn’t fix this violation. The Q-Q Plot also doesn’t show any significant improvement, with the residuals on the tail ends varying from the average. Additionally, the same 3 outlier points from my initial model are still causing problems/being shown as high leverage points in these graphs.

#trying one last thing: removing the three 'high leverage' observations from my model (10, 9, 200, 157, and 199)

clean = cpus[-c(10, 9, 200, 157, 199),]

clean_model = lm((log(estperf)) ~ syct + mmin + mmax + cach + chmax, data = clean)

plot(clean_model)

Using the data free from the outliers initially discovered in my first two models/residual plots made somewhat of an improvement in my opinion - the red trend line on the Residuals vs. Fitted plot is a lot less wavy/varied, although it still does not fit to the dotted grey line at all. The Q-Q Plot didn’t show much improvement, as it hasn’t in any of the updated models I’ve created. However, looking at the Scale-Location plot and the Residuals vs. Leverage plot, the red trend line definitely has flattened out with the removal of the initially identified outliers. Despite this, new outliers/high-leverage points have been identified. I think that my last/most recent model is the best fit, but the outliers could be better assessed, or there might be another way that the ‘estperf’ variable could be transformed to better fit the other variables to create a model with a better fit.

d. How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

summary(clean_model)

## 
## Call:
## lm(formula = (log(estperf)) ~ syct + mmin + mmax + cach + chmax, 
##     data = clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46999 -0.11111  0.02302  0.10538  0.38234 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.139e+00  2.320e-02 135.301  < 2e-16 ***
## syct        -3.014e-04  4.566e-05  -6.601 3.66e-10 ***
## mmin         3.398e-05  5.350e-06   6.352 1.43e-09 ***
## mmax         5.983e-05  1.827e-06  32.745  < 2e-16 ***
## cach         7.093e-03  3.783e-04  18.748  < 2e-16 ***
## chmax        9.444e-04  6.071e-04   1.556    0.121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1524 on 198 degrees of freedom
## Multiple R-squared:  0.9681, Adjusted R-squared:  0.9673 
## F-statistic:  1203 on 5 and 198 DF,  p-value: < 2.2e-16

The final model fits the data pretty well. Compared to the original model, the chmax variable is no longer statistically significant (as it was in the original model - all variables included were). However, the R-squared value on the final model (with the log transformation and the cleaning of outliers) is higher than that of the original model (~96% on the new model versus 91% on the original model), meaning that this cleaned/transformed model is a better fit on the data, and explains more of the variation in ‘estperf’ than my original model did. Additionally, the F-statistic is relatively high in this cleaned/transformed model (1203) compared to the original model (414.8), though the p-value is the same. This higher F-stat, coupled with the equally small p-value, implies that the cleaned/transformed model is statistically significant and does a better job at predicting the ‘estperf’ model than my original model despite one of the variables (‘chmax’) not being statistically significant in the new model.

e. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

Intercept: 3.139. The intercept is the estimate of what log(estperf) would be when every other predictor in the model is held constant at 0. This is not exactly a meaningful value by itself, as it is unrealistic for most of the other variables in this model to ever be 0. I am unsure of what units log(estperf) would be given in, as no units were given on the cpus dataset documentation for the estperf variable.
syct: -3.014e-04. This intercept for cycle time says that for every 1-nanosecond increase in cycle time, the value of log(estperf) will decrease by 3.014e-04 units, while every other variable is held constant. Taking into consideration the fact that my model uses log(estperf) and not just estperf, the translation would be that for every 1-nanosecond increase in cycle time, the estperf would be multiplied by 0.9997 (taking exp(-3.014e-04)).
mmin: 3.395e-05. This intercept for minimum main memory says that for every 1-KB increase in minimum main memory, the value of log(estperf) will increase by 3.395e-05 units, every other variable being held constant. Taking into consideration the log transformation that has occurred in my model, for every 1 KB increase in mmin, the true estperf value would be multiplied by a factor of around 1. However, because this seems like it’s doing nothing in the equation, then, imagining the mmin in MB would show that for every 1MB increase in mmin (1024KB), the true estperf value would increase by around 3.5%.
mmax: 5.983e-05. This intercept for maximum main memory says that for every 1-KB increase in maximum main memory, the value of log(estperf) will increase by 5.983e-05 units, holding every other variable in the model constant. As I did in my analysis for mmin’s intercept, taking into account the log transformation, for every 1 KB increase in mmax, the true estperf value would again be multiplied by around a factor of 1, just a little above. This means that, imagining mmax in MB instead of KB, for every 1MB increase in mmax, the true estperf value would increase by around 6%.
cach: 7.093e-03. This intercept for cache means that for every 1 KB increase in the cache size, the value of log(estperf) will increase by 7.093e-03 units while all other variables are held constant. Because my variable estperf is log-transformed in this model, that means that for every 1 KB increase in cache size, the true estperf value will increase by a factor of around 1, of 1.00712. Looking at this in terms of MB to get a better representation, for every 1 MB increase in cache size, the estimated performance would increase by a factor of around 1230, which is an incredibly large increase compared to the other variables.
chmax: 9.444e-04. This intercept for chmax, or maximum number of channels, says that for every 1-channel increase in the maximum number of channels, the log(estperf) value will increase by a factor of 9.444e-04. Converting this to represent the true value of estperf, this means that for every 1-channel increase in the maximum number of channels, the value of estperf will increase by a factor of 1.000945, or 0.0945%.

f. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

To assess multicollinearity between predictors in my model, I chose to calculate VIFs.

library(car)

## Loading required package: carData

vif(clean_model)

##     syct     mmin     mmax     cach    chmax 
## 1.250982 2.295930 2.450264 1.609008 1.393809

All of these VIF scores for my final model indicate that there is no multicollinearity, as a VIF score below 5 is usually seen as ‘good’, or no multicollinear. This implies that these coefficients are good in the model, and are not skewing it in any way. Because the VIF scores are so low, there is no action that needs to be taken here.

g. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.

Because I’ve already made some residual plots that discuss outliers/influential observations, I chose to calculate some indices that would further justify my answer.

I started by calculating Cook’s D, which is a way to calculate influence on a model.

cooks_dist = cooks.distance(clean_model)

cooks_dist

##            1            2            3            4            5            6 
## 7.971881e-02 4.894154e-04 4.894154e-04 4.894154e-04 1.658968e-02 4.321608e-03 
##            7            8           11           12           13           14 
## 1.894834e-02 1.894834e-02 7.205744e-04 6.927192e-04 2.267221e-03 8.281804e-04 
##           15           16           17           18           19           20 
## 9.718355e-03 1.055740e-03 2.181600e-03 7.112512e-04 2.993073e-03 3.029138e-02 
##           21           22           23           24           25           26 
## 1.091103e-07 5.618754e-04 3.237095e-03 1.389289e-02 4.653070e-04 2.014168e-03 
##           27           28           29           30           31           32 
## 4.548499e-04 2.022378e-03 2.160842e-04 4.881519e-04 9.150915e-02 9.150915e-02 
##           33           34           35           36           37           38 
## 2.884952e-03 2.884952e-03 1.674338e-03 1.383015e-03 4.159333e-03 8.343650e-04 
##           39           40           41           42           43           44 
## 2.048984e-04 3.317341e-04 2.027232e-05 1.090116e-03 1.041045e-03 1.041045e-03 
##           45           46           47           48           49           50 
## 3.108017e-04 4.056381e-05 3.092140e-03 2.588650e-03 8.890847e-06 2.181273e-07 
##           51           52           53           54           55           56 
## 2.200003e-03 2.198288e-03 8.452814e-04 6.978808e-06 8.057133e-04 2.942519e-04 
##           57           58           59           60           61           62 
## 9.130214e-05 4.949352e-03 4.949352e-03 4.949352e-03 4.949352e-03 4.949352e-03 
##           63           64           65           66           67           68 
## 7.692913e-03 3.891510e-03 1.217026e-03 4.541794e-04 1.028488e-02 6.252920e-03 
##           69           70           71           72           73           74 
## 5.453067e-04 4.399360e-06 8.799166e-04 8.616169e-03 6.375433e-03 1.875719e-03 
##           75           76           77           78           79           80 
## 8.704567e-04 7.901257e-04 4.175629e-04 4.636948e-04 7.990296e-03 5.120191e-04 
##           81           82           83           84           85           86 
## 9.893844e-04 6.213793e-04 1.218576e-02 3.032611e-03 6.014061e-04 5.184297e-04 
##           87           88           89           90           91           92 
## 1.698620e-04 1.854256e-05 1.589074e-02 3.014309e-03 1.623972e-02 1.623972e-02 
##           93           94           95           96           97           98 
## 1.854256e-05 2.886243e-03 9.216655e-05 3.331992e-02 4.593814e-02 1.182355e-03 
##           99          100          101          102          103          104 
## 3.729997e-02 1.026874e-02 3.014000e-03 1.618980e-04 3.428123e-04 9.018540e-04 
##          105          106          107          108          109          110 
## 1.539365e-03 2.194745e-06 3.435355e-03 1.740545e-03 4.071389e-03 2.371680e-03 
##          111          112          113          114          115          116 
## 5.603895e-03 7.756365e-03 3.589826e-06 3.589826e-06 9.464006e-05 1.085367e-03 
##          117          118          119          120          121          122 
## 1.054906e-03 7.051297e-04 9.155860e-04 1.669259e-03 9.023958e-03 5.686226e-03 
##          123          124          125          126          127          128 
## 1.771739e-02 3.425327e-02 5.509885e-04 1.416522e-04 1.927424e-04 1.927424e-04 
##          129          130          131          132          133          134 
## 7.873028e-04 7.873028e-04 4.621530e-03 2.143560e-05 7.656564e-05 7.656564e-05 
##          135          136          137          138          139          140 
## 1.729251e-03 1.854754e-03 1.854754e-03 1.698261e-03 2.425223e-04 3.620558e-04 
##          141          142          143          144          145          146 
## 4.850600e-04 4.850600e-04 3.036927e-05 1.213483e-03 8.960220e-04 8.568292e-04 
##          147          148          149          150          151          152 
## 1.027146e-03 1.252178e-03 1.212513e-03 4.659152e-04 1.181711e-02 1.102219e-02 
##          153          154          155          156          158          159 
## 9.619517e-03 2.587835e-01 7.331918e-04 1.068034e-02 7.607999e-03 5.682083e-03 
##          160          161          162          163          164          165 
## 2.046136e-03 2.372655e-07 1.071450e-04 1.322358e-03 1.906570e-03 1.819424e-03 
##          166          167          168          169          170          171 
## 1.894308e-02 1.894308e-02 4.553806e-03 1.145658e-01 3.422975e-03 3.084048e-03 
##          172          173          174          175          176          177 
## 1.285282e-03 2.520307e-03 1.381900e-03 1.381900e-03 1.185021e-03 1.642450e-03 
##          178          179          180          181          182          183 
## 5.491611e-03 2.995055e-04 4.838045e-05 4.534725e-04 5.811235e-03 2.276836e-03 
##          184          185          186          187          188          189 
## 1.758983e-04 1.602384e-04 2.427468e-04 1.231990e-03 2.134571e-03 1.986226e-03 
##          190          191          192          193          194          195 
## 6.794543e-03 2.913073e-03 8.185000e-03 4.639300e-02 9.359615e-04 5.697509e-03 
##          196          197          198          201          202          203 
## 3.013216e-03 3.419496e-02 3.288438e-02 1.961712e-03 2.091749e-03 1.961712e-03 
##          204          205          206          207          208          209 
## 2.091749e-03 8.863780e-06 2.631124e-04 1.718192e-04 2.609550e-03 2.102238e-04

Using one rule of thumb, which says that a point with a Cook’s D value over 1 is highly influential, between 0.5 and 1 is moderately influential, and below 0.5 is not influential, would show that none of these points are influential in the model. However, I don’t think that is true, simply because I previously graphed some residual plots that had ‘high leverage’ or outliers on them.

Using another rule of thumb, 4/N, where N is the # of observations, shows a different story.

threshold = 4/204 #204 observations in my cooks.distance function
threshold

## [1] 0.01960784

This implied that our threshold for an outlier or high-leverage point is 0.01960784, or 1.97e-02. Now, there are a handful of high-leverage points/outliers that still remain in my model. I put them into a table for easier analysis:

cooks_d = data.frame(Observation = 1:length(cooks_dist), Cooks_Distance = cooks_dist)

influential = which(cooks_dist > threshold)

influential

##   1  20  31  32  96  97  99 124 154 169 193 197 198 
##   1  18  29  30  94  95  97 122 152 166 190 194 195

From this, we can see that the columns, or observations, in my dataset that are outliers/high-leverage points based on Cook’s D are columns:

This makes sense, as some of these observations were present in my earlier residual graphs, like 31, 169, 154, and 166. Removing all of these points from my model is an option, but because my model otherwise has a very good fit and no multicollinearity, I don’t think that is entirely necessary so make the model ‘better’.

Consider the data from R package . We will investigate the relationship between low birthweight and the predictors in the data using logistic regression and discriminant analysis.

a. Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.

data("birthwt")
head(birthwt)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

#correlation between  variables in the 'birthwt' dataset

bw_corr = cor(birthwt[, c("low", "age", "lwt", "race", "smoke", "ptl", "ht", "ui", "ftv", "bwt")])

bw_corr

##               low         age         lwt         race       smoke          ptl
## low    1.00000000 -0.11893928 -0.16962694  0.137792751  0.16140431  0.196087267
## age   -0.11893928  1.00000000  0.18007315 -0.172817953 -0.04434618  0.071606386
## lwt   -0.16962694  0.18007315  1.00000000 -0.165048544 -0.04417908 -0.140029003
## race   0.13779275 -0.17281795 -0.16504854  1.000000000 -0.33903074  0.007951293
## smoke  0.16140431 -0.04434618 -0.04417908 -0.339030745  1.00000000  0.187557063
## ptl    0.19608727  0.07160639 -0.14002900  0.007951293  0.18755706  1.000000000
## ht     0.15237025 -0.01583700  0.23636040  0.019929917  0.01340704 -0.015399579
## ui     0.16904283 -0.07515558 -0.15276317  0.053602088  0.06215900  0.227585340
## ftv   -0.06296026  0.21539394  0.14052746 -0.098336254 -0.02801314 -0.044429660
## bwt   -0.78480516  0.09031781  0.18573328 -0.194713487 -0.19044806 -0.154653390
##                ht          ui         ftv         bwt
## low    0.15237025  0.16904283 -0.06296026 -0.78480516
## age   -0.01583700 -0.07515558  0.21539394  0.09031781
## lwt    0.23636040 -0.15276317  0.14052746  0.18573328
## race   0.01992992  0.05360209 -0.09833625 -0.19471349
## smoke  0.01340704  0.06215900 -0.02801314 -0.19044806
## ptl   -0.01539958  0.22758534 -0.04442966 -0.15465339
## ht     1.00000000 -0.10858506 -0.07237255 -0.14598189
## ui    -0.10858506  1.00000000 -0.05952341 -0.28392741
## ftv   -0.07237255 -0.05952341  1.00000000  0.05831777
## bwt   -0.14598189 -0.28392741  0.05831777  1.00000000

Looking at the correlation matrix of the variables, not many of them seem to be strong correlated, as most are around or below 0.1 or -0.1. However, the correlation between ‘low’, which is a binary variable that indicates whether or not the birthweight (bwt) was less than 2.5kg, and the ‘bwt’ variable is very strongly negatively correlated (-0.78), which makes sense considering that they are both variables relating to weight at birth.

#visualizing a scatterplot between 'bwt' and 'age' (mother's age)

ggplot(birthwt, aes(x = age, y = bwt)) +geom_point()

Looking at the scatterplot between ‘bwt’ and ‘age’, it does not appear that there is any trend between the mother’s age and the weight of the child. There seems to be a high concentration of mothers between 20 and 25, but the weight of their children doesn’t follow any specific trend. There appears to be one outlier - a 45-year-old mother had a child that weighed 5000 grams, or 5kg, or ~11 lbs.

#visualizing a scatterplot between 'bwt' and 'lwt' (mothers weight at last menstrual cycle in lbs)

ggplot(birthwt, aes(x = lwt, y = bwt)) +geom_point()

Again, here, there isn’t a very clear trend between bwt and and lwt, likely because a child’s birthweight is more so dependent on genetics and environmental factors as well as mother’s health, but not particularly mother’s weight. It does seem like children of mothers who have a higher weight typically weight more at birth, but there are also children who weigh 5000 g at birth from mothers who weigh 125lbs, so again, there is no real trend here.

#visualizing a barplot for count of 'low' (whether or not the birthweight was below 2.5kg, 0=no, 1=yes)

ggplot(birthwt, aes(x = low)) +geom_bar()

Based on this bar plot, it appears that most children are born weighing above 2.5kg, which is ~5.5lbs. Around 125 babies in this dataset are born above 2.5kg, and around half of that is how many babies were born under 2.5kg.

#visualizing a barplot for counts of 'race' (race of mother: 1=white, 2=black, 3=other)

ggplot(birthwt, aes(x = race)) +geom_bar()

This bar plot shows that the majority of mothers in the dataset are white, and then ‘other’, and then black. I think it would be beneficial to have broken the ‘other’ category into more races since it’s such a large portion of the dataset and would give a better picture of the demographic of mothers included in the dataset, especially because there can be disparities within that whole category that might be indicative of one race but not the other.

#visualizing a barplot for count of 'smoke' (whether or not the mother smoked during pregnancy: 0 = no, 1= yes)

ggplot(birthwt, aes(x = smoke)) +geom_bar()

It appears that most of the mothers in this dataset did not smoke during their pregnancy, but a large minority did smoke during their pregnancy. This might correlate to a lower birthweight, as we previously discussed that mother’s health plays a role in determining birthweight.

#visualizing a barplot for count of 'ptl' (number of premature labors)

ggplot(birthwt, aes(x = ptl)) +geom_bar()

From this, we can see that most mothers have not ever had a premature labor before, and a very small amount of mothers have had 3 premature labors before – almost a negligible amount compared to those who have had 0-2.

#visualizing a barplot for count of 'ht' (whether or not the mother has a history of hypertension, 0=no, 1=yes)

ggplot(birthwt, aes(x = ht)) +geom_bar()

Here we can see that a large majority of mothers do not have a history of hypertension, with only a handful (~12 or 13) of mother’s do. Those with a history of hypertension might have children with a lower birthweight, again, due to birthweight and mother’s health being related.

#visualizing a barplot for count of 'ui' (whether or not the mother has a history of uterine irritability, 0=no, 1=yes)

ggplot(birthwt, aes(x = ui)) +geom_bar()

Again here, we can see that most mothers do not have a history of uterine irritability, which is a condition that causes frequent contractions during pregnancy and could, in very rare cases, cause premature labor. However, only around 25 women in this dataset have a history of this.

#visualizing a barplot for count of 'ftv' (# of physicians visits during the first trimester)

ggplot(birthwt, aes(x = ftv)) +geom_bar()

A large number of women did not visit a physician at all during their first trimester; this could be because they weren’t aware that they were pregnant, or potentially because they lacked the resources to get to a doctor during that time or at all during their pregnancy. There are also a few women who had 4+ visits to their physician during the first trimester of their pregnancy, which could indicate a history or presence of complicated pregnancies or an issue with their child.

b. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.

#fit logistic regression model predicting whether or not a child's weight will be below 2.5kg or not using variables relating to mother's health

bw_lr = glm(low ~ age + lwt + race + smoke + ht + ui, family = "binomial", data = birthwt)

summary(bw_lr)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ht + ui, family = "binomial", 
##     data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.137148   1.270543  -0.108  0.91404   
## age         -0.024657   0.034772  -0.709  0.47825   
## lwt         -0.013268   0.006539  -2.029  0.04247 * 
## race         0.464659   0.212688   2.185  0.02891 * 
## smoke        1.030252   0.390802   2.636  0.00838 **
## ht           1.824340   0.683039   2.671  0.00756 **
## ui           0.859435   0.451075   1.905  0.05674 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 206.83  on 182  degrees of freedom
## AIC: 220.83
## 
## Number of Fisher Scoring iterations: 4

Even though most of my visualizations are comparing ‘bwt’ and other variables, it is not possible to build a logistic regression model with ‘bwt’ as the outcome because it is continuous and logistic regression models predict a binary outcome. As a solution to this, I selected ‘low’ as the predictor, because it is strongly correlated with ‘bwt’, so we can use my variables to predict whether or not a child will be born weighing more or less than 2.5kg.

For my model, I chose to use the variables of age, lwt, race, smoke, ht, and ui for predictors. These predictors are related to maternal health, and thus can be a good predictor of the child’s birthweight. I chose not to include ptl because while it might be a logical explanation for why a child is at a lower weight, it’s not directly related to the mother’s health, and a small number of women in the dataset have ever had a premature labor before. I opted to not include ftv as well because, again, this is not a variable that is indicative of the mother’s health. Finally, I obviously did not include the ‘bwt’ variable because this directly tells us whether or not the child’s weight is below/above 2.5kg, so it shouldn’t be used as a factor for prediction.

c. What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

I did not include ptl and ftv in my final model, one because they were not statistically significant, and more importantly because I chose to focus my model on maternal health. However, I noticed that specifically ‘ptl’ had a high-ish intercept, at 0.542087, whereas ftv had a smaller intercept, but both were positive, implying an increase of the log odds of low birth weight as these variables increased. The assumption that an increase of premature labor would increase the log odds of low birth weight since a premature birth means not enough time for the child to fully develop. However, realistically, more physician visits doesn’t really have any bearing on the weight of the child; as we’ve already discussed, the birthweight of a child is primarily due to genetics, environment, and maternal health. It’s possible that a physician visit might expose ailments in a mother’s health that could then be fixed and have a positive impact on the birthweight, but more doesn’t equal better in this case.

d. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

#creating ptl2, where 0 = no premature labors and 1 = yes

birthwt$ptl2 = ifelse(birthwt$ptl > 0, 1, 0)

birthwt

##     low age lwt race smoke ptl ht ui ftv  bwt ptl2
## 85    0  19 182    2     0   0  0  1   0 2523    0
## 86    0  33 155    3     0   0  0  0   3 2551    0
## 87    0  20 105    1     1   0  0  0   1 2557    0
## 88    0  21 108    1     1   0  0  1   2 2594    0
## 89    0  18 107    1     1   0  0  1   0 2600    0
## 91    0  21 124    3     0   0  0  0   0 2622    0
## 92    0  22 118    1     0   0  0  0   1 2637    0
## 93    0  17 103    3     0   0  0  0   1 2637    0
## 94    0  29 123    1     1   0  0  0   1 2663    0
## 95    0  26 113    1     1   0  0  0   0 2665    0
## 96    0  19  95    3     0   0  0  0   0 2722    0
## 97    0  19 150    3     0   0  0  0   1 2733    0
## 98    0  22  95    3     0   0  1  0   0 2751    0
## 99    0  30 107    3     0   1  0  1   2 2750    1
## 100   0  18 100    1     1   0  0  0   0 2769    0
## 101   0  18 100    1     1   0  0  0   0 2769    0
## 102   0  15  98    2     0   0  0  0   0 2778    0
## 103   0  25 118    1     1   0  0  0   3 2782    0
## 104   0  20 120    3     0   0  0  1   0 2807    0
## 105   0  28 120    1     1   0  0  0   1 2821    0
## 106   0  32 121    3     0   0  0  0   2 2835    0
## 107   0  31 100    1     0   0  0  1   3 2835    0
## 108   0  36 202    1     0   0  0  0   1 2836    0
## 109   0  28 120    3     0   0  0  0   0 2863    0
## 111   0  25 120    3     0   0  0  1   2 2877    0
## 112   0  28 167    1     0   0  0  0   0 2877    0
## 113   0  17 122    1     1   0  0  0   0 2906    0
## 114   0  29 150    1     0   0  0  0   2 2920    0
## 115   0  26 168    2     1   0  0  0   0 2920    0
## 116   0  17 113    2     0   0  0  0   1 2920    0
## 117   0  17 113    2     0   0  0  0   1 2920    0
## 118   0  24  90    1     1   1  0  0   1 2948    1
## 119   0  35 121    2     1   1  0  0   1 2948    1
## 120   0  25 155    1     0   0  0  0   1 2977    0
## 121   0  25 125    2     0   0  0  0   0 2977    0
## 123   0  29 140    1     1   0  0  0   2 2977    0
## 124   0  19 138    1     1   0  0  0   2 2977    0
## 125   0  27 124    1     1   0  0  0   0 2922    0
## 126   0  31 215    1     1   0  0  0   2 3005    0
## 127   0  33 109    1     1   0  0  0   1 3033    0
## 128   0  21 185    2     1   0  0  0   2 3042    0
## 129   0  19 189    1     0   0  0  0   2 3062    0
## 130   0  23 130    2     0   0  0  0   1 3062    0
## 131   0  21 160    1     0   0  0  0   0 3062    0
## 132   0  18  90    1     1   0  0  1   0 3062    0
## 133   0  18  90    1     1   0  0  1   0 3062    0
## 134   0  32 132    1     0   0  0  0   4 3080    0
## 135   0  19 132    3     0   0  0  0   0 3090    0
## 136   0  24 115    1     0   0  0  0   2 3090    0
## 137   0  22  85    3     1   0  0  0   0 3090    0
## 138   0  22 120    1     0   0  1  0   1 3100    0
## 139   0  23 128    3     0   0  0  0   0 3104    0
## 140   0  22 130    1     1   0  0  0   0 3132    0
## 141   0  30  95    1     1   0  0  0   2 3147    0
## 142   0  19 115    3     0   0  0  0   0 3175    0
## 143   0  16 110    3     0   0  0  0   0 3175    0
## 144   0  21 110    3     1   0  0  1   0 3203    0
## 145   0  30 153    3     0   0  0  0   0 3203    0
## 146   0  20 103    3     0   0  0  0   0 3203    0
## 147   0  17 119    3     0   0  0  0   0 3225    0
## 148   0  17 119    3     0   0  0  0   0 3225    0
## 149   0  23 119    3     0   0  0  0   2 3232    0
## 150   0  24 110    3     0   0  0  0   0 3232    0
## 151   0  28 140    1     0   0  0  0   0 3234    0
## 154   0  26 133    3     1   2  0  0   0 3260    1
## 155   0  20 169    3     0   1  0  1   1 3274    1
## 156   0  24 115    3     0   0  0  0   2 3274    0
## 159   0  28 250    3     1   0  0  0   6 3303    0
## 160   0  20 141    1     0   2  0  1   1 3317    1
## 161   0  22 158    2     0   1  0  0   2 3317    1
## 162   0  22 112    1     1   2  0  0   0 3317    1
## 163   0  31 150    3     1   0  0  0   2 3321    0
## 164   0  23 115    3     1   0  0  0   1 3331    0
## 166   0  16 112    2     0   0  0  0   0 3374    0
## 167   0  16 135    1     1   0  0  0   0 3374    0
## 168   0  18 229    2     0   0  0  0   0 3402    0
## 169   0  25 140    1     0   0  0  0   1 3416    0
## 170   0  32 134    1     1   1  0  0   4 3430    1
## 172   0  20 121    2     1   0  0  0   0 3444    0
## 173   0  23 190    1     0   0  0  0   0 3459    0
## 174   0  22 131    1     0   0  0  0   1 3460    0
## 175   0  32 170    1     0   0  0  0   0 3473    0
## 176   0  30 110    3     0   0  0  0   0 3544    0
## 177   0  20 127    3     0   0  0  0   0 3487    0
## 179   0  23 123    3     0   0  0  0   0 3544    0
## 180   0  17 120    3     1   0  0  0   0 3572    0
## 181   0  19 105    3     0   0  0  0   0 3572    0
## 182   0  23 130    1     0   0  0  0   0 3586    0
## 183   0  36 175    1     0   0  0  0   0 3600    0
## 184   0  22 125    1     0   0  0  0   1 3614    0
## 185   0  24 133    1     0   0  0  0   0 3614    0
## 186   0  21 134    3     0   0  0  0   2 3629    0
## 187   0  19 235    1     1   0  1  0   0 3629    0
## 188   0  25  95    1     1   3  0  1   0 3637    1
## 189   0  16 135    1     1   0  0  0   0 3643    0
## 190   0  29 135    1     0   0  0  0   1 3651    0
## 191   0  29 154    1     0   0  0  0   1 3651    0
## 192   0  19 147    1     1   0  0  0   0 3651    0
## 193   0  19 147    1     1   0  0  0   0 3651    0
## 195   0  30 137    1     0   0  0  0   1 3699    0
## 196   0  24 110    1     0   0  0  0   1 3728    0
## 197   0  19 184    1     1   0  1  0   0 3756    0
## 199   0  24 110    3     0   1  0  0   0 3770    1
## 200   0  23 110    1     0   0  0  0   1 3770    0
## 201   0  20 120    3     0   0  0  0   0 3770    0
## 202   0  25 241    2     0   0  1  0   0 3790    0
## 203   0  30 112    1     0   0  0  0   1 3799    0
## 204   0  22 169    1     0   0  0  0   0 3827    0
## 205   0  18 120    1     1   0  0  0   2 3856    0
## 206   0  16 170    2     0   0  0  0   4 3860    0
## 207   0  32 186    1     0   0  0  0   2 3860    0
## 208   0  18 120    3     0   0  0  0   1 3884    0
## 209   0  29 130    1     1   0  0  0   2 3884    0
## 210   0  33 117    1     0   0  0  1   1 3912    0
## 211   0  20 170    1     1   0  0  0   0 3940    0
## 212   0  28 134    3     0   0  0  0   1 3941    0
## 213   0  14 135    1     0   0  0  0   0 3941    0
## 214   0  28 130    3     0   0  0  0   0 3969    0
## 215   0  25 120    1     0   0  0  0   2 3983    0
## 216   0  16  95    3     0   0  0  0   1 3997    0
## 217   0  20 158    1     0   0  0  0   1 3997    0
## 218   0  26 160    3     0   0  0  0   0 4054    0
## 219   0  21 115    1     0   0  0  0   1 4054    0
## 220   0  22 129    1     0   0  0  0   0 4111    0
## 221   0  25 130    1     0   0  0  0   2 4153    0
## 222   0  31 120    1     0   0  0  0   2 4167    0
## 223   0  35 170    1     0   1  0  0   1 4174    1
## 224   0  19 120    1     1   0  0  0   0 4238    0
## 225   0  24 116    1     0   0  0  0   1 4593    0
## 226   0  45 123    1     0   0  0  0   1 4990    0
## 4     1  28 120    3     1   1  0  1   0  709    1
## 10    1  29 130    1     0   0  0  1   2 1021    0
## 11    1  34 187    2     1   0  1  0   0 1135    0
## 13    1  25 105    3     0   1  1  0   0 1330    1
## 15    1  25  85    3     0   0  0  1   0 1474    0
## 16    1  27 150    3     0   0  0  0   0 1588    0
## 17    1  23  97    3     0   0  0  1   1 1588    0
## 18    1  24 128    2     0   1  0  0   1 1701    1
## 19    1  24 132    3     0   0  1  0   0 1729    0
## 20    1  21 165    1     1   0  1  0   1 1790    0
## 22    1  32 105    1     1   0  0  0   0 1818    0
## 23    1  19  91    1     1   2  0  1   0 1885    1
## 24    1  25 115    3     0   0  0  0   0 1893    0
## 25    1  16 130    3     0   0  0  0   1 1899    0
## 26    1  25  92    1     1   0  0  0   0 1928    0
## 27    1  20 150    1     1   0  0  0   2 1928    0
## 28    1  21 200    2     0   0  0  1   2 1928    0
## 29    1  24 155    1     1   1  0  0   0 1936    1
## 30    1  21 103    3     0   0  0  0   0 1970    0
## 31    1  20 125    3     0   0  0  1   0 2055    0
## 32    1  25  89    3     0   2  0  0   1 2055    1
## 33    1  19 102    1     0   0  0  0   2 2082    0
## 34    1  19 112    1     1   0  0  1   0 2084    0
## 35    1  26 117    1     1   1  0  0   0 2084    1
## 36    1  24 138    1     0   0  0  0   0 2100    0
## 37    1  17 130    3     1   1  0  1   0 2125    1
## 40    1  20 120    2     1   0  0  0   3 2126    0
## 42    1  22 130    1     1   1  0  1   1 2187    1
## 43    1  27 130    2     0   0  0  1   0 2187    0
## 44    1  20  80    3     1   0  0  1   0 2211    0
## 45    1  17 110    1     1   0  0  0   0 2225    0
## 46    1  25 105    3     0   1  0  0   1 2240    1
## 47    1  20 109    3     0   0  0  0   0 2240    0
## 49    1  18 148    3     0   0  0  0   0 2282    0
## 50    1  18 110    2     1   1  0  0   0 2296    1
## 51    1  20 121    1     1   1  0  1   0 2296    1
## 52    1  21 100    3     0   1  0  0   4 2301    1
## 54    1  26  96    3     0   0  0  0   0 2325    0
## 56    1  31 102    1     1   1  0  0   1 2353    1
## 57    1  15 110    1     0   0  0  0   0 2353    0
## 59    1  23 187    2     1   0  0  0   1 2367    0
## 60    1  20 122    2     1   0  0  0   0 2381    0
## 61    1  24 105    2     1   0  0  0   0 2381    0
## 62    1  15 115    3     0   0  0  1   0 2381    0
## 63    1  23 120    3     0   0  0  0   0 2410    0
## 65    1  30 142    1     1   1  0  0   0 2410    1
## 67    1  22 130    1     1   0  0  0   1 2410    0
## 68    1  17 120    1     1   0  0  0   3 2414    0
## 69    1  23 110    1     1   1  0  0   0 2424    1
## 71    1  17 120    2     0   0  0  0   2 2438    0
## 75    1  26 154    3     0   1  1  0   1 2442    1
## 76    1  20 105    3     0   0  0  0   3 2450    0
## 77    1  26 190    1     1   0  0  0   0 2466    0
## 78    1  14 101    3     1   1  0  0   0 2466    1
## 79    1  28  95    1     1   0  0  0   2 2466    0
## 81    1  14 100    3     0   0  0  0   2 2495    0
## 82    1  23  94    3     1   0  0  0   0 2495    0
## 83    1  17 142    2     0   0  1  0   0 2495    0
## 84    1  21 130    1     1   0  1  0   3 2495    0

For this part, I chose to collapse ptl into 2 variables, 0 and 1. 0 = no premature labors, and 1 = 1+ premature labors. This eliminates the need for a dummy variable, and makes it easier to determine the relationship between whether or not having a premature labor before has any impact on birth weight, especially since the number of women who have had more than 1 premature labor is so small.

e. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.

#creating ftv2, where 0 = no physician visits in first trimester, 1 = 1+ physician visits in first trimester

birthwt$ftv2 = ifelse(birthwt$ftv > 0, 1, 0)

birthwt

##     low age lwt race smoke ptl ht ui ftv  bwt ptl2 ftv2
## 85    0  19 182    2     0   0  0  1   0 2523    0    0
## 86    0  33 155    3     0   0  0  0   3 2551    0    1
## 87    0  20 105    1     1   0  0  0   1 2557    0    1
## 88    0  21 108    1     1   0  0  1   2 2594    0    1
## 89    0  18 107    1     1   0  0  1   0 2600    0    0
## 91    0  21 124    3     0   0  0  0   0 2622    0    0
## 92    0  22 118    1     0   0  0  0   1 2637    0    1
## 93    0  17 103    3     0   0  0  0   1 2637    0    1
## 94    0  29 123    1     1   0  0  0   1 2663    0    1
## 95    0  26 113    1     1   0  0  0   0 2665    0    0
## 96    0  19  95    3     0   0  0  0   0 2722    0    0
## 97    0  19 150    3     0   0  0  0   1 2733    0    1
## 98    0  22  95    3     0   0  1  0   0 2751    0    0
## 99    0  30 107    3     0   1  0  1   2 2750    1    1
## 100   0  18 100    1     1   0  0  0   0 2769    0    0
## 101   0  18 100    1     1   0  0  0   0 2769    0    0
## 102   0  15  98    2     0   0  0  0   0 2778    0    0
## 103   0  25 118    1     1   0  0  0   3 2782    0    1
## 104   0  20 120    3     0   0  0  1   0 2807    0    0
## 105   0  28 120    1     1   0  0  0   1 2821    0    1
## 106   0  32 121    3     0   0  0  0   2 2835    0    1
## 107   0  31 100    1     0   0  0  1   3 2835    0    1
## 108   0  36 202    1     0   0  0  0   1 2836    0    1
## 109   0  28 120    3     0   0  0  0   0 2863    0    0
## 111   0  25 120    3     0   0  0  1   2 2877    0    1
## 112   0  28 167    1     0   0  0  0   0 2877    0    0
## 113   0  17 122    1     1   0  0  0   0 2906    0    0
## 114   0  29 150    1     0   0  0  0   2 2920    0    1
## 115   0  26 168    2     1   0  0  0   0 2920    0    0
## 116   0  17 113    2     0   0  0  0   1 2920    0    1
## 117   0  17 113    2     0   0  0  0   1 2920    0    1
## 118   0  24  90    1     1   1  0  0   1 2948    1    1
## 119   0  35 121    2     1   1  0  0   1 2948    1    1
## 120   0  25 155    1     0   0  0  0   1 2977    0    1
## 121   0  25 125    2     0   0  0  0   0 2977    0    0
## 123   0  29 140    1     1   0  0  0   2 2977    0    1
## 124   0  19 138    1     1   0  0  0   2 2977    0    1
## 125   0  27 124    1     1   0  0  0   0 2922    0    0
## 126   0  31 215    1     1   0  0  0   2 3005    0    1
## 127   0  33 109    1     1   0  0  0   1 3033    0    1
## 128   0  21 185    2     1   0  0  0   2 3042    0    1
## 129   0  19 189    1     0   0  0  0   2 3062    0    1
## 130   0  23 130    2     0   0  0  0   1 3062    0    1
## 131   0  21 160    1     0   0  0  0   0 3062    0    0
## 132   0  18  90    1     1   0  0  1   0 3062    0    0
## 133   0  18  90    1     1   0  0  1   0 3062    0    0
## 134   0  32 132    1     0   0  0  0   4 3080    0    1
## 135   0  19 132    3     0   0  0  0   0 3090    0    0
## 136   0  24 115    1     0   0  0  0   2 3090    0    1
## 137   0  22  85    3     1   0  0  0   0 3090    0    0
## 138   0  22 120    1     0   0  1  0   1 3100    0    1
## 139   0  23 128    3     0   0  0  0   0 3104    0    0
## 140   0  22 130    1     1   0  0  0   0 3132    0    0
## 141   0  30  95    1     1   0  0  0   2 3147    0    1
## 142   0  19 115    3     0   0  0  0   0 3175    0    0
## 143   0  16 110    3     0   0  0  0   0 3175    0    0
## 144   0  21 110    3     1   0  0  1   0 3203    0    0
## 145   0  30 153    3     0   0  0  0   0 3203    0    0
## 146   0  20 103    3     0   0  0  0   0 3203    0    0
## 147   0  17 119    3     0   0  0  0   0 3225    0    0
## 148   0  17 119    3     0   0  0  0   0 3225    0    0
## 149   0  23 119    3     0   0  0  0   2 3232    0    1
## 150   0  24 110    3     0   0  0  0   0 3232    0    0
## 151   0  28 140    1     0   0  0  0   0 3234    0    0
## 154   0  26 133    3     1   2  0  0   0 3260    1    0
## 155   0  20 169    3     0   1  0  1   1 3274    1    1
## 156   0  24 115    3     0   0  0  0   2 3274    0    1
## 159   0  28 250    3     1   0  0  0   6 3303    0    1
## 160   0  20 141    1     0   2  0  1   1 3317    1    1
## 161   0  22 158    2     0   1  0  0   2 3317    1    1
## 162   0  22 112    1     1   2  0  0   0 3317    1    0
## 163   0  31 150    3     1   0  0  0   2 3321    0    1
## 164   0  23 115    3     1   0  0  0   1 3331    0    1
## 166   0  16 112    2     0   0  0  0   0 3374    0    0
## 167   0  16 135    1     1   0  0  0   0 3374    0    0
## 168   0  18 229    2     0   0  0  0   0 3402    0    0
## 169   0  25 140    1     0   0  0  0   1 3416    0    1
## 170   0  32 134    1     1   1  0  0   4 3430    1    1
## 172   0  20 121    2     1   0  0  0   0 3444    0    0
## 173   0  23 190    1     0   0  0  0   0 3459    0    0
## 174   0  22 131    1     0   0  0  0   1 3460    0    1
## 175   0  32 170    1     0   0  0  0   0 3473    0    0
## 176   0  30 110    3     0   0  0  0   0 3544    0    0
## 177   0  20 127    3     0   0  0  0   0 3487    0    0
## 179   0  23 123    3     0   0  0  0   0 3544    0    0
## 180   0  17 120    3     1   0  0  0   0 3572    0    0
## 181   0  19 105    3     0   0  0  0   0 3572    0    0
## 182   0  23 130    1     0   0  0  0   0 3586    0    0
## 183   0  36 175    1     0   0  0  0   0 3600    0    0
## 184   0  22 125    1     0   0  0  0   1 3614    0    1
## 185   0  24 133    1     0   0  0  0   0 3614    0    0
## 186   0  21 134    3     0   0  0  0   2 3629    0    1
## 187   0  19 235    1     1   0  1  0   0 3629    0    0
## 188   0  25  95    1     1   3  0  1   0 3637    1    0
## 189   0  16 135    1     1   0  0  0   0 3643    0    0
## 190   0  29 135    1     0   0  0  0   1 3651    0    1
## 191   0  29 154    1     0   0  0  0   1 3651    0    1
## 192   0  19 147    1     1   0  0  0   0 3651    0    0
## 193   0  19 147    1     1   0  0  0   0 3651    0    0
## 195   0  30 137    1     0   0  0  0   1 3699    0    1
## 196   0  24 110    1     0   0  0  0   1 3728    0    1
## 197   0  19 184    1     1   0  1  0   0 3756    0    0
## 199   0  24 110    3     0   1  0  0   0 3770    1    0
## 200   0  23 110    1     0   0  0  0   1 3770    0    1
## 201   0  20 120    3     0   0  0  0   0 3770    0    0
## 202   0  25 241    2     0   0  1  0   0 3790    0    0
## 203   0  30 112    1     0   0  0  0   1 3799    0    1
## 204   0  22 169    1     0   0  0  0   0 3827    0    0
## 205   0  18 120    1     1   0  0  0   2 3856    0    1
## 206   0  16 170    2     0   0  0  0   4 3860    0    1
## 207   0  32 186    1     0   0  0  0   2 3860    0    1
## 208   0  18 120    3     0   0  0  0   1 3884    0    1
## 209   0  29 130    1     1   0  0  0   2 3884    0    1
## 210   0  33 117    1     0   0  0  1   1 3912    0    1
## 211   0  20 170    1     1   0  0  0   0 3940    0    0
## 212   0  28 134    3     0   0  0  0   1 3941    0    1
## 213   0  14 135    1     0   0  0  0   0 3941    0    0
## 214   0  28 130    3     0   0  0  0   0 3969    0    0
## 215   0  25 120    1     0   0  0  0   2 3983    0    1
## 216   0  16  95    3     0   0  0  0   1 3997    0    1
## 217   0  20 158    1     0   0  0  0   1 3997    0    1
## 218   0  26 160    3     0   0  0  0   0 4054    0    0
## 219   0  21 115    1     0   0  0  0   1 4054    0    1
## 220   0  22 129    1     0   0  0  0   0 4111    0    0
## 221   0  25 130    1     0   0  0  0   2 4153    0    1
## 222   0  31 120    1     0   0  0  0   2 4167    0    1
## 223   0  35 170    1     0   1  0  0   1 4174    1    1
## 224   0  19 120    1     1   0  0  0   0 4238    0    0
## 225   0  24 116    1     0   0  0  0   1 4593    0    1
## 226   0  45 123    1     0   0  0  0   1 4990    0    1
## 4     1  28 120    3     1   1  0  1   0  709    1    0
## 10    1  29 130    1     0   0  0  1   2 1021    0    1
## 11    1  34 187    2     1   0  1  0   0 1135    0    0
## 13    1  25 105    3     0   1  1  0   0 1330    1    0
## 15    1  25  85    3     0   0  0  1   0 1474    0    0
## 16    1  27 150    3     0   0  0  0   0 1588    0    0
## 17    1  23  97    3     0   0  0  1   1 1588    0    1
## 18    1  24 128    2     0   1  0  0   1 1701    1    1
## 19    1  24 132    3     0   0  1  0   0 1729    0    0
## 20    1  21 165    1     1   0  1  0   1 1790    0    1
## 22    1  32 105    1     1   0  0  0   0 1818    0    0
## 23    1  19  91    1     1   2  0  1   0 1885    1    0
## 24    1  25 115    3     0   0  0  0   0 1893    0    0
## 25    1  16 130    3     0   0  0  0   1 1899    0    1
## 26    1  25  92    1     1   0  0  0   0 1928    0    0
## 27    1  20 150    1     1   0  0  0   2 1928    0    1
## 28    1  21 200    2     0   0  0  1   2 1928    0    1
## 29    1  24 155    1     1   1  0  0   0 1936    1    0
## 30    1  21 103    3     0   0  0  0   0 1970    0    0
## 31    1  20 125    3     0   0  0  1   0 2055    0    0
## 32    1  25  89    3     0   2  0  0   1 2055    1    1
## 33    1  19 102    1     0   0  0  0   2 2082    0    1
## 34    1  19 112    1     1   0  0  1   0 2084    0    0
## 35    1  26 117    1     1   1  0  0   0 2084    1    0
## 36    1  24 138    1     0   0  0  0   0 2100    0    0
## 37    1  17 130    3     1   1  0  1   0 2125    1    0
## 40    1  20 120    2     1   0  0  0   3 2126    0    1
## 42    1  22 130    1     1   1  0  1   1 2187    1    1
## 43    1  27 130    2     0   0  0  1   0 2187    0    0
## 44    1  20  80    3     1   0  0  1   0 2211    0    0
## 45    1  17 110    1     1   0  0  0   0 2225    0    0
## 46    1  25 105    3     0   1  0  0   1 2240    1    1
## 47    1  20 109    3     0   0  0  0   0 2240    0    0
## 49    1  18 148    3     0   0  0  0   0 2282    0    0
## 50    1  18 110    2     1   1  0  0   0 2296    1    0
## 51    1  20 121    1     1   1  0  1   0 2296    1    0
## 52    1  21 100    3     0   1  0  0   4 2301    1    1
## 54    1  26  96    3     0   0  0  0   0 2325    0    0
## 56    1  31 102    1     1   1  0  0   1 2353    1    1
## 57    1  15 110    1     0   0  0  0   0 2353    0    0
## 59    1  23 187    2     1   0  0  0   1 2367    0    1
## 60    1  20 122    2     1   0  0  0   0 2381    0    0
## 61    1  24 105    2     1   0  0  0   0 2381    0    0
## 62    1  15 115    3     0   0  0  1   0 2381    0    0
## 63    1  23 120    3     0   0  0  0   0 2410    0    0
## 65    1  30 142    1     1   1  0  0   0 2410    1    0
## 67    1  22 130    1     1   0  0  0   1 2410    0    1
## 68    1  17 120    1     1   0  0  0   3 2414    0    1
## 69    1  23 110    1     1   1  0  0   0 2424    1    0
## 71    1  17 120    2     0   0  0  0   2 2438    0    1
## 75    1  26 154    3     0   1  1  0   1 2442    1    1
## 76    1  20 105    3     0   0  0  0   3 2450    0    1
## 77    1  26 190    1     1   0  0  0   0 2466    0    0
## 78    1  14 101    3     1   1  0  0   0 2466    1    0
## 79    1  28  95    1     1   0  0  0   2 2466    0    1
## 81    1  14 100    3     0   0  0  0   2 2495    0    1
## 82    1  23  94    3     1   0  0  0   0 2495    0    0
## 83    1  17 142    2     0   0  1  0   0 2495    0    0
## 84    1  21 130    1     1   0  1  0   3 2495    0    1

Similarly to ptl2, I chose to make ftv2 a binary variable where 0 = no physicians visits in the first trimester and 1 = 1+ visits in the first trimester. As I mentioned in a previous discussion, I don’t think that multiple physicians visits necessarily has any bearing on the birthweight of a child, especially since it was not statistically significant in the initial logistic regression model.

#correlation table to assess relationship between ftv2 and low

ftv2_low_corr = cor(birthwt[,c('low', 'ftv2')])

ftv2_low_corr

##             low       ftv2
## low   1.0000000 -0.1094147
## ftv2 -0.1094147  1.0000000

Looking at this correlation table, it does not appear that there is a very strong correlation between ‘low’ and ‘ftv2’. However, the correlation is negative, which implies that when ‘low’ is 0 (below 2.5kg at birth), ftv2 is ‘more likely’ to be 1 (1+ physician’s visits in the first trimester), but because the correlation coefficient is so small, this is likely not actually correlated, so it’s unreasonable to assume that not going to see a physician would therefore make your child more likely to have a birthweight above 2.5kg.

f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??

new_bw_lr = glm(low ~ age + lwt + race + smoke + ht + ui +ptl2 + ftv2, family = "binomial", data = birthwt)

summary(new_bw_lr)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ht + ui + ptl2 + 
##     ftv2, family = "binomial", data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.132302   1.325225   0.100  0.92048   
## age         -0.041760   0.037844  -1.103  0.26981   
## lwt         -0.011848   0.006755  -1.754  0.07943 . 
## race         0.404931   0.224863   1.801  0.07174 . 
## smoke        0.816451   0.416669   1.959  0.05006 . 
## ht           1.795749   0.702141   2.558  0.01054 * 
## ui           0.657830   0.468951   1.403  0.16069   
## ptl2         1.249407   0.465305   2.685  0.00725 **
## ftv2        -0.103561   0.373413  -0.277  0.78152   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 199.45  on 180  degrees of freedom
## AIC: 217.45
## 
## Number of Fisher Scoring iterations: 4

Based on this new logistic regression model with my new variables, ptl2 and ftv2, it appears that ptl2 is now actually the strongest statistically significant variable in the model, implying that this new variable is very important in predicting whether or not a child’s birthweight will be below 2.5kg. However, ftv2 is still not statistically significant, which I predicted based on the very small correlation between ftv2 and ‘low’, which is what my model is aiming to predict.

g. In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

#split the data into an 80/20 training/testing set and set the seed
library(TeachingDemos)
char2seed("TakehomeExam") #setting the seed using TeachingDemos

bw_train_index = sample(1:nrow(birthwt), 0.8*nrow(birthwt))
bw_train = birthwt[bw_train_index, ]
bw_test = birthwt[-bw_train_index, ]

#fit LDA model using the variables with statistical significance in most recent logistic regression

bw_lda = lda(low ~ lwt + race + smoke + ht + ptl2, data = bw_train)

#confusion matrix
lda_predictions = predict(bw_lda, newdata = bw_test)
lda_class = lda_predictions$class
lda_conf = table(Predicted = lda_class, Actual = bw_test$low)
lda_conf

##          Actual
## Predicted  0  1
##         0 24  6
##         1  3  5

#calculate TN/FN/TP/FP
metrics <- function(confusion_matrix) {
  TN <- confusion_matrix[1, 1]
  FP <- confusion_matrix[1, 2]
  FN <- confusion_matrix[2, 1]
  TP <- confusion_matrix[2, 2]
  sensitivity <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  accuracy <- (TP + TN) / (TP + TN + FP + FN)

  return(list(sensitivity = sensitivity, specificity = specificity, accuracy = accuracy))
}

#sensitivity, specificity, and accuracy
lda_metrics = metrics(lda_conf)
lda_metrics

## $sensitivity
## [1] 0.625
## 
## $specificity
## [1] 0.8
## 
## $accuracy
## [1] 0.7631579

Looking at the LDA model, it appears that the sensitivity is 62.5%, the specificity is 80%, and the model is 76.3% accurate.

#fit QDA with the same variables as LDA

bw_qda = lda(low ~ lwt + race + smoke + ht + ptl2, data = bw_train)

#confusion matrix
qda_predictions = predict(bw_qda, newdata = bw_test)
qda_class = qda_predictions$class
qda_conf = table(Predicted = qda_class, Actual = bw_test$low)
qda_conf

##          Actual
## Predicted  0  1
##         0 24  6
##         1  3  5

#sensitivity, specificity, and accuracy
qda_metrics = metrics(qda_conf)
qda_metrics

## $sensitivity
## [1] 0.625
## 
## $specificity
## [1] 0.8
## 
## $accuracy
## [1] 0.7631579

Looking at the QDA model, it has the same metrics as the LDA model, so I can conclude that they perform the same.

#training the logistic regression model from (f) on the training data

f_lr = glm(low ~ age + lwt + race + smoke + ht + ui +ptl2 + ftv2, family = "binomial", data = bw_train)

#confusion matrix
f_predictions = predict(f_lr, newdata = bw_test, type = "response") 
f_class = factor(ifelse(f_predictions > 0.5, "Low", "Normal"))
f_conf = table(Predicted = f_class, Actual = bw_test$low)
f_conf

##          Actual
## Predicted  0  1
##    Low     2  5
##    Normal 25  6

#metrics
f_metrics = metrics(f_conf)
f_metrics

## $sensitivity
## [1] 0.1935484
## 
## $specificity
## [1] 0.2857143
## 
## $accuracy
## [1] 0.2105263

Looking at the logistic regression model from part (f), it has a much lower sensitivity that the LDA/QDA at 19.4%. The specificity is also lower at 28.6%, and the accuracy is much lower at around 21%.

#training the logistic regression model from (b) on the training data

b_lr = glm(low ~ age + lwt + race + smoke + ht + ui, family = "binomial", data = bw_train)

#confusion matrix
b_predictions = predict(b_lr, newdata = bw_test, type = "response") 
b_class = factor(ifelse(b_predictions > 0.5, "Low", "Normal"))
b_conf = table(Predicted = b_class, Actual = bw_test$low)
b_conf

##          Actual
## Predicted  0  1
##    Low     2  4
##    Normal 25  7

#metrics
b_metrics = metrics(b_conf)
b_metrics

## $sensitivity
## [1] 0.21875
## 
## $specificity
## [1] 0.3333333
## 
## $accuracy
## [1] 0.2368421

Finally, looking at the logistic regression model from part (b), it’s sensitivity is a bit higher than the one from part (f) at 21.88%, but is still less than the LDA/QDA model. The specificity is a bit higher than the model from part (f) at 33.3%, and the accuracy is also a little bit higher at 23.7%.

The model(s) that perform the best are LDA and QDA. I would recommend using the QDA model, however, because none of this data appears to be linear, so QDA might fit it better.

h. Using your final model from f), interpret the estimates for all covariates.

Because QDA models don’t print coefficients in the way that logistic/linear regression models do, I had to analyze the model in a slightly different way. I instead printed the means of each predictor variable in the two groups, not below 2.5kg and below 2.5kg.

bw_qda$means

##        lwt     race     smoke         ht       ptl2
## 0 133.4369 1.757282 0.3592233 0.02912621 0.08737864
## 1 123.8333 1.937500 0.5208333 0.12500000 0.29166667

Looking at the comparison of ‘lwt’, or mother’s weight at last menstrual period, we can see that the average weight of mothers whose children are born above 2.5kg is higher than the average weight of mothers whose children are born below 2.5kg, at 133 lbs vs 124 lbs. This implies that a lower weight of mother at last menstrual period might be related with a lower birth weight.

Because race, smoke, ht, and ptl2 are categorical variables, not continuous, I decided to interpret them using contingency tables.

#contingency table for 'race' variable

table(birthwt$race, birthwt$low)

##    
##      0  1
##   1 73 23
##   2 15 11
##   3 42 25

prop.table(table(birthwt$race, birthwt$low), margin = 1)

##    
##             0         1
##   1 0.7604167 0.2395833
##   2 0.5769231 0.4230769
##   3 0.6268657 0.3731343

Analyzing race, keeping in mind that 1 = white, 2 = black, and 3 = other, we can see that the majority of children born to white mothers (76%) are born above 2.5kg, whereas 57% of babies born to black mothers are born above 2.5kg, and 62% of babies born to mothers in the ‘other’ categories are born above 2.5kg. This implies that non-white mothers are more likely to have children born under 2.5kg, which might be due to genetics, but could potentially be due to healthcare disparities or lack of access to care.

#contingency table for 'smoke' variable

table(birthwt$smoke, birthwt$low)

##    
##      0  1
##   0 86 29
##   1 44 30

prop.table(table(birthwt$smoke, birthwt$low), margin = 1)

##    
##             0         1
##   0 0.7478261 0.2521739
##   1 0.5945946 0.4054054

From this, we can see that of babies born above 2.5kg, 75% of their mothers did NOT smoke during pregnancy, while for babies born below 2.5kg, only around 60% of their mothers did not smoke, and 40% did. This implies that smoking during pregnancy might lead to a child being born at a lower weight, and this might be due to the fact that smoking impacts a mother’s health and a child’s development in the womb.

#contingency table for 'ht' variable

table(birthwt$ht, birthwt$low)

##    
##       0   1
##   0 125  52
##   1   5   7

prop.table(table(birthwt$ht, birthwt$low), margin = 1)

##    
##             0         1
##   0 0.7062147 0.2937853
##   1 0.4166667 0.5833333

Looking at the comparison of low birthweight to mother’s who do or do not have hypertension, we can see that more mother’s without hypertension have babies above 2.5kg, and of babies born below 2.5kg, almost 60% of mothers have hypertension. However, it should ve noted that the majority of mothers do not have hypertension at all, and there are only 12 babies in this case born below 2.5kg versus the almost 180 babies born above 2.5kg, so it’s difficult to draw realistic conclusions from this dataset

#contingency table for 'ptl2' variable

table(birthwt$ptl2, birthwt$low)

##    
##       0   1
##   0 118  41
##   1  12  18

prop.table(table(birthwt$ptl2, birthwt$low), margin = 1)

##    
##             0         1
##   0 0.7421384 0.2578616
##   1 0.4000000 0.6000000

Finally, looking at the ptl2 variable, where 0=no physicians visit in the first trimester and 1= 1+ physicians visits in the first trimester, we can see that, actually of the babies born above 2.5kg, 74% of mothers did not attend any physicians visits in the first trimester, and for babies born below 2.5kg, 60% of mothers visited physicians at least once. Based on our previous analysis, I would not say that more physicians visits = a child weighing more at birth, but rather would posit that babies born below 2.5kg caused more concern during pregnancy, and therefore required mothers to attend more physicians visits.

Statistical Learning Take Home Exam1

2025-03-06

Please hand this result in on Canvas no later than 11:59pm on Wednesday, March 12! Do not work in groups!