Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.
library(MASS)
library(ggplot2)
data("cpus") #load the data
head(cpus)
## name syct mmin mmax cach chmin chmax perf estperf
## 1 ADVISOR 32/60 125 256 6000 256 16 128 198 199
## 2 AMDAHL 470V/7 29 8000 32000 32 8 32 269 253
## 3 AMDAHL 470/7A 29 8000 32000 32 8 32 220 253
## 4 AMDAHL 470V/7B 29 8000 32000 32 8 32 172 253
## 5 AMDAHL 470V/7C 29 8000 16000 32 8 16 132 132
## 6 AMDAHL 470V/8 26 8000 32000 64 8 32 318 290
#build a correlation matrix to see correlations between estimated performance (estperf) and other variables
cpus = na.omit(cpus) #drop NA values from dataset
corr = cor(cpus[, c("syct", "mmin", "mmax", "cach", "chmin", "chmax", "perf", "estperf")])
corr["estperf",]
## syct mmin mmax cach chmin chmax perf
## -0.2883956 0.8192915 0.9012024 0.6486203 0.6105802 0.5921556 0.9664687
## estperf
## 1.0000000
#scatterplot of published vs syct (cycle time in nanoseconds)
ggplot(cpus, aes(x = syct, y = estperf)) + geom_point()
#scatterplot between mmin and estperf
ggplot(cpus, aes(x = mmin, y = estperf)) + geom_point()
#scatterplot between mmax and estperf
ggplot(cpus, aes(x = mmax, y = estperf)) + geom_point()
#Scatterplot between cach and estperf
ggplot(cpus, aes(x = cach, y = estperf)) + geom_point()
#scatterplot between chmin and estperf
ggplot(cpus, aes(x = chmin, y = estperf)) + geom_point()
#scatterplot between chmax and estperf
ggplot(cpus, aes(x = chmax, y = estperf)) + geom_point()
#scatterplot between per and estperf
ggplot(cpus, aes(x = perf, y = estperf)) + geom_point()
Using the correlation matrix and scatterplots to assess the relationship between ‘estperf’ and the other variables in this dataset:
The correlation between ‘estperf’ and ‘syct’ is -0.288, implying that there is a negative correlation between the two variables (as the cycle time increases by 1 nanosecond, the estimated performance decreases by ~ 0.288). However, this factor is relatively small, so I would say that the relationship between these two variables is relatively week overall since -0.288 is close to 0 (0 = no correlation). Looking at the scatterplot between ‘syct’ and ‘estperf’, it’s also apparent that there is no semblance of a relationship between the two variables. The points on the graph have no pattern, which makes sense because of the weak correlation between the two variables.
The correlation between ‘estperf’ and ‘mmin’ is 0.819, which implies a strong positive correlation between the two variables - as mmin (minimum main memory in KB) increases by 1 KB, the estimated performance increases by ~ 0.82. This positive linear relationship is pretty visible in the scatterplot showcasing the relationship between estperf and mmin - there is a general positive linear theme to the graph with only a few outliers.
The correlation between ‘estperf’ and ‘mmax’ is 0.901, which again implies a strong positive correlation between these two variables. So as mmax (maximum main memory in KB) increases by 1KB, the estimated performance increases by ~0.90. Looking at the scatterplot, this is again confirmed - the relationship looks pretty linear with only a few outliers. Based on the fact that mmin and mmax are highly positively correlated with estperf, I feel comfortable concluding that the main memory and the performance of a CPU are related.
The correlation between ‘estperf’ and ‘cach’ is 0.649, which is a somewhat strong positive correlation, although not as strong as mmin and mmax. As the cache size, in KB, increases by 1KB, the estimated performance increases by ~0.65. The scatterplot of these two variables starts out displaying a somewhat linear trend towards the left-hand side of the graph, but the values fan out as the cache size increases, which says to me that maybe another related factor might be the cause for the increasing estimated performance.
The correlation between ‘estperf’ and ‘chmin’ is 0.611, which, like ‘cach’, is a relatively strong positive correlation, but not as strong as between ‘estperf’ and some other variables. As the minimum number of channels increases by 1, so does the estimated performance by 0.61. The scatterplot of these two variables doesn’t show a very promising linear relationship, however, as the values are stacked on top of one another on the left-hand side of the graph and fanned out as the minimum channel number increases.
The correlation between ‘estperf’ and ‘chmax’ is 0.592, which like ‘chmin’ and ‘cach’ is a positive correlation, but not very strong. As the maximum number of channels increases (by a factor of 1), the estimated performance increases by 0.592. Looking at the scatterplot of these two variables, the same thing that was true for ‘chmin’ is true here: there is not much visual evidence of a positive relationship. The values have no apparent pattern, and are stacked atop each other on the left side and fanned out on the right.
The correlation between ‘estperf’ and ‘perf’ is 0.966, which is almost 1, meaning this relationship is a very strong positive correlation - as the published performance increases by one unit, the estimated performance increases by 0.966. The scatterplot of these two variables showcases this strong positive correlation, with a very apparent positive linear relationship. However, this is to be expected since both variables are concerned with ‘performance’ and I would assume that the estimated performance is, in some capacity, based on the recorded performance. In the rest of this analysis, the ‘perf’ variable will not be used.
Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.
cpus_model = lm(estperf ~ syct + mmin + mmax + cach + chmax, data = cpus)
summary(cpus_model)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.633 -23.149 1.574 22.957 287.869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.660e+01 6.257e+00 -10.643 < 2e-16 ***
## syct 6.613e-02 1.365e-02 4.846 2.50e-06 ***
## mmin 1.424e-02 1.397e-03 10.188 < 2e-16 ***
## mmax 6.584e-03 4.999e-04 13.171 < 2e-16 ***
## cach 4.871e-01 1.050e-01 4.639 6.28e-06 ***
## chmax 1.187e+00 1.623e-01 7.314 5.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.78 on 203 degrees of freedom
## Multiple R-squared: 0.9108, Adjusted R-squared: 0.9086
## F-statistic: 414.8 on 5 and 203 DF, p-value: < 2.2e-16
My final model used the variables ‘syct’, ‘mmin’, ‘mmax’, ‘cach’, and ‘chmax’ as predictors for ‘estperf’. Per request of question (b), the variable ‘name’ is not included, and ‘perf’ is not used as a predictor as was requested in the initial instructions for this question. Initially, my model also included ‘chmin’, but after running a summary, I found that it was the only predictor that wasn’t statistically significant (a p-value of 0.797). The adjusted R-Squared value also increased slightly (from 0.9082 to 0.9086) with the removal of ‘chmin’. The fact that all of the included variables are statistically significant and the R-Sqaured values are ~ 90% indicate that this linear regression model is a strong fit, and this particular one is the best representation of the predictors and ‘estperf’.
Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.
plot(cpus_model)
Based on the 4 residual plots made (Residuals vs Fitted, Q-Q Plot, Scale-Location, and Residuals vs Leverage), I believe that this model does violate some of the assumptions of linear regression, primarily the normality assumption and the homoscedasticity assumption. Specifically looking at the Scale-Location graph, the red line and the points show a general positive trend, which implies that the residuals are not normalized across the model. Additionally, looking at the Q-Q Plot, it’s apparent that for the most part the residuals fit normally, they vary greatly at either end of the plot, implying that the distribution is not entirely normal. Finally, there are clearly outliers present in this model/data based on the four generated graphs, specifically observations 199, 200, and 10, which are highlighted in every graph.
In order to adjust my model, I think that transforming the estperf variable might be effective as it would address the issue of the model violating the homoscedasticity assumption.
#trying a log transformation of 'estperf' in my modeland re-generating the 4 plots
log_model = lm((log(estperf)) ~ syct + mmin + mmax + cach + chmax, data = cpus)
plot(log_model)
Looking at these 4 new residual plots using my new model with the ‘estperf’ variable logarithmically transformed, it appears that there was not much of a change in the violation of the homoscedasticity assumption, as the red trend line is still not flat and is gradually increasing as you move to the right of the graph, implying that the log function didn’t fix this violation. The Q-Q Plot also doesn’t show any significant improvement, with the residuals on the tail ends varying from the average. Additionally, the same 3 outlier points from my initial model are still causing problems/being shown as high leverage points in these graphs.
#trying one last thing: removing the three 'high leverage' observations from my model (10, 9, 200, 157, and 199)
clean = cpus[-c(10, 9, 200, 157, 199),]
clean_model = lm((log(estperf)) ~ syct + mmin + mmax + cach + chmax, data = clean)
plot(clean_model)
Using the data free from the outliers initially discovered in my first two models/residual plots made somewhat of an improvement in my opinion - the red trend line on the Residuals vs. Fitted plot is a lot less wavy/varied, although it still does not fit to the dotted grey line at all. The Q-Q Plot didn’t show much improvement, as it hasn’t in any of the updated models I’ve created. However, looking at the Scale-Location plot and the Residuals vs. Leverage plot, the red trend line definitely has flattened out with the removal of the initially identified outliers. Despite this, new outliers/high-leverage points have been identified. I think that my last/most recent model is the best fit, but the outliers could be better assessed, or there might be another way that the ‘estperf’ variable could be transformed to better fit the other variables to create a model with a better fit.
d. How well does the final model fit the data? Comment on some model fit criteria from the model built in c)
summary(clean_model)
##
## Call:
## lm(formula = (log(estperf)) ~ syct + mmin + mmax + cach + chmax,
## data = clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46999 -0.11111 0.02302 0.10538 0.38234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.139e+00 2.320e-02 135.301 < 2e-16 ***
## syct -3.014e-04 4.566e-05 -6.601 3.66e-10 ***
## mmin 3.398e-05 5.350e-06 6.352 1.43e-09 ***
## mmax 5.983e-05 1.827e-06 32.745 < 2e-16 ***
## cach 7.093e-03 3.783e-04 18.748 < 2e-16 ***
## chmax 9.444e-04 6.071e-04 1.556 0.121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1524 on 198 degrees of freedom
## Multiple R-squared: 0.9681, Adjusted R-squared: 0.9673
## F-statistic: 1203 on 5 and 198 DF, p-value: < 2.2e-16
The final model fits the data pretty well. Compared to the original model, the chmax variable is no longer statistically significant (as it was in the original model - all variables included were). However, the R-squared value on the final model (with the log transformation and the cleaning of outliers) is higher than that of the original model (~96% on the new model versus 91% on the original model), meaning that this cleaned/transformed model is a better fit on the data, and explains more of the variation in ‘estperf’ than my original model did. Additionally, the F-statistic is relatively high in this cleaned/transformed model (1203) compared to the original model (414.8), though the p-value is the same. This higher F-stat, coupled with the equally small p-value, implies that the cleaned/transformed model is statistically significant and does a better job at predicting the ‘estperf’ model than my original model despite one of the variables (‘chmax’) not being statistically significant in the new model.
e. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)
f. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.
To assess multicollinearity between predictors in my model, I chose to calculate VIFs.
library(car)
## Loading required package: carData
vif(clean_model)
## syct mmin mmax cach chmax
## 1.250982 2.295930 2.450264 1.609008 1.393809
All of these VIF scores for my final model indicate that there is no multicollinearity, as a VIF score below 5 is usually seen as ‘good’, or no multicollinear. This implies that these coefficients are good in the model, and are not skewing it in any way. Because the VIF scores are so low, there is no action that needs to be taken here.
g. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.
Because I’ve already made some residual plots that discuss outliers/influential observations, I chose to calculate some indices that would further justify my answer.
I started by calculating Cook’s D, which is a way to calculate influence on a model.
cooks_dist = cooks.distance(clean_model)
cooks_dist
## 1 2 3 4 5 6
## 7.971881e-02 4.894154e-04 4.894154e-04 4.894154e-04 1.658968e-02 4.321608e-03
## 7 8 11 12 13 14
## 1.894834e-02 1.894834e-02 7.205744e-04 6.927192e-04 2.267221e-03 8.281804e-04
## 15 16 17 18 19 20
## 9.718355e-03 1.055740e-03 2.181600e-03 7.112512e-04 2.993073e-03 3.029138e-02
## 21 22 23 24 25 26
## 1.091103e-07 5.618754e-04 3.237095e-03 1.389289e-02 4.653070e-04 2.014168e-03
## 27 28 29 30 31 32
## 4.548499e-04 2.022378e-03 2.160842e-04 4.881519e-04 9.150915e-02 9.150915e-02
## 33 34 35 36 37 38
## 2.884952e-03 2.884952e-03 1.674338e-03 1.383015e-03 4.159333e-03 8.343650e-04
## 39 40 41 42 43 44
## 2.048984e-04 3.317341e-04 2.027232e-05 1.090116e-03 1.041045e-03 1.041045e-03
## 45 46 47 48 49 50
## 3.108017e-04 4.056381e-05 3.092140e-03 2.588650e-03 8.890847e-06 2.181273e-07
## 51 52 53 54 55 56
## 2.200003e-03 2.198288e-03 8.452814e-04 6.978808e-06 8.057133e-04 2.942519e-04
## 57 58 59 60 61 62
## 9.130214e-05 4.949352e-03 4.949352e-03 4.949352e-03 4.949352e-03 4.949352e-03
## 63 64 65 66 67 68
## 7.692913e-03 3.891510e-03 1.217026e-03 4.541794e-04 1.028488e-02 6.252920e-03
## 69 70 71 72 73 74
## 5.453067e-04 4.399360e-06 8.799166e-04 8.616169e-03 6.375433e-03 1.875719e-03
## 75 76 77 78 79 80
## 8.704567e-04 7.901257e-04 4.175629e-04 4.636948e-04 7.990296e-03 5.120191e-04
## 81 82 83 84 85 86
## 9.893844e-04 6.213793e-04 1.218576e-02 3.032611e-03 6.014061e-04 5.184297e-04
## 87 88 89 90 91 92
## 1.698620e-04 1.854256e-05 1.589074e-02 3.014309e-03 1.623972e-02 1.623972e-02
## 93 94 95 96 97 98
## 1.854256e-05 2.886243e-03 9.216655e-05 3.331992e-02 4.593814e-02 1.182355e-03
## 99 100 101 102 103 104
## 3.729997e-02 1.026874e-02 3.014000e-03 1.618980e-04 3.428123e-04 9.018540e-04
## 105 106 107 108 109 110
## 1.539365e-03 2.194745e-06 3.435355e-03 1.740545e-03 4.071389e-03 2.371680e-03
## 111 112 113 114 115 116
## 5.603895e-03 7.756365e-03 3.589826e-06 3.589826e-06 9.464006e-05 1.085367e-03
## 117 118 119 120 121 122
## 1.054906e-03 7.051297e-04 9.155860e-04 1.669259e-03 9.023958e-03 5.686226e-03
## 123 124 125 126 127 128
## 1.771739e-02 3.425327e-02 5.509885e-04 1.416522e-04 1.927424e-04 1.927424e-04
## 129 130 131 132 133 134
## 7.873028e-04 7.873028e-04 4.621530e-03 2.143560e-05 7.656564e-05 7.656564e-05
## 135 136 137 138 139 140
## 1.729251e-03 1.854754e-03 1.854754e-03 1.698261e-03 2.425223e-04 3.620558e-04
## 141 142 143 144 145 146
## 4.850600e-04 4.850600e-04 3.036927e-05 1.213483e-03 8.960220e-04 8.568292e-04
## 147 148 149 150 151 152
## 1.027146e-03 1.252178e-03 1.212513e-03 4.659152e-04 1.181711e-02 1.102219e-02
## 153 154 155 156 158 159
## 9.619517e-03 2.587835e-01 7.331918e-04 1.068034e-02 7.607999e-03 5.682083e-03
## 160 161 162 163 164 165
## 2.046136e-03 2.372655e-07 1.071450e-04 1.322358e-03 1.906570e-03 1.819424e-03
## 166 167 168 169 170 171
## 1.894308e-02 1.894308e-02 4.553806e-03 1.145658e-01 3.422975e-03 3.084048e-03
## 172 173 174 175 176 177
## 1.285282e-03 2.520307e-03 1.381900e-03 1.381900e-03 1.185021e-03 1.642450e-03
## 178 179 180 181 182 183
## 5.491611e-03 2.995055e-04 4.838045e-05 4.534725e-04 5.811235e-03 2.276836e-03
## 184 185 186 187 188 189
## 1.758983e-04 1.602384e-04 2.427468e-04 1.231990e-03 2.134571e-03 1.986226e-03
## 190 191 192 193 194 195
## 6.794543e-03 2.913073e-03 8.185000e-03 4.639300e-02 9.359615e-04 5.697509e-03
## 196 197 198 201 202 203
## 3.013216e-03 3.419496e-02 3.288438e-02 1.961712e-03 2.091749e-03 1.961712e-03
## 204 205 206 207 208 209
## 2.091749e-03 8.863780e-06 2.631124e-04 1.718192e-04 2.609550e-03 2.102238e-04
Using one rule of thumb, which says that a point with a Cook’s D value over 1 is highly influential, between 0.5 and 1 is moderately influential, and below 0.5 is not influential, would show that none of these points are influential in the model. However, I don’t think that is true, simply because I previously graphed some residual plots that had ‘high leverage’ or outliers on them.
Using another rule of thumb, 4/N, where N is the # of observations, shows a different story.
threshold = 4/204 #204 observations in my cooks.distance function
threshold
## [1] 0.01960784
This implied that our threshold for an outlier or high-leverage point is 0.01960784, or 1.97e-02. Now, there are a handful of high-leverage points/outliers that still remain in my model. I put them into a table for easier analysis:
cooks_d = data.frame(Observation = 1:length(cooks_dist), Cooks_Distance = cooks_dist)
influential = which(cooks_dist > threshold)
influential
## 1 20 31 32 96 97 99 124 154 169 193 197 198
## 1 18 29 30 94 95 97 122 152 166 190 194 195
From this, we can see that the columns, or observations, in my dataset that are outliers/high-leverage points based on Cook’s D are columns:
20
18
31
29
32
30
96
94
97
95
99
124
122
154
152
169
166
193
190
197
194
198
195
This makes sense, as some of these observations were present in my earlier residual graphs, like 31, 169, 154, and 166. Removing all of these points from my model is an option, but because my model otherwise has a very good fit and no multicollinearity, I don’t think that is entirely necessary so make the model ‘better’.
a. Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.
data("birthwt")
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
#correlation between variables in the 'birthwt' dataset
bw_corr = cor(birthwt[, c("low", "age", "lwt", "race", "smoke", "ptl", "ht", "ui", "ftv", "bwt")])
bw_corr
## low age lwt race smoke ptl
## low 1.00000000 -0.11893928 -0.16962694 0.137792751 0.16140431 0.196087267
## age -0.11893928 1.00000000 0.18007315 -0.172817953 -0.04434618 0.071606386
## lwt -0.16962694 0.18007315 1.00000000 -0.165048544 -0.04417908 -0.140029003
## race 0.13779275 -0.17281795 -0.16504854 1.000000000 -0.33903074 0.007951293
## smoke 0.16140431 -0.04434618 -0.04417908 -0.339030745 1.00000000 0.187557063
## ptl 0.19608727 0.07160639 -0.14002900 0.007951293 0.18755706 1.000000000
## ht 0.15237025 -0.01583700 0.23636040 0.019929917 0.01340704 -0.015399579
## ui 0.16904283 -0.07515558 -0.15276317 0.053602088 0.06215900 0.227585340
## ftv -0.06296026 0.21539394 0.14052746 -0.098336254 -0.02801314 -0.044429660
## bwt -0.78480516 0.09031781 0.18573328 -0.194713487 -0.19044806 -0.154653390
## ht ui ftv bwt
## low 0.15237025 0.16904283 -0.06296026 -0.78480516
## age -0.01583700 -0.07515558 0.21539394 0.09031781
## lwt 0.23636040 -0.15276317 0.14052746 0.18573328
## race 0.01992992 0.05360209 -0.09833625 -0.19471349
## smoke 0.01340704 0.06215900 -0.02801314 -0.19044806
## ptl -0.01539958 0.22758534 -0.04442966 -0.15465339
## ht 1.00000000 -0.10858506 -0.07237255 -0.14598189
## ui -0.10858506 1.00000000 -0.05952341 -0.28392741
## ftv -0.07237255 -0.05952341 1.00000000 0.05831777
## bwt -0.14598189 -0.28392741 0.05831777 1.00000000
Looking at the correlation matrix of the variables, not many of them seem to be strong correlated, as most are around or below 0.1 or -0.1. However, the correlation between ‘low’, which is a binary variable that indicates whether or not the birthweight (bwt) was less than 2.5kg, and the ‘bwt’ variable is very strongly negatively correlated (-0.78), which makes sense considering that they are both variables relating to weight at birth.
#visualizing a scatterplot between 'bwt' and 'age' (mother's age)
ggplot(birthwt, aes(x = age, y = bwt)) +geom_point()
Looking at the scatterplot between ‘bwt’ and ‘age’, it does not appear that there is any trend between the mother’s age and the weight of the child. There seems to be a high concentration of mothers between 20 and 25, but the weight of their children doesn’t follow any specific trend. There appears to be one outlier - a 45-year-old mother had a child that weighed 5000 grams, or 5kg, or ~11 lbs.
#visualizing a scatterplot between 'bwt' and 'lwt' (mothers weight at last menstrual cycle in lbs)
ggplot(birthwt, aes(x = lwt, y = bwt)) +geom_point()
Again, here, there isn’t a very clear trend between bwt and and lwt, likely because a child’s birthweight is more so dependent on genetics and environmental factors as well as mother’s health, but not particularly mother’s weight. It does seem like children of mothers who have a higher weight typically weight more at birth, but there are also children who weigh 5000 g at birth from mothers who weigh 125lbs, so again, there is no real trend here.
#visualizing a barplot for count of 'low' (whether or not the birthweight was below 2.5kg, 0=no, 1=yes)
ggplot(birthwt, aes(x = low)) +geom_bar()
Based on this bar plot, it appears that most children are born weighing above 2.5kg, which is ~5.5lbs. Around 125 babies in this dataset are born above 2.5kg, and around half of that is how many babies were born under 2.5kg.
#visualizing a barplot for counts of 'race' (race of mother: 1=white, 2=black, 3=other)
ggplot(birthwt, aes(x = race)) +geom_bar()
This bar plot shows that the majority of mothers in the dataset are white, and then ‘other’, and then black. I think it would be beneficial to have broken the ‘other’ category into more races since it’s such a large portion of the dataset and would give a better picture of the demographic of mothers included in the dataset, especially because there can be disparities within that whole category that might be indicative of one race but not the other.
#visualizing a barplot for count of 'smoke' (whether or not the mother smoked during pregnancy: 0 = no, 1= yes)
ggplot(birthwt, aes(x = smoke)) +geom_bar()
It appears that most of the mothers in this dataset did not smoke during their pregnancy, but a large minority did smoke during their pregnancy. This might correlate to a lower birthweight, as we previously discussed that mother’s health plays a role in determining birthweight.
#visualizing a barplot for count of 'ptl' (number of premature labors)
ggplot(birthwt, aes(x = ptl)) +geom_bar()
From this, we can see that most mothers have not ever had a premature labor before, and a very small amount of mothers have had 3 premature labors before – almost a negligible amount compared to those who have had 0-2.
#visualizing a barplot for count of 'ht' (whether or not the mother has a history of hypertension, 0=no, 1=yes)
ggplot(birthwt, aes(x = ht)) +geom_bar()
Here we can see that a large majority of mothers do not have a history of hypertension, with only a handful (~12 or 13) of mother’s do. Those with a history of hypertension might have children with a lower birthweight, again, due to birthweight and mother’s health being related.
#visualizing a barplot for count of 'ui' (whether or not the mother has a history of uterine irritability, 0=no, 1=yes)
ggplot(birthwt, aes(x = ui)) +geom_bar()
Again here, we can see that most mothers do not have a history of uterine irritability, which is a condition that causes frequent contractions during pregnancy and could, in very rare cases, cause premature labor. However, only around 25 women in this dataset have a history of this.
#visualizing a barplot for count of 'ftv' (# of physicians visits during the first trimester)
ggplot(birthwt, aes(x = ftv)) +geom_bar()
A large number of women did not visit a physician at all during their first trimester; this could be because they weren’t aware that they were pregnant, or potentially because they lacked the resources to get to a doctor during that time or at all during their pregnancy. There are also a few women who had 4+ visits to their physician during the first trimester of their pregnancy, which could indicate a history or presence of complicated pregnancies or an issue with their child.
b. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.
#fit logistic regression model predicting whether or not a child's weight will be below 2.5kg or not using variables relating to mother's health
bw_lr = glm(low ~ age + lwt + race + smoke + ht + ui, family = "binomial", data = birthwt)
summary(bw_lr)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ht + ui, family = "binomial",
## data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.137148 1.270543 -0.108 0.91404
## age -0.024657 0.034772 -0.709 0.47825
## lwt -0.013268 0.006539 -2.029 0.04247 *
## race 0.464659 0.212688 2.185 0.02891 *
## smoke 1.030252 0.390802 2.636 0.00838 **
## ht 1.824340 0.683039 2.671 0.00756 **
## ui 0.859435 0.451075 1.905 0.05674 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 206.83 on 182 degrees of freedom
## AIC: 220.83
##
## Number of Fisher Scoring iterations: 4
Even though most of my visualizations are comparing ‘bwt’ and other variables, it is not possible to build a logistic regression model with ‘bwt’ as the outcome because it is continuous and logistic regression models predict a binary outcome. As a solution to this, I selected ‘low’ as the predictor, because it is strongly correlated with ‘bwt’, so we can use my variables to predict whether or not a child will be born weighing more or less than 2.5kg.
For my model, I chose to use the variables of age, lwt, race, smoke, ht, and ui for predictors. These predictors are related to maternal health, and thus can be a good predictor of the child’s birthweight. I chose not to include ptl because while it might be a logical explanation for why a child is at a lower weight, it’s not directly related to the mother’s health, and a small number of women in the dataset have ever had a premature labor before. I opted to not include ftv as well because, again, this is not a variable that is indicative of the mother’s health. Finally, I obviously did not include the ‘bwt’ variable because this directly tells us whether or not the child’s weight is below/above 2.5kg, so it shouldn’t be used as a factor for prediction.
c. What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
I did not include ptl and ftv in my final model, one because they were not statistically significant, and more importantly because I chose to focus my model on maternal health. However, I noticed that specifically ‘ptl’ had a high-ish intercept, at 0.542087, whereas ftv had a smaller intercept, but both were positive, implying an increase of the log odds of low birth weight as these variables increased. The assumption that an increase of premature labor would increase the log odds of low birth weight since a premature birth means not enough time for the child to fully develop. However, realistically, more physician visits doesn’t really have any bearing on the weight of the child; as we’ve already discussed, the birthweight of a child is primarily due to genetics, environment, and maternal health. It’s possible that a physician visit might expose ailments in a mother’s health that could then be fixed and have a positive impact on the birthweight, but more doesn’t equal better in this case.
d. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
#creating ptl2, where 0 = no premature labors and 1 = yes
birthwt$ptl2 = ifelse(birthwt$ptl > 0, 1, 0)
birthwt
## low age lwt race smoke ptl ht ui ftv bwt ptl2
## 85 0 19 182 2 0 0 0 1 0 2523 0
## 86 0 33 155 3 0 0 0 0 3 2551 0
## 87 0 20 105 1 1 0 0 0 1 2557 0
## 88 0 21 108 1 1 0 0 1 2 2594 0
## 89 0 18 107 1 1 0 0 1 0 2600 0
## 91 0 21 124 3 0 0 0 0 0 2622 0
## 92 0 22 118 1 0 0 0 0 1 2637 0
## 93 0 17 103 3 0 0 0 0 1 2637 0
## 94 0 29 123 1 1 0 0 0 1 2663 0
## 95 0 26 113 1 1 0 0 0 0 2665 0
## 96 0 19 95 3 0 0 0 0 0 2722 0
## 97 0 19 150 3 0 0 0 0 1 2733 0
## 98 0 22 95 3 0 0 1 0 0 2751 0
## 99 0 30 107 3 0 1 0 1 2 2750 1
## 100 0 18 100 1 1 0 0 0 0 2769 0
## 101 0 18 100 1 1 0 0 0 0 2769 0
## 102 0 15 98 2 0 0 0 0 0 2778 0
## 103 0 25 118 1 1 0 0 0 3 2782 0
## 104 0 20 120 3 0 0 0 1 0 2807 0
## 105 0 28 120 1 1 0 0 0 1 2821 0
## 106 0 32 121 3 0 0 0 0 2 2835 0
## 107 0 31 100 1 0 0 0 1 3 2835 0
## 108 0 36 202 1 0 0 0 0 1 2836 0
## 109 0 28 120 3 0 0 0 0 0 2863 0
## 111 0 25 120 3 0 0 0 1 2 2877 0
## 112 0 28 167 1 0 0 0 0 0 2877 0
## 113 0 17 122 1 1 0 0 0 0 2906 0
## 114 0 29 150 1 0 0 0 0 2 2920 0
## 115 0 26 168 2 1 0 0 0 0 2920 0
## 116 0 17 113 2 0 0 0 0 1 2920 0
## 117 0 17 113 2 0 0 0 0 1 2920 0
## 118 0 24 90 1 1 1 0 0 1 2948 1
## 119 0 35 121 2 1 1 0 0 1 2948 1
## 120 0 25 155 1 0 0 0 0 1 2977 0
## 121 0 25 125 2 0 0 0 0 0 2977 0
## 123 0 29 140 1 1 0 0 0 2 2977 0
## 124 0 19 138 1 1 0 0 0 2 2977 0
## 125 0 27 124 1 1 0 0 0 0 2922 0
## 126 0 31 215 1 1 0 0 0 2 3005 0
## 127 0 33 109 1 1 0 0 0 1 3033 0
## 128 0 21 185 2 1 0 0 0 2 3042 0
## 129 0 19 189 1 0 0 0 0 2 3062 0
## 130 0 23 130 2 0 0 0 0 1 3062 0
## 131 0 21 160 1 0 0 0 0 0 3062 0
## 132 0 18 90 1 1 0 0 1 0 3062 0
## 133 0 18 90 1 1 0 0 1 0 3062 0
## 134 0 32 132 1 0 0 0 0 4 3080 0
## 135 0 19 132 3 0 0 0 0 0 3090 0
## 136 0 24 115 1 0 0 0 0 2 3090 0
## 137 0 22 85 3 1 0 0 0 0 3090 0
## 138 0 22 120 1 0 0 1 0 1 3100 0
## 139 0 23 128 3 0 0 0 0 0 3104 0
## 140 0 22 130 1 1 0 0 0 0 3132 0
## 141 0 30 95 1 1 0 0 0 2 3147 0
## 142 0 19 115 3 0 0 0 0 0 3175 0
## 143 0 16 110 3 0 0 0 0 0 3175 0
## 144 0 21 110 3 1 0 0 1 0 3203 0
## 145 0 30 153 3 0 0 0 0 0 3203 0
## 146 0 20 103 3 0 0 0 0 0 3203 0
## 147 0 17 119 3 0 0 0 0 0 3225 0
## 148 0 17 119 3 0 0 0 0 0 3225 0
## 149 0 23 119 3 0 0 0 0 2 3232 0
## 150 0 24 110 3 0 0 0 0 0 3232 0
## 151 0 28 140 1 0 0 0 0 0 3234 0
## 154 0 26 133 3 1 2 0 0 0 3260 1
## 155 0 20 169 3 0 1 0 1 1 3274 1
## 156 0 24 115 3 0 0 0 0 2 3274 0
## 159 0 28 250 3 1 0 0 0 6 3303 0
## 160 0 20 141 1 0 2 0 1 1 3317 1
## 161 0 22 158 2 0 1 0 0 2 3317 1
## 162 0 22 112 1 1 2 0 0 0 3317 1
## 163 0 31 150 3 1 0 0 0 2 3321 0
## 164 0 23 115 3 1 0 0 0 1 3331 0
## 166 0 16 112 2 0 0 0 0 0 3374 0
## 167 0 16 135 1 1 0 0 0 0 3374 0
## 168 0 18 229 2 0 0 0 0 0 3402 0
## 169 0 25 140 1 0 0 0 0 1 3416 0
## 170 0 32 134 1 1 1 0 0 4 3430 1
## 172 0 20 121 2 1 0 0 0 0 3444 0
## 173 0 23 190 1 0 0 0 0 0 3459 0
## 174 0 22 131 1 0 0 0 0 1 3460 0
## 175 0 32 170 1 0 0 0 0 0 3473 0
## 176 0 30 110 3 0 0 0 0 0 3544 0
## 177 0 20 127 3 0 0 0 0 0 3487 0
## 179 0 23 123 3 0 0 0 0 0 3544 0
## 180 0 17 120 3 1 0 0 0 0 3572 0
## 181 0 19 105 3 0 0 0 0 0 3572 0
## 182 0 23 130 1 0 0 0 0 0 3586 0
## 183 0 36 175 1 0 0 0 0 0 3600 0
## 184 0 22 125 1 0 0 0 0 1 3614 0
## 185 0 24 133 1 0 0 0 0 0 3614 0
## 186 0 21 134 3 0 0 0 0 2 3629 0
## 187 0 19 235 1 1 0 1 0 0 3629 0
## 188 0 25 95 1 1 3 0 1 0 3637 1
## 189 0 16 135 1 1 0 0 0 0 3643 0
## 190 0 29 135 1 0 0 0 0 1 3651 0
## 191 0 29 154 1 0 0 0 0 1 3651 0
## 192 0 19 147 1 1 0 0 0 0 3651 0
## 193 0 19 147 1 1 0 0 0 0 3651 0
## 195 0 30 137 1 0 0 0 0 1 3699 0
## 196 0 24 110 1 0 0 0 0 1 3728 0
## 197 0 19 184 1 1 0 1 0 0 3756 0
## 199 0 24 110 3 0 1 0 0 0 3770 1
## 200 0 23 110 1 0 0 0 0 1 3770 0
## 201 0 20 120 3 0 0 0 0 0 3770 0
## 202 0 25 241 2 0 0 1 0 0 3790 0
## 203 0 30 112 1 0 0 0 0 1 3799 0
## 204 0 22 169 1 0 0 0 0 0 3827 0
## 205 0 18 120 1 1 0 0 0 2 3856 0
## 206 0 16 170 2 0 0 0 0 4 3860 0
## 207 0 32 186 1 0 0 0 0 2 3860 0
## 208 0 18 120 3 0 0 0 0 1 3884 0
## 209 0 29 130 1 1 0 0 0 2 3884 0
## 210 0 33 117 1 0 0 0 1 1 3912 0
## 211 0 20 170 1 1 0 0 0 0 3940 0
## 212 0 28 134 3 0 0 0 0 1 3941 0
## 213 0 14 135 1 0 0 0 0 0 3941 0
## 214 0 28 130 3 0 0 0 0 0 3969 0
## 215 0 25 120 1 0 0 0 0 2 3983 0
## 216 0 16 95 3 0 0 0 0 1 3997 0
## 217 0 20 158 1 0 0 0 0 1 3997 0
## 218 0 26 160 3 0 0 0 0 0 4054 0
## 219 0 21 115 1 0 0 0 0 1 4054 0
## 220 0 22 129 1 0 0 0 0 0 4111 0
## 221 0 25 130 1 0 0 0 0 2 4153 0
## 222 0 31 120 1 0 0 0 0 2 4167 0
## 223 0 35 170 1 0 1 0 0 1 4174 1
## 224 0 19 120 1 1 0 0 0 0 4238 0
## 225 0 24 116 1 0 0 0 0 1 4593 0
## 226 0 45 123 1 0 0 0 0 1 4990 0
## 4 1 28 120 3 1 1 0 1 0 709 1
## 10 1 29 130 1 0 0 0 1 2 1021 0
## 11 1 34 187 2 1 0 1 0 0 1135 0
## 13 1 25 105 3 0 1 1 0 0 1330 1
## 15 1 25 85 3 0 0 0 1 0 1474 0
## 16 1 27 150 3 0 0 0 0 0 1588 0
## 17 1 23 97 3 0 0 0 1 1 1588 0
## 18 1 24 128 2 0 1 0 0 1 1701 1
## 19 1 24 132 3 0 0 1 0 0 1729 0
## 20 1 21 165 1 1 0 1 0 1 1790 0
## 22 1 32 105 1 1 0 0 0 0 1818 0
## 23 1 19 91 1 1 2 0 1 0 1885 1
## 24 1 25 115 3 0 0 0 0 0 1893 0
## 25 1 16 130 3 0 0 0 0 1 1899 0
## 26 1 25 92 1 1 0 0 0 0 1928 0
## 27 1 20 150 1 1 0 0 0 2 1928 0
## 28 1 21 200 2 0 0 0 1 2 1928 0
## 29 1 24 155 1 1 1 0 0 0 1936 1
## 30 1 21 103 3 0 0 0 0 0 1970 0
## 31 1 20 125 3 0 0 0 1 0 2055 0
## 32 1 25 89 3 0 2 0 0 1 2055 1
## 33 1 19 102 1 0 0 0 0 2 2082 0
## 34 1 19 112 1 1 0 0 1 0 2084 0
## 35 1 26 117 1 1 1 0 0 0 2084 1
## 36 1 24 138 1 0 0 0 0 0 2100 0
## 37 1 17 130 3 1 1 0 1 0 2125 1
## 40 1 20 120 2 1 0 0 0 3 2126 0
## 42 1 22 130 1 1 1 0 1 1 2187 1
## 43 1 27 130 2 0 0 0 1 0 2187 0
## 44 1 20 80 3 1 0 0 1 0 2211 0
## 45 1 17 110 1 1 0 0 0 0 2225 0
## 46 1 25 105 3 0 1 0 0 1 2240 1
## 47 1 20 109 3 0 0 0 0 0 2240 0
## 49 1 18 148 3 0 0 0 0 0 2282 0
## 50 1 18 110 2 1 1 0 0 0 2296 1
## 51 1 20 121 1 1 1 0 1 0 2296 1
## 52 1 21 100 3 0 1 0 0 4 2301 1
## 54 1 26 96 3 0 0 0 0 0 2325 0
## 56 1 31 102 1 1 1 0 0 1 2353 1
## 57 1 15 110 1 0 0 0 0 0 2353 0
## 59 1 23 187 2 1 0 0 0 1 2367 0
## 60 1 20 122 2 1 0 0 0 0 2381 0
## 61 1 24 105 2 1 0 0 0 0 2381 0
## 62 1 15 115 3 0 0 0 1 0 2381 0
## 63 1 23 120 3 0 0 0 0 0 2410 0
## 65 1 30 142 1 1 1 0 0 0 2410 1
## 67 1 22 130 1 1 0 0 0 1 2410 0
## 68 1 17 120 1 1 0 0 0 3 2414 0
## 69 1 23 110 1 1 1 0 0 0 2424 1
## 71 1 17 120 2 0 0 0 0 2 2438 0
## 75 1 26 154 3 0 1 1 0 1 2442 1
## 76 1 20 105 3 0 0 0 0 3 2450 0
## 77 1 26 190 1 1 0 0 0 0 2466 0
## 78 1 14 101 3 1 1 0 0 0 2466 1
## 79 1 28 95 1 1 0 0 0 2 2466 0
## 81 1 14 100 3 0 0 0 0 2 2495 0
## 82 1 23 94 3 1 0 0 0 0 2495 0
## 83 1 17 142 2 0 0 1 0 0 2495 0
## 84 1 21 130 1 1 0 1 0 3 2495 0
For this part, I chose to collapse ptl into 2 variables, 0 and 1. 0 = no premature labors, and 1 = 1+ premature labors. This eliminates the need for a dummy variable, and makes it easier to determine the relationship between whether or not having a premature labor before has any impact on birth weight, especially since the number of women who have had more than 1 premature labor is so small.
e. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
#creating ftv2, where 0 = no physician visits in first trimester, 1 = 1+ physician visits in first trimester
birthwt$ftv2 = ifelse(birthwt$ftv > 0, 1, 0)
birthwt
## low age lwt race smoke ptl ht ui ftv bwt ptl2 ftv2
## 85 0 19 182 2 0 0 0 1 0 2523 0 0
## 86 0 33 155 3 0 0 0 0 3 2551 0 1
## 87 0 20 105 1 1 0 0 0 1 2557 0 1
## 88 0 21 108 1 1 0 0 1 2 2594 0 1
## 89 0 18 107 1 1 0 0 1 0 2600 0 0
## 91 0 21 124 3 0 0 0 0 0 2622 0 0
## 92 0 22 118 1 0 0 0 0 1 2637 0 1
## 93 0 17 103 3 0 0 0 0 1 2637 0 1
## 94 0 29 123 1 1 0 0 0 1 2663 0 1
## 95 0 26 113 1 1 0 0 0 0 2665 0 0
## 96 0 19 95 3 0 0 0 0 0 2722 0 0
## 97 0 19 150 3 0 0 0 0 1 2733 0 1
## 98 0 22 95 3 0 0 1 0 0 2751 0 0
## 99 0 30 107 3 0 1 0 1 2 2750 1 1
## 100 0 18 100 1 1 0 0 0 0 2769 0 0
## 101 0 18 100 1 1 0 0 0 0 2769 0 0
## 102 0 15 98 2 0 0 0 0 0 2778 0 0
## 103 0 25 118 1 1 0 0 0 3 2782 0 1
## 104 0 20 120 3 0 0 0 1 0 2807 0 0
## 105 0 28 120 1 1 0 0 0 1 2821 0 1
## 106 0 32 121 3 0 0 0 0 2 2835 0 1
## 107 0 31 100 1 0 0 0 1 3 2835 0 1
## 108 0 36 202 1 0 0 0 0 1 2836 0 1
## 109 0 28 120 3 0 0 0 0 0 2863 0 0
## 111 0 25 120 3 0 0 0 1 2 2877 0 1
## 112 0 28 167 1 0 0 0 0 0 2877 0 0
## 113 0 17 122 1 1 0 0 0 0 2906 0 0
## 114 0 29 150 1 0 0 0 0 2 2920 0 1
## 115 0 26 168 2 1 0 0 0 0 2920 0 0
## 116 0 17 113 2 0 0 0 0 1 2920 0 1
## 117 0 17 113 2 0 0 0 0 1 2920 0 1
## 118 0 24 90 1 1 1 0 0 1 2948 1 1
## 119 0 35 121 2 1 1 0 0 1 2948 1 1
## 120 0 25 155 1 0 0 0 0 1 2977 0 1
## 121 0 25 125 2 0 0 0 0 0 2977 0 0
## 123 0 29 140 1 1 0 0 0 2 2977 0 1
## 124 0 19 138 1 1 0 0 0 2 2977 0 1
## 125 0 27 124 1 1 0 0 0 0 2922 0 0
## 126 0 31 215 1 1 0 0 0 2 3005 0 1
## 127 0 33 109 1 1 0 0 0 1 3033 0 1
## 128 0 21 185 2 1 0 0 0 2 3042 0 1
## 129 0 19 189 1 0 0 0 0 2 3062 0 1
## 130 0 23 130 2 0 0 0 0 1 3062 0 1
## 131 0 21 160 1 0 0 0 0 0 3062 0 0
## 132 0 18 90 1 1 0 0 1 0 3062 0 0
## 133 0 18 90 1 1 0 0 1 0 3062 0 0
## 134 0 32 132 1 0 0 0 0 4 3080 0 1
## 135 0 19 132 3 0 0 0 0 0 3090 0 0
## 136 0 24 115 1 0 0 0 0 2 3090 0 1
## 137 0 22 85 3 1 0 0 0 0 3090 0 0
## 138 0 22 120 1 0 0 1 0 1 3100 0 1
## 139 0 23 128 3 0 0 0 0 0 3104 0 0
## 140 0 22 130 1 1 0 0 0 0 3132 0 0
## 141 0 30 95 1 1 0 0 0 2 3147 0 1
## 142 0 19 115 3 0 0 0 0 0 3175 0 0
## 143 0 16 110 3 0 0 0 0 0 3175 0 0
## 144 0 21 110 3 1 0 0 1 0 3203 0 0
## 145 0 30 153 3 0 0 0 0 0 3203 0 0
## 146 0 20 103 3 0 0 0 0 0 3203 0 0
## 147 0 17 119 3 0 0 0 0 0 3225 0 0
## 148 0 17 119 3 0 0 0 0 0 3225 0 0
## 149 0 23 119 3 0 0 0 0 2 3232 0 1
## 150 0 24 110 3 0 0 0 0 0 3232 0 0
## 151 0 28 140 1 0 0 0 0 0 3234 0 0
## 154 0 26 133 3 1 2 0 0 0 3260 1 0
## 155 0 20 169 3 0 1 0 1 1 3274 1 1
## 156 0 24 115 3 0 0 0 0 2 3274 0 1
## 159 0 28 250 3 1 0 0 0 6 3303 0 1
## 160 0 20 141 1 0 2 0 1 1 3317 1 1
## 161 0 22 158 2 0 1 0 0 2 3317 1 1
## 162 0 22 112 1 1 2 0 0 0 3317 1 0
## 163 0 31 150 3 1 0 0 0 2 3321 0 1
## 164 0 23 115 3 1 0 0 0 1 3331 0 1
## 166 0 16 112 2 0 0 0 0 0 3374 0 0
## 167 0 16 135 1 1 0 0 0 0 3374 0 0
## 168 0 18 229 2 0 0 0 0 0 3402 0 0
## 169 0 25 140 1 0 0 0 0 1 3416 0 1
## 170 0 32 134 1 1 1 0 0 4 3430 1 1
## 172 0 20 121 2 1 0 0 0 0 3444 0 0
## 173 0 23 190 1 0 0 0 0 0 3459 0 0
## 174 0 22 131 1 0 0 0 0 1 3460 0 1
## 175 0 32 170 1 0 0 0 0 0 3473 0 0
## 176 0 30 110 3 0 0 0 0 0 3544 0 0
## 177 0 20 127 3 0 0 0 0 0 3487 0 0
## 179 0 23 123 3 0 0 0 0 0 3544 0 0
## 180 0 17 120 3 1 0 0 0 0 3572 0 0
## 181 0 19 105 3 0 0 0 0 0 3572 0 0
## 182 0 23 130 1 0 0 0 0 0 3586 0 0
## 183 0 36 175 1 0 0 0 0 0 3600 0 0
## 184 0 22 125 1 0 0 0 0 1 3614 0 1
## 185 0 24 133 1 0 0 0 0 0 3614 0 0
## 186 0 21 134 3 0 0 0 0 2 3629 0 1
## 187 0 19 235 1 1 0 1 0 0 3629 0 0
## 188 0 25 95 1 1 3 0 1 0 3637 1 0
## 189 0 16 135 1 1 0 0 0 0 3643 0 0
## 190 0 29 135 1 0 0 0 0 1 3651 0 1
## 191 0 29 154 1 0 0 0 0 1 3651 0 1
## 192 0 19 147 1 1 0 0 0 0 3651 0 0
## 193 0 19 147 1 1 0 0 0 0 3651 0 0
## 195 0 30 137 1 0 0 0 0 1 3699 0 1
## 196 0 24 110 1 0 0 0 0 1 3728 0 1
## 197 0 19 184 1 1 0 1 0 0 3756 0 0
## 199 0 24 110 3 0 1 0 0 0 3770 1 0
## 200 0 23 110 1 0 0 0 0 1 3770 0 1
## 201 0 20 120 3 0 0 0 0 0 3770 0 0
## 202 0 25 241 2 0 0 1 0 0 3790 0 0
## 203 0 30 112 1 0 0 0 0 1 3799 0 1
## 204 0 22 169 1 0 0 0 0 0 3827 0 0
## 205 0 18 120 1 1 0 0 0 2 3856 0 1
## 206 0 16 170 2 0 0 0 0 4 3860 0 1
## 207 0 32 186 1 0 0 0 0 2 3860 0 1
## 208 0 18 120 3 0 0 0 0 1 3884 0 1
## 209 0 29 130 1 1 0 0 0 2 3884 0 1
## 210 0 33 117 1 0 0 0 1 1 3912 0 1
## 211 0 20 170 1 1 0 0 0 0 3940 0 0
## 212 0 28 134 3 0 0 0 0 1 3941 0 1
## 213 0 14 135 1 0 0 0 0 0 3941 0 0
## 214 0 28 130 3 0 0 0 0 0 3969 0 0
## 215 0 25 120 1 0 0 0 0 2 3983 0 1
## 216 0 16 95 3 0 0 0 0 1 3997 0 1
## 217 0 20 158 1 0 0 0 0 1 3997 0 1
## 218 0 26 160 3 0 0 0 0 0 4054 0 0
## 219 0 21 115 1 0 0 0 0 1 4054 0 1
## 220 0 22 129 1 0 0 0 0 0 4111 0 0
## 221 0 25 130 1 0 0 0 0 2 4153 0 1
## 222 0 31 120 1 0 0 0 0 2 4167 0 1
## 223 0 35 170 1 0 1 0 0 1 4174 1 1
## 224 0 19 120 1 1 0 0 0 0 4238 0 0
## 225 0 24 116 1 0 0 0 0 1 4593 0 1
## 226 0 45 123 1 0 0 0 0 1 4990 0 1
## 4 1 28 120 3 1 1 0 1 0 709 1 0
## 10 1 29 130 1 0 0 0 1 2 1021 0 1
## 11 1 34 187 2 1 0 1 0 0 1135 0 0
## 13 1 25 105 3 0 1 1 0 0 1330 1 0
## 15 1 25 85 3 0 0 0 1 0 1474 0 0
## 16 1 27 150 3 0 0 0 0 0 1588 0 0
## 17 1 23 97 3 0 0 0 1 1 1588 0 1
## 18 1 24 128 2 0 1 0 0 1 1701 1 1
## 19 1 24 132 3 0 0 1 0 0 1729 0 0
## 20 1 21 165 1 1 0 1 0 1 1790 0 1
## 22 1 32 105 1 1 0 0 0 0 1818 0 0
## 23 1 19 91 1 1 2 0 1 0 1885 1 0
## 24 1 25 115 3 0 0 0 0 0 1893 0 0
## 25 1 16 130 3 0 0 0 0 1 1899 0 1
## 26 1 25 92 1 1 0 0 0 0 1928 0 0
## 27 1 20 150 1 1 0 0 0 2 1928 0 1
## 28 1 21 200 2 0 0 0 1 2 1928 0 1
## 29 1 24 155 1 1 1 0 0 0 1936 1 0
## 30 1 21 103 3 0 0 0 0 0 1970 0 0
## 31 1 20 125 3 0 0 0 1 0 2055 0 0
## 32 1 25 89 3 0 2 0 0 1 2055 1 1
## 33 1 19 102 1 0 0 0 0 2 2082 0 1
## 34 1 19 112 1 1 0 0 1 0 2084 0 0
## 35 1 26 117 1 1 1 0 0 0 2084 1 0
## 36 1 24 138 1 0 0 0 0 0 2100 0 0
## 37 1 17 130 3 1 1 0 1 0 2125 1 0
## 40 1 20 120 2 1 0 0 0 3 2126 0 1
## 42 1 22 130 1 1 1 0 1 1 2187 1 1
## 43 1 27 130 2 0 0 0 1 0 2187 0 0
## 44 1 20 80 3 1 0 0 1 0 2211 0 0
## 45 1 17 110 1 1 0 0 0 0 2225 0 0
## 46 1 25 105 3 0 1 0 0 1 2240 1 1
## 47 1 20 109 3 0 0 0 0 0 2240 0 0
## 49 1 18 148 3 0 0 0 0 0 2282 0 0
## 50 1 18 110 2 1 1 0 0 0 2296 1 0
## 51 1 20 121 1 1 1 0 1 0 2296 1 0
## 52 1 21 100 3 0 1 0 0 4 2301 1 1
## 54 1 26 96 3 0 0 0 0 0 2325 0 0
## 56 1 31 102 1 1 1 0 0 1 2353 1 1
## 57 1 15 110 1 0 0 0 0 0 2353 0 0
## 59 1 23 187 2 1 0 0 0 1 2367 0 1
## 60 1 20 122 2 1 0 0 0 0 2381 0 0
## 61 1 24 105 2 1 0 0 0 0 2381 0 0
## 62 1 15 115 3 0 0 0 1 0 2381 0 0
## 63 1 23 120 3 0 0 0 0 0 2410 0 0
## 65 1 30 142 1 1 1 0 0 0 2410 1 0
## 67 1 22 130 1 1 0 0 0 1 2410 0 1
## 68 1 17 120 1 1 0 0 0 3 2414 0 1
## 69 1 23 110 1 1 1 0 0 0 2424 1 0
## 71 1 17 120 2 0 0 0 0 2 2438 0 1
## 75 1 26 154 3 0 1 1 0 1 2442 1 1
## 76 1 20 105 3 0 0 0 0 3 2450 0 1
## 77 1 26 190 1 1 0 0 0 0 2466 0 0
## 78 1 14 101 3 1 1 0 0 0 2466 1 0
## 79 1 28 95 1 1 0 0 0 2 2466 0 1
## 81 1 14 100 3 0 0 0 0 2 2495 0 1
## 82 1 23 94 3 1 0 0 0 0 2495 0 0
## 83 1 17 142 2 0 0 1 0 0 2495 0 0
## 84 1 21 130 1 1 0 1 0 3 2495 0 1
Similarly to ptl2, I chose to make ftv2 a binary variable where 0 = no physicians visits in the first trimester and 1 = 1+ visits in the first trimester. As I mentioned in a previous discussion, I don’t think that multiple physicians visits necessarily has any bearing on the birthweight of a child, especially since it was not statistically significant in the initial logistic regression model.
#correlation table to assess relationship between ftv2 and low
ftv2_low_corr = cor(birthwt[,c('low', 'ftv2')])
ftv2_low_corr
## low ftv2
## low 1.0000000 -0.1094147
## ftv2 -0.1094147 1.0000000
Looking at this correlation table, it does not appear that there is a very strong correlation between ‘low’ and ‘ftv2’. However, the correlation is negative, which implies that when ‘low’ is 0 (below 2.5kg at birth), ftv2 is ‘more likely’ to be 1 (1+ physician’s visits in the first trimester), but because the correlation coefficient is so small, this is likely not actually correlated, so it’s unreasonable to assume that not going to see a physician would therefore make your child more likely to have a birthweight above 2.5kg.
f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??
new_bw_lr = glm(low ~ age + lwt + race + smoke + ht + ui +ptl2 + ftv2, family = "binomial", data = birthwt)
summary(new_bw_lr)
##
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ht + ui + ptl2 +
## ftv2, family = "binomial", data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.132302 1.325225 0.100 0.92048
## age -0.041760 0.037844 -1.103 0.26981
## lwt -0.011848 0.006755 -1.754 0.07943 .
## race 0.404931 0.224863 1.801 0.07174 .
## smoke 0.816451 0.416669 1.959 0.05006 .
## ht 1.795749 0.702141 2.558 0.01054 *
## ui 0.657830 0.468951 1.403 0.16069
## ptl2 1.249407 0.465305 2.685 0.00725 **
## ftv2 -0.103561 0.373413 -0.277 0.78152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 199.45 on 180 degrees of freedom
## AIC: 217.45
##
## Number of Fisher Scoring iterations: 4
Based on this new logistic regression model with my new variables, ptl2 and ftv2, it appears that ptl2 is now actually the strongest statistically significant variable in the model, implying that this new variable is very important in predicting whether or not a child’s birthweight will be below 2.5kg. However, ftv2 is still not statistically significant, which I predicted based on the very small correlation between ftv2 and ‘low’, which is what my model is aiming to predict.
g. In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
#split the data into an 80/20 training/testing set and set the seed
library(TeachingDemos)
char2seed("TakehomeExam") #setting the seed using TeachingDemos
bw_train_index = sample(1:nrow(birthwt), 0.8*nrow(birthwt))
bw_train = birthwt[bw_train_index, ]
bw_test = birthwt[-bw_train_index, ]
#fit LDA model using the variables with statistical significance in most recent logistic regression
bw_lda = lda(low ~ lwt + race + smoke + ht + ptl2, data = bw_train)
#confusion matrix
lda_predictions = predict(bw_lda, newdata = bw_test)
lda_class = lda_predictions$class
lda_conf = table(Predicted = lda_class, Actual = bw_test$low)
lda_conf
## Actual
## Predicted 0 1
## 0 24 6
## 1 3 5
#calculate TN/FN/TP/FP
metrics <- function(confusion_matrix) {
TN <- confusion_matrix[1, 1]
FP <- confusion_matrix[1, 2]
FN <- confusion_matrix[2, 1]
TP <- confusion_matrix[2, 2]
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
accuracy <- (TP + TN) / (TP + TN + FP + FN)
return(list(sensitivity = sensitivity, specificity = specificity, accuracy = accuracy))
}
#sensitivity, specificity, and accuracy
lda_metrics = metrics(lda_conf)
lda_metrics
## $sensitivity
## [1] 0.625
##
## $specificity
## [1] 0.8
##
## $accuracy
## [1] 0.7631579
Looking at the LDA model, it appears that the sensitivity is 62.5%, the specificity is 80%, and the model is 76.3% accurate.
#fit QDA with the same variables as LDA
bw_qda = lda(low ~ lwt + race + smoke + ht + ptl2, data = bw_train)
#confusion matrix
qda_predictions = predict(bw_qda, newdata = bw_test)
qda_class = qda_predictions$class
qda_conf = table(Predicted = qda_class, Actual = bw_test$low)
qda_conf
## Actual
## Predicted 0 1
## 0 24 6
## 1 3 5
#sensitivity, specificity, and accuracy
qda_metrics = metrics(qda_conf)
qda_metrics
## $sensitivity
## [1] 0.625
##
## $specificity
## [1] 0.8
##
## $accuracy
## [1] 0.7631579
Looking at the QDA model, it has the same metrics as the LDA model, so I can conclude that they perform the same.
#training the logistic regression model from (f) on the training data
f_lr = glm(low ~ age + lwt + race + smoke + ht + ui +ptl2 + ftv2, family = "binomial", data = bw_train)
#confusion matrix
f_predictions = predict(f_lr, newdata = bw_test, type = "response")
f_class = factor(ifelse(f_predictions > 0.5, "Low", "Normal"))
f_conf = table(Predicted = f_class, Actual = bw_test$low)
f_conf
## Actual
## Predicted 0 1
## Low 2 5
## Normal 25 6
#metrics
f_metrics = metrics(f_conf)
f_metrics
## $sensitivity
## [1] 0.1935484
##
## $specificity
## [1] 0.2857143
##
## $accuracy
## [1] 0.2105263
Looking at the logistic regression model from part (f), it has a much lower sensitivity that the LDA/QDA at 19.4%. The specificity is also lower at 28.6%, and the accuracy is much lower at around 21%.
#training the logistic regression model from (b) on the training data
b_lr = glm(low ~ age + lwt + race + smoke + ht + ui, family = "binomial", data = bw_train)
#confusion matrix
b_predictions = predict(b_lr, newdata = bw_test, type = "response")
b_class = factor(ifelse(b_predictions > 0.5, "Low", "Normal"))
b_conf = table(Predicted = b_class, Actual = bw_test$low)
b_conf
## Actual
## Predicted 0 1
## Low 2 4
## Normal 25 7
#metrics
b_metrics = metrics(b_conf)
b_metrics
## $sensitivity
## [1] 0.21875
##
## $specificity
## [1] 0.3333333
##
## $accuracy
## [1] 0.2368421
Finally, looking at the logistic regression model from part (b), it’s sensitivity is a bit higher than the one from part (f) at 21.88%, but is still less than the LDA/QDA model. The specificity is a bit higher than the model from part (f) at 33.3%, and the accuracy is also a little bit higher at 23.7%.
The model(s) that perform the best are LDA and QDA. I would recommend using the QDA model, however, because none of this data appears to be linear, so QDA might fit it better.
h. Using your final model from f), interpret the estimates for all covariates.
Because QDA models don’t print coefficients in the way that logistic/linear regression models do, I had to analyze the model in a slightly different way. I instead printed the means of each predictor variable in the two groups, not below 2.5kg and below 2.5kg.
bw_qda$means
## lwt race smoke ht ptl2
## 0 133.4369 1.757282 0.3592233 0.02912621 0.08737864
## 1 123.8333 1.937500 0.5208333 0.12500000 0.29166667
Because race, smoke, ht, and ptl2 are categorical variables, not continuous, I decided to interpret them using contingency tables.
#contingency table for 'race' variable
table(birthwt$race, birthwt$low)
##
## 0 1
## 1 73 23
## 2 15 11
## 3 42 25
prop.table(table(birthwt$race, birthwt$low), margin = 1)
##
## 0 1
## 1 0.7604167 0.2395833
## 2 0.5769231 0.4230769
## 3 0.6268657 0.3731343
#contingency table for 'smoke' variable
table(birthwt$smoke, birthwt$low)
##
## 0 1
## 0 86 29
## 1 44 30
prop.table(table(birthwt$smoke, birthwt$low), margin = 1)
##
## 0 1
## 0 0.7478261 0.2521739
## 1 0.5945946 0.4054054
#contingency table for 'ht' variable
table(birthwt$ht, birthwt$low)
##
## 0 1
## 0 125 52
## 1 5 7
prop.table(table(birthwt$ht, birthwt$low), margin = 1)
##
## 0 1
## 0 0.7062147 0.2937853
## 1 0.4166667 0.5833333
#contingency table for 'ptl2' variable
table(birthwt$ptl2, birthwt$low)
##
## 0 1
## 0 118 41
## 1 12 18
prop.table(table(birthwt$ptl2, birthwt$low), margin = 1)
##
## 0 1
## 0 0.7421384 0.2578616
## 1 0.4000000 0.6000000