The purpose of this study was to first identify any observable trends and/or correlations between our outcome variable of interest (birth weight of newborn children) and the various health states of their mothers. Our goal was to develop a multiple regression model that could feasibly predict what a given child’s birth weight would be based on certain conditions of the mother (ie: smoking status, age, race, etc.) While we ultimately found that certain variables displayed a strong significance with the birth weight variable, our overall model did not yield much predictive power (with an R squared rating of 0.2383), despite multiple approaches in development.
As stated above, our team’s primary objective was to find meaningful relationships between a given mother’s health state and the birth weight of her child, and develop a model that could reasonably predict said birth weight. Previous studies have shown that variables such as smoking status and prior premature labors can have a significant impact on a newborn’s overall health and weight, and we wanted to confirm these correlations for ourselves so that we may better educate and consult future parents about maintaining the health of their unborn children throughout the pregancy process.
This dataset is an R dataframe called “Risk Factors Associated with Low Infant Birth Weight,” this data was collected at Baystate Medical Center, Springfield, Mass during 1986. Variable descriptions are below (as defined verbatim by the creator of this set):
low - indicator of birth weight less than 2.5kg
age - mother’s age in years
lwt - mother’s weight in pounds at last menstrual period
race - mother’s race (1=white, 2=black, 3=other)
smoke - smoking status during pregnancy
ptl - number of previous premature labors
ht - history of hypertension
ui - presence of uterine irritability
ftv - number of physician visits during the first trimester
bwt - birth weight in grams
#Import table, drop first column, and run pairs matrix
birthwt <- read_csv("birthwt.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## low = col_double(),
## age = col_double(),
## lwt = col_double(),
## race = col_double(),
## smoke = col_double(),
## ptl = col_double(),
## ht = col_double(),
## ui = col_double(),
## ftv = col_double(),
## bwt = col_double()
## )
birthwt$X1<- NULL
#Convert factor variables
birthwt$low <- factor(birthwt$low, levels = c(0,1),labels = c("No", "Yes"))
birthwt$race <- factor(birthwt$race, levels = c(1:3), labels=c("white","black","other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0,1), labels = c("No", "Yes"))
birthwt$ht <- factor(birthwt$ht, levels = c(0,1), labels = c("No", "Yes"))
birthwt$ui <- factor(birthwt$ui, levels = c(0,1),labels = c("No", "Yes"))
birthwt$ptl <- factor(birthwt$ptl)
birthwt$ftv <- factor(birthwt$ftv)
our.obs <-birthwt[7,]
summary(birthwt)
low age lwt race smoke ptl
No :130 Min. :14.00 Min. : 80.0 white:96 No :115 0:159
Yes: 59 1st Qu.:19.00 1st Qu.:110.0 black:26 Yes: 74 1: 24
Median :23.00 Median :121.0 other:67 2: 5
Mean :23.24 Mean :129.8 3: 1
3rd Qu.:26.00 3rd Qu.:140.0
Max. :45.00 Max. :250.0
ht ui ftv bwt
No :177 No :161 0:100 Min. : 709
Yes: 12 Yes: 28 1: 47 1st Qu.:2414
2: 30 Median :2977
3: 7 Mean :2945
4: 4 3rd Qu.:3487
6: 1 Max. :4990
barplot(table(birthwt$race), xlab= "Race", ylab="Frequency")
barplot(table(birthwt$smoke), xlab= "Smoking Status", ylab="Frequency")
barplot(table(birthwt$ht), xlab= "Hypertension State", ylab="Frequency")
barplot(table(birthwt$ui), xlab= "Uterine Irritability", ylab="Frequency")
barplot(table(birthwt$ptl), xlab= "Previous Premature Labors", ylab="Frequency")
ggplot(data=birthwt, aes(x=lwt, y=bwt)) +
geom_point()
#Boxplots of categorical variables
ggplot(data=birthwt, aes(x=ptl, y=bwt)) +
geom_boxplot()
ggplot(data=birthwt, aes(x=smoke, y=bwt)) +
geom_boxplot()
ggplot(data=birthwt, aes(x=ui, y=bwt)) +
geom_boxplot()
ggplot(data=birthwt, aes(x=race, y=bwt)) +
geom_boxplot()
#Pairs plots and correlations of numerical variables
pairs(birthwt)
cor(x=birthwt$age,y=birthwt$bwt)
[1] 0.09031781
cor(x=birthwt$lwt,y=birthwt$bwt)
[1] 0.1857333
Notes
-The spread of data points in numerical pair plots don’t appear to show any discernable linear (or non-linear) trends.
-Boxplots of categorical variables appear to show observable differences between various factor levels, especially with ptl, smoke, and ui. Factor levels of race appear to be much closer together in terms of data spread.
-Correlations between numerical predictors and the outcome variable are very low.
#Start with a backward stepwise method to narrow down variables
bwt.fit <- lm(bwt~.*., data=birthwt)
summary(bwt.fit)
stepAIC(bwt.fit, direction="backward")
#Copy the model called by the above code, and remove insignificant/non-applicable variables.
#Removed all interactions along with ftv and the singular "low" variable, as it is defined by our outcome variable of bwt.
bwt.fit2<-lm(formula = bwt ~ lwt + race + smoke + ptl + ht +
ui, data = birthwt)
summary(bwt.fit2)
Call: lm(formula = bwt ~ lwt + race + smoke + ptl + ht + ui, data = birthwt)
Residuals: Min 1Q Median 3Q Max -1830.54 -441.19 44.76 482.39 1626.09
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 2834.324 241.596 11.732 < 2e-16 lwt 4.306 1.659 2.595 0.01024
raceblack -445.614 144.003 -3.094 0.00229 raceother -310.801 111.480 -2.788 0.00588 ** smokeYes -330.181 103.875 -3.179 0.00174 ** ptl1 -294.426 143.421 -2.053 0.04154 *
ptl2 -15.485 293.639 -0.053 0.95800
ptl3 1266.335 654.582 1.935 0.05462 .
htYes -573.974 196.937 -2.915 0.00402 ** uiYes -542.511 136.105 -3.986 9.77e-05 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 636.4 on 179 degrees of freedom Multiple R-squared: 0.2747, Adjusted R-squared: 0.2383 F-statistic: 7.534 on 9 and 179 DF, p-value: 2.505e-09
Note: Adjusted R squared of this model is 0.2383, indicating low predictive power. Both manual selection and ASR methods were also explored when selecting variables, though this backwards stepwise approach yielded the best result in our case.
leveragePlots(bwt.fit2, pch=16)
bwt.fit2.df <- augment(bwt.fit2)
ggplot(bwt.fit2.df, aes(x=.fitted, y=.resid))+
geom_point()+
geom_hline(yintercept=0, linetype=2)+
labs(x="Fitted Values", y="Residuals")
ncvTest(bwt.fit2)
Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 0.02383701, Df = 1, p = 0.8773
shapiro.test(bwt.fit2$residuals)
Shapiro-Wilk normality test
data: bwt.fit2$residuals W = 0.99294, p-value = 0.4973
-NCV test shows a very high p-value, indicating a constant variance among residuals. No concerns.
-Shapiro-Wilk test is showing an average p-value, indicating fair normality amongst the spread of residuals.
predict(bwt.fit2, newdata = our.obs, interval="predict")
fit lwr upr
1 3342.386 2073.566 4611.206
For record #7, our model produced a prediction of 3,342g with a prediction interval of (2,074, 4,611). In other words, we’re 95% confident that the true value of bwt given the values of lwt, race, smoke, ptl, ht, and ui for record #7 is between 2,074 and 4,611g. In this case, the actual vale does fall within this interval at 2,637g. The large difference between the actual and predicted values was expected, most likely due to the lower R squared rating of our model.
General Observations:
• Dataset is small (189 observations), would have preferred a larger set
• Data was pulled from only one city at one medical facility. Would have preferred more variety here.
• There isn’t much spread in data points across factor levels of certain categorical variables, particularly ui, ht, and ptl.
• Data was collected in 1986; fairly dated at this point.
• There is no indication of who collected the data.
• Categorical variables could have been more granular. ie: amount and frequency of smoking per patient at various time periods, or actually listing out other races instead of just using an “other” factor level.
• While the predictor varibles chosen for this study were shown to be significant for buliding our model, the model itself turned out to be a rather weak fit for the spread of data. There are likely a number of reasons for this, such as the relatively small size of the dataset, the broadness of some of our categorical varibles (ie: simply using “Yes” or “No” for smoking status), and the limited number of levels for some variables.
• Furthermore, the distribution of the data points in the residual plot did not seem to indicate any obvious nonlinear visual trends (quadratic, exponential, etc.), therefore there’s no guarantee that transforming certain variables would have improved our model’s performance.
Variables Missing:
• How long the mother had been smoking
• If they smoked the entire pregnancy, or just through a certain period of it
• How many cigarettes smoked a single day, on average
• Height of both mother and infant, or bmi preferrably
• Indicator of recreational drug use
• Gestational age of infant
• Average alcohol consumption over the course of the pregnancy
• Insurance coverage/access to wellness programs specifically targetted for pregnancies
• Indicator of gestational diabetes