MSDS 5043 Regression Project

Abstract

The purpose of this study was to first identify any observable trends and/or correlations between our outcome variable of interest (birth weight of newborn children) and the various health states of their mothers. Our goal was to develop a multiple regression model that could feasibly predict what a given child’s birth weight would be based on certain conditions of the mother (ie: smoking status, age, race, etc.) While we ultimately found that certain variables displayed a strong significance with the birth weight variable, our overall model did not yield much predictive power (with an R squared rating of 0.2383), despite multiple approaches in development.

Introduction

As stated above, our team’s primary objective was to find meaningful relationships between a given mother’s health state and the birth weight of her child, and develop a model that could reasonably predict said birth weight. Previous studies have shown that variables such as smoking status and prior premature labors can have a significant impact on a newborn’s overall health and weight, and we wanted to confirm these correlations for ourselves so that we may better educate and consult future parents about maintaining the health of their unborn children throughout the pregancy process.

Overview of the Dataset

This dataset is an R dataframe called “Risk Factors Associated with Low Infant Birth Weight,” this data was collected at Baystate Medical Center, Springfield, Mass during 1986. Variable descriptions are below (as defined verbatim by the creator of this set):

low - indicator of birth weight less than 2.5kg

age - mother’s age in years

lwt - mother’s weight in pounds at last menstrual period

race - mother’s race (1=white, 2=black, 3=other)

smoke - smoking status during pregnancy

ptl - number of previous premature labors

ht - history of hypertension

ui - presence of uterine irritability

ftv - number of physician visits during the first trimester

bwt - birth weight in grams

Import Dataset and Convert Factor Variables

#Import table, drop first column, and run pairs matrix
birthwt <- read_csv("birthwt.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   low = col_double(),
##   age = col_double(),
##   lwt = col_double(),
##   race = col_double(),
##   smoke = col_double(),
##   ptl = col_double(),
##   ht = col_double(),
##   ui = col_double(),
##   ftv = col_double(),
##   bwt = col_double()
## )

birthwt$X1<- NULL

#Convert factor variables
birthwt$low <- factor(birthwt$low, levels = c(0,1),labels = c("No", "Yes"))
birthwt$race <- factor(birthwt$race, levels = c(1:3), labels=c("white","black","other"))
birthwt$smoke <- factor(birthwt$smoke, levels = c(0,1), labels = c("No", "Yes"))
birthwt$ht <- factor(birthwt$ht, levels = c(0,1), labels = c("No", "Yes"))
birthwt$ui <- factor(birthwt$ui, levels = c(0,1),labels = c("No", "Yes"))
birthwt$ptl <- factor(birthwt$ptl)
birthwt$ftv <- factor(birthwt$ftv)

Selected 7th Record of Dataset to Use for Prediction Later

our.obs <-birthwt[7,]

Exploratory Analysis

View Summary

summary(birthwt)

low age lwt race smoke ptl
No :130 Min. :14.00 Min. : 80.0 white:96 No :115 0:159
Yes: 59 1st Qu.:19.00 1st Qu.:110.0 black:26 Yes: 74 1: 24
Median :23.00 Median :121.0 other:67 2: 5
Mean :23.24 Mean :129.8 3: 1
3rd Qu.:26.00 3rd Qu.:140.0
Max. :45.00 Max. :250.0
ht ui ftv bwt
No :177 No :161 0:100 Min. : 709
Yes: 12 Yes: 28 1: 47 1st Qu.:2414
2: 30 Median :2977
3: 7 Mean :2945
4: 4 3rd Qu.:3487
6: 1 Max. :4990

Histograms of Categorical/Discrete Variables

barplot(table(birthwt$race), xlab= "Race", ylab="Frequency")

barplot(table(birthwt$smoke), xlab= "Smoking Status", ylab="Frequency")

barplot(table(birthwt$ht), xlab= "Hypertension State", ylab="Frequency")

barplot(table(birthwt$ui), xlab= "Uterine Irritability", ylab="Frequency")

barplot(table(birthwt$ptl), xlab= "Previous Premature Labors", ylab="Frequency")

Scatter Plot of Mother’s Weight (lwt) vs. Birth Weight (bwt)

ggplot(data=birthwt, aes(x=lwt, y=bwt)) +
  geom_point()

Get Boxplots of Categorical Variables in Relation to Birth Weight

#Boxplots of categorical variables
ggplot(data=birthwt, aes(x=ptl, y=bwt)) +
  geom_boxplot()

ggplot(data=birthwt, aes(x=smoke, y=bwt)) +
  geom_boxplot()

ggplot(data=birthwt, aes(x=ui, y=bwt)) +
  geom_boxplot()

ggplot(data=birthwt, aes(x=race, y=bwt)) +
  geom_boxplot()

#Pairs plots and correlations of numerical variables
pairs(birthwt)

cor(x=birthwt$age,y=birthwt$bwt)

[1] 0.09031781

cor(x=birthwt$lwt,y=birthwt$bwt)

[1] 0.1857333

Notes

-The spread of data points in numerical pair plots don’t appear to show any discernable linear (or non-linear) trends.

-Boxplots of categorical variables appear to show observable differences between various factor levels, especially with ptl, smoke, and ui. Factor levels of race appear to be much closer together in terms of data spread.

-Correlations between numerical predictors and the outcome variable are very low.

Identification and Evaluation of a Suitable Model

Backwards Stepwise Method

#Start with a backward stepwise method to narrow down variables
bwt.fit <- lm(bwt~.*., data=birthwt)
summary(bwt.fit)

stepAIC(bwt.fit, direction="backward")

Remove Insignificant Variables and Rerun

#Copy the model called by the above code, and remove insignificant/non-applicable variables.
#Removed all interactions along with ftv and the singular "low" variable, as it is defined by our outcome variable of bwt.

bwt.fit2<-lm(formula = bwt ~ lwt + race + smoke + ptl + ht + 
               ui, data = birthwt)
summary(bwt.fit2)

Call: lm(formula = bwt ~ lwt + race + smoke + ptl + ht + ui, data = birthwt)

Residuals: Min 1Q Median 3Q Max -1830.54 -441.19 44.76 482.39 1626.09

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 2834.324 241.596 11.732 < 2e-16 lwt 4.306 1.659 2.595 0.01024
raceblack -445.614 144.003 -3.094 0.00229 raceother -310.801 111.480 -2.788 0.00588 ** smokeYes -330.181 103.875 -3.179 0.00174 ** ptl1 -294.426 143.421 -2.053 0.04154 *
ptl2 -15.485 293.639 -0.053 0.95800
ptl3 1266.335 654.582 1.935 0.05462 .
htYes -573.974 196.937 -2.915 0.00402 ** uiYes -542.511 136.105 -3.986 9.77e-05 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 636.4 on 179 degrees of freedom Multiple R-squared: 0.2747, Adjusted R-squared: 0.2383 F-statistic: 7.534 on 9 and 179 DF, p-value: 2.505e-09

Note: Adjusted R squared of this model is 0.2383, indicating low predictive power. Both manual selection and ASR methods were also explored when selecting variables, though this backwards stepwise approach yielded the best result in our case.

Leverage Plots

leveragePlots(bwt.fit2, pch=16)

Residual Plot

bwt.fit2.df <- augment(bwt.fit2)
ggplot(bwt.fit2.df, aes(x=.fitted, y=.resid))+
  geom_point()+
  geom_hline(yintercept=0, linetype=2)+
  labs(x="Fitted Values", y="Residuals")

NCV and Shapiro Tests

ncvTest(bwt.fit2)

Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 0.02383701, Df = 1, p = 0.8773

shapiro.test(bwt.fit2$residuals)

Shapiro-Wilk normality test

data: bwt.fit2$residuals W = 0.99294, p-value = 0.4973

-NCV test shows a very high p-value, indicating a constant variance among residuals. No concerns.

-Shapiro-Wilk test is showing an average p-value, indicating fair normality amongst the spread of residuals.

Application of the Model

Apply Randomly Selected Observation to Model for Prediction

predict(bwt.fit2, newdata = our.obs, interval="predict")

   fit      lwr      upr

1 3342.386 2073.566 4611.206

For record #7, our model produced a prediction of 3,342g with a prediction interval of (2,074, 4,611). In other words, we’re 95% confident that the true value of bwt given the values of lwt, race, smoke, ptl, ht, and ui for record #7 is between 2,074 and 4,611g. In this case, the actual vale does fall within this interval at 2,637g. The large difference between the actual and predicted values was expected, most likely due to the lower R squared rating of our model.

Limitations and Assumptions

General Observations:

• Dataset is small (189 observations), would have preferred a larger set

• Data was pulled from only one city at one medical facility. Would have preferred more variety here.

• There isn’t much spread in data points across factor levels of certain categorical variables, particularly ui, ht, and ptl.

• Data was collected in 1986; fairly dated at this point.

• There is no indication of who collected the data.

• Categorical variables could have been more granular. ie: amount and frequency of smoking per patient at various time periods, or actually listing out other races instead of just using an “other” factor level.

• While the predictor varibles chosen for this study were shown to be significant for buliding our model, the model itself turned out to be a rather weak fit for the spread of data. There are likely a number of reasons for this, such as the relatively small size of the dataset, the broadness of some of our categorical varibles (ie: simply using “Yes” or “No” for smoking status), and the limited number of levels for some variables.

• Furthermore, the distribution of the data points in the residual plot did not seem to indicate any obvious nonlinear visual trends (quadratic, exponential, etc.), therefore there’s no guarantee that transforming certain variables would have improved our model’s performance.

Variables Missing:

• How long the mother had been smoking

• If they smoked the entire pregnancy, or just through a certain period of it

• How many cigarettes smoked a single day, on average

• Height of both mother and infant, or bmi preferrably

• Indicator of recreational drug use

• Gestational age of infant

• Average alcohol consumption over the course of the pregnancy

• Insurance coverage/access to wellness programs specifically targetted for pregnancies

• Indicator of gestational diabetes