I am going to build mutiple regression model and find out what metrics affect the newborns weight. Sourcse: Low Birth Weight Study data from https://www.statcrunch.com/5.0/shareddata.php?keywords=birth+weight
newborn.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/data605_week11/master/DS5.csv?token=Ac3_PhlQcevZsvGDro1Q3wc09ffcsX3Mks5b9iPfwA%3D%3D', header=TRUE)
head(newborn.dt)
## ID LOW AGE LWT RACE SMOKE PTL HT UI FTV BWT
## 1 85 0 19 182 2 0 0 0 1 0 2523
## 2 86 0 33 155 3 0 0 0 0 3 2551
## 3 87 0 20 105 1 1 0 0 0 1 2557
## 4 88 0 21 108 1 1 0 0 1 2 2594
## 5 89 0 18 107 1 1 0 0 1 0 2600
## 6 91 0 21 124 3 0 0 0 0 0 2622
Description of data:
ID Identification Code
LOW Low Birth Weight (0=Birth Weight >= 2500g, 1=Birth Weight < 2500g)
AGE Age of the Mother in Years
LWT Weight of Mother in Pounds at the Last Menstrual Period
RACE Race (1 = White, 2 = Black, 3 = Other)
SMOKE Smoking Status During Pregnancy (1 = Yes, 0 = No)
PTL History of Premature Labor (0 = None 1 = One, etc.)
HT History of Hypertension (1 = Yes, 0 = No)
UI Presence of Uterine Irritability (1 = Yes, 0 = No)
FTV Number of Physician Visits During the First Trimester
BWT Birth Weight in Grams
summary(newborn.dt)
## ID LOW AGE LWT
## Min. : 4.0 Min. :0.0000 Min. :14.00 Min. : 80.0
## 1st Qu.: 68.0 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0
## Median :123.0 Median :0.0000 Median :23.00 Median :121.0
## Mean :121.1 Mean :0.3122 Mean :23.24 Mean :129.8
## 3rd Qu.:176.0 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :226.0 Max. :1.0000 Max. :45.00 Max. :250.0
## RACE SMOKE PTL HT
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :1.847 Mean :0.3915 Mean :0.1958 Mean :0.06349
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :3.000 Max. :1.0000 Max. :3.0000 Max. :1.00000
## UI FTV BWT
## Min. :0.0000 Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :0.0000 Median :2977
## Mean :0.1481 Mean :0.7937 Mean :2945
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3475
## Max. :1.0000 Max. :6.0000 Max. :4990
pairs(newborn.dt, gap = 0.5)
model<- lm(BWT~ LOW + AGE + LWT + RACE + SMOKE + PTL+ HT + UI + FTV , data = newborn.dt)
summary(model)
##
## Call:
## lm(formula = BWT ~ LOW + AGE + LWT + RACE + SMOKE + PTL + HT +
## UI + FTV, data = newborn.dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -991.06 -299.92 -7.87 275.53 1639.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3615.947 229.159 15.779 < 2e-16 ***
## LOW -1131.941 73.861 -15.325 < 2e-16 ***
## AGE -6.329 6.338 -0.999 0.319382
## LWT 1.049 1.132 0.927 0.355225
## RACE -101.837 38.494 -2.646 0.008883 **
## SMOKE -172.572 71.907 -2.400 0.017423 *
## PTL 80.987 68.463 1.183 0.238400
## HT -181.672 137.482 -1.321 0.188046
## UI -335.511 93.193 -3.600 0.000412 ***
## FTV -7.587 30.951 -0.245 0.806649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 433.2 on 179 degrees of freedom
## Multiple R-squared: 0.6639, Adjusted R-squared: 0.647
## F-statistic: 39.28 on 9 and 179 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
hist(model$residuals, main = "Histogram of Residuals", xlab= "")
plot(model$residuals, fitted(model))
qqnorm(model$residuals)
qqline(model$residuals)
Conclusion:
The R-squared value is 0.6639 which means that the model explains 66.39 percent of the data’s variation. Residuals are normally distributed. Q-Q plot confirms that we can use speed as a predictor.
Model : BWT = 3615.94 - 1131.941* LOW - 6.329 * AGE + 1.049* LWT - 101.837 * RACE -172.572* SMOKE + 80.987 * PTL -181.672 * HT - 335.511* UI -7.587* FTV