I am going to build mutiple regression model and find out what metrics affect the newborns weight. Sourcse: Low Birth Weight Study data from https://www.statcrunch.com/5.0/shareddata.php?keywords=birth+weight

newborn.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/data605_week11/master/DS5.csv?token=Ac3_PhlQcevZsvGDro1Q3wc09ffcsX3Mks5b9iPfwA%3D%3D', header=TRUE)
head(newborn.dt)
##   ID LOW AGE LWT RACE SMOKE PTL HT UI FTV  BWT
## 1 85   0  19 182    2     0   0  0  1   0 2523
## 2 86   0  33 155    3     0   0  0  0   3 2551
## 3 87   0  20 105    1     1   0  0  0   1 2557
## 4 88   0  21 108    1     1   0  0  1   2 2594
## 5 89   0  18 107    1     1   0  0  1   0 2600
## 6 91   0  21 124    3     0   0  0  0   0 2622

Description of data:

ID Identification Code

LOW Low Birth Weight (0=Birth Weight >= 2500g, 1=Birth Weight < 2500g)

AGE Age of the Mother in Years

LWT Weight of Mother in Pounds at the Last Menstrual Period

RACE Race (1 = White, 2 = Black, 3 = Other)

SMOKE Smoking Status During Pregnancy (1 = Yes, 0 = No)

PTL History of Premature Labor (0 = None 1 = One, etc.)

HT History of Hypertension (1 = Yes, 0 = No)

UI Presence of Uterine Irritability (1 = Yes, 0 = No)

FTV Number of Physician Visits During the First Trimester

BWT Birth Weight in Grams

summary(newborn.dt)
##        ID             LOW              AGE             LWT       
##  Min.   :  4.0   Min.   :0.0000   Min.   :14.00   Min.   : 80.0  
##  1st Qu.: 68.0   1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0  
##  Median :123.0   Median :0.0000   Median :23.00   Median :121.0  
##  Mean   :121.1   Mean   :0.3122   Mean   :23.24   Mean   :129.8  
##  3rd Qu.:176.0   3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0  
##  Max.   :226.0   Max.   :1.0000   Max.   :45.00   Max.   :250.0  
##       RACE           SMOKE             PTL               HT         
##  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :1.847   Mean   :0.3915   Mean   :0.1958   Mean   :0.06349  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :3.000   Max.   :1.0000   Max.   :3.0000   Max.   :1.00000  
##        UI              FTV              BWT      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :0.0000   Median :2977  
##  Mean   :0.1481   Mean   :0.7937   Mean   :2945  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3475  
##  Max.   :1.0000   Max.   :6.0000   Max.   :4990
pairs(newborn.dt, gap = 0.5)

model<- lm(BWT~  LOW + AGE + LWT  + RACE + SMOKE + PTL+ HT + UI + FTV , data = newborn.dt)
summary(model)
## 
## Call:
## lm(formula = BWT ~ LOW + AGE + LWT + RACE + SMOKE + PTL + HT + 
##     UI + FTV, data = newborn.dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -991.06 -299.92   -7.87  275.53 1639.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3615.947    229.159  15.779  < 2e-16 ***
## LOW         -1131.941     73.861 -15.325  < 2e-16 ***
## AGE            -6.329      6.338  -0.999 0.319382    
## LWT             1.049      1.132   0.927 0.355225    
## RACE         -101.837     38.494  -2.646 0.008883 ** 
## SMOKE        -172.572     71.907  -2.400 0.017423 *  
## PTL            80.987     68.463   1.183 0.238400    
## HT           -181.672    137.482  -1.321 0.188046    
## UI           -335.511     93.193  -3.600 0.000412 ***
## FTV            -7.587     30.951  -0.245 0.806649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 433.2 on 179 degrees of freedom
## Multiple R-squared:  0.6639, Adjusted R-squared:  0.647 
## F-statistic: 39.28 on 9 and 179 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
hist(model$residuals, main = "Histogram of Residuals", xlab= "")
plot(model$residuals, fitted(model))
qqnorm(model$residuals)
qqline(model$residuals)

Conclusion:

The R-squared value is 0.6639 which means that the model explains 66.39 percent of the data’s variation. Residuals are normally distributed. Q-Q plot confirms that we can use speed as a predictor.

Model : BWT = 3615.94 - 1131.941* LOW - 6.329 * AGE + 1.049* LWT - 101.837 * RACE -172.572* SMOKE + 80.987 * PTL -181.672 * HT - 335.511* UI -7.587* FTV