MSDS Spring 2018

DATA 605 Fundamental of Computational Mathematics

Jiadi Li

Week 11 Discussion: Regression Model

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The data is from OpenIntro Statistics(https://www.openintro.org/stat/)

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
variable description
fage father’s age in years.
mage mother’s age in years.
mature maturity status of mother.
weeks length of pregnancy in weeks.
premie whether the birth was classified as premature (premie) or full-term.
visits number of hospital visits during pregnancy.
marital whether mother is married or not married at birth.
gained weight gained by mother during pregnancy in pounds.
weight weight of the baby at birth in pounds.
lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
gender gender of the baby, female or male.
habit status of the mother as a nonsmoker or a smoker.
whitemom whether mom is white or not white.
dim(nc)
## [1] 1000   13
summary(nc)
##       fage            mage            mature        weeks      
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00  
##  Median :30.00   Median :27                     Median :39.00  
##  Mean   :30.26   Mean   :27                     Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                     Max.   :45.00  
##  NA's   :171                                    NA's   :2      
##        premie        visits            marital        gained     
##  full term:846   Min.   : 0.0   married    :386   Min.   : 0.00  
##  premie   :152   1st Qu.:10.0   not married:613   1st Qu.:20.00  
##  NA's     :  2   Median :12.0   NA's       :  1   Median :30.00  
##                  Mean   :12.1                     Mean   :30.33  
##                  3rd Qu.:15.0                     3rd Qu.:38.00  
##                  Max.   :30.0                     Max.   :85.00  
##                  NA's   :9                        NA's   :27     
##      weight       lowbirthweight    gender          habit    
##  Min.   : 1.000   low    :111    female:503   nonsmoker:873  
##  1st Qu.: 6.380   not low:889    male  :497   smoker   :126  
##  Median : 7.310                               NA's     :  1  
##  Mean   : 7.101                                              
##  3rd Qu.: 8.060                                              
##  Max.   :11.750                                              
##                                                              
##       whitemom  
##  not white:284  
##  white    :714  
##  NA's     :  2  
##                 
##                 
##                 
## 

Data Preparation

For a linear regression question, several numerical variables are picked.
independent variables:mage,weeks
dependent variable: weight
(In order to avoid the affect of gender, only female babies are taken into consideration)

female.babies <- subset(nc,gender=='female')
female.babies.reg <- female.babies[,c('mage','weeks','weight')]
female.babies.reg <- female.babies.reg[complete.cases(female.babies.reg),]
summary(female.babies.reg)
##       mage           weeks           weight      
##  Min.   :15.00   Min.   :20.00   Min.   : 1.000  
##  1st Qu.:22.00   1st Qu.:37.00   1st Qu.: 6.265  
##  Median :26.00   Median :39.00   Median : 7.130  
##  Mean   :27.03   Mean   :38.28   Mean   : 6.909  
##  3rd Qu.:32.00   3rd Qu.:40.00   3rd Qu.: 7.750  
##  Max.   :50.00   Max.   :44.00   Max.   :11.630

Observation

plot(female.babies.reg$weight,female.babies.reg$mage,xlab = 'mother\'s age in years',ylab = 'weight of the baby at birth in pounds')

cor(female.babies.reg$weight,female.babies.reg$mage)
## [1] 0.01726081

Did not notice any linear correlation between mother’s age and baby’s weight at birth.
The correlation coefficient also confirms conclusion since 0.0072 is small.

plot(female.babies.reg$weight,female.babies.reg$weeks,xlab = 'length of pregnancy in weeks',ylab = 'weight of the baby at birth in pounds')

cor(female.babies.reg$weight,female.babies.reg$weeks)
## [1] 0.6895168

The relationship between length of pregnancy and baby’s weight looks linear.
The strength of the relationship can be quantified by the correlation coefficient and is considered moderate.

Linear Model

m.mage <- lm(weight ~ mage,data = female.babies.reg)
summary(m.mage)
## 
## Call:
## lm(formula = weight ~ mage, data = female.babies.reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9174 -0.6300  0.2187  0.8511  4.7003 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.799240   0.292838  23.218   <2e-16 ***
## mage        0.004076   0.010559   0.386      0.7    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.471 on 500 degrees of freedom
## Multiple R-squared:  0.0002979,  Adjusted R-squared:  -0.001701 
## F-statistic: 0.149 on 1 and 500 DF,  p-value: 0.6996
plot(female.babies.reg$weight ~ female.babies.reg$mage)
abline(m.mage)

Based on the table, the least squared regression line for the linear model is:
\[ \hat{y} = 6.799240 + 0.004076 * mage \]

m.weeks <- lm(weight ~ weeks,data = female.babies.reg)
summary(m.weeks)
## 
## Call:
## lm(formula = weight ~ weeks, data = female.babies.reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1123 -0.6210 -0.0435  0.6241  3.7347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.92509    0.60478  -9.797   <2e-16 ***
## weeks        0.33529    0.01575  21.288   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.066 on 500 degrees of freedom
## Multiple R-squared:  0.4754, Adjusted R-squared:  0.4744 
## F-statistic: 453.2 on 1 and 500 DF,  p-value: < 2.2e-16
plot(female.babies.reg$weight ~ female.babies.reg$weeks)
abline(m.weeks)

Based on the table, the least squared regression line for the linear model is:
\[ \hat{y} = -5.92509 + 0.33529 * weeks \]

Residual Analysis

hist(m.mage$residuals) #nearly normal residuals

qqnorm(m.mage$residuals) #normal probability plot of the residuals
qqline(m.mage$residuals) #adds diagonal line to the normal prob plot

The histogram shows a slight right skew, but not enough to invalidate the model. The QQ plot also shows a similar tendency and therefore the normal residual condition is met.

hist(m.weeks$residuals) #nearly normal residuals

qqnorm(m.weeks$residuals) #normal probability plot of the residuals
qqline(m.weeks$residuals) #adds diagonal line to the normal prob plot

The histogram is nearly normal while the QQ plot is close enough to the line. The normal residual condition is met.

Conclusion

The linear model appears to be appropriate based both on the model diagnostics (nearly normal residuals condition), and the fact that the length of pregnancy should affect the weight of a baby at birth while mother’s age doesn’t affect too much.