Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The data is from OpenIntro Statistics(https://www.openintro.org/stat/)
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
variable | description |
---|---|
fage |
father’s age in years. |
mage |
mother’s age in years. |
mature |
maturity status of mother. |
weeks |
length of pregnancy in weeks. |
premie |
whether the birth was classified as premature (premie) or full-term. |
visits |
number of hospital visits during pregnancy. |
marital |
whether mother is married or not married at birth. |
gained |
weight gained by mother during pregnancy in pounds. |
weight |
weight of the baby at birth in pounds. |
lowbirthweight |
whether baby was classified as low birthweight (low ) or not (not low ). |
gender |
gender of the baby, female or male . |
habit |
status of the mother as a nonsmoker or a smoker . |
whitemom |
whether mom is white or not white . |
dim(nc)
## [1] 1000 13
summary(nc)
## fage mage mature weeks
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00
## Median :30.00 Median :27 Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.00
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00
## NA's : 2 Median :12.0 NA's : 1 Median :30.00
## Mean :12.1 Mean :30.33
## 3rd Qu.:15.0 3rd Qu.:38.00
## Max. :30.0 Max. :85.00
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.000 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.380 not low:889 male :497 smoker :126
## Median : 7.310 NA's : 1
## Mean : 7.101
## 3rd Qu.: 8.060
## Max. :11.750
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
Data Preparation
For a linear regression question, several numerical variables are picked.
independent variables:mage,weeks
dependent variable: weight
(In order to avoid the affect of gender, only female babies are taken into consideration)
female.babies <- subset(nc,gender=='female')
female.babies.reg <- female.babies[,c('mage','weeks','weight')]
female.babies.reg <- female.babies.reg[complete.cases(female.babies.reg),]
summary(female.babies.reg)
## mage weeks weight
## Min. :15.00 Min. :20.00 Min. : 1.000
## 1st Qu.:22.00 1st Qu.:37.00 1st Qu.: 6.265
## Median :26.00 Median :39.00 Median : 7.130
## Mean :27.03 Mean :38.28 Mean : 6.909
## 3rd Qu.:32.00 3rd Qu.:40.00 3rd Qu.: 7.750
## Max. :50.00 Max. :44.00 Max. :11.630
Observation
plot(female.babies.reg$weight,female.babies.reg$mage,xlab = 'mother\'s age in years',ylab = 'weight of the baby at birth in pounds')
cor(female.babies.reg$weight,female.babies.reg$mage)
## [1] 0.01726081
Did not notice any linear correlation between mother’s age and baby’s weight at birth.
The correlation coefficient also confirms conclusion since 0.0072 is small.
plot(female.babies.reg$weight,female.babies.reg$weeks,xlab = 'length of pregnancy in weeks',ylab = 'weight of the baby at birth in pounds')
cor(female.babies.reg$weight,female.babies.reg$weeks)
## [1] 0.6895168
The relationship between length of pregnancy and baby’s weight looks linear.
The strength of the relationship can be quantified by the correlation coefficient and is considered moderate.
Linear Model
m.mage <- lm(weight ~ mage,data = female.babies.reg)
summary(m.mage)
##
## Call:
## lm(formula = weight ~ mage, data = female.babies.reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9174 -0.6300 0.2187 0.8511 4.7003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.799240 0.292838 23.218 <2e-16 ***
## mage 0.004076 0.010559 0.386 0.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.471 on 500 degrees of freedom
## Multiple R-squared: 0.0002979, Adjusted R-squared: -0.001701
## F-statistic: 0.149 on 1 and 500 DF, p-value: 0.6996
plot(female.babies.reg$weight ~ female.babies.reg$mage)
abline(m.mage)
Based on the table, the least squared regression line for the linear model is:
\[
\hat{y} = 6.799240 + 0.004076 * mage
\]
m.weeks <- lm(weight ~ weeks,data = female.babies.reg)
summary(m.weeks)
##
## Call:
## lm(formula = weight ~ weeks, data = female.babies.reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1123 -0.6210 -0.0435 0.6241 3.7347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.92509 0.60478 -9.797 <2e-16 ***
## weeks 0.33529 0.01575 21.288 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.066 on 500 degrees of freedom
## Multiple R-squared: 0.4754, Adjusted R-squared: 0.4744
## F-statistic: 453.2 on 1 and 500 DF, p-value: < 2.2e-16
plot(female.babies.reg$weight ~ female.babies.reg$weeks)
abline(m.weeks)
Based on the table, the least squared regression line for the linear model is:
\[
\hat{y} = -5.92509 + 0.33529 * weeks
\]
Residual Analysis
hist(m.mage$residuals) #nearly normal residuals
qqnorm(m.mage$residuals) #normal probability plot of the residuals
qqline(m.mage$residuals) #adds diagonal line to the normal prob plot
The histogram shows a slight right skew, but not enough to invalidate the model. The QQ plot also shows a similar tendency and therefore the normal residual condition is met.
hist(m.weeks$residuals) #nearly normal residuals
qqnorm(m.weeks$residuals) #normal probability plot of the residuals
qqline(m.weeks$residuals) #adds diagonal line to the normal prob plot
The histogram is nearly normal while the QQ plot is close enough to the line. The normal residual condition is met.
Conclusion
The linear model appears to be appropriate based both on the model diagnostics (nearly normal residuals condition), and the fact that the length of pregnancy should affect the weight of a baby at birth while mother’s age doesn’t affect too much.