Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The data set used here was “babies” data from the OpenIntro package. The data consist of recorded pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The column information contains birth weight in ounces (btw), gestation in days, parity, age of the mother, height of the mother, weight of the mother, and a binary indicator to tell if the mother smoked or not. Here it was chosen that there would be a comparison between gestation and weight of the mother.
data(babies)
babies = drop_na(babies)
summary(babies)
## case bwt gestation parity
## Min. : 1.0 Min. : 55.0 Min. :148.0 Min. :0.0000
## 1st Qu.: 317.2 1st Qu.:108.0 1st Qu.:272.0 1st Qu.:0.0000
## Median : 625.5 Median :120.0 Median :280.0 Median :0.0000
## Mean : 624.8 Mean :119.5 Mean :279.1 Mean :0.2624
## 3rd Qu.: 934.8 3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000
## Max. :1236.0 Max. :176.0 Max. :353.0 Max. :1.0000
## age height weight smoke
## Min. :15.00 Min. :53.00 Min. : 87.0 Min. :0.000
## 1st Qu.:23.00 1st Qu.:62.00 1st Qu.:114.2 1st Qu.:0.000
## Median :26.00 Median :64.00 Median :125.0 Median :0.000
## Mean :27.23 Mean :64.05 Mean :128.5 Mean :0.391
## 3rd Qu.:31.00 3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.000
## Max. :45.00 Max. :72.00 Max. :250.0 Max. :1.000
Summarize the linear regression model using the lm function
m1 <- lm(gestation ~ weight, data = babies)
summary(m1)
##
## Call:
## lm(formula = gestation ~ weight, data = babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.220 -6.937 0.917 8.898 74.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 276.75463 2.93450 94.31 <2e-16 ***
## weight 0.01827 0.02255 0.81 0.418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.01 on 1172 degrees of freedom
## Multiple R-squared: 0.0005596, Adjusted R-squared: -0.0002932
## F-statistic: 0.6562 on 1 and 1172 DF, p-value: 0.4181
Simple linear regression model with best fits line
ggplot(data = babies, aes(x = weight, y = bwt)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Checking Normality which is seen to be skewed toward the right.
ggplot(m1, aes(x = .resid)) +
geom_histogram(binwidth = 3)
Checking constant variance of residuals.
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
Checking for nearly normal residuals.
ggplot(data = m1, aes(sample = .resid)) +
stat_qq()
The linear model fits the checks of linearity, nearly normal residuals as well as constant variance. Therefore, it could be concluded that there is a relationship between gestation and weight of the mother.