Discussion Objective:

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Dataset

The data set used here was “babies” data from the OpenIntro package. The data consist of recorded pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The column information contains birth weight in ounces (btw), gestation in days, parity, age of the mother, height of the mother, weight of the mother, and a binary indicator to tell if the mother smoked or not. Here it was chosen that there would be a comparison between gestation and weight of the mother.

data(babies)
babies = drop_na(babies)
summary(babies)
##       case             bwt          gestation         parity      
##  Min.   :   1.0   Min.   : 55.0   Min.   :148.0   Min.   :0.0000  
##  1st Qu.: 317.2   1st Qu.:108.0   1st Qu.:272.0   1st Qu.:0.0000  
##  Median : 625.5   Median :120.0   Median :280.0   Median :0.0000  
##  Mean   : 624.8   Mean   :119.5   Mean   :279.1   Mean   :0.2624  
##  3rd Qu.: 934.8   3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000  
##  Max.   :1236.0   Max.   :176.0   Max.   :353.0   Max.   :1.0000  
##       age            height          weight          smoke      
##  Min.   :15.00   Min.   :53.00   Min.   : 87.0   Min.   :0.000  
##  1st Qu.:23.00   1st Qu.:62.00   1st Qu.:114.2   1st Qu.:0.000  
##  Median :26.00   Median :64.00   Median :125.0   Median :0.000  
##  Mean   :27.23   Mean   :64.05   Mean   :128.5   Mean   :0.391  
##  3rd Qu.:31.00   3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.000  
##  Max.   :45.00   Max.   :72.00   Max.   :250.0   Max.   :1.000

Summarize the linear regression model using the lm function

m1 <- lm(gestation ~ weight, data = babies)
summary(m1)
## 
## Call:
## lm(formula = gestation ~ weight, data = babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -131.220   -6.937    0.917    8.898   74.145 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 276.75463    2.93450   94.31   <2e-16 ***
## weight        0.01827    0.02255    0.81    0.418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.01 on 1172 degrees of freedom
## Multiple R-squared:  0.0005596,  Adjusted R-squared:  -0.0002932 
## F-statistic: 0.6562 on 1 and 1172 DF,  p-value: 0.4181

Simple linear regression model with best fits line

ggplot(data = babies, aes(x = weight, y = bwt)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Checking Normality which is seen to be skewed toward the right.

ggplot(m1, aes(x = .resid)) +
  geom_histogram(binwidth = 3) 


Checking constant variance of residuals.

ggplot(data = m1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

Checking for nearly normal residuals.

ggplot(data = m1, aes(sample = .resid)) +
  stat_qq()

The linear model fits the checks of linearity, nearly normal residuals as well as constant variance. Therefore, it could be concluded that there is a relationship between gestation and weight of the mother.