Discussion 11

Load the US Birth data from 2000 - 2014

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## 
## | year| month| date_of_month| day_of_week| births|
## |----:|-----:|-------------:|-----------:|------:|
## | 2000|     1|             1|           6|   9083|
## | 2000|     1|             2|           7|   8006|
## | 2000|     1|             3|           1|  11363|
## | 2000|     1|             4|           2|  13032|
## | 2000|     1|             5|           3|  12558|
## | 2000|     1|             6|           4|  12466|

Columns in the dataset

colnames(birth_data)
## [1] "year"          "month"         "date_of_month" "day_of_week"  
## [5] "births"

Group by the Year and sum the births

## [1] "data.frame"
##   Year Total_Births
## 1 2000      4149598
## 2 2001      4110963
## 3 2002      4099313
## 4 2003      4163060
## 5 2004      4186863
## 6 2005      4211941

Create a simple model based on Year and Total Births and summarize

model <- lm( data = birthdata, Total_Births ~ Year)
summary(model)
## 
## Call:
## lm(formula = Total_Births ~ Year, data = birthdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107161  -87672  -50894   55664  234982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 28337549   14338700   1.976   0.0697 .
## Year          -12054       7144  -1.687   0.1154  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119500 on 13 degrees of freedom
## Multiple R-squared:  0.1796, Adjusted R-squared:  0.1165 
## F-statistic: 2.847 on 1 and 13 DF,  p-value: 0.1154

Visualize the birth data by year and plot

ggplot(data = birthdata, aes(x = Year, y = Total_Births)) +
 geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Residuals in the births

res = resid(model)
plot(birthdata$Total_Births, res,
     ylab="Residuals", xlab="Births")
 abline(0, 0)                  # the horizon

check_model(model)

Summary

The study was based on the number of US births over the years. Based on the first plot, there does not seem to be a linear relationship between the number of births over the years. The residuals are also not normal. Based on the above plot the residual dots do not fall along the line. There is a constant variability of births across the years.