This dataset of size n = 51 are for the 50 states and the District of Columbia in the United States. The variables are y = year 2002 birth rate per 1000 females 15 to 17 years old and x = poverty rate, which is the percent of the state’s population living in households with incomes below the federally defined poverty level. (Data source: Mind On Statistics, 3rd edition, Utts and Heckard).
poverty <- read.table('/Users/meganduff/Downloads/poverty.txt', header=T)
head(poverty)
## Location PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
## 1 Alabama 20.1 31.5 88.7 11.2 54.5
## 2 Alaska 7.1 18.9 73.7 9.1 39.5
## 3 Arizona 16.1 35.0 102.5 10.4 61.2
## 4 Arkansas 14.9 31.6 101.7 10.4 59.9
## 5 California 16.7 22.6 69.1 11.2 41.1
## 6 Colorado 8.8 26.2 79.1 5.8 47.0
PovPct <- poverty$PovPct
Brth15to17 <- poverty$Brth15to17
plot(PovPct, Brth15to17, xlim=c(5,26), ylim=c(0,50), xlab="PovPct", ylab="Brth15to17", main="Scatter Plot for Brth15to17 vs PovPct")
Let's first calculate the linear regression by hand! Using the formulas derived in class, \(\beta_0=\bar{Y}-\beta_1\bar{X}\) and \(\beta_1= \frac{\sum_{i=1}^{n}(X_iY_i-\bar{Y}X_i)}{\sum_{i=1}^{n}(X_i^2-\bar{X}X_i)}\). Let's find \(\beta_1\) first and then \(\beta_0\):
beta_1 <- sum(PovPct*Brth15to17-mean(Brth15to17)*PovPct)/sum(PovPct^2-mean(PovPct)*PovPct)
beta_0<-mean(Brth15to17)-beta_1*mean(PovPct)
beta_1
## [1] 1.373345
beta_0
## [1] 4.267293
Therefore our linear regression is: \(E(Brth15to17|PovPct)=4.267+1.373*PovPct\).
Now, let's make our life's 100x easier and use the built in R function:
model<-lm(Brth15to17 ~ PovPct)
summary(model)
##
## Call:
## lm(formula = Brth15to17 ~ PovPct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2275 -3.6554 -0.0407 2.4972 10.5152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2673 2.5297 1.687 0.098 .
## PovPct 1.3733 0.1835 7.483 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.551 on 49 degrees of freedom
## Multiple R-squared: 0.5333, Adjusted R-squared: 0.5238
## F-statistic: 56 on 1 and 49 DF, p-value: 1.188e-09
We can see our \(\beta_0\) and \(\beta_1\) we found by hand matches R's calculations (which is always what we want)! With the model summary function, this tells us not only the estimated coefficients but gives us additional information, like \(R^2\) and test statistics. We will talk more about this later!
Add the regression line to the previous scatterplot using the 'abline' function, where 'a' is the intercept, \(\beta_0\), and 'b' is the coefficient, \(\beta_1\).
{plot(PovPct, Brth15to17, xlim=c(5,26), ylim=c(0,50), xlab="PovPct", ylab="Brth15to17", main="Scatter Plot for Brth15to17 vs PovPct")
abline(a=4.2673, b=1.3733, col='red')}
new.data<-data.frame(PovPct=c(10,20,30,40,50))
predict(model, new.data)
## 1 2 3 4 5
## 18.00075 31.73420 45.46765 59.20111 72.93456