Linear Regression

Data

This dataset of size n = 51 are for the 50 states and the District of Columbia in the United States. The variables are y = year 2002 birth rate per 1000 females 15 to 17 years old and x = poverty rate, which is the percent of the state’s population living in households with incomes below the federally defined poverty level. (Data source: Mind On Statistics, 3rd edition, Utts and Heckard).

poverty <- read.table('/Users/meganduff/Downloads/poverty.txt', header=T)
head(poverty)

##     Location PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
## 1    Alabama   20.1       31.5       88.7      11.2     54.5
## 2     Alaska    7.1       18.9       73.7       9.1     39.5
## 3    Arizona   16.1       35.0      102.5      10.4     61.2
## 4   Arkansas   14.9       31.6      101.7      10.4     59.9
## 5 California   16.7       22.6       69.1      11.2     41.1
## 6   Colorado    8.8       26.2       79.1       5.8     47.0

PovPct <- poverty$PovPct 
Brth15to17 <- poverty$Brth15to17
plot(PovPct, Brth15to17, xlim=c(5,26), ylim=c(0,50), xlab="PovPct", ylab="Brth15to17", main="Scatter Plot for Brth15to17 vs PovPct")

Linear Regression

Let's first calculate the linear regression by hand! Using the formulas derived in class, \(\beta_0=\bar{Y}-\beta_1\bar{X}\) and \(\beta_1= \frac{\sum_{i=1}^{n}(X_iY_i-\bar{Y}X_i)}{\sum_{i=1}^{n}(X_i^2-\bar{X}X_i)}\). Let's find \(\beta_1\) first and then \(\beta_0\):

beta_1 <- sum(PovPct*Brth15to17-mean(Brth15to17)*PovPct)/sum(PovPct^2-mean(PovPct)*PovPct)
beta_0<-mean(Brth15to17)-beta_1*mean(PovPct)
beta_1

## [1] 1.373345

beta_0

## [1] 4.267293

Therefore our linear regression is: \(E(Brth15to17|PovPct)=4.267+1.373*PovPct\).

Now, let's make our life's 100x easier and use the built in R function:

model<-lm(Brth15to17 ~ PovPct)
summary(model)

## 
## Call:
## lm(formula = Brth15to17 ~ PovPct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2275  -3.6554  -0.0407   2.4972  10.5152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2673     2.5297   1.687    0.098 .  
## PovPct        1.3733     0.1835   7.483 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.551 on 49 degrees of freedom
## Multiple R-squared:  0.5333, Adjusted R-squared:  0.5238 
## F-statistic:    56 on 1 and 49 DF,  p-value: 1.188e-09

We can see our \(\beta_0\) and \(\beta_1\) we found by hand matches R's calculations (which is always what we want)! With the model summary function, this tells us not only the estimated coefficients but gives us additional information, like \(R^2\) and test statistics. We will talk more about this later!

Plotting the linear regression line

Add the regression line to the previous scatterplot using the 'abline' function, where 'a' is the intercept, \(\beta_0\), and 'b' is the coefficient, \(\beta_1\).

{plot(PovPct, Brth15to17, xlim=c(5,26), ylim=c(0,50), xlab="PovPct", ylab="Brth15to17", main="Scatter Plot for Brth15to17 vs PovPct")
abline(a=4.2673, b=1.3733, col='red')}

Making Predictions with linear regression

new.data<-data.frame(PovPct=c(10,20,30,40,50))
predict(model, new.data)

##        1        2        3        4        5 
## 18.00075 31.73420 45.46765 59.20111 72.93456