Simple Linear Regression

Introduction

Linear regression I suppose is the starting point of this major and the focus of this course. Estimation is the first step of building the linear models. Estimation is the process of identifying the coefficient for each independent variable so as to explain as much of the response (dependent) variable as possible.

The least squares estimation is a fundamental approach to identifying the best estimation of coefficient as the least squares estimation attempts to minimize the sum of the squared errors.

After identifying a proper estimation based on the least squares estimation, the goodness of fit is also an important evaluation for any linear regression model.

These concepts serve as the basis for the course and also the comprehension of linear regression models as a whole. Given any dataset with a single response variable, a linear regression model could be attempted to evaluate the explanation of the independent variables in describing the dependent variable. Given the modules of R, building a naive linear regression model in R requires just a few lines of code. In fact, the simplicity of the R language and models gives the user, perhaps data scientist, the ability to create models with little grasp of the underlying concepts.

Implementation

Using the dataset from https://data.world/nrippner/ols-regression-challenge, the challenge is an attempt to predict the mortality rate of cancer in counties of the United States. I’ve used the dataset for purposes of building a naive linear model. I’ve used the R code snippets throughout Chapter 2 to confirm the mathematics of the linear model function to better understand the underlying principles.

library(faraway)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
cancer_data <- read.csv("cancer_reg.csv")

cancer_data <- subset(cancer_data, select=-c(binnedinc,geography))

# Just use full cases for now
cancer_data <- cancer_data[complete.cases(cancer_data), ]

summary(cancer_data)
##   avganncount      avgdeathsperyear  target_deathrate incidencerate   
##  Min.   :    6.0   Min.   :    3.0   Min.   : 94.4    Min.   : 234.0  
##  1st Qu.:   87.0   1st Qu.:   30.0   1st Qu.:161.3    1st Qu.: 425.2  
##  Median :  183.0   Median :   64.0   Median :177.9    Median : 453.5  
##  Mean   :  682.3   Mean   :  199.4   Mean   :179.1    Mean   : 452.7  
##  3rd Qu.:  542.0   3rd Qu.:  131.0   3rd Qu.:195.4    3rd Qu.: 481.6  
##  Max.   :38150.0   Max.   :14010.0   Max.   :293.9    Max.   :1014.2  
##    medincome       popest2015       povertypercent   studypercap     
##  Min.   :23047   Min.   :     829   Min.   : 4.80   Min.   :   0.00  
##  1st Qu.:39305   1st Qu.:   12328   1st Qu.:12.00   1st Qu.:   0.00  
##  Median :45235   Median :   27755   Median :15.60   Median :   0.00  
##  Mean   :47033   Mean   :  114553   Mean   :16.72   Mean   : 122.73  
##  3rd Qu.:52686   3rd Qu.:   66217   3rd Qu.:19.90   3rd Qu.:  60.95  
##  Max.   :97279   Max.   :10170292   Max.   :45.10   Max.   :3657.32  
##    medianage      medianagemale   medianagefemale avghouseholdsize
##  Min.   : 23.20   Min.   :23.70   Min.   :22.30   Min.   :0.0221  
##  1st Qu.: 38.20   1st Qu.:36.60   1st Qu.:39.30   1st Qu.:2.3700  
##  Median : 40.90   Median :39.50   Median :42.30   Median :2.5000  
##  Mean   : 45.99   Mean   :39.57   Mean   :42.06   Mean   :2.4719  
##  3rd Qu.: 43.60   3rd Qu.:42.25   3rd Qu.:44.90   3rd Qu.:2.6500  
##  Max.   :508.80   Max.   :58.50   Max.   :58.00   Max.   :3.7800  
##  percentmarried   pctnohs18_24     pcths18_24    pctsomecol18_24
##  Min.   :26.20   Min.   : 0.50   Min.   : 7.10   Min.   : 7.10  
##  1st Qu.:48.20   1st Qu.:12.75   1st Qu.:29.30   1st Qu.:34.00  
##  Median :52.70   Median :17.00   Median :34.80   Median :40.30  
##  Mean   :51.89   Mean   :18.02   Mean   :35.17   Mean   :40.78  
##  3rd Qu.:56.50   3rd Qu.:22.35   3rd Qu.:41.20   3rd Qu.:46.15  
##  Max.   :65.90   Max.   :48.90   Max.   :72.50   Max.   :79.00  
##  pctbachdeg18_24   pcths25_over   pctbachdeg25_over pctemployed16_over
##  Min.   : 0.000   Min.   :12.70   Min.   : 3.90     Min.   :19.50     
##  1st Qu.: 2.850   1st Qu.:31.20   1st Qu.: 9.25     1st Qu.:49.30     
##  Median : 5.500   Median :35.60   Median :12.20     Median :55.00     
##  Mean   : 6.022   Mean   :35.27   Mean   :13.15     Mean   :54.47     
##  3rd Qu.: 8.250   3rd Qu.:39.50   3rd Qu.:15.80     3rd Qu.:60.30     
##  Max.   :28.500   Max.   :54.80   Max.   :35.50     Max.   :73.20     
##  pctunemployed16_over pctprivatecoverage pctprivatecoveragealone
##  Min.   : 0.800       Min.   :31.50      Min.   :23.70          
##  1st Qu.: 5.700       1st Qu.:57.65      1st Qu.:41.90          
##  Median : 7.400       Median :65.80      Median :49.00          
##  Mean   : 7.819       Mean   :64.60      Mean   :48.75          
##  3rd Qu.: 9.500       3rd Qu.:72.45      3rd Qu.:55.90          
##  Max.   :27.000       Max.   :87.50      Max.   :76.60          
##  pctempprivcoverage pctpubliccoverage pctpubliccoveragealone    pctwhite     
##  Min.   :16.30      Min.   :14.80     Min.   : 2.60          Min.   : 11.01  
##  1st Qu.:34.75      1st Qu.:30.95     1st Qu.:14.95          1st Qu.: 78.08  
##  Median :41.60      Median :36.50     Median :18.70          Median : 90.24  
##  Mean   :41.53      Mean   :36.19     Mean   :19.20          Mean   : 84.22  
##  3rd Qu.:48.55      3rd Qu.:40.70     3rd Qu.:22.90          3rd Qu.: 95.27  
##  Max.   :68.80      Max.   :62.20     Max.   :41.40          Max.   :100.00  
##     pctblack          pctasian        pctotherrace     pctmarriedhouseholds
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   :24.02       
##  1st Qu.: 0.6818   1st Qu.: 0.2679   1st Qu.: 0.2843   1st Qu.:48.11       
##  Median : 2.4508   Median : 0.5698   Median : 0.7552   Median :51.67       
##  Mean   : 8.7276   Mean   : 1.1837   Mean   : 1.9213   Mean   :51.32       
##  3rd Qu.:11.3957   3rd Qu.: 1.1972   3rd Qu.: 2.1518   3rd Qu.:55.31       
##  Max.   :84.8660   Max.   :23.2381   Max.   :37.6110   Max.   :71.70       
##    birthrate      
##  Min.   : 0.4132  
##  1st Qu.: 4.6065  
##  Median : 5.4978  
##  Mean   : 5.7397  
##  3rd Qu.: 6.7070  
##  Max.   :17.8771

Output a summary of the cancer data to confirm attributes and check for any missing data. There is no missing data.

lmod <- lm(target_deathrate ~ incidencerate + pctbachdeg25_over + pctemployed16_over + pctotherrace + pctmarriedhouseholds, data=cancer_data)
summary(lmod)
## 
## Call:
## lm(formula = target_deathrate ~ incidencerate + pctbachdeg25_over + 
##     pctemployed16_over + pctotherrace + pctmarriedhouseholds, 
##     data = cancer_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.363  -10.195    0.511   11.814  109.693 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          185.45807   11.75731  15.774  < 2e-16 ***
## incidencerate          0.17407    0.01664  10.458  < 2e-16 ***
## pctbachdeg25_over     -1.89949    0.20522  -9.256  < 2e-16 ***
## pctemployed16_over    -0.39575    0.14754  -2.682 0.007517 ** 
## pctotherrace          -0.99722    0.26394  -3.778 0.000174 ***
## pctmarriedhouseholds  -0.71520    0.15140  -4.724  2.9e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.11 on 585 degrees of freedom
## Multiple R-squared:  0.4062, Adjusted R-squared:  0.4012 
## F-statistic: 80.05 on 5 and 585 DF,  p-value: < 2.2e-16

Output of the linear model indicates high significance for the independent variables selected. The variables selected by level of significance from previously generated linear model.

# Extract the X matrix
x <- model.matrix( ~ incidencerate + pctbachdeg25_over + pctemployed16_over + pctotherrace + pctmarriedhouseholds, cancer_data)

# Response y
y <- cancer_data$target_deathrate
xtxi <- solve(t(x) %*% x)

(xtxi %*% t(x) %*% y)
##                             [,1]
## (Intercept)          185.4580731
## incidencerate          0.1740683
## pctbachdeg25_over     -1.8994863
## pctemployed16_over    -0.3957486
## pctotherrace          -0.9972177
## pctmarriedhouseholds  -0.7151985

As compared to above, the matrix multiply to compute the coefficients does work, matching the output of the lm() function.

(solve(crossprod(x,x), crossprod(x,y)))
##                             [,1]
## (Intercept)          185.4580731
## incidencerate          0.1740683
## pctbachdeg25_over     -1.8994863
## pctemployed16_over    -0.3957486
## pctotherrace          -0.9972177
## pctmarriedhouseholds  -0.7151985

As compared to above, another mechanism by which to calculate coefficients for the data set.

lmodsum <- summary(lmod)

# Calculate sigma
sqrt(deviance(lmod) / df.residual(lmod))
## [1] 21.10569
# Directly outpu sigma
lmodsum$sigma
## [1] 21.10569

Yep, the sigma values match.

# Compute standard errors of the coefficients
xtxi <- lmodsum$cov.unscaled
sqrt(diag(xtxi)) * 21.10569
##          (Intercept)        incidencerate    pctbachdeg25_over 
##          11.75731322           0.01664402           0.20522389 
##   pctemployed16_over         pctotherrace pctmarriedhouseholds 
##           0.14753668           0.26393722           0.15140039
lmodsum$coef[,2]
##          (Intercept)        incidencerate    pctbachdeg25_over 
##          11.75731480           0.01664402           0.20522392 
##   pctemployed16_over         pctotherrace pctmarriedhouseholds 
##           0.14753670           0.26393725           0.15140041

Compute the QR decomposition

qrx <- qr(x)

# Extract and output dimensions of the Q matrix
dim(qr.Q(qrx))
## [1] 591   6
# Solve for f
(f <- t(qr.Q(qrx)) %*% y)
##             [,1]
## [1,] -4353.78691
## [2,]   249.11148
## [3,]   302.32930
## [4,]   -99.88445
## [5,]   -70.03778
## [6,]    99.70092
# Confirm coeffients
backsolve(qr.R(qrx), f)
##             [,1]
## [1,] 185.4580731
## [2,]   0.1740683
## [3,]  -1.8994863
## [4,]  -0.3957486
## [5,]  -0.9972177
## [6,]  -0.7151985

And sure enough, the coefficients match again.

Conclusion

To me, the above implementation shows just how much of the linear model is abstracted from the user. This first blog post shows just how much of a journey I expect this course to be in truly learning the different techniques and approaches to constructing proper linear models. With data driven analysis, journalism, etc. so popular today, the fundamental techniques of building linear models will be valuable in understanding the justification for decisions made in my career as a software engineer and hopefully future data scientist.