Linear regression I suppose is the starting point of this major and the focus of this course. Estimation is the first step of building the linear models. Estimation is the process of identifying the coefficient for each independent variable so as to explain as much of the response (dependent) variable as possible.
The least squares estimation is a fundamental approach to identifying the best estimation of coefficient as the least squares estimation attempts to minimize the sum of the squared errors.
After identifying a proper estimation based on the least squares estimation, the goodness of fit is also an important evaluation for any linear regression model.
These concepts serve as the basis for the course and also the comprehension of linear regression models as a whole. Given any dataset with a single response variable, a linear regression model could be attempted to evaluate the explanation of the independent variables in describing the dependent variable. Given the modules of R, building a naive linear regression model in R requires just a few lines of code. In fact, the simplicity of the R language and models gives the user, perhaps data scientist, the ability to create models with little grasp of the underlying concepts.
Using the dataset from https://data.world/nrippner/ols-regression-challenge, the challenge is an attempt to predict the mortality rate of cancer in counties of the United States. I’ve used the dataset for purposes of building a naive linear model. I’ve used the R code snippets throughout Chapter 2 to confirm the mathematics of the linear model function to better understand the underlying principles.
library(faraway)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
cancer_data <- read.csv("cancer_reg.csv")
cancer_data <- subset(cancer_data, select=-c(binnedinc,geography))
# Just use full cases for now
cancer_data <- cancer_data[complete.cases(cancer_data), ]
summary(cancer_data)
## avganncount avgdeathsperyear target_deathrate incidencerate
## Min. : 6.0 Min. : 3.0 Min. : 94.4 Min. : 234.0
## 1st Qu.: 87.0 1st Qu.: 30.0 1st Qu.:161.3 1st Qu.: 425.2
## Median : 183.0 Median : 64.0 Median :177.9 Median : 453.5
## Mean : 682.3 Mean : 199.4 Mean :179.1 Mean : 452.7
## 3rd Qu.: 542.0 3rd Qu.: 131.0 3rd Qu.:195.4 3rd Qu.: 481.6
## Max. :38150.0 Max. :14010.0 Max. :293.9 Max. :1014.2
## medincome popest2015 povertypercent studypercap
## Min. :23047 Min. : 829 Min. : 4.80 Min. : 0.00
## 1st Qu.:39305 1st Qu.: 12328 1st Qu.:12.00 1st Qu.: 0.00
## Median :45235 Median : 27755 Median :15.60 Median : 0.00
## Mean :47033 Mean : 114553 Mean :16.72 Mean : 122.73
## 3rd Qu.:52686 3rd Qu.: 66217 3rd Qu.:19.90 3rd Qu.: 60.95
## Max. :97279 Max. :10170292 Max. :45.10 Max. :3657.32
## medianage medianagemale medianagefemale avghouseholdsize
## Min. : 23.20 Min. :23.70 Min. :22.30 Min. :0.0221
## 1st Qu.: 38.20 1st Qu.:36.60 1st Qu.:39.30 1st Qu.:2.3700
## Median : 40.90 Median :39.50 Median :42.30 Median :2.5000
## Mean : 45.99 Mean :39.57 Mean :42.06 Mean :2.4719
## 3rd Qu.: 43.60 3rd Qu.:42.25 3rd Qu.:44.90 3rd Qu.:2.6500
## Max. :508.80 Max. :58.50 Max. :58.00 Max. :3.7800
## percentmarried pctnohs18_24 pcths18_24 pctsomecol18_24
## Min. :26.20 Min. : 0.50 Min. : 7.10 Min. : 7.10
## 1st Qu.:48.20 1st Qu.:12.75 1st Qu.:29.30 1st Qu.:34.00
## Median :52.70 Median :17.00 Median :34.80 Median :40.30
## Mean :51.89 Mean :18.02 Mean :35.17 Mean :40.78
## 3rd Qu.:56.50 3rd Qu.:22.35 3rd Qu.:41.20 3rd Qu.:46.15
## Max. :65.90 Max. :48.90 Max. :72.50 Max. :79.00
## pctbachdeg18_24 pcths25_over pctbachdeg25_over pctemployed16_over
## Min. : 0.000 Min. :12.70 Min. : 3.90 Min. :19.50
## 1st Qu.: 2.850 1st Qu.:31.20 1st Qu.: 9.25 1st Qu.:49.30
## Median : 5.500 Median :35.60 Median :12.20 Median :55.00
## Mean : 6.022 Mean :35.27 Mean :13.15 Mean :54.47
## 3rd Qu.: 8.250 3rd Qu.:39.50 3rd Qu.:15.80 3rd Qu.:60.30
## Max. :28.500 Max. :54.80 Max. :35.50 Max. :73.20
## pctunemployed16_over pctprivatecoverage pctprivatecoveragealone
## Min. : 0.800 Min. :31.50 Min. :23.70
## 1st Qu.: 5.700 1st Qu.:57.65 1st Qu.:41.90
## Median : 7.400 Median :65.80 Median :49.00
## Mean : 7.819 Mean :64.60 Mean :48.75
## 3rd Qu.: 9.500 3rd Qu.:72.45 3rd Qu.:55.90
## Max. :27.000 Max. :87.50 Max. :76.60
## pctempprivcoverage pctpubliccoverage pctpubliccoveragealone pctwhite
## Min. :16.30 Min. :14.80 Min. : 2.60 Min. : 11.01
## 1st Qu.:34.75 1st Qu.:30.95 1st Qu.:14.95 1st Qu.: 78.08
## Median :41.60 Median :36.50 Median :18.70 Median : 90.24
## Mean :41.53 Mean :36.19 Mean :19.20 Mean : 84.22
## 3rd Qu.:48.55 3rd Qu.:40.70 3rd Qu.:22.90 3rd Qu.: 95.27
## Max. :68.80 Max. :62.20 Max. :41.40 Max. :100.00
## pctblack pctasian pctotherrace pctmarriedhouseholds
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. :24.02
## 1st Qu.: 0.6818 1st Qu.: 0.2679 1st Qu.: 0.2843 1st Qu.:48.11
## Median : 2.4508 Median : 0.5698 Median : 0.7552 Median :51.67
## Mean : 8.7276 Mean : 1.1837 Mean : 1.9213 Mean :51.32
## 3rd Qu.:11.3957 3rd Qu.: 1.1972 3rd Qu.: 2.1518 3rd Qu.:55.31
## Max. :84.8660 Max. :23.2381 Max. :37.6110 Max. :71.70
## birthrate
## Min. : 0.4132
## 1st Qu.: 4.6065
## Median : 5.4978
## Mean : 5.7397
## 3rd Qu.: 6.7070
## Max. :17.8771
Output a summary of the cancer data to confirm attributes and check for any missing data. There is no missing data.
lmod <- lm(target_deathrate ~ incidencerate + pctbachdeg25_over + pctemployed16_over + pctotherrace + pctmarriedhouseholds, data=cancer_data)
summary(lmod)
##
## Call:
## lm(formula = target_deathrate ~ incidencerate + pctbachdeg25_over +
## pctemployed16_over + pctotherrace + pctmarriedhouseholds,
## data = cancer_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.363 -10.195 0.511 11.814 109.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 185.45807 11.75731 15.774 < 2e-16 ***
## incidencerate 0.17407 0.01664 10.458 < 2e-16 ***
## pctbachdeg25_over -1.89949 0.20522 -9.256 < 2e-16 ***
## pctemployed16_over -0.39575 0.14754 -2.682 0.007517 **
## pctotherrace -0.99722 0.26394 -3.778 0.000174 ***
## pctmarriedhouseholds -0.71520 0.15140 -4.724 2.9e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.11 on 585 degrees of freedom
## Multiple R-squared: 0.4062, Adjusted R-squared: 0.4012
## F-statistic: 80.05 on 5 and 585 DF, p-value: < 2.2e-16
Output of the linear model indicates high significance for the independent variables selected. The variables selected by level of significance from previously generated linear model.
# Extract the X matrix
x <- model.matrix( ~ incidencerate + pctbachdeg25_over + pctemployed16_over + pctotherrace + pctmarriedhouseholds, cancer_data)
# Response y
y <- cancer_data$target_deathrate
xtxi <- solve(t(x) %*% x)
(xtxi %*% t(x) %*% y)
## [,1]
## (Intercept) 185.4580731
## incidencerate 0.1740683
## pctbachdeg25_over -1.8994863
## pctemployed16_over -0.3957486
## pctotherrace -0.9972177
## pctmarriedhouseholds -0.7151985
As compared to above, the matrix multiply to compute the coefficients does work, matching the output of the lm() function.
(solve(crossprod(x,x), crossprod(x,y)))
## [,1]
## (Intercept) 185.4580731
## incidencerate 0.1740683
## pctbachdeg25_over -1.8994863
## pctemployed16_over -0.3957486
## pctotherrace -0.9972177
## pctmarriedhouseholds -0.7151985
As compared to above, another mechanism by which to calculate coefficients for the data set.
lmodsum <- summary(lmod)
# Calculate sigma
sqrt(deviance(lmod) / df.residual(lmod))
## [1] 21.10569
# Directly outpu sigma
lmodsum$sigma
## [1] 21.10569
Yep, the sigma values match.
# Compute standard errors of the coefficients
xtxi <- lmodsum$cov.unscaled
sqrt(diag(xtxi)) * 21.10569
## (Intercept) incidencerate pctbachdeg25_over
## 11.75731322 0.01664402 0.20522389
## pctemployed16_over pctotherrace pctmarriedhouseholds
## 0.14753668 0.26393722 0.15140039
lmodsum$coef[,2]
## (Intercept) incidencerate pctbachdeg25_over
## 11.75731480 0.01664402 0.20522392
## pctemployed16_over pctotherrace pctmarriedhouseholds
## 0.14753670 0.26393725 0.15140041
Compute the QR decomposition
qrx <- qr(x)
# Extract and output dimensions of the Q matrix
dim(qr.Q(qrx))
## [1] 591 6
# Solve for f
(f <- t(qr.Q(qrx)) %*% y)
## [,1]
## [1,] -4353.78691
## [2,] 249.11148
## [3,] 302.32930
## [4,] -99.88445
## [5,] -70.03778
## [6,] 99.70092
# Confirm coeffients
backsolve(qr.R(qrx), f)
## [,1]
## [1,] 185.4580731
## [2,] 0.1740683
## [3,] -1.8994863
## [4,] -0.3957486
## [5,] -0.9972177
## [6,] -0.7151985
And sure enough, the coefficients match again.
To me, the above implementation shows just how much of the linear model is abstracted from the user. This first blog post shows just how much of a journey I expect this course to be in truly learning the different techniques and approaches to constructing proper linear models. With data driven analysis, journalism, etc. so popular today, the fundamental techniques of building linear models will be valuable in understanding the justification for decisions made in my career as a software engineer and hopefully future data scientist.