March 25, 2020

Simple Linear Regression

Linear Regression is a statistical approach to calculating how changes in independent variable(s) influence changes in a dependent variable. In Linear Regression, the dependent variable must be a continuous variable, not a categorical one.

By understanding the relationship between independent and dependent variables, new observations of independent variables can be used to predict expected values in the dependent variable. These relationships are calculated by finding the line of best fit.

Line of Best Fit Equation

\(\hat{y}_i = b_0 + b_1\cdot x_i\)

Calculating "Line of Best Fit" Parameters

The parameters of the line of best fit equation are calculated as such…

\(b_0 = \bar{y}_i - b_1\cdot \bar{x}\)

\(b_1 = {\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \over \sum_{i=1}^n (x_i - \bar{x})^2}\)

We are left with the least squares regression line, which we can use to input a value for the independent variable in order to predict a dependent variable.

A Kaggle Dataset called House Sales in King County, USA will be used to illustrate Linear Regression.

## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

The target variable will be price.

The price variable will be transformed to \(\sqrt{price}\)

Code for the target variable density plot

ggplot(data, aes(x= sqrt(price))) + 
  geom_histogram(aes(y=..density..), 
                 colour="black", 
                 fill="white") +
  geom_density(alpha=.2, fill="blue") + 
  labs(x = "Sale Price", 
       y = "Density", 
       title = "Distribution of Sales Price") +
  theme_classic()

Use sqft_living to predict price.

\(\hat{price} = \beta_0 + \beta_1\cdot (sqft.living)\)

Try a second-order polynomial regression.

\(\hat{price} = \beta_0 + \beta_1\cdot (sqft.living)^2\)

Add grade to the model.

mod <- lm(price ~ poly(sqft_living,2) + grade, data = data)
summary(mod)
## 
## Call:
## lm(formula = price ~ poly(sqft_living, 2) + grade, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5668007  -128773   -27868    88013  2549048 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -296784      16396  -18.10   <2e-16 ***
## poly(sqft_living, 2)1 23484670     367466   63.91   <2e-16 ***
## poly(sqft_living, 2)2 12001335     238156   50.39   <2e-16 ***
## grade                   109297       2131   51.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 237000 on 21609 degrees of freedom
## Multiple R-squared:  0.5835, Adjusted R-squared:  0.5834 
## F-statistic: 1.009e+04 on 3 and 21609 DF,  p-value: < 2.2e-16

Multiple Regression Inputs and Output

\(x = sqft.living\); \(y = grade\); \(z = price\)