Introduction

Greetings!. These are my course notes from Chapter 5 of the course - Inference for Linear Regression. Instructor for this Course is Jo Hardin and this course is available in Datacamp

Let us Load the Required Packages

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(broom)
library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
library(infer)

Let us attempt to import the Data

LAhomes <- read.csv("https://assets.datacamp.com/production/course_3623/datasets/LAhomes.csv")

Let us attempt to inspect the data

head(LAhomes)
##         city type bed bath garage sqft pool spa  price
## 1 Long Beach        0    1         513       NA 119000
## 2 Long Beach        0    1         550       NA 153000
## 3 Long Beach        0    1         550       NA 205000
## 4 Long Beach        0    1      1 1030       NA 300000
## 5 Long Beach        0    1      1 1526       NA 375000
## 6 Long Beach        1    1         552       NA 159900
tail(LAhomes)
##          city type bed bath garage sqft pool spa   price
## 1589 Westwood  SFR   3 1.25   <NA> 1594       NA  949000
## 1590 Westwood  SFR   3 2.00      2 1579       NA 1034000
## 1591 Westwood  SFR   3 2.50   <NA> 2372    Y  NA 1250000
## 1592 Westwood  SFR   3 3.00      2 1870       NA 1350000
## 1593 Westwood  SFR   3 3.00   <NA> 1488       NA 1198000
## 1594 Westwood  SFR   5 3.50   <NA> 3656       NA 1995000
str(LAhomes)
## 'data.frame':    1594 obs. of  9 variables:
##  $ city  : Factor w/ 4 levels "Beverly Hills",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ type  : Factor w/ 3 levels "","Condo/Twh",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bed   : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ bath  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ garage: Factor w/ 5 levels "","1","2","3",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ sqft  : int  513 550 550 1030 1526 552 558 596 744 750 ...
##  $ pool  : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ spa   : logi  NA NA NA NA NA NA ...
##  $ price : num  119000 153000 205000 300000 375000 ...
summary(LAhomes)
##             city             type          bed              bath       
##  Beverly Hills: 232            : 39   Min.   : 0.000   Min.   : 1.000  
##  Long Beach   :1062   Condo/Twh:639   1st Qu.: 2.000   1st Qu.: 1.000  
##  Santa Monica : 204   SFR      :916   Median : 3.000   Median : 2.000  
##  Westwood     :  96                   Mean   : 2.755   Mean   : 2.444  
##                                       3rd Qu.: 3.000   3rd Qu.: 3.000  
##                                       Max.   :17.000   Max.   :30.000  
##   garage         sqft       pool       spa              price         
##      :388   Min.   :  403    :1448   Mode:logical   Min.   :  100000  
##  1   :260   1st Qu.:  960   Y: 146   NA's:1594      1st Qu.:  275000  
##  2   :666   Median : 1380                           Median :  488944  
##  3   : 37   Mean   : 1963                           Mean   : 1254851  
##  4+  :  6   3rd Qu.: 2078                           3rd Qu.:  988750  
##  NA's:237   Max.   :28000                           Max.   :89950000
names(LAhomes)
## [1] "city"   "type"   "bed"    "bath"   "garage" "sqft"   "pool"   "spa"   
## [9] "price"

Transformed Model

Instructions: - Using tidy output, run an lm analysis on price versus sqft for the LAhomes dataset - Run one more analysis, but this time on transformed variables: log(price) versus log(sqft)

# Create a tidy model
lm(price ~ sqft, data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -1661892.   64460.      -25.8 8.85e-123
## 2 sqft            1486.      22.7      65.4 0.
# Create a tidy model using the log of both variables
lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     2.70    0.144       18.8 1.97e-71
## 2 log(sqft)       1.44    0.0195      73.8 0.

Interpreting transformed coefficients

Transforming variables is a powerful tool to use when running linear regressions. However the parameter estimates must be carefully interpreted in a model with transformed variables.

Consider data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010. The model is given below, and your task is to provide the appropriate interpretation of the coefficient on log(sqft)?

Note: you must be careful to avoid causative interpretations. Additional square footage does not necessarily cause the price of a specific house to go up. The interpretation of the coefficient describes the estimate of the average price of homes at a given square footage.

You will need to run the linear model before answering the question:

lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     2.70    0.144       18.8 1.97e-71
## 2 log(sqft)       1.44    0.0195      73.8 0.

ANSWER: Each additional 1% of square footage produces an estimate of the average price which is 1.44% higher.

LA homes, multicollinearity (1)

Instructions: - Run a tidy lm on the log transformed variables price and sqft from the dataset LAhomes - Notice whether the relationship is positive or negative and whether or not the relationship is significant

# Output the tidy model
lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     2.70    0.144       18.8 1.97e-71
## 2 log(sqft)       1.44    0.0195      73.8 0.

LA Homes, multicollinearity (2)

Instructions: - Run a tidy lm on the log transformed variables price and bath from the dataset LAhomes - Notice whether the relationship is positive or negative and whether or not the relationship is significant

# Output the tidy model
lm(log(price) ~ log(bath), data = LAhomes)  %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    12.2     0.0280     437.  0.       
## 2 log(bath)       1.43    0.0306      46.6 9.66e-300

LA Homes, multicollinearity (3)

Instructions: - Run a tidy lm on the log transformed variables price and both of sqft and bath from the dataset LAhomes. Use the formula: log(price) ~ log(sqft) + log(bath) - Now look at the coefficients separately. What happened to the signs of each of the coefficients? What happened to the significance of each of the coefficients?

# Output the tidy model
lm(log(price) ~ log(sqft) + log(bath), data = LAhomes) %>% tidy()
## # A tibble: 3 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   2.51      0.262      9.60  2.96e- 21
## 2 log(sqft)     1.47      0.0395    37.2   1.19e-218
## 3 log(bath)    -0.0390    0.0453    -0.862 3.89e-  1

Inference on coefficients

Importing the necessary data for this exercise

Prior to committing to the instructions set about by this exercise, let us import the necessary data

restNYC <- read.csv("https://assets.datacamp.com/production/course_3623/datasets/restNYC.csv")

Let us examine the data

head(restNYC)
##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0
## 4    4             Bottino    41   20    20      17    0
## 5    5          Da Umberto    54   24    19      21    0
## 6    6            Le Madri    52   22    22      21    0
tail(restNYC)
##     Case         Restaurant Price Food Decor Service East
## 163  163 Sambuca, Trattoria    31   19    16      18    0
## 164  164               Baci    31   17    15      16    0
## 165  165            Puccini    26   20    16      17    0
## 166  166         Bella Luna    31   18    16      17    0
## 167  167            MŽtisse    38   22    17      21    0
## 168  168            Gennaro    34   24    10      16    0
str(restNYC)
## 'data.frame':    168 obs. of  7 variables:
##  $ Case      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Restaurant: Factor w/ 168 levels "Amarone","Anche Vivolo",..: 47 149 18 20 46 89 90 119 16 37 ...
##  $ Price     : int  43 32 34 41 54 52 34 34 39 44 ...
##  $ Food      : int  22 20 21 20 24 22 22 20 22 21 ...
##  $ Decor     : int  18 19 13 20 19 22 16 18 19 17 ...
##  $ Service   : int  20 19 18 17 21 21 21 21 22 19 ...
##  $ East      : int  0 0 0 0 0 0 0 1 1 1 ...
summary(restNYC)
##       Case               Restaurant      Price           Food     
##  Min.   :  1.00   Amarone     :  1   Min.   :19.0   Min.   :16.0  
##  1st Qu.: 42.75   Anche Vivolo:  1   1st Qu.:36.0   1st Qu.:19.0  
##  Median : 84.50   Andiamo     :  1   Median :43.0   Median :20.5  
##  Mean   : 84.50   Arno        :  1   Mean   :42.7   Mean   :20.6  
##  3rd Qu.:126.25   Artusi      :  1   3rd Qu.:50.0   3rd Qu.:22.0  
##  Max.   :168.00   Baci        :  1   Max.   :65.0   Max.   :25.0  
##                   (Other)     :162                                
##      Decor          Service          East      
##  Min.   : 6.00   Min.   :14.0   Min.   :0.000  
##  1st Qu.:16.00   1st Qu.:18.0   1st Qu.:0.000  
##  Median :18.00   Median :20.0   Median :1.000  
##  Mean   :17.69   Mean   :19.4   Mean   :0.631  
##  3rd Qu.:19.00   3rd Qu.:21.0   3rd Qu.:1.000  
##  Max.   :25.00   Max.   :24.0   Max.   :1.000  
## 
names(restNYC)
## [1] "Case"       "Restaurant" "Price"      "Food"       "Decor"     
## [6] "Service"    "East"

Instructions for this exercise

  • Run a tidy lm regressing Price on Service
  • Run a tidy lm regressing Price on Service, Food and Decor
  • What happened to the significance of Service when additional Variables were added into the model?
# Output the first model
lm(Price ~ Service, data = restNYC) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -12.0      5.11      -2.34 2.02e- 2
## 2 Service         2.82     0.262     10.8  7.88e-21
# Output the second model
lm(Price ~ Service + Food + Decor, data = restNYC) %>% tidy()
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -24.6       4.75     -5.18  6.33e- 7
## 2 Service        0.135     0.396     0.341 7.33e- 1
## 3 Food           1.56      0.373     4.17  4.93e- 5
## 4 Decor          1.85      0.218     8.49  1.17e-14

Interpreting coefficients

What is the correct interpretation of the coefficient on Service in the linear model which regresses Price on Service, Food, and Decor?

You will need to run the linear model before answering the question: lm(Price ~ Service + Food + Decor, data=restNYC) %>% tidy()

# Let us run the model 
lm(Price ~ Service + Food + Decor, data = restNYC) %>% tidy()
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -24.6       4.75     -5.18  6.33e- 7
## 2 Service        0.135     0.396     0.341 7.33e- 1
## 3 Food           1.56      0.373     4.17  4.93e- 5
## 4 Decor          1.85      0.218     8.49  1.17e-14

Given that Food and Decor are in the model, Service is not significant and we cannot know whether it has an effect on modelling Price