Greetings!. These are my course notes from Chapter 5 of the course - Inference for Linear Regression. Instructor for this Course is Jo Hardin and this course is available in Datacamp
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(broom)
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(infer)
LAhomes <- read.csv("https://assets.datacamp.com/production/course_3623/datasets/LAhomes.csv")
head(LAhomes)
## city type bed bath garage sqft pool spa price
## 1 Long Beach 0 1 513 NA 119000
## 2 Long Beach 0 1 550 NA 153000
## 3 Long Beach 0 1 550 NA 205000
## 4 Long Beach 0 1 1 1030 NA 300000
## 5 Long Beach 0 1 1 1526 NA 375000
## 6 Long Beach 1 1 552 NA 159900
tail(LAhomes)
## city type bed bath garage sqft pool spa price
## 1589 Westwood SFR 3 1.25 <NA> 1594 NA 949000
## 1590 Westwood SFR 3 2.00 2 1579 NA 1034000
## 1591 Westwood SFR 3 2.50 <NA> 2372 Y NA 1250000
## 1592 Westwood SFR 3 3.00 2 1870 NA 1350000
## 1593 Westwood SFR 3 3.00 <NA> 1488 NA 1198000
## 1594 Westwood SFR 5 3.50 <NA> 3656 NA 1995000
str(LAhomes)
## 'data.frame': 1594 obs. of 9 variables:
## $ city : Factor w/ 4 levels "Beverly Hills",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ type : Factor w/ 3 levels "","Condo/Twh",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bed : int 0 0 0 0 0 1 1 1 1 1 ...
## $ bath : num 1 1 1 1 1 1 1 1 1 1 ...
## $ garage: Factor w/ 5 levels "","1","2","3",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ sqft : int 513 550 550 1030 1526 552 558 596 744 750 ...
## $ pool : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ spa : logi NA NA NA NA NA NA ...
## $ price : num 119000 153000 205000 300000 375000 ...
summary(LAhomes)
## city type bed bath
## Beverly Hills: 232 : 39 Min. : 0.000 Min. : 1.000
## Long Beach :1062 Condo/Twh:639 1st Qu.: 2.000 1st Qu.: 1.000
## Santa Monica : 204 SFR :916 Median : 3.000 Median : 2.000
## Westwood : 96 Mean : 2.755 Mean : 2.444
## 3rd Qu.: 3.000 3rd Qu.: 3.000
## Max. :17.000 Max. :30.000
## garage sqft pool spa price
## :388 Min. : 403 :1448 Mode:logical Min. : 100000
## 1 :260 1st Qu.: 960 Y: 146 NA's:1594 1st Qu.: 275000
## 2 :666 Median : 1380 Median : 488944
## 3 : 37 Mean : 1963 Mean : 1254851
## 4+ : 6 3rd Qu.: 2078 3rd Qu.: 988750
## NA's:237 Max. :28000 Max. :89950000
names(LAhomes)
## [1] "city" "type" "bed" "bath" "garage" "sqft" "pool" "spa"
## [9] "price"
Instructions: - Using tidy output, run an lm analysis on price versus sqft for the LAhomes dataset - Run one more analysis, but this time on transformed variables: log(price) versus log(sqft)
# Create a tidy model
lm(price ~ sqft, data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1661892. 64460. -25.8 8.85e-123
## 2 sqft 1486. 22.7 65.4 0.
# Create a tidy model using the log of both variables
lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.70 0.144 18.8 1.97e-71
## 2 log(sqft) 1.44 0.0195 73.8 0.
Transforming variables is a powerful tool to use when running linear regressions. However the parameter estimates must be carefully interpreted in a model with transformed variables.
Consider data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010. The model is given below, and your task is to provide the appropriate interpretation of the coefficient on log(sqft)?
Note: you must be careful to avoid causative interpretations. Additional square footage does not necessarily cause the price of a specific house to go up. The interpretation of the coefficient describes the estimate of the average price of homes at a given square footage.
You will need to run the linear model before answering the question:
lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.70 0.144 18.8 1.97e-71
## 2 log(sqft) 1.44 0.0195 73.8 0.
ANSWER: Each additional 1% of square footage produces an estimate of the average price which is 1.44% higher.
Instructions: - Run a tidy lm on the log transformed variables price and sqft from the dataset LAhomes - Notice whether the relationship is positive or negative and whether or not the relationship is significant
# Output the tidy model
lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.70 0.144 18.8 1.97e-71
## 2 log(sqft) 1.44 0.0195 73.8 0.
Instructions: - Run a tidy lm on the log transformed variables price and bath from the dataset LAhomes - Notice whether the relationship is positive or negative and whether or not the relationship is significant
# Output the tidy model
lm(log(price) ~ log(bath), data = LAhomes) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.2 0.0280 437. 0.
## 2 log(bath) 1.43 0.0306 46.6 9.66e-300
Instructions: - Run a tidy lm on the log transformed variables price and both of sqft and bath from the dataset LAhomes. Use the formula: log(price) ~ log(sqft) + log(bath) - Now look at the coefficients separately. What happened to the signs of each of the coefficients? What happened to the significance of each of the coefficients?
# Output the tidy model
lm(log(price) ~ log(sqft) + log(bath), data = LAhomes) %>% tidy()
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.51 0.262 9.60 2.96e- 21
## 2 log(sqft) 1.47 0.0395 37.2 1.19e-218
## 3 log(bath) -0.0390 0.0453 -0.862 3.89e- 1
Prior to committing to the instructions set about by this exercise, let us import the necessary data
restNYC <- read.csv("https://assets.datacamp.com/production/course_3623/datasets/restNYC.csv")
head(restNYC)
## Case Restaurant Price Food Decor Service East
## 1 1 Daniella Ristorante 43 22 18 20 0
## 2 2 Tello's Ristorante 32 20 19 19 0
## 3 3 Biricchino 34 21 13 18 0
## 4 4 Bottino 41 20 20 17 0
## 5 5 Da Umberto 54 24 19 21 0
## 6 6 Le Madri 52 22 22 21 0
tail(restNYC)
## Case Restaurant Price Food Decor Service East
## 163 163 Sambuca, Trattoria 31 19 16 18 0
## 164 164 Baci 31 17 15 16 0
## 165 165 Puccini 26 20 16 17 0
## 166 166 Bella Luna 31 18 16 17 0
## 167 167 MŽtisse 38 22 17 21 0
## 168 168 Gennaro 34 24 10 16 0
str(restNYC)
## 'data.frame': 168 obs. of 7 variables:
## $ Case : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Restaurant: Factor w/ 168 levels "Amarone","Anche Vivolo",..: 47 149 18 20 46 89 90 119 16 37 ...
## $ Price : int 43 32 34 41 54 52 34 34 39 44 ...
## $ Food : int 22 20 21 20 24 22 22 20 22 21 ...
## $ Decor : int 18 19 13 20 19 22 16 18 19 17 ...
## $ Service : int 20 19 18 17 21 21 21 21 22 19 ...
## $ East : int 0 0 0 0 0 0 0 1 1 1 ...
summary(restNYC)
## Case Restaurant Price Food
## Min. : 1.00 Amarone : 1 Min. :19.0 Min. :16.0
## 1st Qu.: 42.75 Anche Vivolo: 1 1st Qu.:36.0 1st Qu.:19.0
## Median : 84.50 Andiamo : 1 Median :43.0 Median :20.5
## Mean : 84.50 Arno : 1 Mean :42.7 Mean :20.6
## 3rd Qu.:126.25 Artusi : 1 3rd Qu.:50.0 3rd Qu.:22.0
## Max. :168.00 Baci : 1 Max. :65.0 Max. :25.0
## (Other) :162
## Decor Service East
## Min. : 6.00 Min. :14.0 Min. :0.000
## 1st Qu.:16.00 1st Qu.:18.0 1st Qu.:0.000
## Median :18.00 Median :20.0 Median :1.000
## Mean :17.69 Mean :19.4 Mean :0.631
## 3rd Qu.:19.00 3rd Qu.:21.0 3rd Qu.:1.000
## Max. :25.00 Max. :24.0 Max. :1.000
##
names(restNYC)
## [1] "Case" "Restaurant" "Price" "Food" "Decor"
## [6] "Service" "East"
# Output the first model
lm(Price ~ Service, data = restNYC) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -12.0 5.11 -2.34 2.02e- 2
## 2 Service 2.82 0.262 10.8 7.88e-21
# Output the second model
lm(Price ~ Service + Food + Decor, data = restNYC) %>% tidy()
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -24.6 4.75 -5.18 6.33e- 7
## 2 Service 0.135 0.396 0.341 7.33e- 1
## 3 Food 1.56 0.373 4.17 4.93e- 5
## 4 Decor 1.85 0.218 8.49 1.17e-14
What is the correct interpretation of the coefficient on Service in the linear model which regresses Price on Service, Food, and Decor?
You will need to run the linear model before answering the question: lm(Price ~ Service + Food + Decor, data=restNYC) %>% tidy()
# Let us run the model
lm(Price ~ Service + Food + Decor, data = restNYC) %>% tidy()
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -24.6 4.75 -5.18 6.33e- 7
## 2 Service 0.135 0.396 0.341 7.33e- 1
## 3 Food 1.56 0.373 4.17 4.93e- 5
## 4 Decor 1.85 0.218 8.49 1.17e-14
Given that Food and Decor are in the model, Service is not significant and we cannot know whether it has an effect on modelling Price