Before beginning this assignment, please ensure you have access to R and RStudio.
Download the problemset7.Rmd file from Canvas. Open problemset7.Rmd in RStudio and supply your solutions to the assignment by editing problemset7.Rmd.
Replace the “Insert Your Name Here” text in the author: field with your own full name. Any collaborators must be listed on the top of your assignment.
All materials and resources that you use (with the exception of lecture slides) must be appropriately referenced within your assignment. In particular, note that Stack Overflow is licenses as Creative Commons (CC-BY-SA). This means you have to attribute any code you refer from SO.
Partial credit will be awarded for each question for which a serious attempt at finding an answer has been shown. But please DO NOT submit pages and pages of hard-to-read code and attempts that is impossible to grade. That is, avoid redundancy. Remember that one of the key goals of a data scientist is to produce coherent reports that others can easily follow. Students are encouraged to attempt each question and to document their reasoning process even if they cannot find the correct answer. If you would like to include R code to show this process, but it does not run without errors you can do so with the eval=FALSE option as follows:
a + b # these object dont' exist
# if you run this on its own it with give an error
When you have completed the assignment and have checked that your code both runs in the Console and knits correctly when you click Knit PDF, rename the knitted PDF file to ps6_ourLastName_YourFirstName.pdf, and submit the PDF file on Canvas.
Collaboration is often fun and useful, but each student must turn in an individual write-up in their own words as well as code/work that is their own. Regardless of whether you work with others, what you turn in must be your own work; this includes code and interpretation of results. The names of all collaborators must be listed on each assignment. Do not copy-and-paste from other students’ responses or code.
In this problem set you will need, at minimum, the following R packages.
# Load standard libraries
library(tidyverse)
library(dplyr)
library(MASS) # Modern applied statistics functions
library(data.table)
library(dplyr)
library(ggplot2)
library(corrplot)
The Italian restaurants in New York City are legendary, and it’s time to put your newly developed regression modeling skills to work to understand how they operate. What are the factors that contribute to the price of a meal at Italian restaurants in New York City?
You will need to address this question with a series of multiple regression models. The Zagat guide is an influential review of restaurants. You will be looking at the numeric reviews posted on the Zagat review. Each restaurant is rated on a scale of 0 to 30 for the quality of its food, decor, and service.The data comes in the form of Zagat reviews from 168 Italian restaurants in New York City from 2001.
\begin{enumerate}
Inspect the data using your usual inspect data functions to get a sense of how the variables are encoded and what values they typically take on. For example, the East variable records whether the restaurant is located east or west of Fifth Avenue, which historically divides the island of Manhattan. Describe the data and variables.
nycData <- read.csv('nyc.csv') ## loading nyc data
## Inspect the data
nycData <- data.table(nycData)
head(nycData)
str(nycData)
## Classes 'data.table' and 'data.frame': 168 obs. of 7 variables:
## $ Case : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Restaurant: Factor w/ 168 levels "Amarone","Anche Vivolo",..: 47 149 18 20 46 89 90 119 16 37 ...
## $ Price : int 43 32 34 41 54 52 34 34 39 44 ...
## $ Food : int 22 20 21 20 24 22 22 20 22 21 ...
## $ Decor : int 18 19 13 20 19 22 16 18 19 17 ...
## $ Service : int 20 19 18 17 21 21 21 21 22 19 ...
## $ East : int 0 0 0 0 0 0 0 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(nycData)
## Case Restaurant Price Food
## Min. : 1.00 Amarone : 1 Min. :19.0 Min. :16.0
## 1st Qu.: 42.75 Anche Vivolo: 1 1st Qu.:36.0 1st Qu.:19.0
## Median : 84.50 Andiamo : 1 Median :43.0 Median :20.5
## Mean : 84.50 Arno : 1 Mean :42.7 Mean :20.6
## 3rd Qu.:126.25 Artusi : 1 3rd Qu.:50.0 3rd Qu.:22.0
## Max. :168.00 Baci : 1 Max. :65.0 Max. :25.0
## (Other) :162
## Decor Service East
## Min. : 6.00 Min. :14.0 Min. :0.000
## 1st Qu.:16.00 1st Qu.:18.0 1st Qu.:0.000
## Median :18.00 Median :20.0 Median :1.000
## Mean :17.69 Mean :19.4 Mean :0.631
## 3rd Qu.:19.00 3rd Qu.:21.0 3rd Qu.:1.000
## Max. :25.00 Max. :24.0 Max. :1.000
##
nycData %>% select_if(is.numeric) %>%
dplyr::select(., -Case) %>%
cor()%>%corrplot()
Findings: The nycdata has five variables and 168 observations. Five of them are integer variables and 1 factor variable. The observations represented the number of restaurants found in NYC, and Each variable represents: - ‘Restaurant’ represents 168 restaurant names that participated in this review process. - ‘Price’ represents the Price of the food - ‘Food’ the quality of the food - ‘Decor’ the quality of the restaurant in terms of decore - ‘Service’ is the overall customer service and presentation of the food. - ‘East’ which side of the city the restaurant found in NYC
Based on your knowledge of the restaurant industry, do you think that the quality of the food in a restaurant is an important determinant of the price of a meal at that restaurant? How will you prove your intuition (quality determines or does not determine) using the nyc data and your newly developed regression modeling skills? Before writing code explain your choice of response variable, explanatory variable(s), and modeling technique.
Finding: - Yes, my experience taught me that the food quality is an important determinant of the price of a meal at that restaurant. - Food, Decor, & Service are explanatory variables, and Price is the response variable. The linear regression model is a suitable model to find the relationship between price and food quality.
Now build the model based on the choices you made in the previous question (i.e. write code below).
### linear regression model is the simplest one to develop and excute my intutions
lm_nycdata <- lm(Price ~ Food + Service + Decor + East, data = nycData)
Visualize the fitted model
### visualizing the fitted model
ggplot(nycData, aes(Food, Price)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
Interpret model, model coefficients and see whether you were able to prove or disprove your intuition. On top of reporting the coefficient value and model fit you need to interpret cofficients in plain English.
Finding: - Based on my model, the coefficients and slopes show that the price goes down to 24.023800 when the food quality rate is zero. Which means an increase in one unit of food quality rate, the price increased by $1.538120.
You plan to visit an Italian restaurant for lunch. You check the Zagat review for a restaurant called Plaza that your colleague has been raving about and you find that Zagat has rated the quality of food for that restaurant as 20. What’s your best estimate of the price of a meal that you would need to pay if you ended up going to Plaza for lunch.
## to prdict the price when the food quality rate is 20
predict(lm_pricefood <- lm(Price ~ Food, data = nycData))
## 1 2 3 4 5 6 7 8
## 46.82497 40.94705 43.88601 40.94705 52.70289 46.82497 46.82497 40.94705
## 9 10 11 12 13 14 15 16
## 46.82497 43.88601 38.00809 43.88601 43.88601 38.00809 40.94705 43.88601
## 17 18 19 20 21 22 23 24
## 46.82497 52.70289 38.00809 46.82497 49.76393 49.76393 40.94705 46.82497
## 25 26 27 28 29 30 31 32
## 46.82497 38.00809 46.82497 40.94705 46.82497 43.88601 40.94705 43.88601
## 33 34 35 36 37 38 39 40
## 40.94705 49.76393 55.64185 49.76393 43.88601 46.82497 52.70289 43.88601
## 41 42 43 44 45 46 47 48
## 38.00809 38.00809 43.88601 52.70289 46.82497 46.82497 40.94705 43.88601
## 49 50 51 52 53 54 55 56
## 52.70289 29.19121 40.94705 40.94705 38.00809 32.13017 35.06913 35.06913
## 57 58 59 60 61 62 63 64
## 49.76393 40.94705 46.82497 35.06913 52.70289 38.00809 35.06913 40.94705
## 65 66 67 68 69 70 71 72
## 38.00809 43.88601 40.94705 32.13017 32.13017 46.82497 46.82497 49.76393
## 73 74 75 76 77 78 79 80
## 52.70289 38.00809 49.76393 46.82497 43.88601 35.06913 35.06913 43.88601
## 81 82 83 84 85 86 87 88
## 43.88601 49.76393 49.76393 40.94705 49.76393 38.00809 49.76393 55.64185
## 89 90 91 92 93 94 95 96
## 43.88601 43.88601 46.82497 52.70289 49.76393 49.76393 38.00809 40.94705
## 97 98 99 100 101 102 103 104
## 38.00809 38.00809 38.00809 35.06913 38.00809 40.94705 49.76393 35.06913
## 105 106 107 108 109 110 111 112
## 52.70289 49.76393 35.06913 40.94705 40.94705 49.76393 38.00809 35.06913
## 113 114 115 116 117 118 119 120
## 49.76393 49.76393 35.06913 32.13017 43.88601 38.00809 38.00809 38.00809
## 121 122 123 124 125 126 127 128
## 38.00809 38.00809 32.13017 40.94705 40.94705 35.06913 46.82497 40.94705
## 129 130 131 132 133 134 135 136
## 40.94705 38.00809 38.00809 49.76393 40.94705 46.82497 40.94705 38.00809
## 137 138 139 140 141 142 143 144
## 38.00809 38.00809 40.94705 35.06913 43.88601 43.88601 46.82497 46.82497
## 145 146 147 148 149 150 151 152
## 46.82497 49.76393 38.00809 46.82497 43.88601 35.06913 43.88601 49.76393
## 153 154 155 156 157 158 159 160
## 43.88601 40.94705 32.13017 40.94705 46.82497 35.06913 29.19121 40.94705
## 161 162 163 164 165 166 167 168
## 43.88601 43.88601 38.00809 32.13017 40.94705 35.06913 46.82497 52.70289
### or Combine two equestions
predict(lm_pricefood, tibble(Food = 20))
## 1
## 40.94705
\end{enumerate}
Fifth Avenue is one the most well-known streets in the world, renowned for its flagship stores for shopping, landmark hotels, and internationally-recognized sites. 5th Avenue divides the island of Manhattan vertically. The city’s east side has historically been home to expensive residences and opulent attractions. Maybe everything is more expensive on the East side. Maybe even food is more expensive on the east side.
## average meal price for east and west side of NYC resturants
nycData %>%
group_by(East) %>%
summarize(mean_East= mean(Price))
## `summarise()` ungrouping output (override with `.groups` argument)
Finding: - The east side’s average mean price is 3.58339 higher than Westside, which is 44.01887 and 40.43548, respectively.
## multiple regression model
Price_foodService <- lm(Price ~ Food + Service, data = nycData)
Finding: - Multiple regression gives us a better result in comparing more than two variables: Price, Food, and Service. - After controlling the food quality, people are willing to paying 1.704 more for each additional unit of service rate.
## linear regression model for prive and decor
Price_decor <- lm(Price ~ Decor, data = nycData)
## multiple regression model for price, service and decor
Price_decor <- lm(Price ~ Food + Service + Decor, data = nycData)
Finding: - The decore’s perceived quality is for each additional unit of rate is people are willing to pay 2.491 more for each meal. - The effect of Decor moderated by food and service is for an additional unit of decor rate; people are willing to pay 1.847 more for each meal.
You are given an auction dataset, ebay auction data (https://www.modelingonlineauctions.com/datasets). You would want to know how the final sale price of a particular item (“Palm Pilot M515 PDA”) is affected by opening price and the type of auction. To do so, respond to the following:
## loading nyc data
ebay_auction <- read.csv('auction.csv')
## inspecting the data
head(ebay_auction)
str(ebay_auction)
## 'data.frame': 10681 obs. of 9 variables:
## $ auctionid : num 1.64e+09 1.64e+09 1.64e+09 1.64e+09 1.64e+09 ...
## $ bid : num 175 100 120 150 178 ...
## $ bidtime : num 2.23 2.6 2.6 2.6 2.91 ...
## $ bidder : Factor w/ 3387 levels "-kim-","**balancedbody4u**",..: 2729 627 1764 1764 1046 354 2723 354 2723 354 ...
## $ bidderrate : int 0 0 2 2 4 2 1 2 1 2 ...
## $ openbid : num 99 99 99 99 99 1 1 1 1 1 ...
## $ price : num 178 178 178 178 178 ...
## $ item : Factor w/ 3 levels "Cartier wristwatch",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ auction_type: Factor w/ 3 levels "3 day auction",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(ebay_auction)
## auctionid bid bidtime
## Min. :1.639e+09 Min. : 0.01 Min. :0.000567
## 1st Qu.:3.015e+09 1st Qu.: 72.00 1st Qu.:1.949931
## Median :3.021e+09 Median : 140.00 Median :4.140833
## Mean :4.136e+09 Mean : 207.59 Mean :3.979628
## 3rd Qu.:8.212e+09 3rd Qu.: 210.00 3rd Qu.:6.448060
## Max. :8.216e+09 Max. :5400.00 Max. :6.999990
##
## bidder bidderrate openbid price
## warrencheryl: 45 Min. : -4.00 Min. : 0.01 Min. : 26.0
## mregestr : 33 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 186.5
## zebedin : 33 Median : 5.00 Median : 4.99 Median : 228.5
## babygirljrt : 32 Mean : 31.94 Mean : 52.25 Mean : 335.0
## macdonn : 31 3rd Qu.: 21.00 3rd Qu.: 50.00 3rd Qu.: 255.0
## (Other) :10491 Max. :3140.00 Max. :5000.00 Max. :5400.0
## NA's : 16 NA's :11
## item auction_type
## Cartier wristwatch :1953 3 day auction:2023
## Palm Pilot M515 PDA:5917 5 day auction:1617
## Xbox game console :2811 7 day auction:7041
##
##
##
##
## i created a new coulmn based on the bid and price variable.
##if the price match with the bid, the item is sold item which is "1".
##if the price not equat to bid, the item is not sold item which is "0" because the was not matched.
ebay_auction1 <- ebay_auction %>%
mutate(sold := ifelse(price==bid,1,0))
## Filtering "Palm Pilot M515 PDA" data
ebay_auction_PDA <- ebay_auction1 %>%
filter(item =="Palm Pilot M515 PDA")
table(ebay_auction1$sold)
##
## 0 1
## 9969 712
## change the data type for openbid to factor
ebay_auction_PDA$openbid = as.factor(ebay_auction_PDA$openbid)
ebay_auction_PDA$sold = as.factor(ebay_auction_PDA$sold)
## to build regression model which is approprate model to analyse the auction$Palm Pilot M515 PDA data
ebay_PDA_model <- glm(sold ~ openbid, family = "binomial", data = ebay_auction_PDA)
summary(ebay_PDA_model )
##
## Call:
## glm(formula = sold ~ openbid, family = "binomial", data = ebay_auction_PDA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2435 -0.3312 -0.3200 -0.2828 2.6444
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.199311 0.127524 -25.088 < 2e-16 ***
## openbid0.07 -0.167985 1.025059 -0.164 0.869827
## openbid0.11 0.491260 1.040639 0.472 0.636873
## openbid1 0.252823 0.173007 1.461 0.143921
## openbid2 1.119869 1.068299 1.048 0.294513
## openbid5 -0.009515 0.525703 -0.018 0.985560
## openbid9 -0.266425 1.023481 -0.260 0.794622
## openbid9.95 -0.007181 0.310324 -0.023 0.981538
## openbid9.99 0.323865 0.226882 1.427 0.153448
## openbid10 0.411218 0.385946 1.065 0.286659
## openbid15 0.131258 0.734519 0.179 0.858174
## openbid19.99 0.031728 0.526107 0.060 0.951911
## openbid20 -0.096526 1.026304 -0.094 0.925068
## openbid25 0.178886 0.527691 0.339 0.734612
## openbid39 0.406103 0.608279 0.668 0.504373
## openbid40 1.002086 1.061778 0.944 0.345282
## openbid45.99 0.154788 1.031446 0.150 0.880710
## openbid50 0.479212 0.336392 1.425 0.154283
## openbid70 0.228896 0.736141 0.311 0.755846
## openbid74.9 0.366097 1.036863 0.353 0.724027
## openbid89 0.491260 1.040639 0.472 0.636873
## openbid99 1.119869 0.760764 1.472 0.141012
## openbid99.99 0.615313 0.533976 1.152 0.249188
## openbid100 0.742575 0.369996 2.007 0.044752 *
## openbid125 0.801415 1.052222 0.762 0.446274
## openbid140 0.366097 1.036863 0.353 0.724027
## openbid149.99 0.634361 0.744798 0.852 0.394368
## openbid150 0.840030 0.327826 2.562 0.010394 *
## openbid155 0.366097 1.036863 0.353 0.724027
## openbid175 1.615191 0.215385 7.499 6.43e-14 ***
## openbid179.99 1.059245 0.543759 1.948 0.051415 .
## openbid180 2.793846 0.921735 3.031 0.002437 **
## openbid185 1.119869 1.068299 1.048 0.294513
## openbid190 1.589873 0.785024 2.025 0.042841 *
## openbid199 2.862838 0.433233 6.608 3.89e-11 ***
## openbid199.95 1.813016 1.125283 1.611 0.107144
## openbid199.99 17.765378 882.743384 0.020 0.983944
## openbid200 1.667834 0.410231 4.066 4.79e-05 ***
## openbid200.99 0.308939 1.035286 0.298 0.765391
## openbid202.5 17.765378 882.743384 0.020 0.983944
## openbid215 1.813016 0.360415 5.030 4.90e-07 ***
## openbid219.99 3.199311 1.419952 2.253 0.024252 *
## openbid220 2.100698 0.826395 2.542 0.011022 *
## openbid225 3.199311 0.718514 4.453 8.48e-06 ***
## openbid230 3.199311 1.419952 2.253 0.024252 *
## openbid239.99 17.765378 882.743384 0.020 0.983944
## openbid240 3.353461 0.413551 8.109 5.11e-16 ***
## openbid245 17.765378 624.193840 0.028 0.977294
## openbid249.99 17.765378 882.743384 0.020 0.983944
## openbid250 17.765378 882.743384 0.020 0.983944
## openbid255 3.199311 0.826395 3.871 0.000108 ***
## openbid259 1.946548 0.811862 2.398 0.016501 *
## openbid259.95 17.765378 882.743384 0.020 0.983944
## openbid265 17.765378 882.743384 0.020 0.983944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2816.2 on 5916 degrees of freedom
## Residual deviance: 2573.2 on 5863 degrees of freedom
## AIC: 2681.2
##
## Number of Fisher Scoring iterations: 13
# visualize the modelscatterplot with jitter
#scatterplot with jitter
ebay_PDA_openbid <- ggplot(data = ebay_auction_PDA, aes(y = sold , x = openbid )) + #geom_point(alpha = .15)
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# add logistic curve
ebay_PDA_openbid +
geom_smooth(method = 'glm', method.args = 'binomial')
## `geom_smooth()` using formula 'y ~ x'
Finding: - The p-value for all openbids is greater than 0.05, and there is no statistically significant. It indicates that there is strong evidence that there is a relationship between openbid and price.
## change the data type for openbid to factor
ebay_auction_PDA$auction_type = as.factor(ebay_auction_PDA$auction_type)
## to build regression model which is approprate model to analyse the auction$Palm Pilot M515 PDA data
ebay_PDA_model2 <- glm(sold ~ openbid + auction_type, family = "binomial", data = ebay_auction_PDA)
summary(ebay_PDA_model2 )
##
## Call:
## glm(formula = sold ~ openbid + auction_type, family = "binomial",
## data = ebay_auction_PDA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2506 -0.3288 -0.3133 -0.2843 2.6444
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.249569 0.186433 -17.430 < 2e-16 ***
## openbid0.07 -0.141529 1.039058 -0.136 0.891656
## openbid0.11 0.480028 1.041007 0.461 0.644713
## openbid1 0.259693 0.174050 1.492 0.135685
## openbid2 1.108637 1.068657 1.037 0.299544
## openbid5 -0.007903 0.525854 -0.015 0.988009
## openbid9 -0.277657 1.023855 -0.271 0.786246
## openbid9.95 -0.018413 0.311556 -0.059 0.952872
## openbid9.99 0.335122 0.229249 1.462 0.143789
## openbid10 0.420039 0.395094 1.063 0.287719
## openbid15 0.139965 0.739306 0.189 0.849842
## openbid19.99 0.065957 0.533286 0.124 0.901568
## openbid20 -0.107758 1.026677 -0.105 0.916409
## openbid25 0.194510 0.529389 0.367 0.713303
## openbid39 0.394871 0.608909 0.648 0.516669
## openbid40 0.990854 1.062139 0.933 0.350879
## openbid45.99 0.181245 1.045360 0.173 0.862353
## openbid50 0.477775 0.336547 1.420 0.155714
## openbid70 0.255352 0.755512 0.338 0.735374
## openbid74.9 0.354865 1.037233 0.342 0.732255
## openbid89 0.517717 1.054431 0.491 0.623432
## openbid99 1.170128 0.772824 1.514 0.130002
## openbid99.99 0.629735 0.545451 1.155 0.248286
## openbid100 0.748056 0.376368 1.988 0.046860 *
## openbid125 0.790183 1.052586 0.751 0.452829
## openbid140 0.392554 1.050705 0.374 0.708695
## openbid149.99 0.623129 0.745312 0.836 0.403118
## openbid150 0.847466 0.329113 2.575 0.010024 *
## openbid155 0.354865 1.037233 0.342 0.732255
## openbid175 1.657811 0.243553 6.807 9.98e-12 ***
## openbid179.99 1.055854 0.544680 1.938 0.052564 .
## openbid180 2.782614 0.922151 3.018 0.002548 **
## openbid185 1.170128 1.076920 1.087 0.277236
## openbid190 1.624522 0.790737 2.054 0.039933 *
## openbid199 2.892531 0.440980 6.559 5.41e-11 ***
## openbid199.95 1.863275 1.133471 1.644 0.100204
## openbid199.99 17.754146 882.743385 0.020 0.983954
## openbid200 1.700239 0.420157 4.047 5.20e-05 ***
## openbid200.99 0.359197 1.044180 0.344 0.730847
## openbid202.5 17.815637 882.743395 0.020 0.983898
## openbid215 1.828723 0.365381 5.005 5.59e-07 ***
## openbid219.99 3.188079 1.420221 2.245 0.024783 *
## openbid220 2.150957 0.837510 2.568 0.010221 *
## openbid225 3.206923 0.722936 4.436 9.17e-06 ***
## openbid230 3.225767 1.430090 2.256 0.024093 *
## openbid239.99 17.754146 882.743385 0.020 0.983954
## openbid240 3.358426 0.413849 8.115 4.85e-16 ***
## openbid245 17.773116 624.177638 0.028 0.977284
## openbid249.99 17.791835 882.743401 0.020 0.983920
## openbid250 17.754146 882.743385 0.020 0.983954
## openbid255 3.229734 0.838893 3.850 0.000118 ***
## openbid259 1.983544 0.819846 2.419 0.015546 *
## openbid259.95 17.754146 882.743385 0.020 0.983954
## openbid265 17.754146 882.743385 0.020 0.983954
## auction_type5 day auction 0.023802 0.214043 0.111 0.911456
## auction_type7 day auction 0.061490 0.161408 0.381 0.703230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2816.2 on 5916 degrees of freedom
## Residual deviance: 2573.1 on 5861 degrees of freedom
## AIC: 2685.1
##
## Number of Fisher Scoring iterations: 13
#scatterplot with jitter
ebay_PDA_openbid <- ggplot(data = ebay_auction_PDA, aes(y = price , x = openbid )) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# add logistic curve
ebay_PDA_openbid +
geom_smooth(method = 'glm', method.args = 'binomial')
## `geom_smooth()` using formula 'y ~ x'
Finding: - The p-value for all auction_type is greater than 0.05, and there is no statistically significant. It indicates that there is strong evidence that there is a relationship between auction_type and price.
Let’s revisit the Boston dataset from the previous problem (PS6). In PS6, you had investigated linear relationships between predictor variables and the response. What we didn’t investigate is the presence of any non-linear association.
\begin{enumerate}
Is there evidence of a non-linear association between any of the predictors and the response? Hint: To answer this question, for each predictor X fit a model of the form:
Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon
library(MASS)
data(Boston)
## fit model for each predictor X
summary(lm(medv~crim+ I(crim^2) + I(crim^3), data = Boston))
##
## Call:
## lm(formula = medv ~ crim + I(crim^2) + I(crim^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.983 -4.975 -1.940 2.881 33.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.519e+01 4.355e-01 57.846 < 2e-16 ***
## crim -1.136e+00 1.444e-01 -7.868 2.24e-14 ***
## I(crim^2) 2.378e-02 6.808e-03 3.494 0.000518 ***
## I(crim^3) -1.489e-04 6.641e-05 -2.242 0.025411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.159 on 502 degrees of freedom
## Multiple R-squared: 0.2177, Adjusted R-squared: 0.213
## F-statistic: 46.57 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~zn + I(zn^2) + I(zn^3), data = Boston))
##
## Call:
## lm(formula = medv ~ zn + I(zn^2) + I(zn^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.449 -5.549 -1.049 3.225 29.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.4485972 0.4359536 46.905 < 2e-16 ***
## zn 0.6433652 0.1105611 5.819 1.06e-08 ***
## I(zn^2) -0.0167646 0.0038872 -4.313 1.94e-05 ***
## I(zn^3) 0.0001257 0.0000316 3.978 7.98e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.43 on 502 degrees of freedom
## Multiple R-squared: 0.1649, Adjusted R-squared: 0.1599
## F-statistic: 33.05 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~indus + I(indus^2) + I(indus^3), data = Boston))
##
## Call:
## lm(formula = medv ~ indus + I(indus^2) + I(indus^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.760 -4.725 -1.009 2.932 32.038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.080160 1.663326 22.293 < 2e-16 ***
## indus -2.806994 0.509349 -5.511 5.71e-08 ***
## I(indus^2) 0.140462 0.041554 3.380 0.000781 ***
## I(indus^3) -0.002399 0.001011 -2.373 0.018026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.844 on 502 degrees of freedom
## Multiple R-squared: 0.2768, Adjusted R-squared: 0.2725
## F-statistic: 64.06 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~chas + I(chas^2) + I(chas^3), data = Boston))
##
## Call:
## lm(formula = medv ~ chas + I(chas^2) + I(chas^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.094 -5.894 -1.417 2.856 27.906
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.0938 0.4176 52.902 < 2e-16 ***
## chas 6.3462 1.5880 3.996 7.39e-05 ***
## I(chas^2) NA NA NA NA
## I(chas^3) NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.064 on 504 degrees of freedom
## Multiple R-squared: 0.03072, Adjusted R-squared: 0.02879
## F-statistic: 15.97 on 1 and 504 DF, p-value: 7.391e-05
summary(lm(medv~nox + I(nox^2) + I(nox^3), data = Boston))
##
## Call:
## lm(formula = medv ~ nox + I(nox^2) + I(nox^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.104 -5.020 -2.144 2.747 32.416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.49 38.52 -0.584 0.5596
## nox 315.10 195.10 1.615 0.1069
## I(nox^2) -615.83 320.48 -1.922 0.0552 .
## I(nox^3) 350.19 170.92 2.049 0.0410 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.282 on 502 degrees of freedom
## Multiple R-squared: 0.1939, Adjusted R-squared: 0.189
## F-statistic: 40.24 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~rm + I(rm^2) + I(rm^3), data = Boston))
##
## Call:
## lm(formula = medv ~ rm + I(rm^2) + I(rm^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.102 -2.674 0.569 3.011 35.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 241.3108 47.3275 5.099 4.85e-07 ***
## rm -109.3906 22.9690 -4.763 2.51e-06 ***
## I(rm^2) 16.4910 3.6750 4.487 8.95e-06 ***
## I(rm^3) -0.7404 0.1935 -3.827 0.000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.11 on 502 degrees of freedom
## Multiple R-squared: 0.5612, Adjusted R-squared: 0.5586
## F-statistic: 214 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~age + I(age^2) + I(age^3), data = Boston))
##
## Call:
## lm(formula = medv ~ age + I(age^2) + I(age^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.443 -4.909 -2.234 2.185 32.944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.893e+01 2.992e+00 9.668 <2e-16 ***
## age -1.224e-01 2.014e-01 -0.608 0.544
## I(age^2) 2.355e-03 3.930e-03 0.599 0.549
## I(age^3) -2.318e-05 2.279e-05 -1.017 0.310
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.472 on 502 degrees of freedom
## Multiple R-squared: 0.1566, Adjusted R-squared: 0.1515
## F-statistic: 31.06 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~dis + I(dis^2) + I(dis^3), data = Boston))
##
## Call:
## lm(formula = medv ~ dis + I(dis^2) + I(dis^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.571 -5.242 -2.037 2.397 34.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.03789 2.91134 2.417 0.01599 *
## dis 8.59284 2.06633 4.158 3.77e-05 ***
## I(dis^2) -1.24953 0.41235 -3.030 0.00257 **
## I(dis^3) 0.05602 0.02428 2.307 0.02146 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.727 on 502 degrees of freedom
## Multiple R-squared: 0.105, Adjusted R-squared: 0.09968
## F-statistic: 19.64 on 3 and 502 DF, p-value: 4.736e-12
summary(lm(medv~rad + I(rad^2) + I(rad^3), data = Boston))
##
## Call:
## lm(formula = medv ~ rad + I(rad^2) + I(rad^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.630 -5.151 -2.017 3.169 33.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.251303 2.567860 11.781 < 2e-16 ***
## rad -3.799454 1.307156 -2.907 0.003815 **
## I(rad^2) 0.616347 0.186057 3.313 0.000991 ***
## I(rad^3) -0.020086 0.005717 -3.514 0.000482 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.37 on 502 degrees of freedom
## Multiple R-squared: 0.1767, Adjusted R-squared: 0.1718
## F-statistic: 35.91 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~tax + I(tax^2) + I(tax^3), data = Boston))
##
## Call:
## lm(formula = medv ~ tax + I(tax^2) + I(tax^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.109 -4.952 -1.878 2.957 33.694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.222e+01 1.397e+01 3.739 0.000206 ***
## tax -1.635e-01 1.133e-01 -1.443 0.149646
## I(tax^2) 3.029e-04 2.872e-04 1.055 0.292004
## I(tax^3) -2.079e-07 2.236e-07 -0.930 0.353061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.115 on 502 degrees of freedom
## Multiple R-squared: 0.2261, Adjusted R-squared: 0.2215
## F-statistic: 48.89 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~ptratio + I(ptratio^2) + I(ptratio^3), data = Boston))
##
## Call:
## lm(formula = medv ~ ptratio + I(ptratio^2) + I(ptratio^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7795 -5.0364 -0.9778 3.4766 31.1636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 312.28642 152.48693 2.048 0.0411 *
## ptratio -48.69114 26.88441 -1.811 0.0707 .
## I(ptratio^2) 2.83995 1.56413 1.816 0.0700 .
## I(ptratio^3) -0.05686 0.03005 -1.892 0.0590 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.898 on 502 degrees of freedom
## Multiple R-squared: 0.2669, Adjusted R-squared: 0.2625
## F-statistic: 60.91 on 3 and 502 DF, p-value: < 2.2e-16
summary(lm(medv~black + I(black^2) + I(black^3), data = Boston))
##
## Call:
## lm(formula = medv ~ black + I(black^2) + I(black^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.005 -4.802 -1.613 2.852 28.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.260e+01 2.517e+00 5.006 7.7e-07 ***
## black -1.703e-02 6.150e-02 -0.277 0.782
## I(black^2) 2.036e-04 3.258e-04 0.625 0.532
## I(black^3) -2.224e-07 4.765e-07 -0.467 0.641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.685 on 502 degrees of freedom
## Multiple R-squared: 0.1135, Adjusted R-squared: 0.1082
## F-statistic: 21.43 on 3 and 502 DF, p-value: 4.463e-13
summary(lm(medv~lstat + I(lstat^2) + I(lstat^3), data = Boston))
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5441 -3.7122 -0.5145 2.4846 26.4153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.6496253 1.4347240 33.909 < 2e-16 ***
## lstat -3.8655928 0.3287861 -11.757 < 2e-16 ***
## I(lstat^2) 0.1487385 0.0212987 6.983 9.18e-12 ***
## I(lstat^3) -0.0020039 0.0003997 -5.013 7.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.396 on 502 degrees of freedom
## Multiple R-squared: 0.6578, Adjusted R-squared: 0.6558
## F-statistic: 321.7 on 3 and 502 DF, p-value: < 2.2e-16
In PS6, question 4, you had built a kitchen-sink model by fitting a multiple regression model to predict the response using all of the predictors. Now consider performing a stepwise model selection procedure to determine the best fit model. Discuss your results. How is this model different from the kitchen-sink model?
## kitchen-sink model form problem 6
Mul_reg_Boston <- (lm(medv~., data = Boston))
summary(Mul_reg_Boston)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
# Stepwise regression model
step_reg_model <- MASS::stepAIC(Mul_reg_Boston, k=2, direction = "both")
## Start: AIC=1589.64
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 0.06 11079 1587.7
## - indus 1 2.52 11081 1587.8
## <none> 11079 1589.6
## - chas 1 218.97 11298 1597.5
## - tax 1 242.26 11321 1598.6
## - crim 1 243.22 11322 1598.6
## - zn 1 257.49 11336 1599.3
## - black 1 270.63 11349 1599.8
## - rad 1 479.15 11558 1609.1
## - nox 1 487.16 11566 1609.4
## - ptratio 1 1194.23 12273 1639.4
## - dis 1 1232.41 12311 1641.0
## - rm 1 1871.32 12950 1666.6
## - lstat 1 2410.84 13490 1687.3
##
## Step: AIC=1587.65
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 2.52 11081 1585.8
## <none> 11079 1587.7
## + age 1 0.06 11079 1589.6
## - chas 1 219.91 11299 1595.6
## - tax 1 242.24 11321 1596.6
## - crim 1 243.20 11322 1596.6
## - zn 1 260.32 11339 1597.4
## - black 1 272.26 11351 1597.9
## - rad 1 481.09 11560 1607.2
## - nox 1 520.87 11600 1608.9
## - ptratio 1 1200.23 12279 1637.7
## - dis 1 1352.26 12431 1643.9
## - rm 1 1959.55 13038 1668.0
## - lstat 1 2718.88 13798 1696.7
##
## Step: AIC=1585.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 11081 1585.8
## + indus 1 2.52 11079 1587.7
## + age 1 0.06 11081 1587.8
## - chas 1 227.21 11309 1594.0
## - crim 1 245.37 11327 1594.8
## - zn 1 257.82 11339 1595.4
## - black 1 270.82 11352 1596.0
## - tax 1 273.62 11355 1596.1
## - rad 1 500.92 11582 1606.1
## - nox 1 541.91 11623 1607.9
## - ptratio 1 1206.45 12288 1636.0
## - dis 1 1448.94 12530 1645.9
## - rm 1 1963.66 13045 1666.3
## - lstat 1 2723.48 13805 1695.0
Finding: - The stepwise model selection procedure removes the highest and lowest from the model reveals the summary. The stepwise model adds predictors that decrease an information criterion and/or remove those that increase it. On the other hand, the kitchen-sink model includes all variables that have higher and low p-values which leads to higher p-values.
Evaluate the statistical assumptions in your regression analysis from this model (the best fit model) by performing a basic analysis of model residuals and any unusual observations. Discuss any concerns you have about your model.
From the above stepwise model; - The model removes the highest p-value (age p-values= 0.958229) from the model, and the AIC = 1587.65. - The model removes the highest p-value ( indus p-values= 0.738288) from the model, and the AIC = 1585.76 - In the stepwise model, to best fit the model, we excluding the highest and lowest p-values. The value of residuals also decreasing when we drop out linear p-values. \end{enumerate}