IMT 573: Problem Set 7

Collaborators:

Instructions:

Before beginning this assignment, please ensure you have access to R and RStudio.

Download the problemset7.Rmd file from Canvas. Open problemset7.Rmd in RStudio and supply your solutions to the assignment by editing problemset7.Rmd.
Replace the “Insert Your Name Here” text in the author: field with your own full name. Any collaborators must be listed on the top of your assignment.
All materials and resources that you use (with the exception of lecture slides) must be appropriately referenced within your assignment. In particular, note that Stack Overflow is licenses as Creative Commons (CC-BY-SA). This means you have to attribute any code you refer from SO.
Partial credit will be awarded for each question for which a serious attempt at finding an answer has been shown. But please DO NOT submit pages and pages of hard-to-read code and attempts that is impossible to grade. That is, avoid redundancy. Remember that one of the key goals of a data scientist is to produce coherent reports that others can easily follow. Students are encouraged to attempt each question and to document their reasoning process even if they cannot find the correct answer. If you would like to include R code to show this process, but it does not run without errors you can do so with the eval=FALSE option as follows:

a + b # these object dont' exist 
# if you run this on its own it with give an error

When you have completed the assignment and have checked that your code both runs in the Console and knits correctly when you click Knit PDF, rename the knitted PDF file to ps6_ourLastName_YourFirstName.pdf, and submit the PDF file on Canvas.
Collaboration is often fun and useful, but each student must turn in an individual write-up in their own words as well as code/work that is their own. Regardless of whether you work with others, what you turn in must be your own work; this includes code and interpretation of results. The names of all collaborators must be listed on each assignment. Do not copy-and-paste from other students’ responses or code.

Setup

In this problem set you will need, at minimum, the following R packages.

# Load standard libraries
library(tidyverse)
library(dplyr)
library(MASS) # Modern applied statistics functions
library(data.table)

library(dplyr)
library(ggplot2)
library(corrplot)

The Italian restaurants in New York City are legendary, and it’s time to put your newly developed regression modeling skills to work to understand how they operate. What are the factors that contribute to the price of a meal at Italian restaurants in New York City?

You will need to address this question with a series of multiple regression models. The Zagat guide is an influential review of restaurants. You will be looking at the numeric reviews posted on the Zagat review. Each restaurant is rated on a scale of 0 to 30 for the quality of its food, decor, and service.The data comes in the form of Zagat reviews from 168 Italian restaurants in New York City from 2001.

\begin{enumerate}

Inspect the data using your usual inspect data functions to get a sense of how the variables are encoded and what values they typically take on. For example, the East variable records whether the restaurant is located east or west of Fifth Avenue, which historically divides the island of Manhattan. Describe the data and variables.

nycData <- read.csv('nyc.csv')  ## loading nyc data

## Inspect the data
nycData <- data.table(nycData)
head(nycData)

str(nycData)

## Classes 'data.table' and 'data.frame':   168 obs. of  7 variables:
##  $ Case      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Restaurant: Factor w/ 168 levels "Amarone","Anche Vivolo",..: 47 149 18 20 46 89 90 119 16 37 ...
##  $ Price     : int  43 32 34 41 54 52 34 34 39 44 ...
##  $ Food      : int  22 20 21 20 24 22 22 20 22 21 ...
##  $ Decor     : int  18 19 13 20 19 22 16 18 19 17 ...
##  $ Service   : int  20 19 18 17 21 21 21 21 22 19 ...
##  $ East      : int  0 0 0 0 0 0 0 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

summary(nycData)

##       Case               Restaurant      Price           Food     
##  Min.   :  1.00   Amarone     :  1   Min.   :19.0   Min.   :16.0  
##  1st Qu.: 42.75   Anche Vivolo:  1   1st Qu.:36.0   1st Qu.:19.0  
##  Median : 84.50   Andiamo     :  1   Median :43.0   Median :20.5  
##  Mean   : 84.50   Arno        :  1   Mean   :42.7   Mean   :20.6  
##  3rd Qu.:126.25   Artusi      :  1   3rd Qu.:50.0   3rd Qu.:22.0  
##  Max.   :168.00   Baci        :  1   Max.   :65.0   Max.   :25.0  
##                   (Other)     :162                                
##      Decor          Service          East      
##  Min.   : 6.00   Min.   :14.0   Min.   :0.000  
##  1st Qu.:16.00   1st Qu.:18.0   1st Qu.:0.000  
##  Median :18.00   Median :20.0   Median :1.000  
##  Mean   :17.69   Mean   :19.4   Mean   :0.631  
##  3rd Qu.:19.00   3rd Qu.:21.0   3rd Qu.:1.000  
##  Max.   :25.00   Max.   :24.0   Max.   :1.000  
##

nycData %>% select_if(is.numeric) %>%
  dplyr::select(., -Case) %>%
  cor()%>%corrplot()

Findings: The nycdata has five variables and 168 observations. Five of them are integer variables and 1 factor variable. The observations represented the number of restaurants found in NYC, and Each variable represents: - ‘Restaurant’ represents 168 restaurant names that participated in this review process. - ‘Price’ represents the Price of the food - ‘Food’ the quality of the food - ‘Decor’ the quality of the restaurant in terms of decore - ‘Service’ is the overall customer service and presentation of the food. - ‘East’ which side of the city the restaurant found in NYC

Based on your knowledge of the restaurant industry, do you think that the quality of the food in a restaurant is an important determinant of the price of a meal at that restaurant? How will you prove your intuition (quality determines or does not determine) using the nyc data and your newly developed regression modeling skills? Before writing code explain your choice of response variable, explanatory variable(s), and modeling technique.

Finding: - Yes, my experience taught me that the food quality is an important determinant of the price of a meal at that restaurant. - Food, Decor, & Service are explanatory variables, and Price is the response variable. The linear regression model is a suitable model to find the relationship between price and food quality.

Now build the model based on the choices you made in the previous question (i.e. write code below).

### linear regression model is the simplest one to develop and excute my intutions 

lm_nycdata <- lm(Price ~ Food + Service + Decor + East, data = nycData)

Visualize the fitted model

### visualizing the fitted model

ggplot(nycData, aes(Food, Price)) + 
  geom_point() + 
  geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

Interpret model, model coefficients and see whether you were able to prove or disprove your intuition. On top of reporting the coefficient value and model fit you need to interpret cofficients in plain English.

Finding: - Based on my model, the coefficients and slopes show that the price goes down to 24.023800 when the food quality rate is zero. Which means an increase in one unit of food quality rate, the price increased by $1.538120.

You plan to visit an Italian restaurant for lunch. You check the Zagat review for a restaurant called Plaza that your colleague has been raving about and you find that Zagat has rated the quality of food for that restaurant as 20. What’s your best estimate of the price of a meal that you would need to pay if you ended up going to Plaza for lunch.

## to prdict the price when the food quality rate is 20

predict(lm_pricefood <- lm(Price ~ Food, data = nycData))

##        1        2        3        4        5        6        7        8 
## 46.82497 40.94705 43.88601 40.94705 52.70289 46.82497 46.82497 40.94705 
##        9       10       11       12       13       14       15       16 
## 46.82497 43.88601 38.00809 43.88601 43.88601 38.00809 40.94705 43.88601 
##       17       18       19       20       21       22       23       24 
## 46.82497 52.70289 38.00809 46.82497 49.76393 49.76393 40.94705 46.82497 
##       25       26       27       28       29       30       31       32 
## 46.82497 38.00809 46.82497 40.94705 46.82497 43.88601 40.94705 43.88601 
##       33       34       35       36       37       38       39       40 
## 40.94705 49.76393 55.64185 49.76393 43.88601 46.82497 52.70289 43.88601 
##       41       42       43       44       45       46       47       48 
## 38.00809 38.00809 43.88601 52.70289 46.82497 46.82497 40.94705 43.88601 
##       49       50       51       52       53       54       55       56 
## 52.70289 29.19121 40.94705 40.94705 38.00809 32.13017 35.06913 35.06913 
##       57       58       59       60       61       62       63       64 
## 49.76393 40.94705 46.82497 35.06913 52.70289 38.00809 35.06913 40.94705 
##       65       66       67       68       69       70       71       72 
## 38.00809 43.88601 40.94705 32.13017 32.13017 46.82497 46.82497 49.76393 
##       73       74       75       76       77       78       79       80 
## 52.70289 38.00809 49.76393 46.82497 43.88601 35.06913 35.06913 43.88601 
##       81       82       83       84       85       86       87       88 
## 43.88601 49.76393 49.76393 40.94705 49.76393 38.00809 49.76393 55.64185 
##       89       90       91       92       93       94       95       96 
## 43.88601 43.88601 46.82497 52.70289 49.76393 49.76393 38.00809 40.94705 
##       97       98       99      100      101      102      103      104 
## 38.00809 38.00809 38.00809 35.06913 38.00809 40.94705 49.76393 35.06913 
##      105      106      107      108      109      110      111      112 
## 52.70289 49.76393 35.06913 40.94705 40.94705 49.76393 38.00809 35.06913 
##      113      114      115      116      117      118      119      120 
## 49.76393 49.76393 35.06913 32.13017 43.88601 38.00809 38.00809 38.00809 
##      121      122      123      124      125      126      127      128 
## 38.00809 38.00809 32.13017 40.94705 40.94705 35.06913 46.82497 40.94705 
##      129      130      131      132      133      134      135      136 
## 40.94705 38.00809 38.00809 49.76393 40.94705 46.82497 40.94705 38.00809 
##      137      138      139      140      141      142      143      144 
## 38.00809 38.00809 40.94705 35.06913 43.88601 43.88601 46.82497 46.82497 
##      145      146      147      148      149      150      151      152 
## 46.82497 49.76393 38.00809 46.82497 43.88601 35.06913 43.88601 49.76393 
##      153      154      155      156      157      158      159      160 
## 43.88601 40.94705 32.13017 40.94705 46.82497 35.06913 29.19121 40.94705 
##      161      162      163      164      165      166      167      168 
## 43.88601 43.88601 38.00809 32.13017 40.94705 35.06913 46.82497 52.70289

### or Combine two equestions
predict(lm_pricefood, tibble(Food = 20))

##        1 
## 40.94705

\end{enumerate}

Fifth Avenue is one the most well-known streets in the world, renowned for its flagship stores for shopping, landmark hotels, and internationally-recognized sites. 5th Avenue divides the island of Manhattan vertically. The city’s east side has historically been home to expensive residences and opulent attractions. Maybe everything is more expensive on the East side. Maybe even food is more expensive on the east side.

Do Italian restaurants located on the east side of 5th Avenue tend to charge more? How much more? Answer the question simply by looking at the average food prices on the East and West side. *Hint: You don’t need to build a model to answer this question.

## average meal price for east and west side of NYC resturants
 nycData %>% 
  group_by(East) %>% 
  summarize(mean_East= mean(Price))

## `summarise()` ungrouping output (override with `.groups` argument)

Finding: - The east side’s average mean price is 3.58339 higher than Westside, which is 44.01887 and 40.43548, respectively.

Another important consideration in dining out is the quality of the service. Are people willing to pay more for better restaurant service? More interestingly, are people willing to pay more for higher quality service, irrespective of the quality of food? How much more? Now answer this question with an appropriate model. Justify the choice of your model and variables that go into the model.

## multiple regression model

Price_foodService <- lm(Price ~ Food + Service, data = nycData)

Finding: - Multiple regression gives us a better result in comparing more than two variables: Price, Food, and Service. - After controlling the food quality, people are willing to paying 1.704 more for each additional unit of service rate.

The last criteria for the Zagat review is decor. This refers to the visual presentation of the restaurant: the furniture, the wall adornments, and the overall ambience. Clearly, restaurants vary widely in the quality of their decor. But how does the perceived quality of decor vary with the price of a meal? And how is the effect moderated by the quality of food and service?

## linear regression model for prive and decor
Price_decor <- lm(Price ~ Decor, data = nycData)

## multiple regression model for price, service and decor
Price_decor <- lm(Price ~ Food + Service + Decor, data = nycData)

Finding: - The decore’s perceived quality is for each additional unit of rate is people are willing to pay 2.491 more for each meal. - The effect of Decor moderated by food and service is for an additional unit of decor rate; people are willing to pay 1.847 more for each meal.

You are given an auction dataset, ebay auction data (https://www.modelingonlineauctions.com/datasets). You would want to know how the final sale price of a particular item (“Palm Pilot M515 PDA”) is affected by opening price and the type of auction. To do so, respond to the following:

Describe the data and variables that are part of the auction dataset. That is, load and inspect the data. Filter the data to make sure your data only contains the item for which you need to answer questions.

## loading nyc data
ebay_auction <- read.csv('auction.csv') 

## inspecting the data
head(ebay_auction)

str(ebay_auction)

## 'data.frame':    10681 obs. of  9 variables:
##  $ auctionid   : num  1.64e+09 1.64e+09 1.64e+09 1.64e+09 1.64e+09 ...
##  $ bid         : num  175 100 120 150 178 ...
##  $ bidtime     : num  2.23 2.6 2.6 2.6 2.91 ...
##  $ bidder      : Factor w/ 3387 levels "-kim-","**balancedbody4u**",..: 2729 627 1764 1764 1046 354 2723 354 2723 354 ...
##  $ bidderrate  : int  0 0 2 2 4 2 1 2 1 2 ...
##  $ openbid     : num  99 99 99 99 99 1 1 1 1 1 ...
##  $ price       : num  178 178 178 178 178 ...
##  $ item        : Factor w/ 3 levels "Cartier wristwatch",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ auction_type: Factor w/ 3 levels "3 day auction",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(ebay_auction)

##    auctionid              bid             bidtime        
##  Min.   :1.639e+09   Min.   :   0.01   Min.   :0.000567  
##  1st Qu.:3.015e+09   1st Qu.:  72.00   1st Qu.:1.949931  
##  Median :3.021e+09   Median : 140.00   Median :4.140833  
##  Mean   :4.136e+09   Mean   : 207.59   Mean   :3.979628  
##  3rd Qu.:8.212e+09   3rd Qu.: 210.00   3rd Qu.:6.448060  
##  Max.   :8.216e+09   Max.   :5400.00   Max.   :6.999990  
##                                                          
##           bidder        bidderrate         openbid            price       
##  warrencheryl:   45   Min.   :  -4.00   Min.   :   0.01   Min.   :  26.0  
##  mregestr    :   33   1st Qu.:   1.00   1st Qu.:   1.00   1st Qu.: 186.5  
##  zebedin     :   33   Median :   5.00   Median :   4.99   Median : 228.5  
##  babygirljrt :   32   Mean   :  31.94   Mean   :  52.25   Mean   : 335.0  
##  macdonn     :   31   3rd Qu.:  21.00   3rd Qu.:  50.00   3rd Qu.: 255.0  
##  (Other)     :10491   Max.   :3140.00   Max.   :5000.00   Max.   :5400.0  
##  NA's        :   16   NA's   :11                                          
##                   item             auction_type 
##  Cartier wristwatch :1953   3 day auction:2023  
##  Palm Pilot M515 PDA:5917   5 day auction:1617  
##  Xbox game console  :2811   7 day auction:7041  
##                                                 
##                                                 
##                                                 
##

## i created a new coulmn based on the bid and price variable. 
##if the price match with the bid, the item is sold item which is "1". 
##if the price not equat to bid, the item is not sold item which is "0" because the was not matched. 

ebay_auction1 <- ebay_auction %>%
  mutate(sold := ifelse(price==bid,1,0))

## Filtering "Palm Pilot M515 PDA" data
ebay_auction_PDA <- ebay_auction1 %>%
  filter(item =="Palm Pilot M515 PDA") 

table(ebay_auction1$sold)

## 
##    0    1 
## 9969  712

bidtime -
bidderrate - number of bidders participating in the auction
openbid - minimum bid amount set by the seller
price - winning bid price which the item sold for
item - description of the listing item

Does opening bid affect final sale price of Palm Pilot M515 PDA? You need to a). build an appropriate model, b). visualize the model output, c). interpret results to answer the question.

## change the data type for openbid to factor
ebay_auction_PDA$openbid = as.factor(ebay_auction_PDA$openbid)
ebay_auction_PDA$sold = as.factor(ebay_auction_PDA$sold)

## to build regression model which is approprate model to analyse the auction$Palm Pilot M515 PDA data
ebay_PDA_model <- glm(sold ~ openbid, family = "binomial", data = ebay_auction_PDA)
summary(ebay_PDA_model )

## 
## Call:
## glm(formula = sold ~ openbid, family = "binomial", data = ebay_auction_PDA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2435  -0.3312  -0.3200  -0.2828   2.6444  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.199311   0.127524 -25.088  < 2e-16 ***
## openbid0.07    -0.167985   1.025059  -0.164 0.869827    
## openbid0.11     0.491260   1.040639   0.472 0.636873    
## openbid1        0.252823   0.173007   1.461 0.143921    
## openbid2        1.119869   1.068299   1.048 0.294513    
## openbid5       -0.009515   0.525703  -0.018 0.985560    
## openbid9       -0.266425   1.023481  -0.260 0.794622    
## openbid9.95    -0.007181   0.310324  -0.023 0.981538    
## openbid9.99     0.323865   0.226882   1.427 0.153448    
## openbid10       0.411218   0.385946   1.065 0.286659    
## openbid15       0.131258   0.734519   0.179 0.858174    
## openbid19.99    0.031728   0.526107   0.060 0.951911    
## openbid20      -0.096526   1.026304  -0.094 0.925068    
## openbid25       0.178886   0.527691   0.339 0.734612    
## openbid39       0.406103   0.608279   0.668 0.504373    
## openbid40       1.002086   1.061778   0.944 0.345282    
## openbid45.99    0.154788   1.031446   0.150 0.880710    
## openbid50       0.479212   0.336392   1.425 0.154283    
## openbid70       0.228896   0.736141   0.311 0.755846    
## openbid74.9     0.366097   1.036863   0.353 0.724027    
## openbid89       0.491260   1.040639   0.472 0.636873    
## openbid99       1.119869   0.760764   1.472 0.141012    
## openbid99.99    0.615313   0.533976   1.152 0.249188    
## openbid100      0.742575   0.369996   2.007 0.044752 *  
## openbid125      0.801415   1.052222   0.762 0.446274    
## openbid140      0.366097   1.036863   0.353 0.724027    
## openbid149.99   0.634361   0.744798   0.852 0.394368    
## openbid150      0.840030   0.327826   2.562 0.010394 *  
## openbid155      0.366097   1.036863   0.353 0.724027    
## openbid175      1.615191   0.215385   7.499 6.43e-14 ***
## openbid179.99   1.059245   0.543759   1.948 0.051415 .  
## openbid180      2.793846   0.921735   3.031 0.002437 ** 
## openbid185      1.119869   1.068299   1.048 0.294513    
## openbid190      1.589873   0.785024   2.025 0.042841 *  
## openbid199      2.862838   0.433233   6.608 3.89e-11 ***
## openbid199.95   1.813016   1.125283   1.611 0.107144    
## openbid199.99  17.765378 882.743384   0.020 0.983944    
## openbid200      1.667834   0.410231   4.066 4.79e-05 ***
## openbid200.99   0.308939   1.035286   0.298 0.765391    
## openbid202.5   17.765378 882.743384   0.020 0.983944    
## openbid215      1.813016   0.360415   5.030 4.90e-07 ***
## openbid219.99   3.199311   1.419952   2.253 0.024252 *  
## openbid220      2.100698   0.826395   2.542 0.011022 *  
## openbid225      3.199311   0.718514   4.453 8.48e-06 ***
## openbid230      3.199311   1.419952   2.253 0.024252 *  
## openbid239.99  17.765378 882.743384   0.020 0.983944    
## openbid240      3.353461   0.413551   8.109 5.11e-16 ***
## openbid245     17.765378 624.193840   0.028 0.977294    
## openbid249.99  17.765378 882.743384   0.020 0.983944    
## openbid250     17.765378 882.743384   0.020 0.983944    
## openbid255      3.199311   0.826395   3.871 0.000108 ***
## openbid259      1.946548   0.811862   2.398 0.016501 *  
## openbid259.95  17.765378 882.743384   0.020 0.983944    
## openbid265     17.765378 882.743384   0.020 0.983944    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2816.2  on 5916  degrees of freedom
## Residual deviance: 2573.2  on 5863  degrees of freedom
## AIC: 2681.2
## 
## Number of Fisher Scoring iterations: 13

# visualize the modelscatterplot with jitter

#scatterplot with jitter
ebay_PDA_openbid <- ggplot(data = ebay_auction_PDA, aes(y = sold , x = openbid )) +  #geom_point(alpha = .15)
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# add logistic curve
ebay_PDA_openbid +
  geom_smooth(method = 'glm', method.args = 'binomial')

## `geom_smooth()` using formula 'y ~ x'

Finding: - The p-value for all openbids is greater than 0.05, and there is no statistically significant. It indicates that there is strong evidence that there is a relationship between openbid and price.

For Palm Pilot M515 PDA, does opening bid price affect final sale price for any type of auction? You need to a). explain the choice of your model, b). build the model, c). interpret model results to answer this question. d). draw appropriate visual to confirm your interpretation. Hint: You should think about plotting price versus openbid, colored by auction_type as a scatter plot with regression trend lines. Look at the trend line.

## change the data type for openbid to factor
ebay_auction_PDA$auction_type = as.factor(ebay_auction_PDA$auction_type)


## to build regression model which is approprate model to analyse the auction$Palm Pilot M515 PDA data
ebay_PDA_model2 <- glm(sold ~ openbid + auction_type, family = "binomial", data = ebay_auction_PDA)
summary(ebay_PDA_model2 )

## 
## Call:
## glm(formula = sold ~ openbid + auction_type, family = "binomial", 
##     data = ebay_auction_PDA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2506  -0.3288  -0.3133  -0.2843   2.6444  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -3.249569   0.186433 -17.430  < 2e-16 ***
## openbid0.07                -0.141529   1.039058  -0.136 0.891656    
## openbid0.11                 0.480028   1.041007   0.461 0.644713    
## openbid1                    0.259693   0.174050   1.492 0.135685    
## openbid2                    1.108637   1.068657   1.037 0.299544    
## openbid5                   -0.007903   0.525854  -0.015 0.988009    
## openbid9                   -0.277657   1.023855  -0.271 0.786246    
## openbid9.95                -0.018413   0.311556  -0.059 0.952872    
## openbid9.99                 0.335122   0.229249   1.462 0.143789    
## openbid10                   0.420039   0.395094   1.063 0.287719    
## openbid15                   0.139965   0.739306   0.189 0.849842    
## openbid19.99                0.065957   0.533286   0.124 0.901568    
## openbid20                  -0.107758   1.026677  -0.105 0.916409    
## openbid25                   0.194510   0.529389   0.367 0.713303    
## openbid39                   0.394871   0.608909   0.648 0.516669    
## openbid40                   0.990854   1.062139   0.933 0.350879    
## openbid45.99                0.181245   1.045360   0.173 0.862353    
## openbid50                   0.477775   0.336547   1.420 0.155714    
## openbid70                   0.255352   0.755512   0.338 0.735374    
## openbid74.9                 0.354865   1.037233   0.342 0.732255    
## openbid89                   0.517717   1.054431   0.491 0.623432    
## openbid99                   1.170128   0.772824   1.514 0.130002    
## openbid99.99                0.629735   0.545451   1.155 0.248286    
## openbid100                  0.748056   0.376368   1.988 0.046860 *  
## openbid125                  0.790183   1.052586   0.751 0.452829    
## openbid140                  0.392554   1.050705   0.374 0.708695    
## openbid149.99               0.623129   0.745312   0.836 0.403118    
## openbid150                  0.847466   0.329113   2.575 0.010024 *  
## openbid155                  0.354865   1.037233   0.342 0.732255    
## openbid175                  1.657811   0.243553   6.807 9.98e-12 ***
## openbid179.99               1.055854   0.544680   1.938 0.052564 .  
## openbid180                  2.782614   0.922151   3.018 0.002548 ** 
## openbid185                  1.170128   1.076920   1.087 0.277236    
## openbid190                  1.624522   0.790737   2.054 0.039933 *  
## openbid199                  2.892531   0.440980   6.559 5.41e-11 ***
## openbid199.95               1.863275   1.133471   1.644 0.100204    
## openbid199.99              17.754146 882.743385   0.020 0.983954    
## openbid200                  1.700239   0.420157   4.047 5.20e-05 ***
## openbid200.99               0.359197   1.044180   0.344 0.730847    
## openbid202.5               17.815637 882.743395   0.020 0.983898    
## openbid215                  1.828723   0.365381   5.005 5.59e-07 ***
## openbid219.99               3.188079   1.420221   2.245 0.024783 *  
## openbid220                  2.150957   0.837510   2.568 0.010221 *  
## openbid225                  3.206923   0.722936   4.436 9.17e-06 ***
## openbid230                  3.225767   1.430090   2.256 0.024093 *  
## openbid239.99              17.754146 882.743385   0.020 0.983954    
## openbid240                  3.358426   0.413849   8.115 4.85e-16 ***
## openbid245                 17.773116 624.177638   0.028 0.977284    
## openbid249.99              17.791835 882.743401   0.020 0.983920    
## openbid250                 17.754146 882.743385   0.020 0.983954    
## openbid255                  3.229734   0.838893   3.850 0.000118 ***
## openbid259                  1.983544   0.819846   2.419 0.015546 *  
## openbid259.95              17.754146 882.743385   0.020 0.983954    
## openbid265                 17.754146 882.743385   0.020 0.983954    
## auction_type5 day auction   0.023802   0.214043   0.111 0.911456    
## auction_type7 day auction   0.061490   0.161408   0.381 0.703230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2816.2  on 5916  degrees of freedom
## Residual deviance: 2573.1  on 5861  degrees of freedom
## AIC: 2685.1
## 
## Number of Fisher Scoring iterations: 13

#scatterplot with jitter
ebay_PDA_openbid <- ggplot(data = ebay_auction_PDA, aes(y = price , x = openbid )) +   
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# add logistic curve
ebay_PDA_openbid +
  geom_smooth(method = 'glm', method.args = 'binomial')

## `geom_smooth()` using formula 'y ~ x'

Finding: - The p-value for all auction_type is greater than 0.05, and there is no statistically significant. It indicates that there is strong evidence that there is a relationship between auction_type and price.

Let’s revisit the Boston dataset from the previous problem (PS6). In PS6, you had investigated linear relationships between predictor variables and the response. What we didn’t investigate is the presence of any non-linear association.

\begin{enumerate}

Is there evidence of a non-linear association between any of the predictors and the response? Hint: To answer this question, for each predictor X fit a model of the form:

     Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon

library(MASS)
data(Boston)

## fit model for each predictor X

summary(lm(medv~crim+ I(crim^2) + I(crim^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ crim + I(crim^2) + I(crim^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.983  -4.975  -1.940   2.881  33.391 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.519e+01  4.355e-01  57.846  < 2e-16 ***
## crim        -1.136e+00  1.444e-01  -7.868 2.24e-14 ***
## I(crim^2)    2.378e-02  6.808e-03   3.494 0.000518 ***
## I(crim^3)   -1.489e-04  6.641e-05  -2.242 0.025411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.159 on 502 degrees of freedom
## Multiple R-squared:  0.2177, Adjusted R-squared:  0.213 
## F-statistic: 46.57 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~zn + I(zn^2) + I(zn^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ zn + I(zn^2) + I(zn^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.449  -5.549  -1.049   3.225  29.551 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.4485972  0.4359536  46.905  < 2e-16 ***
## zn           0.6433652  0.1105611   5.819 1.06e-08 ***
## I(zn^2)     -0.0167646  0.0038872  -4.313 1.94e-05 ***
## I(zn^3)      0.0001257  0.0000316   3.978 7.98e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.43 on 502 degrees of freedom
## Multiple R-squared:  0.1649, Adjusted R-squared:  0.1599 
## F-statistic: 33.05 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~indus + I(indus^2) + I(indus^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ indus + I(indus^2) + I(indus^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.760  -4.725  -1.009   2.932  32.038 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.080160   1.663326  22.293  < 2e-16 ***
## indus       -2.806994   0.509349  -5.511 5.71e-08 ***
## I(indus^2)   0.140462   0.041554   3.380 0.000781 ***
## I(indus^3)  -0.002399   0.001011  -2.373 0.018026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.844 on 502 degrees of freedom
## Multiple R-squared:  0.2768, Adjusted R-squared:  0.2725 
## F-statistic: 64.06 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~chas + I(chas^2) + I(chas^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ chas + I(chas^2) + I(chas^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.094  -5.894  -1.417   2.856  27.906 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.0938     0.4176  52.902  < 2e-16 ***
## chas          6.3462     1.5880   3.996 7.39e-05 ***
## I(chas^2)         NA         NA      NA       NA    
## I(chas^3)         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.064 on 504 degrees of freedom
## Multiple R-squared:  0.03072,    Adjusted R-squared:  0.02879 
## F-statistic: 15.97 on 1 and 504 DF,  p-value: 7.391e-05

summary(lm(medv~nox + I(nox^2) + I(nox^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ nox + I(nox^2) + I(nox^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.104  -5.020  -2.144   2.747  32.416 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -22.49      38.52  -0.584   0.5596  
## nox           315.10     195.10   1.615   0.1069  
## I(nox^2)     -615.83     320.48  -1.922   0.0552 .
## I(nox^3)      350.19     170.92   2.049   0.0410 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.282 on 502 degrees of freedom
## Multiple R-squared:  0.1939, Adjusted R-squared:  0.189 
## F-statistic: 40.24 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~rm + I(rm^2) + I(rm^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ rm + I(rm^2) + I(rm^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.102  -2.674   0.569   3.011  35.911 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  241.3108    47.3275   5.099 4.85e-07 ***
## rm          -109.3906    22.9690  -4.763 2.51e-06 ***
## I(rm^2)       16.4910     3.6750   4.487 8.95e-06 ***
## I(rm^3)       -0.7404     0.1935  -3.827 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.11 on 502 degrees of freedom
## Multiple R-squared:  0.5612, Adjusted R-squared:  0.5586 
## F-statistic:   214 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~age + I(age^2) + I(age^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ age + I(age^2) + I(age^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.443  -4.909  -2.234   2.185  32.944 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.893e+01  2.992e+00   9.668   <2e-16 ***
## age         -1.224e-01  2.014e-01  -0.608    0.544    
## I(age^2)     2.355e-03  3.930e-03   0.599    0.549    
## I(age^3)    -2.318e-05  2.279e-05  -1.017    0.310    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.472 on 502 degrees of freedom
## Multiple R-squared:  0.1566, Adjusted R-squared:  0.1515 
## F-statistic: 31.06 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~dis + I(dis^2) + I(dis^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ dis + I(dis^2) + I(dis^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.571  -5.242  -2.037   2.397  34.769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.03789    2.91134   2.417  0.01599 *  
## dis          8.59284    2.06633   4.158 3.77e-05 ***
## I(dis^2)    -1.24953    0.41235  -3.030  0.00257 ** 
## I(dis^3)     0.05602    0.02428   2.307  0.02146 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.727 on 502 degrees of freedom
## Multiple R-squared:  0.105,  Adjusted R-squared:  0.09968 
## F-statistic: 19.64 on 3 and 502 DF,  p-value: 4.736e-12

summary(lm(medv~rad + I(rad^2) + I(rad^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ rad + I(rad^2) + I(rad^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.630  -5.151  -2.017   3.169  33.594 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.251303   2.567860  11.781  < 2e-16 ***
## rad         -3.799454   1.307156  -2.907 0.003815 ** 
## I(rad^2)     0.616347   0.186057   3.313 0.000991 ***
## I(rad^3)    -0.020086   0.005717  -3.514 0.000482 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.37 on 502 degrees of freedom
## Multiple R-squared:  0.1767, Adjusted R-squared:  0.1718 
## F-statistic: 35.91 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~tax + I(tax^2) + I(tax^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ tax + I(tax^2) + I(tax^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.109  -4.952  -1.878   2.957  33.694 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.222e+01  1.397e+01   3.739 0.000206 ***
## tax         -1.635e-01  1.133e-01  -1.443 0.149646    
## I(tax^2)     3.029e-04  2.872e-04   1.055 0.292004    
## I(tax^3)    -2.079e-07  2.236e-07  -0.930 0.353061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.115 on 502 degrees of freedom
## Multiple R-squared:  0.2261, Adjusted R-squared:  0.2215 
## F-statistic: 48.89 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~ptratio + I(ptratio^2) + I(ptratio^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ ptratio + I(ptratio^2) + I(ptratio^3), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7795  -5.0364  -0.9778   3.4766  31.1636 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  312.28642  152.48693   2.048   0.0411 *
## ptratio      -48.69114   26.88441  -1.811   0.0707 .
## I(ptratio^2)   2.83995    1.56413   1.816   0.0700 .
## I(ptratio^3)  -0.05686    0.03005  -1.892   0.0590 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.898 on 502 degrees of freedom
## Multiple R-squared:  0.2669, Adjusted R-squared:  0.2625 
## F-statistic: 60.91 on 3 and 502 DF,  p-value: < 2.2e-16

summary(lm(medv~black + I(black^2) + I(black^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ black + I(black^2) + I(black^3), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.005  -4.802  -1.613   2.852  28.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.260e+01  2.517e+00   5.006  7.7e-07 ***
## black       -1.703e-02  6.150e-02  -0.277    0.782    
## I(black^2)   2.036e-04  3.258e-04   0.625    0.532    
## I(black^3)  -2.224e-07  4.765e-07  -0.467    0.641    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.685 on 502 degrees of freedom
## Multiple R-squared:  0.1135, Adjusted R-squared:  0.1082 
## F-statistic: 21.43 on 3 and 502 DF,  p-value: 4.463e-13

summary(lm(medv~lstat + I(lstat^2) + I(lstat^3), data = Boston))

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2) + I(lstat^3), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5441  -3.7122  -0.5145   2.4846  26.4153 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.6496253  1.4347240  33.909  < 2e-16 ***
## lstat       -3.8655928  0.3287861 -11.757  < 2e-16 ***
## I(lstat^2)   0.1487385  0.0212987   6.983 9.18e-12 ***
## I(lstat^3)  -0.0020039  0.0003997  -5.013 7.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.396 on 502 degrees of freedom
## Multiple R-squared:  0.6578, Adjusted R-squared:  0.6558 
## F-statistic: 321.7 on 3 and 502 DF,  p-value: < 2.2e-16

In PS6, question 4, you had built a kitchen-sink model by fitting a multiple regression model to predict the response using all of the predictors. Now consider performing a stepwise model selection procedure to determine the best fit model. Discuss your results. How is this model different from the kitchen-sink model?

## kitchen-sink model  form problem 6
Mul_reg_Boston <- (lm(medv~., data = Boston))
summary(Mul_reg_Boston)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

# Stepwise regression model
step_reg_model <- MASS::stepAIC(Mul_reg_Boston, k=2, direction = "both")

## Start:  AIC=1589.64
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - age      1      0.06 11079 1587.7
## - indus    1      2.52 11081 1587.8
## <none>                 11079 1589.6
## - chas     1    218.97 11298 1597.5
## - tax      1    242.26 11321 1598.6
## - crim     1    243.22 11322 1598.6
## - zn       1    257.49 11336 1599.3
## - black    1    270.63 11349 1599.8
## - rad      1    479.15 11558 1609.1
## - nox      1    487.16 11566 1609.4
## - ptratio  1   1194.23 12273 1639.4
## - dis      1   1232.41 12311 1641.0
## - rm       1   1871.32 12950 1666.6
## - lstat    1   2410.84 13490 1687.3
## 
## Step:  AIC=1587.65
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - indus    1      2.52 11081 1585.8
## <none>                 11079 1587.7
## + age      1      0.06 11079 1589.6
## - chas     1    219.91 11299 1595.6
## - tax      1    242.24 11321 1596.6
## - crim     1    243.20 11322 1596.6
## - zn       1    260.32 11339 1597.4
## - black    1    272.26 11351 1597.9
## - rad      1    481.09 11560 1607.2
## - nox      1    520.87 11600 1608.9
## - ptratio  1   1200.23 12279 1637.7
## - dis      1   1352.26 12431 1643.9
## - rm       1   1959.55 13038 1668.0
## - lstat    1   2718.88 13798 1696.7
## 
## Step:  AIC=1585.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11081 1585.8
## + indus    1      2.52 11079 1587.7
## + age      1      0.06 11081 1587.8
## - chas     1    227.21 11309 1594.0
## - crim     1    245.37 11327 1594.8
## - zn       1    257.82 11339 1595.4
## - black    1    270.82 11352 1596.0
## - tax      1    273.62 11355 1596.1
## - rad      1    500.92 11582 1606.1
## - nox      1    541.91 11623 1607.9
## - ptratio  1   1206.45 12288 1636.0
## - dis      1   1448.94 12530 1645.9
## - rm       1   1963.66 13045 1666.3
## - lstat    1   2723.48 13805 1695.0

Finding: - The stepwise model selection procedure removes the highest and lowest from the model reveals the summary. The stepwise model adds predictors that decrease an information criterion and/or remove those that increase it. On the other hand, the kitchen-sink model includes all variables that have higher and low p-values which leads to higher p-values.

Evaluate the statistical assumptions in your regression analysis from this model (the best fit model) by performing a basic analysis of model residuals and any unusual observations. Discuss any concerns you have about your model.

From the above stepwise model; - The model removes the highest p-value (age p-values= 0.958229) from the model, and the AIC = 1587.65. - The model removes the highest p-value ( indus p-values= 0.738288) from the model, and the AIC = 1585.76 - In the stepwise model, to best fit the model, we excluding the highest and lowest p-values. The value of residuals also decreasing when we drop out linear p-values. \end{enumerate}

IMT 573: Problem Set 7 - Regression

Yonas Woldekdian

Due: Tuesday, December 1, 2020

Collaborators:

Instructions:

Setup