Load the data. I have embedded it in the code and hidden it. When compiled, the document has data in it.
> summary(Restaurants)
Location Food. Décor Service
City :50 Min. :16.00 Min. : 8.00 Min. :14.00
Suburban:50 1st Qu.:20.00 1st Qu.:16.00 1st Qu.:18.00
Median :21.00 Median :19.00 Median :20.00
Mean :21.51 Mean :18.84 Mean :20.15
3rd Qu.:23.25 3rd Qu.:21.00 3rd Qu.:22.00
Max. :26.00 Max. :27.00 Max. :26.00
Summated.Rating Coded.Location Cost
Min. :46.0 Min. :0.0 Min. :22.00
1st Qu.:55.0 1st Qu.:0.0 1st Qu.:38.75
Median :61.0 Median :0.5 Median :46.50
Mean :60.5 Mean :0.5 Mean :46.98
3rd Qu.:65.0 3rd Qu.:1.0 3rd Qu.:55.00
Max. :76.0 Max. :1.0 Max. :85.00
Almost all of the variables are quantitative, save Location. Location has Coded.Location that is 0/1. In short, all of the data are quantitatively represented.
- Location is split half and half between City and Suburb.
- Restaurant Costs range from $22 to $85 with a mean and median between $46.50 and $47.
- It is not clear from the data what the scale for the various ratings are. They are integers and appear to range between 0 and 30 with the average and median around 20 for each.
- The summated rating has mean and median around 60 reflecting the previous; this is just the sum of Food, Decor, and Service.
Let me generate two scatterplot matrices to examine these relationships. First, I will look at a generic scatterplot matrix. It is worth noting this occasionally fails when there are “character” variables in the data. If needed, remove them from the data. Here, that could be Location. The way to do that is to use the select comand with negation, e.g. select everything not Location… select(-Location)
> library(car)
Loading required package: carData
> library(gplots)
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
> library(tidyverse)
-- Attaching packages ------------------- tidyverse 1.2.1 --
v ggplot2 3.1.0 v purrr 0.2.5
v tibble 1.4.2 v dplyr 0.7.8
v tidyr 0.8.2 v stringr 1.3.1
v readr 1.2.1 v forcats 0.3.0
-- Conflicts ---------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
x dplyr::recode() masks car::recode()
x purrr::some() masks car::some()
> Restaurants2 <- Restaurants %>% select(-Location)
> plot(Restaurants2)
Cost is on the y-axis throughout the last row of the scatterplot matrix. It decreases in suburbs, increases in Decor and Service and increases a bit less in Food. The summated rating is increasing as cost is increasing. Now to some basic summary information.
Next let me visualize the correlations among the variables. Note, this requires that I eliminate character variables.
> library(corrplot)
corrplot 0.84 loaded
> corrplot(cor(Restaurants2), method="circle")
Let’s first look at costs overall and then broken down between cities and suburbs.
> library(skimr)
> Restaurants %>% group_by(Location) %>% skim()
Skim summary statistics
n obs: 100
n variables: 7
group variables: Location
-- Variable type:integer -----------------------------------
Location variable missing complete n mean sd p0 p25 p50
City Coded.Location 0 50 50 0 0 0 0 0
City Cost 0 50 50 47.68 12.31 24 39.25 46.5
City Décor 0 50 50 18.84 3.47 11 16 19
City Food. 0 50 50 21.36 2.35 16 20 21
City Service 0 50 50 19.9 2.28 14 18 20
City Summated.Rating 0 50 50 60.1 6.81 46 54.25 61.5
Suburban Coded.Location 0 50 50 1 0 1 1 1
Suburban Cost 0 50 50 46.28 11.5 22 38.25 46.5
Suburban Décor 0 50 50 18.84 4.18 8 16 18
Suburban Food. 0 50 50 21.66 2.58 16 20 22
Suburban Service 0 50 50 20.4 2.15 16 19 20
Suburban Summated.Rating 0 50 50 60.9 7.21 47 55 60.5
p75 p100 hist
0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
55 85 <U+2581><U+2587><U+2587><U+2587><U+2585><U+2582><U+2581><U+2581>
21 26 <U+2581><U+2582><U+2587><U+2586><U+2587><U+2586><U+2582><U+2583>
23 26 <U+2581><U+2582><U+2582><U+2587><U+2582><U+2583><U+2582><U+2582>
21 26 <U+2581><U+2585><U+2585><U+2587><U+2587><U+2585><U+2581><U+2581>
65 75 <U+2582><U+2586><U+2586><U+2583><U+2587><U+2586><U+2583><U+2582>
1 1 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
53 71 <U+2582><U+2582><U+2587><U+2583><U+2587><U+2583><U+2582><U+2582>
21.75 27 <U+2581><U+2581><U+2585><U+2587><U+2586><U+2586><U+2583><U+2583>
24 26 <U+2582><U+2582><U+2582><U+2587><U+2583><U+2583><U+2585><U+2583>
22 24 <U+2583><U+2586><U+2585><U+2587><U+2585><U+2585><U+2586><U+2582>
66.75 76 <U+2582><U+2586><U+2586><U+2587><U+2587><U+2585><U+2583><U+2583>
Let’s describe average costs and ask whether or not they differ by location. Restaurants are not paired; these are independent samples of city and suburban restaurants.
> with(Restaurants, (t.test(Cost,alternative='two.sided', mu=0.0, conf.level=.95)))
One Sample t-test
data: Cost
t = 39.58, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
44.6248 49.3352
sample estimates:
mean of x
46.98
> t.test(Cost~Location, alternative='two.sided', conf.level=.95, var.equal=FALSE, data=Restaurants)
Welch Two Sample t-test
data: Cost by Location
t = 0.58779, df = 97.547, p-value = 0.558
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.326909 6.126909
sample estimates:
mean in group City mean in group Suburban
47.68 46.28
Overall, the mean of cost is $46.98 with a 95% confidence interval ranging from $44.62 to $49.34. Average cost shows little evidence of being systematically different between cities and suburbs with no other information taken into account. Cities could be $3.33 less or $6.12 more than suburbs with 95% confidence. We can utilize the plot of means to display this with options deploying the confidence intervals at 95%.
> plotmeans(Cost~Location, data=Restaurants, ci.label = T, ylim=c(30,60))
This section takes apart three questions.
1. Are summated ratings related to Cost? 2. Net of ratings, are cities and suburbs different? 3. Is the relationship between Summated Ratings and Cost the same or different in cities and suburbs.
Let me start with the scatterplot.
> with(Restaurants, plot(y=Cost,Summated.Rating))
> abline(lm(Cost~Summated.Rating, data=Restaurants))
And the regression to measure the least squares line.
> SRatings <- lm(Cost~Summated.Rating, data=Restaurants)
> summary(SRatings)
Call:
lm(formula = Cost ~ Summated.Rating, data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-16.794 -5.232 -0.148 4.801 25.710
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.1862 6.7756 -4.603 1.25e-05 ***
Summated.Rating 1.2920 0.1113 11.612 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.74 on 98 degrees of freedom
Multiple R-squared: 0.5791, Adjusted R-squared: 0.5748
F-statistic: 134.8 on 1 and 98 DF, p-value: < 2.2e-16
The estimated equation is:
\[ Cost = -31.1862 dollars plus 1.292 dollars per summated rating + error \]
If a restaurant had no rating, the meal would come with cash attached. For each rating point, costs increase by about $1.30. Overall, summated ratings explain 58% of the variation in costs. The summated ratings produce 50% of predictions to within $5.23 of the actual value and a residual standard error of $7.74 [in part driven by unexplained costs of $25.71 and -$16.80]. Let’s now look at the residuals.
> Restaurants<- within(Restaurants, {
+ residuals.SRatings <- residuals(SRatings)
+ })
> qqPlot(Restaurants$residuals.SRatings)
[1] 47 36
> shapiro.test(Restaurants$residuals.SRatings)
Shapiro-Wilk normality test
data: Restaurants$residuals.SRatings
W = 0.97124, p-value = 0.02753
The residuals are not normal. There are a few too many costs that are much higher than predicted [top right]. These points will require explanation. Let’s first see if that is simply a result of city/suburb differences.
> SRandLoc <- lm(Cost~Coded.Location+Summated.Rating, data=Restaurants)
> summary(SRandLoc)
Call:
lm(formula = Cost ~ Coded.Location + Summated.Rating, data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-18.0603 -5.3309 0.2408 4.8460 24.5649
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.5759 6.7347 -4.540 1.62e-05 ***
Coded.Location -2.4417 1.5386 -1.587 0.116
Summated.Rating 1.3021 0.1106 11.774 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.68 on 97 degrees of freedom
Multiple R-squared: 0.5898, Adjusted R-squared: 0.5813
F-statistic: 69.73 on 2 and 97 DF, p-value: < 2.2e-16
The estimated equation is:
\[ Cost = -30.58 dollars plus 1.302 dollars per summated rating - $2.44*Suburban + error \]
> Restaurants<- within(Restaurants, {
+ residuals.SRandLoc <- residuals(SRandLoc)
+ })
> qqPlot(Restaurants$residuals.SRandLoc)
[1] 47 36
> shapiro.test(Restaurants$residuals.SRandLoc)
Shapiro-Wilk normality test
data: Restaurants$residuals.SRandLoc
W = 0.97983, p-value = 0.1289
That seems to have fixed that problem. Now let me turn to a few residual diagnostics.
Everything seems to check out. So what does the relationship look like?
> # The following code plots four panels in a graphic and returns the graphics settings to a single panel.
> oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,2))
> plot(SRandLoc)
> par(oldpar)
> library(effects)
> plot(allEffects(SRandLoc, partial.residuals=TRUE), span=1)
The line represents the best guess and the shaded region is the 95% confidence interval for the average given a summated rating.
> Confint(SRandLoc, level=0.95)
Estimate 2.5 % 97.5 %
(Intercept) -30.575857 -43.94241 -17.2093020
Coded.Location -2.441675 -5.49540 0.6120497
Summated.Rating 1.302094 1.08260 1.5215879
Each summated rating point is worth $1.08 to $1.52 in cost. A focus on the summated ratings assumes that the slope on Decor, Food, and Service are the same. Is this true?
> ThreeRatings <- lm(Cost~Décor+Food.+Service, data=Restaurants)
> summary(ThreeRatings)
Call:
lm(formula = Cost ~ Décor + Food. + Service, data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-13.945 -5.635 -0.910 5.082 22.812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.70814 7.06629 -2.931 0.004228 **
Décor 1.81781 0.22536 8.066 2.06e-12 ***
Food. -0.09981 0.39941 -0.250 0.803200
Service 1.76613 0.49427 3.573 0.000554 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.173 on 96 degrees of freedom
Multiple R-squared: 0.6459, Adjusted R-squared: 0.6348
F-statistic: 58.37 on 3 and 96 DF, p-value: < 2.2e-16
The slopes are not the same. It seems not. Let’s compare the two regressions.
> anova(SRatings, ThreeRatings)
Analysis of Variance Table
Model 1: Cost ~ Summated.Rating
Model 2: Cost ~ Décor + Food. + Service
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 5870.4
2 96 4939.2 2 931.15 9.0491 0.000251 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F tests the claim that the models explain the same amount of variance. The alternative is that the separate ratings slopes are better; a very low p-value [probability of the null] suggests that the ratings have different impacts on cost.
> Restaurants<- within(Restaurants, {
+ residuals.ThreeRatings <- residuals(ThreeRatings)
+ })
> qqPlot(Restaurants$residuals.ThreeRatings)
[1] 47 36
> shapiro.test(Restaurants$residuals.ThreeRatings)
Shapiro-Wilk normality test
data: Restaurants$residuals.ThreeRatings
W = 0.97196, p-value = 0.03131
The residuals are not normal. Perhaps we need to again account for the suburbs.
> RegModel.5 <- lm(Cost~Coded.Location+Décor+Food.+Service, data=Restaurants)
> summary(RegModel.5)
Call:
lm(formula = Cost ~ Coded.Location + Décor + Food. + Service,
data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-15.1604 -5.7500 -0.5895 4.9411 21.7674
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.6592 7.0098 -2.947 0.004035 **
Coded.Location -2.2964 1.4367 -1.598 0.113275
Décor 1.7909 0.2242 7.988 3.19e-12 ***
Food. -0.1145 0.3963 -0.289 0.773258
Service 1.8616 0.4939 3.769 0.000285 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.115 on 95 degrees of freedom
Multiple R-squared: 0.6552, Adjusted R-squared: 0.6406
F-statistic: 45.12 on 4 and 95 DF, p-value: < 2.2e-16
> anova(ThreeRatings, RegModel.5)
Analysis of Variance Table
Model 1: Cost ~ Décor + Food. + Service
Model 2: Cost ~ Coded.Location + Décor + Food. + Service
Res.Df RSS Df Sum of Sq F Pr(>F)
1 96 4939.2
2 95 4809.9 1 129.35 2.5549 0.1133
> Restaurants<- within(Restaurants, {
+ residuals.RegModel.5 <- residuals(RegModel.5)
+ })
> qqPlot(Restaurants$residuals.RegModel.5)
[1] 47 24
> shapiro.test(Restaurants$residuals.RegModel.5)
Shapiro-Wilk normality test
data: Restaurants$residuals.RegModel.5
W = 0.98143, p-value = 0.1715
That works. Now I have a model. I can use prediction intervals to predict the mean or distribution given a set of ratings. I can also describe the confidence intervals. Let me do that here.
> confint(RegModel.5)
2.5 % 97.5 %
(Intercept) -34.5753968 -6.7429184
Coded.Location -5.1486597 0.5557957
Décor 1.3457949 2.2359313
Food. -0.9013008 0.6722778
Service 0.8809753 2.8421666
Decor and Service influence costs. Food appears not to matter. Suburbs are a bit less expensive, but there s much uncertainty and the 95 percent confidence interval includes 0. I should also mention that Decor and Service could have the same effect; there are common values of the cost per point.
> plot(allEffects(RegModel.5, partial.residuals=TRUE), span=1)
I will create a new variable that is the sum of Decor and Service. I call it Ambience. Let me fit a regression of cost as a function of location, Food, and Ambience, and compare it to the regression with three ratings.
> Restaurants$Ambience <- with(Restaurants, Décor+ Service)
> AmbienceLocF <- lm(Cost~Ambience+Coded.Location+Food., data=Restaurants)
> summary(AmbienceLocF)
Call:
lm(formula = Cost ~ Ambience + Coded.Location + Food., data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-15.1923 -5.8603 -0.6167 4.9205 21.7746
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.42085 6.64832 -3.072 0.00277 **
Ambience 1.80884 0.15656 11.554 < 2e-16 ***
Coded.Location -2.27670 1.41863 -1.605 0.11181
Food. -0.09241 0.34252 -0.270 0.78790
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.079 on 96 degrees of freedom
Multiple R-squared: 0.6551, Adjusted R-squared: 0.6443
F-statistic: 60.78 on 3 and 96 DF, p-value: < 2.2e-16
> anova(AmbienceLocF, RegModel.5)
Analysis of Variance Table
Model 1: Cost ~ Ambience + Coded.Location + Food.
Model 2: Cost ~ Coded.Location + Décor + Food. + Service
Res.Df RSS Df Sum of Sq F Pr(>F)
1 96 4810.5
2 95 4809.9 1 0.64209 0.0127 0.9106
It would seem that Ambience is a useful simplification.
> Restaurants<- within(Restaurants, {
+ residuals.AmbienceLocF <- residuals(AmbienceLocF)
+ })
> qqPlot(Restaurants$residuals.AmbienceLocF)
[1] 47 24
> shapiro.test(Restaurants$residuals.AmbienceLocF)
Shapiro-Wilk normality test
data: Restaurants$residuals.AmbienceLocF
W = 0.98087, p-value = 0.1554
From here, it is important to point out a few things. We can use this model for prediction or for characterization but Food remains a part of the model but it does not have a nonzero slope. Can we eliminate it?
> Ambience.NoF <- lm(Cost ~ Location +Ambience, data=Restaurants)
> summary(Ambience.NoF)
Call:
lm(formula = Cost ~ Location + Ambience, data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-15.4358 -5.8003 -0.5707 4.8410 22.0012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.526 5.211 -4.131 7.67e-05 ***
LocationSuburban -2.293 1.411 -1.626 0.107
Ambience 1.786 0.132 13.530 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.045 on 97 degrees of freedom
Multiple R-squared: 0.6548, Adjusted R-squared: 0.6477
F-statistic: 92.02 on 2 and 97 DF, p-value: < 2.2e-16
> anova(Ambience.NoF,AmbienceLocF)
Analysis of Variance Table
Model 1: Cost ~ Location + Ambience
Model 2: Cost ~ Ambience + Coded.Location + Food.
Res.Df RSS Df Sum of Sq F Pr(>F)
1 97 4814.1
2 96 4810.5 1 3.6472 0.0728 0.7879
> Restaurants<- within(Restaurants, {
+ residuals.Ambience.NoF <- residuals(Ambience.NoF)
+ })
> qqPlot(Restaurants$residuals.Ambience.NoF)
[1] 47 24
> shapiro.test(Restaurants$residuals.Ambience.NoF)
Shapiro-Wilk normality test
data: Restaurants$residuals.Ambience.NoF
W = 0.98263, p-value = 0.212
> LinearModel.12 <- lm(Cost ~ Location + Ambience:Location, data=Restaurants)
> summary(LinearModel.12)
Call:
lm(formula = Cost ~ Location + Ambience:Location, data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-15.212 -5.797 -0.558 4.719 21.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.4745 7.3943 -2.769 0.00675 **
LocationSuburban -4.3778 10.4445 -0.419 0.67604
LocationCity:Ambience 1.7593 0.1891 9.303 4.68e-15 ***
LocationSuburban:Ambience 1.8127 0.1862 9.733 5.57e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.08 on 96 degrees of freedom
Multiple R-squared: 0.655, Adjusted R-squared: 0.6442
F-statistic: 60.75 on 3 and 96 DF, p-value: < 2.2e-16
> anova(Ambience.NoF, LinearModel.12)
Analysis of Variance Table
Model 1: Cost ~ Location + Ambience
Model 2: Cost ~ Location + Ambience:Location
Res.Df RSS Df Sum of Sq F Pr(>F)
1 97 4814.1
2 96 4812.1 1 2.0343 0.0406 0.8408
No.
We have a final model. Ambience and Location determine Cost.
It is important to note that we could remove Location, but if we do, the residuals will become concentrated in the lower tail because suburban restaurants are typically though not universally cheaper.
> RegModel.13 <- lm(Cost~Ambience, data=Restaurants)
> summary(RegModel.13)
Call:
lm(formula = Cost ~ Ambience, data = Restaurants)
Residuals:
Min 1Q Median 3Q Max
-14.2088 -5.6033 -0.7277 5.3314 23.1077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.281 5.234 -4.257 4.75e-05 ***
Ambience 1.776 0.133 13.357 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.104 on 98 degrees of freedom
Multiple R-squared: 0.6454, Adjusted R-squared: 0.6418
F-statistic: 178.4 on 1 and 98 DF, p-value: < 2.2e-16
> anova(Ambience.NoF, RegModel.13)
Analysis of Variance Table
Model 1: Cost ~ Location + Ambience
Model 2: Cost ~ Ambience
Res.Df RSS Df Sum of Sq F Pr(>F)
1 97 4814.1
2 98 4945.3 -1 -131.18 2.6432 0.1072
> Restaurants<- within(Restaurants, {
+ residuals.RegModel.13 <- residuals(RegModel.13)
+ })
> qqPlot(Restaurants$residuals.RegModel.13)
[1] 47 36
and
> shapiro.test(Restaurants$residuals.RegModel.13)
Shapiro-Wilk normality test
data: Restaurants$residuals.RegModel.13
W = 0.97429, p-value = 0.0475
> plot(density(Restaurants$residuals.RegModel.13))
It appears as though Decor and Service are the key drivers and it is not clear that they have different effects. Location does not clearly matter but a few high priced city restaurants skew the residuals when Location is not taken into account. A simple graphical depiction is shown below. With 95% confidence, the confidence intervals are:
> confint(Ambience.NoF)
2.5 % 97.5 %
(Intercept) -31.868961 -11.1830439
LocationSuburban -5.092707 0.5062849
Ambience 1.524363 2.0484818
> plot(allEffects(Ambience.NoF, partial.residuals=TRUE), span=0.95)