## # A tibble: 6 × 17
## ...1 index date week weekday area count rabate price operator driver
## <dbl> <dbl> <date> <dbl> <dbl> <chr> <dbl> <lgl> <dbl> <chr> <chr>
## 1 1 1 2014-03-01 9 6 Camden 5 TRUE 65.7 Rhonda Taylor
## 2 2 2 2014-03-01 9 6 Westm… 2 FALSE 27.0 Rhonda Butch…
## 3 3 3 2014-03-01 9 6 Westm… 3 FALSE 41.0 Allanah Butch…
## 4 4 4 2014-03-01 9 6 Brent 2 FALSE 26.0 Allanah Taylor
## 5 5 5 2014-03-01 9 6 Brent 5 TRUE 57.6 Rhonda Carter
## 6 6 6 2014-03-01 9 6 Camden 1 FALSE 14.0 Allanah Taylor
## # ℹ 6 more variables: delivery_min <dbl>, temperature <dbl>,
## # wine_ordered <dbl>, wine_delivered <dbl>, wrongpizza <lgl>, quality <chr>
## [1] "...1" "index" "date" "week"
## [5] "weekday" "area" "count" "rabate"
## [9] "price" "operator" "driver" "delivery_min"
## [13] "temperature" "wine_ordered" "wine_delivered" "wrongpizza"
## [17] "quality"
Hypothesis:
price = 5 + 20wine_ordered + .5delivery_min +15count - 10rabate + error Wine_ordered: $20 per bottle of wine, Delivery_min: add half the delivery mins, Pizza: +15 per pizza, +$10 if rabate is true + error.
Interpretation: There are no values with concerning levels of correlation, and certainly no perfect colinearity. - high corr for some
##
## Call:
## lm(formula = price ~ wine_ordered, data = pizza)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.570 -14.007 -1.017 10.359 63.041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.9870 0.5859 75.08 <2e-16 ***
## wine_ordered 30.3535 1.4822 20.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.62 on 1195 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.2598, Adjusted R-squared: 0.2592
## F-statistic: 419.4 on 1 and 1195 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ wine_ordered + count, data = pizza)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4712 -3.1707 0.2092 3.5142 20.0108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6008 0.4086 11.26 <2e-16 ***
## wine_ordered 31.4329 0.4545 69.16 <2e-16 ***
## count 11.3885 0.1061 107.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.708 on 1194 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9305, Adjusted R-squared: 0.9304
## F-statistic: 7991 on 2 and 1194 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ wine_ordered + count + delivery_min, data = pizza)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0633 -3.2107 0.2249 3.4361 20.3481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.37159 0.55027 6.127 1.21e-09 ***
## wine_ordered 31.31640 0.45399 68.980 < 2e-16 ***
## count 11.37502 0.10574 107.575 < 2e-16 ***
## delivery_min 0.05043 0.01520 3.318 0.000935 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.684 on 1193 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9311, Adjusted R-squared: 0.9309
## F-statistic: 5376 on 3 and 1193 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ wine_ordered + count + delivery_min + rabate,
## data = pizza)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6048 -2.8197 0.0533 3.0642 17.9538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.37691 0.51734 2.662 0.00788 **
## wine_ordered 31.63795 0.41417 76.388 < 2e-16 ***
## count 13.06341 0.14464 90.317 < 2e-16 ***
## delivery_min 0.03615 0.01388 2.605 0.00931 **
## rabateTRUE -7.03613 0.44956 -15.651 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.179 on 1192 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9429, Adjusted R-squared: 0.9427
## F-statistic: 4917 on 4 and 1192 DF, p-value: < 2.2e-16
## R² values:
## Model 1: 0.2597706
## Model 2: 0.9304851
## Model 3: 0.9311206
## Model 4: 0.9428623
##
## Regression Results
## =================================================================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------------------------------------
## Price
## Wine Only Wine + Pizza Add Delivery Full Model
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------------------------------------------------
## wine_ordered 30.354*** 31.433*** 31.316*** 31.638***
## (1.482) (0.455) (0.454) (0.414)
## count 11.388*** 11.375*** 13.063***
## (0.106) (0.106) (0.145)
## delivery_min 0.050*** 0.036***
## (0.015) (0.014)
## rabate -7.036***
## (0.450)
## Constant 43.987*** 4.601*** 3.372*** 1.377***
## (0.586) (0.409) (0.550) (0.517)
## ---------------------------------------------------------------------------------------------------------------------------------
## Observations 1,197 1,197 1,197 1,197
## R2 0.260 0.930 0.931 0.943
## Adjusted R2 0.259 0.930 0.931 0.943
## Residual Std. Error 18.619 (df = 1195) 5.708 (df = 1194) 5.684 (df = 1193) 5.179 (df = 1192)
## F Statistic 419.364*** (df = 1; 1195) 7,991.087*** (df = 2; 1194) 5,375.712*** (df = 3; 1193) 4,917.472*** (df = 4; 1192)
## =================================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
4
4a: full model: price = wine_ordered + count + delivery_min + rabate, data
Each additional bottle of wine increases the bill by about $31.64, while each additional pizza adds about $13.06.
Delivery time has only a small effect, with each extra minute adding roughly 3.6 cents.
If a rebate is applied, the total bill decreases by about $7.04.
The intercept of about $1.38 is just a baseline prediction when no items are ordered and has little practical meaning.
4b: The model that fit best is the full model, this can be concluded as it has the highest r^2 of .94. This conclusion can also be made by considering statisticial signifigance, (p value), which is also lowest for the full model.
4c:
4d: The residual plot shows heteroskedasticity, as residuals increase as the values increase. You can tell by the cone shape of the data.
Gauss-Markov assumptions
Linearity in parameters: The model is linear in its coefficients, so this assumption is satisfied.
Random sampling and independent observations: Because I don’t know how the data were gathered, I can’t be sure if this holds. Orders from the same store, customer, or time period could be correlated.
No perfect multicollinearity: My correlation plot showed no extreme relationships, and the model ran without errors, so this assumption appears to hold.
Zero conditional mean (exogeneity): The QQ plot of residuals is roughly linear through the middle, indicating approximate normality. However, slight deviations in the tails suggest that a few extreme values may cause mild non-normality, though the assumption is largely reasonable for inference.
Homoskedasticity: The residual plot shows residuals spread out slightly more for higher fitted values, suggesting mild heteroskedasticity. The model still works, but standard errors might be less reliable unless robust ones are used.