Part (a)

Using my TipData from the first project, I chose to predict Tip based upon the numerical variables of bill amount (“Bill”) and the percent tip (“PctTip”), as well as the categorical variable of “Credit”, which indicates whether the bill was paid using a credit card (“y”) or cash (“n”). The multiple regression can be seen in R below:

tipdata = read.csv("http://bit.ly/1StTazL",header=T)
str(tipdata)
## 'data.frame':    140 obs. of  7 variables:
##  $ Bill  : num  10.2 18.4 11.7 9.2 18.1 ...
##  $ Tip   : num  1.83 2.75 2.28 1.8 4 3.13 5 3.35 7.25 3 ...
##  $ Credit: Factor w/ 2 levels "n","y": 1 1 2 1 1 2 2 2 2 1 ...
##  $ Guests: int  1 2 1 1 3 2 2 2 2 2 ...
##  $ Day   : Factor w/ 5 levels "F","M","R","T",..: 5 2 5 5 5 5 3 4 5 1 ...
##  $ Server: Factor w/ 3 levels "A","B","C": 1 2 1 1 3 2 3 1 1 3 ...
##  $ PctTip: num  18 14.9 19.5 19.6 22.1 15 19.9 18 18.2 13.4 ...
myreg1= lm(Tip ~ Bill + PctTip + Credit, data= tipdata)
summary(myreg1)
## 
## Call:
## lm(formula = Tip ~ Bill + PctTip + Credit, data = tipdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27293 -0.17670 -0.04499  0.07162  2.92768 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.477222   0.149157 -23.313   <2e-16 ***
## Bill         0.172838   0.003176  54.426   <2e-16 ***
## PctTip       0.204232   0.008163  25.020   <2e-16 ***
## Credity      0.010282   0.083186   0.124    0.902    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4353 on 136 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:  0.9698 
## F-statistic:  1487 on 3 and 136 DF,  p-value: < 2.2e-16

Therefore, the equation for this multiple regression model to predict Tip is: y= -3.4772 + 0.1728x1 + 0.2042x2 + 0.0103x3, where x1= Bill, x2= PctTip, and x3= Credit. Moreover, because Credit is a categorical variable, an indicator variable is required. As seen by the R output, 0= “n”, indicating “no” (meaning cash was used to pay) and 1= “y”, indicated “yes” (meaning a credit card was used to pay).

Part (b)

The Regression ANOVA table for the model constructed in Part A was produced using the following R ouput:

aov(myreg1)
## Call:
##    aov(formula = myreg1)
## 
## Terms:
##                     Bill   PctTip   Credit Residuals
## Sum of Squares  726.4925 118.8237   0.0029   25.7665
## Deg. of Freedom        1        1        1       136
## 
## Residual standard error: 0.4352691
## Estimated effects may be unbalanced
var(tipdata$Tip)*139
## [1] 871.0855
pf(1486.9, 3, 136, lower.tail= F)
## [1] 9.976404e-104

The regression ANOVA table is attached to the back of this document.

Part (c)

The 95% confidence intervals for the coefficients of the variables for the model in Part (a) can be seen in R below:

confint(myreg1)
##                  2.5 %     97.5 %
## (Intercept) -3.7721884 -3.1822559
## Bill         0.1665581  0.1791182
## PctTip       0.1880897  0.2203739
## Credity     -0.1542228  0.1747863

In other words, the interval for Bill is (0.167, 0.179), the interval for PctTip is (0.188, 0.22), and the interval for Credit is (-0.154, 0.175).

At a significance level of 0.05, the coefficients which are significantly different than 0 are Bill and PctTip. At the 95% confidence level, Credit is not significant.

Part (d) i.

Because Credit was not significant, I constructed another regression model excluding Credit:

myreg2= lm(Tip ~ Bill + PctTip, data= tipdata)
summary(myreg2)
## 
## Call:
## lm(formula = Tip ~ Bill + PctTip, data = tipdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27272 -0.17601 -0.04493  0.07691  2.92888 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.477561   0.148595  -23.40   <2e-16 ***
## Bill         0.172978   0.002958   58.48   <2e-16 ***
## PctTip       0.204270   0.008127   25.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4337 on 137 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:   0.97 
## F-statistic:  2247 on 2 and 137 DF,  p-value: < 2.2e-16

Therefore the equation for this regression model would be: y= -3.4776 + 0.173x1 + 0.2043x2, where x1= “Bill” and x2= “PctTip”.

Part (e)

Of the two multiple regression models that I have constructed, I would choose myreg2, the model exclusing Credit, to make a prediction about tip. Because the variables of Bill and PctTip have such a strong influence on Tip, the difference between the two models is not drastic; nevertheless, because Credit was not a signifcant variable at the 95% significance level, excluding this variable from myreg2 produced a more accurage regression model. This is evidenced by the Adjusted R-squared values for the models. While the Adjusted R-squared value for myreg1 was 0.9698, the Adjusted R-squared value for myreg2 was slightly higher at 0.97. Thus, I would use myreg2 to make a more accurate prediction.