Executive Summary

Row

Objective

The objective of this exercise is to predict the tip amount of a waiter using Linear regression.

Task, Performance and Experience

Task

Scenario 1:

Task: Given a bill amount paid by people who ate at the restaurant, predict the tip amount

Scenario 2:

Task: Given the number of people who ate at the restaurant, predict the tip amount

Scenario 3:

Task: Given a bill amount and number of people, predict the tip amount

Performance and Experience

Performance: Minimize the difference between predicted and actual value
Experience: Machine Learning Model will learn using data collected over a period of time

Dataset

The dataset used for this exercise has been provided as part of the OpenIntro package. This dataset contains 95 observations with 5 variables. It is a simulated dataset of tips over a few weeks based on couple days per week. Each tip is associated with a single group, which may include several bills and tables (i.e. groups paid in one lump sum in simulations).

week - Week number.
day - Day, either Friday or Tuesday.
nPeop - Number of people associated with the group.
bill - Total bill for the group.
tip - Total tip received from the group.

Row

Model #1

Bill Amount

Model #2

No. of People

Model #3

Bill & No. of people

Column

Conclusion

After looking at the 3 different models for this dataset, I conclude that Model #3 is the better model because the sum of squared error is lower compared to the other two models.

Column

Number of Observations in the dataset

Number of Observations in training data

Number of Observations in testing data

Mean Tip Amount

Mean Bill Amount

Mean No. of People

Model #1

Column

Tips Prediction

Bill Amount

Total Sum of Squares (SST)

1400

Sum of Squares due to Regression (SSR)

1214

Sum of Squared Error (SSE)

186

Column

Histogram and Normal Q-Q plot of Residuals

Regression Plot - Bill Amount

Findings

The linear regression equation is y = 0.179556860294802 * x + -0.0776209757925994
The sum of squared errors/residuals (SSE) is 185.9090242
The R^2 is 0.8671673
Observation: The plot shows a strong linear relationship between bill amount and tip amount (dependent variable)
Inference: An increase in $1 in bill anount would increase the tip amount by $ 0.1795569

Model #2

Column

Tips Prediction

No. of People

Total Sum of Squares (SST)

1400

Sum of Squares due to Regression (SSR)

1206

Sum of Squared Error (SSE)

194

Column

Box Plot using Training data - Number of people against Tip Amount

Regression Plot using Training Data - No. of People

Findings

The linear regression equation is y = 2.60273051356081 * x + 0.359084576878193
The sum of squared errors/residuals (SSE) is 193.98694
The R^2 is 0.8613956
Observation: The plot shows a strong linear relationship between bill amount and tip amount (dependent variable)
Inference: An increase in 1 person in a group would increase the tip amount by $ 2.6027305

Model #3

Column

Tips Prediction

Bill & No. of People

Total Sum of Squares (SST)

1400

Sum of Squares due to Regression (SSR)

1223

Sum of Squared Error (SSE)

176

Column

Normal Q-Q Plot of residuals

Findings

The linear regression equation is y = 0.10450730575244 * x1 + 1.11551597060224 * x2 + 0.0400253520867612
The sum of squared errors/residuals (SSE) is 176.4777821
The R^2 is 0.873906
Observation: The plot shows a strong linear relationship between bill amount and number of people to the tip amount (dependent variable)
Inference: An increase in $1 in bill amount and 1 person in a group would increase the tip amount by $ 2.7822874

Data Table

Summary Report

Column {data-width = 100}

Model #1 SSE

185.909024246021

Model #2 SSE

193.986940026561

Model #3 SSE

176.477782076843

Column

Report

This report is based on 95 tips observations.
The average bill amount is 41.1952632.
The average number of people that ate at the restaurant is 2.7227368.
The average tip amount received by the waiter is 7.3386842.
Model #3 is the better model because the SSE is low compared to the other model.
This report was generated on November 21, 2018.

Definitions

The linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
The case of one explanatory variable is called simple linear regression.
The case of more than one explanatory variable is called multiple linear regression.
The Total Sum of Square (SST) is defined as being the sum, over all observations, of the squared differences of each observation from the overall mean.
The Sum of Squares Regression (SSR) is the sum of the squared differences between the prediction for each observation and the population mean.
The Sum of Squares Due to Error(aka) Summed Squares of Residuals is a statistic measures the total deviation of the response values from the fit to the response values.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.