Myra Hallman Mini Project 2
library(ggplot2)
library(scales)
library(dplyr)
library(tidyverse)
library(forecast)
tayko.df <- read.csv("Tayko.csv")
tayko.df <- na.omit(tayko.df)
tayko <- mutate(tayko.df,
GENDER = factor(GENDER, labels = c("Female", "Male")),
WEB = factor(WEB, labels = c("Non_Web", "Web")),
ADDRESS_RES = factor(ADDRESS_RES, labels = c("Non_Residential", "Residential")),
ADDRESS_US = factor(ADDRESS_US, labels = c("Non_US", "US")))
According to the Gender Pivot Table below, it appears that females spend on average, approximately $9 more than males from the test mailing.
gendergroup.df <- group_by(tayko, GENDER)
gender.by.spending.df <- summarize(gendergroup.df,
mean_spending=mean(SPENDING),
median_spending=median(SPENDING),
standard_dev=sd(SPENDING, na.rm = TRUE))
(gender.by.spending.df)
The pivot table below shows that those buying on the web at least once tend to spend more than those who do not use the web for purchasing.
webgroup.df <- group_by(tayko, WEB)
web.by.spending.df <- summarize(webgroup.df,
mean_spending=mean(SPENDING),
median_spending=median(SPENDING),
standard_dev=sd(SPENDING, na.rm = TRUE))
(web.by.spending.df)
Commercial addresses have a higher mean spending than residential. See the Pivot table below for details.
address.res.df <- group_by(tayko, ADDRESS_RES)
address.res.by.spending.df <- summarize(address.res.df,
mean_spending=mean(SPENDING),
median_spending=median(SPENDING),
standard_dev=sd(SPENDING, na.rm = TRUE))
(address.res.by.spending.df)
Although it’s close, US customers do appear to spend slightly more than non_US customers, on average. The Pivot Table below shows that there is about $1.71 difference in mean spending between the two.
address.us.group.df <- group_by(tayko, ADDRESS_US)
address.us.by.spending.df <- summarize(address.us.group.df,
mean_spending=mean(SPENDING),
median_spending=median(SPENDING),
standard_dev=sd(SPENDING, na.rm = TRUE))
(address.us.by.spending.df)
There does not appear to be a relationship between the last update and the mean price that I can tell. Please reference the below Pivot Table for data.
last.update.group.df <- group_by(tayko, LAST_UPDATE)
last.update.by.spending.df <- summarize(last.update.group.df,
mean_spending=mean(SPENDING),
median_spending=median(SPENDING),
standard_dev=sd(SPENDING, na.rm = TRUE))
(last.update.by.spending.df)
The Pivot Table below shows the relationship between frequency and spending. It appears there is a relationship that with increased frequency, the average spending also increases. That is until about the 10th purchase.
freq.group.df <- group_by(tayko, FREQ)
freq.by.spending.df <- summarize(freq.group.df,
mean_spending=mean(SPENDING),
median_spending=median(SPENDING),
standard_dev=sd(SPENDING, na.rm = TRUE))
(freq.by.spending.df)
The scatter plot below shows the relationship between spending and frequency of purchase. It appears that when the frequency increases, so then does the amount spent. The customers that have shopped less that 5 times tend to buy more of the less expensive items.
plot(tayko.df$SPENDING~tayko.df$FREQ, xlab="Frequency", ylab="Spending")
The relationship between spending and last update seems to be less obvious. There does not seem to be a known relationship between the two variable. See the plot below for reference.
plot(tayko.df$SPENDING~tayko.df$LAST_UPDATE, xlab="Last Update", ylab="Spending")
Develop a predictive model for Spending: 3.A. Partition the 2000 records into training and validation sets.
set.seed(1)
TrainIndex <- as.numeric(sample(row.names(tayko.df), 1200))
train.df <- tayko.df [TrainIndex, ]
valid.df <- tayko.df [-TrainIndex, ]
3.B. Run a multiple linear regression model for Spending vs. all six predictors. Report the estimated predictive equation.
Let B1=Freq; B2=Last Update; B3=Web; B4=Gender; B5=Address_Res; B6=Address_US
y = 14.5 + 89B1 - 0.009B2 + 14.14B3 - 3.19B4 - 71.75B5 - 11.62B6
spending.lm.all <- lm(SPENDING ~ ., data = train.df)
summary(spending.lm.all)
Call:
lm(formula = SPENDING ~ ., data = train.df)
Residuals:
Min 1Q Median 3Q Max
-413.83 -75.88 -3.06 35.78 1336.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.518218 13.448542 1.080 0.2806
FREQ 89.089779 2.983947 29.856 < 2e-16 ***
LAST_UPDATE -0.009212 0.003439 -2.678 0.0075 **
WEB 14.140303 7.390405 1.913 0.0559 .
GENDER -3.189701 7.292959 -0.437 0.6619
ADDRESS_RES -71.654020 9.146249 -7.834 1.04e-14 ***
ADDRESS_US -11.624718 9.221134 -1.261 0.2077
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 125.8 on 1193 degrees of freedom
Multiple R-squared: 0.4813, Adjusted R-squared: 0.4787
F-statistic: 184.5 on 6 and 1193 DF, p-value: < 2.2e-16
accuracy(predict(spending.lm.all, valid.df), valid.df$SPENDING)
ME RMSE MAE MPE MAPE
Test set 6.974396 137.6987 81.38876 NaN Inf
3.C. Use backward elimination to reduce the number of predictors. Report the estimated predictive equation.
Let B1=Freq; B2=Last Update; B3=Web; B4=Gender; B5=Address_Res; B6=Address_US
y = 4.26 + 88.85B1 - 0.009B2 + 14.16B3 - 71.56B5
spending.lm.backward <- step(spending.lm.all, direction = "backward")
Start: AIC=11611
SPENDING ~ FREQ + LAST_UPDATE + WEB + GENDER + ADDRESS_RES +
ADDRESS_US
Df Sum of Sq RSS AIC
- GENDER 1 3029 18894916 11609
- ADDRESS_US 1 25167 18917054 11611
<none> 18891887 11611
- WEB 1 57972 18949859 11613
- LAST_UPDATE 1 113609 19005495 11616
- ADDRESS_RES 1 971919 19863806 11669
- FREQ 1 14115891 33007778 12279
Step: AIC=11609.19
SPENDING ~ FREQ + LAST_UPDATE + WEB + ADDRESS_RES + ADDRESS_US
Df Sum of Sq RSS AIC
- ADDRESS_US 1 25492 18920408 11609
<none> 18894916 11609
- WEB 1 58087 18953003 11611
- LAST_UPDATE 1 112439 19007355 11614
- ADDRESS_RES 1 969982 19864898 11667
- FREQ 1 14157517 33052433 12278
Step: AIC=11608.81
SPENDING ~ FREQ + LAST_UPDATE + WEB + ADDRESS_RES
Df Sum of Sq RSS AIC
<none> 18920408 11609
- WEB 1 58155 18978562 11610
- LAST_UPDATE 1 120215 19040623 11614
- ADDRESS_RES 1 969872 19890280 11667
- FREQ 1 14148682 33069089 12277
summary(spending.lm.backward)
Call:
lm(formula = SPENDING ~ FREQ + LAST_UPDATE + WEB + ADDRESS_RES,
data = train.df)
Residuals:
Min 1Q Median 3Q Max
-416.80 -76.07 -2.86 34.72 1332.54
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.264397 10.908492 0.391 0.69592
FREQ 88.854127 2.972357 29.893 < 2e-16 ***
LAST_UPDATE -0.009450 0.003429 -2.755 0.00595 **
WEB 14.162450 7.389719 1.917 0.05554 .
ADDRESS_RES -71.560735 9.143212 -7.827 1.1e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 125.8 on 1195 degrees of freedom
Multiple R-squared: 0.4805, Adjusted R-squared: 0.4788
F-statistic: 276.4 on 4 and 1195 DF, p-value: < 2.2e-16
accuracy(predict(spending.lm.backward, valid.df), valid.df$SPENDING)
ME RMSE MAE MPE MAPE
Test set 6.502126 137.6235 81.32966 NaN Inf
3.D. Use your own intuition and the results you observed so far to propose a linear model for predicting Spending. If you want, you can use any subset of the predictors, any combination of them, or any non-linear transformation of predictors. Report the estimated predictive equation.
Let B1=Freq; B2=Last Update; B3=Web; B4=Gender; B5=Address_Res; B6=Address_US
y = 101.38 - 10.23B4 +4.48B6
spending.lm.gender.address_us <- lm(SPENDING ~ GENDER + ADDRESS_US, data = train.df)
summary(spending.lm.gender.address_us)
accuracy(predict(spending.lm.gender.address_us, valid.df), valid.df$SPENDING)
3.E. Evaluate the predictive accuracy of the model by examining its performance on the validation set. Report which model (3.B, 3.C, or 3.D) has the best performance. Explain according to what metric you evaluate the performance of the model. ((The lower the numbers the better the prediction))
Test 1= all Test 2= backward Test 3= gender and US addresses
According to the predictive accuracy reports, it appears the backwards model is slightly more accurate than the all model. The model comparing gender and address seems like a much worse predictor than the other two.
accuracy(predict(spending.lm.all, valid.df), valid.df$SPENDING)
ME RMSE MAE MPE MAPE
Test set 6.974396 137.6987 81.38876 NaN Inf
accuracy(predict(spending.lm.backward, valid.df), valid.df$SPENDING)
ME RMSE MAE MPE MAPE
Test set 6.502126 137.6235 81.32966 NaN Inf
accuracy(predict(spending.lm.gender.address_us, valid.df), valid.df$SPENDING)
ME RMSE MAE MPE MAPE
Test set 7.319066 204.116 120.2655 -Inf Inf