Our goal is to build a linear regression model through customer data and help decision makers for an e-commerce business. Data source is from udemy course python for data science and machine learning bootcamp
The dataset contains 500 customers and 8 variables: Email, Address, Avatar, Average session length, Time on App, Time on Website, Length of Membership, Yearly Amount Spent
str(my_data)
## 'data.frame': 500 obs. of 8 variables:
## $ Email : Factor w/ 500 levels "aaron04@yahoo.com",..: 343 191 356 392 342 18 260 44 461 57 ...
## $ Address : Factor w/ 500 levels "0001 Mack Mill\nNorth Jennifer, NE 42021-5936",..: 382 227 124 63 61 314 329 483 392 460 ...
## $ Avatar : Factor w/ 138 levels "AliceBlue","AntiqueWhite",..: 133 26 7 115 81 44 35 3 116 12 ...
## $ Avg..Session.Length : num 34.5 31.9 33 34.3 33.3 ...
## $ Time.on.App : num 12.7 11.1 11.3 13.7 12.8 ...
## $ Time.on.Website : num 39.6 37.3 37.1 36.7 37.5 ...
## $ Length.of.Membership: num 4.08 2.66 4.1 3.12 4.45 ...
## $ Yearly.Amount.Spent : num 588 392 488 582 599 ...
Review the first 6 rows
head(my_data)
## Email
## 1 mstephenson@fernandez.com
## 2 hduke@hotmail.com
## 3 pallen@yahoo.com
## 4 riverarebecca@gmail.com
## 5 mstephens@davidson-herman.com
## 6 alvareznancy@lucas.biz
## Address
## 1 835 Frank Tunnel\nWrightmouth, MI 82180-9605
## 2 4547 Archer Common\nDiazchester, CA 06566-8576
## 3 24645 Valerie Unions Suite 582\nCobbborough, DC 99414-7564
## 4 1414 David Throughway\nPort Jason, OH 22070-1220
## 5 14023 Rodriguez Passage\nPort Jacobville, PR 37242-1057
## 6 645 Martha Park Apt. 611\nJeffreychester, MN 67218-7250
## Avatar Avg..Session.Length Time.on.App Time.on.Website
## 1 Violet 34.49727 12.65565 39.57767
## 2 DarkGreen 31.92627 11.10946 37.26896
## 3 Bisque 33.00091 11.33028 37.11060
## 4 SaddleBrown 34.30556 13.71751 36.72128
## 5 MediumAquaMarine 33.33067 12.79519 37.53665
## 6 FloralWhite 33.87104 12.02693 34.47688
## Length.of.Membership Yearly.Amount.Spent
## 1 4.082621 587.9511
## 2 2.664034 392.2049
## 3 4.104543 487.5475
## 4 3.120179 581.8523
## 5 4.446308 599.4061
## 6 5.493507 637.1024
Only keep numeric variables and based on the plot, it looks like the length of membership is the strongest correlated feature with the yearly amount spent.
NumCol <- sapply(my_data,is.numeric)
Data <- my_data[,NumCol]
library(psych)
## Warning: package 'psych' was built under R version 3.5.3
pairs.panels(Data,
method = "pearson", # correlation method
hist.col = "#66A61E",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
A training set is a set of data used to discover potentially predictive relationships. A test set is a set of data used to assess the strength and utility of a predictive relationship. We train our model on 70% of the data and then test the model performance on 30% of the data that is withheld.
library(caTools)
## Warning: package 'caTools' was built under R version 3.5.3
set.seed(101)
sample <- sample.split(Data$Yearly.Amount.Spent, SplitRatio = 0.7)
train <- subset(Data,sample=TRUE)
According to our model, variable Time on Website is not significant due to a large p value, the result suggests that the model can explain 98% of the variability in our customer data.
model <- lm(Yearly.Amount.Spent~., data=train)
summary(model)
##
## Call:
## lm(formula = Yearly.Amount.Spent ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.4059 -6.2191 -0.1364 6.6048 30.3085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1051.5943 22.9925 -45.736 <2e-16 ***
## Avg..Session.Length 25.7343 0.4510 57.057 <2e-16 ***
## Time.on.App 38.7092 0.4510 85.828 <2e-16 ***
## Time.on.Website 0.4367 0.4441 0.983 0.326
## Length.of.Membership 61.5773 0.4483 137.346 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.973 on 495 degrees of freedom
## Multiple R-squared: 0.9843, Adjusted R-squared: 0.9842
## F-statistic: 7766 on 4 and 495 DF, p-value: < 2.2e-16
An important part of assessing regression models is visualizing residuals, the diagnostic plots show residuals in four different ways:
Residuals vs Fitted Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
Normal Q-Q Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
Scale-Location (or Spread-Location) Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
Residuals vs Leverage Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.
confint(model)
## 2.5 % 97.5 %
## (Intercept) -1096.7692602 -1006.41925
## Avg..Session.Length 24.8481081 26.62043
## Time.on.App 37.8230294 39.59528
## Time.on.Website -0.4358024 1.30928
## Length.of.Membership 60.6964468 62.45820
plot(model,which=1, col=c("blue"))
plot(model, which=2, col=c("red"))
plot(model,which=3)
plot(model,which=4)
## Evaluating the Model
#redidual plot
res <- residuals(model)
res <- as.data.frame(res)
ggplot(res,aes(res))+geom_histogram(fill='blue', alpha=0.5)