E-commerce Data Analysis

Should we make our efforts more on the mobile app or on the website?

Our goal is to build a linear regression model through customer data and help decision makers for an e-commerce business. Data source is from udemy course python for data science and machine learning bootcamp

Dataset

The dataset contains 500 customers and 8 variables: Email, Address, Avatar, Average session length, Time on App, Time on Website, Length of Membership, Yearly Amount Spent

str(my_data)

## 'data.frame':    500 obs. of  8 variables:
##  $ Email               : Factor w/ 500 levels "aaron04@yahoo.com",..: 343 191 356 392 342 18 260 44 461 57 ...
##  $ Address             : Factor w/ 500 levels "0001 Mack Mill\nNorth Jennifer, NE 42021-5936",..: 382 227 124 63 61 314 329 483 392 460 ...
##  $ Avatar              : Factor w/ 138 levels "AliceBlue","AntiqueWhite",..: 133 26 7 115 81 44 35 3 116 12 ...
##  $ Avg..Session.Length : num  34.5 31.9 33 34.3 33.3 ...
##  $ Time.on.App         : num  12.7 11.1 11.3 13.7 12.8 ...
##  $ Time.on.Website     : num  39.6 37.3 37.1 36.7 37.5 ...
##  $ Length.of.Membership: num  4.08 2.66 4.1 3.12 4.45 ...
##  $ Yearly.Amount.Spent : num  588 392 488 582 599 ...

Review the first 6 rows

head(my_data)

##                           Email
## 1     mstephenson@fernandez.com
## 2             hduke@hotmail.com
## 3              pallen@yahoo.com
## 4       riverarebecca@gmail.com
## 5 mstephens@davidson-herman.com
## 6        alvareznancy@lucas.biz
##                                                      Address
## 1               835 Frank Tunnel\nWrightmouth, MI 82180-9605
## 2             4547 Archer Common\nDiazchester, CA 06566-8576
## 3 24645 Valerie Unions Suite 582\nCobbborough, DC 99414-7564
## 4           1414 David Throughway\nPort Jason, OH 22070-1220
## 5    14023 Rodriguez Passage\nPort Jacobville, PR 37242-1057
## 6    645 Martha Park Apt. 611\nJeffreychester, MN 67218-7250
##             Avatar Avg..Session.Length Time.on.App Time.on.Website
## 1           Violet            34.49727    12.65565        39.57767
## 2        DarkGreen            31.92627    11.10946        37.26896
## 3           Bisque            33.00091    11.33028        37.11060
## 4      SaddleBrown            34.30556    13.71751        36.72128
## 5 MediumAquaMarine            33.33067    12.79519        37.53665
## 6      FloralWhite            33.87104    12.02693        34.47688
##   Length.of.Membership Yearly.Amount.Spent
## 1             4.082621            587.9511
## 2             2.664034            392.2049
## 3             4.104543            487.5475
## 4             3.120179            581.8523
## 5             4.446308            599.4061
## 6             5.493507            637.1024

Exploring the dataset

Only keep numeric variables and based on the plot, it looks like the length of membership is the strongest correlated feature with the yearly amount spent.

NumCol <- sapply(my_data,is.numeric)
Data <- my_data[,NumCol]
library(psych)

## Warning: package 'psych' was built under R version 3.5.3

pairs.panels(Data, 
             method = "pearson", # correlation method
             hist.col = "#66A61E",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Training data and test data

A training set is a set of data used to discover potentially predictive relationships. A test set is a set of data used to assess the strength and utility of a predictive relationship. We train our model on 70% of the data and then test the model performance on 30% of the data that is withheld.

library(caTools)

## Warning: package 'caTools' was built under R version 3.5.3

set.seed(101)
sample <- sample.split(Data$Yearly.Amount.Spent, SplitRatio = 0.7)
train <- subset(Data,sample=TRUE)

Create a linear regression model

According to our model, variable Time on Website is not significant due to a large p value, the result suggests that the model can explain 98% of the variability in our customer data.

model <- lm(Yearly.Amount.Spent~., data=train)
summary(model)

## 
## Call:
## lm(formula = Yearly.Amount.Spent ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.4059  -6.2191  -0.1364   6.6048  30.3085 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1051.5943    22.9925 -45.736   <2e-16 ***
## Avg..Session.Length     25.7343     0.4510  57.057   <2e-16 ***
## Time.on.App             38.7092     0.4510  85.828   <2e-16 ***
## Time.on.Website          0.4367     0.4441   0.983    0.326    
## Length.of.Membership    61.5773     0.4483 137.346   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.973 on 495 degrees of freedom
## Multiple R-squared:  0.9843, Adjusted R-squared:  0.9842 
## F-statistic:  7766 on 4 and 495 DF,  p-value: < 2.2e-16

Regression diagnostics

An important part of assessing regression models is visualizing residuals, the diagnostic plots show residuals in four different ways:

Residuals vs Fitted Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.

Normal Q-Q Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.

Scale-Location (or Spread-Location) Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.

Residuals vs Leverage Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.

confint(model)

##                              2.5 %      97.5 %
## (Intercept)          -1096.7692602 -1006.41925
## Avg..Session.Length     24.8481081    26.62043
## Time.on.App             37.8230294    39.59528
## Time.on.Website         -0.4358024     1.30928
## Length.of.Membership    60.6964468    62.45820

plot(model,which=1, col=c("blue"))

plot(model, which=2, col=c("red"))

plot(model,which=3)

plot(model,which=4)

## Evaluating the Model

#redidual plot
res <- residuals(model)
res <- as.data.frame(res)
ggplot(res,aes(res))+geom_histogram(fill='blue', alpha=0.5)