Channel Success: Leveraging Machine Learning in Python to Predict the Impact of Mobile App vs. Website on Ecommerce Sales

Independent Data Analysis Project

Published

November 3, 2024

Modified

November 3, 2024

Executive Summary

This study explores the relationship between customer interaction channels and purchasing behavior for a New York City-based Ecommerce company specializing in clothing sales and personal styling services. Using linear regression analysis on customer data, we sought to determine whether the company’s mobile app or website more effectively drives customer purchases. Our model achieved a Root Mean Square Error (RMSE) of 1.8, reflecting a reliable predictive capacity, although with some room for further refinement. Findings suggest that customer engagement data can effectively guide strategic channel prioritization. These insights provide the company with a data-driven foundation to enhance customer experience and maximize revenue through targeted digital optimization efforts.

Keywords

Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn

Background

In the age of digital commerce, the retail industry faces evolving challenges and opportunities as customer interactions span both physical and digital spaces. Ecommerce companies must balance these multi-channel experiences, optimizing for mobile, desktop, and in-store interactions to meet shifting consumer expectations. As consumers increasingly shop online, many retailers find it critical to understand which channels—mobile applications or websites—better align with customer preferences and drive revenue.

This project was conducted in collaboration with an Ecommerce company based in New York City, specializing in both online clothing sales and in-store personal styling services. The company provides a hybrid shopping experience: customers have the option to visit in-store for style consultations and later complete their purchases via a mobile app or website. To enhance their business strategy, the company seeks to identify whether the mobile app or the website better serves customer needs and contributes more effectively to sales.

The core of this analysis focuses on building a predictive model using linear regression to understand the relationship between customer interaction channels and purchasing behavior. By leveraging a dataset containing customer session information, purchasing patterns, and channel preferences, this study aims to provide actionable insights that will help the company refine its digital strategy. The results of this project will guide the company’s decision on whether to prioritize development resources towards their mobile app or website, thereby improving customer experience and maximizing return on investment (James et al. 2013).

Data

We’ll work with the Ecommerce Customers csv file from the company. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:

  • Avg. Session Length: Average session of in-store style advice sessions.
  • Time on App: Average time spent on App in minutes
  • Time on Website: Average time spent on Website in minutes
  • Length of Membership: How many years the customer has been a member.

We start by importing the relevant packages for the analysis.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

We then read in the Ecommerce Customers csv file as a DataFrame called customers.

customers = pd.read_csv("Ecommerce Customers")

Exploratory Data Analysis

Summary Statistics

We start by inspecting the first few rows of the data.

customers.head()
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621 587.951054
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034 392.204933
2 pallen@yahoo.com 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543 487.547505
3 riverarebecca@gmail.com 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179 581.852344
4 mstephens@davidson-herman.com 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308 599.406092

The info() method gives us a description of the data.

customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

The data has 500 observations and 8 variables. Five (5) of the variables are numeric while the others are strings (Email, address, and Avatar). I use the describe method to get summary statistics for the data, starting with the numeric variables.

customers.describe()
Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 33.053194 12.052488 37.060445 3.533462 499.314038
std 0.992563 0.994216 1.010489 0.999278 79.314782
min 29.532429 8.508152 33.913847 0.269901 256.670582
25% 32.341822 11.388153 36.349257 2.930450 445.038277
50% 33.082008 11.983231 37.069367 3.533975 498.887875
75% 33.711985 12.753850 37.716432 4.126502 549.313828
max 36.139662 15.126994 40.005182 6.922689 765.518462

Next is a summary for character varaibles,

customers.describe(include = "object")
Email Address Avatar
count 500 500 500
unique 500 500 138
top mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 SlateBlue
freq 1 1 7

Data Visualization

To begin, I Use seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns.

import seaborn as sns
sns.jointplot(x = "Time on Website", y = "Yearly Amount Spent", data = customers, color = "gray")

While the two variables are reasonalby normally distributed, the correlation does not appear to be strong.

I rerun a similar correlation, but with the Time on App column instead.

import seaborn as sns
sns.jointplot(x = "Time on App", y = "Yearly Amount Spent", data = customers, color = "gray")

The correlation shows a strong correlation between time on App and the amount spent.

I the use jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership.

import seaborn as sns
sns.jointplot(x = "Time on App", y = "Length of Membership", data = customers, color = "gray", kind = "hex")

Although there is no strong relationship between these two variables, we can see a heavy concentration at the middle of the chart. Length of membership is also normally distributed.

We examine the pariwise relationship between all variables in the model. Here, we use the seaborn pairplot() function.

import seaborn as sns
sns.pairplot(customers)

It appears the Length of membership has the stronget linear relationship with annual spend. We drill into this using the seaborn lmplot() method.

import seaborn as sns
sns.lmplot(x = "Length of Membership", y = "Yearly Amount Spent", data = customers)

Machine Learning: Linear Regression

Training and Testing Data

Now that we’ve explored the data a bit, let’s go ahead and split the data into training and testing sets. We set a variable X equal to the numerical features of the customers and a variable y equal to the “Yearly Amount Spent” column.

X = customers[['Avg. Session Length', 'Time on App',
       'Time on Website', 'Length of Membership']]

y = customers['Yearly Amount Spent']

Next, we split the data into a training and testing set. The training set is used for model training while the evaluation is done on the testing set

ERROR ALERT!!

You may come accross an error like “ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject”. The solution is to reinstall a lower version of numpy as follows pip install “numpy<2”

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

We then fit the linear regression model. We start by specifying the appropriate model in this LinearRegression from sklearn.linear_model. We then fit the model.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We look at the intercept and coefficients of the model.

lm.intercept_
-1047.9327822502391
import pandas as pd
values = lm.coef_
values = list(values)

params = list(X_train.columns)


pd.DataFrame(values, params)
0
Avg. Session Length 25.981550
Time on App 38.590159
Time on Website 0.190405
Length of Membership 61.279097

Predicting on Test Data

Now that we have fit our model, let’s evaluate its performance by predicting off the test values!

We use lm.predict() to predict off the X_test set of the data.

predictions = lm.predict(X_test)

We the create a scatterplot of the real test values versus the predicted values.

sns.scatterplot(x = y_test, y = predictions, color = "gray")
plt.title("Actuals versus Predicted Values")
Text(0.5, 1.0, 'Actuals versus Predicted Values')

Evaluating the Model

Let’s evaluate our model performance by calculating the residual sum of squares and the explained variance score (R^2).

We calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

The Mean absolute error:

mean_absolute_error(y_test, predictions)
7.228148653430826

The mean squared error:

mean_squared_error(y_test, predictions)
79.81305165097427

The root mean squared error (RMSE):

mean_squared_error(y_test, predictions) ** 0.5
8.933815066978624

We compare this RMSE with the average spending in the test set.

avg = customers["Yearly Amount Spent"].mean()

mean_squared_error(y_test, predictions) ** 0.5 / avg * 100
1.7892176831511146

hence the RMSE (8.9) is about 1.8% off the average spend. Not bad at all.

Overall, we advice the management to concentrate on the App as it has the highest impact on sales. A unit increase in customer time on app corresponds to a 38 USD increase in average spend, as compared to USD 0.19 for the website, all else remaining the same. Improving the App functionality and experience would have the highest ROI.

Residuals

We have gotten a very good model with a good fit. Let’s quickly explore the residuals to make sure everything was okay with our data.

We plot a histogram of the residuals and make sure it looks normally distributed. We have the option to use either seaborn displot, or just plt.hist().

sns.displot(x = (y_test - predictions), color = "gray")
plt.title("Distribution of Residuals")
Text(0.5, 1.0, 'Distribution of Residuals')

The residuals appear fairly normally dustributed, although there are more sophisticated ways to visualise and test these relationships.

Conclusion

Our analysis provides valuable insights into the impact of customer interaction channels on sales for the Ecommerce company. Using linear regression, we achieved a model with a Root Mean Square Error (RMSE) of 1.8, indicating a reasonably accurate prediction of customer spending based on their channel preference. While the RMSE suggests that the model has some margin for improvement, it offers a solid foundation for guiding the company’s strategic focus between the mobile app and website.

The results of this regression highlight the predictive power of customer engagement data in determining purchasing behavior and underscore the importance of optimizing the digital experience. Moving forward, the company can use these insights to prioritize development efforts toward the channel that maximizes customer satisfaction and revenue potential (Muddana and Vinayakam 2024).

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.