Channel Success: Leveraging Machine Learning in `Python` to Predict the Impact of Mobile App vs. Website on Ecommerce Sales

Independent Data Analysis Project

Author

Affiliations

John Karuitha, PhD

Karatina University, Department of Business and Economics

University of the Witwatersrand, School of Construction Economics & Management

Published

November 3, 2024

Modified

November 3, 2024

Executive Summary

This study explores the relationship between customer interaction channels and purchasing behavior for a New York City-based Ecommerce company specializing in clothing sales and personal styling services. Using linear regression analysis on customer data, we sought to determine whether the company’s mobile app or website more effectively drives customer purchases. Our model achieved a Root Mean Square Error (RMSE) of 1.8, reflecting a reliable predictive capacity, although with some room for further refinement. Findings suggest that customer engagement data can effectively guide strategic channel prioritization. These insights provide the company with a data-driven foundation to enhance customer experience and maximize revenue through targeted digital optimization efforts.

Keywords

Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn

Background

In the age of digital commerce, the retail industry faces evolving challenges and opportunities as customer interactions span both physical and digital spaces. Ecommerce companies must balance these multi-channel experiences, optimizing for mobile, desktop, and in-store interactions to meet shifting consumer expectations. As consumers increasingly shop online, many retailers find it critical to understand which channels—mobile applications or websites—better align with customer preferences and drive revenue.

This project was conducted in collaboration with an Ecommerce company based in New York City, specializing in both online clothing sales and in-store personal styling services. The company provides a hybrid shopping experience: customers have the option to visit in-store for style consultations and later complete their purchases via a mobile app or website. To enhance their business strategy, the company seeks to identify whether the mobile app or the website better serves customer needs and contributes more effectively to sales.

The core of this analysis focuses on building a predictive model using linear regression to understand the relationship between customer interaction channels and purchasing behavior. By leveraging a dataset containing customer session information, purchasing patterns, and channel preferences, this study aims to provide actionable insights that will help the company refine its digital strategy. The results of this project will guide the company’s decision on whether to prioritize development resources towards their mobile app or website, thereby improving customer experience and maximizing return on investment (James et al. 2013).

Data

We’ll work with the Ecommerce Customers csv file from the company. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:

Avg. Session Length: Average session of in-store style advice sessions.
Time on App: Average time spent on App in minutes
Time on Website: Average time spent on Website in minutes
Length of Membership: How many years the customer has been a member.

We start by importing the relevant packages for the analysis.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

We then read in the Ecommerce Customers csv file as a DataFrame called customers.

customers = pd.read_csv("Ecommerce Customers")

Exploratory Data Analysis

Summary Statistics

We start by inspecting the first few rows of the data.

customers.head()

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092

The info() method gives us a description of the data.

customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

The data has 500 observations and 8 variables. Five (5) of the variables are numeric while the others are strings (Email, address, and Avatar). I use the describe method to get summary statistics for the data, starting with the numeric variables.

customers.describe()

	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
count	500.000000	500.000000	500.000000	500.000000	500.000000
mean	33.053194	12.052488	37.060445	3.533462	499.314038
std	0.992563	0.994216	1.010489	0.999278	79.314782
min	29.532429	8.508152	33.913847	0.269901	256.670582
25%	32.341822	11.388153	36.349257	2.930450	445.038277
50%	33.082008	11.983231	37.069367	3.533975	498.887875
75%	33.711985	12.753850	37.716432	4.126502	549.313828
max	36.139662	15.126994	40.005182	6.922689	765.518462

Next is a summary for character varaibles,

customers.describe(include = "object")

	Email	Address	Avatar
count	500	500	500
unique	500	500	138
top	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	SlateBlue
freq	1	1	7

Data Visualization

To begin, I Use seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns.

import seaborn as sns
sns.jointplot(x = "Time on Website", y = "Yearly Amount Spent", data = customers, color = "gray")

While the two variables are reasonalby normally distributed, the correlation does not appear to be strong.

I rerun a similar correlation, but with the Time on App column instead.

import seaborn as sns
sns.jointplot(x = "Time on App", y = "Yearly Amount Spent", data = customers, color = "gray")

The correlation shows a strong correlation between time on App and the amount spent.

I the use jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership.

import seaborn as sns
sns.jointplot(x = "Time on App", y = "Length of Membership", data = customers, color = "gray", kind = "hex")

Although there is no strong relationship between these two variables, we can see a heavy concentration at the middle of the chart. Length of membership is also normally distributed.

We examine the pariwise relationship between all variables in the model. Here, we use the seaborn pairplot() function.

import seaborn as sns
sns.pairplot(customers)

It appears the Length of membership has the stronget linear relationship with annual spend. We drill into this using the seaborn lmplot() method.

import seaborn as sns
sns.lmplot(x = "Length of Membership", y = "Yearly Amount Spent", data = customers)

Machine Learning: Linear Regression

Training and Testing Data

Now that we’ve explored the data a bit, let’s go ahead and split the data into training and testing sets. We set a variable X equal to the numerical features of the customers and a variable y equal to the “Yearly Amount Spent” column.

X = customers[['Avg. Session Length', 'Time on App',
       'Time on Website', 'Length of Membership']]

y = customers['Yearly Amount Spent']

Next, we split the data into a training and testing set. The training set is used for model training while the evaluation is done on the testing set

ERROR ALERT!!

You may come accross an error like “ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject”. The solution is to reinstall a lower version of numpy as follows pip install “numpy<2”

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

We then fit the linear regression model. We start by specifying the appropriate model in this LinearRegression from sklearn.linear_model. We then fit the model.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()

lm.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We look at the intercept and coefficients of the model.

lm.intercept_

-1047.9327822502391

import pandas as pd
values = lm.coef_
values = list(values)

params = list(X_train.columns)


pd.DataFrame(values, params)

	0
Avg. Session Length	25.981550
Time on App	38.590159
Time on Website	0.190405
Length of Membership	61.279097

Predicting on Test Data

Now that we have fit our model, let’s evaluate its performance by predicting off the test values!

We use lm.predict() to predict off the X_test set of the data.

predictions = lm.predict(X_test)

We the create a scatterplot of the real test values versus the predicted values.

sns.scatterplot(x = y_test, y = predictions, color = "gray")
plt.title("Actuals versus Predicted Values")

Text(0.5, 1.0, 'Actuals versus Predicted Values')

Evaluating the Model

Let’s evaluate our model performance by calculating the residual sum of squares and the explained variance score (R^2).

We calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

The Mean absolute error:

mean_absolute_error(y_test, predictions)

7.228148653430826

The mean squared error:

mean_squared_error(y_test, predictions)

79.81305165097427

The root mean squared error (RMSE):

mean_squared_error(y_test, predictions) ** 0.5

8.933815066978624

We compare this RMSE with the average spending in the test set.

avg = customers["Yearly Amount Spent"].mean()

mean_squared_error(y_test, predictions) ** 0.5 / avg * 100

1.7892176831511146

hence the RMSE (8.9) is about 1.8% off the average spend. Not bad at all.

Overall, we advice the management to concentrate on the App as it has the highest impact on sales. A unit increase in customer time on app corresponds to a 38 USD increase in average spend, as compared to USD 0.19 for the website, all else remaining the same. Improving the App functionality and experience would have the highest ROI.

Residuals

We have gotten a very good model with a good fit. Let’s quickly explore the residuals to make sure everything was okay with our data.

We plot a histogram of the residuals and make sure it looks normally distributed. We have the option to use either seaborn displot, or just plt.hist().

sns.displot(x = (y_test - predictions), color = "gray")
plt.title("Distribution of Residuals")

Text(0.5, 1.0, 'Distribution of Residuals')

The residuals appear fairly normally dustributed, although there are more sophisticated ways to visualise and test these relationships.

Conclusion

Our analysis provides valuable insights into the impact of customer interaction channels on sales for the Ecommerce company. Using linear regression, we achieved a model with a Root Mean Square Error (RMSE) of 1.8, indicating a reasonably accurate prediction of customer spending based on their channel preference. While the RMSE suggests that the model has some margin for improvement, it offers a solid foundation for guiding the company’s strategic focus between the mobile app and website.

The results of this regression highlight the predictive power of customer engagement data in determining purchasing behavior and underscore the importance of optimizing the digital experience. Moving forward, the company can use these insights to prioritize development efforts toward the channel that maximizes customer satisfaction and revenue potential (Muddana and Vinayakam 2024).

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.