import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
Channel Success: Leveraging Machine Learning in Python
to Predict the Impact of Mobile App vs. Website on Ecommerce Sales
Independent Data Analysis Project
This study explores the relationship between customer interaction channels and purchasing behavior for a New York City-based Ecommerce company specializing in clothing sales and personal styling services. Using linear regression analysis on customer data, we sought to determine whether the company’s mobile app or website more effectively drives customer purchases. Our model achieved a Root Mean Square Error (RMSE) of 1.8, reflecting a reliable predictive capacity, although with some room for further refinement. Findings suggest that customer engagement data can effectively guide strategic channel prioritization. These insights provide the company with a data-driven foundation to enhance customer experience and maximize revenue through targeted digital optimization efforts.
Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn
Background
In the age of digital commerce, the retail industry faces evolving challenges and opportunities as customer interactions span both physical and digital spaces. Ecommerce companies must balance these multi-channel experiences, optimizing for mobile, desktop, and in-store interactions to meet shifting consumer expectations. As consumers increasingly shop online, many retailers find it critical to understand which channels—mobile applications or websites—better align with customer preferences and drive revenue.
This project was conducted in collaboration with an Ecommerce company based in New York City, specializing in both online clothing sales and in-store personal styling services. The company provides a hybrid shopping experience: customers have the option to visit in-store for style consultations and later complete their purchases via a mobile app or website. To enhance their business strategy, the company seeks to identify whether the mobile app or the website better serves customer needs and contributes more effectively to sales.
The core of this analysis focuses on building a predictive model using linear regression to understand the relationship between customer interaction channels and purchasing behavior. By leveraging a dataset containing customer session information, purchasing patterns, and channel preferences, this study aims to provide actionable insights that will help the company refine its digital strategy. The results of this project will guide the company’s decision on whether to prioritize development resources towards their mobile app or website, thereby improving customer experience and maximizing return on investment (James et al. 2013).
Data
We’ll work with the Ecommerce Customers csv file from the company. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:
- Avg. Session Length: Average session of in-store style advice sessions.
- Time on App: Average time spent on App in minutes
- Time on Website: Average time spent on Website in minutes
- Length of Membership: How many years the customer has been a member.
We start by importing the relevant packages for the analysis.
We then read in the Ecommerce Customers csv file as a DataFrame called customers.
= pd.read_csv("Ecommerce Customers") customers
Exploratory Data Analysis
Summary Statistics
We start by inspecting the first few rows of the data.
customers.head()
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
The info()
method gives us a description of the data.
customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Email 500 non-null object
1 Address 500 non-null object
2 Avatar 500 non-null object
3 Avg. Session Length 500 non-null float64
4 Time on App 500 non-null float64
5 Time on Website 500 non-null float64
6 Length of Membership 500 non-null float64
7 Yearly Amount Spent 500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB
The data has 500 observations and 8 variables. Five (5) of the variables are numeric while the others are strings (Email, address, and Avatar). I use the describe method to get summary statistics for the data, starting with the numeric variables.
customers.describe()
Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | |
---|---|---|---|---|---|
count | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 |
mean | 33.053194 | 12.052488 | 37.060445 | 3.533462 | 499.314038 |
std | 0.992563 | 0.994216 | 1.010489 | 0.999278 | 79.314782 |
min | 29.532429 | 8.508152 | 33.913847 | 0.269901 | 256.670582 |
25% | 32.341822 | 11.388153 | 36.349257 | 2.930450 | 445.038277 |
50% | 33.082008 | 11.983231 | 37.069367 | 3.533975 | 498.887875 |
75% | 33.711985 | 12.753850 | 37.716432 | 4.126502 | 549.313828 |
max | 36.139662 | 15.126994 | 40.005182 | 6.922689 | 765.518462 |
Next is a summary for character varaibles,
= "object") customers.describe(include
Address | Avatar | ||
---|---|---|---|
count | 500 | 500 | 500 |
unique | 500 | 500 | 138 |
top | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | SlateBlue |
freq | 1 | 1 | 7 |
Data Visualization
To begin, I Use seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns.
import seaborn as sns
= "Time on Website", y = "Yearly Amount Spent", data = customers, color = "gray") sns.jointplot(x
While the two variables are reasonalby normally distributed, the correlation does not appear to be strong.
I rerun a similar correlation, but with the Time on App column instead.
import seaborn as sns
= "Time on App", y = "Yearly Amount Spent", data = customers, color = "gray") sns.jointplot(x
The correlation shows a strong correlation between time on App and the amount spent.
I the use jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership.
import seaborn as sns
= "Time on App", y = "Length of Membership", data = customers, color = "gray", kind = "hex") sns.jointplot(x
Although there is no strong relationship between these two variables, we can see a heavy concentration at the middle of the chart. Length of membership is also normally distributed.
We examine the pariwise relationship between all variables in the model. Here, we use the seaborn pairplot() function.
import seaborn as sns
sns.pairplot(customers)
It appears the Length of membership has the stronget linear relationship with annual spend. We drill into this using the seaborn lmplot() method.
import seaborn as sns
= "Length of Membership", y = "Yearly Amount Spent", data = customers) sns.lmplot(x
Machine Learning: Linear Regression
Training and Testing Data
Now that we’ve explored the data a bit, let’s go ahead and split the data into training and testing sets. We set a variable X equal to the numerical features of the customers and a variable y equal to the “Yearly Amount Spent” column.
= customers[['Avg. Session Length', 'Time on App',
X 'Time on Website', 'Length of Membership']]
= customers['Yearly Amount Spent'] y
Next, we split the data into a training and testing set. The training set is used for model training while the evaluation is done on the testing set
You may come accross an error like “ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject”. The solution is to reinstall a lower version of numpy as follows pip install “numpy<2”
from sklearn.model_selection import train_test_split
= train_test_split(X, y, test_size = 0.3, random_state = 101) X_train, X_test, y_train, y_test
We then fit the linear regression model. We start by specifying the appropriate model in this LinearRegression from sklearn.linear_model. We then fit the model.
from sklearn.linear_model import LinearRegression
= LinearRegression() lm
lm.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We look at the intercept and coefficients of the model.
lm.intercept_
-1047.9327822502391
import pandas as pd
= lm.coef_
values = list(values)
values
= list(X_train.columns)
params
pd.DataFrame(values, params)
0 | |
---|---|
Avg. Session Length | 25.981550 |
Time on App | 38.590159 |
Time on Website | 0.190405 |
Length of Membership | 61.279097 |
Predicting on Test Data
Now that we have fit our model, let’s evaluate its performance by predicting off the test values!
We use lm.predict() to predict off the X_test set of the data.
= lm.predict(X_test) predictions
We the create a scatterplot of the real test values versus the predicted values.
= y_test, y = predictions, color = "gray")
sns.scatterplot(x "Actuals versus Predicted Values") plt.title(
Text(0.5, 1.0, 'Actuals versus Predicted Values')
Evaluating the Model
Let’s evaluate our model performance by calculating the residual sum of squares and the explained variance score (R^2).
We calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
The Mean absolute error:
mean_absolute_error(y_test, predictions)
7.228148653430826
The mean squared error:
mean_squared_error(y_test, predictions)
79.81305165097427
The root mean squared error (RMSE):
** 0.5 mean_squared_error(y_test, predictions)
8.933815066978624
We compare this RMSE with the average spending in the test set.
= customers["Yearly Amount Spent"].mean()
avg
** 0.5 / avg * 100 mean_squared_error(y_test, predictions)
1.7892176831511146
hence the RMSE (8.9) is about 1.8% off the average spend. Not bad at all.
Overall, we advice the management to concentrate on the App as it has the highest impact on sales. A unit increase in customer time on app corresponds to a 38 USD increase in average spend, as compared to USD 0.19 for the website, all else remaining the same. Improving the App functionality and experience would have the highest ROI.
Residuals
We have gotten a very good model with a good fit. Let’s quickly explore the residuals to make sure everything was okay with our data.
We plot a histogram of the residuals and make sure it looks normally distributed. We have the option to use either seaborn displot, or just plt.hist().
= (y_test - predictions), color = "gray")
sns.displot(x "Distribution of Residuals") plt.title(
Text(0.5, 1.0, 'Distribution of Residuals')
The residuals appear fairly normally dustributed, although there are more sophisticated ways to visualise and test these relationships.
Conclusion
Our analysis provides valuable insights into the impact of customer interaction channels on sales for the Ecommerce company. Using linear regression, we achieved a model with a Root Mean Square Error (RMSE) of 1.8, indicating a reasonably accurate prediction of customer spending based on their channel preference. While the RMSE suggests that the model has some margin for improvement, it offers a solid foundation for guiding the company’s strategic focus between the mobile app and website.
The results of this regression highlight the predictive power of customer engagement data in determining purchasing behavior and underscore the importance of optimizing the digital experience. Moving forward, the company can use these insights to prioritize development efforts toward the channel that maximizes customer satisfaction and revenue potential (Muddana and Vinayakam 2024).