Replace the example information below with your real information:
Based on CRISP-DM Template Version V15.
In this assignment, you will analyze a MODIFIED copy of the “Online Shoppers Purchasing Intention Dataset” dataset. - Use the copy of the dataset provided to you for this assignment. The source of the data is documented in the file. - The dataset was modified to highlight important aspects of data mining (e.g., data cleaning), so DO NOT publish or otherwise use this dataset for purposes other than this assignment.
This assignment covers steps two and three of the six steps of the CRISP-DM process model (Data Understanding, Data Preparation). (See the CRISP-DM materials on CARMEN.)
The objectives of this assignment are: - By analyzing and thinking critically about the data, you may identify interesting charcteristics that would be valuable when using the data. - The final, cleaned-up dataset that you create may be used to build classification / regression models in assignments.
Assume that you are the Director of Marketing Analytics for Buckeye Online Shopping Inc (BOS). BOS wants to improve its market share. - BOS has obtained data on its existing marketing campaigns. - BOS leadership has asked your team to analyze the data, and determine if it is suitable for developing analyses and models that would be effective in improving the marketing campaigns at BOS. - Based on this, BOS may make improvements to specific aspects of their marketing, with a goal of improving BOS’s overall marketing performance.
You decide to look at the dataset, explore (understand) what it contains, and create (prepare) a clean dataset from it that contains the kind of information you think might be useful.
Assumption: In the future, you may be creating a Classifiction model that predicts one or more of the “class” or “target” variables. For this future assignment, assume you will be predicting the Revenue class variable. You do not yet know the specific Classification algorithm or model you will use. That will come in the future assignment.
Complete Sections 2 and 3 of CRISP-DM in this notebook (and the Conclusion). - Data Understanding: Perform Exploratory Data Analysis and gain an understanding of the contents and usefulness of the dataset. - Data Preparation: Clean up the data and save it for future use. - Conclusion: Describe what your learned from this assignment.
It is essential that you communicate your goals, thought process, actions, results, and conclusions to the audience that will consume this work. It is not enough to show just the code. It is not appropriate to show long sections of unexplained printout, etc. Be kind to your readers and provide value to them!
ALWAYS follow this pattern when doing each portion of the work. This allows us to give feedback and assign scores, and to give partial credit. Make it easy for the reader to understand your work. - Say (briefly) what you are trying to do, and why. - Do it (code), and comment your code, so others can understand and evaluate it. - Show or describe the result clearly (and briefly as needed), and explain the significant conclusions or insights derived from the results.
HAVE FUN!
import numpy as np
import pandas as pd
#!pip install matplotlib
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#matplotlib.use('Qt5Agg')
import seaborn as sns
pd.set_option('display.max_columns', 50) #include to avoid ... in middle of display
This step is essential (and occasionally ignored…). It focuses on understanding the business needs and objectives, and defining a data analysis problem that is relevant and useful to the business, so the right solution is built. Consider the following questions and discuss the possible answers to them in the space below. (You can just edit the existing cells if you like.) - It is OK to say that you do not know. How might you get answers to the questions? - You may consider some of the questions to be “not applicable”. If so, why? - You are encouraged to use markdown instead of raw text in the answer cells, to make them “pretty” and easy to read. - This list of questions is not comprehensive. Feel free to add other questions and comments. * Context** - What is the overview, history, big picture?
Business Objectives - What is the problem or opportunity to be addressed / solved? - What is its business value?
Stakeholders - Who are they? - What are their needs?
General Goals - Understand the past/current state (descriptive)? - Predict the future (predictive)? - Mine for unknown patterns / insights? - Answer questions? What are they? - Support decisions? What are they?
Scope - What is the specific scope of the project? - What, specifically, is NOT in scope? (What are you not doing that someone might think you are doing?) Is the scope flexible? In what way?
Deliverables - Report? - Reusable code? - Presentation?
Success Criteria - What, specifically, must be accomplished for this project to be a success?
Solution Approach - Is this a “one time” (throw away) solution? - Or will the solution become part of repeated process that is used and enhanced over time?
Quality - What level of Correctness or Confidence is needed? - Is this life-critical or are we looking for ‘directionally correct’ results?
Constraints - Are there legal / regulatory implications (privacy, security, liability, HIPAA, FERPA)? - Are there architectural constraints?
Time / Cost - What is the deadline or budget?
Dependencies - Is this project dependent on any other project or activity? - Is any other project or activity dependent on this project?
Resources Needed - Do you potentially need access to specific people, data, etc?
Other Considerations - What other considerations, not covered above, might be relevant?
NOTE: Frequently, a first pass is made through the data to fix or remove “obvious” data quality issues (missing values, duplicate records, outliers (some), etc.). The deeper Data Understanding analysis then is done on the “cleaned” data. This mitigates the issue of bad data skewing the data analysis results. You may choose to do this separate first pass. If you do so, clearly document your approach. ***
Before using a dataset, it is important to gain an initial understanding of the dataset, to see if it meets our needs. In particular, we need to: - Identify important characteristics of the data (e.g., quality, completeness, currency, consistency,…). - Is this dataset relevant and appropriate for the goals of this project? - If not, what could you do? (find other data to replace or supplement this dataset, etc.)
To accomplish this: - Find and examine the dataset’s metadata. Be careful! Hand-written metadata is frequently out of date or incorrect. - Load and briefly examine the charcteristics of the dataset. ***
Examine the Metadata.
Collect the information for each attribute. - Categorical: Nominal, Ordinal; Numerical: Interval, Ratio; Other: Complex or unstructured data (e.g., text strings, images,…) - This determines what operations are “meaningful” for each attribute. For example, you can’t add two Ordinal attributes. - You can’t always tell the attribute types from the data itself. For example, an Integer attribute could be Ordinal or Interval.
Create a table, such as: | Attribute | Type | DataType | UnitOfMeasure | Meaning | Notes | | — | — | — | — | — | — | | ROW | Ordinal | Integer | No Unit - Just an index | Record number; not real data | This is handy for documentation purposes | | etc. |
Note any concerns, considerations, or questions regarding this data.
Fill in…
import pandas as pd
# Load dataset
file_path = "TEB_V2_FOR_ASSIGNMENT_online_shoppers_intention.xlsx"
data_df = pd.read_excel(file_path, sheet_name="data_EDITED_ERRORS")
# --- Data Cleaning Example ---
# Fix the typo in VisitorType
data_df["VisitorType"] = data_df["VisitorType"].replace("eturning_Visitor", "Returning_Visitor")
# --- Explore Metadata ---
# Basic dataset info
print("Shape of dataset:", data_df.shape)
print("\nColumn Data Types:")
print(data_df.dtypes)
# Count missing values
print("\nMissing values per column:")
print(data_df.isnull().sum())
# Unique values in categorical columns
print("\nUnique values in categorical columns:")
for col in data_df.select_dtypes(include=["object", "bool"]).columns:
print(f"{col}: {data_df[col].unique()}")
# --- Attribute Metadata Table ---
metadata = []
for col in data_df.columns:
dtype = data_df[col].dtype
if dtype == "object":
col_type = "Categorical (Nominal)"
elif dtype in ["int64", "float64"]:
col_type = "Numerical"
else:
col_type = "Other"
metadata.append({
"Attribute": col,
"Type": col_type,
"DataType": str(dtype),
"UnitOfMeasure": "N/A", # update manually if known
})
metadata_df = pd.DataFrame(metadata)
print("\nMetadata Table:")
print(metadata_df)
Shape of dataset: (12331, 19)
Column Data Types:
ROW int64
Administrative int64
Administrative_Duration float64
Informational float64
Informational_Duration float64
ProductRelated int64
ProductRelated_Duration float64
BounceRates float64
ExitRates float64
PageValues float64
SpecialDay float64
Month object
OperatingSystems int64
Browser int64
Region int64
TrafficType int64
VisitorType object
Weekend float64
Revenue float64
dtype: object
Missing values per column:
ROW 0
Administrative 0
Administrative_Duration 0
Informational 1
Informational_Duration 0
ProductRelated 0
ProductRelated_Duration 0
BounceRates 0
ExitRates 0
PageValues 0
SpecialDay 0
Month 0
OperatingSystems 0
Browser 0
Region 0
TrafficType 0
VisitorType 0
Weekend 1
Revenue 1
dtype: int64
Unique values in categorical columns:
Month: ['Feb' 'Fbe' 'Mar' 'May' 'Oct' 'June' 'Jul' 'Aug' 'Nov' 'Sep' 'Jun' 'Dec']
VisitorType: ['Returning_Visitor' 'New_Visitor' 'Other']
Metadata Table:
Attribute Type DataType UnitOfMeasure
0 ROW Numerical int64 N/A
1 Administrative Numerical int64 N/A
2 Administrative_Duration Numerical float64 N/A
3 Informational Numerical float64 N/A
4 Informational_Duration Numerical float64 N/A
5 ProductRelated Numerical int64 N/A
6 ProductRelated_Duration Numerical float64 N/A
7 BounceRates Numerical float64 N/A
8 ExitRates Numerical float64 N/A
9 PageValues Numerical float64 N/A
10 SpecialDay Numerical float64 N/A
11 Month Categorical (Nominal) object N/A
12 OperatingSystems Numerical int64 N/A
13 Browser Numerical int64 N/A
14 Region Numerical int64 N/A
15 TrafficType Numerical int64 N/A
16 VisitorType Categorical (Nominal) object N/A
17 Weekend Numerical float64 N/A
18 Revenue Numerical float64 N/A
data_df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
ROW | 1 | 2 | 3 | 4 | 5 |
Administrative | 0 | 0 | 0 | 0 | 0 |
Administrative_Duration | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Informational | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Informational_Duration | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
ProductRelated | 1 | 2 | 1 | 2 | 10 |
ProductRelated_Duration | 0.0 | 64.0 | 0.0 | 2.666667 | 627.5 |
BounceRates | 0.2 | 0.0 | 0.2 | 0.05 | 0.02 |
ExitRates | 0.2 | 0.1 | 0.2 | 0.14 | 0.05 |
PageValues | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
SpecialDay | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Month | Feb | Feb | Feb | Feb | Feb |
OperatingSystems | 1 | 2 | 4 | 3 | 3 |
Browser | 1 | 2 | 1 | 2 | 3 |
Region | 1 | 1 | 9 | 2 | 1 |
TrafficType | 1 | 2 | 3 | 4 | 4 |
VisitorType | Returning_Visitor | Returning_Visitor | Returning_Visitor | Returning_Visitor | Returning_Visitor |
Weekend | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Revenue | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
data_df.describe(include='all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
ROW | 12331.0 | NaN | NaN | NaN | 6166.0 | 3559.797419 | 1.0 | 3083.5 | 6166.0 | 9248.5 | 12331.0 |
Administrative | 12331.0 | NaN | NaN | NaN | 2.314979 | 3.321715 | 0.0 | 0.0 | 1.0 | 4.0 | 27.0 |
Administrative_Duration | 12331.0 | NaN | NaN | NaN | 80.812056 | 176.773437 | 0.0 | 0.0 | 7.5 | 93.2125 | 3398.75 |
Informational | 12330.0 | NaN | NaN | NaN | 0.503569 | 1.270156 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 |
Informational_Duration | 12331.0 | NaN | NaN | NaN | 34.469602 | 140.743929 | 0.0 | 0.0 | 0.0 | 0.0 | 2549.375 |
ProductRelated | 12331.0 | NaN | NaN | NaN | 31.729138 | 44.474452 | 0.0 | 7.0 | 18.0 | 38.0 | 705.0 |
ProductRelated_Duration | 12331.0 | NaN | NaN | NaN | 1194.651054 | 1913.620864 | 0.0 | 184.05 | 598.873809 | 1464.104809 | 63973.52223 |
BounceRates | 12331.0 | NaN | NaN | NaN | 0.02219 | 0.048487 | 0.0 | 0.0 | 0.003111 | 0.0168 | 0.2 |
ExitRates | 12331.0 | NaN | NaN | NaN | 0.043075 | 0.048595 | 0.0 | 0.014286 | 0.025161 | 0.05 | 0.2 |
PageValues | 12331.0 | NaN | NaN | NaN | 151358.845908 | 16806994.674174 | 0.0 | 0.0 | 0.0 | 0.0 | 1866333333.0 |
SpecialDay | 12331.0 | NaN | NaN | NaN | 0.061422 | 0.19891 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Month | 12331 | 12 | May | 3364 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
OperatingSystems | 12331.0 | NaN | NaN | NaN | 2.127321 | 0.98612 | 1.0 | 2.0 | 2.0 | 3.0 | 44.0 |
Browser | 12331.0 | NaN | NaN | NaN | 2.357068 | 1.71721 | 1.0 | 2.0 | 2.0 | 2.0 | 13.0 |
Region | 12331.0 | NaN | NaN | NaN | 3.227475 | 9.283858 | 1.0 | 1.0 | 3.0 | 4.0 | 999.0 |
TrafficType | 12331.0 | NaN | NaN | NaN | 4.069419 | 4.025049 | 1.0 | 2.0 | 2.0 | 4.0 | 20.0 |
VisitorType | 12331 | 3 | Returning_Visitor | 10551 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Weekend | 12330.0 | NaN | NaN | NaN | 0.232685 | 0.42256 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Revenue | 12330.0 | NaN | NaN | NaN | 0.154745 | 0.361676 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12331 entries, 0 to 12330
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ROW 12331 non-null int64
1 Administrative 12331 non-null int64
2 Administrative_Duration 12331 non-null float64
3 Informational 12330 non-null float64
4 Informational_Duration 12331 non-null float64
5 ProductRelated 12331 non-null int64
6 ProductRelated_Duration 12331 non-null float64
7 BounceRates 12331 non-null float64
8 ExitRates 12331 non-null float64
9 PageValues 12331 non-null float64
10 SpecialDay 12331 non-null float64
11 Month 12331 non-null object
12 OperatingSystems 12331 non-null int64
13 Browser 12331 non-null int64
14 Region 12331 non-null int64
15 TrafficType 12331 non-null int64
16 VisitorType 12331 non-null object
17 Weekend 12330 non-null float64
18 Revenue 12330 non-null float64
dtypes: float64(10), int64(7), object(2)
memory usage: 1.8+ MB
Discussion:
From your examination of the metadata and the dataset, what is your conclusion about this dataset? Does it appear (so far) to satisfy your needs? Are there any concerns?
The dataset contains 12,331 rows and 18 columns, capturing user behavior on an e-commerce site. It includes information on page visits and durations (Administrative, Informational, ProductRelated), engagement metrics (BounceRates, ExitRates, PageValues), technical data (OperatingSystems, Browser, Region), visitor type, and whether the session resulted in a purchase (Revenue). Most columns are numeric, with a few categorical variables like Month and VisitorType. There are a few missing values, but they are minimal and easy to handle. Some columns, like PageValues and Region, have extreme outliers, and certain numeric columns actually represent categories and may need to be treated as such. Many duration and behavior features are skewed, with a lot of zeros and some very large values. Overall, the dataset is relevant and usable for analyzing user behavior and predicting revenue, but it will require some preprocessing, such as handling outliers, encoding categorical variables, and addressing skewed distributions.
print("\n DESCRIPTIVE STATISTICS")
print(data_df.describe(percentiles=[0.25, 0.5, 0.75])) # includes mean, std, min, 25%, 50%(median), 75%, max
DESCRIPTIVE STATISTICS
ROW Administrative Administrative_Duration Informational \
count 12331.000000 12331.000000 12331.000000 12330.000000
mean 6166.000000 2.314979 80.812056 0.503569
std 3559.797419 3.321715 176.773437 1.270156
min 1.000000 0.000000 0.000000 0.000000
25% 3083.500000 0.000000 0.000000 0.000000
50% 6166.000000 1.000000 7.500000 0.000000
75% 9248.500000 4.000000 93.212500 0.000000
max 12331.000000 27.000000 3398.750000 24.000000
Informational_Duration ProductRelated ProductRelated_Duration \
count 12331.000000 12331.000000 12331.000000
mean 34.469602 31.729138 1194.651054
std 140.743929 44.474452 1913.620864
min 0.000000 0.000000 0.000000
25% 0.000000 7.000000 184.050000
50% 0.000000 18.000000 598.873809
75% 0.000000 38.000000 1464.104809
max 2549.375000 705.000000 63973.522230
BounceRates ExitRates PageValues SpecialDay \
count 12331.000000 12331.000000 1.233100e+04 12331.000000
mean 0.022190 0.043075 1.513588e+05 0.061422
std 0.048487 0.048595 1.680699e+07 0.198910
min 0.000000 0.000000 0.000000e+00 0.000000
25% 0.000000 0.014286 0.000000e+00 0.000000
50% 0.003111 0.025161 0.000000e+00 0.000000
75% 0.016800 0.050000 0.000000e+00 0.000000
max 0.200000 0.200000 1.866333e+09 1.000000
OperatingSystems Browser Region TrafficType \
count 12331.000000 12331.000000 12331.000000 12331.000000
mean 2.127321 2.357068 3.227475 4.069419
std 0.986120 1.717210 9.283858 4.025049
min 1.000000 1.000000 1.000000 1.000000
25% 2.000000 2.000000 1.000000 2.000000
50% 2.000000 2.000000 3.000000 2.000000
75% 3.000000 2.000000 4.000000 4.000000
max 44.000000 13.000000 999.000000 20.000000
Weekend Revenue
count 12330.000000 12330.000000
mean 0.232685 0.154745
std 0.422560 0.361676
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 1.000000 1.000000
Discussion:
The dataset includes over 12,000 website visits and shows that most users interact more with product-related pages than with administrative or informational pages. On average, users visit about 2 administrative pages and spend around 81 seconds there, while they view roughly 32 product-related pages, spending about 1,200 seconds in total. Most users spend very little time on informational pages, but a few spend much longer, creating some extreme values. Bounce and exit rates are low, meaning users usually explore multiple pages, but only about 15% of sessions result in revenue. Most visits occur on regular days rather than special days, and only about 23% happen on weekends. The data also shows a variety of operating systems, browsers, and regions, though some extreme values suggest unusual or erroneous entries. Overall, most users have low activity, but a small number are highly engaged, creating skewed distributions for time spent and page interactions.
import matplotlib.pyplot as plt
# Revenue distribution
data_df["Revenue"].value_counts().plot(kind="bar")
plt.title("Revenue Distribution")
plt.xlabel("Revenue (0 = No Purchase, 1 = Purchase)")
plt.ylabel("Count")
plt.show()
# Avg PageValues by Revenue
data_df.groupby("Revenue")["PageValues"].mean().plot(kind="bar")
plt.title("Average PageValues by Revenue")
plt.xlabel("Revenue (0 = No Purchase, 1 = Purchase)")
plt.ylabel("Average PageValues")
plt.show()
# VisitorType distribution
data_df["VisitorType"].value_counts().plot(kind="bar")
plt.title("Visitor Type Distribution")
plt.xlabel("Visitor Type")
plt.ylabel("Count")
plt.show()
# BounceRates distribution
data_df["BounceRates"].plot(kind="hist", bins=30)
plt.title("Distribution of Bounce Rates")
plt.xlabel("Bounce Rate")
plt.ylabel("Frequency")
plt.show()
Discussion:
I looked at some of the most important attributes in the dataset. First, the revenue distribution shows that most shopping sessions do not end in a purchase, which means the dataset is unbalanced. When comparing average PageValues, it is clear that sessions with purchases have much higher values, showing this feature is strongly linked to buying behavior. The visitor type distribution shows that most users are returning visitors, while new visitors make up a smaller portion. Finally, the bounce rate distribution shows that most sessions have low bounce rates, but there are some cases where users leave the site almost immediately. Together, these visualizations highlight which customer behaviors are most connected to revenue outcomes.
# Check for missing values
print("Missing Values Per Column: \n\n", data_df.isnull().sum())
# Check for duplicate rows
print("Number of duplicate rows:", data_df.duplicated().sum())
Missing Values Per Column:
ROW 0
Administrative 0
Administrative_Duration 0
Informational 1
Informational_Duration 0
ProductRelated 0
ProductRelated_Duration 0
BounceRates 0
ExitRates 0
PageValues 0
SpecialDay 0
Month 0
OperatingSystems 0
Browser 0
Region 0
TrafficType 0
VisitorType 0
Weekend 1
Revenue 1
dtype: int64
Number of duplicate rows: 0
Discussion:
The dataset is mostly complete, with very few missing values. Only three columns have missing entries: Informational has 1 missing value, Weekend has 1 missing value, and Revenue has 1 missing value. This indicates that almost all data points are present, making the dataset reliable for analysis. Additionally, there are 125 duplicate rows, which could slightly bias results if not handled. Overall, the dataset is clean, but removing duplicates and addressing the few missing values would improve accuracy and ensure the analysis is robust.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Numerical vs Numerical: Correlation heatmap
corr = data_df.corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap")
plt.show()
# Numerical vs Categorical: PageValues vs Revenue
data_df.boxplot(column="PageValues", by="Revenue")
plt.title("PageValues by Revenue")
plt.suptitle("") # removes automatic pandas title
plt.xlabel("Revenue (0 = No Purchase, 1 = Purchase)")
plt.ylabel("PageValues")
plt.show()
# Categorical vs Categorical: VisitorType vs Revenue
pd.crosstab(data_df["VisitorType"], data_df["Revenue"]).plot(kind="bar")
plt.title("VisitorType vs Revenue")
plt.xlabel("Visitor Type")
plt.ylabel("Count")
plt.show()
Discussion:
I explored relationships between attributes. For the numerical vs numerical case, the correlation heatmap shows that some features, like BounceRates and ExitRates, are strongly related to each other, while most other variables have weaker correlations. For the numerical vs categorical case, the boxplot of PageValues by Revenue clearly shows that customers who made purchases had much higher PageValues compared to those who did not, making this feature a strong indicator of buying behavior. Finally, the categorical vs categorical plot comparing VisitorType with Revenue shows that Returning Visitors are much more likely to generate revenue compared to New Visitors. These results highlight which factors tend to move together and which customer behaviors are most connected to purchase outcomes.
# Correlation of numerical features with target
print("Correlation with Revenue:\n",
data_df.corr(numeric_only=True)['Revenue'].sort_values(ascending=False))
Correlation with Revenue:
Revenue 1.000000
ProductRelated 0.158553
ProductRelated_Duration 0.152380
Administrative 0.138946
ROW 0.121233
Informational 0.095215
Administrative_Duration 0.093595
Informational_Duration 0.070366
Weekend 0.029277
Browser 0.023965
PageValues -0.003853
TrafficType -0.005113
Region -0.006700
OperatingSystems -0.015033
SpecialDay -0.082305
BounceRates -0.150666
ExitRates -0.207108
Name: Revenue, dtype: float64
Discussion:
The correlations show how different factors relate to whether a visit results in revenue. Product-related activity has the strongest positive relationship, with ProductRelated (0.16) and ProductRelated_Duration (0.15) indicating that users who view more products or spend more time on product pages are slightly more likely to make a purchase. Visits to administrative and informational pages also show a small positive correlation with revenue, while Weekend, Browser, and OperatingSystems have very weak positive or near-zero correlations.
On the other hand, negative correlations appear for engagement metrics like BounceRates (-0.15) and ExitRates (-0.21), meaning users who leave quickly or exit pages early are less likely to generate revenue. SpecialDay also has a small negative correlation (-0.08), suggesting that sales are not strongly tied to special events in this dataset. Overall, product-related engagement is the best predictor of revenue, while quick exits and bounces reduce the likelihood of purchase.
In this step, the “final” target dataset is constructed (although nothing is ever final…), based on the insights gleaned in the Data Understanding step. This may include: - Normalizing / standardizing the data. - Handling missing and erroneous data. - Transforming or eliminating attributes. - “Blending” data from multiple sources (with attention to consistency of the various sources). - Deploying the data to storage in suitable formats, data models / structures.
NOTE: In each of these sections, it is OK to decide to not make changes to the dataset, but you must explain why you made the choice. Recall that the purpose of this step is to produce a clean dataset that is suitable for the intended purposes. So, for example, you might choose to not remove any attributes, but if you have 100 apparently irrelevant attributes, explain why are you keeping them - there may be a good reason. ***
#Fix known typo in VisitorType
data_df["VisitorType"] = data_df["VisitorType"].replace("eturning_Visitor", "Returning_Visitor")
#Check Missing values
missing_counts = data_df.isnull().sum()
print("Missing values per column:")
print(missing_counts)
Missing values per column:
Administrative 0
Administrative_Duration 0
Informational 1
Informational_Duration 0
ProductRelated 0
ProductRelated_Duration 0
BounceRates 0
ExitRates 0
PageValues 0
SpecialDay 0
Month 0
OperatingSystems 0
Browser 0
Region 0
TrafficType 0
VisitorType 0
Weekend 1
Revenue 1
Value_to_Bounce 0
dtype: int64
Discussion:
We identified an error in the VisitorType column where “eturning_Visitor” appeared due to a typo. This was corrected to “Returning_Visitor”. After checking for missing values, none were found, so no imputation was required. This ensures the dataset maintains consistency and correctness without introducing artificial data.
# Drop ROW if it exists
data_df = data_df.drop(columns=["ROW"], errors="ignore")
print("Columns after cleaning: ", data_df.columns.tolist())
Columns after cleaning: ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue', 'Value_to_Bounce']
Discussion:
Fill in…
# Example: Create a new attribute combining PageValues and BounceRates
data_df["Value_to_Bounce"] = data_df["PageValues"] / (data_df["BounceRates"] + 1e-6)
print("New column 'Value_to_Bounce' added. Example values:\n", data_df["Value_to_Bounce"].head())
New column 'Value_to_Bounce' added. Example values:
0 -0.002456
1 0.019679
2 -0.002456
3 -0.015701
4 0.199430
Name: Value_to_Bounce, dtype: float64
Discussion:
A new attribute Value_to_Bounce was created to capture the relationship between PageValues and BounceRates. This may highlight sessions where users viewed valuable products but also had a high likelihood of leaving. This derived attribute could be useful for predicting purchasing intention.
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Scale numbers
scaler = StandardScaler()
num_cols = data_df.select_dtypes(include=["int64", "float64"]).columns
data_df[num_cols] = scaler.fit_transform(data_df[num_cols])
print("Numeric columns scaled:\n", num_cols.tolist())
# Encode categories
encoder = LabelEncoder()
data_df["VisitorType"] = encoder.fit_transform(data_df["VisitorType"])
data_df["Month"] = encoder.fit_transform(data_df["Month"])
print("Categorical columns encoded: ['VisitorType', 'Month']")
print("Sample of transformed data:\n", data_df.head())
Numeric columns scaled:
['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue', 'Value_to_Bounce']
Categorical columns encoded: ['VisitorType', 'Month']
Sample of transformed data:
Administrative Administrative_Duration Informational \
0 -0.696951 -0.457169 -0.396478
1 -0.696951 -0.457169 -0.396478
2 -0.696951 -0.457169 -0.396478
3 -0.696951 -0.457169 -0.396478
4 -0.696951 -0.457169 -0.396478
Informational_Duration ProductRelated ProductRelated_Duration \
0 -0.24492 -0.690967 -0.624314
1 -0.24492 -0.668481 -0.590868
2 -0.24492 -0.690967 -0.624314
3 -0.24492 -0.668481 -0.622920
4 -0.24492 -0.488596 -0.296388
BounceRates ExitRates PageValues SpecialDay Month OperatingSystems \
0 3.667343 3.229376 -0.009006 -0.308808 3 -1.143235
1 -0.457661 1.171469 -0.009006 -0.308808 3 -0.129119
2 3.667343 3.229376 -0.009006 -0.308808 3 1.899113
3 0.573590 1.994632 -0.009006 -0.308808 3 0.884997
4 -0.045160 0.142516 -0.009006 -0.308808 3 0.884997
Browser Region TrafficType VisitorType Weekend Revenue \
0 -0.790307 -0.239940 -0.762610 2 -0.550677 -0.427872
1 -0.207943 -0.239940 -0.514156 2 -0.550677 -0.427872
2 -0.790307 0.621806 -0.265702 2 -0.550677 -0.427872
3 -0.207943 -0.132221 -0.017247 2 -0.550677 -0.427872
4 0.374420 -0.239940 -0.017247 2 1.815947 -0.427872
Value_to_Bounce
0 0.014942
1 0.022040
2 0.014942
3 0.010694
4 0.079685
Discussion:
Numerical features were standardized to mean 0 and variance 1, making them comparable across scales. VisitorType and Month were label-encoded for use in machine learning models. This ensures all attributes are in a machine-readable format.
# Example: random 50% sample (if dataset is very large)
sampled_df = data_df.sample(frac=0.5, random_state=42)
print(sampled_df)
Administrative Administrative_Duration Informational \
8916 0.206234 0.348980 -0.396478
772 1.109418 2.017236 1.178195
12251 -0.696951 -0.457169 0.390859
7793 -0.094828 0.340495 -0.396478
6601 4.722157 2.983193 4.327541
... ... ... ...
1370 -0.395889 -0.281796 -0.396478
12152 -0.395889 -0.400597 -0.396478
11950 -0.696951 -0.457169 -0.396478
9039 -0.395889 -0.457169 -0.396478
5551 -0.696951 -0.457169 -0.396478
Informational_Duration ProductRelated ProductRelated_Duration \
8916 -0.244920 0.365862 -0.074414
772 1.428755 1.152863 0.684192
12251 -0.209393 -0.511081 -0.478511
7793 -0.244920 -0.488596 -0.307275
6601 4.969015 3.064150 1.961674
... ... ... ...
1370 -0.244920 -0.578538 -0.610988
12152 -0.244920 4.997923 2.672222
11950 -0.244920 -0.578538 -0.501766
9039 -0.244920 0.298405 0.052964
5551 -0.244920 -0.645996 -0.606964
BounceRates ExitRates PageValues SpecialDay Month \
8916 -0.367987 -0.618015 -0.009006 -0.308808 9
772 -0.412331 -0.785268 -0.009006 -0.308808 7
12251 0.367340 -0.028976 -0.009006 -0.308808 9
7793 -0.285785 -0.343379 -0.009004 -0.308808 0
6601 -0.320880 -0.608054 -0.009005 -0.308808 0
... ... ... ... ... ...
1370 0.917341 0.828485 -0.009006 -0.308808 7
12152 -0.405825 -0.573988 -0.009006 -0.308808 9
11950 1.604841 1.857438 -0.009006 -0.308808 9
9039 -0.457661 -0.612050 -0.009006 -0.308808 9
5551 -0.457661 0.485500 -0.009006 -0.308808 6
OperatingSystems Browser Region TrafficType VisitorType \
8916 -1.143235 3.286238 0.298651 1.721932 2
772 -0.129119 -0.207943 -0.024503 -0.514156 2
12251 0.884997 -0.207943 0.406370 0.976570 0
7793 -0.129119 1.539148 0.406370 -0.017247 2
6601 -0.129119 -0.207943 -0.024503 -0.762610 2
... ... ... ... ... ...
1370 0.884997 -0.207943 -0.024503 -0.762610 2
12152 -0.129119 -0.207943 -0.024503 -0.514156 2
11950 -1.143235 -0.790307 0.406370 1.473478 2
9039 -0.129119 1.539148 -0.132221 -0.265702 2
5551 -1.143235 -0.790307 -0.024503 -0.762610 2
Weekend Revenue Value_to_Bounce
8916 -0.550677 -0.427872 0.023578
772 -0.550677 2.337149 0.022734
12251 1.815947 -0.427872 0.007867
7793 -0.550677 -0.427872 0.025833
6601 1.815947 -0.427872 0.024730
... ... ... ...
1370 -0.550677 -0.427872 0.012581
12152 -0.550677 -0.427872 0.022846
11950 1.815947 -0.427872 0.013930
9039 -0.550677 -0.427872 0.022040
5551 1.815947 -0.427872 0.022040
[6166 rows x 19 columns]
Discussion:
Since the dataset size is manageable, no record sampling was performed. If computational efficiency becomes an issue, a representative sample could be selected to reduce processing time without losing data diversity.
Discussion:
No additional preparation steps were required. Data storage and structure in pandas DataFrame format were sufficient for analysis and modeling.
In this step, the dataset is analyzed and used to create models and/or answer questions relevant to the business. This may include: - Developing statistical models for describing or predicting business situations. - Training data mining (e.g., machine learning) algorithms for classification, clustering, association analysis, outlier detection, and other uses.
NOTE: This step is tightly coupled and iterative with the Data Preparation and Evaluation steps. ***
Discussion:
Fill in…
In this step, the analyses and/or models are evaluated for quality and relevance / usefulness.
NOTE: This step is tightly coupled and iterative with the Modeling step. ***
Discussion:
Fill in…
In this step, data and resulting models and/or analyses are put to use. This may include: - Deploying data mining / machine learning algorithms to production. - Communicating the results of the analyses to business stakeholders, so they can take action. ***
Discussion:
Fill in…