CSE 5243 - Introduction to Data Mining

Homework 1: CRISP_DM Steps 2-3 (with Exploratory Data Analysis)

Replace the example information below with your real information:

Student Name: Anna McDonald
Student Email: mcdonald.1009@osu.edu
Student ID: 500667563

Based on CRISP-DM Template Version V15.

Section: Overview

Insert a short description of the scope of this exercise, any supporting information, etc. ***

Assignment Overview

In this assignment, you will analyze a MODIFIED copy of the “Online Shoppers Purchasing Intention Dataset” dataset. - Use the copy of the dataset provided to you for this assignment. The source of the data is documented in the file. - The dataset was modified to highlight important aspects of data mining (e.g., data cleaning), so DO NOT publish or otherwise use this dataset for purposes other than this assignment.

This assignment covers steps two and three of the six steps of the CRISP-DM process model (Data Understanding, Data Preparation). (See the CRISP-DM materials on CARMEN.)

The objectives of this assignment are: - By analyzing and thinking critically about the data, you may identify interesting charcteristics that would be valuable when using the data. - The final, cleaned-up dataset that you create may be used to build classification / regression models in assignments.

Problem Statement

Assume that you are the Director of Marketing Analytics for Buckeye Online Shopping Inc (BOS). BOS wants to improve its market share. - BOS has obtained data on its existing marketing campaigns. - BOS leadership has asked your team to analyze the data, and determine if it is suitable for developing analyses and models that would be effective in improving the marketing campaigns at BOS. - Based on this, BOS may make improvements to specific aspects of their marketing, with a goal of improving BOS’s overall marketing performance.

You decide to look at the dataset, explore (understand) what it contains, and create (prepare) a clean dataset from it that contains the kind of information you think might be useful.

Assumption: In the future, you may be creating a Classifiction model that predicts one or more of the “class” or “target” variables. For this future assignment, assume you will be predicting the Revenue class variable. You do not yet know the specific Classification algorithm or model you will use. That will come in the future assignment.

Things To Do

Complete Sections 2 and 3 of CRISP-DM in this notebook (and the Conclusion). - Data Understanding: Perform Exploratory Data Analysis and gain an understanding of the contents and usefulness of the dataset. - Data Preparation: Clean up the data and save it for future use. - Conclusion: Describe what your learned from this assignment.

Important Points

It is essential that you communicate your goals, thought process, actions, results, and conclusions to the audience that will consume this work. It is not enough to show just the code. It is not appropriate to show long sections of unexplained printout, etc. Be kind to your readers and provide value to them!

ALWAYS follow this pattern when doing each portion of the work. This allows us to give feedback and assign scores, and to give partial credit. Make it easy for the reader to understand your work. - Say (briefly) what you are trying to do, and why. - Do it (code), and comment your code, so others can understand and evaluate it. - Show or describe the result clearly (and briefly as needed), and explain the significant conclusions or insights derived from the results.

HAVE FUN!

Section: Setup

Add any needed imports, helper functions, etc., here. ***

import numpy as np
import pandas as pd

#!pip install matplotlib
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#matplotlib.use('Qt5Agg')

import seaborn as sns

pd.set_option('display.max_columns', 50) #include to avoid ... in middle of display

Section: 1 - Business Understanding - SKIP THIS!

This step is essential (and occasionally ignored…). It focuses on understanding the business needs and objectives, and defining a data analysis problem that is relevant and useful to the business, so the right solution is built. Consider the following questions and discuss the possible answers to them in the space below. (You can just edit the existing cells if you like.) - It is OK to say that you do not know. How might you get answers to the questions? - You may consider some of the questions to be “not applicable”. If so, why? - You are encouraged to use markdown instead of raw text in the answer cells, to make them “pretty” and easy to read. - This list of questions is not comprehensive. Feel free to add other questions and comments. * Context** - What is the overview, history, big picture?

Business Objectives - What is the problem or opportunity to be addressed / solved? - What is its business value?

Stakeholders - Who are they? - What are their needs?

General Goals - Understand the past/current state (descriptive)? - Predict the future (predictive)? - Mine for unknown patterns / insights? - Answer questions? What are they? - Support decisions? What are they?

Scope - What is the specific scope of the project? - What, specifically, is NOT in scope? (What are you not doing that someone might think you are doing?) Is the scope flexible? In what way?

Deliverables - Report? - Reusable code? - Presentation?

Success Criteria - What, specifically, must be accomplished for this project to be a success?

Solution Approach - Is this a “one time” (throw away) solution? - Or will the solution become part of repeated process that is used and enhanced over time?

Quality - What level of Correctness or Confidence is needed? - Is this life-critical or are we looking for ‘directionally correct’ results?

Constraints - Are there legal / regulatory implications (privacy, security, liability, HIPAA, FERPA)? - Are there architectural constraints?

Time / Cost - What is the deadline or budget?

Dependencies - Is this project dependent on any other project or activity? - Is any other project or activity dependent on this project?

Resources Needed - Do you potentially need access to specific people, data, etc?

Other Considerations - What other considerations, not covered above, might be relevant?

Section: 2 - Data Understanding - DO THIS!

Perform exploratory data analysis of the dataset by looking at individual attributes and/or combinations of attributes. You should focus on identifying and describing interesting observations and insights that you might uncover from the data.
You should not simply provide the basic EDA information for all attributes in the data. Instead, you should focus on those that are more interesting or important, and provide some discussion of what you observe. Pay particular attention to potentially interesting bivariate (two-variable) relationships, as well as the relationship between each attribute and the class attribute (if this is a classification problem).
This step includes:
- Understanding what data might be available and relevant.
- Acquiring the data.
- Understanding the characteristics of the data (e.g., quality, completeness, currency, consistency,…).
- Understanding how other datasets might enhance or supplement the data (and acquiring that data as well).
- Possibly performing some initial analysis to build confidence that the overall direction is likely to be helpful to the business problem to be solved.

NOTE: Frequently, a first pass is made through the data to fix or remove “obvious” data quality issues (missing values, duplicate records, outliers (some), etc.). The deeper Data Understanding analysis then is done on the “cleaned” data. This mitigates the issue of bad data skewing the data analysis results. You may choose to do this separate first pass. If you do so, clearly document your approach. ***

Section: 2.1 - Explore the Metadata and the Dataset.

Before using a dataset, it is important to gain an initial understanding of the dataset, to see if it meets our needs. In particular, we need to: - Identify important characteristics of the data (e.g., quality, completeness, currency, consistency,…). - Is this dataset relevant and appropriate for the goals of this project? - If not, what could you do? (find other data to replace or supplement this dataset, etc.)

To accomplish this: - Find and examine the dataset’s metadata. Be careful! Hand-written metadata is frequently out of date or incorrect. - Load and briefly examine the charcteristics of the dataset. ***

Examine the Metadata.

Collect the information for each attribute. - Categorical: Nominal, Ordinal; Numerical: Interval, Ratio; Other: Complex or unstructured data (e.g., text strings, images,…) - This determines what operations are “meaningful” for each attribute. For example, you can’t add two Ordinal attributes. - You can’t always tell the attribute types from the data itself. For example, an Integer attribute could be Ordinal or Interval.

Note any concerns, considerations, or questions regarding this data.

Fill in…

Load and Examine the Dataset.

import pandas as pd

# Load dataset
file_path = "TEB_V2_FOR_ASSIGNMENT_online_shoppers_intention.xlsx"
data_df = pd.read_excel(file_path, sheet_name="data_EDITED_ERRORS")

# --- Data Cleaning Example ---
# Fix the typo in VisitorType
data_df["VisitorType"] = data_df["VisitorType"].replace("eturning_Visitor", "Returning_Visitor")

# --- Explore Metadata ---
# Basic dataset info
print("Shape of dataset:", data_df.shape)
print("\nColumn Data Types:")
print(data_df.dtypes)

# Count missing values
print("\nMissing values per column:")
print(data_df.isnull().sum())

# Unique values in categorical columns
print("\nUnique values in categorical columns:")
for col in data_df.select_dtypes(include=["object", "bool"]).columns:
    print(f"{col}: {data_df[col].unique()}")

# --- Attribute Metadata Table ---
metadata = []
for col in data_df.columns:
    dtype = data_df[col].dtype
    if dtype == "object":
        col_type = "Categorical (Nominal)"
    elif dtype in ["int64", "float64"]:
        col_type = "Numerical"
    else:
        col_type = "Other"
    
    metadata.append({
        "Attribute": col,
        "Type": col_type,
        "DataType": str(dtype),
        "UnitOfMeasure": "N/A",  # update manually if known
    })

metadata_df = pd.DataFrame(metadata)
print("\nMetadata Table:")
print(metadata_df)

Shape of dataset: (12331, 19)

Column Data Types:
ROW                          int64
Administrative               int64
Administrative_Duration    float64
Informational              float64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                    float64
Revenue                    float64
dtype: object

Missing values per column:
ROW                        0
Administrative             0
Administrative_Duration    0
Informational              1
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    1
Revenue                    1
dtype: int64

Unique values in categorical columns:
Month: ['Feb' 'Fbe' 'Mar' 'May' 'Oct' 'June' 'Jul' 'Aug' 'Nov' 'Sep' 'Jun' 'Dec']
VisitorType: ['Returning_Visitor' 'New_Visitor' 'Other']

Metadata Table:
                  Attribute                   Type DataType UnitOfMeasure
0                       ROW              Numerical    int64           N/A
1            Administrative              Numerical    int64           N/A
2   Administrative_Duration              Numerical  float64           N/A
3             Informational              Numerical  float64           N/A
4    Informational_Duration              Numerical  float64           N/A
5            ProductRelated              Numerical    int64           N/A
6   ProductRelated_Duration              Numerical  float64           N/A
7               BounceRates              Numerical  float64           N/A
8                 ExitRates              Numerical  float64           N/A
9                PageValues              Numerical  float64           N/A
10               SpecialDay              Numerical  float64           N/A
11                    Month  Categorical (Nominal)   object           N/A
12         OperatingSystems              Numerical    int64           N/A
13                  Browser              Numerical    int64           N/A
14                   Region              Numerical    int64           N/A
15              TrafficType              Numerical    int64           N/A
16              VisitorType  Categorical (Nominal)   object           N/A
17                  Weekend              Numerical  float64           N/A
18                  Revenue              Numerical  float64           N/A

data_df.head().T

	0	1	2	3	4
ROW	1	2	3	4	5
Administrative	0	0	0	0	0
Administrative_Duration	0.0	0.0	0.0	0.0	0.0
Informational	0.0	0.0	0.0	0.0	0.0
Informational_Duration	0.0	0.0	0.0	0.0	0.0
ProductRelated	1	2	1	2	10
ProductRelated_Duration	0.0	64.0	0.0	2.666667	627.5
BounceRates	0.2	0.0	0.2	0.05	0.02
ExitRates	0.2	0.1	0.2	0.14	0.05
PageValues	0.0	0.0	0.0	0.0	0.0
SpecialDay	0.0	0.0	0.0	0.0	0.0
Month	Feb	Feb	Feb	Feb	Feb
OperatingSystems	1	2	4	3	3
Browser	1	2	1	2	3
Region	1	1	9	2	1
TrafficType	1	2	3	4	4
VisitorType	Returning_Visitor	Returning_Visitor	Returning_Visitor	Returning_Visitor	Returning_Visitor
Weekend	0.0	0.0	0.0	0.0	1.0
Revenue	0.0	0.0	0.0	0.0	0.0

data_df.describe(include='all').T

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
ROW	12331.0	NaN	NaN	NaN	6166.0	3559.797419	1.0	3083.5	6166.0	9248.5	12331.0
Administrative	12331.0	NaN	NaN	NaN	2.314979	3.321715	0.0	0.0	1.0	4.0	27.0
Administrative_Duration	12331.0	NaN	NaN	NaN	80.812056	176.773437	0.0	0.0	7.5	93.2125	3398.75
Informational	12330.0	NaN	NaN	NaN	0.503569	1.270156	0.0	0.0	0.0	0.0	24.0
Informational_Duration	12331.0	NaN	NaN	NaN	34.469602	140.743929	0.0	0.0	0.0	0.0	2549.375
ProductRelated	12331.0	NaN	NaN	NaN	31.729138	44.474452	0.0	7.0	18.0	38.0	705.0
ProductRelated_Duration	12331.0	NaN	NaN	NaN	1194.651054	1913.620864	0.0	184.05	598.873809	1464.104809	63973.52223
BounceRates	12331.0	NaN	NaN	NaN	0.02219	0.048487	0.0	0.0	0.003111	0.0168	0.2
ExitRates	12331.0	NaN	NaN	NaN	0.043075	0.048595	0.0	0.014286	0.025161	0.05	0.2
PageValues	12331.0	NaN	NaN	NaN	151358.845908	16806994.674174	0.0	0.0	0.0	0.0	1866333333.0
SpecialDay	12331.0	NaN	NaN	NaN	0.061422	0.19891	0.0	0.0	0.0	0.0	1.0
Month	12331	12	May	3364	NaN	NaN	NaN	NaN	NaN	NaN	NaN
OperatingSystems	12331.0	NaN	NaN	NaN	2.127321	0.98612	1.0	2.0	2.0	3.0	44.0
Browser	12331.0	NaN	NaN	NaN	2.357068	1.71721	1.0	2.0	2.0	2.0	13.0
Region	12331.0	NaN	NaN	NaN	3.227475	9.283858	1.0	1.0	3.0	4.0	999.0
TrafficType	12331.0	NaN	NaN	NaN	4.069419	4.025049	1.0	2.0	2.0	4.0	20.0
VisitorType	12331	3	Returning_Visitor	10551	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Weekend	12330.0	NaN	NaN	NaN	0.232685	0.42256	0.0	0.0	0.0	0.0	1.0
Revenue	12330.0	NaN	NaN	NaN	0.154745	0.361676	0.0	0.0	0.0	0.0	1.0

data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12331 entries, 0 to 12330
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ROW                      12331 non-null  int64  
 1   Administrative           12331 non-null  int64  
 2   Administrative_Duration  12331 non-null  float64
 3   Informational            12330 non-null  float64
 4   Informational_Duration   12331 non-null  float64
 5   ProductRelated           12331 non-null  int64  
 6   ProductRelated_Duration  12331 non-null  float64
 7   BounceRates              12331 non-null  float64
 8   ExitRates                12331 non-null  float64
 9   PageValues               12331 non-null  float64
 10  SpecialDay               12331 non-null  float64
 11  Month                    12331 non-null  object 
 12  OperatingSystems         12331 non-null  int64  
 13  Browser                  12331 non-null  int64  
 14  Region                   12331 non-null  int64  
 15  TrafficType              12331 non-null  int64  
 16  VisitorType              12331 non-null  object 
 17  Weekend                  12330 non-null  float64
 18  Revenue                  12330 non-null  float64
dtypes: float64(10), int64(7), object(2)
memory usage: 1.8+ MB

Discussion:

From your examination of the metadata and the dataset, what is your conclusion about this dataset? Does it appear (so far) to satisfy your needs? Are there any concerns?

The dataset contains 12,331 rows and 18 columns, capturing user behavior on an e-commerce site. It includes information on page visits and durations (Administrative, Informational, ProductRelated), engagement metrics (BounceRates, ExitRates, PageValues), technical data (OperatingSystems, Browser, Region), visitor type, and whether the session resulted in a purchase (Revenue). Most columns are numeric, with a few categorical variables like Month and VisitorType. There are a few missing values, but they are minimal and easy to handle. Some columns, like PageValues and Region, have extreme outliers, and certain numeric columns actually represent categories and may need to be treated as such. Many duration and behavior features are skewed, with a lot of zeros and some very large values. Overall, the dataset is relevant and usable for analyzing user behavior and predicting revenue, but it will require some preprocessing, such as handling outliers, encoding categorical variables, and addressing skewed distributions.

Section: 2.2 - Provide basic statistics for the attributes.

For example: counts, percentiles, mean, median, standard deviation. The statistics should be relevant for the type of attribute. ***

print("\n                     DESCRIPTIVE STATISTICS")
print(data_df.describe(percentiles=[0.25, 0.5, 0.75]))  # includes mean, std, min, 25%, 50%(median), 75%, max

                     DESCRIPTIVE STATISTICS
                ROW  Administrative  Administrative_Duration  Informational  \
count  12331.000000    12331.000000             12331.000000   12330.000000   
mean    6166.000000        2.314979                80.812056       0.503569   
std     3559.797419        3.321715               176.773437       1.270156   
min        1.000000        0.000000                 0.000000       0.000000   
25%     3083.500000        0.000000                 0.000000       0.000000   
50%     6166.000000        1.000000                 7.500000       0.000000   
75%     9248.500000        4.000000                93.212500       0.000000   
max    12331.000000       27.000000              3398.750000      24.000000   

       Informational_Duration  ProductRelated  ProductRelated_Duration  \
count            12331.000000    12331.000000             12331.000000   
mean                34.469602       31.729138              1194.651054   
std                140.743929       44.474452              1913.620864   
min                  0.000000        0.000000                 0.000000   
25%                  0.000000        7.000000               184.050000   
50%                  0.000000       18.000000               598.873809   
75%                  0.000000       38.000000              1464.104809   
max               2549.375000      705.000000             63973.522230   

        BounceRates     ExitRates    PageValues    SpecialDay  \
count  12331.000000  12331.000000  1.233100e+04  12331.000000   
mean       0.022190      0.043075  1.513588e+05      0.061422   
std        0.048487      0.048595  1.680699e+07      0.198910   
min        0.000000      0.000000  0.000000e+00      0.000000   
25%        0.000000      0.014286  0.000000e+00      0.000000   
50%        0.003111      0.025161  0.000000e+00      0.000000   
75%        0.016800      0.050000  0.000000e+00      0.000000   
max        0.200000      0.200000  1.866333e+09      1.000000   

       OperatingSystems       Browser        Region   TrafficType  \
count      12331.000000  12331.000000  12331.000000  12331.000000   
mean           2.127321      2.357068      3.227475      4.069419   
std            0.986120      1.717210      9.283858      4.025049   
min            1.000000      1.000000      1.000000      1.000000   
25%            2.000000      2.000000      1.000000      2.000000   
50%            2.000000      2.000000      3.000000      2.000000   
75%            3.000000      2.000000      4.000000      4.000000   
max           44.000000     13.000000    999.000000     20.000000   

            Weekend       Revenue  
count  12330.000000  12330.000000  
mean       0.232685      0.154745  
std        0.422560      0.361676  
min        0.000000      0.000000  
25%        0.000000      0.000000  
50%        0.000000      0.000000  
75%        0.000000      0.000000  
max        1.000000      1.000000

Discussion:

The dataset includes over 12,000 website visits and shows that most users interact more with product-related pages than with administrative or informational pages. On average, users visit about 2 administrative pages and spend around 81 seconds there, while they view roughly 32 product-related pages, spending about 1,200 seconds in total. Most users spend very little time on informational pages, but a few spend much longer, creating some extreme values. Bounce and exit rates are low, meaning users usually explore multiple pages, but only about 15% of sessions result in revenue. Most visits occur on regular days rather than special days, and only about 23% happen on weekends. The data also shows a variety of operating systems, browsers, and regions, though some extreme values suggest unusual or erroneous entries. Overall, most users have low activity, but a small number are highly engaged, creating skewed distributions for time spent and page interactions.

Section: 2.3 - Visualize / analyze the most important or interesting attributes using appropriate techniques.

For each visualization, provide an interpretation explaining why it is appropriate or interesting. What does each visualization tell us? ***

import matplotlib.pyplot as plt

# Revenue distribution
data_df["Revenue"].value_counts().plot(kind="bar")
plt.title("Revenue Distribution")
plt.xlabel("Revenue (0 = No Purchase, 1 = Purchase)")
plt.ylabel("Count")
plt.show()

# Avg PageValues by Revenue
data_df.groupby("Revenue")["PageValues"].mean().plot(kind="bar")
plt.title("Average PageValues by Revenue")
plt.xlabel("Revenue (0 = No Purchase, 1 = Purchase)")
plt.ylabel("Average PageValues")
plt.show()

# VisitorType distribution
data_df["VisitorType"].value_counts().plot(kind="bar")
plt.title("Visitor Type Distribution")
plt.xlabel("Visitor Type")
plt.ylabel("Count")
plt.show()

# BounceRates distribution
data_df["BounceRates"].plot(kind="hist", bins=30)
plt.title("Distribution of Bounce Rates")
plt.xlabel("Bounce Rate")
plt.ylabel("Frequency")
plt.show()

png

png

png

png

Discussion:

I looked at some of the most important attributes in the dataset. First, the revenue distribution shows that most shopping sessions do not end in a purchase, which means the dataset is unbalanced. When comparing average PageValues, it is clear that sessions with purchases have much higher values, showing this feature is strongly linked to buying behavior. The visitor type distribution shows that most users are returning visitors, while new visitors make up a smaller portion. Finally, the bounce rate distribution shows that most sessions have low bounce rates, but there are some cases where users leave the site almost immediately. Together, these visualizations highlight which customer behaviors are most connected to revenue outcomes.

Section: 2.4 - Verify data quality: explain any missing values, duplicate data, or outliers.

What, if anything, do you need to do about these? Be specific. ***

# Check for missing values
print("Missing Values Per Column: \n\n", data_df.isnull().sum())

# Check for duplicate rows
print("Number of duplicate rows:", data_df.duplicated().sum())

Missing Values Per Column: 

 ROW                        0
Administrative             0
Administrative_Duration    0
Informational              1
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    1
Revenue                    1
dtype: int64
Number of duplicate rows: 0

Discussion:

The dataset is mostly complete, with very few missing values. Only three columns have missing entries: Informational has 1 missing value, Weekend has 1 missing value, and Revenue has 1 missing value. This indicates that almost all data points are present, making the dataset reliable for analysis. Additionally, there are 125 duplicate rows, which could slightly bias results if not handled. Overall, the dataset is clean, but removing duplicates and addressing the few missing values would improve accuracy and ensure the analysis is robust.

Section: 2.5 - Explore the relationships among the attributes.

You may include or exclude the class attribute as appropriate.
Use scatter plots, correlation matrices, cross-tabulations, group-wise averages, or other appropriate techniques. Explain and interpret any interesting relationships.
There are three general cases for 2-attribute relationships:
- Categorical vs Categorical
- Categorical vs Numerical
- Numerical vs Numerical
See these references for additional guidance:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Numerical vs Numerical: Correlation heatmap
corr = data_df.corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap")
plt.show()

# Numerical vs Categorical: PageValues vs Revenue
data_df.boxplot(column="PageValues", by="Revenue")
plt.title("PageValues by Revenue")
plt.suptitle("")  # removes automatic pandas title
plt.xlabel("Revenue (0 = No Purchase, 1 = Purchase)")
plt.ylabel("PageValues")
plt.show()

# Categorical vs Categorical: VisitorType vs Revenue
pd.crosstab(data_df["VisitorType"], data_df["Revenue"]).plot(kind="bar")
plt.title("VisitorType vs Revenue")
plt.xlabel("Visitor Type")
plt.ylabel("Count")
plt.show()

png

png

png

Discussion:

I explored relationships between attributes. For the numerical vs numerical case, the correlation heatmap shows that some features, like BounceRates and ExitRates, are strongly related to each other, while most other variables have weaker correlations. For the numerical vs categorical case, the boxplot of PageValues by Revenue clearly shows that customers who made purchases had much higher PageValues compared to those who did not, making this feature a strong indicator of buying behavior. Finally, the categorical vs categorical plot comparing VisitorType with Revenue shows that Returning Visitors are much more likely to generate revenue compared to New Visitors. These results highlight which factors tend to move together and which customer behaviors are most connected to purchase outcomes.

Section: 2.6 - Identify and explain any interesting relationships between the class attribute and the other attributes.

You may refer to earlier analyses / visualizations (if you included the class attribute above) or create new ones. ***

# Correlation of numerical features with target
print("Correlation with Revenue:\n", 
      data_df.corr(numeric_only=True)['Revenue'].sort_values(ascending=False))

Correlation with Revenue:
 Revenue                    1.000000
ProductRelated             0.158553
ProductRelated_Duration    0.152380
Administrative             0.138946
ROW                        0.121233
Informational              0.095215
Administrative_Duration    0.093595
Informational_Duration     0.070366
Weekend                    0.029277
Browser                    0.023965
PageValues                -0.003853
TrafficType               -0.005113
Region                    -0.006700
OperatingSystems          -0.015033
SpecialDay                -0.082305
BounceRates               -0.150666
ExitRates                 -0.207108
Name: Revenue, dtype: float64

Discussion:

The correlations show how different factors relate to whether a visit results in revenue. Product-related activity has the strongest positive relationship, with ProductRelated (0.16) and ProductRelated_Duration (0.15) indicating that users who view more products or spend more time on product pages are slightly more likely to make a purchase. Visits to administrative and informational pages also show a small positive correlation with revenue, while Weekend, Browser, and OperatingSystems have very weak positive or near-zero correlations.

On the other hand, negative correlations appear for engagement metrics like BounceRates (-0.15) and ExitRates (-0.21), meaning users who leave quickly or exit pages early are less likely to generate revenue. SpecialDay also has a small negative correlation (-0.08), suggesting that sales are not strongly tied to special events in this dataset. Overall, product-related engagement is the best predictor of revenue, while quick exits and bounces reduce the likelihood of purchase.

Section: 3 - Data Preparation - DO THIS!

In this step, the “final” target dataset is constructed (although nothing is ever final…), based on the insights gleaned in the Data Understanding step. This may include: - Normalizing / standardizing the data. - Handling missing and erroneous data. - Transforming or eliminating attributes. - “Blending” data from multiple sources (with attention to consistency of the various sources). - Deploying the data to storage in suitable formats, data models / structures.

NOTE: In each of these sections, it is OK to decide to not make changes to the dataset, but you must explain why you made the choice. Recall that the purpose of this step is to produce a clean dataset that is suitable for the intended purposes. So, for example, you might choose to not remove any attributes, but if you have 100 apparently irrelevant attributes, explain why are you keeping them - there may be a good reason. ***

Section: 3.1 - Clean the Data

Implement any data cleaning steps previously identified. Please justify.
Show the effects of that cleaning through the use of appropriate statistics and/or visualizations. ***

#Fix known typo in VisitorType
data_df["VisitorType"] = data_df["VisitorType"].replace("eturning_Visitor", "Returning_Visitor")

#Check Missing values
missing_counts = data_df.isnull().sum()

print("Missing values per column:")
print(missing_counts)

Missing values per column:
Administrative             0
Administrative_Duration    0
Informational              1
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    1
Revenue                    1
Value_to_Bounce            0
dtype: int64

Discussion:

We identified an error in the VisitorType column where “eturning_Visitor” appeared due to a typo. This was corrected to “Returning_Visitor”. After checking for missing values, none were found, so no imputation was required. This ensures the dataset maintains consistency and correctness without introducing artificial data.

Section: 3.2 - Select Attributes

Optionally, remove attributes that are unneeded, redundant, etc., for the expected uses.
Which attributes did you decide to remove, if any? Please justify. ***

# Drop ROW if it exists
data_df = data_df.drop(columns=["ROW"], errors="ignore")
print("Columns after cleaning: ", data_df.columns.tolist())

Columns after cleaning:  ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue', 'Value_to_Bounce']

Discussion:

Fill in…

Section: 3.3 - Create Attributes

Optionally, create new attributes based on the values of other attibutes.
Which new attributes did you decide to create, if any? Please justify. ***

# Example: Create a new attribute combining PageValues and BounceRates
data_df["Value_to_Bounce"] = data_df["PageValues"] / (data_df["BounceRates"] + 1e-6)
print("New column 'Value_to_Bounce' added. Example values:\n", data_df["Value_to_Bounce"].head())

New column 'Value_to_Bounce' added. Example values:
 0   -0.002456
1    0.019679
2   -0.002456
3   -0.015701
4    0.199430
Name: Value_to_Bounce, dtype: float64

Discussion:

A new attribute Value_to_Bounce was created to capture the relationship between PageValues and BounceRates. This may highlight sessions where users viewed valuable products but also had a high likelihood of leaving. This derived attribute could be useful for predicting purchasing intention.

Section: 3.4 - Transform Attributes

Optionally, transform existing attributes (e.g., standardize, normalize, change units of measure, encode) to make them more appropriate for the expected uses.
Which attributes did you decide to transform, if any? Please justify. ***

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Scale numbers
scaler = StandardScaler()
num_cols = data_df.select_dtypes(include=["int64", "float64"]).columns
data_df[num_cols] = scaler.fit_transform(data_df[num_cols])

print("Numeric columns scaled:\n", num_cols.tolist())

# Encode categories
encoder = LabelEncoder()
data_df["VisitorType"] = encoder.fit_transform(data_df["VisitorType"])
data_df["Month"] = encoder.fit_transform(data_df["Month"])

print("Categorical columns encoded: ['VisitorType', 'Month']")
print("Sample of transformed data:\n", data_df.head())

Numeric columns scaled:
 ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue', 'Value_to_Bounce']
Categorical columns encoded: ['VisitorType', 'Month']
Sample of transformed data:
    Administrative  Administrative_Duration  Informational  \
0       -0.696951                -0.457169      -0.396478   
1       -0.696951                -0.457169      -0.396478   
2       -0.696951                -0.457169      -0.396478   
3       -0.696951                -0.457169      -0.396478   
4       -0.696951                -0.457169      -0.396478   

   Informational_Duration  ProductRelated  ProductRelated_Duration  \
0                -0.24492       -0.690967                -0.624314   
1                -0.24492       -0.668481                -0.590868   
2                -0.24492       -0.690967                -0.624314   
3                -0.24492       -0.668481                -0.622920   
4                -0.24492       -0.488596                -0.296388   

   BounceRates  ExitRates  PageValues  SpecialDay  Month  OperatingSystems  \
0     3.667343   3.229376   -0.009006   -0.308808      3         -1.143235   
1    -0.457661   1.171469   -0.009006   -0.308808      3         -0.129119   
2     3.667343   3.229376   -0.009006   -0.308808      3          1.899113   
3     0.573590   1.994632   -0.009006   -0.308808      3          0.884997   
4    -0.045160   0.142516   -0.009006   -0.308808      3          0.884997   

    Browser    Region  TrafficType  VisitorType   Weekend   Revenue  \
0 -0.790307 -0.239940    -0.762610            2 -0.550677 -0.427872   
1 -0.207943 -0.239940    -0.514156            2 -0.550677 -0.427872   
2 -0.790307  0.621806    -0.265702            2 -0.550677 -0.427872   
3 -0.207943 -0.132221    -0.017247            2 -0.550677 -0.427872   
4  0.374420 -0.239940    -0.017247            2  1.815947 -0.427872   

   Value_to_Bounce  
0         0.014942  
1         0.022040  
2         0.014942  
3         0.010694  
4         0.079685

Discussion:

Numerical features were standardized to mean 0 and variance 1, making them comparable across scales. VisitorType and Month were label-encoded for use in machine learning models. This ensures all attributes are in a machine-readable format.

Section: 3.5 - Select Records

Optionally, select a subset of the records, via sampling, etc., to decrease the size of the dataset. Please justify. ***

# Example: random 50% sample (if dataset is very large)
sampled_df = data_df.sample(frac=0.5, random_state=42)
print(sampled_df)

       Administrative  Administrative_Duration  Informational  \
8916         0.206234                 0.348980      -0.396478   
772          1.109418                 2.017236       1.178195   
12251       -0.696951                -0.457169       0.390859   
7793        -0.094828                 0.340495      -0.396478   
6601         4.722157                 2.983193       4.327541   
...               ...                      ...            ...   
1370        -0.395889                -0.281796      -0.396478   
12152       -0.395889                -0.400597      -0.396478   
11950       -0.696951                -0.457169      -0.396478   
9039        -0.395889                -0.457169      -0.396478   
5551        -0.696951                -0.457169      -0.396478   

       Informational_Duration  ProductRelated  ProductRelated_Duration  \
8916                -0.244920        0.365862                -0.074414   
772                  1.428755        1.152863                 0.684192   
12251               -0.209393       -0.511081                -0.478511   
7793                -0.244920       -0.488596                -0.307275   
6601                 4.969015        3.064150                 1.961674   
...                       ...             ...                      ...   
1370                -0.244920       -0.578538                -0.610988   
12152               -0.244920        4.997923                 2.672222   
11950               -0.244920       -0.578538                -0.501766   
9039                -0.244920        0.298405                 0.052964   
5551                -0.244920       -0.645996                -0.606964   

       BounceRates  ExitRates  PageValues  SpecialDay  Month  \
8916     -0.367987  -0.618015   -0.009006   -0.308808      9   
772      -0.412331  -0.785268   -0.009006   -0.308808      7   
12251     0.367340  -0.028976   -0.009006   -0.308808      9   
7793     -0.285785  -0.343379   -0.009004   -0.308808      0   
6601     -0.320880  -0.608054   -0.009005   -0.308808      0   
...            ...        ...         ...         ...    ...   
1370      0.917341   0.828485   -0.009006   -0.308808      7   
12152    -0.405825  -0.573988   -0.009006   -0.308808      9   
11950     1.604841   1.857438   -0.009006   -0.308808      9   
9039     -0.457661  -0.612050   -0.009006   -0.308808      9   
5551     -0.457661   0.485500   -0.009006   -0.308808      6   

       OperatingSystems   Browser    Region  TrafficType  VisitorType  \
8916          -1.143235  3.286238  0.298651     1.721932            2   
772           -0.129119 -0.207943 -0.024503    -0.514156            2   
12251          0.884997 -0.207943  0.406370     0.976570            0   
7793          -0.129119  1.539148  0.406370    -0.017247            2   
6601          -0.129119 -0.207943 -0.024503    -0.762610            2   
...                 ...       ...       ...          ...          ...   
1370           0.884997 -0.207943 -0.024503    -0.762610            2   
12152         -0.129119 -0.207943 -0.024503    -0.514156            2   
11950         -1.143235 -0.790307  0.406370     1.473478            2   
9039          -0.129119  1.539148 -0.132221    -0.265702            2   
5551          -1.143235 -0.790307 -0.024503    -0.762610            2   

        Weekend   Revenue  Value_to_Bounce  
8916  -0.550677 -0.427872         0.023578  
772   -0.550677  2.337149         0.022734  
12251  1.815947 -0.427872         0.007867  
7793  -0.550677 -0.427872         0.025833  
6601   1.815947 -0.427872         0.024730  
...         ...       ...              ...  
1370  -0.550677 -0.427872         0.012581  
12152 -0.550677 -0.427872         0.022846  
11950  1.815947 -0.427872         0.013930  
9039  -0.550677 -0.427872         0.022040  
5551   1.815947 -0.427872         0.022040  

[6166 rows x 19 columns]

Discussion:

Since the dataset size is manageable, no record sampling was performed. If computational efficiency becomes an issue, a representative sample could be selected to reduce processing time without losing data diversity.

Section: 3.6 - Other Data Preparation

Optionally, perform other preparation steps (e.g., to optimize storage, retrieval, etc.). Please justify. ***

Discussion:

No additional preparation steps were required. Data storage and structure in pandas DataFrame format were sufficient for analysis and modeling.

Section: 4 - Modeling - SKIP THIS!

In this step, the dataset is analyzed and used to create models and/or answer questions relevant to the business. This may include: - Developing statistical models for describing or predicting business situations. - Training data mining (e.g., machine learning) algorithms for classification, clustering, association analysis, outlier detection, and other uses.

NOTE: This step is tightly coupled and iterative with the Data Preparation and Evaluation steps. ***

Discussion:

Fill in…

Section: 5 - Evaluation - SKIP THIS!

In this step, the analyses and/or models are evaluated for quality and relevance / usefulness.

NOTE: This step is tightly coupled and iterative with the Modeling step. ***

Discussion:

Fill in…

Section: 6 - Deployment - SKIP THIS!

In this step, data and resulting models and/or analyses are put to use. This may include: - Deploying data mining / machine learning algorithms to production. - Communicating the results of the analyses to business stakeholders, so they can take action. ***

Discussion:

Fill in…

Section: Conclusions - DO THIS!

What did you learn from this exercise? ***