import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels as sts
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
Predicting Ad Clicks Using Logistic Regression in Python
Independent Data Analysis Project
This project explores the application of logistic regression to predict whether users will click on online advertisements based on demographic and behavioral data. Using a dataset containing information such as age, daily internet usage, income, and engagement metrics, we conducted extensive exploratory data analysis (EDA) to uncover key patterns and relationships. After cleaning and transforming the data, including feature engineering to extract temporal components, we built a logistic regression model to predict ad clicks. The model achieved a strong balance between precision and recall, indicating its effectiveness in identifying factors influencing user behavior. Key findings suggest that user age, daily internet usage, and time spent on site significantly impact the likelihood of clicking on ads. This analysis demonstrates the power of predictive modeling in digital marketing and highlights potential areas for future model enhancement using more advanced machine learning techniques.
Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn, Logistic Regression
Background
In this project, we analyze a dataset containing information on online advertisements to determine whether an internet user will click on an advertisement based on specific user features. The goal is to create a logistic regression model that can predict whether a user is likely to click on an ad using Python and its associated data science libraries.
The dataset includes various attributes like user demographics, online behavior, and ad-specific data. By leveraging exploratory data analysis (EDA) and feature engineering techniques, we aim to build a robust predictive model.
Key Takeaways
Younger users and those spending more time on the site are more likely to click on ads.
Feature engineering and data preprocessing play a crucial role in enhancing model performance.
Libraries Used
We load the Python libraries that we use in the analysis.
Data Overview
We begin by loading and exploring the dataset to understand its structure and the information it contains (James et al. 2013; Muddana and Vinayakam 2024).
Loading the Data
We use Pandas to load the data.
= pd.read_csv("advertising.csv") ad_data
The dataset includes the following features:
- Daily Time Spent on Site: Time spent on the website in minutes.
- Age: Age of the user in years.
- Area Income: Average income of the user’s geographical area.
- Daily Internet Usage: Average minutes per day spent online.
- Ad Topic Line: Headline of the advertisement.
- City: User’s city.
- Male: Binary indicator of the user’s gender (1 for male, 0 for female).
- Country: User’s country.
- Timestamp: Date and time the user clicked on the ad or closed the window.
- Clicked on Ad: Target variable (1 if the user clicked on the ad, 0 otherwise).
Initial Data Inspection
To get an overview of the data, we display the first few rows and summary statistics:
ad_data.head()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | 35 | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 2016-03-27 00:53:11 | 0 |
1 | 80.23 | 31 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 2016-04-04 01:39:02 | 0 |
2 | 69.47 | 26 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 2016-03-13 20:35:42 | 0 |
3 | 74.15 | 29 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 2016-01-10 02:31:19 | 0 |
4 | 68.37 | 35 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 2016-06-03 03:36:18 | 0 |
ad_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Daily Time Spent on Site 1000 non-null float64
1 Age 1000 non-null int64
2 Area Income 1000 non-null float64
3 Daily Internet Usage 1000 non-null float64
4 Ad Topic Line 1000 non-null object
5 City 1000 non-null object
6 Male 1000 non-null int64
7 Country 1000 non-null object
8 Timestamp 1000 non-null object
9 Clicked on Ad 1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.3+ KB
ad_data.describe()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | Clicked on Ad | |
---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 |
mean | 65.000200 | 36.009000 | 55000.000080 | 180.000100 | 0.481000 | 0.50000 |
std | 15.853615 | 8.785562 | 13414.634022 | 43.902339 | 0.499889 | 0.50025 |
min | 32.600000 | 19.000000 | 13996.500000 | 104.780000 | 0.000000 | 0.00000 |
25% | 51.360000 | 29.000000 | 47031.802500 | 138.830000 | 0.000000 | 0.00000 |
50% | 68.215000 | 35.000000 | 57012.300000 | 183.130000 | 0.000000 | 0.50000 |
75% | 78.547500 | 42.000000 | 65470.635000 | 218.792500 | 1.000000 | 1.00000 |
max | 91.430000 | 61.000000 | 79484.800000 | 269.960000 | 1.000000 | 1.00000 |
="object") ad_data.describe(include
Ad Topic Line | City | Country | Timestamp | |
---|---|---|---|---|
count | 1000 | 1000 | 1000 | 1000 |
unique | 1000 | 969 | 237 | 1000 |
top | Cloned 5thgeneration orchestration | Lisamouth | France | 2016-03-27 00:53:11 |
freq | 1 | 3 | 9 | 1 |
Data Visualization
We perform exploratory data analysis to visualize the distribution of key features and uncover potential patterns that may influence whether a user clicks on an advertisement.
Age Distribution
let us look at the age distribution of the clients.
"whitegrid")
sns.set_style(="Age", data=ad_data, palette="mako")
sns.displot(x"Age Distribution")
plt.title( plt.show()
Insight: The age distribution shows a concentration of users in the mid-30s, with fewer users at the extremes.
Analyzing Area Income vs. Age
To assess if there’s a relationship between a user’s age and their area’s income, we use a joint plot:
="Age", y="Area Income", data=ad_data, hue="Clicked on Ad")
sns.jointplot(x"Area Income vs Age")
plt.title( plt.show()
Insight: There doesn’t appear to be a strong linear correlation between age and area income, though there may be clusters where users are more likely to click on ads.
Daily Time Spent on Site vs. Age
We further explore the relationship between the time spent on the website and the user’s age:
="Age", y="Daily Time Spent on Site", data=ad_data, kind="kde", palette="rocket")
sns.jointplot(x"Daily Time Spent on Site vs Age")
plt.title( plt.show()
Internet Usage Analysis
We examine the relationship between daily time spent on the site and overall internet usage:
="Daily Time Spent on Site", y="Daily Internet Usage", data=ad_data, joint_kws={"color":"green"})
sns.jointplot(x"Daily Time Spent on Site vs Daily Internet Usage")
plt.title( plt.show()
Pair Plot Analysis
We generate a pair plot to observe relationships between multiple features:
="Clicked on Ad", palette="rocket", corner=True)
sns.pairplot(ad_data, hue"Pairwise Relationships")
plt.title( plt.show()
Feature Engineering
To enhance our model, we engineer new features from the existing data.
Extracting Time Features
The Timestamp column contains valuable temporal information. We extract components such as the year, month, day, and hour:
"Timestamp"] = pd.to_datetime(ad_data["Timestamp"])
ad_data["year"] = ad_data["Timestamp"].dt.year
ad_data["hour"] = ad_data["Timestamp"].dt.hour
ad_data["month"] = ad_data["Timestamp"].dt.month_name()
ad_data["day"] = ad_data["Timestamp"].dt.day_name() ad_data[
Encoding Categorical Variables
We convert categorical features into dummy variables to make them suitable for modeling:
= ad_data.drop(["Ad Topic Line", "City", "Timestamp"], axis=1)
ad_data = pd.get_dummies(ad_data["Country"], drop_first=True)
countries = pd.get_dummies(ad_data["day"], drop_first=True)
days = pd.get_dummies(ad_data["month"], drop_first=True)
months = pd.concat([ad_data, countries, days, months], axis=1)
ad_data = ad_data.drop(["Country", "day", "month", "year"], axis=1) ad_data
Logistic Regression
With the data cleaned and prepared, we move on to building our logistic regression model.
Train-Test Split
We split the dataset into training and testing sets:
= ad_data.drop(['Clicked on Ad'], axis=1)
X = ad_data["Clicked on Ad"]
y = train_test_split(X, y, test_size=0.3, random_state=103) X_train, X_test, y_train, y_test
Model Training
We fit a logistic regression model on the training data:
= LogisticRegression()
mymodel mymodel.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Making Predictions
We make the predictions on the test set.
= mymodel.predict(X_test) predictions
Model Evaluation
To assess the performance of our model, we use a classification report and confusion matrix:
print(classification_report(y_test, predictions))
= confusion_matrix(y_test, predictions)
conf_matrix =True, cmap="Blues", fmt=".0f")
sns.heatmap(conf_matrix, annot"Confusion Matrix")
plt.title( plt.show()
precision recall f1-score support
0 0.90 0.94 0.92 162
1 0.93 0.88 0.90 138
accuracy 0.91 300
macro avg 0.92 0.91 0.91 300
weighted avg 0.91 0.91 0.91 300
Insights:
- The model achieved a good balance between precision and recall.
- The confusion matrix indicates a reasonable level of accuracy in predicting whether a user will click on an ad.
Conclusion
This project demonstrates how logistic regression can be used to predict user behavior based on their online activity and demographics. Our model successfully identified key factors that influence whether users click on advertisements, with a focus on user age, internet usage, and site engagement.
Future Work
To improve the model’s accuracy, we could explore:
- Using more sophisticated algorithms like Random Forest or Gradient Boosting.
- Incorporating additional features from user behavior data to refine predictions.