Predicting Ad Clicks Using Logistic Regression in Python

Independent Data Analysis Project

Author

Affiliations

John Karuitha, PhD

Karatina University, Department of Business and Economics

University of the Witwatersrand, School of Construction Economics & Management

Published

November 14, 2024

Modified

November 14, 2024

Executive Summary

This project explores the application of logistic regression to predict whether users will click on online advertisements based on demographic and behavioral data. Using a dataset containing information such as age, daily internet usage, income, and engagement metrics, we conducted extensive exploratory data analysis (EDA) to uncover key patterns and relationships. After cleaning and transforming the data, including feature engineering to extract temporal components, we built a logistic regression model to predict ad clicks. The model achieved a strong balance between precision and recall, indicating its effectiveness in identifying factors influencing user behavior. Key findings suggest that user age, daily internet usage, and time spent on site significantly impact the likelihood of clicking on ads. This analysis demonstrates the power of predictive modeling in digital marketing and highlights potential areas for future model enhancement using more advanced machine learning techniques.

Keywords

Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn, Logistic Regression

Background

In this project, we analyze a dataset containing information on online advertisements to determine whether an internet user will click on an advertisement based on specific user features. The goal is to create a logistic regression model that can predict whether a user is likely to click on an ad using Python and its associated data science libraries.

The dataset includes various attributes like user demographics, online behavior, and ad-specific data. By leveraging exploratory data analysis (EDA) and feature engineering techniques, we aim to build a robust predictive model.

Key Takeaways

Younger users and those spending more time on the site are more likely to click on ads.
Feature engineering and data preprocessing play a crucial role in enhancing model performance.

Libraries Used

We load the Python libraries that we use in the analysis.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels as sts
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

Data Overview

We begin by loading and exploring the dataset to understand its structure and the information it contains (James et al. 2013; Muddana and Vinayakam 2024).

Loading the Data

We use Pandas to load the data.

ad_data = pd.read_csv("advertising.csv")

The dataset includes the following features:

Daily Time Spent on Site: Time spent on the website in minutes.
Age: Age of the user in years.
Area Income: Average income of the user’s geographical area.
Daily Internet Usage: Average minutes per day spent online.
Ad Topic Line: Headline of the advertisement.
City: User’s city.
Male: Binary indicator of the user’s gender (1 for male, 0 for female).
Country: User’s country.
Timestamp: Date and time the user clicked on the ad or closed the window.
Clicked on Ad: Target variable (1 if the user clicked on the ad, 0 otherwise).

Initial Data Inspection

To get an overview of the data, we display the first few rows and summary statistics:

ad_data.head()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18

ad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.3+ KB

ad_data.describe()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.009000	55000.000080	180.000100	0.481000	0.50000
std	15.853615	8.785562	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000

ad_data.describe(include="object")

	Ad Topic Line	City	Country	Timestamp
count	1000	1000	1000	1000
unique	1000	969	237	1000
top	Cloned 5thgeneration orchestration	Lisamouth	France	2016-03-27 00:53:11
freq	1	3	9	1

Data Visualization

We perform exploratory data analysis to visualize the distribution of key features and uncover potential patterns that may influence whether a user clicks on an advertisement.

Age Distribution

let us look at the age distribution of the clients.

sns.set_style("whitegrid")
sns.displot(x="Age", data=ad_data, palette="mako")
plt.title("Age Distribution")
plt.show()

Insight: The age distribution shows a concentration of users in the mid-30s, with fewer users at the extremes.

Analyzing Area Income vs. Age

To assess if there’s a relationship between a user’s age and their area’s income, we use a joint plot:

sns.jointplot(x="Age", y="Area Income", data=ad_data, hue="Clicked on Ad")
plt.title("Area Income vs Age")
plt.show()

Insight: There doesn’t appear to be a strong linear correlation between age and area income, though there may be clusters where users are more likely to click on ads.

Daily Time Spent on Site vs. Age

We further explore the relationship between the time spent on the website and the user’s age:

sns.jointplot(x="Age", y="Daily Time Spent on Site", data=ad_data, kind="kde", palette="rocket")
plt.title("Daily Time Spent on Site vs Age")
plt.show()

Internet Usage Analysis

We examine the relationship between daily time spent on the site and overall internet usage:

sns.jointplot(x="Daily Time Spent on Site", y="Daily Internet Usage", data=ad_data, joint_kws={"color":"green"})
plt.title("Daily Time Spent on Site vs Daily Internet Usage")
plt.show()

Pair Plot Analysis

We generate a pair plot to observe relationships between multiple features:

sns.pairplot(ad_data, hue="Clicked on Ad", palette="rocket", corner=True)
plt.title("Pairwise Relationships")
plt.show()

Feature Engineering

To enhance our model, we engineer new features from the existing data.

Extracting Time Features

The Timestamp column contains valuable temporal information. We extract components such as the year, month, day, and hour:

ad_data["Timestamp"] = pd.to_datetime(ad_data["Timestamp"])
ad_data["year"] = ad_data["Timestamp"].dt.year
ad_data["hour"] = ad_data["Timestamp"].dt.hour
ad_data["month"] = ad_data["Timestamp"].dt.month_name()
ad_data["day"] = ad_data["Timestamp"].dt.day_name()

Encoding Categorical Variables

We convert categorical features into dummy variables to make them suitable for modeling:

ad_data = ad_data.drop(["Ad Topic Line", "City", "Timestamp"], axis=1)
countries = pd.get_dummies(ad_data["Country"], drop_first=True)
days = pd.get_dummies(ad_data["day"], drop_first=True)
months = pd.get_dummies(ad_data["month"], drop_first=True)
ad_data = pd.concat([ad_data, countries, days, months], axis=1)
ad_data = ad_data.drop(["Country", "day", "month", "year"], axis=1)

Logistic Regression

With the data cleaned and prepared, we move on to building our logistic regression model.

Train-Test Split

We split the dataset into training and testing sets:

X = ad_data.drop(['Clicked on Ad'], axis=1)
y = ad_data["Clicked on Ad"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=103)

Model Training

We fit a logistic regression model on the training data:

mymodel = LogisticRegression()
mymodel.fit(X_train, y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Making Predictions

We make the predictions on the test set.

predictions = mymodel.predict(X_test)

Model Evaluation

To assess the performance of our model, we use a classification report and confusion matrix:

print(classification_report(y_test, predictions))
conf_matrix = confusion_matrix(y_test, predictions)
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt=".0f")
plt.title("Confusion Matrix")
plt.show()

              precision    recall  f1-score   support

           0       0.90      0.94      0.92       162
           1       0.93      0.88      0.90       138

    accuracy                           0.91       300
   macro avg       0.92      0.91      0.91       300
weighted avg       0.91      0.91      0.91       300

Insights:

The model achieved a good balance between precision and recall.
The confusion matrix indicates a reasonable level of accuracy in predicting whether a user will click on an ad.

Conclusion

This project demonstrates how logistic regression can be used to predict user behavior based on their online activity and demographics. Our model successfully identified key factors that influence whether users click on advertisements, with a focus on user age, internet usage, and site engagement.

Future Work

To improve the model’s accuracy, we could explore:

Using more sophisticated algorithms like Random Forest or Gradient Boosting.
Incorporating additional features from user behavior data to refine predictions.

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.