Predicting Ad Clicks Using Logistic Regression in Python

Independent Data Analysis Project

Published

November 14, 2024

Modified

November 14, 2024

Executive Summary

This project explores the application of logistic regression to predict whether users will click on online advertisements based on demographic and behavioral data. Using a dataset containing information such as age, daily internet usage, income, and engagement metrics, we conducted extensive exploratory data analysis (EDA) to uncover key patterns and relationships. After cleaning and transforming the data, including feature engineering to extract temporal components, we built a logistic regression model to predict ad clicks. The model achieved a strong balance between precision and recall, indicating its effectiveness in identifying factors influencing user behavior. Key findings suggest that user age, daily internet usage, and time spent on site significantly impact the likelihood of clicking on ads. This analysis demonstrates the power of predictive modeling in digital marketing and highlights potential areas for future model enhancement using more advanced machine learning techniques.

Keywords

Data analysis, Python, Pandas, Seaborn, Numpy, Descriptive Analysis, Data Science, Machine Learning, Scikit-learn, Logistic Regression

Background

In this project, we analyze a dataset containing information on online advertisements to determine whether an internet user will click on an advertisement based on specific user features. The goal is to create a logistic regression model that can predict whether a user is likely to click on an ad using Python and its associated data science libraries.

The dataset includes various attributes like user demographics, online behavior, and ad-specific data. By leveraging exploratory data analysis (EDA) and feature engineering techniques, we aim to build a robust predictive model.

Key Takeaways

  • Younger users and those spending more time on the site are more likely to click on ads.

  • Feature engineering and data preprocessing play a crucial role in enhancing model performance.

Libraries Used

We load the Python libraries that we use in the analysis.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels as sts
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

Data Overview

We begin by loading and exploring the dataset to understand its structure and the information it contains (James et al. 2013; Muddana and Vinayakam 2024).

Loading the Data

We use Pandas to load the data.

ad_data = pd.read_csv("advertising.csv")

The dataset includes the following features:

  • Daily Time Spent on Site: Time spent on the website in minutes.
  • Age: Age of the user in years.
  • Area Income: Average income of the user’s geographical area.
  • Daily Internet Usage: Average minutes per day spent online.
  • Ad Topic Line: Headline of the advertisement.
  • City: User’s city.
  • Male: Binary indicator of the user’s gender (1 for male, 0 for female).
  • Country: User’s country.
  • Timestamp: Date and time the user clicked on the ad or closed the window.
  • Clicked on Ad: Target variable (1 if the user clicked on the ad, 0 otherwise).

Initial Data Inspection

To get an overview of the data, we display the first few rows and summary statistics:

ad_data.head()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0
ad_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.3+ KB
ad_data.describe()
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 65.000200 36.009000 55000.000080 180.000100 0.481000 0.50000
std 15.853615 8.785562 13414.634022 43.902339 0.499889 0.50025
min 32.600000 19.000000 13996.500000 104.780000 0.000000 0.00000
25% 51.360000 29.000000 47031.802500 138.830000 0.000000 0.00000
50% 68.215000 35.000000 57012.300000 183.130000 0.000000 0.50000
75% 78.547500 42.000000 65470.635000 218.792500 1.000000 1.00000
max 91.430000 61.000000 79484.800000 269.960000 1.000000 1.00000
ad_data.describe(include="object")
Ad Topic Line City Country Timestamp
count 1000 1000 1000 1000
unique 1000 969 237 1000
top Cloned 5thgeneration orchestration Lisamouth France 2016-03-27 00:53:11
freq 1 3 9 1

Data Visualization

We perform exploratory data analysis to visualize the distribution of key features and uncover potential patterns that may influence whether a user clicks on an advertisement.

Age Distribution

let us look at the age distribution of the clients.

sns.set_style("whitegrid")
sns.displot(x="Age", data=ad_data, palette="mako")
plt.title("Age Distribution")
plt.show()

Insight: The age distribution shows a concentration of users in the mid-30s, with fewer users at the extremes.

Analyzing Area Income vs. Age

To assess if there’s a relationship between a user’s age and their area’s income, we use a joint plot:

sns.jointplot(x="Age", y="Area Income", data=ad_data, hue="Clicked on Ad")
plt.title("Area Income vs Age")
plt.show()

Insight: There doesn’t appear to be a strong linear correlation between age and area income, though there may be clusters where users are more likely to click on ads.

Daily Time Spent on Site vs. Age

We further explore the relationship between the time spent on the website and the user’s age:

sns.jointplot(x="Age", y="Daily Time Spent on Site", data=ad_data, kind="kde", palette="rocket")
plt.title("Daily Time Spent on Site vs Age")
plt.show()

Internet Usage Analysis

We examine the relationship between daily time spent on the site and overall internet usage:

sns.jointplot(x="Daily Time Spent on Site", y="Daily Internet Usage", data=ad_data, joint_kws={"color":"green"})
plt.title("Daily Time Spent on Site vs Daily Internet Usage")
plt.show()

Pair Plot Analysis

We generate a pair plot to observe relationships between multiple features:

sns.pairplot(ad_data, hue="Clicked on Ad", palette="rocket", corner=True)
plt.title("Pairwise Relationships")
plt.show()

Feature Engineering

To enhance our model, we engineer new features from the existing data.

Extracting Time Features

The Timestamp column contains valuable temporal information. We extract components such as the year, month, day, and hour:

ad_data["Timestamp"] = pd.to_datetime(ad_data["Timestamp"])
ad_data["year"] = ad_data["Timestamp"].dt.year
ad_data["hour"] = ad_data["Timestamp"].dt.hour
ad_data["month"] = ad_data["Timestamp"].dt.month_name()
ad_data["day"] = ad_data["Timestamp"].dt.day_name()

Encoding Categorical Variables

We convert categorical features into dummy variables to make them suitable for modeling:

ad_data = ad_data.drop(["Ad Topic Line", "City", "Timestamp"], axis=1)
countries = pd.get_dummies(ad_data["Country"], drop_first=True)
days = pd.get_dummies(ad_data["day"], drop_first=True)
months = pd.get_dummies(ad_data["month"], drop_first=True)
ad_data = pd.concat([ad_data, countries, days, months], axis=1)
ad_data = ad_data.drop(["Country", "day", "month", "year"], axis=1)

Logistic Regression

With the data cleaned and prepared, we move on to building our logistic regression model.

Train-Test Split

We split the dataset into training and testing sets:

X = ad_data.drop(['Clicked on Ad'], axis=1)
y = ad_data["Clicked on Ad"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=103)

Model Training

We fit a logistic regression model on the training data:

mymodel = LogisticRegression()
mymodel.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Making Predictions

We make the predictions on the test set.

predictions = mymodel.predict(X_test)

Model Evaluation

To assess the performance of our model, we use a classification report and confusion matrix:

print(classification_report(y_test, predictions))
conf_matrix = confusion_matrix(y_test, predictions)
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt=".0f")
plt.title("Confusion Matrix")
plt.show()
              precision    recall  f1-score   support

           0       0.90      0.94      0.92       162
           1       0.93      0.88      0.90       138

    accuracy                           0.91       300
   macro avg       0.92      0.91      0.91       300
weighted avg       0.91      0.91      0.91       300

Insights:

  • The model achieved a good balance between precision and recall.
  • The confusion matrix indicates a reasonable level of accuracy in predicting whether a user will click on an ad.

Conclusion

This project demonstrates how logistic regression can be used to predict user behavior based on their online activity and demographics. Our model successfully identified key factors that influence whether users click on advertisements, with a focus on user age, internet usage, and site engagement.

Future Work

To improve the model’s accuracy, we could explore:

  • Using more sophisticated algorithms like Random Forest or Gradient Boosting.
  • Incorporating additional features from user behavior data to refine predictions.

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.