October 22, 2024
# Load the dataset
data <- read.csv("ad_impressions.csv")
Set your random seed to 157911 and then randomly partition the data into a 70% training data set and a 30% test data set. We will use the training data to “train” our prediction models, and the test data to test them. What fraction of instances in the training data have a click? What fraction of instances in the test data have a click?
set.seed(157911)
# Randomly split the dataset
sample_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[sample_indices, ]
test_data <- data[-sample_indices, ]
# Calculate fractions of clicks in training and test sets
train_click_fraction <- mean(train_data$Click)
test_click_fraction <- mean(test_data$Click)
cat("Fraction of instances in training data that have a click:", train_click_fraction*100, "%\n")
## Fraction of instances in training data that have a click: 4.913656 %
cat("Fraction of instances in test data that have a click:", test_click_fraction*100, "%\n")
## Fraction of instances in test data that have a click: 4.811627 %
We are going to build several prediction models for the variable “Click”. The simplest prediction model (“Naive Model”) would contain no X variables at all. Without any X variables, the Naive Model predicts the same value for every instance. Using the training data, the best predictor for the Naive Model is the average CTR (average of “Click”) in the training data. Compute the prediction error for this model for every instance in the test data. What RMSE (root mean square error) do you get for the prediction error in the test data?
# Naive Model prediction
naive_prediction <- mean(train_data$Click, na.rm = TRUE) # na.rm = TRUE in case of missing values
# Calculate RMSE for the test data
naive_rmse <- sqrt(mean((test_data$Click - naive_prediction)^2, na.rm = TRUE))
cat("RMSE for the prediction error in the test data:", naive_rmse, "\n")
## RMSE for the prediction error in the test data: 0.2140143
Construct a prediction model (“Model 1”), by running a regression of Click on dummies for each Gender (remember that Gender has three possible values) and dummies for each Age (six possible values). Use the training data to run the regression. (Hint: the R factor command will come in useful here.) What is the interpretation of the intercept? What is the interpretation of the coefficient on Male (Gender=1)?
# Convert Gender and Age to factors
train_data$Gender <- as.factor(train_data$Gender)
train_data$Age <- as.factor(train_data$Age)
# Fit Model 1
model1 <- lm(Click ~ Gender + Age, data = train_data)
summary(model1)
##
## Call:
## lm(formula = Click ~ Gender + Age, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05484 -0.05090 -0.04955 -0.04603 0.95532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0513776 0.0068490 7.501 6.38e-14 ***
## Gender1 -0.0002298 0.0072149 -0.032 0.9746
## Gender2 0.0011137 0.0072381 0.154 0.8777
## Age2 -0.0038593 0.0034881 -1.106 0.2686
## Age3 -0.0015939 0.0032132 -0.496 0.6199
## Age4 -0.0064631 0.0033390 -1.936 0.0529 .
## Age5 -0.0012290 0.0035316 -0.348 0.7279
## Age6 0.0023518 0.0041839 0.562 0.5740
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2161 on 77531 degrees of freedom
## Multiple R-squared: 0.0001449, Adjusted R-squared: 5.465e-05
## F-statistic: 1.605 on 7 and 77531 DF, p-value: 0.1286
Click = 1) when all dummy variables are set to their
baseline levels. For this dataset:
Use Model 1 to predict Click for the test data set. Are there any predicted values less than 0 or greater than 1? What is the RMSE for Model 1 on the test data?
# Predictions using Model 1
test_data$Gender <- as.factor(test_data$Gender)
test_data$Age <- as.factor(test_data$Age)
model1_predictions <- predict(model1, newdata = test_data)
# Check predictions out of bounds
any(model1_predictions < 0 | model1_predictions > 1)
## [1] FALSE
# Calculate RMSE for Model 1
model1_rmse <- sqrt(mean((test_data$Click - model1_predictions)^2))
model1_rmse
## [1] 0.2139816
Click on the test dataset. An RMSE of approximately
0.214 means that the model’s predictions of click
probability deviate, on average, by about 21.4 percentage
points from the actual click values.Position
and Impression, to improve model performance.Construct a new prediction model (“Model 2”), by running a regression of Click on Position, Depth, Impression, and dummies for each Gender and each Age. Run the regression on the training data. What is the interpretation of the coefficient on Position?
# Fit Model 2
model2 <- lm(Click ~ Position + Depth + Impression + Gender + Age, data = train_data)
summary(model2)
##
## Call:
## lm(formula = Click ~ Position + Depth + Impression + Gender +
## Age, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41835 -0.06065 -0.05185 -0.03395 0.99243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0746594 0.0072479 10.301 < 2e-16 ***
## Position -0.0299465 0.0015202 -19.699 < 2e-16 ***
## Depth 0.0087917 0.0013535 6.495 8.34e-11 ***
## Impression 0.0031361 0.0006444 4.867 1.14e-06 ***
## Gender1 -0.0006708 0.0071946 -0.093 0.9257
## Gender2 0.0008316 0.0072177 0.115 0.9083
## Age2 -0.0041163 0.0034794 -1.183 0.2368
## Age3 -0.0014113 0.0032047 -0.440 0.6597
## Age4 -0.0060950 0.0033308 -1.830 0.0673 .
## Age5 -0.0008692 0.0035222 -0.247 0.8051
## Age6 0.0023123 0.0041725 0.554 0.5795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2155 on 77528 degrees of freedom
## Multiple R-squared: 0.00586, Adjusted R-squared: 0.005731
## F-statistic: 45.7 on 10 and 77528 DF, p-value: < 2.2e-16
Position represents the change in
the predicted probability of a click (Click = 1) for a
one-unit increase in the ad’s position, holding all other variables
constant.Use Model 2 to predict Click for the test data set. Are there any predicted values less than 0 or greater than 1? What is the RMSE for Model 2 on the test data?
# Predictions using Model 2
model2_predictions <- predict(model2, newdata = test_data)
# Check predictions out of bounds
any(model2_predictions < 0 | model2_predictions > 1)
## [1] FALSE
# Calculate RMSE for Model 2
model2_rmse <- sqrt(mean((test_data$Click - model2_predictions)^2))
model2_rmse
## [1] 0.2133507
Position, Depth, and Impression
enhances the predictive performance of the model. These features likely
capture meaningful patterns in the data that help predict click behavior
more effectively than just Gender and Age
alone (as in Model 1).Which model is the best prediction model: Naive, Model 1, or Model 2? Justify your answer.
# Compare RMSE values
naive_rmse
## [1] 0.2140143
model1_rmse
## [1] 0.2139816
model2_rmse
## [1] 0.2133507
Click from the training data, which lacks any
feature-specific information.Gender and Age,
improving slightly over the Naive Model, but these variables alone are
weak predictors of click behavior.Position,
Depth, and Impression, which are more
meaningful predictors for click-through rates. The inclusion of these
variables significantly improves the model’s performance.Can you find an even better prediction model by either adding or subtracting variables from your model? (Please Note: This question is very open ended, and you could work for weeks perfecting the model. Please do not obsess over it. Just try a couple things and report what you found. Note that it is possible to make the model better by removing some variables, so that is one simple option. In the next session we will learn a way of automating this model-building process for prediction models.)
# Experiment: Remove Depth and add interaction between Position and Gender
model3 <- lm(Click ~ Position * Gender + Impression + Age, data = train_data)
# Predictions using Model 3
model3_predictions <- predict(model3, newdata = test_data)
# Calculate RMSE for Model 3
model3_rmse <- sqrt(mean((test_data$Click - model3_predictions)^2))
model3_rmse
## [1] 0.2133589
Depth and added an interaction
term between Position and Gender.
Depth may contribute less predictive power compared to
the interaction between Position and Gender,
as ad placement (captured by Position) might have a
different impact based on gender.Depth did not
significantly affect the model’s accuracy.Depth does not add substantial predictive value in this
case.Position * Gender is a
reasonable step when exploring potential improvements, as such terms can
capture nuanced relationships in the data.