Data & Decisions

Lab Assignment #8

Agata Braja, Amar Amir

October 22, 2024

# Load the dataset
data <- read.csv("ad_impressions.csv")

Question 1

Set your random seed to 157911 and then randomly partition the data into a 70% training data set and a 30% test data set. We will use the training data to “train” our prediction models, and the test data to test them. What fraction of instances in the training data have a click? What fraction of instances in the test data have a click?

set.seed(157911)

# Randomly split the dataset
sample_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[sample_indices, ]
test_data <- data[-sample_indices, ]

# Calculate fractions of clicks in training and test sets
train_click_fraction <- mean(train_data$Click)
test_click_fraction <- mean(test_data$Click)

cat("Fraction of instances in training data that have a click:", train_click_fraction*100, "%\n")

## Fraction of instances in training data that have a click: 4.913656 %

cat("Fraction of instances in test data that have a click:", test_click_fraction*100, "%\n")

## Fraction of instances in test data that have a click: 4.811627 %

Question 2

We are going to build several prediction models for the variable “Click”. The simplest prediction model (“Naive Model”) would contain no X variables at all. Without any X variables, the Naive Model predicts the same value for every instance. Using the training data, the best predictor for the Naive Model is the average CTR (average of “Click”) in the training data. Compute the prediction error for this model for every instance in the test data. What RMSE (root mean square error) do you get for the prediction error in the test data?

# Naive Model prediction
naive_prediction <- mean(train_data$Click, na.rm = TRUE) # na.rm = TRUE in case of missing values

# Calculate RMSE for the test data
naive_rmse <- sqrt(mean((test_data$Click - naive_prediction)^2, na.rm = TRUE))
cat("RMSE for the prediction error in the test data:", naive_rmse, "\n")

## RMSE for the prediction error in the test data: 0.2140143

Question 3

Construct a prediction model (“Model 1”), by running a regression of Click on dummies for each Gender (remember that Gender has three possible values) and dummies for each Age (six possible values). Use the training data to run the regression. (Hint: the R factor command will come in useful here.) What is the interpretation of the intercept? What is the interpretation of the coefficient on Male (Gender=1)?

# Convert Gender and Age to factors
train_data$Gender <- as.factor(train_data$Gender)
train_data$Age <- as.factor(train_data$Age)

# Fit Model 1
model1 <- lm(Click ~ Gender + Age, data = train_data)
summary(model1)

## 
## Call:
## lm(formula = Click ~ Gender + Age, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05484 -0.05090 -0.04955 -0.04603  0.95532 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0513776  0.0068490   7.501 6.38e-14 ***
## Gender1     -0.0002298  0.0072149  -0.032   0.9746    
## Gender2      0.0011137  0.0072381   0.154   0.8777    
## Age2        -0.0038593  0.0034881  -1.106   0.2686    
## Age3        -0.0015939  0.0032132  -0.496   0.6199    
## Age4        -0.0064631  0.0033390  -1.936   0.0529 .  
## Age5        -0.0012290  0.0035316  -0.348   0.7279    
## Age6         0.0023518  0.0041839   0.562   0.5740    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2161 on 77531 degrees of freedom
## Multiple R-squared:  0.0001449,  Adjusted R-squared:  5.465e-05 
## F-statistic: 1.605 on 7 and 77531 DF,  p-value: 0.1286

Model 1: Interpretation of Results

Intercept

Estimate: 0.0513776
- The intercept represents the predicted probability of a click (Click = 1) when all dummy variables are set to their baseline levels. For this dataset:
  - Baseline Gender: Likely Gender = 0 (unknown).
  - Baseline Age: Likely Age = 1 (0-12 years old).
- Interpretation: When the gender is unknown and the user belongs to the youngest age group (0-12), the predicted probability of a click is approximately 5.14%.

Gender Variables

Gender = 1 (Male)
- Estimate: -0.0002298
- This coefficient represents the change in the predicted probability of a click when the user is male compared to the baseline gender (unknown).
- Interpretation: Being male (Gender = 1) decreases the probability of a click by approximately 0.02 percentage points compared to the baseline (Gender = 0). However, the p-value (0.9746) suggests this effect is not statistically significant.

Question 4

Use Model 1 to predict Click for the test data set. Are there any predicted values less than 0 or greater than 1? What is the RMSE for Model 1 on the test data?

# Predictions using Model 1
test_data$Gender <- as.factor(test_data$Gender)
test_data$Age <- as.factor(test_data$Age)

model1_predictions <- predict(model1, newdata = test_data)

# Check predictions out of bounds
any(model1_predictions < 0 | model1_predictions > 1)

## [1] FALSE

# Calculate RMSE for Model 1
model1_rmse <- sqrt(mean((test_data$Click - model1_predictions)^2))
model1_rmse

## [1] 0.2139816

Model 1: Predictions and RMSE Interpretation

Predicted Values Range

After making predictions using Model 1, no predicted values were found to be less than 0 or greater than 1.
- Interpretation: The predicted probabilities are well-behaved and lie within the valid range of [0, 1]. This indicates that the regression model is numerically stable and adheres to the constraints of probability predictions.

Root Mean Square Error (RMSE)

RMSE: 0.2139816
- The RMSE measures the average error in the predictions of Click on the test dataset. An RMSE of approximately 0.214 means that the model’s predictions of click probability deviate, on average, by about 21.4 percentage points from the actual click values.

Key Insight

While the RMSE provides a quantitative measure of prediction error, the relatively high RMSE indicates that the model’s accuracy is limited, potentially due to weak predictors (e.g., Gender and Age) being insufficient for capturing patterns in click behavior. This suggests the need for adding more predictive variables, such as Position and Impression, to improve model performance.

Question 5

Construct a new prediction model (“Model 2”), by running a regression of Click on Position, Depth, Impression, and dummies for each Gender and each Age. Run the regression on the training data. What is the interpretation of the coefficient on Position?

# Fit Model 2
model2 <- lm(Click ~ Position + Depth + Impression + Gender + Age, data = train_data)
summary(model2)

## 
## Call:
## lm(formula = Click ~ Position + Depth + Impression + Gender + 
##     Age, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41835 -0.06065 -0.05185 -0.03395  0.99243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0746594  0.0072479  10.301  < 2e-16 ***
## Position    -0.0299465  0.0015202 -19.699  < 2e-16 ***
## Depth        0.0087917  0.0013535   6.495 8.34e-11 ***
## Impression   0.0031361  0.0006444   4.867 1.14e-06 ***
## Gender1     -0.0006708  0.0071946  -0.093   0.9257    
## Gender2      0.0008316  0.0072177   0.115   0.9083    
## Age2        -0.0041163  0.0034794  -1.183   0.2368    
## Age3        -0.0014113  0.0032047  -0.440   0.6597    
## Age4        -0.0060950  0.0033308  -1.830   0.0673 .  
## Age5        -0.0008692  0.0035222  -0.247   0.8051    
## Age6         0.0023123  0.0041725   0.554   0.5795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2155 on 77528 degrees of freedom
## Multiple R-squared:  0.00586,    Adjusted R-squared:  0.005731 
## F-statistic:  45.7 on 10 and 77528 DF,  p-value: < 2.2e-16

Model 2: Interpretation of the Coefficient on Position

Position

Estimate: -0.0290465
- The coefficient on Position represents the change in the predicted probability of a click (Click = 1) for a one-unit increase in the ad’s position, holding all other variables constant.
- Interpretation: For each one-unit increase in the position of the ad (e.g., moving the ad further down the list), the probability of a click decreases by approximately 2.9 percentage points.
- This result is highly statistically significant, as indicated by the p-value (< 2e-16).

Key Insight

Ads appearing in higher (more prominent) positions tend to have a higher probability of being clicked. This aligns with the general intuition that users are more likely to click on ads that are positioned at the top of a search results page.

Question 6

Use Model 2 to predict Click for the test data set. Are there any predicted values less than 0 or greater than 1? What is the RMSE for Model 2 on the test data?

# Predictions using Model 2
model2_predictions <- predict(model2, newdata = test_data)

# Check predictions out of bounds
any(model2_predictions < 0 | model2_predictions > 1)

## [1] FALSE

# Calculate RMSE for Model 2
model2_rmse <- sqrt(mean((test_data$Click - model2_predictions)^2))
model2_rmse

## [1] 0.2133507

Model 2: Predictions and RMSE Interpretation

Predicted Values Range

After making predictions using Model 2, no predicted values were found to be less than 0 or greater than 1.
- Interpretation: The predicted probabilities are well-behaved and lie within the valid range of [0, 1]. This indicates that Model 2 adheres to probability constraints, ensuring valid predictions for click probabilities.

Root Mean Square Error (RMSE)

RMSE: 0.2133507
- The RMSE for Model 2 is slightly lower than Model 1, indicating that Model 2 achieves better predictive accuracy on the test dataset.
- An RMSE of approximately 0.213 means that Model 2’s predictions of click probability deviate, on average, by about 21.3 percentage points from the actual click values.

Key Insight

The improved RMSE suggests that adding variables such as Position, Depth, and Impression enhances the predictive performance of the model. These features likely capture meaningful patterns in the data that help predict click behavior more effectively than just Gender and Age alone (as in Model 1).

Question 7

Which model is the best prediction model: Naive, Model 1, or Model 2? Justify your answer.

# Compare RMSE values
naive_rmse

## [1] 0.2140143

model1_rmse

## [1] 0.2139816

model2_rmse

## [1] 0.2133507

Comparison of Prediction Models: Naive, Model 1, and Model 2

RMSE Values

Naive Model RMSE: 0.2140143
Model 1 RMSE: 0.2139816
Model 2 RMSE: 0.2133507

Best Prediction Model

Based on the RMSE values, Model 2 has the lowest RMSE (0.2133507), indicating that it achieves the best predictive performance among the three models.
Justification:
- The Naive Model provides a baseline prediction using only the mean of Click from the training data, which lacks any feature-specific information.
- Model 1 incorporates Gender and Age, improving slightly over the Naive Model, but these variables alone are weak predictors of click behavior.
- Model 2 further incorporates Position, Depth, and Impression, which are more meaningful predictors for click-through rates. The inclusion of these variables significantly improves the model’s performance.

Key Insight

Model 2 is the best prediction model because it leverages additional variables that better capture patterns influencing click probabilities. These variables likely account for ad placement and exposure, which are critical for predicting clicks.

Question 8

Can you find an even better prediction model by either adding or subtracting variables from your model? (Please Note: This question is very open ended, and you could work for weeks perfecting the model. Please do not obsess over it. Just try a couple things and report what you found. Note that it is possible to make the model better by removing some variables, so that is one simple option. In the next session we will learn a way of automating this model-building process for prediction models.)

# Experiment: Remove Depth and add interaction between Position and Gender
model3 <- lm(Click ~ Position * Gender + Impression + Age, data = train_data)

# Predictions using Model 3
model3_predictions <- predict(model3, newdata = test_data)

# Calculate RMSE for Model 3
model3_rmse <- sqrt(mean((test_data$Click - model3_predictions)^2))
model3_rmse

## [1] 0.2133589

Exploring an Improved Prediction Model: Model 3

Experiment

In Model 3, we removed Depth and added an interaction term between Position and Gender.
- Rationale:
  - Depth may contribute less predictive power compared to the interaction between Position and Gender, as ad placement (captured by Position) might have a different impact based on gender.

RMSE for Model 3

RMSE: 0.2133589
- This is almost identical to the RMSE of Model 2 (0.2133507). The marginal difference suggests that removing Depth did not significantly affect the model’s accuracy.

Key Insight

Model 3 performs similarly to Model 2, indicating that the inclusion of Depth does not add substantial predictive value in this case.
Adding interaction terms like Position * Gender is a reasonable step when exploring potential improvements, as such terms can capture nuanced relationships in the data.

Takeaway

While Model 3 does not outperform Model 2 significantly, experimenting with variable selection and interaction terms is essential for refining predictive models. Further exploration of additional variables or interaction effects might yield better results.