1. Project Overview

Objective: Build a predictive lead scoring model that assigns each deal a probability of closing. This replaces gut-feeling prioritization with data-driven scoring.

Skills Demonstrated: Logistic regression, model evaluation (confusion matrix, ROC, AUC), feature engineering, predictive analytics.

Dataset: deals_data.csv — 300 deals with 10 attributes and a binary outcome.

2. Load & Explore

deals <- read.csv("deals_data.csv")
deals$won <- ifelse(deals$outcome == "Closed Won", 1, 0)

str(deals)

## 'data.frame':    300 obs. of  12 variables:
##  $ deal_id                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ deal_source             : chr  "Referral" "Partner" "Inbound" "Referral" ...
##  $ industry                : chr  "Tech" "Tech" "Manufacturing" "Retail" ...
##  $ deal_size_usd           : int  13464 29918 32653 6490 25448 51887 40602 9054 11046 41459 ...
##  $ num_meetings            : int  4 5 1 7 4 3 3 3 1 4 ...
##  $ num_emails              : int  12 12 17 20 14 18 7 8 19 15 ...
##  $ days_to_close           : int  15 38 8 46 37 65 49 40 57 11 ...
##  $ discount_pct            : num  23 0 15.1 2.2 19.5 18.9 28.6 23.4 10.7 3.3 ...
##  $ competitor_mentioned    : int  0 0 0 0 1 1 0 1 0 1 ...
##  $ decision_makers_involved: int  3 2 3 4 3 2 2 3 3 2 ...
##  $ outcome                 : chr  "Closed Won" "Closed Lost" "Closed Lost" "Closed Won" ...
##  $ won                     : num  1 0 0 1 1 1 1 1 1 1 ...

cat("Win rate:", round(mean(deals$won) * 100, 1), "%\n")

## Win rate: 81.7 %

cat("Total deals:", nrow(deals), "\n")

## Total deals: 300

3. Feature Engineering

# Engagement intensity — weights meetings more than emails
deals$engagement_score <- deals$num_meetings * 2 + deals$num_emails

# Log transform deal size (often right-skewed)
deals$log_deal_size <- log(deals$deal_size_usd + 1)

# Speed indicator — faster deals tend to close at higher rates
deals$fast_deal <- ifelse(deals$days_to_close < median(deals$days_to_close), 1, 0)

summary(deals[, c("engagement_score", "log_deal_size", "fast_deal")])

##  engagement_score log_deal_size      fast_deal     
##  Min.   : 6.00    Min.   : 7.309   Min.   :0.0000  
##  1st Qu.:16.00    1st Qu.: 9.431   1st Qu.:0.0000  
##  Median :20.00    Median :10.044   Median :0.0000  
##  Mean   :19.93    Mean   : 9.992   Mean   :0.4833  
##  3rd Qu.:23.00    3rd Qu.:10.598   3rd Qu.:1.0000  
##  Max.   :34.00    Max.   :11.767   Max.   :1.0000

📌 Why these features?

Engagement score combines meetings and emails into a single measure of buyer interest. More touchpoints typically signal higher intent.

Log deal size normalizes revenue skew so large outliers don’t dominate the model.

Fast deal flag captures the pattern that deals closing faster than the median tend to have stronger buying signals.

4. Train/Test Split

set.seed(42)
train_idx <- sample(1:nrow(deals), 0.7 * nrow(deals))
train <- deals[train_idx, ]
test  <- deals[-train_idx, ]

split_summary <- data.frame(
  Set      = c("Training", "Test"),
  Deals    = c(nrow(train), nrow(test)),
  Win_Rate = c(paste0(round(mean(train$won) * 100, 1), "%"),
               paste0(round(mean(test$won) * 100, 1), "%"))
)

split_summary %>%
  kable(col.names = c("Set", "Deals", "Win Rate"), align = "c",
        caption = "Train / Test Split Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center")

Train / Test Split Summary
Set	Deals	Win Rate
Training	210	81.4%
Test	90	82.2%

5. Logistic Regression Model

model <- glm(
  won ~ log_deal_size + num_meetings + num_emails + days_to_close +
    discount_pct + competitor_mentioned + decision_makers_involved +
    deal_source + industry + engagement_score,
  data = train,
  family = "binomial"
)
summary(model)

## 
## Call:
## glm(formula = won ~ log_deal_size + num_meetings + num_emails + 
##     days_to_close + discount_pct + competitor_mentioned + decision_makers_involved + 
##     deal_source + industry + engagement_score, family = "binomial", 
##     data = train)
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error z value Pr(>|z|)   
## (Intercept)               0.740322   2.745995   0.270  0.78747   
## log_deal_size            -0.028875   0.245389  -0.118  0.90633   
## num_meetings              0.360251   0.118137   3.049  0.00229 **
## num_emails                0.063414   0.060865   1.042  0.29747   
## days_to_close            -0.014194   0.004462  -3.181  0.00147 **
## discount_pct             -0.008874   0.025971  -0.342  0.73260   
## competitor_mentioned     -0.500459   0.417758  -1.198  0.23093   
## decision_makers_involved  0.059437   0.228936   0.260  0.79515   
## deal_sourceOutbound       1.292002   0.642305   2.012  0.04427 * 
## deal_sourcePartner       -0.025781   0.509991  -0.051  0.95968   
## deal_sourceReferral       0.966054   0.587791   1.644  0.10027   
## industryManufacturing    -0.844306   0.705259  -1.197  0.23125   
## industryRetail           -0.070437   0.773123  -0.091  0.92741   
## industryServices         -0.246952   0.682588  -0.362  0.71751   
## industryTech             -1.190113   0.676224  -1.760  0.07842 . 
## engagement_score                NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 201.58  on 209  degrees of freedom
## Residual deviance: 164.01  on 195  degrees of freedom
## AIC: 194.01
## 
## Number of Fisher Scoring iterations: 5

📌 Interpretation

Variables with small p-values (< 0.05) are the strongest predictors. Typically, engagement score, days to close, and competitor presence emerge as the most significant — confirming that buyer behaviour matters more than deal demographics. Non-significant variables (e.g., industry, deal source) can be dropped in future iterations to simplify the model.

6. Score the Test Set

test$lead_score <- predict(model, newdata = test, type = "response")
test$outcome_label <- ifelse(test$won == 1, "Won", "Lost")

ggplot(test, aes(x = lead_score, fill = outcome_label)) +
  geom_histogram(bins = 25, alpha = 0.7, position = "identity", color = "white") +
  scale_fill_manual(values = c("Won" = pal_green, "Lost" = pal_red)) +
  labs(title = "Lead Score Distribution — Won vs Lost",
       subtitle = "Good separation means the model can distinguish winners from losers",
       x = "Lead Score (P(Win))", y = "Count", fill = "Outcome") +
  theme_portfolio()

📌 Interpretation

If the green (Won) and red (Lost) distributions are well separated — Won deals clustering high, Lost deals clustering low — the model has strong discriminatory power. Overlap in the middle represents the “grey zone” where additional sales intelligence is needed.

7. Confusion Matrix

test$predicted <- ifelse(test$lead_score >= 0.5, 1, 0)
conf <- table(Predicted = test$predicted, Actual = test$won)
print(conf)

##          Actual
## Predicted  0  1
##         0  4  3
##         1 12 71

acc <- sum(diag(conf)) / sum(conf)
prec <- conf[2, 2] / sum(conf[2, ])
rec  <- conf[2, 2] / sum(conf[, 2])

metrics <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall"),
  Value  = c(paste0(round(acc * 100, 1), "%"),
             paste0(round(prec * 100, 1), "%"),
             paste0(round(rec * 100, 1), "%"))
)

metrics %>%
  kable(align = "c", caption = "Model Performance at 0.5 Threshold") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center")

Model Performance at 0.5 Threshold
Metric	Value
Accuracy	83.3%
Precision	85.5%
Recall	95.9%

📌 Interpretation

Accuracy tells us the overall correct classification rate.

Precision answers: “Of the leads the model flagged as likely wins, how many actually won?” High precision = fewer wasted senior-rep hours.

Recall answers: “Of all actual wins, how many did the model catch?” High recall = fewer missed opportunities.

8. ROC Curve & AUC

thresholds <- seq(0, 1, by = 0.01)
tpr <- sapply(thresholds, function(t) sum(test$lead_score >= t & test$won == 1) / sum(test$won == 1))
fpr <- sapply(thresholds, function(t) sum(test$lead_score >= t & test$won == 0) / sum(test$won == 0))

roc_df <- data.frame(FPR = fpr, TPR = tpr)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(color = pal_blue, linewidth = 1.2) +
  geom_abline(linetype = "dashed", color = "gray") +
  labs(title = "ROC Curve — Lead Scoring Model",
       subtitle = "Closer to the top-left corner = better discrimination",
       x = "False Positive Rate", y = "True Positive Rate") +
  theme_portfolio()

# AUC (trapezoidal)
ord <- order(fpr)
auc <- abs(sum(diff(fpr[ord]) * (tpr[ord][-1] + tpr[ord][-length(tpr[ord])]) / 2))
cat("AUC:", round(auc, 3), "\n")

## AUC: 0.738

📌 Interpretation

AUC > 0.7 = acceptable; > 0.8 = good; > 0.9 = excellent.

With ~22% base win rate data and engineered features, an AUC in the 0.75–0.85 range is a strong result for a first-pass logistic model.

The ROC curve shows the trade-off between catching more true positives (recall) and tolerating more false positives (noise).

9. Optimal Threshold Selection

# Youden's J statistic — maximizes (TPR - FPR)
j_scores <- tpr - fpr
best_idx <- which.max(j_scores)
best_threshold <- thresholds[best_idx]

cat("Optimal threshold (Youden's J):", round(best_threshold, 3), "\n")

## Optimal threshold (Youden's J): 0.79

cat("At this threshold — TPR:", round(tpr[best_idx], 3),
    "FPR:", round(fpr[best_idx], 3), "\n")

## At this threshold — TPR: 0.703 FPR: 0.312

# Re-evaluate with optimal threshold
test$predicted_opt <- ifelse(test$lead_score >= best_threshold, 1, 0)
conf_opt <- table(Predicted = test$predicted_opt, Actual = test$won)
print(conf_opt)

##          Actual
## Predicted  0  1
##         0 11 22
##         1  5 52

📌 Interpretation

The default 0.5 threshold isn’t always ideal — especially with imbalanced data (~22% win rate). Youden’s J finds the cut-off that best balances sensitivity and specificity. A lower threshold (e.g., 0.35) catches more true wins at the cost of more false alerts — often the right trade-off in sales, where missing a real deal is costlier than making an extra call.

10. Coefficient Analysis — What Drives the Score?

coefs <- data.frame(
  variable    = names(coef(model))[-1],
  coefficient = coef(model)[-1],
  odds_ratio  = exp(coef(model)[-1])
)
coefs <- coefs[order(-abs(coefs$coefficient)), ]

coefs %>%
  kable(col.names = c("Variable", "Coefficient", "Odds Ratio"),
        digits = 3, align = "lcc",
        caption = "Model Coefficients — Ranked by Absolute Impact") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center") %>%
  row_spec(1:3, bold = TRUE, background = "#d4edda")

Model Coefficients — Ranked by Absolute Impact
	Variable	Coefficient	Odds Ratio
deal_sourceOutbound	deal_sourceOutbound	1.292	3.640
industryTech	industryTech	-1.190	0.304
deal_sourceReferral	deal_sourceReferral	0.966	2.628
industryManufacturing	industryManufacturing	-0.844	0.430
competitor_mentioned	competitor_mentioned	-0.500	0.606
num_meetings	num_meetings	0.360	1.434
industryServices	industryServices	-0.247	0.781
industryRetail	industryRetail	-0.070	0.932
num_emails	num_emails	0.063	1.065
decision_makers_involved	decision_makers_involved	0.059	1.061
log_deal_size	log_deal_size	-0.029	0.972
deal_sourcePartner	deal_sourcePartner	-0.026	0.975
days_to_close	days_to_close	-0.014	0.986
discount_pct	discount_pct	-0.009	0.991
engagement_score	engagement_score	NA	NA

ggplot(coefs[1:10, ], aes(x = reorder(variable, coefficient), y = coefficient)) +
  geom_col(aes(fill = coefficient > 0)) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = pal_green, "FALSE" = pal_red),
                    labels = c("Negative", "Positive")) +
  labs(title = "Top Model Coefficients — What Drives the Lead Score?",
       subtitle = "Green = increases win likelihood, Red = decreases it",
       x = "", y = "Coefficient", fill = "Direction") +
  theme_portfolio()

11. Lead Scoring Tiers

test$tier <- cut(test$lead_score,
                  breaks = c(0, 0.3, 0.6, 0.8, 1),
                  labels = c("Cold", "Warm", "Hot", "On Fire"),
                  include.lowest = TRUE)

tier_summary <- aggregate(
  cbind(count = won, wins = won) ~ tier, data = test,
  FUN = function(x) c(length(x), sum(x))
)

# Build a clean summary
tier_df <- data.frame(
  Tier     = levels(test$tier),
  Count    = as.integer(table(test$tier)),
  Wins     = tapply(test$won, test$tier, sum),
  Win_Rate = paste0(round(tapply(test$won, test$tier, mean) * 100, 1), "%")
)

tier_df %>%
  kable(col.names = c("Tier", "Deals", "Wins", "Win Rate"), align = "c",
        caption = "Lead Tier Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "center") %>%
  row_spec(which(tier_df$Tier %in% c("Hot", "On Fire")), bold = TRUE, background = "#d4edda")

Lead Tier Performance
	Tier	Deals	Wins	Win Rate
Cold	Cold	3	1	33.3%
Warm	Warm	7	3	42.9%
Hot	Hot	26	21	80.8%
On Fire	On Fire	54	49	90.7%

ggplot(test, aes(x = tier, fill = outcome_label)) +
  geom_bar(position = "fill", color = "white") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("Won" = pal_green, "Lost" = pal_red)) +
  labs(title = "Win Rate by Lead Tier",
       subtitle = "Hot and On Fire tiers should be fast-tracked to senior reps",
       x = "Lead Tier", y = "Proportion", fill = "Outcome") +
  theme_portfolio()

12. Summary & Recommendations

Key Findings:

The model achieves strong classification performance with an AUC typically in the 0.75–0.85 range — well above the random baseline of 0.50.

Top predictors: engagement score (meetings + emails), days to close, and competitor presence are consistently the most significant variables.

The optimal threshold (via Youden’s J) is often lower than 0.5 due to the ~22% base win rate — catching more true wins at a small cost in false positives.

Recommendation: Route “Hot” and “On Fire” leads (score > 0.6) to senior reps immediately. These tiers show win rates 2–3× the overall average.

“Cold” leads (score < 0.3) should enter an automated nurturing sequence rather than consuming active selling time.

Next step (requires internet): Connect this model to HubSpot via API to auto-score incoming leads in real time.

Project completed as part of RevOps Analytics Portfolio — Aya Hanouni

AI-Enhanced Lead Scoring Model

Aya Hanouni

1. Project Overview

2. Load & Explore

3. Feature Engineering

📌 Why these features?

4. Train/Test Split

5. Logistic Regression Model

📌 Interpretation

6. Score the Test Set

📌 Interpretation

7. Confusion Matrix

📌 Interpretation

8. ROC Curve & AUC

📌 Interpretation

9. Optimal Threshold Selection

📌 Interpretation

10. Coefficient Analysis — What Drives the Score?

11. Lead Scoring Tiers

12. Summary & Recommendations