Objective: Identify which deal attributes most influence whether a deal is won or lost. This helps RevOps teams prioritize the right levers — should reps focus on more meetings? Bigger deals? Fewer discounts?
Skills Demonstrated: Data wrangling, correlation analysis, Random Forest feature importance, logistic regression, data visualization.
Dataset: deals_data.csv — 300 CRM deals
with attributes like source, industry, meetings, discount %, and
outcome.
deals <- read.csv("deals_data.csv", stringsAsFactors = FALSE)
cat("Dimensions:", dim(deals), "\n")
## Dimensions: 300 11
head(deals, 10) %>%
kable("html", caption = "First 10 Rows") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 13)
| deal_id | deal_source | industry | deal_size_usd | num_meetings | num_emails | days_to_close | discount_pct | competitor_mentioned | decision_makers_involved | outcome |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Outbound | SaaS | 16896 | 6 | 11 | 60 | 0 | 0 | 1 | Closed Lost |
| 2 | Event | Healthcare | 14817 | 2 | 12 | 86 | 20 | 0 | 1 | Closed Lost |
| 3 | Referral | Healthcare | 20878 | 4 | 12 | 61 | 0 | 0 | 1 | Closed Lost |
| 4 | Referral | Healthcare | 13126 | 4 | 19 | 62 | 10 | 1 | 1 | Closed Lost |
| 5 | Inbound | Retail | 16700 | 5 | 12 | 54 | 20 | 0 | 1 | Closed Lost |
| 6 | Inbound | Manufacturing | 12101 | 3 | 13 | 77 | 10 | 1 | 2 | Closed Lost |
| 7 | Inbound | Healthcare | 58827 | 7 | 10 | 74 | 0 | 0 | 1 | Closed Lost |
| 8 | Partner | Finance | 22418 | 4 | 15 | 43 | 5 | 0 | 2 | Closed Won |
| 9 | Referral | Retail | 15523 | 5 | 12 | 31 | 0 | 0 | 1 | Closed Lost |
| 10 | Referral | Finance | 24513 | 2 | 13 | 51 | 10 | 0 | 1 | Closed Lost |
summary(deals) %>% kable("html") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
full_width = FALSE, font_size = 13)
| deal_id | deal_source | industry | deal_size_usd | num_meetings | num_emails | days_to_close | discount_pct | competitor_mentioned | decision_makers_involved | outcome | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | Length:300 | Length:300 | Min. : 5719 | Min. : 0.000 | Min. : 3.00 | Min. : 10.00 | Min. : 0.000 | Min. :0.00 | Min. :1.00 | Length:300 | |
| 1st Qu.: 75.75 | Class :character | Class :character | 1st Qu.: 15018 | 1st Qu.: 3.000 | 1st Qu.:10.00 | 1st Qu.: 43.00 | 1st Qu.: 0.000 | 1st Qu.:0.00 | 1st Qu.:1.00 | Class :character | |
| Median :150.50 | Mode :character | Mode :character | Median : 20932 | Median : 4.000 | Median :12.00 | Median : 54.00 | Median : 5.000 | Median :0.00 | Median :2.00 | Mode :character | |
| Mean :150.50 | NA | NA | Mean : 24115 | Mean : 3.993 | Mean :11.97 | Mean : 55.06 | Mean : 9.583 | Mean :0.35 | Mean :2.25 | NA | |
| 3rd Qu.:225.25 | NA | NA | 3rd Qu.: 29388 | 3rd Qu.: 5.000 | 3rd Qu.:14.00 | 3rd Qu.: 66.00 | 3rd Qu.:20.000 | 3rd Qu.:1.00 | 3rd Qu.:3.00 | NA | |
| Max. :300.00 | NA | NA | Max. :102686 | Max. :10.000 | Max. :24.00 | Max. :103.00 | Max. :25.000 | Max. :1.00 | Max. :4.00 | NA |
deals$won <- as.integer(deals$outcome == "Closed Won")
win_rate <- mean(deals$won) * 100
cat("Total deals:", nrow(deals), "\n")
## Total deals: 300
cat("Win rate: ", round(win_rate, 1), "%\n")
## Win rate: 24.7 %
Context: With ~300 deals and a ~22% win rate, this is a realistic B2B dataset. The challenge is figuring out which attributes separate the ~22% that close from the ~78% that don’t.
numeric_cols <- c("won", "deal_size_usd", "num_meetings", "num_emails",
"days_to_close", "discount_pct", "competitor_mentioned",
"decision_makers_involved")
cor_matrix <- cor(deals[numeric_cols], use = "complete.obs")
cor_with_win <- cor_matrix[, "won"]
cor_with_win <- cor_with_win[names(cor_with_win) != "won"]
cor_with_win <- sort(cor_with_win, decreasing = TRUE)
cor_df <- data.frame(
Variable = names(cor_with_win),
Correlation = round(cor_with_win, 3)
)
cor_df %>%
kable("html", caption = "Correlation with Win Outcome", row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 13) %>%
row_spec(which(abs(cor_df$Correlation) > 0.15),
bold = TRUE, background = "#d5f5e3")
| Variable | Correlation |
|---|---|
| num_meetings | 0.160 |
| decision_makers_involved | 0.156 |
| num_emails | 0.078 |
| deal_size_usd | -0.019 |
| competitor_mentioned | -0.047 |
| days_to_close | -0.126 |
| discount_pct | -0.175 |
melted <- melt(cor_matrix)
ggplot(melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = round(value, 2)), size = 3.5) +
scale_fill_gradient2(low = pal_red, mid = "white", high = pal_blue,
midpoint = 0, limits = c(-1, 1)) +
labs(title = "Correlation Heatmap — Deal Attributes", fill = "r") +
theme_portfolio() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_blank())
Interpretation: Variables with the strongest positive correlations to winning (e.g.,
num_meetings,decision_makers_involved) are the ones reps should double down on. Negative correlations (e.g.,days_to_close) confirm that longer sales cycles hurt — speed matters. Weak or near-zero correlations (e.g.,deal_size_usd) suggest that deal size alone doesn’t determine the outcome.
wr_source <- deals %>%
group_by(deal_source) %>%
summarise(win_pct = round(mean(won) * 100, 1), .groups = "drop") %>%
arrange(win_pct)
wr_source$deal_source <- factor(wr_source$deal_source,
levels = wr_source$deal_source)
ggplot(wr_source, aes(x = win_pct, y = deal_source)) +
geom_col(fill = pal_blue, alpha = 0.85, width = 0.6) +
geom_text(aes(label = paste0(win_pct, "%")),
hjust = -0.15, size = 3.8) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "Win Rate by Deal Source", x = "Win Rate (%)", y = NULL) +
theme_portfolio()
wr_ind <- deals %>%
group_by(industry) %>%
summarise(win_pct = round(mean(won) * 100, 1), .groups = "drop") %>%
arrange(win_pct)
wr_ind$industry <- factor(wr_ind$industry, levels = wr_ind$industry)
ggplot(wr_ind, aes(x = win_pct, y = industry)) +
geom_col(fill = pal_orange, alpha = 0.85, width = 0.6) +
geom_text(aes(label = paste0(win_pct, "%")),
hjust = -0.15, size = 3.8) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "Win Rate by Industry", x = "Win Rate (%)", y = NULL) +
theme_portfolio()
Interpretation: The best-performing deal source is where marketing and sales should focus pipeline generation. The top industry vertical may warrant a dedicated playbook or specialized reps. Low-performing sources or industries aren’t necessarily bad — they may need different enablement or longer nurture sequences.
feature_cols <- c("deal_size_usd", "num_meetings", "num_emails",
"days_to_close", "discount_pct", "competitor_mentioned",
"decision_makers_involved")
# Encode categoricals
deals$source_enc <- as.integer(as.factor(deals$deal_source))
deals$industry_enc <- as.integer(as.factor(deals$industry))
all_features <- c(feature_cols, "source_enc", "industry_enc")
feature_labels <- c(feature_cols, "deal_source", "industry")
set.seed(42)
rf <- randomForest(
x = deals[all_features],
y = as.factor(deals$won),
ntree = 500,
importance = TRUE
)
imp <- data.frame(
Variable = feature_labels,
Importance = round(importance(rf, type = 2)[, 1], 4)
) %>% arrange(desc(Importance))
imp %>%
kable("html", caption = "Feature Importance — Ranked", row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 13) %>%
row_spec(1:3, bold = TRUE, background = "#d5f5e3")
| Variable | Importance |
|---|---|
| deal_size_usd | 22.8768 |
| days_to_close | 18.9162 |
| num_emails | 15.2692 |
| num_meetings | 13.0570 |
| discount_pct | 9.8072 |
| industry | 9.7192 |
| deal_source | 8.5826 |
| decision_makers_involved | 8.5149 |
| competitor_mentioned | 3.4231 |
imp_plot <- imp %>% arrange(Importance)
imp_plot$Variable <- factor(imp_plot$Variable, levels = imp_plot$Variable)
ggplot(imp_plot, aes(x = Importance, y = Variable)) +
geom_col(fill = pal_green, alpha = 0.85, width = 0.6) +
labs(title = "Feature Importance — Mean Decrease Gini",
x = "Importance", y = NULL) +
theme_portfolio()
Interpretation: The Random Forest ranks every feature by how much it helps separate won deals from lost ones. The top 3 drivers (highlighted in green above) are where the sales team should focus. Typically, engagement metrics (meetings, emails) and deal velocity (days to close) dominate — meaning how you sell matters more than what you sell or who you sell to.
top_var <- imp$Variable[1]
cat("Top driver:", top_var, "\n")
## Top driver: deal_size_usd
if (top_var %in% names(deals) && is.numeric(deals[[top_var]])) {
plot_df <- data.frame(
outcome = factor(ifelse(deals$won == 1, "Won", "Lost"),
levels = c("Lost", "Won")),
value = deals[[top_var]]
)
ggplot(plot_df, aes(x = outcome, y = value, fill = outcome)) +
geom_boxplot(alpha = 0.7, width = 0.5, outlier.alpha = 0.4) +
scale_fill_manual(values = c("Lost" = pal_red, "Won" = pal_green)) +
labs(title = paste0("Distribution of ", top_var, " — Won vs Lost"),
y = top_var, x = NULL) +
theme_portfolio() +
theme(legend.position = "none")
}
Interpretation: A clear separation in the boxplots (e.g., won deals having noticeably higher meeting counts) confirms the Random Forest finding. If the distributions overlap heavily, the variable is important but not sufficient on its own — it works in combination with other factors.
log_model <- glm(won ~ deal_size_usd + num_meetings + num_emails +
days_to_close + discount_pct + competitor_mentioned +
decision_makers_involved,
data = deals, family = binomial)
summary(log_model)
##
## Call:
## glm(formula = won ~ deal_size_usd + num_meetings + num_emails +
## days_to_close + discount_pct + competitor_mentioned + decision_makers_involved,
## family = binomial, data = deals)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.019e+00 8.546e-01 -2.362 0.01817 *
## deal_size_usd -5.885e-06 1.091e-05 -0.540 0.58949
## num_meetings 2.022e-01 7.405e-02 2.731 0.00632 **
## num_emails 6.755e-02 4.088e-02 1.653 0.09843 .
## days_to_close -1.645e-02 8.744e-03 -1.881 0.05992 .
## discount_pct -5.298e-02 1.649e-02 -3.213 0.00131 **
## competitor_mentioned -2.624e-01 3.058e-01 -0.858 0.39093
## decision_makers_involved 3.424e-01 1.321e-01 2.593 0.00951 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 335.19 on 299 degrees of freedom
## Residual deviance: 302.32 on 292 degrees of freedom
## AIC: 318.32
##
## Number of Fisher Scoring iterations: 4
coefs <- summary(log_model)$coefficients[-1, ] # drop intercept
or_df <- data.frame(
Variable = rownames(coefs),
Odds_Ratio = round(exp(coefs[, "Estimate"]), 3),
P_Value = round(coefs[, "Pr(>|z|)"], 4)
) %>% arrange(P_Value)
rownames(or_df) <- NULL
or_df %>%
kable("html", caption = "Logistic Regression — Odds Ratios", row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 13) %>%
row_spec(which(or_df$P_Value < 0.05),
bold = TRUE, background = "#d5f5e3")
| Variable | Odds_Ratio | P_Value |
|---|---|---|
| discount_pct | 0.948 | 0.0013 |
| num_meetings | 1.224 | 0.0063 |
| decision_makers_involved | 1.408 | 0.0095 |
| days_to_close | 0.984 | 0.0599 |
| num_emails | 1.070 | 0.0984 |
| competitor_mentioned | 0.769 | 0.3909 |
| deal_size_usd | 1.000 | 0.5895 |
Interpretation: The odds ratio tells you the practical effect of each variable. For example, an odds ratio of 1.25 for
num_meetingsmeans each additional meeting increases the odds of winning by 25%. Variables with p < 0.05 (highlighted in green) are statistically significant. This complements the Random Forest by quantifying how much each lever moves the needle.
top3 <- imp$Variable[1:3]
sig_vars <- or_df$Variable[or_df$P_Value < 0.05]
best_src <- wr_source %>% filter(win_pct == max(win_pct))
best_ind <- wr_ind %>% filter(win_pct == max(win_pct))
findings <- data.frame(`Key Findings` = c(
paste0("Top 3 drivers: ", paste(top3, collapse = ", ")),
paste0("Statistically significant (p<0.05): ",
ifelse(length(sig_vars) > 0, paste(sig_vars, collapse = ", "), "None")),
paste0("Overall win rate: ", round(win_rate, 1), "%"),
paste0("Best deal source: ", best_src$deal_source[1],
" (", best_src$win_pct[1], "%)"),
paste0("Best industry: ", best_ind$industry[1],
" (", best_ind$win_pct[1], "%)")
), check.names = FALSE)
findings %>%
kable("html", caption = "Revenue Drivers — Summary", row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE, font_size = 13)
| Key Findings |
|---|
| Top 3 drivers: deal_size_usd, days_to_close, num_emails |
| Statistically significant (p<0.05): discount_pct, num_meetings, decision_makers_involved |
| Overall win rate: 24.7% |
| Best deal source: Partner (36.2%) |
| Best industry: Retail (34%) |
Takeaways:
- Engagement is king — the number of meetings and multi-threading (decision-makers involved) consistently rank as the top revenue drivers.
- Speed wins — longer sales cycles correlate with lower win rates. Deals that stall should be flagged and fast-tracked or disqualified.
- Discounts don’t save deals — discount percentage shows weak or negative correlation with winning, meaning price concessions alone don’t close business.
- Recommendation: Prioritize rep coaching on discovery meetings and stakeholder mapping. Set pipeline velocity alerts for deals exceeding the median cycle length.
- Recommendation: Double down on the top-performing deal source and industry vertical — allocate more SDR capacity there.
Project completed as part of RevOps Analytics Portfolio — Aya Hanouni