1. Project Overview

Objective: Identify which deal attributes most influence whether a deal is won or lost. This helps RevOps teams prioritize the right levers — should reps focus on more meetings? Bigger deals? Fewer discounts?

Skills Demonstrated: Data wrangling, correlation analysis, Random Forest feature importance, logistic regression, data visualization.

Dataset: deals_data.csv — 300 CRM deals with attributes like source, industry, meetings, discount %, and outcome.


2. Load & Explore Data

deals <- read.csv("deals_data.csv", stringsAsFactors = FALSE)

cat("Dimensions:", dim(deals), "\n")
## Dimensions: 300 11
head(deals, 10) %>%
  kable("html", caption = "First 10 Rows") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13)
First 10 Rows
deal_id deal_source industry deal_size_usd num_meetings num_emails days_to_close discount_pct competitor_mentioned decision_makers_involved outcome
1 Outbound SaaS 16896 6 11 60 0 0 1 Closed Lost
2 Event Healthcare 14817 2 12 86 20 0 1 Closed Lost
3 Referral Healthcare 20878 4 12 61 0 0 1 Closed Lost
4 Referral Healthcare 13126 4 19 62 10 1 1 Closed Lost
5 Inbound Retail 16700 5 12 54 20 0 1 Closed Lost
6 Inbound Manufacturing 12101 3 13 77 10 1 2 Closed Lost
7 Inbound Healthcare 58827 7 10 74 0 0 1 Closed Lost
8 Partner Finance 22418 4 15 43 5 0 2 Closed Won
9 Referral Retail 15523 5 12 31 0 0 1 Closed Lost
10 Referral Finance 24513 2 13 51 10 0 1 Closed Lost
summary(deals) %>% kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "condensed"),
                full_width = FALSE, font_size = 13)
deal_id deal_source industry deal_size_usd num_meetings num_emails days_to_close discount_pct competitor_mentioned decision_makers_involved outcome
Min. : 1.00 Length:300 Length:300 Min. : 5719 Min. : 0.000 Min. : 3.00 Min. : 10.00 Min. : 0.000 Min. :0.00 Min. :1.00 Length:300
1st Qu.: 75.75 Class :character Class :character 1st Qu.: 15018 1st Qu.: 3.000 1st Qu.:10.00 1st Qu.: 43.00 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:1.00 Class :character
Median :150.50 Mode :character Mode :character Median : 20932 Median : 4.000 Median :12.00 Median : 54.00 Median : 5.000 Median :0.00 Median :2.00 Mode :character
Mean :150.50 NA NA Mean : 24115 Mean : 3.993 Mean :11.97 Mean : 55.06 Mean : 9.583 Mean :0.35 Mean :2.25 NA
3rd Qu.:225.25 NA NA 3rd Qu.: 29388 3rd Qu.: 5.000 3rd Qu.:14.00 3rd Qu.: 66.00 3rd Qu.:20.000 3rd Qu.:1.00 3rd Qu.:3.00 NA
Max. :300.00 NA NA Max. :102686 Max. :10.000 Max. :24.00 Max. :103.00 Max. :25.000 Max. :1.00 Max. :4.00 NA
deals$won <- as.integer(deals$outcome == "Closed Won")
win_rate  <- mean(deals$won) * 100

cat("Total deals:", nrow(deals), "\n")
## Total deals: 300
cat("Win rate:   ", round(win_rate, 1), "%\n")
## Win rate:    24.7 %

Context: With ~300 deals and a ~22% win rate, this is a realistic B2B dataset. The challenge is figuring out which attributes separate the ~22% that close from the ~78% that don’t.


3. Correlation Analysis

numeric_cols <- c("won", "deal_size_usd", "num_meetings", "num_emails",
                  "days_to_close", "discount_pct", "competitor_mentioned",
                  "decision_makers_involved")

cor_matrix <- cor(deals[numeric_cols], use = "complete.obs")

cor_with_win <- cor_matrix[, "won"]
cor_with_win <- cor_with_win[names(cor_with_win) != "won"]
cor_with_win <- sort(cor_with_win, decreasing = TRUE)

cor_df <- data.frame(
  Variable    = names(cor_with_win),
  Correlation = round(cor_with_win, 3)
)

cor_df %>%
  kable("html", caption = "Correlation with Win Outcome", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(which(abs(cor_df$Correlation) > 0.15),
           bold = TRUE, background = "#d5f5e3")
Correlation with Win Outcome
Variable Correlation
num_meetings 0.160
decision_makers_involved 0.156
num_emails 0.078
deal_size_usd -0.019
competitor_mentioned -0.047
days_to_close -0.126
discount_pct -0.175
melted <- melt(cor_matrix)

ggplot(melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(value, 2)), size = 3.5) +
  scale_fill_gradient2(low = pal_red, mid = "white", high = pal_blue,
                       midpoint = 0, limits = c(-1, 1)) +
  labs(title = "Correlation Heatmap — Deal Attributes", fill = "r") +
  theme_portfolio() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())

Interpretation: Variables with the strongest positive correlations to winning (e.g., num_meetings, decision_makers_involved) are the ones reps should double down on. Negative correlations (e.g., days_to_close) confirm that longer sales cycles hurt — speed matters. Weak or near-zero correlations (e.g., deal_size_usd) suggest that deal size alone doesn’t determine the outcome.


4. Win Rate by Categorical Variables

wr_source <- deals %>%
  group_by(deal_source) %>%
  summarise(win_pct = round(mean(won) * 100, 1), .groups = "drop") %>%
  arrange(win_pct)

wr_source$deal_source <- factor(wr_source$deal_source,
                                 levels = wr_source$deal_source)

ggplot(wr_source, aes(x = win_pct, y = deal_source)) +
  geom_col(fill = pal_blue, alpha = 0.85, width = 0.6) +
  geom_text(aes(label = paste0(win_pct, "%")),
            hjust = -0.15, size = 3.8) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Win Rate by Deal Source", x = "Win Rate (%)", y = NULL) +
  theme_portfolio()

wr_ind <- deals %>%
  group_by(industry) %>%
  summarise(win_pct = round(mean(won) * 100, 1), .groups = "drop") %>%
  arrange(win_pct)

wr_ind$industry <- factor(wr_ind$industry, levels = wr_ind$industry)

ggplot(wr_ind, aes(x = win_pct, y = industry)) +
  geom_col(fill = pal_orange, alpha = 0.85, width = 0.6) +
  geom_text(aes(label = paste0(win_pct, "%")),
            hjust = -0.15, size = 3.8) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Win Rate by Industry", x = "Win Rate (%)", y = NULL) +
  theme_portfolio()

Interpretation: The best-performing deal source is where marketing and sales should focus pipeline generation. The top industry vertical may warrant a dedicated playbook or specialized reps. Low-performing sources or industries aren’t necessarily bad — they may need different enablement or longer nurture sequences.


5. Feature Importance — Random Forest

feature_cols <- c("deal_size_usd", "num_meetings", "num_emails",
                  "days_to_close", "discount_pct", "competitor_mentioned",
                  "decision_makers_involved")

# Encode categoricals
deals$source_enc   <- as.integer(as.factor(deals$deal_source))
deals$industry_enc <- as.integer(as.factor(deals$industry))

all_features   <- c(feature_cols, "source_enc", "industry_enc")
feature_labels <- c(feature_cols, "deal_source", "industry")

set.seed(42)
rf <- randomForest(
  x = deals[all_features],
  y = as.factor(deals$won),
  ntree      = 500,
  importance = TRUE
)

imp <- data.frame(
  Variable   = feature_labels,
  Importance = round(importance(rf, type = 2)[, 1], 4)
) %>% arrange(desc(Importance))

imp %>%
  kable("html", caption = "Feature Importance — Ranked", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(1:3, bold = TRUE, background = "#d5f5e3")
Feature Importance — Ranked
Variable Importance
deal_size_usd 22.8768
days_to_close 18.9162
num_emails 15.2692
num_meetings 13.0570
discount_pct 9.8072
industry 9.7192
deal_source 8.5826
decision_makers_involved 8.5149
competitor_mentioned 3.4231
imp_plot <- imp %>% arrange(Importance)
imp_plot$Variable <- factor(imp_plot$Variable, levels = imp_plot$Variable)

ggplot(imp_plot, aes(x = Importance, y = Variable)) +
  geom_col(fill = pal_green, alpha = 0.85, width = 0.6) +
  labs(title = "Feature Importance — Mean Decrease Gini",
       x = "Importance", y = NULL) +
  theme_portfolio()

Interpretation: The Random Forest ranks every feature by how much it helps separate won deals from lost ones. The top 3 drivers (highlighted in green above) are where the sales team should focus. Typically, engagement metrics (meetings, emails) and deal velocity (days to close) dominate — meaning how you sell matters more than what you sell or who you sell to.


6. Deep Dive — Top Driver

top_var <- imp$Variable[1]
cat("Top driver:", top_var, "\n")
## Top driver: deal_size_usd
if (top_var %in% names(deals) && is.numeric(deals[[top_var]])) {
  plot_df <- data.frame(
    outcome = factor(ifelse(deals$won == 1, "Won", "Lost"),
                     levels = c("Lost", "Won")),
    value   = deals[[top_var]]
  )

  ggplot(plot_df, aes(x = outcome, y = value, fill = outcome)) +
    geom_boxplot(alpha = 0.7, width = 0.5, outlier.alpha = 0.4) +
    scale_fill_manual(values = c("Lost" = pal_red, "Won" = pal_green)) +
    labs(title = paste0("Distribution of ", top_var, " — Won vs Lost"),
         y = top_var, x = NULL) +
    theme_portfolio() +
    theme(legend.position = "none")
}

Interpretation: A clear separation in the boxplots (e.g., won deals having noticeably higher meeting counts) confirms the Random Forest finding. If the distributions overlap heavily, the variable is important but not sufficient on its own — it works in combination with other factors.


7. Logistic Regression — Quantifying Effects

log_model <- glm(won ~ deal_size_usd + num_meetings + num_emails +
                   days_to_close + discount_pct + competitor_mentioned +
                   decision_makers_involved,
                 data = deals, family = binomial)

summary(log_model)
## 
## Call:
## glm(formula = won ~ deal_size_usd + num_meetings + num_emails + 
##     days_to_close + discount_pct + competitor_mentioned + decision_makers_involved, 
##     family = binomial, data = deals)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)              -2.019e+00  8.546e-01  -2.362  0.01817 * 
## deal_size_usd            -5.885e-06  1.091e-05  -0.540  0.58949   
## num_meetings              2.022e-01  7.405e-02   2.731  0.00632 **
## num_emails                6.755e-02  4.088e-02   1.653  0.09843 . 
## days_to_close            -1.645e-02  8.744e-03  -1.881  0.05992 . 
## discount_pct             -5.298e-02  1.649e-02  -3.213  0.00131 **
## competitor_mentioned     -2.624e-01  3.058e-01  -0.858  0.39093   
## decision_makers_involved  3.424e-01  1.321e-01   2.593  0.00951 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.19  on 299  degrees of freedom
## Residual deviance: 302.32  on 292  degrees of freedom
## AIC: 318.32
## 
## Number of Fisher Scoring iterations: 4
coefs   <- summary(log_model)$coefficients[-1, ]  # drop intercept
or_df   <- data.frame(
  Variable   = rownames(coefs),
  Odds_Ratio = round(exp(coefs[, "Estimate"]), 3),
  P_Value    = round(coefs[, "Pr(>|z|)"], 4)
) %>% arrange(P_Value)
rownames(or_df) <- NULL

or_df %>%
  kable("html", caption = "Logistic Regression — Odds Ratios", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(which(or_df$P_Value < 0.05),
           bold = TRUE, background = "#d5f5e3")
Logistic Regression — Odds Ratios
Variable Odds_Ratio P_Value
discount_pct 0.948 0.0013
num_meetings 1.224 0.0063
decision_makers_involved 1.408 0.0095
days_to_close 0.984 0.0599
num_emails 1.070 0.0984
competitor_mentioned 0.769 0.3909
deal_size_usd 1.000 0.5895

Interpretation: The odds ratio tells you the practical effect of each variable. For example, an odds ratio of 1.25 for num_meetings means each additional meeting increases the odds of winning by 25%. Variables with p < 0.05 (highlighted in green) are statistically significant. This complements the Random Forest by quantifying how much each lever moves the needle.


8. Summary & Recommendations

top3     <- imp$Variable[1:3]
sig_vars <- or_df$Variable[or_df$P_Value < 0.05]
best_src <- wr_source %>% filter(win_pct == max(win_pct))
best_ind <- wr_ind %>% filter(win_pct == max(win_pct))

findings <- data.frame(`Key Findings` = c(
  paste0("Top 3 drivers: ", paste(top3, collapse = ", ")),
  paste0("Statistically significant (p<0.05): ",
         ifelse(length(sig_vars) > 0, paste(sig_vars, collapse = ", "), "None")),
  paste0("Overall win rate: ", round(win_rate, 1), "%"),
  paste0("Best deal source: ", best_src$deal_source[1],
         " (", best_src$win_pct[1], "%)"),
  paste0("Best industry: ", best_ind$industry[1],
         " (", best_ind$win_pct[1], "%)")
), check.names = FALSE)

findings %>%
  kable("html", caption = "Revenue Drivers — Summary", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 13)
Revenue Drivers — Summary
Key Findings
Top 3 drivers: deal_size_usd, days_to_close, num_emails
Statistically significant (p<0.05): discount_pct, num_meetings, decision_makers_involved
Overall win rate: 24.7%
Best deal source: Partner (36.2%)
Best industry: Retail (34%)

Takeaways:

  1. Engagement is king — the number of meetings and multi-threading (decision-makers involved) consistently rank as the top revenue drivers.
  2. Speed wins — longer sales cycles correlate with lower win rates. Deals that stall should be flagged and fast-tracked or disqualified.
  3. Discounts don’t save deals — discount percentage shows weak or negative correlation with winning, meaning price concessions alone don’t close business.
  4. Recommendation: Prioritize rep coaching on discovery meetings and stakeholder mapping. Set pipeline velocity alerts for deals exceeding the median cycle length.
  5. Recommendation: Double down on the top-performing deal source and industry vertical — allocate more SDR capacity there.

Project completed as part of RevOps Analytics Portfolio — Aya Hanouni