1. Project Overview

Objective: Identify which deal attributes most influence whether a deal is won or lost. This helps RevOps teams prioritize the right levers — should reps focus on more meetings? Bigger deals? Fewer discounts?

Skills Demonstrated: Data wrangling, correlation analysis, Random Forest feature importance, logistic regression, data visualization.

Dataset: deals_data.csv — 300 CRM deals with attributes like source, industry, meetings, discount %, and outcome.

2. Load & Explore Data

deals <- read.csv("deals_data.csv", stringsAsFactors = FALSE)

cat("Dimensions:", dim(deals), "\n")

## Dimensions: 300 11

head(deals, 10) %>%
  kable("html", caption = "First 10 Rows") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13)

First 10 Rows
deal_id	deal_source	industry	deal_size_usd	num_meetings	num_emails	days_to_close	discount_pct	competitor_mentioned	decision_makers_involved	outcome
1	Outbound	SaaS	16896	6	11	60	0	0	1	Closed Lost
2	Event	Healthcare	14817	2	12	86	20	0	1	Closed Lost
3	Referral	Healthcare	20878	4	12	61	0	0	1	Closed Lost
4	Referral	Healthcare	13126	4	19	62	10	1	1	Closed Lost
5	Inbound	Retail	16700	5	12	54	20	0	1	Closed Lost
6	Inbound	Manufacturing	12101	3	13	77	10	1	2	Closed Lost
7	Inbound	Healthcare	58827	7	10	74	0	0	1	Closed Lost
8	Partner	Finance	22418	4	15	43	5	0	2	Closed Won
9	Referral	Retail	15523	5	12	31	0	0	1	Closed Lost
10	Referral	Finance	24513	2	13	51	10	0	1	Closed Lost

summary(deals) %>% kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "condensed"),
                full_width = FALSE, font_size = 13)

deal_id	deal_source	industry	deal_size_usd	num_meetings	num_emails	days_to_close	discount_pct	competitor_mentioned	decision_makers_involved	outcome
Min. : 1.00	Length:300	Length:300	Min. : 5719	Min. : 0.000	Min. : 3.00	Min. : 10.00	Min. : 0.000	Min. :0.00	Min. :1.00	Length:300
1st Qu.: 75.75	Class :character	Class :character	1st Qu.: 15018	1st Qu.: 3.000	1st Qu.:10.00	1st Qu.: 43.00	1st Qu.: 0.000	1st Qu.:0.00	1st Qu.:1.00	Class :character
Median :150.50	Mode :character	Mode :character	Median : 20932	Median : 4.000	Median :12.00	Median : 54.00	Median : 5.000	Median :0.00	Median :2.00	Mode :character
Mean :150.50	NA	NA	Mean : 24115	Mean : 3.993	Mean :11.97	Mean : 55.06	Mean : 9.583	Mean :0.35	Mean :2.25	NA
3rd Qu.:225.25	NA	NA	3rd Qu.: 29388	3rd Qu.: 5.000	3rd Qu.:14.00	3rd Qu.: 66.00	3rd Qu.:20.000	3rd Qu.:1.00	3rd Qu.:3.00	NA
Max. :300.00	NA	NA	Max. :102686	Max. :10.000	Max. :24.00	Max. :103.00	Max. :25.000	Max. :1.00	Max. :4.00	NA

deals$won <- as.integer(deals$outcome == "Closed Won")
win_rate  <- mean(deals$won) * 100

cat("Total deals:", nrow(deals), "\n")

## Total deals: 300

cat("Win rate:   ", round(win_rate, 1), "%\n")

## Win rate:    24.7 %

Context: With ~300 deals and a ~22% win rate, this is a realistic B2B dataset. The challenge is figuring out which attributes separate the ~22% that close from the ~78% that don’t.

3. Correlation Analysis

numeric_cols <- c("won", "deal_size_usd", "num_meetings", "num_emails",
                  "days_to_close", "discount_pct", "competitor_mentioned",
                  "decision_makers_involved")

cor_matrix <- cor(deals[numeric_cols], use = "complete.obs")

cor_with_win <- cor_matrix[, "won"]
cor_with_win <- cor_with_win[names(cor_with_win) != "won"]
cor_with_win <- sort(cor_with_win, decreasing = TRUE)

cor_df <- data.frame(
  Variable    = names(cor_with_win),
  Correlation = round(cor_with_win, 3)
)

cor_df %>%
  kable("html", caption = "Correlation with Win Outcome", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(which(abs(cor_df$Correlation) > 0.15),
           bold = TRUE, background = "#d5f5e3")

Correlation with Win Outcome
Variable	Correlation
num_meetings	0.160
decision_makers_involved	0.156
num_emails	0.078
deal_size_usd	-0.019
competitor_mentioned	-0.047
days_to_close	-0.126
discount_pct	-0.175

melted <- melt(cor_matrix)

ggplot(melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(value, 2)), size = 3.5) +
  scale_fill_gradient2(low = pal_red, mid = "white", high = pal_blue,
                       midpoint = 0, limits = c(-1, 1)) +
  labs(title = "Correlation Heatmap — Deal Attributes", fill = "r") +
  theme_portfolio() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())

Interpretation: Variables with the strongest positive correlations to winning (e.g., num_meetings, decision_makers_involved) are the ones reps should double down on. Negative correlations (e.g., days_to_close) confirm that longer sales cycles hurt — speed matters. Weak or near-zero correlations (e.g., deal_size_usd) suggest that deal size alone doesn’t determine the outcome.

4. Win Rate by Categorical Variables

wr_source <- deals %>%
  group_by(deal_source) %>%
  summarise(win_pct = round(mean(won) * 100, 1), .groups = "drop") %>%
  arrange(win_pct)

wr_source$deal_source <- factor(wr_source$deal_source,
                                 levels = wr_source$deal_source)

ggplot(wr_source, aes(x = win_pct, y = deal_source)) +
  geom_col(fill = pal_blue, alpha = 0.85, width = 0.6) +
  geom_text(aes(label = paste0(win_pct, "%")),
            hjust = -0.15, size = 3.8) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Win Rate by Deal Source", x = "Win Rate (%)", y = NULL) +
  theme_portfolio()

wr_ind <- deals %>%
  group_by(industry) %>%
  summarise(win_pct = round(mean(won) * 100, 1), .groups = "drop") %>%
  arrange(win_pct)

wr_ind$industry <- factor(wr_ind$industry, levels = wr_ind$industry)

ggplot(wr_ind, aes(x = win_pct, y = industry)) +
  geom_col(fill = pal_orange, alpha = 0.85, width = 0.6) +
  geom_text(aes(label = paste0(win_pct, "%")),
            hjust = -0.15, size = 3.8) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Win Rate by Industry", x = "Win Rate (%)", y = NULL) +
  theme_portfolio()

Interpretation: The best-performing deal source is where marketing and sales should focus pipeline generation. The top industry vertical may warrant a dedicated playbook or specialized reps. Low-performing sources or industries aren’t necessarily bad — they may need different enablement or longer nurture sequences.

5. Feature Importance — Random Forest

feature_cols <- c("deal_size_usd", "num_meetings", "num_emails",
                  "days_to_close", "discount_pct", "competitor_mentioned",
                  "decision_makers_involved")

# Encode categoricals
deals$source_enc   <- as.integer(as.factor(deals$deal_source))
deals$industry_enc <- as.integer(as.factor(deals$industry))

all_features   <- c(feature_cols, "source_enc", "industry_enc")
feature_labels <- c(feature_cols, "deal_source", "industry")

set.seed(42)
rf <- randomForest(
  x = deals[all_features],
  y = as.factor(deals$won),
  ntree      = 500,
  importance = TRUE
)

imp <- data.frame(
  Variable   = feature_labels,
  Importance = round(importance(rf, type = 2)[, 1], 4)
) %>% arrange(desc(Importance))

imp %>%
  kable("html", caption = "Feature Importance — Ranked", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(1:3, bold = TRUE, background = "#d5f5e3")

Feature Importance — Ranked
Variable	Importance
deal_size_usd	22.8768
days_to_close	18.9162
num_emails	15.2692
num_meetings	13.0570
discount_pct	9.8072
industry	9.7192
deal_source	8.5826
decision_makers_involved	8.5149
competitor_mentioned	3.4231

imp_plot <- imp %>% arrange(Importance)
imp_plot$Variable <- factor(imp_plot$Variable, levels = imp_plot$Variable)

ggplot(imp_plot, aes(x = Importance, y = Variable)) +
  geom_col(fill = pal_green, alpha = 0.85, width = 0.6) +
  labs(title = "Feature Importance — Mean Decrease Gini",
       x = "Importance", y = NULL) +
  theme_portfolio()

Interpretation: The Random Forest ranks every feature by how much it helps separate won deals from lost ones. The top 3 drivers (highlighted in green above) are where the sales team should focus. Typically, engagement metrics (meetings, emails) and deal velocity (days to close) dominate — meaning how you sell matters more than what you sell or who you sell to.

6. Deep Dive — Top Driver

top_var <- imp$Variable[1]
cat("Top driver:", top_var, "\n")

## Top driver: deal_size_usd

if (top_var %in% names(deals) && is.numeric(deals[[top_var]])) {
  plot_df <- data.frame(
    outcome = factor(ifelse(deals$won == 1, "Won", "Lost"),
                     levels = c("Lost", "Won")),
    value   = deals[[top_var]]
  )

  ggplot(plot_df, aes(x = outcome, y = value, fill = outcome)) +
    geom_boxplot(alpha = 0.7, width = 0.5, outlier.alpha = 0.4) +
    scale_fill_manual(values = c("Lost" = pal_red, "Won" = pal_green)) +
    labs(title = paste0("Distribution of ", top_var, " — Won vs Lost"),
         y = top_var, x = NULL) +
    theme_portfolio() +
    theme(legend.position = "none")
}

Interpretation: A clear separation in the boxplots (e.g., won deals having noticeably higher meeting counts) confirms the Random Forest finding. If the distributions overlap heavily, the variable is important but not sufficient on its own — it works in combination with other factors.

7. Logistic Regression — Quantifying Effects

log_model <- glm(won ~ deal_size_usd + num_meetings + num_emails +
                   days_to_close + discount_pct + competitor_mentioned +
                   decision_makers_involved,
                 data = deals, family = binomial)

summary(log_model)

## 
## Call:
## glm(formula = won ~ deal_size_usd + num_meetings + num_emails + 
##     days_to_close + discount_pct + competitor_mentioned + decision_makers_involved, 
##     family = binomial, data = deals)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)              -2.019e+00  8.546e-01  -2.362  0.01817 * 
## deal_size_usd            -5.885e-06  1.091e-05  -0.540  0.58949   
## num_meetings              2.022e-01  7.405e-02   2.731  0.00632 **
## num_emails                6.755e-02  4.088e-02   1.653  0.09843 . 
## days_to_close            -1.645e-02  8.744e-03  -1.881  0.05992 . 
## discount_pct             -5.298e-02  1.649e-02  -3.213  0.00131 **
## competitor_mentioned     -2.624e-01  3.058e-01  -0.858  0.39093   
## decision_makers_involved  3.424e-01  1.321e-01   2.593  0.00951 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.19  on 299  degrees of freedom
## Residual deviance: 302.32  on 292  degrees of freedom
## AIC: 318.32
## 
## Number of Fisher Scoring iterations: 4

coefs   <- summary(log_model)$coefficients[-1, ]  # drop intercept
or_df   <- data.frame(
  Variable   = rownames(coefs),
  Odds_Ratio = round(exp(coefs[, "Estimate"]), 3),
  P_Value    = round(coefs[, "Pr(>|z|)"], 4)
) %>% arrange(P_Value)
rownames(or_df) <- NULL

or_df %>%
  kable("html", caption = "Logistic Regression — Odds Ratios", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(which(or_df$P_Value < 0.05),
           bold = TRUE, background = "#d5f5e3")

Logistic Regression — Odds Ratios
Variable	Odds_Ratio	P_Value
discount_pct	0.948	0.0013
num_meetings	1.224	0.0063
decision_makers_involved	1.408	0.0095
days_to_close	0.984	0.0599
num_emails	1.070	0.0984
competitor_mentioned	0.769	0.3909
deal_size_usd	1.000	0.5895

Interpretation: The odds ratio tells you the practical effect of each variable. For example, an odds ratio of 1.25 for num_meetings means each additional meeting increases the odds of winning by 25%. Variables with p < 0.05 (highlighted in green) are statistically significant. This complements the Random Forest by quantifying how much each lever moves the needle.

8. Summary & Recommendations

top3     <- imp$Variable[1:3]
sig_vars <- or_df$Variable[or_df$P_Value < 0.05]
best_src <- wr_source %>% filter(win_pct == max(win_pct))
best_ind <- wr_ind %>% filter(win_pct == max(win_pct))

findings <- data.frame(`Key Findings` = c(
  paste0("Top 3 drivers: ", paste(top3, collapse = ", ")),
  paste0("Statistically significant (p<0.05): ",
         ifelse(length(sig_vars) > 0, paste(sig_vars, collapse = ", "), "None")),
  paste0("Overall win rate: ", round(win_rate, 1), "%"),
  paste0("Best deal source: ", best_src$deal_source[1],
         " (", best_src$win_pct[1], "%)"),
  paste0("Best industry: ", best_ind$industry[1],
         " (", best_ind$win_pct[1], "%)")
), check.names = FALSE)

findings %>%
  kable("html", caption = "Revenue Drivers — Summary", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 13)

Revenue Drivers — Summary
Key Findings
Top 3 drivers: deal_size_usd, days_to_close, num_emails
Statistically significant (p<0.05): discount_pct, num_meetings, decision_makers_involved
Overall win rate: 24.7%
Best deal source: Partner (36.2%)
Best industry: Retail (34%)

Takeaways:

Engagement is king — the number of meetings and multi-threading (decision-makers involved) consistently rank as the top revenue drivers.

Speed wins — longer sales cycles correlate with lower win rates. Deals that stall should be flagged and fast-tracked or disqualified.

Discounts don’t save deals — discount percentage shows weak or negative correlation with winning, meaning price concessions alone don’t close business.

Recommendation: Prioritize rep coaching on discovery meetings and stakeholder mapping. Set pipeline velocity alerts for deals exceeding the median cycle length.

Recommendation: Double down on the top-performing deal source and industry vertical — allocate more SDR capacity there.

Project completed as part of RevOps Analytics Portfolio — Aya Hanouni

Revenue Drivers Analysis

Aya Hanouni