In our journey so far, we have become proficient at understanding and describing a single variable. We’ve learned to capture its central tendency with the mean and median, measure its variability with standard deviation, and visualize its distribution with histograms. This is the world of univariate analysis.
But the real world is interconnected. Business success rarely depends on a single factor. Answering the most critical business questions requires us to look at the relationships between variables.
To answer these questions, we must step into the world of Bivariate Descriptive Analysis. This is our focus for today: learning the tools and concepts to analyze two variables simultaneously.
Think of a univariate analysis as a detective finding a single clue at a crime scene—say, a footprint. That clue is useful. We can measure it, describe it, and learn something from it. But it doesn’t solve the case.
Bivariate analysis is like finding a second clue—a security camera footage showing a specific person near the scene at the time of the crime. Now, we can analyze the relationship between the two clues: Does the footprint match the person in the video? By looking at them together, we start to build a real story and get closer to solving the mystery.
Today, we become data detectives. We will learn to find and interpret the stories hidden in the relationships between our variables.
## 📈 Welcome to the world of Bivariate Analysis!
## Today we will analyze a dataset of 1000 loan applicants to uncover hidden relationships.
## Our mission: To understand how different factors relate to credit risk and loan terms.
Let’s begin with the case where we have two categorical (or
qualitative) variables. For example, we might want to analyze the
relationship between a customer’s housing status (own,
rent, for free) and their credit_risk_score (good,
bad).
Our primary tool for this is the crosstabulation (or crosstab).
A crosstab is a table that shows the number of observations for every combination of the categories of two variables. Let’s break down its components using the notation from your PDF notes.
Let’s say we have two variables, X (with K categories) and Y (with J categories). The crosstab looks like this:
\[ \begin{array}{c|ccccc|c} \text{X \\ Y} & y_1 & y_2 & \dots & y_J & \text{TOTAL} \\ \hline x_1 & f_{11} & f_{12} & \dots & f_{1J} & R_1 \\ x_2 & f_{21} & f_{22} & \dots & f_{2J} & R_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ x_K & f_{K1} & f_{K2} & \dots & f_{KJ} & R_K \\ \hline \text{TOTAL} & C_1 & C_2 & \dots & C_J & n \\ \end{array} \]
Let’s work through Case A from your
Lectures 8/9 notes.
Business Question: Does a company’s solvency rating depend on its industry?
First, we have the raw counts (joint absolute frequencies):
| Industry Solvency | Low | Average | High | Row TOTAL |
|---|---|---|---|---|
| Manufacturing | 36 | 124 | 80 | 240 |
| Financial | 12 | 64 | 84 | 160 |
| Column TOTAL | 48 | 188 | 164 | 400 |
Looking at raw counts can be misleading because the group sizes are different (240 vs. 160). To make a fair comparison, we must calculate conditional frequencies.
A conditional frequency tells us the distribution of one variable given a specific category of the other variable.
The formula for the conditional frequency of Y given X is: \[F(Y=y_j | X=x_k) = \frac{f_{kj}}{R_k}\]
Let’s calculate the conditional distribution of Solvency
given Industry = “Manufacturing”:
Now for Industry = “Financial”:
This leads to the most important concept here: Statistical Independence. * Independence: If the conditional distributions are the same for all categories, the variables are independent. Knowing one gives you no information about the other. * Dependence: If the conditional distributions are different, the variables are dependent.
Conclusion for our example: The distributions (15.0%, 51.7%, 33.3%) and (7.5%, 40.0%, 52.5%) are clearly different. Therefore, solvency rating is dependent on the industry.
Let’s use our simulated credit_risk dataset to answer:
Does a customer’s credit risk (score) depend on
their housing situation?
We use the distr.table.xy function from your
UBStats library. We will look at the conditional
frequencies of the score, given the housing type (y|x).
# Using the function from your class script
# freq.type = "y|x" means we are conditioning Y (score) on X (housing)
# freq = "prop" gives us the proportions (relative frequencies)
distr.table.xy(x = housing, y = score, data = credit_risk,
freq.type = "y|x", freq = "prop")## y|x: Proportions
## score
## housing bad good TOTAL
## for free 0.28 0.72 1.00
## own 0.29 0.71 1.00
## rent 0.33 0.67 1.00
Interpretation: We can immediately see that the
conditional distributions are different. For example, the proportion of
“bad” scores for people who own their house (0.26 or 26%)
is much lower than for those who live for free (0.41 or
41%). This indicates dependence.
Visualizing this makes the conclusion even clearer. We use
distr.plot.xy with plot.type = "bars".
# The stacked bar chart is excellent for comparing proportions
distr.plot.xy(x = housing, y = score, data = credit_risk,
freq.type = "y|x", freq = "prop",
plot.type = "bars", bar.type = "stacked")Conditional distribution of Credit Score by Housing Type
The visual evidence is compelling. The red bars (representing “bad” scores) are different heights for each housing category, confirming that credit score is dependent on housing status.
What if we want to compare a numerical variable (like
loan duration) across different categories of a qualitative
variable (like credit score)?
Our main tools are: 1. Conditional Summary Measures: Calculate statistics like the mean, median, and standard deviation for the numerical variable, conditioned on each category. 2. Conditional Boxplots: Create a separate boxplot for the numerical variable for each category and place them side-by-side.
Let’s analyze the relationship between duration_months
(quantitative) and score (qualitative) from our
credit_risk dataset.
We use the distr.summary.x function with the
by1 argument, as shown in your Video 5 script.
## 📊 Conditional Summary Statistics for Loan Duration by Credit Score:
# This function calculates the specified stats for the variable 'x'
# grouped by the categories in 'by1'.
summary_by_score <- distr.summary.x(x = duration_months,
by1 = score,
data = credit_risk,
stats = "summary") # "summary" gives a full report## score n n.a min q1 median mean q3 max sd var
## bad 293 0 11 22 25 25.22 28 43 4.93 24.32
## good 707 0 7 17 20 20.13 23 36 4.40 19.38
## $`Summary measures`
## score n n.a min q1 median mean q3 max sd var
## 1 bad 293 0 11 22 25 25.22184 28 43 4.931233 24.31706
## 2 good 707 0 7 17 20 20.12871 23 36 4.401791 19.37576
Interpretation: * Central Tendency:
The mean and median duration for “bad” risk
loans (around 24-25 months) are significantly higher than for “good”
risk loans (around 19 months). * Dispersion: The
sd (standard deviation) and var (variance) are
also much larger for the “bad” category, indicating that the loan
durations for bad risks are not only longer on average but also more
spread out.
A boxplot provides a powerful five-number summary (min, Q1, median, Q3, max) and helps visualize the differences we saw in the table.
## 📈 Conditional Boxplots for Loan Duration by Credit Score:
# Using the distr.plot.xy function with plot.type = "boxplot"
# Note that for boxplots, the categorical variable is 'x' and the numerical is 'y'
distr.plot.xy(x = score, y = duration_months, data = credit_risk,
plot.type = "boxplot")Side-by-side boxplots of Loan Duration by Credit Score
Interpretation: * The entire box for the “bad” category is shifted upwards, visually confirming the higher median and quartiles. * The box for “bad” is taller (larger Interquartile Range) and has longer whiskers, confirming greater variability. * We can see several outliers (the small circles) for “good” risk loans, representing customers who have good credit but took out unusually long-term loans.
Now we arrive at the case of analyzing two numerical variables, like
credit_amount and age.
The first, most crucial step is to create a scatterplot. It plots each pair of (x, y) values as a point and helps us see the form, direction, and strength of the relationship.
Covariance is a numerical measure of the direction of the linear relationship between two variables.
Formulas from your notes: * Population Covariance: \(\sigma_{xy} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y)\) * Sample Covariance: \(s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
Interpretation of the Sign: * \(s_{xy} > 0\): Positive linear relationship (as x goes up, y tends to go up). * \(s_{xy} < 0\): Negative linear relationship (as x goes up, y tends to go down). * \(s_{xy} = 0\): No linear relationship.
Let’s take a tiny dataset to see how the formula works. Data: * X (Study Hours): 2, 4, 5, 7 * Y (Test Score): 65, 75, 80, 90
Step 1: Calculate the means \(\bar{x} = (2+4+5+7) / 4 = 4.5\) \(\bar{y} = (65+75+80+90) / 4 = 77.5\)
Step 2: Calculate deviations and their products
| \(x_i\) | \(y_i\) | \((x_i - \bar{x})\) | \((y_i - \bar{y})\) | \((x_i - \bar{x})(y_i - \bar{y})\) |
|---|---|---|---|---|
| 2 | 65 | -2.5 | -12.5 | 31.25 |
| 4 | 75 | -0.5 | -2.5 | 1.25 |
| 5 | 80 | 0.5 | 2.5 | 1.25 |
| 7 | 90 | 2.5 | 12.5 | 31.25 |
| Sum = | 65.0 |
Step 3: Calculate the covariance \(s_{xy} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{n-1} = \frac{65.0}{4-1} = \frac{65.0}{3} \approx 21.67\)
The positive result indicates a positive linear relationship.
Limitation of Covariance: The value
21.67 is hard to interpret. Is it strong? Weak? Its
magnitude depends on the units of X and Y, making it impossible to use
for judging strength.
To solve the problem of covariance, we standardize it to get the linear correlation coefficient (r), also known as Pearson’s r.
Formulas from your notes: * Population Correlation: \(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}\) * Sample Correlation: \(r_{xy} = \frac{s_{xy}}{s_x s_y}\)
Properties of Correlation (r): * It is a relative measure with no units. * It always ranges between -1 and +1. * \(r = +1\): Perfect positive linear relationship. * \(r = -1\): Perfect negative linear relationship. * \(r = 0\): No linear relationship. * The absolute value, \(|r|\), measures the strength of the linear relationship.
Rule of Thumb for Strength (from notes): * \(|r| < 0.3\): Weak linear correlation * \(0.3 \le |r| \le 0.7\): Moderate linear correlation * \(|r| > 0.7\): Strong linear correlation
A bank analyst wants to predict duration_months. They
want to know which variable is a better linear predictor:
age or credit_amount?
duration_months
vs. age## 📊 Analysis of Loan Duration vs. Age
# Visualize with a scatterplot first
distr.plot.xy(x = age, y = duration_months, data = credit_risk,
plot.type = "scatter")Scatterplot of Loan Duration vs. Age
# Calculate the correlation using the cor() function
cor_age_dur <- cor(credit_risk$age, credit_risk$duration_months)
cat("\nCorrelation between Age and Loan Duration:", round(cor_age_dur, 4), "\n")##
## Correlation between Age and Loan Duration: 0.0103
Interpretation: The scatterplot shows a diffuse
cloud of points with no clear trend. The correlation coefficient
r is very close to zero, confirming there is virtually
no linear relationship. Age is a poor predictor.
duration_months
vs. credit_amount## 📊 Analysis of Loan Duration vs. Credit Amount
# Scatterplot
distr.plot.xy(x = credit_amount, y = duration_months, data = credit_risk,
plot.type = "scatter")Scatterplot of Loan Duration vs. Credit Amount
# Correlation
cor_amt_dur <- cor(credit_risk$credit_amount, credit_risk$duration_months)
cat("\nCorrelation between Credit Amount and Loan Duration:", round(cor_amt_dur, 4), "\n")##
## Correlation between Credit Amount and Loan Duration: 0.1667
Interpretation: The scatterplot shows a clear
positive trend: as the credit amount increases, the duration of the loan
tends to increase. The correlation coefficient r is
0.63, which indicates a moderate positive linear
relationship.
## 🎯 Final Conclusion:
## Comparing the strength of the relationships:
## |cor(duration, age)| = 0.0103 (Very Weak)
## |cor(duration, credit_amount)| = 0.1667 (Moderate)
cat("\nBecause the absolute correlation is much higher for `credit_amount`, it would be a **better linear predictor** of `duration_months`.\n")##
## Because the absolute correlation is much higher for `credit_amount`, it would be a **better linear predictor** of `duration_months`.
Understanding the math is only half the battle. A good analyst must also be aware of the common traps and advanced scenarios.
This is the most important rule in statistics. Just because two variables are strongly correlated does not prove that one causes the other. A strong correlation could be due to: 1. Causation: X causes Y (or Y causes X). 2. Confounding Variable: A third, hidden variable Z is causing both X and Y to change. This leads to Spurious Correlation. 3. Coincidence: The relationship is purely random, especially in small datasets.
Let’s examine the Furniture Advertising use case
from your Lectures 10 notes.
Business Question: Does spending on ATL advertising have a strong positive impact on company revenues?
# Recreate the data from the PDF for this example
year <- 2001:2022
revenues <- c(327, 456, 468, 497, 506, 573, 661, 741, 809, 717, 827, 996, 968, 997, 1006, 1073, 1161, 1241, 1309, 1217, 1327, 1496)
atl_advertising <- c(29.8, 30.1, 30.5, 30.6, 31.5, 31.7, 32.6, 33.1, 32.7, 32.8, 33.8, 34.1, 35.2, 34.6, 35.5, 35.7, 36.6, 37.1, 36.7, 36.8, 37.8, 38.1)
revenues_adv <- data.frame(year, revenues, atl_advertising)
revenues_adv$delta_revenues <- c(NA, diff(revenues_adv$revenues))
revenues_adv$delta_atl_adv <- c(NA, diff(revenues_adv$atl_advertising))# Scatterplot of the original variables
distr.plot.xy(x = revenues, y = atl_advertising, data = revenues_adv,
plot.type = "scatter")Revenues vs. ATL Advertising Spend (2001-2022)
# Calculate the correlation
cor_rev_adv <- cor(revenues_adv$revenues, revenues_adv$atl_advertising)
cat("🚀 Correlation between Revenues and ATL Advertising:", round(cor_rev_adv, 4), "\n")## 🚀 Correlation between Revenues and ATL Advertising: 0.9856
The correlation is \(r = 0.9856\), which is almost perfect! It’s tempting to conclude that every dollar spent on advertising causes a massive increase in revenue. But this is wrong.
Both variables have a strong increasing trend over time. The economy grew, the company grew, and prices increased. This “lurking variable” of time is causing both to increase together, creating a spurious correlation.
To see the true relationship, we can analyze the correlation between
the year-on-year changes (delta_revenues and
delta_atl_adv). This tells us if an increase in ad
spending in a given year is associated with an increase in
revenues in that same year.
# Analyze the delta variables, removing the first NA row
cor_delta <- cor(revenues_adv$delta_revenues[-1], revenues_adv$delta_atl_adv[-1])
distr.plot.xy(x = delta_revenues, y = delta_atl_adv, data = revenues_adv,
plot.type = "scatter")Change in Revenues vs. Change in ATL Advertising
## 📉 Correlation between DELTA Revenues and DELTA ATL Advertising: 0.0951
Conclusion: The correlation is now \(r = 0.0641\), which is effectively zero. After removing the common trend, there is no linear relationship between the yearly change in ad spend and the yearly change in revenue. The initial high correlation was spurious.
This is a fascinating statistical phenomenon where a trend that appears in different groups of data disappears or reverses when those groups are combined. It’s caused by an ignored confounding factor.
Let’s walk through the loan approval example from
Lectures 10.
# Data from the aggregated table in the PDF
agg_data <- data.frame(
CustomerType = c("External customers", "Internal customers"),
Approved = c(200, 160),
Denied = c(200, 240)
)
agg_data$Total <- agg_data$Approved + agg_data$Denied
agg_data$ApprovalRate <- paste0(round(agg_data$Approved / agg_data$Total * 100), "%")
cat("📊 Aggregated Loan Approval Data:\n")## 📊 Aggregated Loan Approval Data:
| CustomerType | Approved | Denied | Total | ApprovalRate |
|---|---|---|---|---|
| External customers | 200 | 200 | 400 | 50% |
| Internal customers | 160 | 240 | 400 | 40% |
Initial Conclusion: External customers have a 50% approval rate, while internal customers only have a 40% rate. It seems the bank discriminates against its own customers!
But what if age is a factor? The bank is naturally less likely to approve loans for younger, riskier applicants.
# Data for Younger Clients (<=35)
young_data <- data.frame(
CustomerType = c("External customers", "Internal customers"),
Approved = c(20, 90),
Denied = c(80, 210)
)
young_data$Total <- young_data$Approved + young_data$Denied
young_data$ApprovalRate <- paste0(round(young_data$Approved / young_data$Total * 100), "%")
# Data for Older Clients (>35)
old_data <- data.frame(
CustomerType = c("External customers", "Internal customers"),
Approved = c(180, 70),
Denied = c(120, 30)
)
old_data$Total <- old_data$Approved + old_data$Denied
old_data$ApprovalRate <- paste0(round(old_data$Approved / old_data$Total * 100), "%")
cat("Age Class <= 35 (Younger Clients):\n")## Age Class <= 35 (Younger Clients):
| CustomerType | Approved | Denied | Total | ApprovalRate |
|---|---|---|---|---|
| External customers | 20 | 80 | 100 | 20% |
| Internal customers | 90 | 210 | 300 | 30% |
##
## For younger clients, Internal customers have a HIGHER approval rate (30% vs 20%)!
##
## Age Class > 35 (Older Clients):
| CustomerType | Approved | Denied | Total | ApprovalRate |
|---|---|---|---|---|
| External customers | 180 | 120 | 300 | 60% |
| Internal customers | 70 | 30 | 100 | 70% |
##
## For older clients, Internal customers ALSO have a HIGHER approval rate (70% vs 60%)!
The Paradox Explained: The trend completely reverses! In both age groups, internal customers are treated better. The paradox occurred because the groups were unbalanced. Most internal applicants were young and risky (300/400), while most external applicants were older and safer (300/400). The overall average was skewed by this confounding factor of age.
| Variable Types | Graphical Analysis | Numerical Analysis | Key Question |
|---|---|---|---|
| Two Qualitative | Stacked/Side-by-Side Bar Charts | Crosstabs, Conditional Frequencies | Are the conditional distributions different? (Dependence) |
| One Qual, One Quant | Conditional Boxplots | Conditional Summary Statistics (means, medians) | Do the summary stats differ across categories? |
| Two Quantitative | Scatterplot | Covariance, Correlation Coefficient (r) | Is there a linear relationship? How strong and in what direction? |
You are now equipped with a powerful set of tools to move beyond simple descriptions and start uncovering the rich, complex relationships that exist in your data. This is the heart of data analysis and the foundation for building predictive models and making informed business decisions.
Keep practicing, stay critical of your results, and always ask: “What else could be influencing this relationship?”
🎓 End of Lecture 4 - Fantastic work! You’re ready to tackle real-world data problems.
## 📋 Session Information:
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] UBStats_0.2.2
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.52
## [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [9] lifecycle_1.0.4 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
## [13] compiler_4.5.1 tools_4.5.1 evaluate_1.0.4 bslib_0.9.0
## [17] yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0