Chapter 1: Introduction - From a Single View to a Complete Picture

1.1 Welcome Back! A New Dimension of Analysis

In our journey so far, we have become proficient at understanding and describing a single variable. We’ve learned to capture its central tendency with the mean and median, measure its variability with standard deviation, and visualize its distribution with histograms. This is the world of univariate analysis.

But the real world is interconnected. Business success rarely depends on a single factor. Answering the most critical business questions requires us to look at the relationships between variables.

How does changing the price of a product relate to the number of units sold?
Does a customer’s satisfaction level depend on the type of service they received?
Is there a connection between employee training hours and their performance rating?

To answer these questions, we must step into the world of Bivariate Descriptive Analysis. This is our focus for today: learning the tools and concepts to analyze two variables simultaneously.

1.2 The Detective Analogy: From a Single Clue to Solving the Case

Think of a univariate analysis as a detective finding a single clue at a crime scene—say, a footprint. That clue is useful. We can measure it, describe it, and learn something from it. But it doesn’t solve the case.

Bivariate analysis is like finding a second clue—a security camera footage showing a specific person near the scene at the time of the crime. Now, we can analyze the relationship between the two clues: Does the footprint match the person in the video? By looking at them together, we start to build a real story and get closer to solving the mystery.

Today, we become data detectives. We will learn to find and interpret the stories hidden in the relationships between our variables.

## 📈 Welcome to the world of Bivariate Analysis!

## Today we will analyze a dataset of 1000 loan applicants to uncover hidden relationships.

## Our mission: To understand how different factors relate to credit risk and loan terms.

Chapter 2: Two Qualitative Variables - The Art of Crosstabulation

Let’s begin with the case where we have two categorical (or qualitative) variables. For example, we might want to analyze the relationship between a customer’s housing status (own, rent, for free) and their credit_risk_score (good, bad).

Our primary tool for this is the crosstabulation (or crosstab).

2.1 The Anatomy of a Crosstab

A crosstab is a table that shows the number of observations for every combination of the categories of two variables. Let’s break down its components using the notation from your PDF notes.

Let’s say we have two variables, X (with K categories) and Y (with J categories). The crosstab looks like this:

\[ \begin{array}{c|ccccc|c} \text{X \\ Y} & y_1 & y_2 & \dots & y_J & \text{TOTAL} \\ \hline x_1 & f_{11} & f_{12} & \dots & f_{1J} & R_1 \\ x_2 & f_{21} & f_{22} & \dots & f_{2J} & R_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ x_K & f_{K1} & f_{K2} & \dots & f_{KJ} & R_K \\ \hline \text{TOTAL} & C_1 & C_2 & \dots & C_J & n \\ \end{array} \]

Joint Absolute Frequencies (\(f_{kj}\)): This is the raw count in each cell. For example, \(f_{11}\) is the number of observations that are in category \(x_1\) AND category \(y_1\).
Joint Relative Frequencies (\(p_{kj}\)): This is the proportion of the total for each cell, calculated as \(p_{kj} = f_{kj} / n\).
Marginal Frequencies: These are the row totals (\(R_k\)) and column totals (\(C_j\)). They represent the univariate distribution of each variable by itself. \(R_1\) is the total number of observations in category \(x_1\), regardless of their Y category.

2.2 Manual Example: Industry vs. Solvency Rating

Let’s work through Case A from your Lectures 8/9 notes.

Business Question: Does a company’s solvency rating depend on its industry?

First, we have the raw counts (joint absolute frequencies):

Industry Solvency	Low	Average	High	Row TOTAL
Manufacturing	36	124	80	240
Financial	12	64	84	160
Column TOTAL	48	188	164	400

Looking at raw counts can be misleading because the group sizes are different (240 vs. 160). To make a fair comparison, we must calculate conditional frequencies.

2.3 Conditional Frequencies & Statistical Independence

A conditional frequency tells us the distribution of one variable given a specific category of the other variable.

The formula for the conditional frequency of Y given X is: \[F(Y=y_j | X=x_k) = \frac{f_{kj}}{R_k}\]

Let’s calculate the conditional distribution of Solvency given Industry = “Manufacturing”:

P(Low | Manufacturing) = 36 / 240 = 0.15 or 15.0%
P(Average | Manufacturing) = 124 / 240 = 0.517 or 51.7%
P(High | Manufacturing) = 80 / 240 = 0.333 or 33.3%

Now for Industry = “Financial”:

P(Low | Financial) = 12 / 160 = 0.075 or 7.5%
P(Average | Financial) = 64 / 160 = 0.400 or 40.0%
P(High | Financial) = 84 / 160 = 0.525 or 52.5%

This leads to the most important concept here: Statistical Independence. * Independence: If the conditional distributions are the same for all categories, the variables are independent. Knowing one gives you no information about the other. * Dependence: If the conditional distributions are different, the variables are dependent.

Conclusion for our example: The distributions (15.0%, 51.7%, 33.3%) and (7.5%, 40.0%, 52.5%) are clearly different. Therefore, solvency rating is dependent on the industry.

2.4 R Example: Credit Score vs. Housing

Let’s use our simulated credit_risk dataset to answer: Does a customer’s credit risk (score) depend on their housing situation?

Step 1: Create the Crosstab

We use the distr.table.xy function from your UBStats library. We will look at the conditional frequencies of the score, given the housing type (y|x).

# Using the function from your class script
# freq.type = "y|x" means we are conditioning Y (score) on X (housing)
# freq = "prop" gives us the proportions (relative frequencies)
distr.table.xy(x = housing, y = score, data = credit_risk,
               freq.type = "y|x", freq = "prop")

## y|x: Proportions
##           score
## housing     bad good TOTAL
##   for free 0.28 0.72  1.00
##   own      0.29 0.71  1.00
##   rent     0.33 0.67  1.00

Interpretation: We can immediately see that the conditional distributions are different. For example, the proportion of “bad” scores for people who own their house (0.26 or 26%) is much lower than for those who live for free (0.41 or 41%). This indicates dependence.

Step 2: Visualize with Bar Charts

Visualizing this makes the conclusion even clearer. We use distr.plot.xy with plot.type = "bars".

# The stacked bar chart is excellent for comparing proportions
distr.plot.xy(x = housing, y = score, data = credit_risk,
              freq.type = "y|x", freq = "prop", 
              plot.type = "bars", bar.type = "stacked")

Conditional distribution of Credit Score by Housing Type

The visual evidence is compelling. The red bars (representing “bad” scores) are different heights for each housing category, confirming that credit score is dependent on housing status.

Chapter 3: One Qualitative & One Quantitative Variable

What if we want to compare a numerical variable (like loan duration) across different categories of a qualitative variable (like credit score)?

Our main tools are: 1. Conditional Summary Measures: Calculate statistics like the mean, median, and standard deviation for the numerical variable, conditioned on each category. 2. Conditional Boxplots: Create a separate boxplot for the numerical variable for each category and place them side-by-side.

3.1 R Example: Loan Duration vs. Credit Score

Let’s analyze the relationship between duration_months (quantitative) and score (qualitative) from our credit_risk dataset.

Step 1: Conditional Summary Statistics

We use the distr.summary.x function with the by1 argument, as shown in your Video 5 script.

cat("📊 Conditional Summary Statistics for Loan Duration by Credit Score:\n")

## 📊 Conditional Summary Statistics for Loan Duration by Credit Score:

# This function calculates the specified stats for the variable 'x'
# grouped by the categories in 'by1'.
summary_by_score <- distr.summary.x(x = duration_months, 
                                    by1 = score, 
                                    data = credit_risk,
                                    stats = "summary") # "summary" gives a full report

##  score   n n.a min q1 median  mean q3 max   sd   var
##    bad 293   0  11 22     25 25.22 28  43 4.93 24.32
##   good 707   0   7 17     20 20.13 23  36 4.40 19.38

print(summary_by_score)

## $`Summary measures`
##   score   n n.a min q1 median     mean q3 max       sd      var
## 1   bad 293   0  11 22     25 25.22184 28  43 4.931233 24.31706
## 2  good 707   0   7 17     20 20.12871 23  36 4.401791 19.37576

Interpretation: * Central Tendency: The mean and median duration for “bad” risk loans (around 24-25 months) are significantly higher than for “good” risk loans (around 19 months). * Dispersion: The sd (standard deviation) and var (variance) are also much larger for the “bad” category, indicating that the loan durations for bad risks are not only longer on average but also more spread out.

Step 2: Conditional Boxplots

A boxplot provides a powerful five-number summary (min, Q1, median, Q3, max) and helps visualize the differences we saw in the table.

cat("📈 Conditional Boxplots for Loan Duration by Credit Score:\n")

## 📈 Conditional Boxplots for Loan Duration by Credit Score:

# Using the distr.plot.xy function with plot.type = "boxplot"
# Note that for boxplots, the categorical variable is 'x' and the numerical is 'y'
distr.plot.xy(x = score, y = duration_months, data = credit_risk,
              plot.type = "boxplot")

Side-by-side boxplots of Loan Duration by Credit Score

Interpretation: * The entire box for the “bad” category is shifted upwards, visually confirming the higher median and quartiles. * The box for “bad” is taller (larger Interquartile Range) and has longer whiskers, confirming greater variability. * We can see several outliers (the small circles) for “good” risk loans, representing customers who have good credit but took out unusually long-term loans.

Chapter 4: Two Quantitative Variables - Scatterplots & Correlation

Now we arrive at the case of analyzing two numerical variables, like credit_amount and age.

4.1 The Scatterplot

The first, most crucial step is to create a scatterplot. It plots each pair of (x, y) values as a point and helps us see the form, direction, and strength of the relationship.

4.2 Covariance: A Measure of Joint Variability

Covariance is a numerical measure of the direction of the linear relationship between two variables.

Formulas from your notes: * Population Covariance: \(\sigma_{xy} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y)\) * Sample Covariance: \(s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)

Interpretation of the Sign: * \(s_{xy} > 0\): Positive linear relationship (as x goes up, y tends to go up). * \(s_{xy} < 0\): Negative linear relationship (as x goes up, y tends to go down). * \(s_{xy} = 0\): No linear relationship.

Manual Calculation of Sample Covariance

Let’s take a tiny dataset to see how the formula works. Data: * X (Study Hours): 2, 4, 5, 7 * Y (Test Score): 65, 75, 80, 90

Step 1: Calculate the means \(\bar{x} = (2+4+5+7) / 4 = 4.5\) \(\bar{y} = (65+75+80+90) / 4 = 77.5\)

Step 2: Calculate deviations and their products

\(x_i\)	\(y_i\)	\((x_i - \bar{x})\)	\((y_i - \bar{y})\)	\((x_i - \bar{x})(y_i - \bar{y})\)
2	65	-2.5	-12.5	31.25
4	75	-0.5	-2.5	1.25
5	80	0.5	2.5	1.25
7	90	2.5	12.5	31.25
			Sum =	65.0

Step 3: Calculate the covariance \(s_{xy} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{n-1} = \frac{65.0}{4-1} = \frac{65.0}{3} \approx 21.67\)

The positive result indicates a positive linear relationship.

Limitation of Covariance: The value 21.67 is hard to interpret. Is it strong? Weak? Its magnitude depends on the units of X and Y, making it impossible to use for judging strength.

4.3 The Linear Correlation Coefficient (r) - The Gold Standard

To solve the problem of covariance, we standardize it to get the linear correlation coefficient (r), also known as Pearson’s r.

Formulas from your notes: * Population Correlation: \(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}\) * Sample Correlation: \(r_{xy} = \frac{s_{xy}}{s_x s_y}\)

Properties of Correlation (r): * It is a relative measure with no units. * It always ranges between -1 and +1. * \(r = +1\): Perfect positive linear relationship. * \(r = -1\): Perfect negative linear relationship. * \(r = 0\): No linear relationship. * The absolute value, \(|r|\), measures the strength of the linear relationship.

Rule of Thumb for Strength (from notes): * \(|r| < 0.3\): Weak linear correlation * \(0.3 \le |r| \le 0.7\): Moderate linear correlation * \(|r| > 0.7\): Strong linear correlation

4.4 R Example: Finding the Best Predictor for Loan Duration

A bank analyst wants to predict duration_months. They want to know which variable is a better linear predictor: age or credit_amount?

Step 1: Analyze `duration_months` vs. `age`

cat("📊 Analysis of Loan Duration vs. Age\n")

## 📊 Analysis of Loan Duration vs. Age

# Visualize with a scatterplot first
distr.plot.xy(x = age, y = duration_months, data = credit_risk,
              plot.type = "scatter")

Scatterplot of Loan Duration vs. Age

# Calculate the correlation using the cor() function
cor_age_dur <- cor(credit_risk$age, credit_risk$duration_months)

cat("\nCorrelation between Age and Loan Duration:", round(cor_age_dur, 4), "\n")

## 
## Correlation between Age and Loan Duration: 0.0103

Interpretation: The scatterplot shows a diffuse cloud of points with no clear trend. The correlation coefficient r is very close to zero, confirming there is virtually no linear relationship. Age is a poor predictor.

Step 2: Analyze `duration_months` vs. `credit_amount`

cat("📊 Analysis of Loan Duration vs. Credit Amount\n")

## 📊 Analysis of Loan Duration vs. Credit Amount

# Scatterplot
distr.plot.xy(x = credit_amount, y = duration_months, data = credit_risk,
              plot.type = "scatter")

Scatterplot of Loan Duration vs. Credit Amount

# Correlation
cor_amt_dur <- cor(credit_risk$credit_amount, credit_risk$duration_months)

cat("\nCorrelation between Credit Amount and Loan Duration:", round(cor_amt_dur, 4), "\n")

## 
## Correlation between Credit Amount and Loan Duration: 0.1667

Interpretation: The scatterplot shows a clear positive trend: as the credit amount increases, the duration of the loan tends to increase. The correlation coefficient r is 0.63, which indicates a moderate positive linear relationship.

Step 3: Conclusion

cat("🎯 Final Conclusion:\n")

## 🎯 Final Conclusion:

cat("Comparing the strength of the relationships:\n")

## Comparing the strength of the relationships:

cat("  |cor(duration, age)|          =", round(abs(cor_age_dur), 4), "(Very Weak)\n")

##   |cor(duration, age)|          = 0.0103 (Very Weak)

cat("  |cor(duration, credit_amount)| =", round(abs(cor_amt_dur), 4), "(Moderate)\n")

##   |cor(duration, credit_amount)| = 0.1667 (Moderate)

cat("\nBecause the absolute correlation is much higher for `credit_amount`, it would be a **better linear predictor** of `duration_months`.\n")

## 
## Because the absolute correlation is much higher for `credit_amount`, it would be a **better linear predictor** of `duration_months`.

Chapter 5: Advanced Topics & Common Pitfalls

Understanding the math is only half the battle. A good analyst must also be aware of the common traps and advanced scenarios.

5.1 Correlation is NOT Causation!

This is the most important rule in statistics. Just because two variables are strongly correlated does not prove that one causes the other. A strong correlation could be due to: 1. Causation: X causes Y (or Y causes X). 2. Confounding Variable: A third, hidden variable Z is causing both X and Y to change. This leads to Spurious Correlation. 3. Coincidence: The relationship is purely random, especially in small datasets.

Spurious Correlation Example: The Lurking Influence of Time

Let’s examine the Furniture Advertising use case from your Lectures 10 notes.

Business Question: Does spending on ATL advertising have a strong positive impact on company revenues?

# Recreate the data from the PDF for this example
year <- 2001:2022
revenues <- c(327, 456, 468, 497, 506, 573, 661, 741, 809, 717, 827, 996, 968, 997, 1006, 1073, 1161, 1241, 1309, 1217, 1327, 1496)
atl_advertising <- c(29.8, 30.1, 30.5, 30.6, 31.5, 31.7, 32.6, 33.1, 32.7, 32.8, 33.8, 34.1, 35.2, 34.6, 35.5, 35.7, 36.6, 37.1, 36.7, 36.8, 37.8, 38.1)
revenues_adv <- data.frame(year, revenues, atl_advertising)
revenues_adv$delta_revenues <- c(NA, diff(revenues_adv$revenues))
revenues_adv$delta_atl_adv <- c(NA, diff(revenues_adv$atl_advertising))

The Initial (and Misleading) Analysis

# Scatterplot of the original variables
distr.plot.xy(x = revenues, y = atl_advertising, data = revenues_adv,
              plot.type = "scatter")

Revenues vs. ATL Advertising Spend (2001-2022)

# Calculate the correlation
cor_rev_adv <- cor(revenues_adv$revenues, revenues_adv$atl_advertising)

cat("🚀 Correlation between Revenues and ATL Advertising:", round(cor_rev_adv, 4), "\n")

## 🚀 Correlation between Revenues and ATL Advertising: 0.9856

The correlation is \(r = 0.9856\), which is almost perfect! It’s tempting to conclude that every dollar spent on advertising causes a massive increase in revenue. But this is wrong.

Both variables have a strong increasing trend over time. The economy grew, the company grew, and prices increased. This “lurking variable” of time is causing both to increase together, creating a spurious correlation.

A Better Analysis: Removing the Trend

To see the true relationship, we can analyze the correlation between the year-on-year changes (delta_revenues and delta_atl_adv). This tells us if an increase in ad spending in a given year is associated with an increase in revenues in that same year.

# Analyze the delta variables, removing the first NA row
cor_delta <- cor(revenues_adv$delta_revenues[-1], revenues_adv$delta_atl_adv[-1])
distr.plot.xy(x = delta_revenues, y = delta_atl_adv, data = revenues_adv,
              plot.type = "scatter")

Change in Revenues vs. Change in ATL Advertising

cat("📉 Correlation between DELTA Revenues and DELTA ATL Advertising:", round(cor_delta, 4), "\n")

## 📉 Correlation between DELTA Revenues and DELTA ATL Advertising: 0.0951

Conclusion: The correlation is now \(r = 0.0641\), which is effectively zero. After removing the common trend, there is no linear relationship between the yearly change in ad spend and the yearly change in revenue. The initial high correlation was spurious.

5.2 Simpson’s Paradox - When Aggregates Lie

This is a fascinating statistical phenomenon where a trend that appears in different groups of data disappears or reverses when those groups are combined. It’s caused by an ignored confounding factor.

Example: Bank Loan Approval

Let’s walk through the loan approval example from Lectures 10.

The Aggregated (Overall) View

# Data from the aggregated table in the PDF
agg_data <- data.frame(
  CustomerType = c("External customers", "Internal customers"),
  Approved = c(200, 160),
  Denied = c(200, 240)
)
agg_data$Total <- agg_data$Approved + agg_data$Denied
agg_data$ApprovalRate <- paste0(round(agg_data$Approved / agg_data$Total * 100), "%")

cat("📊 Aggregated Loan Approval Data:\n")

## 📊 Aggregated Loan Approval Data:

knitr::kable(agg_data, caption = "Overall Approval Rates")

Overall Approval Rates
CustomerType	Approved	Denied	Total	ApprovalRate
External customers	200	200	400	50%
Internal customers	160	240	400	40%

Initial Conclusion: External customers have a 50% approval rate, while internal customers only have a 40% rate. It seems the bank discriminates against its own customers!

The Disaggregated View (Controlling for the Confounding Variable: Age)

But what if age is a factor? The bank is naturally less likely to approve loans for younger, riskier applicants.

# Data for Younger Clients (<=35)
young_data <- data.frame(
  CustomerType = c("External customers", "Internal customers"),
  Approved = c(20, 90),
  Denied = c(80, 210)
)
young_data$Total <- young_data$Approved + young_data$Denied
young_data$ApprovalRate <- paste0(round(young_data$Approved / young_data$Total * 100), "%")

# Data for Older Clients (>35)
old_data <- data.frame(
  CustomerType = c("External customers", "Internal customers"),
  Approved = c(180, 70),
  Denied = c(120, 30)
)
old_data$Total <- old_data$Approved + old_data$Denied
old_data$ApprovalRate <- paste0(round(old_data$Approved / old_data$Total * 100), "%")

cat("Age Class <= 35 (Younger Clients):\n")

## Age Class <= 35 (Younger Clients):

knitr::kable(young_data, caption = "Approval Rates for Younger Clients")

Approval Rates for Younger Clients
CustomerType	Approved	Denied	Total	ApprovalRate
External customers	20	80	100	20%
Internal customers	90	210	300	30%

cat("\nFor younger clients, Internal customers have a HIGHER approval rate (30% vs 20%)!\n")

## 
## For younger clients, Internal customers have a HIGHER approval rate (30% vs 20%)!

cat("\nAge Class > 35 (Older Clients):\n")

## 
## Age Class > 35 (Older Clients):

knitr::kable(old_data, caption = "Approval Rates for Older Clients")

Approval Rates for Older Clients
CustomerType	Approved	Denied	Total	ApprovalRate
External customers	180	120	300	60%
Internal customers	70	30	100	70%

cat("\nFor older clients, Internal customers ALSO have a HIGHER approval rate (70% vs 60%)!\n")

## 
## For older clients, Internal customers ALSO have a HIGHER approval rate (70% vs 60%)!

The Paradox Explained: The trend completely reverses! In both age groups, internal customers are treated better. The paradox occurred because the groups were unbalanced. Most internal applicants were young and risky (300/400), while most external applicants were older and safer (300/400). The overall average was skewed by this confounding factor of age.

Chapter 6: Summary and Key Takeaways

6.1 Your Bivariate Analysis Toolkit

Variable Types	Graphical Analysis	Numerical Analysis	Key Question
Two Qualitative	Stacked/Side-by-Side Bar Charts	Crosstabs, Conditional Frequencies	Are the conditional distributions different? (Dependence)
One Qual, One Quant	Conditional Boxplots	Conditional Summary Statistics (means, medians)	Do the summary stats differ across categories?
Two Quantitative	Scatterplot	Covariance, Correlation Coefficient (r)	Is there a linear relationship? How strong and in what direction?

6.2 The Golden Rules of Interpretation

Always Visualize First: Never just calculate a correlation. Look at the scatterplot to check for non-linear patterns or outliers.
Correlation ≠ Causation: A strong correlation is a clue to dig deeper, not a final answer. Always consider confounding variables.
Watch Out for Time Series: Be skeptical of high correlations between variables that are both trending over time. Consider analyzing differences to remove the trend.
Beware of Simpson’s Paradox: If you get a counter-intuitive result from aggregated data, try to identify a relevant confounding variable and segment your data by it.

6.3 Final Motivation

You are now equipped with a powerful set of tools to move beyond simple descriptions and start uncovering the rich, complex relationships that exist in your data. This is the heart of data analysis and the foundation for building predictive models and making informed business decisions.

Keep practicing, stay critical of your results, and always ask: “What else could be influencing this relationship?”

🎓 End of Lecture 4 - Fantastic work! You’re ready to tackle real-world data problems.

## 📋 Session Information:

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] UBStats_0.2.2
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     xfun_0.52        
##  [5] cachem_1.1.0      knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.29   
##  [9] lifecycle_1.0.4   cli_3.6.5         sass_0.4.10       jquerylib_0.1.4  
## [13] compiler_4.5.1    tools_4.5.1       evaluate_1.0.4    bslib_0.9.0      
## [17] yaml_2.3.10       rlang_1.1.6       jsonlite_2.0.0

Leksioni 4: Bivariate Analysis - Uncovering the Story Between Two Variables

Endri Raço

August 5, 2025

Chapter 1: Introduction - From a Single View to a Complete Picture

1.1 Welcome Back! A New Dimension of Analysis

1.2 The Detective Analogy: From a Single Clue to Solving the Case

Chapter 2: Two Qualitative Variables - The Art of Crosstabulation

2.1 The Anatomy of a Crosstab

2.2 Manual Example: Industry vs. Solvency Rating

2.3 Conditional Frequencies & Statistical Independence

2.4 R Example: Credit Score vs. Housing

Step 1: Create the Crosstab

Step 2: Visualize with Bar Charts

Chapter 3: One Qualitative & One Quantitative Variable

3.1 R Example: Loan Duration vs. Credit Score

Step 1: Conditional Summary Statistics

Step 2: Conditional Boxplots

Chapter 4: Two Quantitative Variables - Scatterplots & Correlation

4.1 The Scatterplot

4.2 Covariance: A Measure of Joint Variability

Manual Calculation of Sample Covariance

4.3 The Linear Correlation Coefficient (r) - The Gold Standard

4.4 R Example: Finding the Best Predictor for Loan Duration

Step 1: Analyze `duration_months` vs. `age`

Step 2: Analyze `duration_months` vs. `credit_amount`

Step 3: Conclusion

Chapter 5: Advanced Topics & Common Pitfalls

5.1 Correlation is NOT Causation!

Spurious Correlation Example: The Lurking Influence of Time

The Initial (and Misleading) Analysis

A Better Analysis: Removing the Trend

5.2 Simpson’s Paradox - When Aggregates Lie

Example: Bank Loan Approval

The Aggregated (Overall) View

The Disaggregated View (Controlling for the Confounding Variable: Age)

Chapter 6: Summary and Key Takeaways

6.1 Your Bivariate Analysis Toolkit

6.2 The Golden Rules of Interpretation

6.3 Final Motivation

Leksioni 4: Bivariate Analysis - Uncovering the Story Between Two Variables

Endri Raço

August 5, 2025

Chapter 1: Introduction - From a Single View to a Complete Picture

1.1 Welcome Back! A New Dimension of Analysis

1.2 The Detective Analogy: From a Single Clue to Solving the Case

Chapter 2: Two Qualitative Variables - The Art of Crosstabulation

2.1 The Anatomy of a Crosstab

2.2 Manual Example: Industry vs. Solvency Rating

2.3 Conditional Frequencies & Statistical Independence

2.4 R Example: Credit Score vs. Housing

Step 1: Create the Crosstab

Step 2: Visualize with Bar Charts

Chapter 3: One Qualitative & One Quantitative Variable

3.1 R Example: Loan Duration vs. Credit Score

Step 1: Conditional Summary Statistics

Step 2: Conditional Boxplots

Chapter 4: Two Quantitative Variables - Scatterplots & Correlation

4.1 The Scatterplot

4.2 Covariance: A Measure of Joint Variability

Manual Calculation of Sample Covariance

4.3 The Linear Correlation Coefficient (r) - The Gold Standard

4.4 R Example: Finding the Best Predictor for Loan Duration

Step 1: Analyze duration_months vs. age

Step 2: Analyze duration_months vs. credit_amount

Step 3: Conclusion

Chapter 5: Advanced Topics & Common Pitfalls

5.1 Correlation is NOT Causation!

Spurious Correlation Example: The Lurking Influence of Time

The Initial (and Misleading) Analysis

A Better Analysis: Removing the Trend

5.2 Simpson’s Paradox - When Aggregates Lie

Example: Bank Loan Approval

The Aggregated (Overall) View

The Disaggregated View (Controlling for the Confounding Variable: Age)

Chapter 6: Summary and Key Takeaways

6.1 Your Bivariate Analysis Toolkit

6.2 The Golden Rules of Interpretation

6.3 Final Motivation

Step 1: Analyze `duration_months` vs. `age`

Step 2: Analyze `duration_months` vs. `credit_amount`