Chapter 1: Introduction - Completing the Statistical Picture

Welcome Back!

In our previous lectures, we learned how to find the β€œtypical” value in our data using measures of central tendency (mean, median, mode). But that’s only half the story! Today we complete the picture by understanding:

  1. How spread out our data is (measures of variability)
  2. How two variables relate to each other (bivariate analysis)

Think of it this way: if central tendency tells us the β€œaverage student grade,” variability tells us whether all students performed similarly or if there’s a huge range from failing to excellent grades.

The Restaurant Analogy - Why Variability Matters

Imagine two pizza delivery restaurants:

  • Restaurant A: Delivery times: 28, 29, 30, 31, 32 minutes (Average: 30 minutes)
  • Restaurant B: Delivery times: 10, 20, 30, 40, 50 minutes (Average: 30 minutes)

Both have the same average, but which would you prefer? Restaurant A is predictable and reliable, while Restaurant B is unpredictable. This is why we need measures of variability!

# Load required libraries
library(UBStats)

# Use built-in mtcars dataset and modify it for our examples
cars <- mtcars

# Add some columns to match our examples
cars$country <- sample(c("Germany", "Japan", "France", "Italy", "United States"), 
                      nrow(cars), replace=TRUE, 
                      prob=c(0.3, 0.25, 0.15, 0.15, 0.15))

cars$price_num <- cars$hp * 100 + rnorm(nrow(cars), 15000, 5000)
cars$price_num <- pmax(cars$price_num, 10000)  # Minimum price $10,000

cars$price_classes <- cut(cars$price_num, 
                         breaks=c(0, 20000, 35000, Inf),
                         labels=c("low", "mid", "high"))

cars$maxspeed <- cars$hp * 2.5 + rnorm(nrow(cars), 50, 15)
cars$acceleration <- 20 - (cars$hp/20) + rnorm(nrow(cars), 0, 2)
cars$weight <- cars$wt * 500
cars$sales <- rpois(nrow(cars), 8000)

# Let's start our journey!
cat("πŸš— Today we'll analyze", nrow(cars), "car models\n")
## πŸš— Today we'll analyze 32 car models
cat("πŸ“Š We'll explore both variability and relationships between variables\n")
## πŸ“Š We'll explore both variability and relationships between variables
# Let's start our journey!
cat("πŸš— Today we'll analyze", nrow(cars), "car models\n")
## πŸš— Today we'll analyze 32 car models
cat("πŸ“Š We'll explore both variability and relationships between variables\n")
## πŸ“Š We'll explore both variability and relationships between variables

Chapter 2: Range and Interquartile Range - Simple but Important

2.1 Range - The Distance Between Extremes

The range is the simplest measure of spread:

\[\text{Range} = x_{\text{max}} - x_{\text{min}}\]

Example 1 from Your Notes: Two Samples

Let’s work through Example 1 from your PDF exactly as written:

Sample 1: 1, 2, 3, 4, 5, 6, 7
Sample 2: 1, 2, 3, 4, 5, 6, 100

# Example 1 from PDF - exactly as in your notes
sample1 <- c(1, 2, 3, 4, 5, 6, 7)
sample2 <- c(1, 2, 3, 4, 5, 6, 100)

cat("πŸ“‹ Example 1 from Your Notes:\n")
## πŸ“‹ Example 1 from Your Notes:
cat("Sample 1:", paste(sample1, collapse=", "), "\n")
## Sample 1: 1, 2, 3, 4, 5, 6, 7
cat("Sample 2:", paste(sample2, collapse=", "), "\n\n")
## Sample 2: 1, 2, 3, 4, 5, 6, 100
# Calculate ranges
range1 <- max(sample1) - min(sample1)
range2 <- max(sample2) - min(sample2)

cat("πŸ”’ Range Calculations:\n")
## πŸ”’ Range Calculations:
cat("Sample 1 Range: 7 - 1 =", range1, "\n")
## Sample 1 Range: 7 - 1 = 6
cat("Sample 2 Range: 100 - 1 =", range2, "\n\n")
## Sample 2 Range: 100 - 1 = 99
cat("πŸ€” What happened? One outlier (100) made the range jump from 6 to 99!\n")
## πŸ€” What happened? One outlier (100) made the range jump from 6 to 99!

2.2 Interquartile Range (IQR) - The Robust Alternative

The IQR measures the spread of the middle 50% of data:

\[\text{IQR} = Q_3 - Q_1\]

# Calculate IQR for both samples
cat("πŸ“Š IQR Calculations:\n")
## πŸ“Š IQR Calculations:
# For Sample 1
q1_s1 <- quantile(sample1, 0.25)  # Q1
q3_s1 <- quantile(sample1, 0.75)  # Q3
iqr1 <- q3_s1 - q1_s1

cat("Sample 1: Q1 =", q1_s1, ", Q3 =", q3_s1, ", IQR =", iqr1, "\n")
## Sample 1: Q1 = 2.5 , Q3 = 5.5 , IQR = 3
# For Sample 2
q1_s2 <- quantile(sample2, 0.25)  # Q1
q3_s2 <- quantile(sample2, 0.75)  # Q3
iqr2 <- q3_s2 - q1_s2

cat("Sample 2: Q1 =", q1_s2, ", Q3 =", q3_s2, ", IQR =", iqr2, "\n\n")
## Sample 2: Q1 = 2.5 , Q3 = 5.5 , IQR = 3
cat("✨ Amazing! The IQR stayed the same because it focuses on the middle 50%\n")
## ✨ Amazing! The IQR stayed the same because it focuses on the middle 50%
cat("   and ignores extreme outliers!\n")
##    and ignores extreme outliers!
# Visualize the difference
par(mfrow=c(1,2))
boxplot(sample1, main="Sample 1: No Outliers", ylab="Values", col="lightblue")
boxplot(sample2, main="Sample 2: With Outlier", ylab="Values", col="lightcoral")

par(mfrow=c(1,1))

Key Learning: IQR is β€œrobust” - it’s not affected by extreme outliers, making it more reliable for describing typical variability.


Chapter 3: Variance - The Mathematical Foundation

Now we get to the heart of variability - variance. This is where all the formulas from your PDF come into play!

3.1 Understanding Variance Conceptually

Variance measures the average squared distance from the mean. Think of it as asking: β€œOn average, how far away are my data points from the center?”

Why Do We Square the Deviations?

Let me show you why we can’t just average the deviations:

# Simple example: 2, 4, 6 (mean = 4)
simple_data <- c(2, 4, 6)
mean_val <- mean(simple_data)
deviations <- simple_data - mean_val

cat("πŸ“ˆ Why We Need to Square Deviations:\n")
## πŸ“ˆ Why We Need to Square Deviations:
cat("Data:", paste(simple_data, collapse=", "), "\n")
## Data: 2, 4, 6
cat("Mean:", mean_val, "\n")
## Mean: 4
cat("Deviations:", paste(deviations, collapse=", "), "\n")
## Deviations: -2, 0, 2
cat("Sum of deviations:", sum(deviations), "\n\n")
## Sum of deviations: 0
cat("😱 The deviations always sum to zero! Positive and negative cancel out.\n")
## 😱 The deviations always sum to zero! Positive and negative cancel out.
cat("πŸ’‘ Solution: Square them to make all values positive!\n")
## πŸ’‘ Solution: Square them to make all values positive!
squared_devs <- deviations^2
cat("Squared deviations:", paste(squared_devs, collapse=", "), "\n")
## Squared deviations: 4, 0, 4
cat("Now we can average them:", mean(squared_devs), "\n")
## Now we can average them: 2.666667

3.2 Population Variance - All Formulas from Your PDF

Your PDF shows multiple formulas for population variance. Let me explain every single one:

Original Formula

\[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2\]

Shortcut Formula

\[\sigma^2 = \frac{\sum_{i=1}^{N} x_i^2}{N} - \mu^2\]

Let’s work through Example 2 from your PDF step by step:

Example 2: Airline Ticket Prices

Data: 80, 120, 150, 90 (dollars)
Assignment: Compute the variance

# Example 2 from your PDF - Airline tickets
prices <- c(80, 120, 150, 90)
N <- length(prices)

cat("✈️ Example 2: Airline Ticket Variance Calculation\n")
## ✈️ Example 2: Airline Ticket Variance Calculation
cat("Data: $", paste(prices, collapse=", $"), "\n")
## Data: $ 80, $120, $150, $90
cat("Population size (N):", N, "\n\n")
## Population size (N): 4
# Step 1: Calculate population mean (ΞΌ)
mu <- sum(prices) / N
cat("πŸ“Š Step 1 - Population Mean:\n")
## πŸ“Š Step 1 - Population Mean:
cat("ΞΌ = (", paste(prices, collapse=" + "), ") Γ·", N, "\n")
## ΞΌ = ( 80 + 120 + 150 + 90 ) Γ· 4
cat("ΞΌ = ", sum(prices), " Γ·", N, " = $", mu, "\n\n")
## ΞΌ =  440  Γ· 4  = $ 110
# Step 2: Calculate deviations (xi - ΞΌ)
deviations <- prices - mu
cat("πŸ“ Step 2 - Deviations from Mean:\n")
## πŸ“ Step 2 - Deviations from Mean:
for(i in 1:N) {
  cat("x", i, "- ΞΌ = $", prices[i], " - $", mu, " = $", deviations[i], "\n")
}
## x 1 - ΞΌ = $ 80  - $ 110  = $ -30 
## x 2 - ΞΌ = $ 120  - $ 110  = $ 10 
## x 3 - ΞΌ = $ 150  - $ 110  = $ 40 
## x 4 - ΞΌ = $ 90  - $ 110  = $ -20
# Step 3: Square the deviations
squared_devs <- deviations^2
cat("\nπŸ”’ Step 3 - Squared Deviations:\n")
## 
## πŸ”’ Step 3 - Squared Deviations:
for(i in 1:N) {
  cat("(x", i, "- ΞΌ)Β² = ($", deviations[i], ")Β² = ", squared_devs[i], "\n")
}
## (x 1 - ΞΌ)Β² = ($ -30 )Β² =  900 
## (x 2 - ΞΌ)Β² = ($ 10 )Β² =  100 
## (x 3 - ΞΌ)Β² = ($ 40 )Β² =  1600 
## (x 4 - ΞΌ)Β² = ($ -20 )Β² =  400
# Step 4: Sum squared deviations
sum_squared <- sum(squared_devs)
cat("\nβž• Step 4 - Sum of Squared Deviations:\n")
## 
## βž• Step 4 - Sum of Squared Deviations:
cat("Ξ£(xi - ΞΌ)Β² = ", paste(squared_devs, collapse=" + "), " = ", sum_squared, "\n")
## Ξ£(xi - ΞΌ)Β² =  900 + 100 + 1600 + 400  =  3000
# Step 5: Calculate variance
variance <- sum_squared / N
cat("\n🎯 Step 5 - Population Variance:\n")
## 
## 🎯 Step 5 - Population Variance:
cat("σ² = Ξ£(xi - ΞΌ)Β² Γ· N = ", sum_squared, " Γ·", N, " = ", variance, "\n")
## σ² = Ξ£(xi - ΞΌ)Β² Γ· N =  3000  Γ· 4  =  750

Now let’s verify using the shortcut formula:

cat("πŸš€ Shortcut Formula Verification:\n")
## πŸš€ Shortcut Formula Verification:
cat("σ² = (Ξ£xiΒ² Γ· N) - ΞΌΒ²\n\n")
## σ² = (Ξ£xiΒ² Γ· N) - ΞΌΒ²
# Calculate Ξ£xiΒ²
sum_x_squared <- sum(prices^2)
cat("πŸ“Š Sum of squares:\n")
## πŸ“Š Sum of squares:
for(i in 1:N) {
  cat("x", i, "Β² = ", prices[i], "Β² = ", prices[i]^2, "\n")
}
## x 1 Β² =  80 Β² =  6400 
## x 2 Β² =  120 Β² =  14400 
## x 3 Β² =  150 Β² =  22500 
## x 4 Β² =  90 Β² =  8100
cat("Ξ£xiΒ² = ", paste(prices^2, collapse=" + "), " = ", sum_x_squared, "\n\n")
## Ξ£xiΒ² =  6400 + 14400 + 22500 + 8100  =  51400
# Apply shortcut formula
shortcut_variance <- (sum_x_squared / N) - mu^2
cat("πŸ”„ Shortcut calculation:\n")
## πŸ”„ Shortcut calculation:
cat("σ² = (", sum_x_squared, " Γ·", N, ") - (", mu, ")Β²\n")
## σ² = ( 51400  Γ· 4 ) - ( 110 )Β²
cat("σ² = ", sum_x_squared/N, " - ", mu^2, " = ", shortcut_variance, "\n\n")
## σ² =  12850  -  12100  =  750
cat("βœ… Both methods give the same result: σ² =", variance, "\n")
## βœ… Both methods give the same result: σ² = 750

3.3 Sample Variance - The n-1 Mystery Solved

When we have a sample (not the whole population), we use:

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\]

Why n-1 instead of n? Because we used our data to estimate the mean, we β€œlose” one degree of freedom. This gives us a better estimate of the true population variance.

cat("πŸ”¬ Sample Variance (treating our airline data as a sample):\n")
## πŸ”¬ Sample Variance (treating our airline data as a sample):
n <- length(prices)
x_bar <- mean(prices)  # Sample mean

# Calculate sample variance manually
sample_deviations <- prices - x_bar
sample_variance_manual <- sum(sample_deviations^2) / (n - 1)

cat("Sample mean (xΜ„):", x_bar, "\n")
## Sample mean (xΜ„): 110
cat("Manual calculation: sΒ² =", round(sample_variance_manual, 2), "\n")
## Manual calculation: sΒ² = 1000
# Verify with R function
sample_variance_R <- var(prices)
cat("R var() function:", round(sample_variance_R, 2), "\n")
## R var() function: 1000
cat("βœ… Perfect match!\n")
## βœ… Perfect match!

Chapter 4: Frequency Distribution Formulas - All From Your PDF

When data is organized in frequency tables, we need special formulas. Your PDF shows these clearly:

4.1 Frequency Data Formulas

Population Variance for Frequency Data

Original Formula: \[\sigma^2 = \frac{1}{N}\sum_{k=1}^{K}(x_k - \mu)^2 f_k = \sum_{k=1}^{K}(x_k - \mu)^2 p_k\]

Shortcut Formula: \[\sigma^2 = \frac{\sum_{k=1}^{K} x_k^2 f_k}{N} - \mu^2 = \sum_{k=1}^{K} x_k^2 p_k - \mu^2\]

Example 3 from Your PDF: Car Ownership

Let’s work through Example 3 exactly as shown in your notes:

Variable: Number of cars owned (sample of 100 families)

Number of cars (xk) Frequency (fk) Proportion (pk)
1 32 0.32
2 48 0.48
3 16 0.16
5 4 0.04
# Example 3 from PDF - Car ownership frequency data
cars_owned <- c(1, 2, 3, 5)
frequencies <- c(32, 48, 16, 4)
N_total <- sum(frequencies)
proportions <- frequencies / N_total

cat("πŸš— Example 3: Car Ownership Frequency Analysis\n")
## πŸš— Example 3: Car Ownership Frequency Analysis
cat("Values (xk):", paste(cars_owned, collapse=", "), "\n")
## Values (xk): 1, 2, 3, 5
cat("Frequencies (fk):", paste(frequencies, collapse=", "), "\n")
## Frequencies (fk): 32, 48, 16, 4
cat("Proportions (pk):", paste(round(proportions, 2), collapse=", "), "\n")
## Proportions (pk): 0.32, 0.48, 0.16, 0.04
cat("Total families (N):", N_total, "\n\n")
## Total families (N): 100
# Step 1: Calculate weighted mean
weighted_mean <- sum(cars_owned * proportions)
cat("πŸ“Š Step 1 - Weighted Mean:\n")
## πŸ“Š Step 1 - Weighted Mean:
cat("ΞΌ = Ξ£(xk Γ— pk)\n")
## ΞΌ = Ξ£(xk Γ— pk)
for(i in 1:length(cars_owned)) {
  cat("   ", cars_owned[i], "Γ—", proportions[i], "=", 
      round(cars_owned[i] * proportions[i], 3), "\n")
}
##     1 Γ— 0.32 = 0.32 
##     2 Γ— 0.48 = 0.96 
##     3 Γ— 0.16 = 0.48 
##     5 Γ— 0.04 = 0.2
cat("ΞΌ = ", round(weighted_mean, 3), " cars per family\n\n")
## ΞΌ =  1.96  cars per family
# Step 2: Variance using original formula
cat("πŸ“ˆ Step 2 - Variance (Original Formula):\n")
## πŸ“ˆ Step 2 - Variance (Original Formula):
cat("σ² = Ξ£(xk - ΞΌ)Β² Γ— pk\n")
## σ² = Ξ£(xk - ΞΌ)Β² Γ— pk
deviations_freq <- cars_owned - weighted_mean
squared_dev_freq <- deviations_freq^2
variance_components <- squared_dev_freq * proportions

for(i in 1:length(cars_owned)) {
  cat("(", cars_owned[i], "-", round(weighted_mean, 3), ")Β² Γ— ", 
      proportions[i], " = ", round(squared_dev_freq[i], 3), " Γ— ", 
      proportions[i], " = ", round(variance_components[i], 4), "\n")
}
## ( 1 - 1.96 )Β² Γ—  0.32  =  0.922  Γ—  0.32  =  0.2949 
## ( 2 - 1.96 )Β² Γ—  0.48  =  0.002  Γ—  0.48  =  8e-04 
## ( 3 - 1.96 )Β² Γ—  0.16  =  1.082  Γ—  0.16  =  0.1731 
## ( 5 - 1.96 )Β² Γ—  0.04  =  9.242  Γ—  0.04  =  0.3697
variance_freq <- sum(variance_components)
cat("σ² = ", paste(round(variance_components, 4), collapse=" + "), 
    " = ", round(variance_freq, 3), "\n\n")
## σ² =  0.2949 + 8e-04 + 0.1731 + 0.3697  =  0.838
# Step 3: Verify with shortcut formula
cat("πŸš€ Step 3 - Shortcut Formula Verification:\n")
## πŸš€ Step 3 - Shortcut Formula Verification:
cat("σ² = Ξ£(xkΒ² Γ— pk) - ΞΌΒ²\n")
## σ² = Ξ£(xkΒ² Γ— pk) - ΞΌΒ²
x_squared_times_p <- cars_owned^2 * proportions
for(i in 1:length(cars_owned)) {
  cat(cars_owned[i], "Β² Γ— ", proportions[i], " = ", 
      cars_owned[i]^2, " Γ— ", proportions[i], " = ", 
      round(x_squared_times_p[i], 3), "\n")
}
## 1 Β² Γ—  0.32  =  1  Γ—  0.32  =  0.32 
## 2 Β² Γ—  0.48  =  4  Γ—  0.48  =  1.92 
## 3 Β² Γ—  0.16  =  9  Γ—  0.16  =  1.44 
## 5 Β² Γ—  0.04  =  25  Γ—  0.04  =  1
shortcut_variance <- sum(x_squared_times_p) - weighted_mean^2
cat("σ² = ", round(sum(x_squared_times_p), 3), " - ", 
    round(weighted_mean^2, 3), " = ", round(shortcut_variance, 3), "\n")
## σ² =  4.68  -  3.842  =  0.838
cat("βœ… Both formulas match!\n")
## βœ… Both formulas match!

Sample Variance for Frequency Data

Sample Formula: \[s^2 = \frac{n}{n-1}\sum_{k=1}^{K}(x_k - \bar{x})^2 p_k\]

Sample Shortcut Formula: \[s^2 = \frac{n}{n-1}\left(\sum_{k=1}^{K} x_k^2 p_k - \bar{x}^2\right)\]

# Calculate sample variance for frequency data
sample_var_freq <- (N_total/(N_total-1)) * variance_freq

cat("πŸ”¬ Sample Variance for Frequency Data:\n")
## πŸ”¬ Sample Variance for Frequency Data:
cat("sΒ² = (n/(n-1)) Γ— σ²\n")
## sΒ² = (n/(n-1)) Γ— σ²
cat("sΒ² = (", N_total, "/(", N_total, "-1)) Γ— ", round(variance_freq, 3), "\n")
## sΒ² = ( 100 /( 100 -1)) Γ—  0.838
cat("sΒ² = ", round(sample_var_freq, 3), "\n")
## sΒ² =  0.847

Chapter 5: Grouped Data and Standard Deviation

5.1 Grouped Data (Interval Classes) - Approximate Variance

Your PDF shows that for grouped data (continuous variables in interval classes), we get approximate variance:

Population Variance: \[\sigma^2 = \frac{1}{N}\sum_{k=1}^{K}(m_k - \mu)^2 f_k\]

Sample Variance: \[s^2 = \frac{n}{n-1}\left[\frac{\sum_{k=1}^{K} m_k^2 f_k}{n} - \bar{x}^2\right]\]

Where \(m_k\) = midpoint of each interval class.

# Example of grouped data - car speeds
cat("🏎️ Grouped Data Example: Car Speeds\n")
## 🏎️ Grouped Data Example: Car Speeds
cat("Note: This gives APPROXIMATE variance because we use midpoints\n\n")
## Note: This gives APPROXIMATE variance because we use midpoints
# Speed intervals and their midpoints
intervals <- c("[0,30)", "[30,50)", "[50,100)")
midpoints <- c(15, 40, 75)  # mk values
frequencies_speed <- c(4, 12, 8)
n_speed <- sum(frequencies_speed)

cat("Speed Intervals:", paste(intervals, collapse=", "), "\n")
## Speed Intervals: [0,30), [30,50), [50,100)
cat("Midpoints (mk):", paste(midpoints, collapse=", "), "\n")
## Midpoints (mk): 15, 40, 75
cat("Frequencies (fk):", paste(frequencies_speed, collapse=", "), "\n\n")
## Frequencies (fk): 4, 12, 8
# Calculate approximate mean
approx_mean <- sum(midpoints * frequencies_speed) / n_speed
cat("Approximate mean: xΜ„ =", round(approx_mean, 1), "km/h\n")
## Approximate mean: xΜ„ = 47.5 km/h
# Calculate approximate sample variance
mk_squared_fk <- midpoints^2 * frequencies_speed
approx_variance <- (n_speed/(n_speed-1)) * 
                   (sum(mk_squared_fk)/n_speed - approx_mean^2)

cat("Approximate sample variance: sΒ² =", round(approx_variance, 1), "\n")
## Approximate sample variance: sΒ² = 476.1
cat("⚠️ Remember: This is approximate because we used interval midpoints!\n")
## ⚠️ Remember: This is approximate because we used interval midpoints!

5.2 Standard Deviation - Making Variance Interpretable

The Problem with Variance Units

Variance has squared units - if we measure height in cm, variance is in cmΒ². What does β€œ25 cm²” of height variability mean? It’s confusing!

Solution: Take the square root to get back to original units!

\[\text{Standard Deviation} = \sqrt{\text{Variance}}\]

Population Standard Deviation

\[\sigma = \sqrt{\sigma^2}\]

Sample Standard Deviation

\[s = \sqrt{s^2}\]

# Using our airline ticket example
cat("✈️ Standard Deviation from Airline Ticket Example:\n")
## ✈️ Standard Deviation from Airline Ticket Example:
cat("Variance: σ² =", variance, "dollarsΒ²\n")
## Variance: σ² = 750 dollarsΒ²
std_dev <- sqrt(variance)
cat("Standard deviation: Οƒ = √", variance, " = $", round(std_dev, 2), "\n\n")
## Standard deviation: Οƒ = √ 750  = $ 27.39
cat("πŸ’‘ Interpretation: On average, ticket prices are about $", 
    round(std_dev, 2), " away from the mean of $", mu, "\n")
## πŸ’‘ Interpretation: On average, ticket prices are about $ 27.39  away from the mean of $ 110
# For the car ownership example
std_dev_cars <- sqrt(variance_freq)
cat("\nπŸš— Car Ownership Standard Deviation:\n")
## 
## πŸš— Car Ownership Standard Deviation:
cat("Οƒ = √", round(variance_freq, 3), " = ", round(std_dev_cars, 3), " cars\n")
## Οƒ = √ 0.838  =  0.916  cars

5.3 The Empirical Rule (68-95-99.7 Rule)

For normally distributed data: - 68% of data falls within 1 standard deviation of the mean - 95% of data falls within 2 standard deviations
- 99.7% of data falls within 3 standard deviations

# Visualize the empirical rule
x <- seq(-4, 4, 0.01)
y <- dnorm(x)

plot(x, y, type="l", lwd=2, col="blue", 
     main="The Empirical Rule (68-95-99.7)",
     xlab="Standard deviations from mean", 
     ylab="Probability density")

# Shade areas
polygon(c(-1, seq(-1, 1, 0.01), 1), c(0, dnorm(seq(-1, 1, 0.01)), 0), 
        col=rgb(0,0,1,0.3), border=NA)
polygon(c(-2, seq(-2, 2, 0.01), 2), c(0, dnorm(seq(-2, 2, 0.01)), 0), 
        col=rgb(0,1,0,0.2), border=NA)
polygon(c(-3, seq(-3, 3, 0.01), 3), c(0, dnorm(seq(-3, 3, 0.01)), 0), 
        col=rgb(1,0,0,0.1), border=NA)

# Add labels
text(0, 0.2, "68%", cex=1.2, font=2)
text(0, 0.05, "95%", cex=1.2, font=2)
text(0, 0.02, "99.7%", cex=1.2, font=2)

# Add vertical lines
abline(v=c(-3,-2,-1,0,1,2,3), lty=2, col="gray")

# Apply empirical rule to car prices
price_mean <- mean(cars$price_num)
price_sd <- sd(cars$price_num)

cat("πŸš— Empirical Rule Applied to Car Prices:\n")
## πŸš— Empirical Rule Applied to Car Prices:
cat("Mean price: $", round(price_mean, 0), "\n")
## Mean price: $ 30620
cat("Standard deviation: $", round(price_sd, 0), "\n\n")
## Standard deviation: $ 9531
cat("πŸ“Š Expected ranges:\n")
## πŸ“Š Expected ranges:
cat("68% of cars priced between: $", 
    round(price_mean - price_sd, 0), " - $", 
    round(price_mean + price_sd, 0), "\n")
## 68% of cars priced between: $ 21089  - $ 40151
cat("95% of cars priced between: $", 
    round(price_mean - 2*price_sd, 0), " - $", 
    round(price_mean + 2*price_sd, 0), "\n")
## 95% of cars priced between: $ 11557  - $ 49682
cat("99.7% of cars priced between: $", 
    round(price_mean - 3*price_sd, 0), " - $", 
    round(price_mean + 3*price_sd, 0), "\n")
## 99.7% of cars priced between: $ 2026  - $ 59214

Chapter 6: UBStats Functions and Coefficient of Variation

6.1 Using UBStats Functions - Making Life Easier

Now let’s use the functions from your class scripts to calculate all these measures automatically:

# Using UBStats functions for car price analysis
cat("πŸ”§ Using UBStats Functions for Car Price Analysis:\n\n")
## πŸ”§ Using UBStats Functions for Car Price Analysis:
# All dispersion measures at once
cat("πŸ“Š All Variability Measures:\n")
## πŸ“Š All Variability Measures:
price_dispersion <- distr.summary.x(cars$price_num, stats="dispersion")
##   n n.a    range  IQrange      sd      var   cv
##  32   0 34554.64 13107.09 9531.25 90844686 0.31
print(price_dispersion)
## $`Measures of dispersion`
##    n n.a    range  IQrange       sd      var        cv
## 1 32   0 34554.64 13107.09 9531.248 90844686 0.3112753
cat("\nπŸ“ˆ Complete Summary (Central Tendency + Variability):\n")
## 
## πŸ“ˆ Complete Summary (Central Tendency + Variability):
price_summary <- distr.summary.x(cars$price_num, stats="summary")
##   n n.a      min       q1   median     mean       q3      max      sd      var
##  32   0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.25 90844686
print(price_summary)
## $`Summary measures`
##    n n.a      min       q1   median     mean       q3      max       sd
## 1 32   0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.248
##        var
## 1 90844686

6.2 Coefficient of Variation - Comparing Different Variables

The coefficient of variation (CV) lets us compare variability across different units:

\[CV = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%\]

cat("πŸ“Š Coefficient of Variation Analysis:\n")
## πŸ“Š Coefficient of Variation Analysis:
cat("Which car characteristic is most variable?\n\n")
## Which car characteristic is most variable?
# Calculate CV for different variables
price_cv <- (sd(cars$price_num) / mean(cars$price_num)) * 100
speed_cv <- (sd(cars$maxspeed) / mean(cars$maxspeed)) * 100
weight_cv <- (sd(cars$weight) / mean(cars$weight)) * 100
accel_cv <- (sd(cars$acceleration) / mean(cars$acceleration)) * 100

cat("Variable         Mean        SD        CV\n")
## Variable         Mean        SD        CV
cat("Price ($)      ", sprintf("%8.0f", mean(cars$price_num)), 
    "   ", sprintf("%8.0f", sd(cars$price_num)), 
    "   ", sprintf("%5.1f%%", price_cv), "\n")
## Price ($)          30620         9531      31.1%
cat("Max Speed      ", sprintf("%8.1f", mean(cars$maxspeed)), 
    "   ", sprintf("%8.1f", sd(cars$maxspeed)), 
    "   ", sprintf("%5.1f%%", speed_cv), "\n")
## Max Speed          420.6        169.1      40.2%
cat("Weight (kg)    ", sprintf("%8.0f", mean(cars$weight)), 
    "   ", sprintf("%8.0f", sd(cars$weight)), 
    "   ", sprintf("%5.1f%%", weight_cv), "\n")
## Weight (kg)         1609          489      30.4%
cat("Acceleration   ", sprintf("%8.2f", mean(cars$acceleration)), 
    "   ", sprintf("%8.2f", sd(cars$acceleration)), 
    "   ", sprintf("%5.1f%%", accel_cv), "\n\n")
## Acceleration       12.51         4.27      34.2%
# Interpretation
cat("πŸ’‘ Interpretation:\n")
## πŸ’‘ Interpretation:
cv_values <- c(Price = price_cv, Speed = speed_cv, Weight = weight_cv, Acceleration = accel_cv)
cv_sorted <- sort(cv_values, decreasing = TRUE)

for(i in 1:length(cv_sorted)) {
  cat(i, ". ", names(cv_sorted)[i], ": ", sprintf("%.1f%%", cv_sorted[i]), 
      " (", ifelse(cv_sorted[i] > 30, "High", ifelse(cv_sorted[i] > 15, "Moderate", "Low")), " variability)\n")
}
## 1 .  Speed :  40.2%  ( High  variability)
## 2 .  Acceleration :  34.2%  ( High  variability)
## 3 .  Price :  31.1%  ( High  variability)
## 4 .  Weight :  30.4%  ( High  variability)
cat("\nMost variable:", names(cv_sorted)[1], "- indicates diverse market segments\n")
## 
## Most variable: Speed - indicates diverse market segments
cat("Least variable:", names(cv_sorted)[4], "- suggests more standardized characteristics\n")
## Least variable: Weight - suggests more standardized characteristics

Chapter 7: Advanced Examples from Your PDF Notes

7.1 Example 4: Five Variables with Same Mean and Median

Let’s work through Example 4 from your variability PDF - the five variables comparison:

cat("πŸ“‹ Example 4: Five Variables Analysis\n")
## πŸ“‹ Example 4: Five Variables Analysis
cat("Goal: Show that variables can have same mean/median but different variability\n\n")
## Goal: Show that variables can have same mean/median but different variability
# Data from your PDF (N=7 for each variable)
var_A <- c(1, 1, 4, 4, 4, 7, 7)
var_B <- c(1, 2, 3, 4, 5, 6, 7)  
var_C <- c(1, 2, 4, 4, 4, 6, 7)
var_D <- c(1, 4, 4, 4, 4, 4, 7)
var_E <- c(4, 4, 4, 4, 4, 4, 4)

variables <- list(A=var_A, B=var_B, C=var_C, D=var_D, E=var_E)

cat("πŸ“Š Data for each variable:\n")
## πŸ“Š Data for each variable:
for(i in 1:5) {
  cat("Variable", names(variables)[i], ":", 
      paste(variables[[i]], collapse=", "), "\n")
}
## Variable A : 1, 1, 4, 4, 4, 7, 7 
## Variable B : 1, 2, 3, 4, 5, 6, 7 
## Variable C : 1, 2, 4, 4, 4, 6, 7 
## Variable D : 1, 4, 4, 4, 4, 4, 7 
## Variable E : 4, 4, 4, 4, 4, 4, 4
cat("\nπŸ“ˆ Complete Analysis of All Five Variables:\n\n")
## 
## πŸ“ˆ Complete Analysis of All Five Variables:
# Calculate all measures exactly as shown in your PDF
results <- data.frame(
  Variable = names(variables),
  Min = sapply(variables, min),
  Q1 = sapply(variables, function(x) quantile(x, 0.25)),
  Median = sapply(variables, median),
  Mean = sapply(variables, mean),
  Q3 = sapply(variables, function(x) quantile(x, 0.75)),
  Max = sapply(variables, max),
  Range = sapply(variables, function(x) max(x) - min(x)),
  IQR = sapply(variables, IQR),
  Variance = sapply(variables, var),
  Std_Dev = sapply(variables, sd)
)

# Print only the numeric columns with proper rounding
results_numeric <- results[, -1]  # Remove the Variable column for rounding
results_rounded <- round(results_numeric, 2)
results_final <- cbind(Variable = results$Variable, results_rounded)

print(results_final)
##   Variable Min  Q1 Median Mean  Q3 Max Range IQR Variance Std_Dev
## A        A   1 2.5      4    4 5.5   7     6   3     6.00    2.45
## B        B   1 2.5      4    4 5.5   7     6   3     4.67    2.16
## C        C   1 3.0      4    4 5.0   7     6   2     4.33    2.08
## D        D   1 4.0      4    4 4.0   7     6   0     3.00    1.73
## E        E   4 4.0      4    4 4.0   4     0   0     0.00    0.00
cat("\nπŸ” Key Observations from Your PDF:\n")
## 
## πŸ” Key Observations from Your PDF:
cat("- ALL variables have same mean (4) and median (4)\n")
## - ALL variables have same mean (4) and median (4)
cat("- But they have very different variability patterns!\n")
## - But they have very different variability patterns!
cat("- Variable E has zero variability (all values = 4)\n")
## - Variable E has zero variability (all values = 4)
cat("- Variable A has highest variability\n")
## - Variable A has highest variability
cat("- Only variance and standard deviation capture these differences!\n")
## - Only variance and standard deviation capture these differences!
# Visualize all five variables
par(mfrow=c(2,3))

for(i in 1:5) {
  boxplot(variables[[i]], 
          main=paste("Variable", names(variables)[i]),
          ylab="Values", col=rainbow(5)[i])
  abline(h=4, col="red", lty=2)  # Mean line
}

# Add histogram for comparison
hist(var_A, main="Variable A Distribution", 
     xlab="Values", col="lightblue", breaks=0:8)

par(mfrow=c(1,1))

7.2 Example 5: Business Scenario - Japanese vs European Cars

Now let’s tackle Example 5 - the business scenario from your PDF:

cat("πŸš— Example 5: Car Manufacturer Price Analysis\n")
## πŸš— Example 5: Car Manufacturer Price Analysis
cat("Business Question: Are Japanese car prices too differentiated?\n\n")
## Business Question: Are Japanese car prices too differentiated?
# Question 1: Are car models more dispersed in terms of sales or price?
cat("πŸ“Š Question 1: Sales vs Price Variability\n")
## πŸ“Š Question 1: Sales vs Price Variability
sales_stats <- distr.summary.x(cars$sales, stats=c("mean","sd","cv"))
##   n n.a    mean    sd   cv
##  32   0 8012.88 81.21 0.01
price_stats <- distr.summary.x(cars$price_num, stats=c("mean","sd","cv"))
##   n n.a     mean      sd   cv
##  32   0 30619.99 9531.25 0.31
print(sales_stats)
## $`Requested statistics`
##    n n.a     mean       sd         cv
## 1 32   0 8012.875 81.21407 0.01013545
print(price_stats)
## $`Requested statistics`
##    n n.a     mean       sd        cv
## 1 32   0 30619.99 9531.248 0.3112753
cat("\nπŸ’‘ Answer: Sales are more dispersed!\n")
## 
## πŸ’‘ Answer: Sales are more dispersed!
cat("Sales CV =", round((sd(cars$sales)/mean(cars$sales))*100, 1), "%\n")
## Sales CV = 1 %
cat("Price CV =", round((sd(cars$price_num)/mean(cars$price_num))*100, 1), "%\n")
## Price CV = 31.1 %
cat("\nπŸ“Š Question 2: Price Variability by Country\n")
## 
## πŸ“Š Question 2: Price Variability by Country
# Compare price variability among different countries
price_by_country <- distr.summary.x(cars$price_num, 
                                   stats=c("mean","sd","cv"), 
                                   by1=cars$country)
##   cars$country  n n.a     mean       sd   cv
##         France  4   0 25927.03  6804.56 0.26
##        Germany  6   0 29644.77 10216.58 0.34
##          Italy  6   0 30332.72 11953.73 0.39
##          Japan 12   0 31136.82  9580.72 0.31
##  United States  4   0 35656.21  8799.23 0.25
print(price_by_country)
## $`Requested statistics`
##    cars$country  n n.a     mean        sd        cv
## 1        France  4   0 25927.03  6804.561 0.2624504
## 2       Germany  6   0 29644.77 10216.579 0.3446334
## 3         Italy  6   0 30332.72 11953.735 0.3940871
## 4         Japan 12   0 31136.82  9580.721 0.3076975
## 5 United States  4   0 35656.21  8799.226 0.2467796
cat("\n🎯 Manager's Concern Analysis:\n")
## 
## 🎯 Manager's Concern Analysis:
# Check if Japan and Germany exist in the data
available_countries <- unique(cars$country)
cat("Available countries:", paste(available_countries, collapse=", "), "\n")
## Available countries: Germany, Italy, Japan, United States, France
# Find Japan and Germany data more safely
if("Japan" %in% available_countries && "Germany" %in% available_countries) {
  # Calculate CV manually for Japan and Germany
  japan_cars <- cars[cars$country == "Japan", ]
  germany_cars <- cars[cars$country == "Germany", ]
  
  japan_cv <- (sd(japan_cars$price_num, na.rm=TRUE) / mean(japan_cars$price_num, na.rm=TRUE)) * 100
  germany_cv <- (sd(germany_cars$price_num, na.rm=TRUE) / mean(germany_cars$price_num, na.rm=TRUE)) * 100
  
  cat("Japanese cars CV:", round(japan_cv, 1), "%\n")
  cat("German cars CV:", round(germany_cv, 1), "%\n")
  
  if(japan_cv > germany_cv) {
    cat("⚠️  Manager's concern is VALID - Japanese cars are more variable\n")
  } else {
    cat("βœ… Manager's concern is UNFOUNDED - Japanese cars are not more variable\n")
  }
} else {
  cat("Note: Japan or Germany not found in dataset\n")
  cat("Comparing available countries instead:\n")
  
  # Get CV for all countries
  country_cvs <- c()
  for(country in available_countries) {
    country_cars <- cars[cars$country == country, ]
    if(nrow(country_cars) > 1) {
      cv <- (sd(country_cars$price_num, na.rm=TRUE) / mean(country_cars$price_num, na.rm=TRUE)) * 100
      country_cvs[country] <- cv
      cat(country, "CV:", round(cv, 1), "%\n")
    }
  }
  
  # Find most and least variable
  if(length(country_cvs) > 0) {
    most_variable <- names(which.max(country_cvs))
    least_variable <- names(which.min(country_cvs))
    cat("Most variable country:", most_variable, "(", round(max(country_cvs), 1), "%)\n")
    cat("Least variable country:", least_variable, "(", round(min(country_cvs), 1), "%)\n")
  }
}
## Japanese cars CV: 30.8 %
## German cars CV: 34.5 %
## βœ… Manager's concern is UNFOUNDED - Japanese cars are not more variable

Chapter 8: Bivariate Analysis - Studying Relationships

Now we move to Part 2 of today’s lecture: How do we study relationships between TWO variables? This is called bivariate analysis.

8.1 Types of Bivariate Relationships

Your PDF notes show three main cases:

  1. Two Qualitative Variables β†’ Crosstabs and bar charts
  2. Two Quantitative Variables β†’ Scatterplots and correlation
  3. One Qualitative + One Quantitative β†’ Conditional analysis
cat("πŸ” Bivariate Analysis Overview:\n")
## πŸ” Bivariate Analysis Overview:
cat("We'll explore relationships between pairs of variables\n")
## We'll explore relationships between pairs of variables
cat("Goal: Understand how one variable relates to another\n\n")
## Goal: Understand how one variable relates to another
# Check our available variables
cat("Available variables in cars dataset:\n")
## Available variables in cars dataset:
str(cars[, c("country", "price_classes", "price_num", "maxspeed", "acceleration")])
## 'data.frame':    32 obs. of  5 variables:
##  $ country      : chr  "Germany" "Italy" "Japan" "United States" ...
##  $ price_classes: Factor w/ 3 levels "low","mid","high": 2 1 2 3 3 3 3 2 2 2 ...
##  $ price_num    : num  24634 19354 23888 39901 39129 ...
##  $ maxspeed     : num  314 345 270 314 470 ...
##  $ acceleration : num  15.5 15.6 15.7 12.3 12.4 ...

8.2 Two Qualitative Variables - Crosstabs

When both variables are categorical, we use crosstabs (contingency tables).

Example from Bivariate PDF: Industry vs Solvency Rating

Let’s recreate Case A from your bivariate analysis notes:

# Example from your bivariate PDF - Industry vs Solvency Rating
cat("πŸ“‹ Case A: Industry vs Solvency Rating\n")
## πŸ“‹ Case A: Industry vs Solvency Rating
cat("Question: Does company rating depend on industry?\n\n")
## Question: Does company rating depend on industry?
# Create the data from your PDF
industry <- c(rep("Manufacturing", 240), rep("Financial", 160))
solvency <- c(rep("Low", 36), rep("Average", 124), rep("High", 80),     # Manufacturing
              rep("Low", 12), rep("Average", 64), rep("High", 84))      # Financial

# Create crosstab
crosstab <- table(industry, solvency)
print(crosstab)
##                solvency
## industry        Average High Low
##   Financial          64   84  12
##   Manufacturing     124   80  36
cat("\nπŸ“Š Step 1: Calculate conditional frequencies (Y|X)\n")
## 
## πŸ“Š Step 1: Calculate conditional frequencies (Y|X)
# Calculate conditional frequencies by row
conditional_freq <- prop.table(crosstab, margin=1) * 100
print(round(conditional_freq, 1))
##                solvency
## industry        Average High  Low
##   Financial        40.0 52.5  7.5
##   Manufacturing    51.7 33.3 15.0
# Step 2: Graphical presentation - Stacked bar chart
barplot(conditional_freq, beside=FALSE, 
        main="Solvency Rating by Industry (Conditional %)",
        xlab="Solvency Rating", ylab="Percentage",
        col=c("lightblue", "lightcoral"),
        legend.text=rownames(conditional_freq))

cat("\n🎯 Step 3: Conclusion\n")
## 
## 🎯 Step 3: Conclusion
cat("Manufacturing: Only 33.3% have high rating\n")
## Manufacturing: Only 33.3% have high rating
cat("Financial: 52.5% have high rating\n")
## Financial: 52.5% have high rating
cat("β†’ The distributions are DIFFERENT, so rating depends on industry!\n")
## β†’ The distributions are DIFFERENT, so rating depends on industry!

Using UBStats Functions for Crosstabs

# Let's analyze country vs price class in our cars dataset
cat("πŸš— Analyzing Country vs Price Class:\n")
## πŸš— Analyzing Country vs Price Class:
# Create frequency table
country_price_table <- distr.table.xy(cars$country, cars$price_classes, 
                                      freq.type="y|x", freq="prop")
## y|x: Proportions
##                cars$price_classes
## cars$country     low  mid high TOTAL
##   France        0.25 0.50 0.25  1.00
##   Germany       0.17 0.67 0.17  1.00
##   Italy         0.33 0.17 0.50  1.00
##   Japan         0.08 0.67 0.25  1.00
##   United States 0.00 0.50 0.50  1.00
print(country_price_table)
## $`y|x: Proportions`
##                      low       mid      high TOTAL
## France        0.25000000 0.5000000 0.2500000     1
## Germany       0.16666667 0.6666667 0.1666667     1
## Italy         0.33333333 0.1666667 0.5000000     1
## Japan         0.08333333 0.6666667 0.2500000     1
## United States 0.00000000 0.5000000 0.5000000     1
# Create visualization
distr.plot.xy(cars$country, cars$price_classes, 
              freq.type="y|x", freq="prop",
              plot.type="bars", bar.type="stacked")

8.3 Two Quantitative Variables - Correlation and Scatterplots

When both variables are numerical, we use scatterplots and correlation.

Correlation Coefficient

The correlation coefficient (r) measures the strength of linear relationship:

\(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}\)

  • r = +1: Perfect positive linear relationship
  • r = 0: No linear relationship
  • r = -1: Perfect negative linear relationship
cat("πŸ“ˆ Correlation Analysis: Car Price vs Performance\n\n")
## πŸ“ˆ Correlation Analysis: Car Price vs Performance
# Calculate correlations
price_speed_cor <- cor(cars$price_num, cars$maxspeed)
price_accel_cor <- cor(cars$price_num, cars$acceleration)

cat("Price vs Max Speed correlation:", round(price_speed_cor, 3), "\n")
## Price vs Max Speed correlation: 0.794
cat("Price vs Acceleration correlation:", round(price_accel_cor, 3), "\n\n")
## Price vs Acceleration correlation: -0.827
# Interpretation
cat("πŸ’‘ Interpretation:\n")
## πŸ’‘ Interpretation:
if(price_speed_cor > 0.5) {
  cat("- Strong positive correlation between price and speed\n")
} else if(price_speed_cor > 0.3) {
  cat("- Moderate positive correlation between price and speed\n")
} else {
  cat("- Weak correlation between price and speed\n")
}
## - Strong positive correlation between price and speed
if(price_accel_cor < -0.3) {
  cat("- Negative correlation: expensive cars accelerate faster (lower seconds)\n")
}
## - Negative correlation: expensive cars accelerate faster (lower seconds)
# Create scatterplots
par(mfrow=c(1,2))

# Price vs Speed
plot(cars$price_num, cars$maxspeed, 
     main="Price vs Max Speed", 
     xlab="Price ($)", ylab="Max Speed (km/h)",
     pch=19, col="blue", alpha=0.6)
## Warning in plot.window(...): "alpha" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "alpha" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in box(...): "alpha" is not a graphical parameter
## Warning in title(...): "alpha" is not a graphical parameter
abline(lm(maxspeed ~ price_num, data=cars), col="red", lwd=2)

# Price vs Acceleration  
plot(cars$price_num, cars$acceleration,
     main="Price vs Acceleration",
     xlab="Price ($)", ylab="Acceleration (0-100 km/h seconds)",
     pch=19, col="green", alpha=0.6)
## Warning in plot.window(...): "alpha" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "alpha" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in box(...): "alpha" is not a graphical parameter
## Warning in title(...): "alpha" is not a graphical parameter
abline(lm(acceleration ~ price_num, data=cars), col="red", lwd=2)

par(mfrow=c(1,1))

Using UBStats for Scatterplots

# Using UBStats functions
distr.plot.xy(cars$price_num, cars$maxspeed, plot.type="scatter")

cat("πŸ“Š Using UBStats for correlation analysis:\n")
## πŸ“Š Using UBStats for correlation analysis:
cat("Correlation between price and speed:", 
    round(cor(cars$price_num, cars$maxspeed), 3), "\n")
## Correlation between price and speed: 0.794

8.4 Mixed Types: One Qualitative + One Quantitative

When we have one categorical and one numerical variable, we use conditional analysis.

cat("πŸ”€ Mixed Analysis: Country (Categorical) vs Price (Numerical)\n\n")
## πŸ”€ Mixed Analysis: Country (Categorical) vs Price (Numerical)
# Conditional summary statistics
price_by_country <- distr.summary.x(cars$price_num, by1=cars$country, 
                                   stats=c("central", "dispersion"))
##   cars$country  n n.a     mode n.modes  mode%   median     mean
##         France  4   0 26792.50       4 0.2500 24430.20 25927.03
##        Germany  6   0 24634.45       6 0.1667 30333.46 29644.77
##          Italy  6   0 19354.12       6 0.1667 28582.25 30332.72
##          Japan 12   0 23887.56      12 0.0833 27685.31 31136.82
##  United States  4   0 39900.58       4 0.2500 34581.44 35656.21
##   cars$country  n n.a    range  IQrange       sd       var   cv
##         France  4   0 15413.19  7396.75  6804.56  46302047 0.26
##        Germany  6   0 30362.11  8410.38 10216.58 104378488 0.34
##          Italy  6   0 27358.69 18614.83 11953.73 142891774 0.39
##          Japan 12   0 31365.75  9538.84  9580.72  91790206 0.31
##  United States  4   0 18497.14 12603.00  8799.23  77426383 0.25
print(price_by_country)
## $`Central tendency measures`
##    cars$country  n n.a     mode n.modes      mode%   median     mean
## 1        France  4   0 26792.50       4 0.25000000 24430.20 25927.03
## 2       Germany  6   0 24634.45       6 0.16666667 30333.46 29644.77
## 3         Italy  6   0 19354.12       6 0.16666667 28582.25 30332.72
## 4         Japan 12   0 23887.56      12 0.08333333 27685.31 31136.82
## 5 United States  4   0 39900.58       4 0.25000000 34581.44 35656.21
## 
## $`Measures of dispersion`
##    cars$country  n n.a    range   IQrange        sd       var        cv
## 1        France  4   0 15413.19  7396.747  6804.561  46302047 0.2624504
## 2       Germany  6   0 30362.11  8410.382 10216.579 104378488 0.3446334
## 3         Italy  6   0 27358.69 18614.825 11953.735 142891774 0.3940871
## 4         Japan 12   0 31365.75  9538.835  9580.721  91790206 0.3076975
## 5 United States  4   0 18497.14 12602.998  8799.226  77426383 0.2467796
# Side-by-side boxplots
distr.plot.xy(cars$country, cars$price_num, plot.type="boxplot")


Chapter 9: Working with R Scripts Functions

Let’s use the exact functions from your class R scripts:

9.1 From Video 3 Script - Univariate Distributions

cat("πŸ”§ Using Functions from Video 3 Script:\n\n")
## πŸ”§ Using Functions from Video 3 Script:
# Frequency tables for categorical variables
cat("πŸ“Š Frequency Table for Country:\n")
## πŸ“Š Frequency Table for Country:
country_table <- distr.table.x(x=cars$country, freq=c("counts","prop","perc"))
##   cars$country Count Prop Percent
##         France     4 0.12      12
##        Germany     6 0.19      19
##          Italy     6 0.19      19
##          Japan    12 0.38      38
##  United States     4 0.12      12
##          TOTAL    32 1.00     100
print(country_table)
##      cars$country Count   Prop Percent
## 1          France     4 0.1250   12.50
## 2         Germany     6 0.1875   18.75
## 3           Italy     6 0.1875   18.75
## 4           Japan    12 0.3750   37.50
## 5   United States     4 0.1250   12.50
## Sum         TOTAL    32 1.0000  100.00
# Numerical variables with many values - use breaks
cat("\nπŸ“ˆ Price Distribution with Custom Breaks:\n")
## 
## πŸ“ˆ Price Distribution with Custom Breaks:
price_breaks <- c(10000, 20000, 30000, 50000, 100000)
price_table <- distr.table.x(x=cars$price_num, 
                            freq=c("counts","prop","dens"),
                            breaks=price_breaks)
##  cars$price_num Count Prop Density
##   [10000,20000)     5 0.16   2e-05
##   [20000,30000)    14 0.44   4e-05
##   [30000,50000)    13 0.41   2e-05
##  [50000,100000]     0 0.00       0
##           TOTAL    32 1.00
print(price_table)
##     cars$price_num Count    Prop     Density
## 1    [10000,20000)     5 0.15625 1.56250e-05
## 2    [20000,30000)    14 0.43750 4.37500e-05
## 3    [30000,50000)    13 0.40625 2.03125e-05
## 4   [50000,100000]     0 0.00000 0.00000e+00
## Sum          TOTAL    32 1.00000          NA
# Plots from Video 3 script
par(mfrow=c(1,2))

# Pie chart for categorical data
distr.plot.x(x=cars$country, freq="proportions", plot.type="pie")

# Histogram for numerical data
distr.plot.x(x=cars$price_num, plot.type="hist", breaks=8)

par(mfrow=c(1,1))

9.2 From Video 4 Script - Summary Measures

cat("πŸ”§ Using Functions from Video 4 Script:\n\n")
## πŸ”§ Using Functions from Video 4 Script:
# Summary measures for central tendency
cat("πŸ“Š Central Tendency Measures:\n")
## πŸ“Š Central Tendency Measures:
central_measures <- distr.summary.x(x=cars$price_num, stats="central")
##   n n.a     mode n.modes  mode%   median     mean
##  32   0 24634.45      32 0.0312 27685.31 30619.99
print(central_measures)
## $`Central tendency measures`
##    n n.a     mode n.modes   mode%   median     mean
## 1 32   0 24634.45      32 0.03125 27685.31 30619.99
# Quartiles and percentiles
cat("\nπŸ“ˆ Quartiles and Percentiles:\n")
## 
## πŸ“ˆ Quartiles and Percentiles:
quartile_measures <- distr.summary.x(x=cars$price_num, stats="quartiles")
##   n n.a      min      p25      p50      p75      max
##  32   0 14123.39 24104.94 27685.31 37212.03 48678.02
print(quartile_measures)
## $Quartiles
##    n n.a      min      p25      p50      p75      max
## 1 32   0 14123.39 24104.94 27685.31 37212.03 48678.02
# All dispersion measures
cat("\nπŸ“ All Dispersion Measures:\n")
## 
## πŸ“ All Dispersion Measures:
dispersion_measures <- distr.summary.x(x=cars$price_num, stats="dispersion")
##   n n.a    range  IQrange      sd      var   cv
##  32   0 34554.64 13107.09 9531.25 90844686 0.31
print(dispersion_measures)
## $`Measures of dispersion`
##    n n.a    range  IQrange       sd      var        cv
## 1 32   0 34554.64 13107.09 9531.248 90844686 0.3112753
# Complete summary
cat("\nπŸ“‹ Complete Summary:\n")
## 
## πŸ“‹ Complete Summary:
complete_summary <- distr.summary.x(x=cars$price_num, stats="summary")
##   n n.a      min       q1   median     mean       q3      max      sd      var
##  32   0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.25 90844686
print(complete_summary)
## $`Summary measures`
##    n n.a      min       q1   median     mean       q3      max       sd
## 1 32   0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.248
##        var
## 1 90844686

9.3 From Video 5 Script - Bivariate Analysis

cat("πŸ”§ Using Functions from Video 5 Script:\n\n")
## πŸ”§ Using Functions from Video 5 Script:
# Joint frequency distributions
cat("πŸ“Š Joint Frequency Table:\n")
## πŸ“Š Joint Frequency Table:
joint_table <- distr.table.xy(x=cars$country, y=cars$price_classes, 
                             freq.type="joint", freq="count")
## Joint counts
##                cars$price_classes
## cars$country    low mid high TOTAL
##   France          1   2    1     4
##   Germany         1   4    1     6
##   Italy           2   1    3     6
##   Japan           1   8    3    12
##   United States   0   2    2     4
##   TOTAL           5  17   10    32
print(joint_table)
## $`Joint counts`
##               low mid high TOTAL
## France          1   2    1     4
## Germany         1   4    1     6
## Italy           2   1    3     6
## Japan           1   8    3    12
## United States   0   2    2     4
## TOTAL           5  17   10    32
# Conditional frequencies
cat("\nπŸ“ˆ Conditional Frequencies (Price Class | Country):\n")
## 
## πŸ“ˆ Conditional Frequencies (Price Class | Country):
conditional_table <- distr.table.xy(x=cars$country, y=cars$price_classes,
                                   freq.type="y|x", freq="prop")
## y|x: Proportions
##                cars$price_classes
## cars$country     low  mid high TOTAL
##   France        0.25 0.50 0.25  1.00
##   Germany       0.17 0.67 0.17  1.00
##   Italy         0.33 0.17 0.50  1.00
##   Japan         0.08 0.67 0.25  1.00
##   United States 0.00 0.50 0.50  1.00
print(conditional_table)
## $`y|x: Proportions`
##                      low       mid      high TOTAL
## France        0.25000000 0.5000000 0.2500000     1
## Germany       0.16666667 0.6666667 0.1666667     1
## Italy         0.33333333 0.1666667 0.5000000     1
## Japan         0.08333333 0.6666667 0.2500000     1
## United States 0.00000000 0.5000000 0.5000000     1
# Bivariate plots from Video 5 script
par(mfrow=c(1,2))

# Side-by-side bar chart
distr.plot.xy(x=cars$country, y=cars$price_classes,
              freq.type="y|x", freq="prop",
              plot.type="bars", bar.type="beside")

# Scatter plot for two numerical variables
distr.plot.xy(x=cars$price_num, y=cars$maxspeed, 
              plot.type="scatter")

par(mfrow=c(1,1))

# Correlation analysis
cat("πŸ“ˆ Correlation Analysis:\n")
## πŸ“ˆ Correlation Analysis:
price_speed_cor <- cor(cars$price_num, cars$maxspeed)
cat("Price vs Speed correlation:", round(price_speed_cor, 3), "\n")
## Price vs Speed correlation: 0.794

Chapter 10: Real-World Business Applications

10.1 Quality Control Example

cat("🏭 Quality Control Application:\n")
## 🏭 Quality Control Application:
cat("A factory produces bolts that should be 10cm long\n\n")
## A factory produces bolts that should be 10cm long
# Simulate two production scenarios
set.seed(123)  # For reproducible results

# Scenario A: Good quality control
bolts_A <- rnorm(100, mean=10.0, sd=0.1)
# Scenario B: Poor quality control  
bolts_B <- rnorm(100, mean=10.0, sd=0.5)

cat("πŸ“Š Quality Control Comparison:\n")
## πŸ“Š Quality Control Comparison:
cat("Scenario A: Mean =", round(mean(bolts_A), 3), 
    "cm, SD =", round(sd(bolts_A), 3), "cm\n")
## Scenario A: Mean = 10.009 cm, SD = 0.091 cm
cat("Scenario B: Mean =", round(mean(bolts_B), 3), 
    "cm, SD =", round(sd(bolts_B), 3), "cm\n\n")
## Scenario B: Mean = 9.946 cm, SD = 0.483 cm
# Calculate coefficient of variation
cv_A <- (sd(bolts_A) / mean(bolts_A)) * 100
cv_B <- (sd(bolts_B) / mean(bolts_B)) * 100

cat("πŸ’‘ Quality Assessment:\n")
## πŸ’‘ Quality Assessment:
cat("Scenario A CV:", round(cv_A, 2), "% - EXCELLENT quality control\n")
## Scenario A CV: 0.91 % - EXCELLENT quality control
cat("Scenario B CV:", round(cv_B, 2), "% - POOR quality control\n")
## Scenario B CV: 4.86 % - POOR quality control
# Visualize quality control differences
par(mfrow=c(1,2))

hist(bolts_A, main="Scenario A: Good Quality Control", 
     xlab="Bolt Length (cm)", col="lightgreen", 
     xlim=c(8, 12), breaks=20)
abline(v=10, col="red", lwd=2, lty=2)

hist(bolts_B, main="Scenario B: Poor Quality Control", 
     xlab="Bolt Length (cm)", col="lightcoral", 
     xlim=c(8, 12), breaks=20)
abline(v=10, col="red", lwd=2, lty=2)

par(mfrow=c(1,1))

cat("🎯 Business Impact:\n")
## 🎯 Business Impact:
cat("- Scenario A: Consistent, predictable production\n")
## - Scenario A: Consistent, predictable production
cat("- Scenario B: High variability = quality problems\n")
## - Scenario B: High variability = quality problems

10.2 Investment Risk Analysis

cat("πŸ’° Investment Portfolio Risk Analysis:\n\n")
## πŸ’° Investment Portfolio Risk Analysis:
# Two investment portfolios with same mean return
portfolio_A <- c(8, 9, 10, 11, 12)  # Conservative
portfolio_B <- c(0, 5, 10, 15, 20)  # Risky

cat("Portfolio A returns:", paste(portfolio_A, "%", collapse=", "), "\n")
## Portfolio A returns: 8 %, 9 %, 10 %, 11 %, 12 %
cat("Portfolio B returns:", paste(portfolio_B, "%", collapse=", "), "\n\n")
## Portfolio B returns: 0 %, 5 %, 10 %, 15 %, 20 %
# Calculate risk measures
mean_A <- mean(portfolio_A)
mean_B <- mean(portfolio_B)
sd_A <- sd(portfolio_A)
sd_B <- sd(portfolio_B)

cat("πŸ“Š Risk Analysis:\n")
## πŸ“Š Risk Analysis:
cat("Portfolio A: Mean =", mean_A, "%, SD =", round(sd_A, 2), "%\n")
## Portfolio A: Mean = 10 %, SD = 1.58 %
cat("Portfolio B: Mean =", mean_B, "%, SD =", round(sd_B, 2), "%\n\n")
## Portfolio B: Mean = 10 %, SD = 7.91 %
cat("πŸ’‘ Investment Advice:\n")
## πŸ’‘ Investment Advice:
cat("- Both portfolios have same expected return (", mean_A, "%)\n")
## - Both portfolios have same expected return ( 10 %)
cat("- Portfolio A is much less risky (lower standard deviation)\n")
## - Portfolio A is much less risky (lower standard deviation)
cat("- Portfolio B has higher volatility = higher risk\n")
## - Portfolio B has higher volatility = higher risk

Chapter 11: Practice Problems and Exercises

11.1 Exercise 1: Manual Variance Calculation

Your turn! Calculate variance manually for this small dataset:

Test scores: 85, 90, 78, 92, 88

cat("🎯 Exercise 1: Manual Variance Calculation\n")
## 🎯 Exercise 1: Manual Variance Calculation
cat("Test scores: 85, 90, 78, 92, 88\n")
## Test scores: 85, 90, 78, 92, 88
cat("Calculate sample variance step by step!\n\n")
## Calculate sample variance step by step!
scores <- c(85, 90, 78, 92, 88)

# Step 1: Calculate mean
cat("Step 1: Calculate the mean\n")
## Step 1: Calculate the mean
cat("Your turn! What is (85 + 90 + 78 + 92 + 88) Γ· 5 = ?\n\n")
## Your turn! What is (85 + 90 + 78 + 92 + 88) Γ· 5 = ?
# Provide solution
mean_scores <- mean(scores)
cat("βœ… Solution: Mean =", mean_scores, "\n\n")
## βœ… Solution: Mean = 86.6
# Step 2: Calculate deviations
cat("Step 2: Calculate deviations from mean\n")
## Step 2: Calculate deviations from mean
deviations <- scores - mean_scores
for(i in 1:length(scores)) {
  cat("Score", i, ":", scores[i], "- ", mean_scores, "=", deviations[i], "\n")
}
## Score 1 : 85 -  86.6 = -1.6 
## Score 2 : 90 -  86.6 = 3.4 
## Score 3 : 78 -  86.6 = -8.6 
## Score 4 : 92 -  86.6 = 5.4 
## Score 5 : 88 -  86.6 = 1.4
# Step 3: Square deviations and calculate variance
cat("\nStep 3: Square deviations and calculate sample variance\n")
## 
## Step 3: Square deviations and calculate sample variance
squared_devs <- deviations^2
sample_var <- sum(squared_devs) / (length(scores) - 1)

cat("Sum of squared deviations:", sum(squared_devs), "\n")
## Sum of squared deviations: 119.2
cat("Sample variance: sΒ² =", round(sample_var, 2), "\n")
## Sample variance: sΒ² = 29.8
cat("Standard deviation: s =", round(sqrt(sample_var), 2), "\n")
## Standard deviation: s = 5.46

11.2 Exercise 2: Coefficient of Variation Challenge

cat("🎯 Exercise 2: CV Comparison Challenge\n")
## 🎯 Exercise 2: CV Comparison Challenge
cat("Which is more variable: car weights or car prices?\n\n")
## Which is more variable: car weights or car prices?
# Calculate CV for both
weight_cv <- (sd(cars$weight) / mean(cars$weight)) * 100
price_cv <- (sd(cars$price_num) / mean(cars$price_num)) * 100

cat("πŸ’ͺ Weight Statistics:\n")
## πŸ’ͺ Weight Statistics:
cat("Mean:", round(mean(cars$weight), 0), "kg\n")
## Mean: 1609 kg
cat("SD:", round(sd(cars$weight), 0), "kg\n")
## SD: 489 kg
cat("CV:", round(weight_cv, 1), "%\n\n")
## CV: 30.4 %
cat("πŸ’° Price Statistics:\n")
## πŸ’° Price Statistics:
cat("Mean: $", round(mean(cars$price_num), 0), "\n")
## Mean: $ 30620
cat("SD: $", round(sd(cars$price_num), 0), "\n")
## SD: $ 9531
cat("CV:", round(price_cv, 1), "%\n\n")
## CV: 31.1 %
cat("πŸ† Winner: ", ifelse(weight_cv > price_cv, "Weight", "Price"), 
    " is more variable (higher CV)\n")
## πŸ† Winner:  Price  is more variable (higher CV)

11.3 Exercise 3: Correlation Interpretation

cat("🎯 Exercise 3: Correlation Detective\n")
## 🎯 Exercise 3: Correlation Detective
cat("Interpret these correlation coefficients:\n\n")
## Interpret these correlation coefficients:
# Different correlation examples
correlations <- c(0.85, -0.92, 0.12, -0.45, 0.0)
variables <- c("Study hours vs Exam scores",
              "Car age vs Market value", 
              "Height vs Happiness",
              "Temperature vs Heating costs",
              "Shoe size vs Intelligence")

for(i in 1:length(correlations)) {
  cat("πŸ“Š", variables[i], ": r =", correlations[i], "\n")
  
  # Interpretation
  r <- abs(correlations[i])
  direction <- ifelse(correlations[i] > 0, "positive", 
                     ifelse(correlations[i] < 0, "negative", "no"))
  
  strength <- ifelse(r > 0.8, "very strong",
                    ifelse(r > 0.6, "strong",
                          ifelse(r > 0.4, "moderate",
                                ifelse(r > 0.2, "weak", "very weak"))))
  
  cat("   Interpretation:", strength, direction, "relationship\n\n")
}
## πŸ“Š Study hours vs Exam scores : r = 0.85 
##    Interpretation: very strong positive relationship
## 
## πŸ“Š Car age vs Market value : r = -0.92 
##    Interpretation: very strong negative relationship
## 
## πŸ“Š Height vs Happiness : r = 0.12 
##    Interpretation: very weak positive relationship
## 
## πŸ“Š Temperature vs Heating costs : r = -0.45 
##    Interpretation: moderate negative relationship
## 
## πŸ“Š Shoe size vs Intelligence : r = 0 
##    Interpretation: very weak no relationship

Chapter 12: Summary and Key Takeaways

12.1 Complete Formula Reference

Here’s your complete reference for all formulas covered today:

Measures of Variability

Range: \(\text{Range} = x_{\max} - x_{\min}\)

IQR: \(\text{IQR} = Q_3 - Q_1\)

Population Variance (Raw Data): \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2 = \frac{\sum_{i=1}^{N} x_i^2}{N} - \mu^2\)

Sample Variance (Raw Data): \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 = \frac{n}{n-1}\left(\frac{\sum_{i=1}^{n} x_i^2}{n} - \bar{x}^2\right)\)

Frequency Data Variance: \(\sigma^2 = \sum_{k=1}^{K}(x_k - \mu)^2 p_k = \sum_{k=1}^{K} x_k^2 p_k - \mu^2\)

Standard Deviation: \(s = \sqrt{s^2}\)

Coefficient of Variation: \(CV = \frac{s}{\bar{x}} \times 100\%\)

Bivariate Analysis

Correlation Coefficient: \(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}\)

12.2 When to Use Each Measure

cat("πŸ“‹ Quick Reference Guide:\n\n")
## πŸ“‹ Quick Reference Guide:
measures <- data.frame(
  Measure = c("Range", "IQR", "Variance", "Std Dev", "CV"),
  Best_Used_When = c(
    "Quick estimate, no outliers",
    "Data has outliers or skewed", 
    "Mathematical calculations",
    "Describing spread in original units",
    "Comparing different variables"
  ),
  Sensitive_to_Outliers = c("Yes", "No", "Yes", "Yes", "Depends"),
  stringsAsFactors = FALSE
)

print(measures)
##    Measure                      Best_Used_When Sensitive_to_Outliers
## 1    Range         Quick estimate, no outliers                   Yes
## 2      IQR         Data has outliers or skewed                    No
## 3 Variance           Mathematical calculations                   Yes
## 4  Std Dev Describing spread in original units                   Yes
## 5       CV       Comparing different variables               Depends

12.3 R Functions Summary

cat("πŸ”§ Essential R Functions Summary:\n\n")
## πŸ”§ Essential R Functions Summary:
cat("πŸ“Š Basic Functions:\n")
## πŸ“Š Basic Functions:
cat("range(x)          # Max - Min\n")
## range(x)          # Max - Min
cat("IQR(x)            # Q3 - Q1\n") 
## IQR(x)            # Q3 - Q1
cat("var(x)            # Sample variance\n")
## var(x)            # Sample variance
cat("sd(x)             # Sample standard deviation\n")
## sd(x)             # Sample standard deviation
cat("cor(x, y)         # Correlation coefficient\n\n")
## cor(x, y)         # Correlation coefficient
cat("πŸ“ˆ UBStats Functions:\n")
## πŸ“ˆ UBStats Functions:
cat("distr.summary.x(x, stats='dispersion')    # All variability measures\n")
## distr.summary.x(x, stats='dispersion')    # All variability measures
cat("distr.table.xy(x, y, freq.type='joint')   # Crosstabs\n")
## distr.table.xy(x, y, freq.type='joint')   # Crosstabs
cat("distr.plot.xy(x, y, plot.type='scatter')  # Scatterplots\n")
## distr.plot.xy(x, y, plot.type='scatter')  # Scatterplots
cat("distr.plot.x(x, plot.type='boxplot')      # Boxplots\n")
## distr.plot.x(x, plot.type='boxplot')      # Boxplots

12.4 Business Applications Recap

cat("πŸ’Ό Key Business Applications:\n\n")
## πŸ’Ό Key Business Applications:
cat("🏭 Quality Control:\n")
## 🏭 Quality Control:
cat("- Use standard deviation to monitor production consistency\n")
## - Use standard deviation to monitor production consistency
cat("- Lower SD = better quality control\n")
## - Lower SD = better quality control
cat("- CV helps compare quality across different products\n\n")
## - CV helps compare quality across different products
cat("πŸ’° Investment Analysis:\n") 
## πŸ’° Investment Analysis:
cat("- Standard deviation measures investment risk\n")
## - Standard deviation measures investment risk
cat("- Higher SD = higher risk and volatility\n")
## - Higher SD = higher risk and volatility
cat("- Use correlation to diversify portfolios\n\n")
## - Use correlation to diversify portfolios
cat("πŸ“Š Market Research:\n")
## πŸ“Š Market Research:
cat("- Use crosstabs to find customer segments\n")
## - Use crosstabs to find customer segments
cat("- Correlation identifies relationships between variables\n")
## - Correlation identifies relationships between variables
cat("- CV compares variability across different metrics\n")
## - CV compares variability across different metrics

Chapter 13: Next Steps and Further Practice

13.1 What’s Coming Next

In our next lecture, we’ll explore:

  • Probability Distributions - Normal, binomial, and other key distributions
  • Confidence Intervals - Estimating population parameters
  • Hypothesis Testing - Making statistical decisions
  • Regression Analysis - Predicting one variable from another

13.2 Practice Recommendations

cat("🎯 Recommended Practice:\n\n")
## 🎯 Recommended Practice:
cat("1. πŸ“ Manual Calculations:\n")
## 1. πŸ“ Manual Calculations:
cat("   - Practice variance calculations by hand\n")
##    - Practice variance calculations by hand
cat("   - Work through correlation examples\n")
##    - Work through correlation examples
cat("   - Create your own crosstabs\n\n")
##    - Create your own crosstabs
cat("2. πŸ”§ R Programming:\n")
## 2. πŸ”§ R Programming:
cat("   - Use the cars dataset for more analyses\n")
##    - Use the cars dataset for more analyses
cat("   - Try different variable combinations\n")
##    - Try different variable combinations
cat("   - Create your own visualizations\n\n")
##    - Create your own visualizations
cat("3. πŸ’Ό Real Applications:\n")
## 3. πŸ’Ό Real Applications:
cat("   - Find your own datasets online\n")
##    - Find your own datasets online
cat("   - Apply these concepts to your interests\n")
##    - Apply these concepts to your interests
cat("   - Think about business problems you could solve\n")
##    - Think about business problems you could solve

13.3 Final Motivation

cat("🌟 Congratulations! 🌟\n\n")
## 🌟 Congratulations! 🌟
cat("You've just mastered some of the most important concepts in statistics:\n")
## You've just mastered some of the most important concepts in statistics:
cat("βœ… How to measure variability in data\n")
## βœ… How to measure variability in data
cat("βœ… How to compare different types of variables\n") 
## βœ… How to compare different types of variables
cat("βœ… How to study relationships between variables\n")
## βœ… How to study relationships between variables
cat("βœ… How to apply these concepts to real business problems\n\n")
## βœ… How to apply these concepts to real business problems
cat("πŸš€ You're now equipped with powerful tools for:\n")
## πŸš€ You're now equipped with powerful tools for:
cat("- Making data-driven business decisions\n")
## - Making data-driven business decisions
cat("- Understanding risk and uncertainty\n")
## - Understanding risk and uncertainty
cat("- Finding patterns and relationships in data\n")
## - Finding patterns and relationships in data
cat("- Communicating statistical findings effectively\n\n")
## - Communicating statistical findings effectively
cat("Keep practicing, stay curious, and remember:\n")
## Keep practicing, stay curious, and remember:
cat("Statistics is not just about numbers - it's about understanding the world!\n")
## Statistics is not just about numbers - it's about understanding the world!

πŸŽ“ End of Lecture 3 - Great job making it through 120 minutes of intensive learning!

Remember to save your work and practice with the R code examples. See you next time for more statistical adventures!

# Session information for reproducibility
cat("πŸ“‹ Session Information:\n")
## πŸ“‹ Session Information:
sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Europe/Budapest
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] UBStats_0.2.2
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] cachem_1.1.0      knitr_1.49        htmltools_0.5.8.1 rmarkdown_2.29   
##  [9] lifecycle_1.0.4   cli_3.6.3         sass_0.4.9        jquerylib_0.1.4  
## [13] compiler_4.4.2    rstudioapi_0.17.1 tools_4.4.2       evaluate_1.0.1   
## [17] bslib_0.8.0       yaml_2.3.10       rlang_1.1.4       jsonlite_1.8.9