In our previous lectures, we learned how to find the βtypicalβ value in our data using measures of central tendency (mean, median, mode). But thatβs only half the story! Today we complete the picture by understanding:
Think of it this way: if central tendency tells us the βaverage student grade,β variability tells us whether all students performed similarly or if thereβs a huge range from failing to excellent grades.
Imagine two pizza delivery restaurants:
Both have the same average, but which would you prefer? Restaurant A is predictable and reliable, while Restaurant B is unpredictable. This is why we need measures of variability!
# Load required libraries
library(UBStats)
# Use built-in mtcars dataset and modify it for our examples
cars <- mtcars
# Add some columns to match our examples
cars$country <- sample(c("Germany", "Japan", "France", "Italy", "United States"),
nrow(cars), replace=TRUE,
prob=c(0.3, 0.25, 0.15, 0.15, 0.15))
cars$price_num <- cars$hp * 100 + rnorm(nrow(cars), 15000, 5000)
cars$price_num <- pmax(cars$price_num, 10000) # Minimum price $10,000
cars$price_classes <- cut(cars$price_num,
breaks=c(0, 20000, 35000, Inf),
labels=c("low", "mid", "high"))
cars$maxspeed <- cars$hp * 2.5 + rnorm(nrow(cars), 50, 15)
cars$acceleration <- 20 - (cars$hp/20) + rnorm(nrow(cars), 0, 2)
cars$weight <- cars$wt * 500
cars$sales <- rpois(nrow(cars), 8000)
# Let's start our journey!
cat("π Today we'll analyze", nrow(cars), "car models\n")
## π Today we'll analyze 32 car models
cat("π We'll explore both variability and relationships between variables\n")
## π We'll explore both variability and relationships between variables
# Let's start our journey!
cat("π Today we'll analyze", nrow(cars), "car models\n")
## π Today we'll analyze 32 car models
cat("π We'll explore both variability and relationships between variables\n")
## π We'll explore both variability and relationships between variables
The range is the simplest measure of spread:
\[\text{Range} = x_{\text{max}} - x_{\text{min}}\]
Letβs work through Example 1 from your PDF exactly as written:
Sample 1: 1, 2, 3, 4, 5, 6, 7
Sample 2: 1, 2, 3, 4, 5, 6, 100
# Example 1 from PDF - exactly as in your notes
sample1 <- c(1, 2, 3, 4, 5, 6, 7)
sample2 <- c(1, 2, 3, 4, 5, 6, 100)
cat("π Example 1 from Your Notes:\n")
## π Example 1 from Your Notes:
cat("Sample 1:", paste(sample1, collapse=", "), "\n")
## Sample 1: 1, 2, 3, 4, 5, 6, 7
cat("Sample 2:", paste(sample2, collapse=", "), "\n\n")
## Sample 2: 1, 2, 3, 4, 5, 6, 100
# Calculate ranges
range1 <- max(sample1) - min(sample1)
range2 <- max(sample2) - min(sample2)
cat("π’ Range Calculations:\n")
## π’ Range Calculations:
cat("Sample 1 Range: 7 - 1 =", range1, "\n")
## Sample 1 Range: 7 - 1 = 6
cat("Sample 2 Range: 100 - 1 =", range2, "\n\n")
## Sample 2 Range: 100 - 1 = 99
cat("π€ What happened? One outlier (100) made the range jump from 6 to 99!\n")
## π€ What happened? One outlier (100) made the range jump from 6 to 99!
The IQR measures the spread of the middle 50% of data:
\[\text{IQR} = Q_3 - Q_1\]
# Calculate IQR for both samples
cat("π IQR Calculations:\n")
## π IQR Calculations:
# For Sample 1
q1_s1 <- quantile(sample1, 0.25) # Q1
q3_s1 <- quantile(sample1, 0.75) # Q3
iqr1 <- q3_s1 - q1_s1
cat("Sample 1: Q1 =", q1_s1, ", Q3 =", q3_s1, ", IQR =", iqr1, "\n")
## Sample 1: Q1 = 2.5 , Q3 = 5.5 , IQR = 3
# For Sample 2
q1_s2 <- quantile(sample2, 0.25) # Q1
q3_s2 <- quantile(sample2, 0.75) # Q3
iqr2 <- q3_s2 - q1_s2
cat("Sample 2: Q1 =", q1_s2, ", Q3 =", q3_s2, ", IQR =", iqr2, "\n\n")
## Sample 2: Q1 = 2.5 , Q3 = 5.5 , IQR = 3
cat("β¨ Amazing! The IQR stayed the same because it focuses on the middle 50%\n")
## β¨ Amazing! The IQR stayed the same because it focuses on the middle 50%
cat(" and ignores extreme outliers!\n")
## and ignores extreme outliers!
# Visualize the difference
par(mfrow=c(1,2))
boxplot(sample1, main="Sample 1: No Outliers", ylab="Values", col="lightblue")
boxplot(sample2, main="Sample 2: With Outlier", ylab="Values", col="lightcoral")
par(mfrow=c(1,1))
Key Learning: IQR is βrobustβ - itβs not affected by extreme outliers, making it more reliable for describing typical variability.
Now we get to the heart of variability - variance. This is where all the formulas from your PDF come into play!
Variance measures the average squared distance from the mean. Think of it as asking: βOn average, how far away are my data points from the center?β
Let me show you why we canβt just average the deviations:
# Simple example: 2, 4, 6 (mean = 4)
simple_data <- c(2, 4, 6)
mean_val <- mean(simple_data)
deviations <- simple_data - mean_val
cat("π Why We Need to Square Deviations:\n")
## π Why We Need to Square Deviations:
cat("Data:", paste(simple_data, collapse=", "), "\n")
## Data: 2, 4, 6
cat("Mean:", mean_val, "\n")
## Mean: 4
cat("Deviations:", paste(deviations, collapse=", "), "\n")
## Deviations: -2, 0, 2
cat("Sum of deviations:", sum(deviations), "\n\n")
## Sum of deviations: 0
cat("π± The deviations always sum to zero! Positive and negative cancel out.\n")
## π± The deviations always sum to zero! Positive and negative cancel out.
cat("π‘ Solution: Square them to make all values positive!\n")
## π‘ Solution: Square them to make all values positive!
squared_devs <- deviations^2
cat("Squared deviations:", paste(squared_devs, collapse=", "), "\n")
## Squared deviations: 4, 0, 4
cat("Now we can average them:", mean(squared_devs), "\n")
## Now we can average them: 2.666667
Your PDF shows multiple formulas for population variance. Let me explain every single one:
\[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2\]
\[\sigma^2 = \frac{\sum_{i=1}^{N} x_i^2}{N} - \mu^2\]
Letβs work through Example 2 from your PDF step by step:
Data: 80, 120, 150, 90 (dollars)
Assignment: Compute the variance
# Example 2 from your PDF - Airline tickets
prices <- c(80, 120, 150, 90)
N <- length(prices)
cat("βοΈ Example 2: Airline Ticket Variance Calculation\n")
## βοΈ Example 2: Airline Ticket Variance Calculation
cat("Data: $", paste(prices, collapse=", $"), "\n")
## Data: $ 80, $120, $150, $90
cat("Population size (N):", N, "\n\n")
## Population size (N): 4
# Step 1: Calculate population mean (ΞΌ)
mu <- sum(prices) / N
cat("π Step 1 - Population Mean:\n")
## π Step 1 - Population Mean:
cat("ΞΌ = (", paste(prices, collapse=" + "), ") Γ·", N, "\n")
## ΞΌ = ( 80 + 120 + 150 + 90 ) Γ· 4
cat("ΞΌ = ", sum(prices), " Γ·", N, " = $", mu, "\n\n")
## ΞΌ = 440 Γ· 4 = $ 110
# Step 2: Calculate deviations (xi - ΞΌ)
deviations <- prices - mu
cat("π Step 2 - Deviations from Mean:\n")
## π Step 2 - Deviations from Mean:
for(i in 1:N) {
cat("x", i, "- ΞΌ = $", prices[i], " - $", mu, " = $", deviations[i], "\n")
}
## x 1 - ΞΌ = $ 80 - $ 110 = $ -30
## x 2 - ΞΌ = $ 120 - $ 110 = $ 10
## x 3 - ΞΌ = $ 150 - $ 110 = $ 40
## x 4 - ΞΌ = $ 90 - $ 110 = $ -20
# Step 3: Square the deviations
squared_devs <- deviations^2
cat("\nπ’ Step 3 - Squared Deviations:\n")
##
## π’ Step 3 - Squared Deviations:
for(i in 1:N) {
cat("(x", i, "- ΞΌ)Β² = ($", deviations[i], ")Β² = ", squared_devs[i], "\n")
}
## (x 1 - ΞΌ)Β² = ($ -30 )Β² = 900
## (x 2 - ΞΌ)Β² = ($ 10 )Β² = 100
## (x 3 - ΞΌ)Β² = ($ 40 )Β² = 1600
## (x 4 - ΞΌ)Β² = ($ -20 )Β² = 400
# Step 4: Sum squared deviations
sum_squared <- sum(squared_devs)
cat("\nβ Step 4 - Sum of Squared Deviations:\n")
##
## β Step 4 - Sum of Squared Deviations:
cat("Ξ£(xi - ΞΌ)Β² = ", paste(squared_devs, collapse=" + "), " = ", sum_squared, "\n")
## Ξ£(xi - ΞΌ)Β² = 900 + 100 + 1600 + 400 = 3000
# Step 5: Calculate variance
variance <- sum_squared / N
cat("\nπ― Step 5 - Population Variance:\n")
##
## π― Step 5 - Population Variance:
cat("ΟΒ² = Ξ£(xi - ΞΌ)Β² Γ· N = ", sum_squared, " Γ·", N, " = ", variance, "\n")
## ΟΒ² = Ξ£(xi - ΞΌ)Β² Γ· N = 3000 Γ· 4 = 750
Now letβs verify using the shortcut formula:
cat("π Shortcut Formula Verification:\n")
## π Shortcut Formula Verification:
cat("ΟΒ² = (Ξ£xiΒ² Γ· N) - ΞΌΒ²\n\n")
## ΟΒ² = (Ξ£xiΒ² Γ· N) - ΞΌΒ²
# Calculate Ξ£xiΒ²
sum_x_squared <- sum(prices^2)
cat("π Sum of squares:\n")
## π Sum of squares:
for(i in 1:N) {
cat("x", i, "Β² = ", prices[i], "Β² = ", prices[i]^2, "\n")
}
## x 1 Β² = 80 Β² = 6400
## x 2 Β² = 120 Β² = 14400
## x 3 Β² = 150 Β² = 22500
## x 4 Β² = 90 Β² = 8100
cat("Ξ£xiΒ² = ", paste(prices^2, collapse=" + "), " = ", sum_x_squared, "\n\n")
## Ξ£xiΒ² = 6400 + 14400 + 22500 + 8100 = 51400
# Apply shortcut formula
shortcut_variance <- (sum_x_squared / N) - mu^2
cat("π Shortcut calculation:\n")
## π Shortcut calculation:
cat("ΟΒ² = (", sum_x_squared, " Γ·", N, ") - (", mu, ")Β²\n")
## ΟΒ² = ( 51400 Γ· 4 ) - ( 110 )Β²
cat("ΟΒ² = ", sum_x_squared/N, " - ", mu^2, " = ", shortcut_variance, "\n\n")
## ΟΒ² = 12850 - 12100 = 750
cat("β
Both methods give the same result: ΟΒ² =", variance, "\n")
## β
Both methods give the same result: ΟΒ² = 750
When we have a sample (not the whole population), we use:
\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\]
Why n-1 instead of n? Because we used our data to estimate the mean, we βloseβ one degree of freedom. This gives us a better estimate of the true population variance.
cat("π¬ Sample Variance (treating our airline data as a sample):\n")
## π¬ Sample Variance (treating our airline data as a sample):
n <- length(prices)
x_bar <- mean(prices) # Sample mean
# Calculate sample variance manually
sample_deviations <- prices - x_bar
sample_variance_manual <- sum(sample_deviations^2) / (n - 1)
cat("Sample mean (xΜ):", x_bar, "\n")
## Sample mean (xΜ): 110
cat("Manual calculation: sΒ² =", round(sample_variance_manual, 2), "\n")
## Manual calculation: sΒ² = 1000
# Verify with R function
sample_variance_R <- var(prices)
cat("R var() function:", round(sample_variance_R, 2), "\n")
## R var() function: 1000
cat("β
Perfect match!\n")
## β
Perfect match!
When data is organized in frequency tables, we need special formulas. Your PDF shows these clearly:
Original Formula: \[\sigma^2 = \frac{1}{N}\sum_{k=1}^{K}(x_k - \mu)^2 f_k = \sum_{k=1}^{K}(x_k - \mu)^2 p_k\]
Shortcut Formula: \[\sigma^2 = \frac{\sum_{k=1}^{K} x_k^2 f_k}{N} - \mu^2 = \sum_{k=1}^{K} x_k^2 p_k - \mu^2\]
Letβs work through Example 3 exactly as shown in your notes:
Variable: Number of cars owned (sample of 100 families)
| Number of cars (xk) | Frequency (fk) | Proportion (pk) |
|---|---|---|
| 1 | 32 | 0.32 |
| 2 | 48 | 0.48 |
| 3 | 16 | 0.16 |
| 5 | 4 | 0.04 |
# Example 3 from PDF - Car ownership frequency data
cars_owned <- c(1, 2, 3, 5)
frequencies <- c(32, 48, 16, 4)
N_total <- sum(frequencies)
proportions <- frequencies / N_total
cat("π Example 3: Car Ownership Frequency Analysis\n")
## π Example 3: Car Ownership Frequency Analysis
cat("Values (xk):", paste(cars_owned, collapse=", "), "\n")
## Values (xk): 1, 2, 3, 5
cat("Frequencies (fk):", paste(frequencies, collapse=", "), "\n")
## Frequencies (fk): 32, 48, 16, 4
cat("Proportions (pk):", paste(round(proportions, 2), collapse=", "), "\n")
## Proportions (pk): 0.32, 0.48, 0.16, 0.04
cat("Total families (N):", N_total, "\n\n")
## Total families (N): 100
# Step 1: Calculate weighted mean
weighted_mean <- sum(cars_owned * proportions)
cat("π Step 1 - Weighted Mean:\n")
## π Step 1 - Weighted Mean:
cat("ΞΌ = Ξ£(xk Γ pk)\n")
## ΞΌ = Ξ£(xk Γ pk)
for(i in 1:length(cars_owned)) {
cat(" ", cars_owned[i], "Γ", proportions[i], "=",
round(cars_owned[i] * proportions[i], 3), "\n")
}
## 1 Γ 0.32 = 0.32
## 2 Γ 0.48 = 0.96
## 3 Γ 0.16 = 0.48
## 5 Γ 0.04 = 0.2
cat("ΞΌ = ", round(weighted_mean, 3), " cars per family\n\n")
## ΞΌ = 1.96 cars per family
# Step 2: Variance using original formula
cat("π Step 2 - Variance (Original Formula):\n")
## π Step 2 - Variance (Original Formula):
cat("ΟΒ² = Ξ£(xk - ΞΌ)Β² Γ pk\n")
## ΟΒ² = Ξ£(xk - ΞΌ)Β² Γ pk
deviations_freq <- cars_owned - weighted_mean
squared_dev_freq <- deviations_freq^2
variance_components <- squared_dev_freq * proportions
for(i in 1:length(cars_owned)) {
cat("(", cars_owned[i], "-", round(weighted_mean, 3), ")Β² Γ ",
proportions[i], " = ", round(squared_dev_freq[i], 3), " Γ ",
proportions[i], " = ", round(variance_components[i], 4), "\n")
}
## ( 1 - 1.96 )Β² Γ 0.32 = 0.922 Γ 0.32 = 0.2949
## ( 2 - 1.96 )Β² Γ 0.48 = 0.002 Γ 0.48 = 8e-04
## ( 3 - 1.96 )Β² Γ 0.16 = 1.082 Γ 0.16 = 0.1731
## ( 5 - 1.96 )Β² Γ 0.04 = 9.242 Γ 0.04 = 0.3697
variance_freq <- sum(variance_components)
cat("ΟΒ² = ", paste(round(variance_components, 4), collapse=" + "),
" = ", round(variance_freq, 3), "\n\n")
## ΟΒ² = 0.2949 + 8e-04 + 0.1731 + 0.3697 = 0.838
# Step 3: Verify with shortcut formula
cat("π Step 3 - Shortcut Formula Verification:\n")
## π Step 3 - Shortcut Formula Verification:
cat("ΟΒ² = Ξ£(xkΒ² Γ pk) - ΞΌΒ²\n")
## ΟΒ² = Ξ£(xkΒ² Γ pk) - ΞΌΒ²
x_squared_times_p <- cars_owned^2 * proportions
for(i in 1:length(cars_owned)) {
cat(cars_owned[i], "Β² Γ ", proportions[i], " = ",
cars_owned[i]^2, " Γ ", proportions[i], " = ",
round(x_squared_times_p[i], 3), "\n")
}
## 1 Β² Γ 0.32 = 1 Γ 0.32 = 0.32
## 2 Β² Γ 0.48 = 4 Γ 0.48 = 1.92
## 3 Β² Γ 0.16 = 9 Γ 0.16 = 1.44
## 5 Β² Γ 0.04 = 25 Γ 0.04 = 1
shortcut_variance <- sum(x_squared_times_p) - weighted_mean^2
cat("ΟΒ² = ", round(sum(x_squared_times_p), 3), " - ",
round(weighted_mean^2, 3), " = ", round(shortcut_variance, 3), "\n")
## ΟΒ² = 4.68 - 3.842 = 0.838
cat("β
Both formulas match!\n")
## β
Both formulas match!
Sample Formula: \[s^2 = \frac{n}{n-1}\sum_{k=1}^{K}(x_k - \bar{x})^2 p_k\]
Sample Shortcut Formula: \[s^2 = \frac{n}{n-1}\left(\sum_{k=1}^{K} x_k^2 p_k - \bar{x}^2\right)\]
# Calculate sample variance for frequency data
sample_var_freq <- (N_total/(N_total-1)) * variance_freq
cat("π¬ Sample Variance for Frequency Data:\n")
## π¬ Sample Variance for Frequency Data:
cat("sΒ² = (n/(n-1)) Γ ΟΒ²\n")
## sΒ² = (n/(n-1)) Γ ΟΒ²
cat("sΒ² = (", N_total, "/(", N_total, "-1)) Γ ", round(variance_freq, 3), "\n")
## sΒ² = ( 100 /( 100 -1)) Γ 0.838
cat("sΒ² = ", round(sample_var_freq, 3), "\n")
## sΒ² = 0.847
Your PDF shows that for grouped data (continuous variables in interval classes), we get approximate variance:
Population Variance: \[\sigma^2 = \frac{1}{N}\sum_{k=1}^{K}(m_k - \mu)^2 f_k\]
Sample Variance: \[s^2 = \frac{n}{n-1}\left[\frac{\sum_{k=1}^{K} m_k^2 f_k}{n} - \bar{x}^2\right]\]
Where \(m_k\) = midpoint of each interval class.
# Example of grouped data - car speeds
cat("ποΈ Grouped Data Example: Car Speeds\n")
## ποΈ Grouped Data Example: Car Speeds
cat("Note: This gives APPROXIMATE variance because we use midpoints\n\n")
## Note: This gives APPROXIMATE variance because we use midpoints
# Speed intervals and their midpoints
intervals <- c("[0,30)", "[30,50)", "[50,100)")
midpoints <- c(15, 40, 75) # mk values
frequencies_speed <- c(4, 12, 8)
n_speed <- sum(frequencies_speed)
cat("Speed Intervals:", paste(intervals, collapse=", "), "\n")
## Speed Intervals: [0,30), [30,50), [50,100)
cat("Midpoints (mk):", paste(midpoints, collapse=", "), "\n")
## Midpoints (mk): 15, 40, 75
cat("Frequencies (fk):", paste(frequencies_speed, collapse=", "), "\n\n")
## Frequencies (fk): 4, 12, 8
# Calculate approximate mean
approx_mean <- sum(midpoints * frequencies_speed) / n_speed
cat("Approximate mean: xΜ =", round(approx_mean, 1), "km/h\n")
## Approximate mean: xΜ = 47.5 km/h
# Calculate approximate sample variance
mk_squared_fk <- midpoints^2 * frequencies_speed
approx_variance <- (n_speed/(n_speed-1)) *
(sum(mk_squared_fk)/n_speed - approx_mean^2)
cat("Approximate sample variance: sΒ² =", round(approx_variance, 1), "\n")
## Approximate sample variance: sΒ² = 476.1
cat("β οΈ Remember: This is approximate because we used interval midpoints!\n")
## β οΈ Remember: This is approximate because we used interval midpoints!
Variance has squared units - if we measure height in cm, variance is in cmΒ². What does β25 cmΒ²β of height variability mean? Itβs confusing!
Solution: Take the square root to get back to original units!
\[\text{Standard Deviation} = \sqrt{\text{Variance}}\]
\[\sigma = \sqrt{\sigma^2}\]
\[s = \sqrt{s^2}\]
# Using our airline ticket example
cat("βοΈ Standard Deviation from Airline Ticket Example:\n")
## βοΈ Standard Deviation from Airline Ticket Example:
cat("Variance: ΟΒ² =", variance, "dollarsΒ²\n")
## Variance: ΟΒ² = 750 dollarsΒ²
std_dev <- sqrt(variance)
cat("Standard deviation: Ο = β", variance, " = $", round(std_dev, 2), "\n\n")
## Standard deviation: Ο = β 750 = $ 27.39
cat("π‘ Interpretation: On average, ticket prices are about $",
round(std_dev, 2), " away from the mean of $", mu, "\n")
## π‘ Interpretation: On average, ticket prices are about $ 27.39 away from the mean of $ 110
# For the car ownership example
std_dev_cars <- sqrt(variance_freq)
cat("\nπ Car Ownership Standard Deviation:\n")
##
## π Car Ownership Standard Deviation:
cat("Ο = β", round(variance_freq, 3), " = ", round(std_dev_cars, 3), " cars\n")
## Ο = β 0.838 = 0.916 cars
For normally distributed data: -
68% of data falls within 1 standard deviation of the
mean - 95% of data falls within 2 standard
deviations
- 99.7% of data falls within 3 standard deviations
# Visualize the empirical rule
x <- seq(-4, 4, 0.01)
y <- dnorm(x)
plot(x, y, type="l", lwd=2, col="blue",
main="The Empirical Rule (68-95-99.7)",
xlab="Standard deviations from mean",
ylab="Probability density")
# Shade areas
polygon(c(-1, seq(-1, 1, 0.01), 1), c(0, dnorm(seq(-1, 1, 0.01)), 0),
col=rgb(0,0,1,0.3), border=NA)
polygon(c(-2, seq(-2, 2, 0.01), 2), c(0, dnorm(seq(-2, 2, 0.01)), 0),
col=rgb(0,1,0,0.2), border=NA)
polygon(c(-3, seq(-3, 3, 0.01), 3), c(0, dnorm(seq(-3, 3, 0.01)), 0),
col=rgb(1,0,0,0.1), border=NA)
# Add labels
text(0, 0.2, "68%", cex=1.2, font=2)
text(0, 0.05, "95%", cex=1.2, font=2)
text(0, 0.02, "99.7%", cex=1.2, font=2)
# Add vertical lines
abline(v=c(-3,-2,-1,0,1,2,3), lty=2, col="gray")
# Apply empirical rule to car prices
price_mean <- mean(cars$price_num)
price_sd <- sd(cars$price_num)
cat("π Empirical Rule Applied to Car Prices:\n")
## π Empirical Rule Applied to Car Prices:
cat("Mean price: $", round(price_mean, 0), "\n")
## Mean price: $ 30620
cat("Standard deviation: $", round(price_sd, 0), "\n\n")
## Standard deviation: $ 9531
cat("π Expected ranges:\n")
## π Expected ranges:
cat("68% of cars priced between: $",
round(price_mean - price_sd, 0), " - $",
round(price_mean + price_sd, 0), "\n")
## 68% of cars priced between: $ 21089 - $ 40151
cat("95% of cars priced between: $",
round(price_mean - 2*price_sd, 0), " - $",
round(price_mean + 2*price_sd, 0), "\n")
## 95% of cars priced between: $ 11557 - $ 49682
cat("99.7% of cars priced between: $",
round(price_mean - 3*price_sd, 0), " - $",
round(price_mean + 3*price_sd, 0), "\n")
## 99.7% of cars priced between: $ 2026 - $ 59214
Now letβs use the functions from your class scripts to calculate all these measures automatically:
# Using UBStats functions for car price analysis
cat("π§ Using UBStats Functions for Car Price Analysis:\n\n")
## π§ Using UBStats Functions for Car Price Analysis:
# All dispersion measures at once
cat("π All Variability Measures:\n")
## π All Variability Measures:
price_dispersion <- distr.summary.x(cars$price_num, stats="dispersion")
## n n.a range IQrange sd var cv
## 32 0 34554.64 13107.09 9531.25 90844686 0.31
print(price_dispersion)
## $`Measures of dispersion`
## n n.a range IQrange sd var cv
## 1 32 0 34554.64 13107.09 9531.248 90844686 0.3112753
cat("\nπ Complete Summary (Central Tendency + Variability):\n")
##
## π Complete Summary (Central Tendency + Variability):
price_summary <- distr.summary.x(cars$price_num, stats="summary")
## n n.a min q1 median mean q3 max sd var
## 32 0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.25 90844686
print(price_summary)
## $`Summary measures`
## n n.a min q1 median mean q3 max sd
## 1 32 0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.248
## var
## 1 90844686
The coefficient of variation (CV) lets us compare variability across different units:
\[CV = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%\]
cat("π Coefficient of Variation Analysis:\n")
## π Coefficient of Variation Analysis:
cat("Which car characteristic is most variable?\n\n")
## Which car characteristic is most variable?
# Calculate CV for different variables
price_cv <- (sd(cars$price_num) / mean(cars$price_num)) * 100
speed_cv <- (sd(cars$maxspeed) / mean(cars$maxspeed)) * 100
weight_cv <- (sd(cars$weight) / mean(cars$weight)) * 100
accel_cv <- (sd(cars$acceleration) / mean(cars$acceleration)) * 100
cat("Variable Mean SD CV\n")
## Variable Mean SD CV
cat("Price ($) ", sprintf("%8.0f", mean(cars$price_num)),
" ", sprintf("%8.0f", sd(cars$price_num)),
" ", sprintf("%5.1f%%", price_cv), "\n")
## Price ($) 30620 9531 31.1%
cat("Max Speed ", sprintf("%8.1f", mean(cars$maxspeed)),
" ", sprintf("%8.1f", sd(cars$maxspeed)),
" ", sprintf("%5.1f%%", speed_cv), "\n")
## Max Speed 420.6 169.1 40.2%
cat("Weight (kg) ", sprintf("%8.0f", mean(cars$weight)),
" ", sprintf("%8.0f", sd(cars$weight)),
" ", sprintf("%5.1f%%", weight_cv), "\n")
## Weight (kg) 1609 489 30.4%
cat("Acceleration ", sprintf("%8.2f", mean(cars$acceleration)),
" ", sprintf("%8.2f", sd(cars$acceleration)),
" ", sprintf("%5.1f%%", accel_cv), "\n\n")
## Acceleration 12.51 4.27 34.2%
# Interpretation
cat("π‘ Interpretation:\n")
## π‘ Interpretation:
cv_values <- c(Price = price_cv, Speed = speed_cv, Weight = weight_cv, Acceleration = accel_cv)
cv_sorted <- sort(cv_values, decreasing = TRUE)
for(i in 1:length(cv_sorted)) {
cat(i, ". ", names(cv_sorted)[i], ": ", sprintf("%.1f%%", cv_sorted[i]),
" (", ifelse(cv_sorted[i] > 30, "High", ifelse(cv_sorted[i] > 15, "Moderate", "Low")), " variability)\n")
}
## 1 . Speed : 40.2% ( High variability)
## 2 . Acceleration : 34.2% ( High variability)
## 3 . Price : 31.1% ( High variability)
## 4 . Weight : 30.4% ( High variability)
cat("\nMost variable:", names(cv_sorted)[1], "- indicates diverse market segments\n")
##
## Most variable: Speed - indicates diverse market segments
cat("Least variable:", names(cv_sorted)[4], "- suggests more standardized characteristics\n")
## Least variable: Weight - suggests more standardized characteristics
Letβs work through Example 4 from your variability PDF - the five variables comparison:
cat("π Example 4: Five Variables Analysis\n")
## π Example 4: Five Variables Analysis
cat("Goal: Show that variables can have same mean/median but different variability\n\n")
## Goal: Show that variables can have same mean/median but different variability
# Data from your PDF (N=7 for each variable)
var_A <- c(1, 1, 4, 4, 4, 7, 7)
var_B <- c(1, 2, 3, 4, 5, 6, 7)
var_C <- c(1, 2, 4, 4, 4, 6, 7)
var_D <- c(1, 4, 4, 4, 4, 4, 7)
var_E <- c(4, 4, 4, 4, 4, 4, 4)
variables <- list(A=var_A, B=var_B, C=var_C, D=var_D, E=var_E)
cat("π Data for each variable:\n")
## π Data for each variable:
for(i in 1:5) {
cat("Variable", names(variables)[i], ":",
paste(variables[[i]], collapse=", "), "\n")
}
## Variable A : 1, 1, 4, 4, 4, 7, 7
## Variable B : 1, 2, 3, 4, 5, 6, 7
## Variable C : 1, 2, 4, 4, 4, 6, 7
## Variable D : 1, 4, 4, 4, 4, 4, 7
## Variable E : 4, 4, 4, 4, 4, 4, 4
cat("\nπ Complete Analysis of All Five Variables:\n\n")
##
## π Complete Analysis of All Five Variables:
# Calculate all measures exactly as shown in your PDF
results <- data.frame(
Variable = names(variables),
Min = sapply(variables, min),
Q1 = sapply(variables, function(x) quantile(x, 0.25)),
Median = sapply(variables, median),
Mean = sapply(variables, mean),
Q3 = sapply(variables, function(x) quantile(x, 0.75)),
Max = sapply(variables, max),
Range = sapply(variables, function(x) max(x) - min(x)),
IQR = sapply(variables, IQR),
Variance = sapply(variables, var),
Std_Dev = sapply(variables, sd)
)
# Print only the numeric columns with proper rounding
results_numeric <- results[, -1] # Remove the Variable column for rounding
results_rounded <- round(results_numeric, 2)
results_final <- cbind(Variable = results$Variable, results_rounded)
print(results_final)
## Variable Min Q1 Median Mean Q3 Max Range IQR Variance Std_Dev
## A A 1 2.5 4 4 5.5 7 6 3 6.00 2.45
## B B 1 2.5 4 4 5.5 7 6 3 4.67 2.16
## C C 1 3.0 4 4 5.0 7 6 2 4.33 2.08
## D D 1 4.0 4 4 4.0 7 6 0 3.00 1.73
## E E 4 4.0 4 4 4.0 4 0 0 0.00 0.00
cat("\nπ Key Observations from Your PDF:\n")
##
## π Key Observations from Your PDF:
cat("- ALL variables have same mean (4) and median (4)\n")
## - ALL variables have same mean (4) and median (4)
cat("- But they have very different variability patterns!\n")
## - But they have very different variability patterns!
cat("- Variable E has zero variability (all values = 4)\n")
## - Variable E has zero variability (all values = 4)
cat("- Variable A has highest variability\n")
## - Variable A has highest variability
cat("- Only variance and standard deviation capture these differences!\n")
## - Only variance and standard deviation capture these differences!
# Visualize all five variables
par(mfrow=c(2,3))
for(i in 1:5) {
boxplot(variables[[i]],
main=paste("Variable", names(variables)[i]),
ylab="Values", col=rainbow(5)[i])
abline(h=4, col="red", lty=2) # Mean line
}
# Add histogram for comparison
hist(var_A, main="Variable A Distribution",
xlab="Values", col="lightblue", breaks=0:8)
par(mfrow=c(1,1))
Now letβs tackle Example 5 - the business scenario from your PDF:
cat("π Example 5: Car Manufacturer Price Analysis\n")
## π Example 5: Car Manufacturer Price Analysis
cat("Business Question: Are Japanese car prices too differentiated?\n\n")
## Business Question: Are Japanese car prices too differentiated?
# Question 1: Are car models more dispersed in terms of sales or price?
cat("π Question 1: Sales vs Price Variability\n")
## π Question 1: Sales vs Price Variability
sales_stats <- distr.summary.x(cars$sales, stats=c("mean","sd","cv"))
## n n.a mean sd cv
## 32 0 8012.88 81.21 0.01
price_stats <- distr.summary.x(cars$price_num, stats=c("mean","sd","cv"))
## n n.a mean sd cv
## 32 0 30619.99 9531.25 0.31
print(sales_stats)
## $`Requested statistics`
## n n.a mean sd cv
## 1 32 0 8012.875 81.21407 0.01013545
print(price_stats)
## $`Requested statistics`
## n n.a mean sd cv
## 1 32 0 30619.99 9531.248 0.3112753
cat("\nπ‘ Answer: Sales are more dispersed!\n")
##
## π‘ Answer: Sales are more dispersed!
cat("Sales CV =", round((sd(cars$sales)/mean(cars$sales))*100, 1), "%\n")
## Sales CV = 1 %
cat("Price CV =", round((sd(cars$price_num)/mean(cars$price_num))*100, 1), "%\n")
## Price CV = 31.1 %
cat("\nπ Question 2: Price Variability by Country\n")
##
## π Question 2: Price Variability by Country
# Compare price variability among different countries
price_by_country <- distr.summary.x(cars$price_num,
stats=c("mean","sd","cv"),
by1=cars$country)
## cars$country n n.a mean sd cv
## France 4 0 25927.03 6804.56 0.26
## Germany 6 0 29644.77 10216.58 0.34
## Italy 6 0 30332.72 11953.73 0.39
## Japan 12 0 31136.82 9580.72 0.31
## United States 4 0 35656.21 8799.23 0.25
print(price_by_country)
## $`Requested statistics`
## cars$country n n.a mean sd cv
## 1 France 4 0 25927.03 6804.561 0.2624504
## 2 Germany 6 0 29644.77 10216.579 0.3446334
## 3 Italy 6 0 30332.72 11953.735 0.3940871
## 4 Japan 12 0 31136.82 9580.721 0.3076975
## 5 United States 4 0 35656.21 8799.226 0.2467796
cat("\nπ― Manager's Concern Analysis:\n")
##
## π― Manager's Concern Analysis:
# Check if Japan and Germany exist in the data
available_countries <- unique(cars$country)
cat("Available countries:", paste(available_countries, collapse=", "), "\n")
## Available countries: Germany, Italy, Japan, United States, France
# Find Japan and Germany data more safely
if("Japan" %in% available_countries && "Germany" %in% available_countries) {
# Calculate CV manually for Japan and Germany
japan_cars <- cars[cars$country == "Japan", ]
germany_cars <- cars[cars$country == "Germany", ]
japan_cv <- (sd(japan_cars$price_num, na.rm=TRUE) / mean(japan_cars$price_num, na.rm=TRUE)) * 100
germany_cv <- (sd(germany_cars$price_num, na.rm=TRUE) / mean(germany_cars$price_num, na.rm=TRUE)) * 100
cat("Japanese cars CV:", round(japan_cv, 1), "%\n")
cat("German cars CV:", round(germany_cv, 1), "%\n")
if(japan_cv > germany_cv) {
cat("β οΈ Manager's concern is VALID - Japanese cars are more variable\n")
} else {
cat("β
Manager's concern is UNFOUNDED - Japanese cars are not more variable\n")
}
} else {
cat("Note: Japan or Germany not found in dataset\n")
cat("Comparing available countries instead:\n")
# Get CV for all countries
country_cvs <- c()
for(country in available_countries) {
country_cars <- cars[cars$country == country, ]
if(nrow(country_cars) > 1) {
cv <- (sd(country_cars$price_num, na.rm=TRUE) / mean(country_cars$price_num, na.rm=TRUE)) * 100
country_cvs[country] <- cv
cat(country, "CV:", round(cv, 1), "%\n")
}
}
# Find most and least variable
if(length(country_cvs) > 0) {
most_variable <- names(which.max(country_cvs))
least_variable <- names(which.min(country_cvs))
cat("Most variable country:", most_variable, "(", round(max(country_cvs), 1), "%)\n")
cat("Least variable country:", least_variable, "(", round(min(country_cvs), 1), "%)\n")
}
}
## Japanese cars CV: 30.8 %
## German cars CV: 34.5 %
## β
Manager's concern is UNFOUNDED - Japanese cars are not more variable
Now we move to Part 2 of todayβs lecture: How do we study relationships between TWO variables? This is called bivariate analysis.
Your PDF notes show three main cases:
cat("π Bivariate Analysis Overview:\n")
## π Bivariate Analysis Overview:
cat("We'll explore relationships between pairs of variables\n")
## We'll explore relationships between pairs of variables
cat("Goal: Understand how one variable relates to another\n\n")
## Goal: Understand how one variable relates to another
# Check our available variables
cat("Available variables in cars dataset:\n")
## Available variables in cars dataset:
str(cars[, c("country", "price_classes", "price_num", "maxspeed", "acceleration")])
## 'data.frame': 32 obs. of 5 variables:
## $ country : chr "Germany" "Italy" "Japan" "United States" ...
## $ price_classes: Factor w/ 3 levels "low","mid","high": 2 1 2 3 3 3 3 2 2 2 ...
## $ price_num : num 24634 19354 23888 39901 39129 ...
## $ maxspeed : num 314 345 270 314 470 ...
## $ acceleration : num 15.5 15.6 15.7 12.3 12.4 ...
When both variables are categorical, we use crosstabs (contingency tables).
Letβs recreate Case A from your bivariate analysis notes:
# Example from your bivariate PDF - Industry vs Solvency Rating
cat("π Case A: Industry vs Solvency Rating\n")
## π Case A: Industry vs Solvency Rating
cat("Question: Does company rating depend on industry?\n\n")
## Question: Does company rating depend on industry?
# Create the data from your PDF
industry <- c(rep("Manufacturing", 240), rep("Financial", 160))
solvency <- c(rep("Low", 36), rep("Average", 124), rep("High", 80), # Manufacturing
rep("Low", 12), rep("Average", 64), rep("High", 84)) # Financial
# Create crosstab
crosstab <- table(industry, solvency)
print(crosstab)
## solvency
## industry Average High Low
## Financial 64 84 12
## Manufacturing 124 80 36
cat("\nπ Step 1: Calculate conditional frequencies (Y|X)\n")
##
## π Step 1: Calculate conditional frequencies (Y|X)
# Calculate conditional frequencies by row
conditional_freq <- prop.table(crosstab, margin=1) * 100
print(round(conditional_freq, 1))
## solvency
## industry Average High Low
## Financial 40.0 52.5 7.5
## Manufacturing 51.7 33.3 15.0
# Step 2: Graphical presentation - Stacked bar chart
barplot(conditional_freq, beside=FALSE,
main="Solvency Rating by Industry (Conditional %)",
xlab="Solvency Rating", ylab="Percentage",
col=c("lightblue", "lightcoral"),
legend.text=rownames(conditional_freq))
cat("\nπ― Step 3: Conclusion\n")
##
## π― Step 3: Conclusion
cat("Manufacturing: Only 33.3% have high rating\n")
## Manufacturing: Only 33.3% have high rating
cat("Financial: 52.5% have high rating\n")
## Financial: 52.5% have high rating
cat("β The distributions are DIFFERENT, so rating depends on industry!\n")
## β The distributions are DIFFERENT, so rating depends on industry!
# Let's analyze country vs price class in our cars dataset
cat("π Analyzing Country vs Price Class:\n")
## π Analyzing Country vs Price Class:
# Create frequency table
country_price_table <- distr.table.xy(cars$country, cars$price_classes,
freq.type="y|x", freq="prop")
## y|x: Proportions
## cars$price_classes
## cars$country low mid high TOTAL
## France 0.25 0.50 0.25 1.00
## Germany 0.17 0.67 0.17 1.00
## Italy 0.33 0.17 0.50 1.00
## Japan 0.08 0.67 0.25 1.00
## United States 0.00 0.50 0.50 1.00
print(country_price_table)
## $`y|x: Proportions`
## low mid high TOTAL
## France 0.25000000 0.5000000 0.2500000 1
## Germany 0.16666667 0.6666667 0.1666667 1
## Italy 0.33333333 0.1666667 0.5000000 1
## Japan 0.08333333 0.6666667 0.2500000 1
## United States 0.00000000 0.5000000 0.5000000 1
# Create visualization
distr.plot.xy(cars$country, cars$price_classes,
freq.type="y|x", freq="prop",
plot.type="bars", bar.type="stacked")
When both variables are numerical, we use scatterplots and correlation.
The correlation coefficient (r) measures the strength of linear relationship:
\(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}\)
cat("π Correlation Analysis: Car Price vs Performance\n\n")
## π Correlation Analysis: Car Price vs Performance
# Calculate correlations
price_speed_cor <- cor(cars$price_num, cars$maxspeed)
price_accel_cor <- cor(cars$price_num, cars$acceleration)
cat("Price vs Max Speed correlation:", round(price_speed_cor, 3), "\n")
## Price vs Max Speed correlation: 0.794
cat("Price vs Acceleration correlation:", round(price_accel_cor, 3), "\n\n")
## Price vs Acceleration correlation: -0.827
# Interpretation
cat("π‘ Interpretation:\n")
## π‘ Interpretation:
if(price_speed_cor > 0.5) {
cat("- Strong positive correlation between price and speed\n")
} else if(price_speed_cor > 0.3) {
cat("- Moderate positive correlation between price and speed\n")
} else {
cat("- Weak correlation between price and speed\n")
}
## - Strong positive correlation between price and speed
if(price_accel_cor < -0.3) {
cat("- Negative correlation: expensive cars accelerate faster (lower seconds)\n")
}
## - Negative correlation: expensive cars accelerate faster (lower seconds)
# Create scatterplots
par(mfrow=c(1,2))
# Price vs Speed
plot(cars$price_num, cars$maxspeed,
main="Price vs Max Speed",
xlab="Price ($)", ylab="Max Speed (km/h)",
pch=19, col="blue", alpha=0.6)
## Warning in plot.window(...): "alpha" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "alpha" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in box(...): "alpha" is not a graphical parameter
## Warning in title(...): "alpha" is not a graphical parameter
abline(lm(maxspeed ~ price_num, data=cars), col="red", lwd=2)
# Price vs Acceleration
plot(cars$price_num, cars$acceleration,
main="Price vs Acceleration",
xlab="Price ($)", ylab="Acceleration (0-100 km/h seconds)",
pch=19, col="green", alpha=0.6)
## Warning in plot.window(...): "alpha" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "alpha" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in box(...): "alpha" is not a graphical parameter
## Warning in title(...): "alpha" is not a graphical parameter
abline(lm(acceleration ~ price_num, data=cars), col="red", lwd=2)
par(mfrow=c(1,1))
# Using UBStats functions
distr.plot.xy(cars$price_num, cars$maxspeed, plot.type="scatter")
cat("π Using UBStats for correlation analysis:\n")
## π Using UBStats for correlation analysis:
cat("Correlation between price and speed:",
round(cor(cars$price_num, cars$maxspeed), 3), "\n")
## Correlation between price and speed: 0.794
When we have one categorical and one numerical variable, we use conditional analysis.
cat("π Mixed Analysis: Country (Categorical) vs Price (Numerical)\n\n")
## π Mixed Analysis: Country (Categorical) vs Price (Numerical)
# Conditional summary statistics
price_by_country <- distr.summary.x(cars$price_num, by1=cars$country,
stats=c("central", "dispersion"))
## cars$country n n.a mode n.modes mode% median mean
## France 4 0 26792.50 4 0.2500 24430.20 25927.03
## Germany 6 0 24634.45 6 0.1667 30333.46 29644.77
## Italy 6 0 19354.12 6 0.1667 28582.25 30332.72
## Japan 12 0 23887.56 12 0.0833 27685.31 31136.82
## United States 4 0 39900.58 4 0.2500 34581.44 35656.21
## cars$country n n.a range IQrange sd var cv
## France 4 0 15413.19 7396.75 6804.56 46302047 0.26
## Germany 6 0 30362.11 8410.38 10216.58 104378488 0.34
## Italy 6 0 27358.69 18614.83 11953.73 142891774 0.39
## Japan 12 0 31365.75 9538.84 9580.72 91790206 0.31
## United States 4 0 18497.14 12603.00 8799.23 77426383 0.25
print(price_by_country)
## $`Central tendency measures`
## cars$country n n.a mode n.modes mode% median mean
## 1 France 4 0 26792.50 4 0.25000000 24430.20 25927.03
## 2 Germany 6 0 24634.45 6 0.16666667 30333.46 29644.77
## 3 Italy 6 0 19354.12 6 0.16666667 28582.25 30332.72
## 4 Japan 12 0 23887.56 12 0.08333333 27685.31 31136.82
## 5 United States 4 0 39900.58 4 0.25000000 34581.44 35656.21
##
## $`Measures of dispersion`
## cars$country n n.a range IQrange sd var cv
## 1 France 4 0 15413.19 7396.747 6804.561 46302047 0.2624504
## 2 Germany 6 0 30362.11 8410.382 10216.579 104378488 0.3446334
## 3 Italy 6 0 27358.69 18614.825 11953.735 142891774 0.3940871
## 4 Japan 12 0 31365.75 9538.835 9580.721 91790206 0.3076975
## 5 United States 4 0 18497.14 12602.998 8799.226 77426383 0.2467796
# Side-by-side boxplots
distr.plot.xy(cars$country, cars$price_num, plot.type="boxplot")
Letβs use the exact functions from your class R scripts:
cat("π§ Using Functions from Video 3 Script:\n\n")
## π§ Using Functions from Video 3 Script:
# Frequency tables for categorical variables
cat("π Frequency Table for Country:\n")
## π Frequency Table for Country:
country_table <- distr.table.x(x=cars$country, freq=c("counts","prop","perc"))
## cars$country Count Prop Percent
## France 4 0.12 12
## Germany 6 0.19 19
## Italy 6 0.19 19
## Japan 12 0.38 38
## United States 4 0.12 12
## TOTAL 32 1.00 100
print(country_table)
## cars$country Count Prop Percent
## 1 France 4 0.1250 12.50
## 2 Germany 6 0.1875 18.75
## 3 Italy 6 0.1875 18.75
## 4 Japan 12 0.3750 37.50
## 5 United States 4 0.1250 12.50
## Sum TOTAL 32 1.0000 100.00
# Numerical variables with many values - use breaks
cat("\nπ Price Distribution with Custom Breaks:\n")
##
## π Price Distribution with Custom Breaks:
price_breaks <- c(10000, 20000, 30000, 50000, 100000)
price_table <- distr.table.x(x=cars$price_num,
freq=c("counts","prop","dens"),
breaks=price_breaks)
## cars$price_num Count Prop Density
## [10000,20000) 5 0.16 2e-05
## [20000,30000) 14 0.44 4e-05
## [30000,50000) 13 0.41 2e-05
## [50000,100000] 0 0.00 0
## TOTAL 32 1.00
print(price_table)
## cars$price_num Count Prop Density
## 1 [10000,20000) 5 0.15625 1.56250e-05
## 2 [20000,30000) 14 0.43750 4.37500e-05
## 3 [30000,50000) 13 0.40625 2.03125e-05
## 4 [50000,100000] 0 0.00000 0.00000e+00
## Sum TOTAL 32 1.00000 NA
# Plots from Video 3 script
par(mfrow=c(1,2))
# Pie chart for categorical data
distr.plot.x(x=cars$country, freq="proportions", plot.type="pie")
# Histogram for numerical data
distr.plot.x(x=cars$price_num, plot.type="hist", breaks=8)
par(mfrow=c(1,1))
cat("π§ Using Functions from Video 4 Script:\n\n")
## π§ Using Functions from Video 4 Script:
# Summary measures for central tendency
cat("π Central Tendency Measures:\n")
## π Central Tendency Measures:
central_measures <- distr.summary.x(x=cars$price_num, stats="central")
## n n.a mode n.modes mode% median mean
## 32 0 24634.45 32 0.0312 27685.31 30619.99
print(central_measures)
## $`Central tendency measures`
## n n.a mode n.modes mode% median mean
## 1 32 0 24634.45 32 0.03125 27685.31 30619.99
# Quartiles and percentiles
cat("\nπ Quartiles and Percentiles:\n")
##
## π Quartiles and Percentiles:
quartile_measures <- distr.summary.x(x=cars$price_num, stats="quartiles")
## n n.a min p25 p50 p75 max
## 32 0 14123.39 24104.94 27685.31 37212.03 48678.02
print(quartile_measures)
## $Quartiles
## n n.a min p25 p50 p75 max
## 1 32 0 14123.39 24104.94 27685.31 37212.03 48678.02
# All dispersion measures
cat("\nπ All Dispersion Measures:\n")
##
## π All Dispersion Measures:
dispersion_measures <- distr.summary.x(x=cars$price_num, stats="dispersion")
## n n.a range IQrange sd var cv
## 32 0 34554.64 13107.09 9531.25 90844686 0.31
print(dispersion_measures)
## $`Measures of dispersion`
## n n.a range IQrange sd var cv
## 1 32 0 34554.64 13107.09 9531.248 90844686 0.3112753
# Complete summary
cat("\nπ Complete Summary:\n")
##
## π Complete Summary:
complete_summary <- distr.summary.x(x=cars$price_num, stats="summary")
## n n.a min q1 median mean q3 max sd var
## 32 0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.25 90844686
print(complete_summary)
## $`Summary measures`
## n n.a min q1 median mean q3 max sd
## 1 32 0 14123.39 24104.94 27685.31 30619.99 37212.03 48678.02 9531.248
## var
## 1 90844686
cat("π§ Using Functions from Video 5 Script:\n\n")
## π§ Using Functions from Video 5 Script:
# Joint frequency distributions
cat("π Joint Frequency Table:\n")
## π Joint Frequency Table:
joint_table <- distr.table.xy(x=cars$country, y=cars$price_classes,
freq.type="joint", freq="count")
## Joint counts
## cars$price_classes
## cars$country low mid high TOTAL
## France 1 2 1 4
## Germany 1 4 1 6
## Italy 2 1 3 6
## Japan 1 8 3 12
## United States 0 2 2 4
## TOTAL 5 17 10 32
print(joint_table)
## $`Joint counts`
## low mid high TOTAL
## France 1 2 1 4
## Germany 1 4 1 6
## Italy 2 1 3 6
## Japan 1 8 3 12
## United States 0 2 2 4
## TOTAL 5 17 10 32
# Conditional frequencies
cat("\nπ Conditional Frequencies (Price Class | Country):\n")
##
## π Conditional Frequencies (Price Class | Country):
conditional_table <- distr.table.xy(x=cars$country, y=cars$price_classes,
freq.type="y|x", freq="prop")
## y|x: Proportions
## cars$price_classes
## cars$country low mid high TOTAL
## France 0.25 0.50 0.25 1.00
## Germany 0.17 0.67 0.17 1.00
## Italy 0.33 0.17 0.50 1.00
## Japan 0.08 0.67 0.25 1.00
## United States 0.00 0.50 0.50 1.00
print(conditional_table)
## $`y|x: Proportions`
## low mid high TOTAL
## France 0.25000000 0.5000000 0.2500000 1
## Germany 0.16666667 0.6666667 0.1666667 1
## Italy 0.33333333 0.1666667 0.5000000 1
## Japan 0.08333333 0.6666667 0.2500000 1
## United States 0.00000000 0.5000000 0.5000000 1
# Bivariate plots from Video 5 script
par(mfrow=c(1,2))
# Side-by-side bar chart
distr.plot.xy(x=cars$country, y=cars$price_classes,
freq.type="y|x", freq="prop",
plot.type="bars", bar.type="beside")
# Scatter plot for two numerical variables
distr.plot.xy(x=cars$price_num, y=cars$maxspeed,
plot.type="scatter")
par(mfrow=c(1,1))
# Correlation analysis
cat("π Correlation Analysis:\n")
## π Correlation Analysis:
price_speed_cor <- cor(cars$price_num, cars$maxspeed)
cat("Price vs Speed correlation:", round(price_speed_cor, 3), "\n")
## Price vs Speed correlation: 0.794
cat("π Quality Control Application:\n")
## π Quality Control Application:
cat("A factory produces bolts that should be 10cm long\n\n")
## A factory produces bolts that should be 10cm long
# Simulate two production scenarios
set.seed(123) # For reproducible results
# Scenario A: Good quality control
bolts_A <- rnorm(100, mean=10.0, sd=0.1)
# Scenario B: Poor quality control
bolts_B <- rnorm(100, mean=10.0, sd=0.5)
cat("π Quality Control Comparison:\n")
## π Quality Control Comparison:
cat("Scenario A: Mean =", round(mean(bolts_A), 3),
"cm, SD =", round(sd(bolts_A), 3), "cm\n")
## Scenario A: Mean = 10.009 cm, SD = 0.091 cm
cat("Scenario B: Mean =", round(mean(bolts_B), 3),
"cm, SD =", round(sd(bolts_B), 3), "cm\n\n")
## Scenario B: Mean = 9.946 cm, SD = 0.483 cm
# Calculate coefficient of variation
cv_A <- (sd(bolts_A) / mean(bolts_A)) * 100
cv_B <- (sd(bolts_B) / mean(bolts_B)) * 100
cat("π‘ Quality Assessment:\n")
## π‘ Quality Assessment:
cat("Scenario A CV:", round(cv_A, 2), "% - EXCELLENT quality control\n")
## Scenario A CV: 0.91 % - EXCELLENT quality control
cat("Scenario B CV:", round(cv_B, 2), "% - POOR quality control\n")
## Scenario B CV: 4.86 % - POOR quality control
# Visualize quality control differences
par(mfrow=c(1,2))
hist(bolts_A, main="Scenario A: Good Quality Control",
xlab="Bolt Length (cm)", col="lightgreen",
xlim=c(8, 12), breaks=20)
abline(v=10, col="red", lwd=2, lty=2)
hist(bolts_B, main="Scenario B: Poor Quality Control",
xlab="Bolt Length (cm)", col="lightcoral",
xlim=c(8, 12), breaks=20)
abline(v=10, col="red", lwd=2, lty=2)
par(mfrow=c(1,1))
cat("π― Business Impact:\n")
## π― Business Impact:
cat("- Scenario A: Consistent, predictable production\n")
## - Scenario A: Consistent, predictable production
cat("- Scenario B: High variability = quality problems\n")
## - Scenario B: High variability = quality problems
cat("π° Investment Portfolio Risk Analysis:\n\n")
## π° Investment Portfolio Risk Analysis:
# Two investment portfolios with same mean return
portfolio_A <- c(8, 9, 10, 11, 12) # Conservative
portfolio_B <- c(0, 5, 10, 15, 20) # Risky
cat("Portfolio A returns:", paste(portfolio_A, "%", collapse=", "), "\n")
## Portfolio A returns: 8 %, 9 %, 10 %, 11 %, 12 %
cat("Portfolio B returns:", paste(portfolio_B, "%", collapse=", "), "\n\n")
## Portfolio B returns: 0 %, 5 %, 10 %, 15 %, 20 %
# Calculate risk measures
mean_A <- mean(portfolio_A)
mean_B <- mean(portfolio_B)
sd_A <- sd(portfolio_A)
sd_B <- sd(portfolio_B)
cat("π Risk Analysis:\n")
## π Risk Analysis:
cat("Portfolio A: Mean =", mean_A, "%, SD =", round(sd_A, 2), "%\n")
## Portfolio A: Mean = 10 %, SD = 1.58 %
cat("Portfolio B: Mean =", mean_B, "%, SD =", round(sd_B, 2), "%\n\n")
## Portfolio B: Mean = 10 %, SD = 7.91 %
cat("π‘ Investment Advice:\n")
## π‘ Investment Advice:
cat("- Both portfolios have same expected return (", mean_A, "%)\n")
## - Both portfolios have same expected return ( 10 %)
cat("- Portfolio A is much less risky (lower standard deviation)\n")
## - Portfolio A is much less risky (lower standard deviation)
cat("- Portfolio B has higher volatility = higher risk\n")
## - Portfolio B has higher volatility = higher risk
Your turn! Calculate variance manually for this small dataset:
Test scores: 85, 90, 78, 92, 88
cat("π― Exercise 1: Manual Variance Calculation\n")
## π― Exercise 1: Manual Variance Calculation
cat("Test scores: 85, 90, 78, 92, 88\n")
## Test scores: 85, 90, 78, 92, 88
cat("Calculate sample variance step by step!\n\n")
## Calculate sample variance step by step!
scores <- c(85, 90, 78, 92, 88)
# Step 1: Calculate mean
cat("Step 1: Calculate the mean\n")
## Step 1: Calculate the mean
cat("Your turn! What is (85 + 90 + 78 + 92 + 88) Γ· 5 = ?\n\n")
## Your turn! What is (85 + 90 + 78 + 92 + 88) Γ· 5 = ?
# Provide solution
mean_scores <- mean(scores)
cat("β
Solution: Mean =", mean_scores, "\n\n")
## β
Solution: Mean = 86.6
# Step 2: Calculate deviations
cat("Step 2: Calculate deviations from mean\n")
## Step 2: Calculate deviations from mean
deviations <- scores - mean_scores
for(i in 1:length(scores)) {
cat("Score", i, ":", scores[i], "- ", mean_scores, "=", deviations[i], "\n")
}
## Score 1 : 85 - 86.6 = -1.6
## Score 2 : 90 - 86.6 = 3.4
## Score 3 : 78 - 86.6 = -8.6
## Score 4 : 92 - 86.6 = 5.4
## Score 5 : 88 - 86.6 = 1.4
# Step 3: Square deviations and calculate variance
cat("\nStep 3: Square deviations and calculate sample variance\n")
##
## Step 3: Square deviations and calculate sample variance
squared_devs <- deviations^2
sample_var <- sum(squared_devs) / (length(scores) - 1)
cat("Sum of squared deviations:", sum(squared_devs), "\n")
## Sum of squared deviations: 119.2
cat("Sample variance: sΒ² =", round(sample_var, 2), "\n")
## Sample variance: sΒ² = 29.8
cat("Standard deviation: s =", round(sqrt(sample_var), 2), "\n")
## Standard deviation: s = 5.46
cat("π― Exercise 2: CV Comparison Challenge\n")
## π― Exercise 2: CV Comparison Challenge
cat("Which is more variable: car weights or car prices?\n\n")
## Which is more variable: car weights or car prices?
# Calculate CV for both
weight_cv <- (sd(cars$weight) / mean(cars$weight)) * 100
price_cv <- (sd(cars$price_num) / mean(cars$price_num)) * 100
cat("πͺ Weight Statistics:\n")
## πͺ Weight Statistics:
cat("Mean:", round(mean(cars$weight), 0), "kg\n")
## Mean: 1609 kg
cat("SD:", round(sd(cars$weight), 0), "kg\n")
## SD: 489 kg
cat("CV:", round(weight_cv, 1), "%\n\n")
## CV: 30.4 %
cat("π° Price Statistics:\n")
## π° Price Statistics:
cat("Mean: $", round(mean(cars$price_num), 0), "\n")
## Mean: $ 30620
cat("SD: $", round(sd(cars$price_num), 0), "\n")
## SD: $ 9531
cat("CV:", round(price_cv, 1), "%\n\n")
## CV: 31.1 %
cat("π Winner: ", ifelse(weight_cv > price_cv, "Weight", "Price"),
" is more variable (higher CV)\n")
## π Winner: Price is more variable (higher CV)
cat("π― Exercise 3: Correlation Detective\n")
## π― Exercise 3: Correlation Detective
cat("Interpret these correlation coefficients:\n\n")
## Interpret these correlation coefficients:
# Different correlation examples
correlations <- c(0.85, -0.92, 0.12, -0.45, 0.0)
variables <- c("Study hours vs Exam scores",
"Car age vs Market value",
"Height vs Happiness",
"Temperature vs Heating costs",
"Shoe size vs Intelligence")
for(i in 1:length(correlations)) {
cat("π", variables[i], ": r =", correlations[i], "\n")
# Interpretation
r <- abs(correlations[i])
direction <- ifelse(correlations[i] > 0, "positive",
ifelse(correlations[i] < 0, "negative", "no"))
strength <- ifelse(r > 0.8, "very strong",
ifelse(r > 0.6, "strong",
ifelse(r > 0.4, "moderate",
ifelse(r > 0.2, "weak", "very weak"))))
cat(" Interpretation:", strength, direction, "relationship\n\n")
}
## π Study hours vs Exam scores : r = 0.85
## Interpretation: very strong positive relationship
##
## π Car age vs Market value : r = -0.92
## Interpretation: very strong negative relationship
##
## π Height vs Happiness : r = 0.12
## Interpretation: very weak positive relationship
##
## π Temperature vs Heating costs : r = -0.45
## Interpretation: moderate negative relationship
##
## π Shoe size vs Intelligence : r = 0
## Interpretation: very weak no relationship
Hereβs your complete reference for all formulas covered today:
Range: \(\text{Range} = x_{\max} - x_{\min}\)
IQR: \(\text{IQR} = Q_3 - Q_1\)
Population Variance (Raw Data): \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2 = \frac{\sum_{i=1}^{N} x_i^2}{N} - \mu^2\)
Sample Variance (Raw Data): \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 = \frac{n}{n-1}\left(\frac{\sum_{i=1}^{n} x_i^2}{n} - \bar{x}^2\right)\)
Frequency Data Variance: \(\sigma^2 = \sum_{k=1}^{K}(x_k - \mu)^2 p_k = \sum_{k=1}^{K} x_k^2 p_k - \mu^2\)
Standard Deviation: \(s = \sqrt{s^2}\)
Coefficient of Variation: \(CV = \frac{s}{\bar{x}} \times 100\%\)
Correlation Coefficient: \(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}\)
cat("π Quick Reference Guide:\n\n")
## π Quick Reference Guide:
measures <- data.frame(
Measure = c("Range", "IQR", "Variance", "Std Dev", "CV"),
Best_Used_When = c(
"Quick estimate, no outliers",
"Data has outliers or skewed",
"Mathematical calculations",
"Describing spread in original units",
"Comparing different variables"
),
Sensitive_to_Outliers = c("Yes", "No", "Yes", "Yes", "Depends"),
stringsAsFactors = FALSE
)
print(measures)
## Measure Best_Used_When Sensitive_to_Outliers
## 1 Range Quick estimate, no outliers Yes
## 2 IQR Data has outliers or skewed No
## 3 Variance Mathematical calculations Yes
## 4 Std Dev Describing spread in original units Yes
## 5 CV Comparing different variables Depends
cat("π§ Essential R Functions Summary:\n\n")
## π§ Essential R Functions Summary:
cat("π Basic Functions:\n")
## π Basic Functions:
cat("range(x) # Max - Min\n")
## range(x) # Max - Min
cat("IQR(x) # Q3 - Q1\n")
## IQR(x) # Q3 - Q1
cat("var(x) # Sample variance\n")
## var(x) # Sample variance
cat("sd(x) # Sample standard deviation\n")
## sd(x) # Sample standard deviation
cat("cor(x, y) # Correlation coefficient\n\n")
## cor(x, y) # Correlation coefficient
cat("π UBStats Functions:\n")
## π UBStats Functions:
cat("distr.summary.x(x, stats='dispersion') # All variability measures\n")
## distr.summary.x(x, stats='dispersion') # All variability measures
cat("distr.table.xy(x, y, freq.type='joint') # Crosstabs\n")
## distr.table.xy(x, y, freq.type='joint') # Crosstabs
cat("distr.plot.xy(x, y, plot.type='scatter') # Scatterplots\n")
## distr.plot.xy(x, y, plot.type='scatter') # Scatterplots
cat("distr.plot.x(x, plot.type='boxplot') # Boxplots\n")
## distr.plot.x(x, plot.type='boxplot') # Boxplots
cat("πΌ Key Business Applications:\n\n")
## πΌ Key Business Applications:
cat("π Quality Control:\n")
## π Quality Control:
cat("- Use standard deviation to monitor production consistency\n")
## - Use standard deviation to monitor production consistency
cat("- Lower SD = better quality control\n")
## - Lower SD = better quality control
cat("- CV helps compare quality across different products\n\n")
## - CV helps compare quality across different products
cat("π° Investment Analysis:\n")
## π° Investment Analysis:
cat("- Standard deviation measures investment risk\n")
## - Standard deviation measures investment risk
cat("- Higher SD = higher risk and volatility\n")
## - Higher SD = higher risk and volatility
cat("- Use correlation to diversify portfolios\n\n")
## - Use correlation to diversify portfolios
cat("π Market Research:\n")
## π Market Research:
cat("- Use crosstabs to find customer segments\n")
## - Use crosstabs to find customer segments
cat("- Correlation identifies relationships between variables\n")
## - Correlation identifies relationships between variables
cat("- CV compares variability across different metrics\n")
## - CV compares variability across different metrics
In our next lecture, weβll explore:
cat("π― Recommended Practice:\n\n")
## π― Recommended Practice:
cat("1. π Manual Calculations:\n")
## 1. π Manual Calculations:
cat(" - Practice variance calculations by hand\n")
## - Practice variance calculations by hand
cat(" - Work through correlation examples\n")
## - Work through correlation examples
cat(" - Create your own crosstabs\n\n")
## - Create your own crosstabs
cat("2. π§ R Programming:\n")
## 2. π§ R Programming:
cat(" - Use the cars dataset for more analyses\n")
## - Use the cars dataset for more analyses
cat(" - Try different variable combinations\n")
## - Try different variable combinations
cat(" - Create your own visualizations\n\n")
## - Create your own visualizations
cat("3. πΌ Real Applications:\n")
## 3. πΌ Real Applications:
cat(" - Find your own datasets online\n")
## - Find your own datasets online
cat(" - Apply these concepts to your interests\n")
## - Apply these concepts to your interests
cat(" - Think about business problems you could solve\n")
## - Think about business problems you could solve
cat("π Congratulations! π\n\n")
## π Congratulations! π
cat("You've just mastered some of the most important concepts in statistics:\n")
## You've just mastered some of the most important concepts in statistics:
cat("β
How to measure variability in data\n")
## β
How to measure variability in data
cat("β
How to compare different types of variables\n")
## β
How to compare different types of variables
cat("β
How to study relationships between variables\n")
## β
How to study relationships between variables
cat("β
How to apply these concepts to real business problems\n\n")
## β
How to apply these concepts to real business problems
cat("π You're now equipped with powerful tools for:\n")
## π You're now equipped with powerful tools for:
cat("- Making data-driven business decisions\n")
## - Making data-driven business decisions
cat("- Understanding risk and uncertainty\n")
## - Understanding risk and uncertainty
cat("- Finding patterns and relationships in data\n")
## - Finding patterns and relationships in data
cat("- Communicating statistical findings effectively\n\n")
## - Communicating statistical findings effectively
cat("Keep practicing, stay curious, and remember:\n")
## Keep practicing, stay curious, and remember:
cat("Statistics is not just about numbers - it's about understanding the world!\n")
## Statistics is not just about numbers - it's about understanding the world!
π End of Lecture 3 - Great job making it through 120 minutes of intensive learning!
Remember to save your work and practice with the R code examples. See you next time for more statistical adventures!
# Session information for reproducibility
cat("π Session Information:\n")
## π Session Information:
sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Europe/Budapest
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] UBStats_0.2.2
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
## [5] cachem_1.1.0 knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
## [9] lifecycle_1.0.4 cli_3.6.3 sass_0.4.9 jquerylib_0.1.4
## [13] compiler_4.4.2 rstudioapi_0.17.1 tools_4.4.2 evaluate_1.0.1
## [17] bslib_0.8.0 yaml_2.3.10 rlang_1.1.4 jsonlite_1.8.9