Welcome to the fascinating world of statistical data analysis! In this comprehensive lecture, we will explore how to analyze and visualize data using R, focusing on understanding the characteristics of individual variables (univariate analysis).
Think of statistics as a powerful lens that helps us see patterns hidden in data. Just as a microscope reveals the invisible world of cells, statistical analysis reveals the invisible patterns in numbers that tell compelling stories about our world.
By the end of this lecture, you will master:
Before we can analyze data, we must understand what type of data we’re working with. This is like a chef understanding ingredients before cooking - the technique depends entirely on what you’re working with.
Statistical Variables
├── Qualitative (Categorical)
│ ├── Nominal (No natural order)
│ │ └── Examples: Color, Brand, Country
│ └── Ordinal (Natural order exists)
│ └── Examples: Grade (A,B,C), Size (S,M,L)
└── Quantitative (Numerical)
├── Discrete (Countable)
│ └── Examples: Number of children, Cars owned
└── Continuous (Measurable)
└── Examples: Height, Weight, Temperature
Let’s begin our journey by preparing our tools:
```{r setup} # Load the UBStats package - our Swiss Army knife for statistical analysis library(UBStats)
load(“stat_datasets_cl17.Rdata”)
str(cars)
```{r data_exploration}
# Get a feel for our dataset
head(cars, 10) # First 10 rows
dim(cars) # Dimensions: rows and columns
names(cars) # Variable names
Imagine you’re a business analyst for a major European car manufacturer. Your dataset contains information about 190 different car models, and each variable tells a different part of the story:
Our Variables Explained:
model: Car model name (Qualitative
Nominal)sales: Annual sales in units (Quantitative
Continuous)bestselling: Best-selling status (1=Yes, 0=No)
(Qualitative Nominal)price_num: Price in dollars (Quantitative
Continuous)price_classes: Price category (low/mid/high)
(Qualitative Ordinal)maxspeed: Maximum speed in km/h (Quantitative
Continuous)acceleration: Time to reach 100 km/h (Quantitative
Continuous)urban_fuelcons: Urban fuel consumption
(Quantitative Continuous)fueltank: Fuel tank capacity in liters
(Quantitative Continuous)weight: Weight in kilograms (Quantitative
Continuous)n_doors_min: Number of doors (Quantitative
Discrete)country: Manufacturing country (Qualitative
Nominal)Let’s start our analysis by exploring where these cars are manufactured. This is our first detective work - uncovering geographical patterns in car production.
{r country_analysis} # Create a frequency distribution table country_freq <- distr.table.x(cars$country) print(country_freq)
💡 What This Table Tells Us: - Count: How many car models each country produces - Prop: The proportion (decimal form) of total production - Each row: Represents a manufacturing region’s contribution
{r country_detailed} # Add percentage for easier interpretation country_detailed <- distr.table.x(cars$country, freq=c("counts","prop", "perc")) print(country_detailed)
{r country_pie} # Pie chart - showing the "whole pie" of car production distr.plot.x(cars$country, plot.type = "pie")
When to Use Pie Charts: - When showing parts of a whole - When proportions are the main message - Maximum 5-7 categories for clarity
{r country_bars} # Bar chart - better for comparing specific values distr.plot.x(cars$country, plot.type = "bars")
When to Use Bar Charts: - When comparing frequencies between categories - When you have many categories - When exact values matter more than proportions
```{r european_analysis} # Calculate percentage of European car production # European countries: Europe-others, France, Germany, Italy european_countries <- c(“Europe - others”, “France”, “Germany”, “Italy”)
european_percentage <- 14 + 15 + 26 + 11 cat(“🇪🇺 European Car Models:”, european_percentage, “%”) cat(“🥇 Top Manufacturing Country: Germany (26%)”) cat(“🌏 Non-European Production:”, 100 - european_percentage, “%”)
## 3.2 Ordinal Variables: The Price Class Hierarchy
Now let's examine price categories - here, order matters! Low < Mid < High.
```{r price_classes_setup}
# Convert to factor with correct logical order
cars$price_classes <- factor(cars$price_classes,
levels=c("low","mid","high"))
# Verify the ordering worked
levels(cars$price_classes)
```{r price_classes_analysis} # Frequency distribution price_class_freq <- distr.table.x(cars$price_classes) print(price_class_freq)
distr.plot.x(cars$price_classes, plot.type = “bars”)
### 📊 Key Business Question: Market Accessibility
```{r price_accessibility}
# What percentage of cars are priced at or below 'mid' class?
# This tells us about market accessibility
affordable_percentage <- 27 + 55 # low + mid
cat("💰 Affordable Cars (Low + Mid price):", affordable_percentage, "%\n")
cat("💎 Luxury Cars (High price):", 100 - affordable_percentage, "%\n")
Business Interpretation: - 82% of car models are priced in low or mid categories - Only 18% are in the luxury segment - This suggests a market focus on accessibility rather than exclusivity
Discrete variables are like counting objects - you can have 2 doors or 3 doors, but not 2.5 doors.
{r doors_analysis} # Frequency distribution for number of doors doors_freq <- distr.table.x(cars$n_doors_min) print(doors_freq)
{r doors_spike} # Spike plot - perfect for discrete data distr.plot.x(cars$n_doors_min, plot.type="spike", freq="prop")
Why Spike Plots for Discrete Data? - Each spike represents an exact value - Height shows frequency/proportion - No artificial binning needed - Clear visual separation between values
{r doors_cumulative} # Cumulative frequency plot distr.plot.x(cars$n_doors_min, plot.type="cumulative", freq="prop")
Reading Cumulative Plots: - X-axis: Number of doors - Y-axis: Cumulative proportion (running total) - Shows “what percentage have X doors or fewer”
{r doors_business_question} # How many car models have more than 4 doors? # From our frequency table: 5-door (134) + 7-door (1) = 135 cars_more_than_4_doors <- 134 + 1 cat("🚪 Car models with more than 4 doors:", cars_more_than_4_doors, "models\n") cat("📊 This represents:", round(135/190*100, 1), "% of all models\n")
Continuous data requires binning - grouping similar values together to reveal patterns. This is like organizing a library: individual books (data points) are grouped into sections (bins) to see the overall collection structure.
{r fuel_tank_histogram} # Create histogram with 5 equal-width classes distr.plot.x(cars$fueltank, plot.type = "hist", breaks = 5)
{r fuel_tank_table} # Corresponding frequency table fuel_table <- distr.table.x(cars$fueltank, breaks = 5) print(fuel_table)
Interval Notation Explained: -
[24.9,40): Includes 24.9, excludes 40 -
[40,55): Includes 40, excludes 55 - [85,100]:
Includes both 85 and 100 (last interval)
Distribution Characteristics: - Peak: Most cars (43%) have fuel tanks between 55-70 liters - Shape: Slightly right-skewed (tail extends right) - Range: From ~25 liters to 100 liters
{r fuel_tank_business} # What percentage of cars have fuel tanks between 40 and 85 liters? # [40,55) + [55,70) + [70,85) = 32% + 43% + 13% = 88% fuel_40_85_percent <- 32 + 43 + 13 cat("⛽ Cars with fuel tank 40-85 liters:", fuel_40_85_percent, "%\n") cat("💡 This covers the vast majority of the market!\n")
Sometimes equal-width bins don’t tell the full story. Let’s see why:
```{r sales_equal_bins} # Sales distribution with 8 equal-width classes distr.plot.x(cars$sales, plot.type = “hist”, breaks=8)
distr.table.x(cars$sales, breaks=8)
**❗ The Problem with Equal-Width Bins:**
- 90% of cars fall in the first bin [44.2, 19400)
- Remaining bins have very few observations
- We lose detail about the majority of data
- The visualization is not informative
### 💡 The Solution: Custom Bin Widths
```{r sales_custom_bins}
# Use custom breaks that make business sense
custom_breaks <- c(0, 2000, 5000, 20000, 160000)
distr.plot.x(cars$sales, plot.type = "hist", breaks = custom_breaks)
# Much more informative frequency table
sales_custom <- distr.table.x(cars$sales, breaks = custom_breaks, freq=c("count","prop","dens"))
print(sales_custom)
🎯 Business Insights from Custom Bins: - Low
volume [0-2000): 28% of models - Medium volume
[2000-5000): 30% of models
- High volume [5000-20000): 32% of models -
Very high volume [20000+): 9% of models
```{r sales_interpolation} # Approximate percentage of cars with sales between 1000 and 3000 units # This requires interpolation within bins
cat(“📈 Estimated cars with sales 1000-3000 units: ~24%”) cat(“🔍 This is an approximation using linear interpolation”)
---
# Chapter 6: Understanding Distribution Shapes - The Three Personalities
Every dataset has a personality revealed through its shape. Learning to read these personalities is crucial for proper analysis.
## 6.1 The Three Distribution Personalities
### 🎯 Symmetric Distribution: The Balanced Personality
/\
/ \
/ \
/
- Data balanced around center
- Mean ≈ Median ≈ Mode
- Bell-shaped appearance
- Most observations near center
### ➡️ Right-Skewed (Positively Skewed): The Long Right Tail
/
/ _ / _ ___
- Tail extends to the right
- Mean > Median > Mode
- Most values concentrated on left
- Few extreme high values
### ⬅️ Left-Skewed (Negatively Skewed): The Long Left Tail
/\
___/ \
___/
- Tail extends to the left
- Mode > Median > Mean
- Most values concentrated on right
- Few extreme low values
## 6.2 Real Examples from Our Data
```{r distribution_examples}
# Example 1: Fuel tank (approximately symmetric)
distr.plot.x(cars$fueltank, plot.type = "hist", breaks = 6)
# Example 2: Sales (heavily right-skewed)
distr.plot.x(cars$sales, plot.type = "hist", breaks = custom_breaks)
# Example 3: Acceleration (slightly right-skewed)
distr.plot.x(cars$acceleration, plot.type = "hist", breaks = 6)
Cumulative distributions answer “what percentage of observations are at or below a certain value?”
{r price_ogive} # Create ogive (cumulative frequency curve) for price distr.plot.x(cars$price_num, plot.type="cumulative", breaks = 10, freq = "prop")
Question: “Is the minimum price of the top 20% most expensive car models greater than $40,000?”
How to Read the Ogive: 1. Top 20% most expensive = 80th percentile 2. Find 0.8 on Y-axis (80% cumulative) 3. Draw horizontal line to curve 4. Drop vertical line to X-axis 5. Read the price value
{r ogive_interpretation} cat("🔍 Ogive Reading Exercise:\n") cat("📊 At 80% cumulative frequency (80th percentile):\n") cat("💰 Price is approximately $40,000\n") cat("✅ Therefore, the statement is approximately CORRECT\n") cat("📈 The minimum price for top 20% expensive cars ≈ $40,000\n")
Numbers tell stories too. Let’s learn to calculate and interpret the key statistics that describe our data.
{r central_tendency} # Calculate central tendency measures for price price_central <- distr.summary.x(cars$price_num, stats="central") print(price_central)
Interpreting the Results: - Mean: $24,837 (arithmetic average - sum ÷ count) - Median: $19,714 (middle value when sorted) - Mode: $16,951 (most frequent value, appears 14 times)
```{r skewness_analysis} # Analyze the relationship between mean and median mean_price <- 24837.48 median_price <- 19713.5
cat(“📊 Distribution Shape Analysis:”) cat(“💰 Mean Price: $”, round(mean_price, 2), “”) cat(“🎯 Median Price: $”, round(median_price, 2), “”) cat(“📈 Mean - Median = $”, round(mean_price - median_price, 2), “”) cat(“🔍 Since Mean > Median: RIGHT-SKEWED distribution”) cat(“💡 A few very expensive cars pull the average up!”)
## 8.2 Percentiles and Quartiles: Dividing the Data
Quartiles divide data into four equal parts, like cutting a pizza into quarters.
```{r quartiles_analysis}
# Get quartiles for price analysis
price_quartiles <- distr.summary.x(cars$price_num, stats="quartiles")
print(price_quartiles)
{r quartile_interpretation} cat("🎯 Price Market Segmentation:\n") cat("💎 Luxury Segment (Top 25%): Above $", round(price_quartiles$p75, 0), "\n") cat("🔶 Premium Segment (50-75%): $", round(price_quartiles$p50, 0), " - $", round(price_quartiles$p75, 0), "\n") cat("🔸 Mid-market (25-50%): $", round(price_quartiles$p25, 0), " - $", round(price_quartiles$p50, 0), "\n") cat("💚 Budget Segment (Bottom 25%): Below $", round(price_quartiles$p25, 0), "\n")
```{r performance_percentiles} # 90th percentiles for top performance cars speed_p90 <- distr.summary.x(cars\(maxspeed, stats="p90") accel_p90 <- distr.summary.x(cars\)acceleration, stats=“p90”)
cat(“🏎️ TOP 10% PERFORMANCE THRESHOLDS:”) cat(“⚡ Minimum speed for top 10%:”, speed_p90\(p90, " km/h\n") cat("🚀 Maximum acceleration time for top 10%: ", accel_p90\)p90, ” seconds“)
## 8.3 The Five-Number Summary: Complete Picture
```{r five_number_summary}
# Five-number summary for acceleration
accel_summary <- distr.summary.x(cars$acceleration, stats="fivenumber")
print(accel_summary)
cat("\n🏁 ACCELERATION PERFORMANCE BREAKDOWN:\n")
cat("🥇 Fastest car: ", accel_summary$min, " seconds (0-100 km/h)\n")
cat("📊 Q1 (25th percentile): ", accel_summary$q1, " seconds\n")
cat("🎯 Median (50th percentile): ", accel_summary$median, " seconds\n")
cat("📊 Q3 (75th percentile): ", accel_summary$q3, " seconds\n")
cat("🐌 Slowest car: ", accel_summary$max, " seconds\n")
Boxplots pack an incredible amount of information into a simple graphic. They’re like a data summary in visual form.
{r boxplot_maxspeed} # Create boxplot for maximum speed distr.plot.x(cars$maxspeed, plot.type = "boxplot")
outlier •
|
whisker |---- Maximum within 1.5×IQR of Q3
|
Q3 ┌────┐
│ │ ← IQR (Interquartile Range)
Median ├────┤ ← Dark line inside box
│ │
Q1 └────┘
|
whisker |---- Minimum within 1.5×IQR of Q1
|
outlier •
{r outlier_analysis} # Check for outliers in maximum speed cat("🚨 OUTLIER ANALYSIS FOR MAXIMUM SPEED:\n") cat("📊 Any points beyond the whiskers are outliers\n") cat("⚡ These represent cars with unusually high speeds\n") cat("🏎️ Could be supercars or sports cars\n") cat("🔍 Outliers require special attention in analysis\n")
{r multiple_boxplots} # Compare different variables par(mfrow=c(2,2)) # 2x2 grid of plots distr.plot.x(cars$acceleration, plot.type = "boxplot", main="Acceleration") distr.plot.x(cars$price_num, plot.type = "boxplot", main="Price") distr.plot.x(cars$maxspeed, plot.type = "boxplot", main="Max Speed") distr.plot.x(cars$weight, plot.type = "boxplot", main="Weight") par(mfrow=c(1,1)) # Reset to single plot
Symmetric Distribution: - Median line centered in box - Equal whisker lengths - No skewness visible
Right-Skewed Distribution: - Median closer to Q1 (left side of box) - Right whisker longer than left - Outliers on right side
Left-Skewed Distribution: - Median closer to Q3 (right side of box) - Left whisker longer than right - Outliers on left side
Let’s put everything together in a comprehensive business analysis.
```{r executive_summary} cat(“🚗 AUTOMOTIVE MARKET ANALYSIS REPORT”) cat(“=” , rep(“=”, 45), “”, sep=““)
cat(“📊 Dataset:”, nrow(cars), ” car models analyzed“)
cat(“🌍 GEOGRAPHIC DISTRIBUTION:”) cat(“🇩🇪 Germany leads with 26% market share”) cat(“🇪🇺 European brands dominate: 66% of models”) cat(“🌏 Asia (Japan + others): 25% of models”) cat(“🇺🇸 US brands: 9% of models”)
price_stats <- distr.summary.x(cars$price_num, stats=“central”) cat(“💰 PRICE ANALYSIS:”) cat(“📈 Average price: \(", round(price_stats\)mean, 0),”“) cat(”🎯 Median price: \(", round(price_stats\)median, 0), “”) cat(“📊 Distribution: Right-skewed (luxury cars drive average up)”) cat(“💚 Budget-friendly focus: 82% priced low-to-mid range”)
cat(“🏎️ PERFORMANCE INSIGHTS:”) cat(“⚡ Top 10% speed threshold: 226 km/h”) cat(“🚀 Top 10% acceleration: Under 15.02 seconds”) cat(“🚨 Speed outliers detected (supercars)”)
## 10.2 Detailed Statistical Profile
```{r detailed_profile}
# Create comprehensive statistical summary
variables_to_analyze <- c("price_num", "maxspeed", "acceleration", "weight", "fueltank")
cat("\n📋 DETAILED STATISTICAL PROFILES:\n")
cat("=" , rep("=", 50), "\n", sep="")
for(var in variables_to_analyze) {
cat("\n📊", toupper(gsub("_", " ", var)), ":\n")
summary_stats <- distr.summary.x(cars[[var]], stats="fivenumber")
central_stats <- distr.summary.x(cars[[var]], stats="central")
cat(" Range: ", summary_stats$min, " to ", summary_stats$max, "\n")
cat(" Mean: ", round(central_stats$mean, 2), "\n")
cat(" Median: ", round(central_stats$median, 2), "\n")
cat(" Q1-Q3: ", summary_stats$q1, " to ", summary_stats$q3, "\n")
# Determine skewness
if(central_stats$mean > central_stats$median) {
cat(" Shape: Right-skewed\n")
} else if(central_stats$mean < central_stats$median) {
cat(" Shape: Left-skewed\n")
} else {
cat(" Shape: Approximately symmetric\n")
}
}
```{r exercise_1} cat(“🔍 EXERCISE 1: DISTRIBUTION DETECTIVE”) cat(“Analyze the fuel consumption distribution:”)
distr.plot.x(cars$urban_fuelcons, plot.type = “hist”, breaks = 6)
fuel_stats <- distr.summary.x(cars\(urban_fuelcons, stats="central") cat("Mean fuel consumption: ", round(fuel_stats\)mean, 2), ” L/100km“) cat(”Median fuel consumption: “, round(fuel_stats$median, 2),” L/100km“)
if(fuel_stats\(mean > fuel_stats\)median) { cat(“✅ ANSWER: Right-skewed distribution”) cat(“💡 Interpretation: Most cars are fuel-efficient, but some gas-guzzlers pull the average up”) }
### 🎯 Exercise 2: Percentile Mastery
```{r exercise_2}
cat("\n🎯 EXERCISE 2: PERCENTILE MASTERY\n")
cat("Find the weight thresholds for different car categories:\n\n")
weight_summary <- distr.summary.x(cars$weight, stats="quartiles")
print(weight_summary)
cat("\n🚗 CAR WEIGHT CATEGORIES:\n")
cat("🪶 Lightweight (bottom 25%): Under ", weight_summary$p25, " kg\n")
cat("⚖️ Standard weight (25-75%): ", weight_summary$p25, "-", weight_summary$p75, " kg\n")
cat("🏋️ Heavyweight (top 25%): Over ", weight_summary$p75, " kg\n")
```{r exercise_3} cat(“🚨 EXERCISE 3: OUTLIER INVESTIGATION”)
distr.plot.x(cars$acceleration, plot.type = “boxplot”)
cat(“Investigation questions:”) cat(“1. Are there any acceleration outliers?”) cat(“2. What might cause extremely slow acceleration?”) cat(“3. How would outliers affect the mean vs median?”)
accel_stats <- distr.summary.x(cars\(acceleration, stats="fivenumber") cat("\n📊 Acceleration Analysis:\n") cat("🚀 Fastest: ", accel_stats\)min, ” seconds“) cat(”🐌 Slowest: “, accel_stats\(max, " seconds\n") cat("📈 Range: ", accel_stats\)max - accel_stats$min,” seconds“)
## 11.2 Real-World Application Scenarios
### 🏢 Scenario 1: Product Development Strategy
```{r scenario_1}
cat("🏢 BUSINESS SCENARIO 1: PRODUCT DEVELOPMENT\n")
cat("=" , rep("=", 50), "\n", sep="")
cat("You're developing a new car model. Use data to inform decisions:\n\n")
# Market positioning analysis
doors_popular <- distr.table.x(cars$n_doors_min)
cat("🚪 DOOR CONFIGURATION STRATEGY:\n")
print(doors_popular)
cat("💡 Recommendation: Focus on 5-door models (71% market preference)\n\n")
# Price positioning
price_classes_dist <- distr.table.x(cars$price_classes)
cat("💰 PRICE POSITIONING STRATEGY:\n")
print(price_classes_dist)
cat("💡 Recommendation: Target mid-price segment (55% of market)\n")
```{r scenario_2} cat(“🎯 BUSINESS SCENARIO 2: COMPETITIVE ANALYSIS”) cat(“=” , rep(“=”, 50), “”, sep=““)
speed_benchmark <- distr.summary.x(cars\(maxspeed, stats="quartiles") cat("⚡ SPEED BENCHMARKING:\n") cat("🥉 Entry level (25th percentile): ", speed_benchmark\)p25, ” km/h“) cat(”🥈 Competitive (50th percentile): “, speed_benchmark\(p50, " km/h\n") cat("🥇 Premium (75th percentile): ", speed_benchmark\)p75,” km/h“) cat(”💡 To be competitive, aim for at least “, speed_benchmark$p50,” km/h“) ```
```{r executive_dashboard} cat(“📊 EXECUTIVE DASHBOARD: KEY METRICS”) cat(“=” , rep(“=”, 60), “”, sep=““)
total_models <- nrow(cars) german_models <- sum(cars\(country == "Germany") luxury_models <- sum(cars\)price_classes == “high”, na.rm = TRUE) high_performance <- sum(cars$maxspeed > 200, na.rm = TRUE)
cat(“🎯 MARKET OVERVIEW:”) cat(” Total Models Analyzed: “, total_models,”“) cat(” German Market Share: “, round(german_models/total_models100, 1), ”%”) cat(” Luxury Segment: ”, round(luxury_models/total_models100, 1),”%“) cat(” High-Performance Cars (>200 km/h): “, round(high_performance/total_models*100, 1), “%”)
price_q <- distr.summary.x(cars$price_num, stats=“quartiles”)
cat(“💰 PRICE INTELLIGENCE:”) cat(” Market Entry Price: \(", round(price_q\)min, 0), “”) cat(”
Budget Threshold (25%): \(",
round(price_q\)p25, 0), “