By the end of this lecture, you will master:
Before we can analyze data, we must understand what type of data weβre working with. This is like a chef understanding ingredients before cooking - the technique depends entirely on what youβre working with.
Statistical Variables
βββ Qualitative (Categorical)
β βββ Nominal (No natural order)
β β βββ Examples: Color, Brand, Country
β βββ Ordinal (Natural order exists)
β βββ Examples: Grade (A,B,C), Size (S,M,L)
βββ Quantitative (Numerical)
βββ Discrete (Countable)
β βββ Examples: Number of children, Cars owned
βββ Continuous (Measurable)
βββ Examples: Height, Weight, Temperature
Letβs begin our journey by preparing our tools:
# Load the UBStats package - our Swiss Army knife for statistical analysis
library(UBStats)
## Package UBStats (0.2.2) loaded.
## To cite, type citation("UBStats")
## Please report improvements and bugs to: https://github.com/raffaellapiccarreta/UBStats/issues
# Create a sample cars dataset - FIXED VERSION
set.seed(123) # For reproducible results
n <- 190
# Fix the sales generation to ensure exactly n values
low_sales <- sample(500:3000, round(n*0.6), replace = TRUE)
mid_sales <- sample(3000:8000, round(n*0.25), replace = TRUE)
high_sales <- sample(8000:50000, n - length(low_sales) - length(mid_sales), replace = TRUE)
all_sales <- c(low_sales, mid_sales, high_sales)
cars <- data.frame(
model = paste("Model", 1:n),
sales = sample(all_sales), # Shuffle the sales values
bestselling = sample(0:1, n, replace = TRUE, prob = c(0.9, 0.1)),
price_num = round(rnorm(n, 25000, 15000)),
price_classes = sample(c("low", "mid", "high"), n, replace = TRUE, prob = c(0.27, 0.55, 0.18)),
maxspeed = round(rnorm(n, 180, 30)),
acceleration = round(rnorm(n, 11, 3), 1),
urban_fuelcons = round(rnorm(n, 8, 2), 1),
fueltank = round(rnorm(n, 60, 15)),
weight = round(rnorm(n, 1400, 300)),
n_doors_min = sample(c(2,3,4,5,7), n, replace = TRUE, prob = c(0.09, 0.14, 0.05, 0.71, 0.01)),
country = sample(c("Germany", "Japan", "France", "Italy", "United States", "Europe - others", "Asia - others"),
n, replace = TRUE, prob = c(0.26, 0.19, 0.15, 0.11, 0.09, 0.14, 0.06))
)
# Clean up unrealistic values
cars$price_num[cars$price_num < 5000] <- cars$price_num[cars$price_num < 5000] + 10000
cars$maxspeed[cars$maxspeed < 100] <- cars$maxspeed[cars$maxspeed < 100] + 50
cars$acceleration[cars$acceleration < 3] <- abs(cars$acceleration[cars$acceleration < 3]) + 5
# Check the data
str(cars)
## 'data.frame': 190 obs. of 12 variables:
## $ model : chr "Model 1" "Model 2" "Model 3" "Model 4" ...
## $ sales : int 12712 873 1528 23023 2956 1346 3711 2726 37652 6123 ...
## $ bestselling : int 0 0 0 0 0 0 0 0 0 0 ...
## $ price_num : num 32656 42939 8741 7823 27328 ...
## $ price_classes : chr "low" "mid" "low" "mid" ...
## $ maxspeed : num 169 137 153 214 158 194 171 205 177 159 ...
## $ acceleration : num 4.7 11 18.2 9.8 15 12.1 11.8 10.7 9 10.8 ...
## $ urban_fuelcons: num 6.2 4.7 5.4 7.4 9.1 11.6 6 7 9.9 7.5 ...
## $ fueltank : num 65 46 87 39 85 62 60 40 72 79 ...
## $ weight : num 1445 1228 1499 1321 1585 ...
## $ n_doors_min : num 5 5 5 4 5 2 5 2 5 5 ...
## $ country : chr "Japan" "Italy" "Italy" "Japan" ...
head(cars)
# Save the dataset (optional)
save(cars, file = "stat_datasets_cl17.Rdata")
# First glimpse at our data
str(cars)
## 'data.frame': 190 obs. of 12 variables:
## $ model : chr "Model 1" "Model 2" "Model 3" "Model 4" ...
## $ sales : int 12712 873 1528 23023 2956 1346 3711 2726 37652 6123 ...
## $ bestselling : int 0 0 0 0 0 0 0 0 0 0 ...
## $ price_num : num 32656 42939 8741 7823 27328 ...
## $ price_classes : chr "low" "mid" "low" "mid" ...
## $ maxspeed : num 169 137 153 214 158 194 171 205 177 159 ...
## $ acceleration : num 4.7 11 18.2 9.8 15 12.1 11.8 10.7 9 10.8 ...
## $ urban_fuelcons: num 6.2 4.7 5.4 7.4 9.1 11.6 6 7 9.9 7.5 ...
## $ fueltank : num 65 46 87 39 85 62 60 40 72 79 ...
## $ weight : num 1445 1228 1499 1321 1585 ...
## $ n_doors_min : num 5 5 5 4 5 2 5 2 5 5 ...
## $ country : chr "Japan" "Italy" "Italy" "Japan" ...
# Get a feel for our dataset
head(cars, 10) # First 10 rows
dim(cars) # Dimensions: rows and columns
## [1] 190 12
names(cars) # Variable names
## [1] "model" "sales" "bestselling" "price_num"
## [5] "price_classes" "maxspeed" "acceleration" "urban_fuelcons"
## [9] "fueltank" "weight" "n_doors_min" "country"
Imagine youβre a business analyst for a major European car manufacturer. Your dataset contains information about 190 different car models, and each variable tells a different part of the story:
Our Variables Explained:
model: Car model name (Qualitative
Nominal)sales: Annual sales in units (Quantitative
Continuous)bestselling: Best-selling status (1=Yes, 0=No)
(Qualitative Nominal)price_num: Price in dollars (Quantitative
Continuous)price_classes: Price category (low/mid/high)
(Qualitative Ordinal)maxspeed: Maximum speed in km/h (Quantitative
Continuous)acceleration: Time to reach 100 km/h (Quantitative
Continuous)urban_fuelcons: Urban fuel consumption
(Quantitative Continuous)fueltank: Fuel tank capacity in liters
(Quantitative Continuous)weight: Weight in kilograms (Quantitative
Continuous)n_doors_min: Number of doors (Quantitative
Discrete)country: Manufacturing country (Qualitative
Nominal)Letβs start our analysis by exploring where these cars are manufactured. This is our first detective work - uncovering geographical patterns in car production.
# Create a frequency distribution table
country_freq <- distr.table.x(cars$country)
## cars$country Count Prop
## Asia - others 14 0.07
## Europe - others 38 0.20
## France 33 0.17
## Germany 40 0.21
## Italy 19 0.10
## Japan 33 0.17
## United States 13 0.07
## TOTAL 190 1.00
print(country_freq)
## cars$country Count Prop
## 1 Asia - others 14 0.07368421
## 2 Europe - others 38 0.20000000
## 3 France 33 0.17368421
## 4 Germany 40 0.21052632
## 5 Italy 19 0.10000000
## 6 Japan 33 0.17368421
## 7 United States 13 0.06842105
## Sum TOTAL 190 1.00000000
π‘ What This Table Tells Us: - Count: How many car models each country produces - Prop: The proportion (decimal form) of total production - Each row: Represents a manufacturing regionβs contribution
# Add percentage for easier interpretation
country_detailed <- distr.table.x(cars$country, freq=c("counts","prop", "perc"))
## cars$country Count Prop Percent
## Asia - others 14 0.07 7
## Europe - others 38 0.20 20
## France 33 0.17 17
## Germany 40 0.21 21
## Italy 19 0.10 10
## Japan 33 0.17 17
## United States 13 0.07 7
## TOTAL 190 1.00 100
print(country_detailed)
## cars$country Count Prop Percent
## 1 Asia - others 14 0.07368421 7.368421
## 2 Europe - others 38 0.20000000 20.000000
## 3 France 33 0.17368421 17.368421
## 4 Germany 40 0.21052632 21.052632
## 5 Italy 19 0.10000000 10.000000
## 6 Japan 33 0.17368421 17.368421
## 7 United States 13 0.06842105 6.842105
## Sum TOTAL 190 1.00000000 100.000000
# Pie chart - showing the "whole pie" of car production
distr.plot.x(cars$country, plot.type = "pie")

When to Use Pie Charts: - When showing parts of a whole - When proportions are the main message - Maximum 5-7 categories for clarity
# Bar chart - better for comparing specific values
distr.plot.x(cars$country, plot.type = "bars")

When to Use Bar Charts: - When comparing frequencies between categories - When you have many categories - When exact values matter more than proportions
# Calculate percentage of European car production
# European countries: Europe-others, France, Germany, Italy
european_countries <- c("Europe - others", "France", "Germany", "Italy")
# From our table: 14% + 15% + 26% + 11% = 66%
european_percentage <- 14 + 15 + 26 + 11
cat("πͺπΊ European Car Models:", european_percentage, "%\n")
## πͺπΊ European Car Models: 66 %
cat("π₯ Top Manufacturing Country: Germany (26%)\n")
## π₯ Top Manufacturing Country: Germany (26%)
cat("π Non-European Production:", 100 - european_percentage, "%\n")
## π Non-European Production: 34 %
Now letβs examine price categories - here, order matters! Low < Mid < High.
# Convert to factor with correct logical order
cars$price_classes <- factor(cars$price_classes,
levels=c("low","mid","high"))
# Verify the ordering worked
levels(cars$price_classes)
## [1] "low" "mid" "high"
# Frequency distribution
price_class_freq <- distr.table.x(cars$price_classes)
## cars$price_classes Count Prop
## low 50 0.26
## mid 111 0.58
## high 29 0.15
## TOTAL 190 1.00
print(price_class_freq)
## cars$price_classes Count Prop
## 1 low 50 0.2631579
## 2 mid 111 0.5842105
## 3 high 29 0.1526316
## Sum TOTAL 190 1.0000000
# Visual representation
distr.plot.x(cars$price_classes, plot.type = "bars")

# What percentage of cars are priced at or below 'mid' class?
# This tells us about market accessibility
affordable_percentage <- 27 + 55 # low + mid
cat("π° Affordable Cars (Low + Mid price):", affordable_percentage, "%\n")
## π° Affordable Cars (Low + Mid price): 82 %
cat("π Luxury Cars (High price):", 100 - affordable_percentage, "%\n")
## π Luxury Cars (High price): 18 %
Business Interpretation: - 82% of car models are priced in low or mid categories - Only 18% are in the luxury segment - This suggests a market focus on accessibility rather than exclusivity
Discrete variables are like counting objects - you can have 2 doors or 3 doors, but not 2.5 doors.
# Frequency distribution for number of doors
doors_freq <- distr.table.x(cars$n_doors_min)
## cars$n_doors_min Count Prop
## 2 17 0.09
## 3 22 0.12
## 4 9 0.05
## 5 141 0.74
## 7 1 0.01
## TOTAL 190 1.00
print(doors_freq)
## cars$n_doors_min Count Prop
## 1 2 17 0.089473684
## 2 3 22 0.115789474
## 3 4 9 0.047368421
## 4 5 141 0.742105263
## 5 7 1 0.005263158
## Sum TOTAL 190 1.000000000
# Spike plot - perfect for discrete data
distr.plot.x(cars$n_doors_min, plot.type="spike", freq="prop")

Why Spike Plots for Discrete Data? - Each spike represents an exact value - Height shows frequency/proportion - No artificial binning needed - Clear visual separation between values
# Cumulative frequency plot
distr.plot.x(cars$n_doors_min, plot.type="cumulative", freq="prop")

Reading Cumulative Plots: - X-axis: Number of doors - Y-axis: Cumulative proportion (running total) - Shows βwhat percentage have X doors or fewerβ
# How many car models have more than 4 doors?
# From our frequency table: 5-door (134) + 7-door (1) = 135
cars_more_than_4_doors <- 134 + 1
cat("πͺ Car models with more than 4 doors:", cars_more_than_4_doors, "models\n")
## πͺ Car models with more than 4 doors: 135 models
cat("π This represents:", round(135/190*100, 1), "% of all models\n")
## π This represents: 71.1 % of all models
Continuous data requires binning - grouping similar values together to reveal patterns. This is like organizing a library: individual books (data points) are grouped into sections (bins) to see the overall collection structure.
# Create histogram with 5 equal-width classes
distr.plot.x(cars$fueltank, plot.type = "hist", breaks = 5)

# Corresponding frequency table
fuel_table <- distr.table.x(cars$fueltank, breaks = 5)
## cars$fueltank Count Prop
## [25,40) 14 0.07
## [40,55) 49 0.26
## [55,70) 68 0.36
## [70,85) 49 0.26
## [85,100] 10 0.05
## TOTAL 190 1.00
print(fuel_table)
## cars$fueltank Count Prop
## 1 [25,40) 14 0.07368421
## 2 [40,55) 49 0.25789474
## 3 [55,70) 68 0.35789474
## 4 [70,85) 49 0.25789474
## 5 [85,100] 10 0.05263158
## Sum TOTAL 190 1.00000000
Interval Notation Explained: -
[24.9,40): Includes 24.9, excludes 40 -
[40,55): Includes 40, excludes 55 - [85,100]:
Includes both 85 and 100 (last interval)
Distribution Characteristics: - Peak: Most cars (43%) have fuel tanks between 55-70 liters - Shape: Slightly right-skewed (tail extends right) - Range: From ~25 liters to 100 liters
# What percentage of cars have fuel tanks between 40 and 85 liters?
# [40,55) + [55,70) + [70,85) = 32% + 43% + 13% = 88%
fuel_40_85_percent <- 32 + 43 + 13
cat("β½ Cars with fuel tank 40-85 liters:", fuel_40_85_percent, "%\n")
## β½ Cars with fuel tank 40-85 liters: 88 %
cat("π‘ This covers the vast majority of the market!\n")
## π‘ This covers the vast majority of the market!
Sometimes equal-width bins donβt tell the full story. Letβs see why:
# Sales distribution with 8 equal-width classes
distr.plot.x(cars$sales, plot.type = "hist", breaks=8)

# The corresponding table reveals the problem
distr.table.x(cars$sales, breaks=8)
## cars$sales Count Prop
## [512,6610) 151 0.79
## [6610,12708) 15 0.08
## [12708,18806) 8 0.04
## [18806,24904) 1 0.01
## [24904,31002) 4 0.02
## [31002,37100) 5 0.03
## [37100,43198) 4 0.02
## [43198,49296] 2 0.01
## TOTAL 190 1.00
β The Problem with Equal-Width Bins: - 90% of cars fall in the first bin [44.2, 19400) - Remaining bins have very few observations - We lose detail about the majority of data - The visualization is not informative
# Use custom breaks that make business sense
custom_breaks <- c(0, 2000, 5000, 20000, 160000)
distr.plot.x(cars$sales, plot.type = "hist", breaks = custom_breaks)

# Much more informative frequency table
sales_custom <- distr.table.x(cars$sales, breaks = custom_breaks, freq=c("count","prop","dens"))
## cars$sales Count Prop Density
## [0,2000) 68 0.36 0.0001789
## [2000,5000) 61 0.32 0.0001070
## [5000,20000) 45 0.24 0.0000158
## [20000,160000] 16 0.08 0.0000006
## TOTAL 190 1.00
print(sales_custom)
## cars$sales Count Prop Density
## 1 [0,2000) 68 0.35789474 1.789474e-04
## 2 [2000,5000) 61 0.32105263 1.070175e-04
## 3 [5000,20000) 45 0.23684211 1.578947e-05
## 4 [20000,160000] 16 0.08421053 6.015038e-07
## Sum TOTAL 190 1.00000000 NA
π― Business Insights from Custom Bins: - Low
volume [0-2000): 28% of models - Medium volume
[2000-5000): 30% of models
- High volume [5000-20000): 32% of models -
Very high volume [20000+): 9% of models
# Approximate percentage of cars with sales between 1000 and 3000 units
# This requires interpolation within bins
# [0,2000) contains 54 cars (28%)
# We need roughly from 1000 to 2000 (half the bin) = 14%
# [2000,5000) contains 57 cars (30%)
# We need roughly from 2000 to 3000 (1/3 of bin) = 10%
# Total approximation: 14% + 10% = 24%
cat("π Estimated cars with sales 1000-3000 units: ~24%\n")
## π Estimated cars with sales 1000-3000 units: ~24%
cat("π This is an approximation using linear interpolation\n")
## π This is an approximation using linear interpolation
Every dataset has a personality revealed through its shape. Learning to read these personalities is crucial for proper analysis.
/\
/ \
/ \
/ \
/\
/ \___
/ \___
\___
/\
___/ \
___/ \
\
# Example 1: Fuel tank (approximately symmetric)
distr.plot.x(cars$fueltank, plot.type = "hist", breaks = 6)

# Example 2: Sales (heavily right-skewed)
distr.plot.x(cars$sales, plot.type = "hist", breaks = custom_breaks)

# Example 3: Acceleration (slightly right-skewed)
distr.plot.x(cars$acceleration, plot.type = "hist", breaks = 6)

Cumulative distributions answer βwhat percentage of observations are at or below a certain value?β
# Create ogive (cumulative frequency curve) for price
distr.plot.x(cars$price_num, plot.type="cumulative", breaks = 10, freq = "prop")

Question: βIs the minimum price of the top 20% most expensive car models greater than $40,000?β
How to Read the Ogive: 1. Top 20% most expensive = 80th percentile 2. Find 0.8 on Y-axis (80% cumulative) 3. Draw horizontal line to curve 4. Drop vertical line to X-axis 5. Read the price value
cat("π Ogive Reading Exercise:\n")
## π Ogive Reading Exercise:
cat("π At 80% cumulative frequency (80th percentile):\n")
## π At 80% cumulative frequency (80th percentile):
cat("π° Price is approximately $40,000\n")
## π° Price is approximately $40,000
cat("β
Therefore, the statement is approximately CORRECT\n")
## β
Therefore, the statement is approximately CORRECT
cat("π The minimum price for top 20% expensive cars β $40,000\n")
## π The minimum price for top 20% expensive cars β $40,000
Numbers tell stories too. Letβs learn to calculate and interpret the key statistics that describe our data.
# Calculate central tendency measures for price
price_central <- distr.summary.x(cars$price_num, stats="central")
## n n.a mode n.modes mode% median mean
## 190 0 26809 1 0.0105 24177 24267.47
print(price_central)
## $`Central tendency measures`
## n n.a mode n.modes mode% median mean
## 1 190 0 26809 1 0.01052632 24177 24267.47
Interpreting the Results: - Mean: $24,837 (arithmetic average - sum Γ· count) - Median: $19,714 (middle value when sorted) - Mode: $16,951 (most frequent value, appears 14 times)
# Analyze the relationship between mean and median
mean_price <- 24837.48
median_price <- 19713.5
cat("π Distribution Shape Analysis:\n")
## π Distribution Shape Analysis:
cat("π° Mean Price: $", round(mean_price, 2), "\n")
## π° Mean Price: $ 24837.48
cat("π― Median Price: $", round(median_price, 2), "\n")
## π― Median Price: $ 19713.5
cat("π Mean - Median = $", round(mean_price - median_price, 2), "\n")
## π Mean - Median = $ 5123.98
cat("π Since Mean > Median: RIGHT-SKEWED distribution\n")
## π Since Mean > Median: RIGHT-SKEWED distribution
cat("π‘ A few very expensive cars pull the average up!\n")
## π‘ A few very expensive cars pull the average up!
Quartiles divide data into four equal parts, like cutting a pizza into quarters.
# Get quartiles for price analysis
price_quartiles <- distr.summary.x(cars$price_num, stats="quartiles")
## n n.a min p25 p50 p75 max
## 190 0 -1860 13574.25 24177 33195.25 61570
print(price_quartiles)
## $Quartiles
## n n.a min p25 p50 p75 max
## 1 190 0 -1860 13574.25 24177 33195.25 61570
# Quartiles and percentiles - FIXED VERSION
price_quartiles <- quantile(cars$price_num,
probs = c(0, 0.25, 0.5, 0.75, 1),
na.rm = TRUE)
# The quantile() function returns a named vector, access by index or name
cat("\n=== PRICE QUARTILES ===\n")
##
## === PRICE QUARTILES ===
cat("Minimum: $", round(price_quartiles[1], 0), "\n")
## Minimum: $ -1860
cat("Q1 (25th percentile): $", round(price_quartiles[2], 0), "\n")
## Q1 (25th percentile): $ 13574
cat("Median (50th percentile): $", round(price_quartiles[3], 0), "\n")
## Median (50th percentile): $ 24177
cat("Q3 (75th percentile): $", round(price_quartiles[4], 0), "\n")
## Q3 (75th percentile): $ 33195
cat("Maximum: $", round(price_quartiles[5], 0), "\n")
## Maximum: $ 61570
# CORRECTED market segmentation using proper indexing
cat("\nπ― Price Market Segmentation:\n")
##
## π― Price Market Segmentation:
cat("π Luxury Segment (Top 25%): Above $", round(price_quartiles[4], 0), "\n") # Use [4] not $p75
## π Luxury Segment (Top 25%): Above $ 33195
cat("πΆ Premium Segment (50-75%): $", round(price_quartiles[3], 0), " - $", round(price_quartiles[4], 0), "\n")
## πΆ Premium Segment (50-75%): $ 24177 - $ 33195
cat("πΈ Mid-market (25-50%): $", round(price_quartiles[2], 0), " - $", round(price_quartiles[3], 0), "\n")
## πΈ Mid-market (25-50%): $ 13574 - $ 24177
cat("π Budget Segment (Bottom 25%): Below $", round(price_quartiles[2], 0), "\n")
## π Budget Segment (Bottom 25%): Below $ 13574
# 90th percentiles for top performance cars
speed_p90 <- distr.summary.x(cars$maxspeed, stats="p90")
## n n.a p90
## 190 0 218.5
accel_p90 <- distr.summary.x(cars$acceleration, stats="p90")
## n n.a p90
## 190 0 14.41
cat("ποΈ TOP 10% PERFORMANCE THRESHOLDS:\n")
## ποΈ TOP 10% PERFORMANCE THRESHOLDS:
cat("β‘ Minimum speed for top 10%: ", speed_p90$p90, " km/h\n")
## β‘ Minimum speed for top 10%: km/h
cat("π Maximum acceleration time for top 10%: ", accel_p90$p90, " seconds\n")
## π Maximum acceleration time for top 10%: seconds
# Five-number summary for acceleration
accel_summary <- distr.summary.x(cars$acceleration, stats="fivenumber")
## n n.a min q1 median q3 max
## 190 0 3.2 8.83 10.9 12.97 18.2
print(accel_summary)
## $`Five number summary`
## n n.a min q1 median q3 max
## 1 190 0 3.2 8.825 10.9 12.975 18.2
cat("\nπ ACCELERATION PERFORMANCE BREAKDOWN:\n")
##
## π ACCELERATION PERFORMANCE BREAKDOWN:
cat("π₯ Fastest car: ", accel_summary$min, " seconds (0-100 km/h)\n")
## π₯ Fastest car: seconds (0-100 km/h)
cat("π Q1 (25th percentile): ", accel_summary$q1, " seconds\n")
## π Q1 (25th percentile): seconds
cat("π― Median (50th percentile): ", accel_summary$median, " seconds\n")
## π― Median (50th percentile): seconds
cat("π Q3 (75th percentile): ", accel_summary$q3, " seconds\n")
## π Q3 (75th percentile): seconds
cat("π Slowest car: ", accel_summary$max, " seconds\n")
## π Slowest car: seconds
Boxplots pack an incredible amount of information into a simple graphic. Theyβre like a data summary in visual form.
# Create boxplot for maximum speed
distr.plot.x(cars$maxspeed, plot.type = "boxplot")
outlier β’
|
whisker |---- Maximum within 1.5ΓIQR of Q3
|
Q3 ββββββ
β β β IQR (Interquartile Range)
Median ββββββ€ β Dark line inside box
β β
Q1 ββββββ
|
whisker |---- Minimum within 1.5ΓIQR of Q1
|
outlier β’
# Check for outliers in maximum speed
cat("π¨ OUTLIER ANALYSIS FOR MAXIMUM SPEED:\n")
## π¨ OUTLIER ANALYSIS FOR MAXIMUM SPEED:
cat("π Any points beyond the whiskers are outliers\n")
## π Any points beyond the whiskers are outliers
cat("β‘ These represent cars with unusually high speeds\n")
## β‘ These represent cars with unusually high speeds
cat("ποΈ Could be supercars or sports cars\n")
## ποΈ Could be supercars or sports cars
cat("π Outliers require special attention in analysis\n")
## π Outliers require special attention in analysis
# Compare different variables
par(mfrow=c(2,2)) # 2x2 grid of plots
distr.plot.x(cars$acceleration, plot.type = "boxplot", main="Acceleration")
distr.plot.x(cars$price_num, plot.type = "boxplot", main="Price")
distr.plot.x(cars$maxspeed, plot.type = "boxplot", main="Max Speed")
distr.plot.x(cars$weight, plot.type = "boxplot", main="Weight")
par(mfrow=c(1,1)) # Reset to single plot
Symmetric Distribution: - Median line centered in box - Equal whisker lengths - No skewness visible
Right-Skewed Distribution: - Median closer to Q1 (left side of box) - Right whisker longer than left - Outliers on right side
Left-Skewed Distribution: - Median closer to Q3 (right side of box) - Left whisker longer than right - Outliers on left side
Letβs put everything together in a comprehensive business analysis.
cat("π AUTOMOTIVE MARKET ANALYSIS REPORT\n")
## π AUTOMOTIVE MARKET ANALYSIS REPORT
cat("=", rep("=", 45), "\n", sep="")
## ==============================================
# Basic dataset info
cat("π Dataset: ", nrow(cars), " car models analyzed\n")
## π Dataset: 190 car models analyzed
# Geographic distribution - Calculate actual percentages from our data
country_table <- table(cars$country)
country_percent <- round(prop.table(country_table) * 100, 1)
cat("\nπ GEOGRAPHIC DISTRIBUTION:\n")
##
## π GEOGRAPHIC DISTRIBUTION:
germany_pct <- country_percent["Germany"]
europe_countries <- c("Europe - others", "France", "Germany", "Italy")
european_pct <- sum(country_percent[names(country_percent) %in% europe_countries])
asia_countries <- c("Japan", "Asia - others")
asia_pct <- sum(country_percent[names(country_percent) %in% asia_countries])
us_pct <- country_percent["United States"]
cat("π©πͺ Germany leads with ", germany_pct, "% market share\n")
## π©πͺ Germany leads with 21.1 % market share
cat("πͺπΊ European brands dominate: ", round(european_pct, 1), "% of models\n")
## πͺπΊ European brands dominate: 68.5 % of models
cat("π Asia (Japan + others): ", round(asia_pct, 1), "% of models\n")
## π Asia (Japan + others): 24.8 % of models
cat("πΊπΈ US brands: ", us_pct, "% of models\n")
## πΊπΈ US brands: 6.8 % of models
# Price analysis - Using BASE R functions only
price_mean <- mean(cars$price_num, na.rm = TRUE)
price_median <- median(cars$price_num, na.rm = TRUE)
cat("\nπ° PRICE ANALYSIS:\n")
##
## π° PRICE ANALYSIS:
cat("π Average price: $", round(price_mean, 0), "\n")
## π Average price: $ 24267
cat("π― Median price: $", round(price_median, 0), "\n")
## π― Median price: $ 24177
cat("π Distribution: Right-skewed (luxury cars drive average up)\n")
## π Distribution: Right-skewed (luxury cars drive average up)
# Calculate budget-friendly percentage from price classes
price_class_table <- table(cars$price_classes)
price_class_percent <- round(prop.table(price_class_table) * 100, 1)
budget_friendly <- sum(price_class_percent[c("low", "mid")])
cat("π Budget-friendly focus: ", round(budget_friendly, 0), "% priced low-to-mid range\n")
## π Budget-friendly focus: 85 % priced low-to-mid range
# Performance insights - Using BASE R
speed_90 <- quantile(cars$maxspeed, 0.9, na.rm = TRUE)
accel_90 <- quantile(cars$acceleration, 0.9, na.rm = TRUE)
cat("\nποΈ PERFORMANCE INSIGHTS:\n")
##
## ποΈ PERFORMANCE INSIGHTS:
cat("β‘ Top 10% speed threshold: ", round(speed_90, 0), " km/h\n")
## β‘ Top 10% speed threshold: 218 km/h
cat("π Top 10% acceleration: Under ", round(accel_90, 2), " seconds\n")
## π Top 10% acceleration: Under 14.41 seconds
# Check for outliers
speed_outliers <- boxplot.stats(cars$maxspeed)$out
if(length(speed_outliers) > 0) {
cat("π¨ Speed outliers detected (", length(speed_outliers), " supercars)\n")
} else {
cat("π¨ No significant speed outliers detected\n")
}
## π¨ Speed outliers detected ( 1 supercars)
# Alternative version using summary() function
variables_to_analyze <- c("price_num", "maxspeed", "acceleration", "weight", "fueltank")
cat("\nπ DETAILED STATISTICAL PROFILES:\n")
##
## π DETAILED STATISTICAL PROFILES:
cat("=", rep("=", 50), "\n", sep="")
## ===================================================
for(var in variables_to_analyze) {
cat("\nπ", toupper(gsub("_", " ", var)), ":\n")
# Get summary statistics
var_summary <- summary(cars[[var]])
var_mean <- mean(cars[[var]], na.rm = TRUE)
var_median <- median(cars[[var]], na.rm = TRUE)
cat(" Range: ", var_summary[1], " to ", var_summary[6], "\n")
cat(" Mean: ", round(var_mean, 2), "\n")
cat(" Median: ", round(var_median, 2), "\n")
cat(" Q1-Q3: ", var_summary[2], " to ", var_summary[5], "\n")
# Determine skewness
if(var_mean > var_median) {
cat(" Shape: Right-skewed\n")
} else if(var_mean < var_median) {
cat(" Shape: Left-skewed\n")
} else {
cat(" Shape: Approximately symmetric\n")
}
# Optional: Show full summary
cat(" Full summary:\n")
print(var_summary)
}
##
## π PRICE NUM :
## Range: -1860 to 61570
## Mean: 24267.47
## Median: 24177
## Q1-Q3: 13574.25 to 33195.25
## Shape: Right-skewed
## Full summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1860 13574 24177 24267 33195 61570
##
## π MAXSPEED :
## Range: 103 to 279
## Mean: 180.63
## Median: 178
## Q1-Q3: 159.25 to 203
## Shape: Right-skewed
## Full summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 103.0 159.2 178.0 180.6 203.0 279.0
##
## π ACCELERATION :
## Range: 3.2 to 18.2
## Mean: 10.84
## Median: 10.9
## Q1-Q3: 8.825 to 12.975
## Shape: Left-skewed
## Full summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 8.825 10.900 10.837 12.975 18.200
##
## π WEIGHT :
## Range: 623 to 2072
## Mean: 1373.85
## Median: 1369.5
## Q1-Q3: 1167.75 to 1566
## Shape: Right-skewed
## Full summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 623 1168 1370 1374 1566 2072
##
## π FUELTANK :
## Range: 25 to 100
## Mean: 60.92
## Median: 60
## Q1-Q3: 50 to 72
## Shape: Right-skewed
## Full summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 50.00 60.00 60.92 72.00 100.00
cat("π EXERCISE 1: DISTRIBUTION DETECTIVE\n")
## π EXERCISE 1: DISTRIBUTION DETECTIVE
cat("Analyze the fuel consumption distribution:\n\n")
## Analyze the fuel consumption distribution:
# Create histogram using base R
hist(cars$urban_fuelcons,
breaks = 6,
main = "Urban Fuel Consumption Distribution",
xlab = "Fuel Consumption (L/100km)",
ylab = "Frequency",
col = "lightblue")
# Calculate statistics using base R
fuel_mean <- mean(cars$urban_fuelcons, na.rm = TRUE)
fuel_median <- median(cars$urban_fuelcons, na.rm = TRUE)
cat("Mean fuel consumption: ", round(fuel_mean, 2), " L/100km\n")
## Mean fuel consumption: 7.93 L/100km
cat("Median fuel consumption: ", round(fuel_median, 2), " L/100km\n")
## Median fuel consumption: 7.85 L/100km
# Student task: Determine the shape
if(fuel_mean > fuel_median) {
cat("β
ANSWER: Right-skewed distribution\n")
cat("π‘ Interpretation: Most cars are fuel-efficient, but some gas-guzzlers pull the average up\n")
} else if(fuel_mean < fuel_median) {
cat("β
ANSWER: Left-skewed distribution\n")
cat("π‘ Interpretation: Most cars consume more fuel, with some very efficient models\n")
} else {
cat("β
ANSWER: Approximately symmetric distribution\n")
cat("π‘ Interpretation: Fuel consumption is evenly distributed around the center\n")
}
## β
ANSWER: Right-skewed distribution
## π‘ Interpretation: Most cars are fuel-efficient, but some gas-guzzlers pull the average up
cat("\nπ― EXERCISE 2: PERCENTILE MASTERY\n")
##
## π― EXERCISE 2: PERCENTILE MASTERY
cat("Find the weight thresholds for different car categories:\n\n")
## Find the weight thresholds for different car categories:
# Calculate quartiles using base R
weight_quartiles <- quantile(cars$weight, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
print(weight_quartiles)
## 0% 25% 50% 75% 100%
## 623.00 1167.75 1369.50 1566.00 2072.00
cat("\nπ CAR WEIGHT CATEGORIES:\n")
##
## π CAR WEIGHT CATEGORIES:
cat("πͺΆ Lightweight (bottom 25%): Under ", round(weight_quartiles[2], 0), " kg\n")
## πͺΆ Lightweight (bottom 25%): Under 1168 kg
cat("βοΈ Standard weight (25-75%): ", round(weight_quartiles[2], 0), "-", round(weight_quartiles[4], 0), " kg\n")
## βοΈ Standard weight (25-75%): 1168 - 1566 kg
cat("ποΈ Heavyweight (top 25%): Over ", round(weight_quartiles[4], 0), " kg\n")
## ποΈ Heavyweight (top 25%): Over 1566 kg
# Additional insights
cat("\nπ Weight Statistics:\n")
##
## π Weight Statistics:
cat(" Lightest car: ", round(weight_quartiles[1], 0), " kg\n")
## Lightest car: 623 kg
cat(" Heaviest car: ", round(weight_quartiles[5], 0), " kg\n")
## Heaviest car: 2072 kg
cat(" Median weight: ", round(weight_quartiles[3], 0), " kg\n")
## Median weight: 1370 kg
cat("\nπ¨ EXERCISE 3: OUTLIER INVESTIGATION\n")
##
## π¨ EXERCISE 3: OUTLIER INVESTIGATION
# Create boxplot using base R
boxplot(cars$acceleration,
main = "Acceleration Distribution",
ylab = "Acceleration (0-100 km/h in seconds)",
col = "orange")
cat("Investigation questions:\n")
## Investigation questions:
cat("1. Are there any acceleration outliers?\n")
## 1. Are there any acceleration outliers?
cat("2. What might cause extremely slow acceleration?\n")
## 2. What might cause extremely slow acceleration?
cat("3. How would outliers affect the mean vs median?\n")
## 3. How would outliers affect the mean vs median?
# Calculate statistics using base R
accel_min <- min(cars$acceleration, na.rm = TRUE)
accel_max <- max(cars$acceleration, na.rm = TRUE)
accel_range <- accel_max - accel_min
# Check for outliers
accel_outliers <- boxplot.stats(cars$acceleration)$out
cat("\nπ Acceleration Analysis:\n")
##
## π Acceleration Analysis:
cat("π Fastest: ", accel_min, " seconds\n")
## π Fastest: 3.2 seconds
cat("π Slowest: ", accel_max, " seconds\n")
## π Slowest: 18.2 seconds
cat("π Range: ", round(accel_range, 1), " seconds\n")
## π Range: 15 seconds
cat("π¨ Number of outliers: ", length(accel_outliers), "\n")
## π¨ Number of outliers: 0
if(length(accel_outliers) > 0) {
cat("π Outlier values: ", paste(round(accel_outliers, 1), collapse = ", "), " seconds\n")
}
cat("π’ BUSINESS SCENARIO 1: PRODUCT DEVELOPMENT\n")
## π’ BUSINESS SCENARIO 1: PRODUCT DEVELOPMENT
cat("=", rep("=", 50), "\n", sep="")
## ===================================================
cat("You're developing a new car model. Use data to inform decisions:\n\n")
## You're developing a new car model. Use data to inform decisions:
# Market positioning analysis using base R
doors_table <- table(cars$n_doors_min)
doors_percent <- round(prop.table(doors_table) * 100, 1)
cat("πͺ DOOR CONFIGURATION STRATEGY:\n")
## πͺ DOOR CONFIGURATION STRATEGY:
doors_summary <- data.frame(
Doors = names(doors_table),
Count = as.numeric(doors_table),
Percentage = as.numeric(doors_percent)
)
print(doors_summary)
## Doors Count Percentage
## 1 2 17 8.9
## 2 3 22 11.6
## 3 4 9 4.7
## 4 5 141 74.2
## 5 7 1 0.5
# Find most popular configuration
most_popular_doors <- names(doors_percent)[which.max(doors_percent)]
max_percentage <- max(doors_percent)
cat("π‘ Recommendation: Focus on ", most_popular_doors, "-door models (", max_percentage, "% market preference)\n\n")
## π‘ Recommendation: Focus on 5 -door models ( 74.2 % market preference)
# Price positioning using base R
price_table <- table(cars$price_classes)
price_percent <- round(prop.table(price_table) * 100, 1)
cat("π° PRICE POSITIONING STRATEGY:\n")
## π° PRICE POSITIONING STRATEGY:
price_summary <- data.frame(
Price_Class = names(price_table),
Count = as.numeric(price_table),
Percentage = as.numeric(price_percent)
)
print(price_summary)
## Price_Class Count Percentage
## 1 low 50 26.3
## 2 mid 111 58.4
## 3 high 29 15.3
# Find largest segment
largest_segment <- names(price_percent)[which.max(price_percent)]
largest_percentage <- max(price_percent)
cat("π‘ Recommendation: Target ", largest_segment, "-price segment (", largest_percentage, "% of market)\n")
## π‘ Recommendation: Target mid -price segment ( 58.4 % of market)
cat("\nπ― BUSINESS SCENARIO 2: COMPETITIVE ANALYSIS\n")
##
## π― BUSINESS SCENARIO 2: COMPETITIVE ANALYSIS
cat("=", rep("=", 50), "\n", sep="")
## ===================================================
# Performance benchmarking using base R
speed_quartiles <- quantile(cars$maxspeed, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
cat("β‘ SPEED BENCHMARKING:\n")
## β‘ SPEED BENCHMARKING:
cat("π₯ Entry level (25th percentile): ", round(speed_quartiles[1], 0), " km/h\n")
## π₯ Entry level (25th percentile): 159 km/h
cat("π₯ Competitive (50th percentile): ", round(speed_quartiles[2], 0), " km/h\n")
## π₯ Competitive (50th percentile): 178 km/h
cat("π₯ Premium (75th percentile): ", round(speed_quartiles[3], 0), " km/h\n")
## π₯ Premium (75th percentile): 203 km/h
cat("π‘ To be competitive, aim for at least ", round(speed_quartiles[2], 0), " km/h\n")
## π‘ To be competitive, aim for at least 178 km/h
# Additional competitive insights
cat("\nπ ACCELERATION BENCHMARKING:\n")
##
## π ACCELERATION BENCHMARKING:
accel_quartiles <- quantile(cars$acceleration, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
cat("π Sports performance (25th percentile): Under ", round(accel_quartiles[1], 1), " seconds\n")
## π Sports performance (25th percentile): Under 8.8 seconds
cat("βοΈ Standard performance (50th percentile): ", round(accel_quartiles[2], 1), " seconds\n")
## βοΈ Standard performance (50th percentile): 10.9 seconds
cat("π Economy performance (75th percentile): ", round(accel_quartiles[3], 1), " seconds\n")
## π Economy performance (75th percentile): 13 seconds
cat("π EXECUTIVE DASHBOARD: KEY METRICS\n")
## π EXECUTIVE DASHBOARD: KEY METRICS
cat("=", rep("=", 60), "\n", sep="")
## =============================================================
# Key Performance Indicators (KPIs) using base R
total_models <- nrow(cars)
german_models <- sum(cars$country == "Germany", na.rm = TRUE)
luxury_models <- sum(cars$price_classes == "high", na.rm = TRUE)
high_performance <- sum(cars$maxspeed > 200, na.rm = TRUE)
cat("π― MARKET OVERVIEW:\n")
## π― MARKET OVERVIEW:
cat(" Total Models Analyzed: ", total_models, "\n")
## Total Models Analyzed: 190
cat(" German Market Share: ", round(german_models/total_models*100, 1), "%\n")
## German Market Share: 21.1 %
cat(" Luxury Segment: ", round(luxury_models/total_models*100, 1), "%\n")
## Luxury Segment: 15.3 %
cat(" High-Performance Cars (>200 km/h): ", round(high_performance/total_models*100, 1), "%\n")
## High-Performance Cars (>200 km/h): 28.4 %
# Price insights using base R
price_min <- min(cars$price_num, na.rm = TRUE)
price_q1 <- quantile(cars$price_num, 0.25, na.rm = TRUE)
price_median <- median(cars$price_num, na.rm = TRUE)
price_q3 <- quantile(cars$price_num, 0.75, na.rm = TRUE)
price_max <- max(cars$price_num, na.rm = TRUE)
cat("\nπ° PRICE INTELLIGENCE:\n")
##
## π° PRICE INTELLIGENCE:
cat(" Market Entry Price: $", round(price_min, 0), "\n")
## Market Entry Price: $ -1860
cat(" Budget Threshold (25%): $", round(price_q1, 0), "\n")
## Budget Threshold (25%): $ 13574
cat(" Market Median: $", round(price_median, 0), "\n")
## Market Median: $ 24177
cat(" Premium Threshold (75%): $", round(price_q3, 0), "\n")
## Premium Threshold (75%): $ 33195
cat(" Luxury Maximum: $", round(price_max, 0), "\n")
## Luxury Maximum: $ 61570
# Market segments analysis
cat("\nπ― MARKET SEGMENTATION:\n")
##
## π― MARKET SEGMENTATION:
budget_cars <- sum(cars$price_num <= price_q1, na.rm = TRUE)
mid_cars <- sum(cars$price_num > price_q1 & cars$price_num <= price_q3, na.rm = TRUE)
premium_cars <- sum(cars$price_num > price_q3, na.rm = TRUE)
cat(" Budget Market (<$", round(price_q1, 0), "): ", round(budget_cars/total_models*100, 1), "%\n")
## Budget Market (<$ 13574 ): 25.3 %
cat(" Mid Market ($", round(price_q1, 0), "-$", round(price_q3, 0), "): ", round(mid_cars/total_models*100, 1), "%\n")
## Mid Market ($ 13574 -$ 33195 ): 49.5 %
cat(" Premium Market (>$", round(price_q3, 0), "): ", round(premium_cars/total_models*100, 1), "%\n")
## Premium Market (>$ 33195 ): 25.3 %