Chapter 1: Introduction to Statistical Data Analysis

What You Will Learn Today

By the end of this lecture, you will master:

Data Type Recognition: How to identify and classify different types of variables
Visual Storytelling: Creating appropriate graphs that reveal data patterns
Distribution Analysis: Understanding the shape and characteristics of data distributions
Summary Statistics: Calculating and interpreting measures that describe data
Outlier Detection: Identifying unusual observations that require attention
Professional Reporting: Presenting statistical findings clearly and accurately

Chapter 2: Understanding Data Types - The Foundation of Analysis

Before we can analyze data, we must understand what type of data we’re working with. This is like a chef understanding ingredients before cooking - the technique depends entirely on what you’re working with.

The Data Type Hierarchy

Statistical Variables
├── Qualitative (Categorical)
│   ├── Nominal (No natural order)
│   │   └── Examples: Color, Brand, Country
│   └── Ordinal (Natural order exists)
│       └── Examples: Grade (A,B,C), Size (S,M,L)
└── Quantitative (Numerical)
    ├── Discrete (Countable)
    │   └── Examples: Number of children, Cars owned
    └── Continuous (Measurable)
        └── Examples: Height, Weight, Temperature

Setting Up Our Analytical Environment

Let’s begin our journey by preparing our tools:

# Load the UBStats package - our Swiss Army knife for statistical analysis
library(UBStats)

## Package UBStats (0.2.2) loaded.
## To cite, type citation("UBStats")

## Please report improvements and bugs to: https://github.com/raffaellapiccarreta/UBStats/issues

# Create a sample cars dataset - FIXED VERSION
set.seed(123)  # For reproducible results
n <- 190

# Fix the sales generation to ensure exactly n values
low_sales <- sample(500:3000, round(n*0.6), replace = TRUE)
mid_sales <- sample(3000:8000, round(n*0.25), replace = TRUE) 
high_sales <- sample(8000:50000, n - length(low_sales) - length(mid_sales), replace = TRUE)
all_sales <- c(low_sales, mid_sales, high_sales)

cars <- data.frame(
  model = paste("Model", 1:n),
  sales = sample(all_sales),  # Shuffle the sales values
  bestselling = sample(0:1, n, replace = TRUE, prob = c(0.9, 0.1)),
  price_num = round(rnorm(n, 25000, 15000)),
  price_classes = sample(c("low", "mid", "high"), n, replace = TRUE, prob = c(0.27, 0.55, 0.18)),
  maxspeed = round(rnorm(n, 180, 30)),
  acceleration = round(rnorm(n, 11, 3), 1),
  urban_fuelcons = round(rnorm(n, 8, 2), 1),
  fueltank = round(rnorm(n, 60, 15)),
  weight = round(rnorm(n, 1400, 300)),
  n_doors_min = sample(c(2,3,4,5,7), n, replace = TRUE, prob = c(0.09, 0.14, 0.05, 0.71, 0.01)),
  country = sample(c("Germany", "Japan", "France", "Italy", "United States", "Europe - others", "Asia - others"), 
                   n, replace = TRUE, prob = c(0.26, 0.19, 0.15, 0.11, 0.09, 0.14, 0.06))
)

# Clean up unrealistic values
cars$price_num[cars$price_num < 5000] <- cars$price_num[cars$price_num < 5000] + 10000
cars$maxspeed[cars$maxspeed < 100] <- cars$maxspeed[cars$maxspeed < 100] + 50
cars$acceleration[cars$acceleration < 3] <- abs(cars$acceleration[cars$acceleration < 3]) + 5

# Check the data
str(cars)

## 'data.frame':    190 obs. of  12 variables:
##  $ model         : chr  "Model 1" "Model 2" "Model 3" "Model 4" ...
##  $ sales         : int  12712 873 1528 23023 2956 1346 3711 2726 37652 6123 ...
##  $ bestselling   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ price_num     : num  32656 42939 8741 7823 27328 ...
##  $ price_classes : chr  "low" "mid" "low" "mid" ...
##  $ maxspeed      : num  169 137 153 214 158 194 171 205 177 159 ...
##  $ acceleration  : num  4.7 11 18.2 9.8 15 12.1 11.8 10.7 9 10.8 ...
##  $ urban_fuelcons: num  6.2 4.7 5.4 7.4 9.1 11.6 6 7 9.9 7.5 ...
##  $ fueltank      : num  65 46 87 39 85 62 60 40 72 79 ...
##  $ weight        : num  1445 1228 1499 1321 1585 ...
##  $ n_doors_min   : num  5 5 5 4 5 2 5 2 5 5 ...
##  $ country       : chr  "Japan" "Italy" "Italy" "Japan" ...

head(cars)

# Save the dataset (optional)
save(cars, file = "stat_datasets_cl17.Rdata")

# First glimpse at our data
str(cars)

## 'data.frame':    190 obs. of  12 variables:
##  $ model         : chr  "Model 1" "Model 2" "Model 3" "Model 4" ...
##  $ sales         : int  12712 873 1528 23023 2956 1346 3711 2726 37652 6123 ...
##  $ bestselling   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ price_num     : num  32656 42939 8741 7823 27328 ...
##  $ price_classes : chr  "low" "mid" "low" "mid" ...
##  $ maxspeed      : num  169 137 153 214 158 194 171 205 177 159 ...
##  $ acceleration  : num  4.7 11 18.2 9.8 15 12.1 11.8 10.7 9 10.8 ...
##  $ urban_fuelcons: num  6.2 4.7 5.4 7.4 9.1 11.6 6 7 9.9 7.5 ...
##  $ fueltank      : num  65 46 87 39 85 62 60 40 72 79 ...
##  $ weight        : num  1445 1228 1499 1321 1585 ...
##  $ n_doors_min   : num  5 5 5 4 5 2 5 2 5 5 ...
##  $ country       : chr  "Japan" "Italy" "Italy" "Japan" ...

# Get a feel for our dataset
head(cars, 10)  # First 10 rows

dim(cars)       # Dimensions: rows and columns

## [1] 190  12

names(cars)     # Variable names

##  [1] "model"          "sales"          "bestselling"    "price_num"     
##  [5] "price_classes"  "maxspeed"       "acceleration"   "urban_fuelcons"
##  [9] "fueltank"       "weight"         "n_doors_min"    "country"

Understanding Our Car Dataset

Imagine you’re a business analyst for a major European car manufacturer. Your dataset contains information about 190 different car models, and each variable tells a different part of the story:

Our Variables Explained:

model: Car model name (Qualitative Nominal)
sales: Annual sales in units (Quantitative Continuous)
bestselling: Best-selling status (1=Yes, 0=No) (Qualitative Nominal)
price_num: Price in dollars (Quantitative Continuous)
price_classes: Price category (low/mid/high) (Qualitative Ordinal)
maxspeed: Maximum speed in km/h (Quantitative Continuous)
acceleration: Time to reach 100 km/h (Quantitative Continuous)
urban_fuelcons: Urban fuel consumption (Quantitative Continuous)
fueltank: Fuel tank capacity in liters (Quantitative Continuous)
weight: Weight in kilograms (Quantitative Continuous)
n_doors_min: Number of doors (Quantitative Discrete)
country: Manufacturing country (Qualitative Nominal)

Chapter 3: Analyzing Qualitative Data - Finding Patterns in Categories

3.1 Nominal Variables: The Country Analysis Story

Let’s start our analysis by exploring where these cars are manufactured. This is our first detective work - uncovering geographical patterns in car production.

# Create a frequency distribution table
country_freq <- distr.table.x(cars$country)

##     cars$country Count Prop
##    Asia - others    14 0.07
##  Europe - others    38 0.20
##           France    33 0.17
##          Germany    40 0.21
##            Italy    19 0.10
##            Japan    33 0.17
##    United States    13 0.07
##            TOTAL   190 1.00

print(country_freq)

##        cars$country Count       Prop
## 1     Asia - others    14 0.07368421
## 2   Europe - others    38 0.20000000
## 3            France    33 0.17368421
## 4           Germany    40 0.21052632
## 5             Italy    19 0.10000000
## 6             Japan    33 0.17368421
## 7     United States    13 0.06842105
## Sum           TOTAL   190 1.00000000

💡 What This Table Tells Us: - Count: How many car models each country produces - Prop: The proportion (decimal form) of total production - Each row: Represents a manufacturing region’s contribution

# Add percentage for easier interpretation
country_detailed <- distr.table.x(cars$country, freq=c("counts","prop", "perc"))

##     cars$country Count Prop Percent
##    Asia - others    14 0.07       7
##  Europe - others    38 0.20      20
##           France    33 0.17      17
##          Germany    40 0.21      21
##            Italy    19 0.10      10
##            Japan    33 0.17      17
##    United States    13 0.07       7
##            TOTAL   190 1.00     100

print(country_detailed)

##        cars$country Count       Prop    Percent
## 1     Asia - others    14 0.07368421   7.368421
## 2   Europe - others    38 0.20000000  20.000000
## 3            France    33 0.17368421  17.368421
## 4           Germany    40 0.21052632  21.052632
## 5             Italy    19 0.10000000  10.000000
## 6             Japan    33 0.17368421  17.368421
## 7     United States    13 0.06842105   6.842105
## Sum           TOTAL   190 1.00000000 100.000000

Visual Storytelling: Pie Charts vs Bar Charts

# Pie chart - showing the "whole pie" of car production
distr.plot.x(cars$country, plot.type = "pie")

When to Use Pie Charts: - When showing parts of a whole - When proportions are the main message - Maximum 5-7 categories for clarity

# Bar chart - better for comparing specific values
distr.plot.x(cars$country, plot.type = "bars")

When to Use Bar Charts: - When comparing frequencies between categories - When you have many categories - When exact values matter more than proportions

🔍 Business Insight Discovery

# Calculate percentage of European car production
# European countries: Europe-others, France, Germany, Italy
european_countries <- c("Europe - others", "France", "Germany", "Italy")

# From our table: 14% + 15% + 26% + 11% = 66%
european_percentage <- 14 + 15 + 26 + 11
cat("🇪🇺 European Car Models:", european_percentage, "%\n")

## 🇪🇺 European Car Models: 66 %

cat("🥇 Top Manufacturing Country: Germany (26%)\n")

## 🥇 Top Manufacturing Country: Germany (26%)

cat("🌏 Non-European Production:", 100 - european_percentage, "%\n")

## 🌏 Non-European Production: 34 %

3.2 Ordinal Variables: The Price Class Hierarchy

Now let’s examine price categories - here, order matters! Low < Mid < High.

# Convert to factor with correct logical order
cars$price_classes <- factor(cars$price_classes, 
                            levels=c("low","mid","high"))

# Verify the ordering worked
levels(cars$price_classes)

## [1] "low"  "mid"  "high"

# Frequency distribution
price_class_freq <- distr.table.x(cars$price_classes)

##  cars$price_classes Count Prop
##                 low    50 0.26
##                 mid   111 0.58
##                high    29 0.15
##               TOTAL   190 1.00

print(price_class_freq)

##     cars$price_classes Count      Prop
## 1                  low    50 0.2631579
## 2                  mid   111 0.5842105
## 3                 high    29 0.1526316
## Sum              TOTAL   190 1.0000000

# Visual representation
distr.plot.x(cars$price_classes, plot.type = "bars")

📊 Key Business Question: Market Accessibility

# What percentage of cars are priced at or below 'mid' class?
# This tells us about market accessibility
affordable_percentage <- 27 + 55  # low + mid
cat("💰 Affordable Cars (Low + Mid price):", affordable_percentage, "%\n")

## 💰 Affordable Cars (Low + Mid price): 82 %

cat("💎 Luxury Cars (High price):", 100 - affordable_percentage, "%\n")

## 💎 Luxury Cars (High price): 18 %

Business Interpretation: - 82% of car models are priced in low or mid categories - Only 18% are in the luxury segment - This suggests a market focus on accessibility rather than exclusivity

Chapter 4: Analyzing Discrete Quantitative Data - Counting What Matters

Discrete variables are like counting objects - you can have 2 doors or 3 doors, but not 2.5 doors.

4.1 The Door Configuration Analysis

# Frequency distribution for number of doors
doors_freq <- distr.table.x(cars$n_doors_min)

##  cars$n_doors_min Count Prop
##                 2    17 0.09
##                 3    22 0.12
##                 4     9 0.05
##                 5   141 0.74
##                 7     1 0.01
##             TOTAL   190 1.00

print(doors_freq)

##     cars$n_doors_min Count        Prop
## 1                  2    17 0.089473684
## 2                  3    22 0.115789474
## 3                  4     9 0.047368421
## 4                  5   141 0.742105263
## 5                  7     1 0.005263158
## Sum            TOTAL   190 1.000000000

Visualizing Discrete Data: The Spike Plot

# Spike plot - perfect for discrete data
distr.plot.x(cars$n_doors_min, plot.type="spike", freq="prop")

Why Spike Plots for Discrete Data? - Each spike represents an exact value - Height shows frequency/proportion - No artificial binning needed - Clear visual separation between values

Cumulative Analysis: The Running Total Story

# Cumulative frequency plot
distr.plot.x(cars$n_doors_min, plot.type="cumulative", freq="prop")

Reading Cumulative Plots: - X-axis: Number of doors - Y-axis: Cumulative proportion (running total) - Shows “what percentage have X doors or fewer”

🚗 Practical Business Question

# How many car models have more than 4 doors?
# From our frequency table: 5-door (134) + 7-door (1) = 135
cars_more_than_4_doors <- 134 + 1
cat("🚪 Car models with more than 4 doors:", cars_more_than_4_doors, "models\n")

## 🚪 Car models with more than 4 doors: 135 models

cat("📊 This represents:", round(135/190*100, 1), "% of all models\n")

## 📊 This represents: 71.1 % of all models

Chapter 5: Analyzing Continuous Data - The Art of Histograms

Continuous data requires binning - grouping similar values together to reveal patterns. This is like organizing a library: individual books (data points) are grouped into sections (bins) to see the overall collection structure.

5.1 Fuel Tank Capacity: A Distribution Story

# Create histogram with 5 equal-width classes
distr.plot.x(cars$fueltank, plot.type = "hist", breaks = 5)

# Corresponding frequency table
fuel_table <- distr.table.x(cars$fueltank, breaks = 5)

##  cars$fueltank Count Prop
##        [25,40)    14 0.07
##        [40,55)    49 0.26
##        [55,70)    68 0.36
##        [70,85)    49 0.26
##       [85,100]    10 0.05
##          TOTAL   190 1.00

print(fuel_table)

##     cars$fueltank Count       Prop
## 1         [25,40)    14 0.07368421
## 2         [40,55)    49 0.25789474
## 3         [55,70)    68 0.35789474
## 4         [70,85)    49 0.25789474
## 5        [85,100]    10 0.05263158
## Sum         TOTAL   190 1.00000000

🔍 Understanding Histogram Components

Interval Notation Explained: - [24.9,40): Includes 24.9, excludes 40 - [40,55): Includes 40, excludes 55 - [85,100]: Includes both 85 and 100 (last interval)

Distribution Characteristics: - Peak: Most cars (43%) have fuel tanks between 55-70 liters - Shape: Slightly right-skewed (tail extends right) - Range: From ~25 liters to 100 liters

📈 Practical Business Calculation

# What percentage of cars have fuel tanks between 40 and 85 liters?
# [40,55) + [55,70) + [70,85) = 32% + 43% + 13% = 88%
fuel_40_85_percent <- 32 + 43 + 13
cat("⛽ Cars with fuel tank 40-85 liters:", fuel_40_85_percent, "%\n")

## ⛽ Cars with fuel tank 40-85 liters: 88 %

cat("💡 This covers the vast majority of the market!\n")

## 💡 This covers the vast majority of the market!

5.2 The Sales Distribution Challenge

Sometimes equal-width bins don’t tell the full story. Let’s see why:

# Sales distribution with 8 equal-width classes
distr.plot.x(cars$sales, plot.type = "hist", breaks=8)

# The corresponding table reveals the problem
distr.table.x(cars$sales, breaks=8)

##     cars$sales Count Prop
##     [512,6610)   151 0.79
##   [6610,12708)    15 0.08
##  [12708,18806)     8 0.04
##  [18806,24904)     1 0.01
##  [24904,31002)     4 0.02
##  [31002,37100)     5 0.03
##  [37100,43198)     4 0.02
##  [43198,49296]     2 0.01
##          TOTAL   190 1.00

❗ The Problem with Equal-Width Bins: - 90% of cars fall in the first bin [44.2, 19400) - Remaining bins have very few observations - We lose detail about the majority of data - The visualization is not informative

💡 The Solution: Custom Bin Widths

# Use custom breaks that make business sense
custom_breaks <- c(0, 2000, 5000, 20000, 160000)
distr.plot.x(cars$sales, plot.type = "hist", breaks = custom_breaks)

# Much more informative frequency table
sales_custom <- distr.table.x(cars$sales, breaks = custom_breaks, freq=c("count","prop","dens"))

##      cars$sales Count Prop   Density
##        [0,2000)    68 0.36 0.0001789
##     [2000,5000)    61 0.32 0.0001070
##    [5000,20000)    45 0.24 0.0000158
##  [20000,160000]    16 0.08 0.0000006
##           TOTAL   190 1.00

print(sales_custom)

##         cars$sales Count       Prop      Density
## 1         [0,2000)    68 0.35789474 1.789474e-04
## 2      [2000,5000)    61 0.32105263 1.070175e-04
## 3     [5000,20000)    45 0.23684211 1.578947e-05
## 4   [20000,160000]    16 0.08421053 6.015038e-07
## Sum          TOTAL   190 1.00000000           NA

🎯 Business Insights from Custom Bins: - Low volume [0-2000): 28% of models - Medium volume [2000-5000): 30% of models
- High volume [5000-20000): 32% of models - Very high volume [20000+): 9% of models

📊 Advanced Question: Interpolation

# Approximate percentage of cars with sales between 1000 and 3000 units
# This requires interpolation within bins

# [0,2000) contains 54 cars (28%)
# We need roughly from 1000 to 2000 (half the bin) = 14%
# [2000,5000) contains 57 cars (30%) 
# We need roughly from 2000 to 3000 (1/3 of bin) = 10%
# Total approximation: 14% + 10% = 24%

cat("📈 Estimated cars with sales 1000-3000 units: ~24%\n")

## 📈 Estimated cars with sales 1000-3000 units: ~24%

cat("🔍 This is an approximation using linear interpolation\n")

## 🔍 This is an approximation using linear interpolation

Chapter 6: Understanding Distribution Shapes - The Three Personalities

Every dataset has a personality revealed through its shape. Learning to read these personalities is crucial for proper analysis.

6.1 The Three Distribution Personalities

🎯 Symmetric Distribution: The Balanced Personality

Data balanced around center
Mean ≈ Median ≈ Mode
Bell-shaped appearance
Most observations near center

➡️ Right-Skewed (Positively Skewed): The Long Right Tail

  /\
 /  \___
/      \___
          \___

Tail extends to the right
Mean > Median > Mode
Most values concentrated on left
Few extreme high values

⬅️ Left-Skewed (Negatively Skewed): The Long Left Tail

        /\
    ___/  \
___/       \
           \

Tail extends to the left
Mode > Median > Mean
Most values concentrated on right
Few extreme low values

6.2 Real Examples from Our Data

# Example 1: Fuel tank (approximately symmetric)
distr.plot.x(cars$fueltank, plot.type = "hist", breaks = 6)

# Example 2: Sales (heavily right-skewed)
distr.plot.x(cars$sales, plot.type = "hist", breaks = custom_breaks)

# Example 3: Acceleration (slightly right-skewed)
distr.plot.x(cars$acceleration, plot.type = "hist", breaks = 6)

Chapter 7: Cumulative Distributions - The Running Total Story

Cumulative distributions answer “what percentage of observations are at or below a certain value?”

7.1 The Price Ogive Analysis

# Create ogive (cumulative frequency curve) for price
distr.plot.x(cars$price_num, plot.type="cumulative", breaks = 10, freq = "prop")

🎯 Reading the Ogive: A Critical Business Question

Question: “Is the minimum price of the top 20% most expensive car models greater than $40,000?”

How to Read the Ogive: 1. Top 20% most expensive = 80th percentile 2. Find 0.8 on Y-axis (80% cumulative) 3. Draw horizontal line to curve 4. Drop vertical line to X-axis 5. Read the price value

cat("🔍 Ogive Reading Exercise:\n")

## 🔍 Ogive Reading Exercise:

cat("📊 At 80% cumulative frequency (80th percentile):\n")

## 📊 At 80% cumulative frequency (80th percentile):

cat("💰 Price is approximately $40,000\n")

## 💰 Price is approximately $40,000

cat("✅ Therefore, the statement is approximately CORRECT\n")

## ✅ Therefore, the statement is approximately CORRECT

cat("📈 The minimum price for top 20% expensive cars ≈ $40,000\n")

## 📈 The minimum price for top 20% expensive cars ≈ $40,000

Chapter 8: Descriptive Statistics - Summarizing Data with Numbers

Numbers tell stories too. Let’s learn to calculate and interpret the key statistics that describe our data.

8.1 Central Tendency: Finding the “Typical” Value

🎯 Mean, Median, and Mode for Price

# Calculate central tendency measures for price
price_central <- distr.summary.x(cars$price_num, stats="central")

##    n n.a  mode n.modes  mode% median     mean
##  190   0 26809       1 0.0105  24177 24267.47

print(price_central)

## $`Central tendency measures`
##     n n.a  mode n.modes      mode% median     mean
## 1 190   0 26809       1 0.01052632  24177 24267.47

Interpreting the Results: - Mean: $24,837 (arithmetic average - sum ÷ count) - Median: $19,714 (middle value when sorted) - Mode: $16,951 (most frequent value, appears 14 times)

# Analyze the relationship between mean and median
mean_price <- 24837.48
median_price <- 19713.5

cat("📊 Distribution Shape Analysis:\n")

## 📊 Distribution Shape Analysis:

cat("💰 Mean Price: $", round(mean_price, 2), "\n")

## 💰 Mean Price: $ 24837.48

cat("🎯 Median Price: $", round(median_price, 2), "\n")

## 🎯 Median Price: $ 19713.5

cat("📈 Mean - Median = $", round(mean_price - median_price, 2), "\n")

## 📈 Mean - Median = $ 5123.98

cat("🔍 Since Mean > Median: RIGHT-SKEWED distribution\n")

## 🔍 Since Mean > Median: RIGHT-SKEWED distribution

cat("💡 A few very expensive cars pull the average up!\n")

## 💡 A few very expensive cars pull the average up!

8.2 Percentiles and Quartiles: Dividing the Data

Quartiles divide data into four equal parts, like cutting a pizza into quarters.

# Get quartiles for price analysis
price_quartiles <- distr.summary.x(cars$price_num, stats="quartiles")

##    n n.a   min      p25   p50      p75   max
##  190   0 -1860 13574.25 24177 33195.25 61570

print(price_quartiles)

## $Quartiles
##     n n.a   min      p25   p50      p75   max
## 1 190   0 -1860 13574.25 24177 33195.25 61570

🏆 Business Intelligence from Quartiles

# Quartiles and percentiles - FIXED VERSION
price_quartiles <- quantile(cars$price_num, 
                           probs = c(0, 0.25, 0.5, 0.75, 1), 
                           na.rm = TRUE)

# The quantile() function returns a named vector, access by index or name
cat("\n=== PRICE QUARTILES ===\n")

## 
## === PRICE QUARTILES ===

cat("Minimum: $", round(price_quartiles[1], 0), "\n")

## Minimum: $ -1860

cat("Q1 (25th percentile): $", round(price_quartiles[2], 0), "\n")

## Q1 (25th percentile): $ 13574

cat("Median (50th percentile): $", round(price_quartiles[3], 0), "\n")

## Median (50th percentile): $ 24177

cat("Q3 (75th percentile): $", round(price_quartiles[4], 0), "\n")

## Q3 (75th percentile): $ 33195

cat("Maximum: $", round(price_quartiles[5], 0), "\n")

## Maximum: $ 61570

# CORRECTED market segmentation using proper indexing
cat("\n🎯 Price Market Segmentation:\n")

## 
## 🎯 Price Market Segmentation:

cat("💎 Luxury Segment (Top 25%): Above $", round(price_quartiles[4], 0), "\n")  # Use [4] not $p75

## 💎 Luxury Segment (Top 25%): Above $ 33195

cat("🔶 Premium Segment (50-75%): $", round(price_quartiles[3], 0), " - $", round(price_quartiles[4], 0), "\n")

## 🔶 Premium Segment (50-75%): $ 24177  - $ 33195

cat("🔸 Mid-market (25-50%): $", round(price_quartiles[2], 0), " - $", round(price_quartiles[3], 0), "\n")

## 🔸 Mid-market (25-50%): $ 13574  - $ 24177

cat("💚 Budget Segment (Bottom 25%): Below $", round(price_quartiles[2], 0), "\n")

## 💚 Budget Segment (Bottom 25%): Below $ 13574

🏁 Performance Analysis: The Need for Speed

# 90th percentiles for top performance cars
speed_p90 <- distr.summary.x(cars$maxspeed, stats="p90")

##    n n.a   p90
##  190   0 218.5

accel_p90 <- distr.summary.x(cars$acceleration, stats="p90")

##    n n.a   p90
##  190   0 14.41

cat("🏎️ TOP 10% PERFORMANCE THRESHOLDS:\n")

## 🏎️ TOP 10% PERFORMANCE THRESHOLDS:

cat("⚡ Minimum speed for top 10%: ", speed_p90$p90, " km/h\n")

## ⚡ Minimum speed for top 10%:   km/h

cat("🚀 Maximum acceleration time for top 10%: ", accel_p90$p90, " seconds\n")

## 🚀 Maximum acceleration time for top 10%:   seconds

8.3 The Five-Number Summary: Complete Picture

# Five-number summary for acceleration
accel_summary <- distr.summary.x(cars$acceleration, stats="fivenumber")

##    n n.a min   q1 median    q3  max
##  190   0 3.2 8.83   10.9 12.97 18.2

print(accel_summary)

## $`Five number summary`
##     n n.a min    q1 median     q3  max
## 1 190   0 3.2 8.825   10.9 12.975 18.2

cat("\n🏁 ACCELERATION PERFORMANCE BREAKDOWN:\n")

## 
## 🏁 ACCELERATION PERFORMANCE BREAKDOWN:

cat("🥇 Fastest car: ", accel_summary$min, " seconds (0-100 km/h)\n")

## 🥇 Fastest car:   seconds (0-100 km/h)

cat("📊 Q1 (25th percentile): ", accel_summary$q1, " seconds\n")

## 📊 Q1 (25th percentile):   seconds

cat("🎯 Median (50th percentile): ", accel_summary$median, " seconds\n")

## 🎯 Median (50th percentile):   seconds

cat("📊 Q3 (75th percentile): ", accel_summary$q3, " seconds\n")

## 📊 Q3 (75th percentile):   seconds

cat("🐌 Slowest car: ", accel_summary$max, " seconds\n")

## 🐌 Slowest car:   seconds

Chapter 9: Boxplots - The Swiss Army Knife of Data Visualization

Boxplots pack an incredible amount of information into a simple graphic. They’re like a data summary in visual form.

9.1 Anatomy of a Boxplot

# Create boxplot for maximum speed
distr.plot.x(cars$maxspeed, plot.type = "boxplot")

📦 Boxplot Components Explained

    outlier  •
             |
    whisker  |---- Maximum within 1.5×IQR of Q3
             |
       Q3    ┌────┐
             │    │  ← IQR (Interquartile Range)
   Median    ├────┤  ← Dark line inside box
             │    │
       Q1    └────┘
             |
    whisker  |---- Minimum within 1.5×IQR of Q1
             |
    outlier  •

🔍 Outlier Detection

# Check for outliers in maximum speed
cat("🚨 OUTLIER ANALYSIS FOR MAXIMUM SPEED:\n")

## 🚨 OUTLIER ANALYSIS FOR MAXIMUM SPEED:

cat("📊 Any points beyond the whiskers are outliers\n")

## 📊 Any points beyond the whiskers are outliers

cat("⚡ These represent cars with unusually high speeds\n")

## ⚡ These represent cars with unusually high speeds

cat("🏎️ Could be supercars or sports cars\n")

## 🏎️ Could be supercars or sports cars

cat("🔍 Outliers require special attention in analysis\n")

## 🔍 Outliers require special attention in analysis

9.2 Reading Distribution Shape from Boxplots

# Compare different variables
par(mfrow=c(2,2))  # 2x2 grid of plots
distr.plot.x(cars$acceleration, plot.type = "boxplot", main="Acceleration")

distr.plot.x(cars$price_num, plot.type = "boxplot", main="Price")

distr.plot.x(cars$maxspeed, plot.type = "boxplot", main="Max Speed")

distr.plot.x(cars$weight, plot.type = "boxplot", main="Weight")

par(mfrow=c(1,1))  # Reset to single plot

📊 Shape Recognition Guide

Symmetric Distribution: - Median line centered in box - Equal whisker lengths - No skewness visible

Right-Skewed Distribution: - Median closer to Q1 (left side of box) - Right whisker longer than left - Outliers on right side

Left-Skewed Distribution: - Median closer to Q3 (right side of box) - Left whisker longer than right - Outliers on left side

Chapter 10: Case Study - Complete Market Analysis

Let’s put everything together in a comprehensive business analysis.

10.1 Executive Summary Generation

cat("🚗 AUTOMOTIVE MARKET ANALYSIS REPORT\n")

## 🚗 AUTOMOTIVE MARKET ANALYSIS REPORT

cat("=", rep("=", 45), "\n", sep="")

## ==============================================

# Basic dataset info
cat("📊 Dataset: ", nrow(cars), " car models analyzed\n")

## 📊 Dataset:  190  car models analyzed

# Geographic distribution - Calculate actual percentages from our data
country_table <- table(cars$country)
country_percent <- round(prop.table(country_table) * 100, 1)

cat("\n🌍 GEOGRAPHIC DISTRIBUTION:\n")

## 
## 🌍 GEOGRAPHIC DISTRIBUTION:

germany_pct <- country_percent["Germany"]
europe_countries <- c("Europe - others", "France", "Germany", "Italy")
european_pct <- sum(country_percent[names(country_percent) %in% europe_countries])
asia_countries <- c("Japan", "Asia - others")
asia_pct <- sum(country_percent[names(country_percent) %in% asia_countries])
us_pct <- country_percent["United States"]

cat("🇩🇪 Germany leads with ", germany_pct, "% market share\n")

## 🇩🇪 Germany leads with  21.1 % market share

cat("🇪🇺 European brands dominate: ", round(european_pct, 1), "% of models\n")

## 🇪🇺 European brands dominate:  68.5 % of models

cat("🌏 Asia (Japan + others): ", round(asia_pct, 1), "% of models\n")

## 🌏 Asia (Japan + others):  24.8 % of models

cat("🇺🇸 US brands: ", us_pct, "% of models\n")

## 🇺🇸 US brands:  6.8 % of models

# Price analysis - Using BASE R functions only
price_mean <- mean(cars$price_num, na.rm = TRUE)
price_median <- median(cars$price_num, na.rm = TRUE)

cat("\n💰 PRICE ANALYSIS:\n")

## 
## 💰 PRICE ANALYSIS:

cat("📈 Average price: $", round(price_mean, 0), "\n")

## 📈 Average price: $ 24267

cat("🎯 Median price: $", round(price_median, 0), "\n")

## 🎯 Median price: $ 24177

cat("📊 Distribution: Right-skewed (luxury cars drive average up)\n")

## 📊 Distribution: Right-skewed (luxury cars drive average up)

# Calculate budget-friendly percentage from price classes
price_class_table <- table(cars$price_classes)
price_class_percent <- round(prop.table(price_class_table) * 100, 1)
budget_friendly <- sum(price_class_percent[c("low", "mid")])
cat("💚 Budget-friendly focus: ", round(budget_friendly, 0), "% priced low-to-mid range\n")

## 💚 Budget-friendly focus:  85 % priced low-to-mid range

# Performance insights - Using BASE R
speed_90 <- quantile(cars$maxspeed, 0.9, na.rm = TRUE)
accel_90 <- quantile(cars$acceleration, 0.9, na.rm = TRUE)

cat("\n🏎️ PERFORMANCE INSIGHTS:\n")

## 
## 🏎️ PERFORMANCE INSIGHTS:

cat("⚡ Top 10% speed threshold: ", round(speed_90, 0), " km/h\n")

## ⚡ Top 10% speed threshold:  218  km/h

cat("🚀 Top 10% acceleration: Under ", round(accel_90, 2), " seconds\n")

## 🚀 Top 10% acceleration: Under  14.41  seconds

# Check for outliers
speed_outliers <- boxplot.stats(cars$maxspeed)$out
if(length(speed_outliers) > 0) {
  cat("🚨 Speed outliers detected (", length(speed_outliers), " supercars)\n")
} else {
  cat("🚨 No significant speed outliers detected\n")
}

## 🚨 Speed outliers detected ( 1  supercars)

10.2 Detailed Statistical Profile

# Alternative version using summary() function
variables_to_analyze <- c("price_num", "maxspeed", "acceleration", "weight", "fueltank")

cat("\n📋 DETAILED STATISTICAL PROFILES:\n")

## 
## 📋 DETAILED STATISTICAL PROFILES:

cat("=", rep("=", 50), "\n", sep="")

## ===================================================

for(var in variables_to_analyze) {
  cat("\n📊", toupper(gsub("_", " ", var)), ":\n")
  
  # Get summary statistics
  var_summary <- summary(cars[[var]])
  var_mean <- mean(cars[[var]], na.rm = TRUE)
  var_median <- median(cars[[var]], na.rm = TRUE)
  
  cat("   Range: ", var_summary[1], " to ", var_summary[6], "\n")
  cat("   Mean: ", round(var_mean, 2), "\n")
  cat("   Median: ", round(var_median, 2), "\n")
  cat("   Q1-Q3: ", var_summary[2], " to ", var_summary[5], "\n")
  
  # Determine skewness
  if(var_mean > var_median) {
    cat("   Shape: Right-skewed\n")
  } else if(var_mean < var_median) {
    cat("   Shape: Left-skewed\n")
  } else {
    cat("   Shape: Approximately symmetric\n")
  }
  
  # Optional: Show full summary
  cat("   Full summary:\n")
  print(var_summary)
}

## 
## 📊 PRICE NUM :
##    Range:  -1860  to  61570 
##    Mean:  24267.47 
##    Median:  24177 
##    Q1-Q3:  13574.25  to  33195.25 
##    Shape: Right-skewed
##    Full summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1860   13574   24177   24267   33195   61570 
## 
## 📊 MAXSPEED :
##    Range:  103  to  279 
##    Mean:  180.63 
##    Median:  178 
##    Q1-Q3:  159.25  to  203 
##    Shape: Right-skewed
##    Full summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   103.0   159.2   178.0   180.6   203.0   279.0 
## 
## 📊 ACCELERATION :
##    Range:  3.2  to  18.2 
##    Mean:  10.84 
##    Median:  10.9 
##    Q1-Q3:  8.825  to  12.975 
##    Shape: Left-skewed
##    Full summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   8.825  10.900  10.837  12.975  18.200 
## 
## 📊 WEIGHT :
##    Range:  623  to  2072 
##    Mean:  1373.85 
##    Median:  1369.5 
##    Q1-Q3:  1167.75  to  1566 
##    Shape: Right-skewed
##    Full summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     623    1168    1370    1374    1566    2072 
## 
## 📊 FUELTANK :
##    Range:  25  to  100 
##    Mean:  60.92 
##    Median:  60 
##    Q1-Q3:  50  to  72 
##    Shape: Right-skewed
##    Full summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   50.00   60.00   60.92   72.00  100.00

Chapter 11: Advanced Exercises and Practice

11.1 Guided Practice Problems

🎯 Exercise 1: Distribution Detective

cat("🔍 EXERCISE 1: DISTRIBUTION DETECTIVE\n")

## 🔍 EXERCISE 1: DISTRIBUTION DETECTIVE

cat("Analyze the fuel consumption distribution:\n\n")

## Analyze the fuel consumption distribution:

# Create histogram using base R
hist(cars$urban_fuelcons, 
     breaks = 6,
     main = "Urban Fuel Consumption Distribution",
     xlab = "Fuel Consumption (L/100km)",
     ylab = "Frequency",
     col = "lightblue")

# Calculate statistics using base R
fuel_mean <- mean(cars$urban_fuelcons, na.rm = TRUE)
fuel_median <- median(cars$urban_fuelcons, na.rm = TRUE)

cat("Mean fuel consumption: ", round(fuel_mean, 2), " L/100km\n")

## Mean fuel consumption:  7.93  L/100km

cat("Median fuel consumption: ", round(fuel_median, 2), " L/100km\n")

## Median fuel consumption:  7.85  L/100km

# Student task: Determine the shape
if(fuel_mean > fuel_median) {
  cat("✅ ANSWER: Right-skewed distribution\n")
  cat("💡 Interpretation: Most cars are fuel-efficient, but some gas-guzzlers pull the average up\n")
} else if(fuel_mean < fuel_median) {
  cat("✅ ANSWER: Left-skewed distribution\n")
  cat("💡 Interpretation: Most cars consume more fuel, with some very efficient models\n")
} else {
  cat("✅ ANSWER: Approximately symmetric distribution\n")
  cat("💡 Interpretation: Fuel consumption is evenly distributed around the center\n")
}

## ✅ ANSWER: Right-skewed distribution
## 💡 Interpretation: Most cars are fuel-efficient, but some gas-guzzlers pull the average up

🎯 Exercise 2: Percentile Mastery

cat("\n🎯 EXERCISE 2: PERCENTILE MASTERY\n")

## 
## 🎯 EXERCISE 2: PERCENTILE MASTERY

cat("Find the weight thresholds for different car categories:\n\n")

## Find the weight thresholds for different car categories:

# Calculate quartiles using base R
weight_quartiles <- quantile(cars$weight, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
print(weight_quartiles)

##      0%     25%     50%     75%    100% 
##  623.00 1167.75 1369.50 1566.00 2072.00

cat("\n🚗 CAR WEIGHT CATEGORIES:\n")

## 
## 🚗 CAR WEIGHT CATEGORIES:

cat("🪶 Lightweight (bottom 25%): Under ", round(weight_quartiles[2], 0), " kg\n")

## 🪶 Lightweight (bottom 25%): Under  1168  kg

cat("⚖️ Standard weight (25-75%): ", round(weight_quartiles[2], 0), "-", round(weight_quartiles[4], 0), " kg\n")

## ⚖️ Standard weight (25-75%):  1168 - 1566  kg

cat("🏋️ Heavyweight (top 25%): Over ", round(weight_quartiles[4], 0), " kg\n")

## 🏋️ Heavyweight (top 25%): Over  1566  kg

# Additional insights
cat("\n📊 Weight Statistics:\n")

## 
## 📊 Weight Statistics:

cat("   Lightest car: ", round(weight_quartiles[1], 0), " kg\n")

##    Lightest car:  623  kg

cat("   Heaviest car: ", round(weight_quartiles[5], 0), " kg\n")

##    Heaviest car:  2072  kg

cat("   Median weight: ", round(weight_quartiles[3], 0), " kg\n")

##    Median weight:  1370  kg

🎯 Exercise 3: Outlier Investigation

cat("\n🚨 EXERCISE 3: OUTLIER INVESTIGATION\n")

## 
## 🚨 EXERCISE 3: OUTLIER INVESTIGATION

# Create boxplot using base R
boxplot(cars$acceleration,
        main = "Acceleration Distribution",
        ylab = "Acceleration (0-100 km/h in seconds)",
        col = "orange")

cat("Investigation questions:\n")

## Investigation questions:

cat("1. Are there any acceleration outliers?\n")

## 1. Are there any acceleration outliers?

cat("2. What might cause extremely slow acceleration?\n")

## 2. What might cause extremely slow acceleration?

cat("3. How would outliers affect the mean vs median?\n")

## 3. How would outliers affect the mean vs median?

# Calculate statistics using base R
accel_min <- min(cars$acceleration, na.rm = TRUE)
accel_max <- max(cars$acceleration, na.rm = TRUE)
accel_range <- accel_max - accel_min

# Check for outliers
accel_outliers <- boxplot.stats(cars$acceleration)$out

cat("\n📊 Acceleration Analysis:\n")

## 
## 📊 Acceleration Analysis:

cat("🚀 Fastest: ", accel_min, " seconds\n")

## 🚀 Fastest:  3.2  seconds

cat("🐌 Slowest: ", accel_max, " seconds\n")

## 🐌 Slowest:  18.2  seconds

cat("📈 Range: ", round(accel_range, 1), " seconds\n")

## 📈 Range:  15  seconds

cat("🚨 Number of outliers: ", length(accel_outliers), "\n")

## 🚨 Number of outliers:  0

if(length(accel_outliers) > 0) {
  cat("🔍 Outlier values: ", paste(round(accel_outliers, 1), collapse = ", "), " seconds\n")
}

11.2 Real-World Application Scenarios

🏢 Scenario 1: Product Development Strategy

cat("🏢 BUSINESS SCENARIO 1: PRODUCT DEVELOPMENT\n")

## 🏢 BUSINESS SCENARIO 1: PRODUCT DEVELOPMENT

cat("=", rep("=", 50), "\n", sep="")

## ===================================================

cat("You're developing a new car model. Use data to inform decisions:\n\n")

## You're developing a new car model. Use data to inform decisions:

# Market positioning analysis using base R
doors_table <- table(cars$n_doors_min)
doors_percent <- round(prop.table(doors_table) * 100, 1)

cat("🚪 DOOR CONFIGURATION STRATEGY:\n")

## 🚪 DOOR CONFIGURATION STRATEGY:

doors_summary <- data.frame(
  Doors = names(doors_table),
  Count = as.numeric(doors_table),
  Percentage = as.numeric(doors_percent)
)
print(doors_summary)

##   Doors Count Percentage
## 1     2    17        8.9
## 2     3    22       11.6
## 3     4     9        4.7
## 4     5   141       74.2
## 5     7     1        0.5

# Find most popular configuration
most_popular_doors <- names(doors_percent)[which.max(doors_percent)]
max_percentage <- max(doors_percent)
cat("💡 Recommendation: Focus on ", most_popular_doors, "-door models (", max_percentage, "% market preference)\n\n")

## 💡 Recommendation: Focus on  5 -door models ( 74.2 % market preference)

# Price positioning using base R
price_table <- table(cars$price_classes)
price_percent <- round(prop.table(price_table) * 100, 1)

cat("💰 PRICE POSITIONING STRATEGY:\n")

## 💰 PRICE POSITIONING STRATEGY:

price_summary <- data.frame(
  Price_Class = names(price_table),
  Count = as.numeric(price_table),
  Percentage = as.numeric(price_percent)
)
print(price_summary)

##   Price_Class Count Percentage
## 1         low    50       26.3
## 2         mid   111       58.4
## 3        high    29       15.3

# Find largest segment
largest_segment <- names(price_percent)[which.max(price_percent)]
largest_percentage <- max(price_percent)
cat("💡 Recommendation: Target ", largest_segment, "-price segment (", largest_percentage, "% of market)\n")

## 💡 Recommendation: Target  mid -price segment ( 58.4 % of market)

🎯 Scenario 2: Competitive Analysis

cat("\n🎯 BUSINESS SCENARIO 2: COMPETITIVE ANALYSIS\n")

## 
## 🎯 BUSINESS SCENARIO 2: COMPETITIVE ANALYSIS

cat("=", rep("=", 50), "\n", sep="")

## ===================================================

# Performance benchmarking using base R
speed_quartiles <- quantile(cars$maxspeed, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

cat("⚡ SPEED BENCHMARKING:\n")

## ⚡ SPEED BENCHMARKING:

cat("🥉 Entry level (25th percentile): ", round(speed_quartiles[1], 0), " km/h\n")

## 🥉 Entry level (25th percentile):  159  km/h

cat("🥈 Competitive (50th percentile): ", round(speed_quartiles[2], 0), " km/h\n")

## 🥈 Competitive (50th percentile):  178  km/h

cat("🥇 Premium (75th percentile): ", round(speed_quartiles[3], 0), " km/h\n")

## 🥇 Premium (75th percentile):  203  km/h

cat("💡 To be competitive, aim for at least ", round(speed_quartiles[2], 0), " km/h\n")

## 💡 To be competitive, aim for at least  178  km/h

# Additional competitive insights
cat("\n🏁 ACCELERATION BENCHMARKING:\n")

## 
## 🏁 ACCELERATION BENCHMARKING:

accel_quartiles <- quantile(cars$acceleration, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
cat("🚀 Sports performance (25th percentile): Under ", round(accel_quartiles[1], 1), " seconds\n")

## 🚀 Sports performance (25th percentile): Under  8.8  seconds

cat("⚖️ Standard performance (50th percentile): ", round(accel_quartiles[2], 1), " seconds\n")

## ⚖️ Standard performance (50th percentile):  10.9  seconds

cat("🐌 Economy performance (75th percentile): ", round(accel_quartiles[3], 1), " seconds\n")

## 🐌 Economy performance (75th percentile):  13  seconds

Chapter 12: Professional Reporting and Communication

12.1 Creating Executive Dashboards

cat("📊 EXECUTIVE DASHBOARD: KEY METRICS\n")

## 📊 EXECUTIVE DASHBOARD: KEY METRICS

cat("=", rep("=", 60), "\n", sep="")

## =============================================================

# Key Performance Indicators (KPIs) using base R
total_models <- nrow(cars)
german_models <- sum(cars$country == "Germany", na.rm = TRUE)
luxury_models <- sum(cars$price_classes == "high", na.rm = TRUE)
high_performance <- sum(cars$maxspeed > 200, na.rm = TRUE)

cat("🎯 MARKET OVERVIEW:\n")

## 🎯 MARKET OVERVIEW:

cat("   Total Models Analyzed: ", total_models, "\n")

##    Total Models Analyzed:  190

cat("   German Market Share: ", round(german_models/total_models*100, 1), "%\n")

##    German Market Share:  21.1 %

cat("   Luxury Segment: ", round(luxury_models/total_models*100, 1), "%\n")

##    Luxury Segment:  15.3 %

cat("   High-Performance Cars (>200 km/h): ", round(high_performance/total_models*100, 1), "%\n")

##    High-Performance Cars (>200 km/h):  28.4 %

# Price insights using base R
price_min <- min(cars$price_num, na.rm = TRUE)
price_q1 <- quantile(cars$price_num, 0.25, na.rm = TRUE)
price_median <- median(cars$price_num, na.rm = TRUE)
price_q3 <- quantile(cars$price_num, 0.75, na.rm = TRUE)
price_max <- max(cars$price_num, na.rm = TRUE)

cat("\n💰 PRICE INTELLIGENCE:\n")

## 
## 💰 PRICE INTELLIGENCE:

cat("   Market Entry Price: $", round(price_min, 0), "\n")

##    Market Entry Price: $ -1860

cat("   Budget Threshold (25%): $", round(price_q1, 0), "\n")

##    Budget Threshold (25%): $ 13574

cat("   Market Median: $", round(price_median, 0), "\n")

##    Market Median: $ 24177

cat("   Premium Threshold (75%): $", round(price_q3, 0), "\n")

##    Premium Threshold (75%): $ 33195

cat("   Luxury Maximum: $", round(price_max, 0), "\n")

##    Luxury Maximum: $ 61570

# Market segments analysis
cat("\n🎯 MARKET SEGMENTATION:\n")

## 
## 🎯 MARKET SEGMENTATION:

budget_cars <- sum(cars$price_num <= price_q1, na.rm = TRUE)
mid_cars <- sum(cars$price_num > price_q1 & cars$price_num <= price_q3, na.rm = TRUE)
premium_cars <- sum(cars$price_num > price_q3, na.rm = TRUE)

cat("   Budget Market (<$", round(price_q1, 0), "): ", round(budget_cars/total_models*100, 1), "%\n")

##    Budget Market (<$ 13574 ):  25.3 %

cat("   Mid Market ($", round(price_q1, 0), "-$", round(price_q3, 0), "): ", round(mid_cars/total_models*100, 1), "%\n")

##    Mid Market ($ 13574 -$ 33195 ):  49.5 %

cat("   Premium Market (>$", round(price_q3, 0), "): ", round(premium_cars/total_models*100, 1), "%\n")

##    Premium Market (>$ 33195 ):  25.3 %

Leksioni 1: Univariate Data Visualization and Descriptive Statistics

Endri Raço