1. Introduction to Descriptive Statistics

Descriptive statistics involve methods for organizing, picturing, and summarizing information from samples or populations. Unlike inferential statistics, which try to make predictions about a population based on a sample, descriptive statistics focus on describing the features of the data you actually have.

1.1 Types of Data

Before calculating statistics, we must identify the data type: 1. Qualitative (Categorical): Non-numerical data (e.g., Hair color, Car brands). 2. Quantitative (Numerical): Numerical measurements. * Discrete: Countable (e.g., Number of children). * Continuous: Measurable on a scale (e.g., Height, Temperature).


2. Measures of Central Tendency

These measures indicate the “center” or “typical value” of a dataset.

2.1 Arithmetic Mean

The sum of all values divided by the number of values.

Mathematical Equation: \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

  • Real-Life Example: Calculating the average battery life of a batch of 500 smartphones to determine if they meet quality standards.

2.2 Median

The middle value when the data is arranged in ascending order. If \(n\) is even, it is the average of the two middle values.

  • Real-Life Example: Real Estate Prices. The median is preferred over the mean for housing prices because a single $10,000,000 mansion (outlier) would artificially inflate the mean, while the median remains representative of the “typical” home.

2.3 Mode

The value that appears most frequently in a dataset.

  • Real-Life Example: Inventory Management. A shoe store owner needs to know the mode of shoe sizes sold to ensure the most popular sizes are always in stock.

3. Measures of Dispersion (Variability)

Central tendency isn’t enough. We need to know how “spread out” the data is.

3.1 Range

The difference between the maximum and minimum values. \[Range = X_{max} - X_{min}\]

3.2 Variance (\(s^2\))

The average of the squared deviations from the mean.

Mathematical Equation (Sample Variance): \[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]

3.3 Standard Deviation (\(s\))

The square root of the variance. It expresses the spread in the same units as the data.

Mathematical Equation: \[s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\]

  • Real-Life Example: Investment Risk. In finance, standard deviation is used as a measure of volatility. A stock with a high \(s\) is considered “riskier” because its returns fluctuate wildly from the mean.

4. Measures of Relative Standing

4.1 Percentiles and Quartiles

  • Q1 (25th Percentile): The value below which 25% of the data falls.
  • Q3 (75th Percentile): The value below which 75% of the data falls.
  • Interquartile Range (IQR): \(IQR = Q3 - Q1\). This measures the spread of the middle 50% of the data.

4.2 Z-Score

Indicates how many standard deviations a value is from the mean.

Mathematical Equation: \[z = \frac{x - \bar{x}}{s}\]


5. R Implementation and Visualization

Let’s use the built-in mtcars dataset to calculate these statistics.

5.1 Summary Statistics

# Load data
data(mtcars)

# Basic summary for Miles Per Gallon (mpg)
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
# Specific calculations
mean_mpg <- mean(mtcars$mpg)
sd_mpg <- sd(mtcars$mpg)

print(paste("The mean MPG is:", round(mean_mpg, 2)))
## [1] "The mean MPG is: 20.09"
print(paste("The Standard Deviation of MPG is:", round(sd_mpg, 2)))
## [1] "The Standard Deviation of MPG is: 6.03"

5.2 Visualizing Distribution with Boxplots

Boxplots are excellent for visualizing the median, quartiles, and potential outliers.

ggplot(mtcars, aes(y = mpg)) +
  geom_boxplot(fill = "skyblue", color = "darkblue") +
  labs(title = "Boxplot of Miles Per Gallon (MPG)",
       y = "Miles Per Gallon") +
  theme_minimal()

5.3 Visualizing with Histograms

Histograms show the frequency distribution and “shape” of the data (Skewness).

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 3, fill = "salmon", color = "white") +
  geom_vline(aes(xintercept = mean(mpg)), color = "blue", linetype = "dashed", size = 1) +
  labs(title = "Distribution of MPG",
       subtitle = "Blue line indicates the Mean",
       x = "MPG", y = "Frequency") +
  theme_minimal()


6. Summary Table

Statistic Purpose Best Used When…
Mean Average Data is symmetric with no outliers.
Median Center point Data is skewed or has outliers.
St. Deviation Spread/Risk Comparing consistency between two groups.
IQR Middle spread You want to ignore extreme outliers.

Exercises

  1. Using the iris dataset, calculate the mean and standard deviation for Sepal.Length.
  2. Create a boxplot for Sepal.Width grouped by Species.
  3. Calculate the Z-score for a car in mtcars that gets 30 MPG. Is it an outlier? ```