Descriptive statistics involve methods for organizing, picturing, and summarizing information from samples or populations. Unlike inferential statistics, which try to make predictions about a population based on a sample, descriptive statistics focus on describing the features of the data you actually have.
Before calculating statistics, we must identify the data type: 1. Qualitative (Categorical): Non-numerical data (e.g., Hair color, Car brands). 2. Quantitative (Numerical): Numerical measurements. * Discrete: Countable (e.g., Number of children). * Continuous: Measurable on a scale (e.g., Height, Temperature).
These measures indicate the “center” or “typical value” of a dataset.
The sum of all values divided by the number of values.
Mathematical Equation: \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
The middle value when the data is arranged in ascending order. If \(n\) is even, it is the average of the two middle values.
The value that appears most frequently in a dataset.
Central tendency isn’t enough. We need to know how “spread out” the data is.
The difference between the maximum and minimum values. \[Range = X_{max} - X_{min}\]
The average of the squared deviations from the mean.
Mathematical Equation (Sample Variance): \[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]
The square root of the variance. It expresses the spread in the same units as the data.
Mathematical Equation: \[s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\]
Indicates how many standard deviations a value is from the mean.
Mathematical Equation: \[z = \frac{x - \bar{x}}{s}\]
Let’s use the built-in mtcars dataset to calculate these
statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
# Specific calculations
mean_mpg <- mean(mtcars$mpg)
sd_mpg <- sd(mtcars$mpg)
print(paste("The mean MPG is:", round(mean_mpg, 2)))## [1] "The mean MPG is: 20.09"
## [1] "The Standard Deviation of MPG is: 6.03"
Boxplots are excellent for visualizing the median, quartiles, and potential outliers.
ggplot(mtcars, aes(y = mpg)) +
geom_boxplot(fill = "skyblue", color = "darkblue") +
labs(title = "Boxplot of Miles Per Gallon (MPG)",
y = "Miles Per Gallon") +
theme_minimal()Histograms show the frequency distribution and “shape” of the data (Skewness).
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 3, fill = "salmon", color = "white") +
geom_vline(aes(xintercept = mean(mpg)), color = "blue", linetype = "dashed", size = 1) +
labs(title = "Distribution of MPG",
subtitle = "Blue line indicates the Mean",
x = "MPG", y = "Frequency") +
theme_minimal()| Statistic | Purpose | Best Used When… |
|---|---|---|
| Mean | Average | Data is symmetric with no outliers. |
| Median | Center point | Data is skewed or has outliers. |
| St. Deviation | Spread/Risk | Comparing consistency between two groups. |
| IQR | Middle spread | You want to ignore extreme outliers. |
iris dataset, calculate the mean and standard
deviation for Sepal.Length.Sepal.Width grouped by
Species.mtcars that gets 30
MPG. Is it an outlier? ```