EC3133
Think of random variables like loot boxes in games - you know what you might get, but not exactly what you’ll get.
Probability is like your win rate in games
Standard deviation is like how much your gaming scores typically swing up or down.
Regression is like finding patterns in what makes your content go viral.
Your latest TikTok dance video usually gets between 100 and 1000 views. Each time you post, it’s like rolling a dice - you never know exactly how many views you’ll get. Suppose you post the same dance video 5 times, and you get the following number of views:
Video 1: 300 views
Video 2: 750 views
Video 3: 200 views
Video 4: 500 views
Video 5: 450 views
Average views:
\[ E[\text{Views}] = \frac{300 + 750 + 200 + 500 + 450}{5} = 440 \text{ views} \]
Variance (how scattered the views are):
\[ \begin{aligned} Var(\text{Views}) &= \frac{1}{5}\sum(\text{views} - 440)^2 \\ &= \frac{(-140)^2 + (310)^2 + (-240)^2 + (60)^2 + (10)^2}{5} \\ &= 44,200 \end{aligned} \]
Your videos typically get around 440 views, but they can swing up or down by about \(\sqrt{44,200} =210\) views.
You’re playing your favorite mobile game. Your high scores tend to follow what we call a “normal distribution” (like a bell curve).
Your average score is 1000 points, with a standard deviation of 100 points.
Let’s use the properties of the normal distribution:
\[ P(X > 1200) = 1 - P\left( Z > \frac{1200-1000}{100} \right) = 1 - P(Z > 2) = 0.0228 \]
You’ll score above 1200 about 2.3% of the time - that’s like getting a rare item in a loot box!
\[ P(900 < X < 1100) = P(-1 < Z < 1) = 0.6826 \]
About 68% of your games will score in this range - these are your typical matches.
You’re a music producer trying to understand what makes a song popular. You collect data on \(x\) = Song length (in minutes) and \(y\) = Number of streams (in thousands).
For 5 songs, you have:
Song 1: (2.5 min, 100k streams)
Song 2: (3.0 min, 150k streams)
Song 3: (3.5 min, 180k streams)
Song 4: (4.0 min, 160k streams)
Song 5: (4.5 min, 140k streams)
Find the best-fit line to predict streams from song length.
Let’s find the line \(y = bx + m\) using the least squares method:
\[ b = \frac{ \sum\limits_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y}) }{ \sum\limits_{i=1}^n (x_i-\bar{x}) (x_i-\bar{x}) } \]
\[ b = \frac{ \left( \frac{1}{n}\sum x_iy_i \right) - \bar{x} \bar{y} }{ \left( \frac{1}{n}\sum x_ix_i \right) - \bar{x} \bar{x}} \]
\[ m = \bar{y} - b\bar{x} \]
Plugging in the numbers:
\[ \begin{aligned} b &= \frac{\frac{1}{5}(2.5\cdot100 + 3.0\cdot150 + 3.5\cdot180 + 4.0\cdot160 + 4.5\cdot140) - (3.5)(146)}{\frac{1}{5}(2.5^2 + 3.0^2 + 3.5^2 + 4.0^2 + 4.5^2) - (3.5)^2} \\ &= 18 \end{aligned} \]
\[ m = 146 - 18 \cdot 3.5 = 83 \]
Your prediction formula is:
\[ \text{Expected Streams} = 20(\text{song length}) + 76 \]
Notice how the streams go up until about 3.5 minutes, then start dropping? This might mean there’s a “sweet spot” for song length - not too short and not too long.
You’re wondering which platform gets more engagement. You post the same content on both:
Instagram likes: 105, 98, 120, 95, 102
TikTok likes: 115, 108, 125, 112, 110
Is TikTok really giving you more likes, or is it just random chance?
Use a 5% significance level.
Let’s use a paired t-test:
Find the differences (TikTok - Instagram): 10, 10, 5, 17, 8
Calculate mean difference (\(\bar{d}\)) and standard deviation (\(s_d\)):
\[ \bar{d} = 10 \]
\[ s_d = 4.42 \]
\[ t = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{10}{4.42/\sqrt{5}} = 5.0 \]
calculate the t-critical values for 4 degrees of freedom at common significance levels:
For two-tailed tests:
At 95% confidence level (\(\alpha = 0.05\)): ±2.776 At 99% confidence level (\(\alpha = 0.01\)): ±4.604
For one-tailed tests:
At 95% confidence level (\(\alpha = 0.05\)): ±2.132 At 99% confidence level (\(\alpha = 0.01\)): ±3.747
Since 5.00 > 2.776, we can be pretty confident (95% sure) that TikTok really does get more likes. It’s not just random luck i.e. TikTok is your engagement winner.
A two-tailed test examines the possibility of a relationship in both directions.
\[\begin{aligned}H_0: \mu &= \mu_0 \\H_1: \mu &\neq \mu_0\end{aligned}\]
For degrees of freedom = 4 and α = 0.05:
# Calculate critical value for two-tailed test
qt(0.975, df = 4) # 0.975 because we want the upper tail probability of 0.025## [1] 2.776445
A one-tailed test examines the possibility of a relationship in only one direction.
Right-tailed: \[\begin{aligned}H_0: \mu &\leq \mu_0 \\H_1: \mu &> \mu_0\end{aligned}\]
Left-tailed: \[\begin{aligned}H_0: \mu &\geq \mu_0 \\H_1: \mu &< \mu_0\end{aligned}\]
For degrees of freedom = 4 and α = 0.05:
# Calculate critical value for one-tailed test
qt(0.95, df = 4) # 0.95 because we want the upper tail probability of 0.05## [1] 2.131847
library(ggplot2)
x <- seq(-4, 4, length.out = 1000)
df <- data.frame(
x = x,
y = dt(x, df = 4)
)
ggplot(df, aes(x = x, y = y)) +
geom_line() +
geom_vline(xintercept = c(-2.776, 2.776), color = "red", linetype = "dashed") +
geom_vline(xintercept = 2.132, color = "blue", linetype = "dashed") +
annotate("text", x = 3.2, y = 0.3, label = "Two-tailed\nα = 0.05", color = "red") +
annotate("text", x = 2.5, y = 0.2, label = "One-tailed\nα = 0.05", color = "blue") +
labs(title = "T-Distribution (df = 4)",
x = "t-value",
y = "Density") +
theme_minimal()Data with a natural order, but no fixed distance between levels
Examples:
Movie ratings: Bad, Ok, Good.
Game rankings: Bronze, Silver, Gold.
Survey responses: Strongly Disagree to Strongly Agree.
You want to predict an ordered outcome (e.g., movie ratings)
Linear regression assumes equal spacing between levels, which is not true for ordinal data
Ordinal logistic regression models the probability of being in or below a certain category
X1: Age (in years)X2: Number of similar movies watched# Simulate movie rating data
set.seed(123)
age <- sample(18:40, 100, replace = TRUE)
movies_watched <- sample(1:20, 100, replace = TRUE)
rating <- cut(0.1 * age + 0.3 * movies_watched + rnorm(100),
breaks = c(-Inf, 5, 10, Inf),
labels = c("Bad", "Okay", "Good"),
ordered_result = TRUE)
movie_data <- data.frame(age, movies_watched, rating)
head(movie_data)## age movies_watched rating
## 1 32 14 Okay
## 2 36 17 Okay
## 3 31 14 Okay
## 4 20 3 Bad
## 5 27 8 Bad
## 6 35 14 Okay
library(MASS)
model <- polr(rating ~ age + movies_watched, data = movie_data, Hess = TRUE)
summary(model)## Call:
## polr(formula = rating ~ age + movies_watched, data = movie_data,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## age 0.1478 0.05036 2.934
## movies_watched 0.5943 0.12155 4.890
##
## Intercepts:
## Value Std. Error t value
## Bad|Okay 8.8388 2.0480 4.3159
## Okay|Good 17.6026 3.3207 5.3009
##
## Residual Deviance: 77.44229
## AIC: 85.44229
Age: For each additional year, the odds of a higher rating increase by \(e^{\beta_1}\)
Movies Watched: For each additional movie, the odds of a higher rating increase by \(e^{\beta_2}\)
For thresholds: Separate the categories (e.g., “Bad” to “Okay”)
\[ Odds = e^{-2 + 0.05 \cdot 25 + 0.2 \cdot 10} \]
library(ggplot2)
probs <- data.frame(
Rating = c("Bad", "Okay", "Good"),
Probability = c(0.2, 0.5, 0.3)
)
ggplot(probs, aes(x = Rating, y = Probability, fill = Rating)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Predicted Probabilities for Movie Ratings",
x = "Rating",
y = "Probability")# Simulate TikTok data
set.seed(123)
video_length <- rnorm(100, mean = 30, sd = 10) # seconds
views <- 1000 + 50 * video_length + rnorm(100, mean = 0, sd = 500)
# Analysis
model_tiktok <- lm(views ~ video_length)
plot(video_length, views, main = 'TikTok Video Length vs Views',
xlab = 'Video Length (seconds)', ylab = 'Views',
pch = 19, col = 'purple')
abline(model_tiktok, col = 'red', lwd = 2)# Simulate gaming data
set.seed(123)
hours_played <- runif(100, min = 1, max = 50)
win_rate <- 0.4 + 0.01 * hours_played + rnorm(100, mean = 0, sd = 0.05)
# Plot
plot(hours_played, win_rate, main = 'Gaming Hours vs Win Rate',
xlab = 'Hours Played', ylab = 'Win Rate',
pch = 19, col = 'orange')
model_gaming <- lm(win_rate ~ hours_played)
abline(model_gaming, col = 'blue', lwd = 2)# Simulate Spotify data
set.seed(123)
danceability <- runif(100, min = 0.3, max = 0.9)
popularity <- 20 + 70 * danceability + rnorm(100, mean = 0, sd = 10)
# Create categories
genre <- sample(c('Pop', 'Hip-Hop', 'Rock'), 100, replace = TRUE)
colors <- ifelse(genre == 'Pop', 'pink',
ifelse(genre == 'Hip-Hop', 'purple', 'blue'))
# Plot
plot(danceability, popularity, col = colors, pch = 19,
main = 'Song Danceability vs Popularity',
xlab = 'Danceability Score', ylab = 'Popularity Score')
legend('topleft', legend = unique(genre), col = c('pink', 'purple', 'blue'),
pch = 19)# Simulate Instagram engagement data
set.seed(123)
hour <- 0:23
engagement <- 100 + 50 * sin((hour - 12) * pi/12) + rnorm(24, mean = 0, sd = 10)
# Plot
barplot(engagement, names.arg = hour,
main = 'Instagram Engagement by Hour',
xlab = 'Hour of Day', ylab = 'Engagement Score',
col = rainbow(24))