Some Insights into Statistical Concepts

Think of random variables like loot boxes in games - you know what you might get, but not exactly what you’ll get.
Probability is like your win rate in games
Standard deviation is like how much your gaming scores typically swing up or down.
Regression is like finding patterns in what makes your content go viral.

Exercise 1: Understanding Random Variables with TikTok Views

Your latest TikTok dance video usually gets between 100 and 1000 views. Each time you post, it’s like rolling a dice - you never know exactly how many views you’ll get. Suppose you post the same dance video 5 times, and you get the following number of views:

Video 1: 300 views
Video 2: 750 views
Video 3: 200 views
Video 4: 500 views
Video 5: 450 views

What’s your average (expected) views?

Average views:

\[ E[\text{Views}] = \frac{300 + 750 + 200 + 500 + 450}{5} = 440 \text{ views} \]

How spread out are your views? (Calculate the variance)

Variance (how scattered the views are):

\[ \begin{aligned} Var(\text{Views}) &= \frac{1}{5}\sum(\text{views} - 440)^2 \\ &= \frac{(-140)^2 + (310)^2 + (-240)^2 + (60)^2 + (10)^2}{5} \\ &= 44,200 \end{aligned} \]

Your videos typically get around 440 views, but they can swing up or down by about \(\sqrt{44,200} =210\) views.

Exercise 2: Probability Distributions with Gaming Scores

You’re playing your favorite mobile game. Your high scores tend to follow what we call a “normal distribution” (like a bell curve).

Your average score is 1000 points, with a standard deviation of 100 points.

What is the probability of getting above 1200 points?

Let’s use the properties of the normal distribution:

\[ P(X > 1200) = 1 - P\left( Z > \frac{1200-1000}{100} \right) = 1 - P(Z > 2) = 0.0228 \]

You’ll score above 1200 about 2.3% of the time - that’s like getting a rare item in a loot box!

What is the probability of getting scores between 900 and 1100:

\[ P(900 < X < 1100) = P(-1 < Z < 1) = 0.6826 \]

About 68% of your games will score in this range - these are your typical matches.

Exercise 3: Regression with Spotify Streaming

You’re a music producer trying to understand what makes a song popular. You collect data on \(x\) = Song length (in minutes) and \(y\) = Number of streams (in thousands).

For 5 songs, you have:

Song 1: (2.5 min, 100k streams)
Song 2: (3.0 min, 150k streams)
Song 3: (3.5 min, 180k streams)
Song 4: (4.0 min, 160k streams)
Song 5: (4.5 min, 140k streams)

Find the best-fit line to predict streams from song length.

Let’s find the line \(y = bx + m\) using the least squares method:

\[ b = \frac{ \sum\limits_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y}) }{ \sum\limits_{i=1}^n (x_i-\bar{x}) (x_i-\bar{x}) } \]

\[ b = \frac{ \left( \frac{1}{n}\sum x_iy_i \right) - \bar{x} \bar{y} }{ \left( \frac{1}{n}\sum x_ix_i \right) - \bar{x} \bar{x}} \]

\[ m = \bar{y} - b\bar{x} \]

Plugging in the numbers:

\[ \begin{aligned} b &= \frac{\frac{1}{5}(2.5\cdot100 + 3.0\cdot150 + 3.5\cdot180 + 4.0\cdot160 + 4.5\cdot140) - (3.5)(146)}{\frac{1}{5}(2.5^2 + 3.0^2 + 3.5^2 + 4.0^2 + 4.5^2) - (3.5)^2} \\ &= 18 \end{aligned} \]

\[ m = 146 - 18 \cdot 3.5 = 83 \]

Your prediction formula is:

\[ \text{Expected Streams} = 20(\text{song length}) + 76 \]

Notice how the streams go up until about 3.5 minutes, then start dropping? This might mean there’s a “sweet spot” for song length - not too short and not too long.

Exercise 4: Hypothesis Testing with Social Media Engagement

You’re wondering which platform gets more engagement. You post the same content on both:

Instagram likes: 105, 98, 120, 95, 102

TikTok likes: 115, 108, 125, 112, 110

Is TikTok really giving you more likes, or is it just random chance?

Use a 5% significance level.

Let’s use a paired t-test:

Find the differences (TikTok - Instagram): 10, 10, 5, 17, 8
Calculate mean difference (\(\bar{d}\)) and standard deviation (\(s_d\)):

\[ \bar{d} = 10 \]

\[ s_d = 4.42 \]

Calculate t-statistic:

\[ t = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{10}{4.42/\sqrt{5}} = 5.0 \]

calculate the t-critical values for 4 degrees of freedom at common significance levels:

For two-tailed tests:

At 95% confidence level (\(\alpha = 0.05\)): ±2.776 At 99% confidence level (\(\alpha = 0.01\)): ±4.604

For one-tailed tests:

At 95% confidence level (\(\alpha = 0.05\)): ±2.132 At 99% confidence level (\(\alpha = 0.01\)): ±3.747

Compare with \(t_{critical}\) (2.776 for 4 degrees of freedom)

Since 5.00 > 2.776, we can be pretty confident (95% sure) that TikTok really does get more likes. It’s not just random luck i.e. TikTok is your engagement winner.

A digression/review: Two-Tailed Test

A two-tailed test examines the possibility of a relationship in both directions.

Key Characteristics:

Tests for differences in both directions (positive and negative)
Uses both tails of the distribution
The significance level (\(\alpha\)) is split between both tails

Hypotheses:

\[\begin{aligned}H_0: \mu &= \mu_0 \\H_1: \mu &\neq \mu_0\end{aligned}\]

Critical Values

For degrees of freedom = 4 and α = 0.05:

# Calculate critical value for two-tailed test
qt(0.975, df = 4)  # 0.975 because we want the upper tail probability of 0.025

## [1] 2.776445

One-Tailed Test

A one-tailed test examines the possibility of a relationship in only one direction.

Key Characteristics:

Tests for differences in one direction only
Uses one tail of the distribution
The entire significance level (α) is in one tail

Hypotheses:

Right-tailed: \[\begin{aligned}H_0: \mu &\leq \mu_0 \\H_1: \mu &> \mu_0\end{aligned}\]

Left-tailed: \[\begin{aligned}H_0: \mu &\geq \mu_0 \\H_1: \mu &< \mu_0\end{aligned}\]

Critical Values

For degrees of freedom = 4 and α = 0.05:

# Calculate critical value for one-tailed test
qt(0.95, df = 4)  # 0.95 because we want the upper tail probability of 0.05

## [1] 2.131847

When to Use Which Test?

Use One-Tailed Test When:

You have a specific directional hypothesis
You only care about differences in one direction
Example: Testing if a new drug increases performance

Use Two-Tailed Test When:

You don’t have a directional hypothesis
You care about differences in both directions
Example: Testing if a new drug affects performance (increase or decrease)

Visual Comparison of Critical Values

library(ggplot2)

x <- seq(-4, 4, length.out = 1000)
df <- data.frame(
  x = x,
  y = dt(x, df = 4)
)

ggplot(df, aes(x = x, y = y)) +
  geom_line() +
  geom_vline(xintercept = c(-2.776, 2.776), color = "red", linetype = "dashed") +
  geom_vline(xintercept = 2.132, color = "blue", linetype = "dashed") +
  annotate("text", x = 3.2, y = 0.3, label = "Two-tailed\nα = 0.05", color = "red") +
  annotate("text", x = 2.5, y = 0.2, label = "One-tailed\nα = 0.05", color = "blue") +
  labs(title = "T-Distribution (df = 4)",
       x = "t-value",
       y = "Density") +
  theme_minimal()

Summary

Two-tailed tests (α = 0.05): Critical value = ±2.776
One-tailed tests (α = 0.05): Critical value = ±2.132
Two-tailed tests are more conservative and commonly used in research
Choose based on your research hypothesis and what differences are meaningful for your study This R Markdown document includes:

More Insights: Understanding Ordinal Data

What is Ordinal Data?

Data with a natural order, but no fixed distance between levels
Examples:
- Movie ratings: Bad, Ok, Good.
- Game rankings: Bronze, Silver, Gold.
- Survey responses: Strongly Disagree to Strongly Agree.

Why Use Ordinal Logistic Regression?

The Problem

You want to predict an ordered outcome (e.g., movie ratings)
Linear regression assumes equal spacing between levels, which is not true for ordinal data
Ordinal logistic regression models the probability of being in or below a certain category

Movie Ratings Example

The Scenario

You ask 100 friends to rate a movie as bad, ok or good.
You also collect data on:
- X1: Age (in years)
- X2: Number of similar movies watched

The Data

# Simulate movie rating data
set.seed(123)
age <- sample(18:40, 100, replace = TRUE)
movies_watched <- sample(1:20, 100, replace = TRUE)
rating <- cut(0.1 * age + 0.3 * movies_watched + rnorm(100),
              breaks = c(-Inf, 5, 10, Inf),
              labels = c("Bad", "Okay", "Good"),
              ordered_result = TRUE)

movie_data <- data.frame(age, movies_watched, rating)
head(movie_data)

##   age movies_watched rating
## 1  32             14   Okay
## 2  36             17   Okay
## 3  31             14   Okay
## 4  20              3    Bad
## 5  27              8    Bad
## 6  35             14   Okay

Fitting the Ordinal Logistic Model

library(MASS)
model <- polr(rating ~ age + movies_watched, data = movie_data, Hess = TRUE)
summary(model)

## Call:
## polr(formula = rating ~ age + movies_watched, data = movie_data, 
##     Hess = TRUE)
## 
## Coefficients:
##                 Value Std. Error t value
## age            0.1478    0.05036   2.934
## movies_watched 0.5943    0.12155   4.890
## 
## Intercepts:
##           Value   Std. Error t value
## Bad|Okay   8.8388  2.0480     4.3159
## Okay|Good 17.6026  3.3207     5.3009
## 
## Residual Deviance: 77.44229 
## AIC: 85.44229

Interpreting the Results

Coefficients

Age: For each additional year, the odds of a higher rating increase by \(e^{\beta_1}\)
Movies Watched: For each additional movie, the odds of a higher rating increase by \(e^{\beta_2}\)
For thresholds: Separate the categories (e.g., “Bad” to “Okay”)

Simplified Example

The Data

Age: 25 years
Movies Watched: 10

The Model

\(\log(Odds) = -2 + 0.05 \cdot Age + 0.2 \cdot Movies Watched\)

Question

What is the probability of each rating (“Bad”, “Okay”, “Good”)?

Step-by-Step Solution

Step 1: Calculate the Odds

\[ Odds = e^{-2 + 0.05 \cdot 25 + 0.2 \cdot 10} \]

Step 2: Convert Odds to Probabilities

Use the thresholds to calculate cumulative probabilities
Subtract to get probabilities for each category

Visualizing the Results

library(ggplot2)
probs <- data.frame(
  Rating = c("Bad", "Okay", "Good"),
  Probability = c(0.2, 0.5, 0.3)
)

ggplot(probs, aes(x = Rating, y = Probability, fill = Rating)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Predicted Probabilities for Movie Ratings",
       x = "Rating",
       y = "Probability")

Practical Example: What Makes a TikTok Video Go Viral?

# Simulate TikTok data
set.seed(123)
video_length <- rnorm(100, mean = 30, sd = 10)  # seconds
views <- 1000 + 50 * video_length + rnorm(100, mean = 0, sd = 500)

# Analysis
model_tiktok <- lm(views ~ video_length)
plot(video_length, views, main = 'TikTok Video Length vs Views',
     xlab = 'Video Length (seconds)', ylab = 'Views',
     pch = 19, col = 'purple')
abline(model_tiktok, col = 'red', lwd = 2)

Practical Example: Gaming Stats

Question: Does Time Played Predict Win Rate?

# Simulate gaming data
set.seed(123)
hours_played <- runif(100, min = 1, max = 50)
win_rate <- 0.4 + 0.01 * hours_played + rnorm(100, mean = 0, sd = 0.05)

# Plot
plot(hours_played, win_rate, main = 'Gaming Hours vs Win Rate',
     xlab = 'Hours Played', ylab = 'Win Rate',
     pch = 19, col = 'orange')
model_gaming <- lm(win_rate ~ hours_played)
abline(model_gaming, col = 'blue', lwd = 2)

Practical Example: Spotify Trends

Question: What Makes a Song Popular?

# Simulate Spotify data
set.seed(123)
danceability <- runif(100, min = 0.3, max = 0.9)
popularity <- 20 + 70 * danceability + rnorm(100, mean = 0, sd = 10)

# Create categories
genre <- sample(c('Pop', 'Hip-Hop', 'Rock'), 100, replace = TRUE)
colors <- ifelse(genre == 'Pop', 'pink',
                ifelse(genre == 'Hip-Hop', 'purple', 'blue'))

# Plot
plot(danceability, popularity, col = colors, pch = 19,
     main = 'Song Danceability vs Popularity',
     xlab = 'Danceability Score', ylab = 'Popularity Score')
legend('topleft', legend = unique(genre), col = c('pink', 'purple', 'blue'),
       pch = 19)

Practical Example: Instagram Analytics 📱

Question: When is the Best Time to Post?

# Simulate Instagram engagement data
set.seed(123)
hour <- 0:23
engagement <- 100 + 50 * sin((hour - 12) * pi/12) + rnorm(24, mean = 0, sd = 10)

# Plot
barplot(engagement, names.arg = hour,
        main = 'Instagram Engagement by Hour',
        xlab = 'Hour of Day', ylab = 'Engagement Score',
        col = rainbow(24))

Econometric Insights (Part II)

Some Insights into Statistical Concepts

Exercise 1: Understanding Random Variables with TikTok Views

Exercise 2: Probability Distributions with Gaming Scores

Exercise 3: Regression with Spotify Streaming

Exercise 4: Hypothesis Testing with Social Media Engagement

A digression/review: Two-Tailed Test

Key Characteristics:

Hypotheses:

Critical Values

One-Tailed Test

Key Characteristics:

Hypotheses:

Critical Values

When to Use Which Test?

Use One-Tailed Test When:

Use Two-Tailed Test When:

Visual Comparison of Critical Values

Summary

More Insights: Understanding Ordinal Data

What is Ordinal Data?

Why Use Ordinal Logistic Regression?

The Problem

Movie Ratings Example

The Scenario

The Data

Fitting the Ordinal Logistic Model

Interpreting the Results

Coefficients

Simplified Example

The Data

The Model

Question

Step-by-Step Solution

Step 1: Calculate the Odds

Step 2: Convert Odds to Probabilities

Visualizing the Results

Practical Example: What Makes a TikTok Video Go Viral?

Practical Example: Gaming Stats

Question: Does Time Played Predict Win Rate?

Practical Example: Spotify Trends

Question: What Makes a Song Popular?

Practical Example: Instagram Analytics 📱

Question: When is the Best Time to Post?