Econometric Insights (Part II)

EC3133

Some Insights into Statistical Concepts

Exercise 1: Understanding Random Variables with TikTok Views

Your latest TikTok dance video usually gets between 100 and 1000 views. Each time you post, it’s like rolling a dice - you never know exactly how many views you’ll get. Suppose you post the same dance video 5 times, and you get the following number of views:

  1. What’s your average (expected) views?

Average views:

\[ E[\text{Views}] = \frac{300 + 750 + 200 + 500 + 450}{5} = 440 \text{ views} \]

  1. How spread out are your views? (Calculate the variance)

Variance (how scattered the views are):

\[ \begin{aligned} Var(\text{Views}) &= \frac{1}{5}\sum(\text{views} - 440)^2 \\ &= \frac{(-140)^2 + (310)^2 + (-240)^2 + (60)^2 + (10)^2}{5} \\ &= 44,200 \end{aligned} \]

Your videos typically get around 440 views, but they can swing up or down by about \(\sqrt{44,200} =210\) views.

Exercise 2: Probability Distributions with Gaming Scores

You’re playing your favorite mobile game. Your high scores tend to follow what we call a “normal distribution” (like a bell curve).

Your average score is 1000 points, with a standard deviation of 100 points.

  1. What is the probability of getting above 1200 points?

Let’s use the properties of the normal distribution:

\[ P(X > 1200) = 1 - P\left( Z > \frac{1200-1000}{100} \right) = 1 - P(Z > 2) = 0.0228 \]

You’ll score above 1200 about 2.3% of the time - that’s like getting a rare item in a loot box!

  1. What is the probability of getting scores between 900 and 1100:

\[ P(900 < X < 1100) = P(-1 < Z < 1) = 0.6826 \]

About 68% of your games will score in this range - these are your typical matches.

Exercise 3: Regression with Spotify Streaming

You’re a music producer trying to understand what makes a song popular. You collect data on \(x\) = Song length (in minutes) and \(y\) = Number of streams (in thousands).

For 5 songs, you have:

Find the best-fit line to predict streams from song length.

Let’s find the line \(y = bx + m\) using the least squares method:

\[ b = \frac{ \sum\limits_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y}) }{ \sum\limits_{i=1}^n (x_i-\bar{x}) (x_i-\bar{x}) } \]

\[ b = \frac{ \left( \frac{1}{n}\sum x_iy_i \right) - \bar{x} \bar{y} }{ \left( \frac{1}{n}\sum x_ix_i \right) - \bar{x} \bar{x}} \]

\[ m = \bar{y} - b\bar{x} \]

Plugging in the numbers:

\[ \begin{aligned} b &= \frac{\frac{1}{5}(2.5\cdot100 + 3.0\cdot150 + 3.5\cdot180 + 4.0\cdot160 + 4.5\cdot140) - (3.5)(146)}{\frac{1}{5}(2.5^2 + 3.0^2 + 3.5^2 + 4.0^2 + 4.5^2) - (3.5)^2} \\ &= 18 \end{aligned} \]

\[ m = 146 - 18 \cdot 3.5 = 83 \]

Your prediction formula is:

\[ \text{Expected Streams} = 20(\text{song length}) + 76 \]

Notice how the streams go up until about 3.5 minutes, then start dropping? This might mean there’s a “sweet spot” for song length - not too short and not too long.

Exercise 4: Hypothesis Testing with Social Media Engagement

You’re wondering which platform gets more engagement. You post the same content on both:

Instagram likes: 105, 98, 120, 95, 102

TikTok likes: 115, 108, 125, 112, 110

Is TikTok really giving you more likes, or is it just random chance?

Use a 5% significance level.

Let’s use a paired t-test:

  1. Find the differences (TikTok - Instagram): 10, 10, 5, 17, 8

  2. Calculate mean difference (\(\bar{d}\)) and standard deviation (\(s_d\)):

\[ \bar{d} = 10 \]

\[ s_d = 4.42 \]

  1. Calculate t-statistic:

\[ t = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{10}{4.42/\sqrt{5}} = 5.0 \]

calculate the t-critical values for 4 degrees of freedom at common significance levels:

For two-tailed tests:

At 95% confidence level (\(\alpha = 0.05\)): ±2.776 At 99% confidence level (\(\alpha = 0.01\)): ±4.604

For one-tailed tests:

At 95% confidence level (\(\alpha = 0.05\)): ±2.132 At 99% confidence level (\(\alpha = 0.01\)): ±3.747

  1. Compare with \(t_{critical}\) (2.776 for 4 degrees of freedom)

Since 5.00 > 2.776, we can be pretty confident (95% sure) that TikTok really does get more likes. It’s not just random luck i.e. TikTok is your engagement winner.

A digression/review: Two-Tailed Test

A two-tailed test examines the possibility of a relationship in both directions.

Key Characteristics:

Hypotheses:

\[\begin{aligned}H_0: \mu &= \mu_0 \\H_1: \mu &\neq \mu_0\end{aligned}\]

Critical Values

For degrees of freedom = 4 and α = 0.05:

# Calculate critical value for two-tailed test
qt(0.975, df = 4)  # 0.975 because we want the upper tail probability of 0.025
## [1] 2.776445

One-Tailed Test

A one-tailed test examines the possibility of a relationship in only one direction.

Key Characteristics:

Hypotheses:

Right-tailed: \[\begin{aligned}H_0: \mu &\leq \mu_0 \\H_1: \mu &> \mu_0\end{aligned}\]

Left-tailed: \[\begin{aligned}H_0: \mu &\geq \mu_0 \\H_1: \mu &< \mu_0\end{aligned}\]

Critical Values

For degrees of freedom = 4 and α = 0.05:

# Calculate critical value for one-tailed test
qt(0.95, df = 4)  # 0.95 because we want the upper tail probability of 0.05
## [1] 2.131847

When to Use Which Test?

Use One-Tailed Test When:

Use Two-Tailed Test When:

Visual Comparison of Critical Values

library(ggplot2)

x <- seq(-4, 4, length.out = 1000)
df <- data.frame(
  x = x,
  y = dt(x, df = 4)
)

ggplot(df, aes(x = x, y = y)) +
  geom_line() +
  geom_vline(xintercept = c(-2.776, 2.776), color = "red", linetype = "dashed") +
  geom_vline(xintercept = 2.132, color = "blue", linetype = "dashed") +
  annotate("text", x = 3.2, y = 0.3, label = "Two-tailed\nα = 0.05", color = "red") +
  annotate("text", x = 2.5, y = 0.2, label = "One-tailed\nα = 0.05", color = "blue") +
  labs(title = "T-Distribution (df = 4)",
       x = "t-value",
       y = "Density") +
  theme_minimal()

Summary

More Insights: Understanding Ordinal Data

What is Ordinal Data?

Why Use Ordinal Logistic Regression?

The Problem

Movie Ratings Example

The Scenario

The Data

# Simulate movie rating data
set.seed(123)
age <- sample(18:40, 100, replace = TRUE)
movies_watched <- sample(1:20, 100, replace = TRUE)
rating <- cut(0.1 * age + 0.3 * movies_watched + rnorm(100),
              breaks = c(-Inf, 5, 10, Inf),
              labels = c("Bad", "Okay", "Good"),
              ordered_result = TRUE)

movie_data <- data.frame(age, movies_watched, rating)
head(movie_data)
##   age movies_watched rating
## 1  32             14   Okay
## 2  36             17   Okay
## 3  31             14   Okay
## 4  20              3    Bad
## 5  27              8    Bad
## 6  35             14   Okay

Fitting the Ordinal Logistic Model

library(MASS)
model <- polr(rating ~ age + movies_watched, data = movie_data, Hess = TRUE)
summary(model)
## Call:
## polr(formula = rating ~ age + movies_watched, data = movie_data, 
##     Hess = TRUE)
## 
## Coefficients:
##                 Value Std. Error t value
## age            0.1478    0.05036   2.934
## movies_watched 0.5943    0.12155   4.890
## 
## Intercepts:
##           Value   Std. Error t value
## Bad|Okay   8.8388  2.0480     4.3159
## Okay|Good 17.6026  3.3207     5.3009
## 
## Residual Deviance: 77.44229 
## AIC: 85.44229

Interpreting the Results

Coefficients

Simplified Example

The Data

The Model

Question

Step-by-Step Solution

Step 1: Calculate the Odds

\[ Odds = e^{-2 + 0.05 \cdot 25 + 0.2 \cdot 10} \]

Step 2: Convert Odds to Probabilities

Visualizing the Results

library(ggplot2)
probs <- data.frame(
  Rating = c("Bad", "Okay", "Good"),
  Probability = c(0.2, 0.5, 0.3)
)

ggplot(probs, aes(x = Rating, y = Probability, fill = Rating)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Predicted Probabilities for Movie Ratings",
       x = "Rating",
       y = "Probability")

Practical Example: What Makes a TikTok Video Go Viral?

# Simulate TikTok data
set.seed(123)
video_length <- rnorm(100, mean = 30, sd = 10)  # seconds
views <- 1000 + 50 * video_length + rnorm(100, mean = 0, sd = 500)

# Analysis
model_tiktok <- lm(views ~ video_length)
plot(video_length, views, main = 'TikTok Video Length vs Views',
     xlab = 'Video Length (seconds)', ylab = 'Views',
     pch = 19, col = 'purple')
abline(model_tiktok, col = 'red', lwd = 2)

Practical Example: Gaming Stats

Question: Does Time Played Predict Win Rate?

# Simulate gaming data
set.seed(123)
hours_played <- runif(100, min = 1, max = 50)
win_rate <- 0.4 + 0.01 * hours_played + rnorm(100, mean = 0, sd = 0.05)

# Plot
plot(hours_played, win_rate, main = 'Gaming Hours vs Win Rate',
     xlab = 'Hours Played', ylab = 'Win Rate',
     pch = 19, col = 'orange')
model_gaming <- lm(win_rate ~ hours_played)
abline(model_gaming, col = 'blue', lwd = 2)

Practical Example: Instagram Analytics 📱

Question: When is the Best Time to Post?

# Simulate Instagram engagement data
set.seed(123)
hour <- 0:23
engagement <- 100 + 50 * sin((hour - 12) * pi/12) + rnorm(24, mean = 0, sd = 10)

# Plot
barplot(engagement, names.arg = hour,
        main = 'Instagram Engagement by Hour',
        xlab = 'Hour of Day', ylab = 'Engagement Score',
        col = rainbow(24))