Rationale

Bivariate regression is a way to test whether two continuous variables move together in a predictable way. One is the independent variable (what might influence something else), and the other is the dependent variable (the outcome). The regression tells us whether changes in the IV are linked to changes in the DV, and if so, how strong and in what direction that relationship is.

In this case, the IV is the average weekly hours participants spent watching television (video), and the DV is their estimate of how many people in the U.S. work in law enforcement, medicine, or emergency response (pct). The idea is that if TV really does shape how people see the world, then the more someone watches, the higher their estimates might be. Regression is a good fit here because it doesn’t just show us whether there’s a relationship. It also shows us how much viewing actually explains the way people think about these occupations.

Hypothesis

The more hours participants watch TV each week, the higher their estimates of people working in law enforcement, medicine, or emergency response will be. In other words, as video goes up, pct should also go up

Variables & method

Independent variable (IV): video - average weekly hours of TV watched. (Continuous)

Dependent variable (DV): pct - participants’ estimate of the percentage of the U.S. workforce in law enforcement, medicine, or emergency response. (Continuous)

Method: I ran a bivariate regression. Graphs such as histograms were used to see the distributions, a scatterplot with a regression line to visualize the relationship, and checked leverage values to spot any influential outliers.

Results & discussion

The histograms showed that both variables had a decent spread. The scatterplot made the relationship visible; the regression line tilted upward, which means that as weekly TV hours increased, so did participants’ estimates.

The regression confirmed this: the slope for video was positive, and the p-value suggested the relationship wasn’t just random chance. At the same time, the r-squared was low, which means TV hours only explain a small part of why people guessed the way they did. Most of the differences in their estimates may come from other factors.

Outlier checks showed a few cases with higher leverage, but nothing that really changed the overall pattern. Overall, the results supported my hypothesis: the heavier TV watchers were more likely to think that a larger share of the U.S. workforce is in law enforcement, medicine, or emergency response. TV still isn’t the whole story, but it plays a role. However, there are clearly other influences on how people form these perceptions.

Leverage estimates for 10 largest outliers
Row # Leverage
164 0.0305
360 0.0207
359 0.0194
371 0.0174
72 0.0162
201 0.0159
265 0.0159
392 0.0148
44 0.0144
97 0.0144
Regression Analysis Results
Coefficient Estimates
Term Estimate Std. Error t p-value
(Intercept) 23.2076 2.2026 10.5363 0.0000
IV 0.8440 0.0630 13.4056 0.0000
Model Fit Statistics
Overall Regression Performance
R-squared Adj. R-squared F-statistic df (model) df (residual) Residual Std. Error
0.3111 0.3093 179.7107 1.0000 398.0000 9.7373

Code:

##################################################
# 1. Install and load required packages
##################################################
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")

library(tidyverse)
library(gt)
library(gtExtras)


##################################################
# 2. Read in the dataset
##################################################
# Replace "YOURFILENAME.csv" with the actual filename
mydata <- read.csv("Cultivation.csv")


# ################################################
# # (Optional) 2b. Remove specific cases by row number
# ################################################
# # Example: remove rows 10 and 25
# rows_to_remove <- c(10, 25) # Edit and uncomment this line
# mydata <- mydata[-rows_to_remove, ] # Uncomment this line


##################################################
# 3. Define dependent variable (DV) and independent variable (IV)
##################################################
# Replace YOURDVNAME and YOURIVNAME with actual column names
mydata$DV <- mydata$pct
mydata$IV <- mydata$video


##################################################
# 4. Explore distributions of DV and IV
##################################################
# Make a histogram for DV
DVGraph <- ggplot(mydata, aes(x = DV)) + 
  geom_histogram(color = "black", fill = "#1f78b4")

# Make a histogram for IV
IVGraph <- ggplot(mydata, aes(x = IV)) + 
  geom_histogram(color = "black", fill = "#1f78b4")


##################################################
# 5. Fit and summarize initial regression model
##################################################
# Suppress scientific notation
options(scipen = 999)

# Fit model
myreg <- lm(DV ~ IV, data = mydata)

# Model summary
summary(myreg)


##################################################
# 6. Visualize regression and check for bivariate outliers
##################################################
# Create scatterplot with regression line as a ggplot object
RegressionPlot <- ggplot(mydata, aes(x = IV, y = DV)) +
  geom_point(color = "#1f78b4") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Scatterplot of DV vs IV with Regression Line",
    x = "Independent Variable (IV)",
    y = "Dependent Variable (DV)"
  ) +
  theme_minimal()


##################################################
# 7. Check for potential outliers (high leverage points)
##################################################
# Calculate leverage values
hat_vals <- hatvalues(myreg)

# Rule of thumb: leverage > 2 * (number of predictors + 1) / n may be influential
threshold <- 2 * (length(coef(myreg)) / nrow(mydata))

# Create table showing 10 largest leverage values
outliers <- data.frame(
  Obs = 1:nrow(mydata),
  Leverage = hatvalues(myreg)
) %>%
  arrange(desc(Leverage)) %>%
  slice_head(n = 10)

# Format as a gt table
outliers_table <- outliers %>%
  gt() %>%
  tab_header(
    title = "Leverage estimates for 10 largest outliers"
  ) %>%
  cols_label(
    Obs = "Row #",
    Leverage = "Leverage"
  ) %>%
  fmt_number(
    columns = Leverage,
    decimals = 4
  )


##################################################
# 8. Create nicely formatted regression results tables
##################################################
# --- Coefficient-level results ---
reg_results <- as.data.frame(coef(summary(myreg))) %>%
  tibble::rownames_to_column("Term") %>%
  rename(
    Estimate = Estimate,
    `Std. Error` = `Std. Error`,
    t = `t value`,
    `p-value` = `Pr(>|t|)`
  )

reg_table <- reg_results %>%
  gt() %>%
  tab_header(
    title = "Regression Analysis Results",
    subtitle = "Coefficient Estimates"
  ) %>%
  fmt_number(
    columns = c(Estimate, `Std. Error`, t, `p-value`),
    decimals = 4
  )


# --- Model fit statistics ---
reg_summary <- summary(myreg)

fit_stats <- tibble::tibble(
  `R-squared` = reg_summary$r.squared,
  `Adj. R-squared` = reg_summary$adj.r.squared,
  `F-statistic` = reg_summary$fstatistic[1],
  `df (model)` = reg_summary$fstatistic[2],
  `df (residual)` = reg_summary$fstatistic[3],
  `Residual Std. Error` = reg_summary$sigma
)

fit_table <- fit_stats %>%
  gt() %>%
  tab_header(
    title = "Model Fit Statistics",
    subtitle = "Overall Regression Performance"
  ) %>%
  fmt_number(
    columns = everything(),
    decimals = 4
  )


##################################################
# 9. Final print of key graphics and tables
##################################################
DVGraph
IVGraph
RegressionPlot
outliers_table
reg_table
fit_table