Regression Analysis

This made-up data represents the results of a not real study that tracked the weekly hours spent watching television content for 400 volunteer participants that were recruited from a random sample of all U.S. adults. The file name used was Cultivation.cvs. The dependent variable is pct and the independent variable is video. See the Example dataset codebook for details about both variables.

This made-up study examined the relationship between video exposure (IV) which is the average weekly hours each research participant spent watching TV during the six-month analysis and the percentage (DV) of the U.S. population estimated by each research participant to be employed full time in either law enforcement/criminal justice (police, investigators, prosecutors, criminal defense lawyers, security guards, etc.) or medicine (doctors, nurses, hospital personnel, medical interns, etc.) or emergency response services (firefighters, paramedics, search-and-rescue personnel, etc.).

The script’s opening lines of code install (if not installed already) and activate the tidyverse and ggextras packages, then read the data from the Cultivate.csv file. The code assumes the data file has been stored in the same subdirectory as the script.

# 1. Install and load required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")

library(tidyverse)
library(gt)
library(gtExtras)

For the reasons described above, the DV is specified as pct, while the IV is specified as video. Next, each variable’s distribution is examined for outliers, data entry errors and other anomalies.

# 2. Read in the dataset
# Replace "Cultivation.csv" with your actual CSV filename
mydata <- read.csv("Cultivation.csv")

# 3. Define dependent variable (DV) and independent variable (IV)
mydata$DV <- mydata$pct
mydata$IV <- mydata$video

# 4. Explore distributions of DV and IV
DVGraph <- ggplot(mydata, aes(x = DV)) + 
  geom_histogram(color = "black", fill = "#1f78b4") +
  labs(title = "Distribution of DV (pct)")

IVGraph <- ggplot(mydata, aes(x = IV)) + 
  geom_histogram(color = "black", fill = "#1f78b4") +
  labs(title = "Distribution of IV (video)")

print(DVGraph)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

print(IVGraph)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graphs show there are potential outliers in the DV and IV variables. In the DV graph, there are some values below 20 that occurs occassionally, but infrequently. Those values may indicate low-end outliers. In the IV graph, there are both very high and very low values, which may suggest outliers.

Because linear regression is being used, these outliers could lead to incorrect conclusions about the relationship between the dependent and independent variables. Creating the regression model and plotting its results will yield further insights.

Let’s begin with the regression results, which this code will produce and display:

# 5. Fit and summarize regression model
options(scipen = 999)
myreg <- lm(DV ~ IV, data = mydata)
print(summary(myreg))

## 
## Call:
## lm(formula = DV ~ IV, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5917  -6.1607   0.6423   6.7433  23.8483 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 23.20761    2.20264   10.54 <0.0000000000000002 ***
## IV           0.84400    0.06296   13.41 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.737 on 398 degrees of freedom
## Multiple R-squared:  0.3111, Adjusted R-squared:  0.3093 
## F-statistic: 179.7 on 1 and 398 DF,  p-value: < 0.00000000000000022

The information suggests a strong, positive relationship between IV and DV. The IV helps explain about 1/3 of the variability in DV. The p-value for the F-statistic is < 0.00000000000000022, which is well below the 0.05 threshold is for statistical significance. The model’s R-squared value is 0.3111, which shows the IV explains about 31 percent of the variation in DV. Other factors could be at play though.

# 6. Visualize regression
RegressionPlot <- ggplot(mydata, aes(x = IV, y = DV)) +
  geom_point(color = "#1f78b4") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Scatterplot of DV (pct) vs IV (video) with Regression Line",
    x = "Video",
    y = "Pct"
  ) +
  theme_minimal()
print(RegressionPlot)

## `geom_smooth()` using formula = 'y ~ x'

The plot shows a positive relationship between DV and IV. As video (IV) increases, pct (DV) also increases. The coefficient, according to charts farther down in the code is 0.844.

# 7. Outlier check
hat_vals <- hatvalues(myreg)
threshold <- 2 * (length(coef(myreg)) / nrow(mydata))

outliers <- data.frame(
  Obs = 1:nrow(mydata),
  Leverage = hatvalues(myreg)
) %>%
  arrange(desc(Leverage)) %>%
  slice_head(n = 10)

outliers_table <- outliers %>%
  gt() %>%
  tab_header(title = "Leverage estimates for 10 largest outliers") %>%
  cols_label(Obs = "Row #", Leverage = "Leverage") %>%
  fmt_number(columns = Leverage, decimals = 4)

outliers_table

Row #	Leverage
Leverage estimates for 10 largest outliers
164	0.0305
360	0.0207
359	0.0194
371	0.0174
72	0.0162
201	0.0159
265	0.0159
392	0.0148
44	0.0144
97	0.0144

The values in the outlier table are over 0.01. With values that high, we need to take a closer look at them to see if they are influential. Row 164 is very high at 0.0305.

# 8. Regression results tables
reg_results <- as.data.frame(coef(summary(myreg))) %>%
  tibble::rownames_to_column("Term") %>%
  rename(
    Estimate = Estimate,
    `Std. Error` = `Std. Error`,
    t = `t value`,
    `p-value` = `Pr(>|t|)`
  )

Based on the results of the t-value and p-value, which are close to 0, shows the results aren’t just because of chance.

reg_table <- reg_results %>%
  gt() %>%
  tab_header(title = "Regression Analysis Results", subtitle = "Coefficient Estimates") %>%
  fmt_number(columns = c(Estimate, `Std. Error`, t, `p-value`), decimals = 4)

reg_summary <- summary(myreg)

fit_stats <- tibble::tibble(
  `R-squared` = reg_summary$r.squared,
  `Adj. R-squared` = reg_summary$adj.r.squared,
  `F-statistic` = reg_summary$fstatistic[1],
  `df (model)` = reg_summary$fstatistic[2],
  `df (residual)` = reg_summary$fstatistic[3],
  `Residual Std. Error` = reg_summary$sigma
)

fit_table <- fit_stats %>%
  gt() %>%
  tab_header(title = "Model Fit Statistics", subtitle = "Overall Regression Performance") %>%
  fmt_number(columns = everything(), decimals = 4)

reg_table

Term	Estimate	Std. Error	t	p-value
Regression Analysis Results
Coefficient Estimates
(Intercept)	23.2076	2.2026	10.5363	0.0000
IV	0.8440	0.0630	13.4056	0.0000

fit_table

R-squared	Adj. R-squared	F-statistic	df (model)	df (residual)	Residual Std. Error
Model Fit Statistics
Overall Regression Performance
0.3111	0.3093	179.7107	1.0000	398.0000	9.7373

Overall, the overall analysis shows a significant positive relationship between IV and DV. The IV in this data shows a strong predictor of DV, at nearly 1/3 of the variation.

Regression Analysis

DeAnn Hays

2025-09-23