This made-up data represents the results of a not real study that tracked the weekly hours spent watching television content for 400 volunteer participants that were recruited from a random sample of all U.S. adults. The file name used was Cultivation.cvs. The dependent variable is pct and the independent variable is video. See the Example dataset codebook for details about both variables.
This made-up study examined the relationship between video exposure (IV) which is the average weekly hours each research participant spent watching TV during the six-month analysis and the percentage (DV) of the U.S. population estimated by each research participant to be employed full time in either law enforcement/criminal justice (police, investigators, prosecutors, criminal defense lawyers, security guards, etc.) or medicine (doctors, nurses, hospital personnel, medical interns, etc.) or emergency response services (firefighters, paramedics, search-and-rescue personnel, etc.).
The script’s opening lines of code install (if not installed already) and activate the tidyverse and ggextras packages, then read the data from the Cultivate.csv file. The code assumes the data file has been stored in the same subdirectory as the script.
# 1. Install and load required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")
library(tidyverse)
library(gt)
library(gtExtras)
For the reasons described above, the DV is specified as pct, while the IV is specified as video. Next, each variable’s distribution is examined for outliers, data entry errors and other anomalies.
# 2. Read in the dataset
# Replace "Cultivation.csv" with your actual CSV filename
mydata <- read.csv("Cultivation.csv")
# 3. Define dependent variable (DV) and independent variable (IV)
mydata$DV <- mydata$pct
mydata$IV <- mydata$video
# 4. Explore distributions of DV and IV
DVGraph <- ggplot(mydata, aes(x = DV)) +
geom_histogram(color = "black", fill = "#1f78b4") +
labs(title = "Distribution of DV (pct)")
IVGraph <- ggplot(mydata, aes(x = IV)) +
geom_histogram(color = "black", fill = "#1f78b4") +
labs(title = "Distribution of IV (video)")
print(DVGraph)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
print(IVGraph)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graphs show there are potential outliers in the DV and IV variables. In the DV graph, there are some values below 20 that occurs occassionally, but infrequently. Those values may indicate low-end outliers. In the IV graph, there are both very high and very low values, which may suggest outliers.
Because linear regression is being used, these outliers could lead to incorrect conclusions about the relationship between the dependent and independent variables. Creating the regression model and plotting its results will yield further insights.
Let’s begin with the regression results, which this code will produce and display:
# 5. Fit and summarize regression model
options(scipen = 999)
myreg <- lm(DV ~ IV, data = mydata)
print(summary(myreg))
##
## Call:
## lm(formula = DV ~ IV, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.5917 -6.1607 0.6423 6.7433 23.8483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.20761 2.20264 10.54 <0.0000000000000002 ***
## IV 0.84400 0.06296 13.41 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.737 on 398 degrees of freedom
## Multiple R-squared: 0.3111, Adjusted R-squared: 0.3093
## F-statistic: 179.7 on 1 and 398 DF, p-value: < 0.00000000000000022
The information suggests a strong, positive relationship between IV and DV. The IV helps explain about 1/3 of the variability in DV. The p-value for the F-statistic is < 0.00000000000000022, which is well below the 0.05 threshold is for statistical significance. The model’s R-squared value is 0.3111, which shows the IV explains about 31 percent of the variation in DV. Other factors could be at play though.
# 6. Visualize regression
RegressionPlot <- ggplot(mydata, aes(x = IV, y = DV)) +
geom_point(color = "#1f78b4") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Scatterplot of DV (pct) vs IV (video) with Regression Line",
x = "Video",
y = "Pct"
) +
theme_minimal()
print(RegressionPlot)
## `geom_smooth()` using formula = 'y ~ x'
The plot shows a positive relationship between DV and IV. As video (IV) increases, pct (DV) also increases. The coefficient, according to charts farther down in the code is 0.844.
# 7. Outlier check
hat_vals <- hatvalues(myreg)
threshold <- 2 * (length(coef(myreg)) / nrow(mydata))
outliers <- data.frame(
Obs = 1:nrow(mydata),
Leverage = hatvalues(myreg)
) %>%
arrange(desc(Leverage)) %>%
slice_head(n = 10)
outliers_table <- outliers %>%
gt() %>%
tab_header(title = "Leverage estimates for 10 largest outliers") %>%
cols_label(Obs = "Row #", Leverage = "Leverage") %>%
fmt_number(columns = Leverage, decimals = 4)
outliers_table
Leverage estimates for 10 largest outliers | |
Row # | Leverage |
---|---|
164 | 0.0305 |
360 | 0.0207 |
359 | 0.0194 |
371 | 0.0174 |
72 | 0.0162 |
201 | 0.0159 |
265 | 0.0159 |
392 | 0.0148 |
44 | 0.0144 |
97 | 0.0144 |
The values in the outlier table are over 0.01. With values that high, we need to take a closer look at them to see if they are influential. Row 164 is very high at 0.0305.
# 8. Regression results tables
reg_results <- as.data.frame(coef(summary(myreg))) %>%
tibble::rownames_to_column("Term") %>%
rename(
Estimate = Estimate,
`Std. Error` = `Std. Error`,
t = `t value`,
`p-value` = `Pr(>|t|)`
)
Based on the results of the t-value and p-value, which are close to 0, shows the results aren’t just because of chance.
reg_table <- reg_results %>%
gt() %>%
tab_header(title = "Regression Analysis Results", subtitle = "Coefficient Estimates") %>%
fmt_number(columns = c(Estimate, `Std. Error`, t, `p-value`), decimals = 4)
reg_summary <- summary(myreg)
fit_stats <- tibble::tibble(
`R-squared` = reg_summary$r.squared,
`Adj. R-squared` = reg_summary$adj.r.squared,
`F-statistic` = reg_summary$fstatistic[1],
`df (model)` = reg_summary$fstatistic[2],
`df (residual)` = reg_summary$fstatistic[3],
`Residual Std. Error` = reg_summary$sigma
)
fit_table <- fit_stats %>%
gt() %>%
tab_header(title = "Model Fit Statistics", subtitle = "Overall Regression Performance") %>%
fmt_number(columns = everything(), decimals = 4)
reg_table
Regression Analysis Results | ||||
Coefficient Estimates | ||||
Term | Estimate | Std. Error | t | p-value |
---|---|---|---|---|
(Intercept) | 23.2076 | 2.2026 | 10.5363 | 0.0000 |
IV | 0.8440 | 0.0630 | 13.4056 | 0.0000 |
fit_table
Model Fit Statistics | |||||
Overall Regression Performance | |||||
R-squared | Adj. R-squared | F-statistic | df (model) | df (residual) | Residual Std. Error |
---|---|---|---|---|---|
0.3111 | 0.3093 | 179.7107 | 1.0000 | 398.0000 | 9.7373 |
Overall, the overall analysis shows a significant positive relationship between IV and DV. The IV in this data shows a strong predictor of DV, at nearly 1/3 of the variation.