Linear Regression

Rationale

Cultivation theory, developed by George Gerbner, proposes that consistent and heavy exposure to media shapes how individuals perceive reality. The theory argues that the more time people spend with media, the more their understanding of the world reflects the portrayals they encounter rather than actual conditions. Because media often highlight extremes such as conflict, glamour, or negativity, heavy consumers are more likely to adopt distorted views, believing those portrayals are common in everyday life. This effect is not immediate but accumulates gradually, cultivating perceptions and attitudes over time.

Applying cultivation theory to social media use, we predict that greater weekly screen time will correspond with higher scores on an unhappiness scale. If social media functions like television in Gerbner’s model, then heavy users will absorb more idealized images, negative comparisons, and curated content, which may cultivate feelings of inadequacy or dissatisfaction. In contrast, lighter users who are less exposed to this constant flow of content are expected to report lower unhappiness scores, suggesting a more positive or balanced perception of their own lives.

Hypothesis

Participants who spend more hours weekly on social media will report higher levels of unhappiness, while participants who spend fewer hours on social media will report lower levels of unhappiness.

Variables & method

The independent variable in this analysis is weekly social media usage, measured in hours based on participants’ screen time reports. This variable is continuous because it can take on a wide range of numeric values. The dependent variable is unhappiness, measured through a 50-point self-report scale in which higher scores indicate greater unhappiness. This variable is also continuous.

The analysis uses linear regression to test the relationship between social media usage and unhappiness. Social media hours are entered as the predictor (independent variable), and unhappiness scores are entered as the outcome (dependent variable). The regression procedure estimates whether increases in weekly social media use are associated with higher unhappiness scores and allows us to determine the strength and direction of this relationship.

Results & discussion

The regression analysis found a significant positive relationship between social media use and unhappiness (p < .001). The regression line slopes upward, showing that unhappiness increases as social media use increases. The R-squared value was 0.3203, indicating that about 32 percent of the variance in unhappiness was explained by social media hours. Removing outliers did not change the results in any meaningful way. Overall, the findings support the hypothesis that heavier social media use is associated with higher unhappiness among women aged 18–34.

Row #	Leverage
Leverage estimates for 10 largest outliers
164	0.0303
360	0.0199
359	0.0190
371	0.0178
72	0.0170
265	0.0164
201	0.0152
392	0.0151
97	0.0151
44	0.0149

Term	Estimate	Std. Error	t	p-value
Regression Analysis Results
Coefficient Estimates
(Intercept)	3.9888	1.5531	2.5683	0.0106
IV	0.4206	0.0307	13.6953	0.0000

R-squared	Adj. R-squared	F-statistic	df (model)	df (residual)	Residual Std. Error
Model Fit Statistics
Overall Regression Performance
0.3203	0.3186	187.5609	1.0000	398.0000	4.1619

Code:

##################################################
# 1. Install and load required packages
##################################################
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")

library(tidyverse)
library(gt)
library(gtExtras)


##################################################
# 2. Read in the dataset
##################################################
# Replace "YOURFILENAME.csv" with the actual filename
mydata <- read.csv("RegressionData.csv")


# ################################################
# # (Optional) 2b. Remove specific cases by row number
# ################################################
# # Example: remove rows 10 and 25
# rows_to_remove <- c(10, 25) # Edit and uncomment this line
# mydata <- mydata[-rows_to_remove, ] # Uncomment this line


##################################################
# 3. Define dependent variable (DV) and independent variable (IV)
##################################################
# Replace YOURDVNAME and YOURIVNAME with actual column names
mydata$DV <- mydata$Unhappiness
mydata$IV <- mydata$Hours


##################################################
# 4. Explore distributions of DV and IV
##################################################
# Make a histogram for DV
DVGraph <- ggplot(mydata, aes(x = DV)) + 
  geom_histogram(color = "black", fill = "#1f78b4")

# Make a histogram for IV
IVGraph <- ggplot(mydata, aes(x = IV)) + 
  geom_histogram(color = "black", fill = "#1f78b4")


##################################################
# 5. Fit and summarize initial regression model
##################################################
# Suppress scientific notation
options(scipen = 999)

# Fit model
myreg <- lm(DV ~ IV, data = mydata)

# Model summary
summary(myreg)


##################################################
# 6. Visualize regression and check for bivariate outliers
##################################################
# Create scatterplot with regression line as a ggplot object
RegressionPlot <- ggplot(mydata, aes(x = IV, y = DV)) +
  geom_point(color = "#1f78b4") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Scatterplot of DV vs IV with Regression Line",
    x = "Independent Variable (IV)",
    y = "Dependent Variable (DV)"
  ) +
  theme_minimal()


##################################################
# 7. Check for potential outliers (high leverage points)
##################################################
# Calculate leverage values
hat_vals <- hatvalues(myreg)

# Rule of thumb: leverage > 2 * (number of predictors + 1) / n may be influential
threshold <- 2 * (length(coef(myreg)) / nrow(mydata))

# Create table showing 10 largest leverage values
outliers <- data.frame(
  Obs = 1:nrow(mydata),
  Leverage = hatvalues(myreg)
) %>%
  arrange(desc(Leverage)) %>%
  slice_head(n = 10)

# Format as a gt table
outliers_table <- outliers %>%
  gt() %>%
  tab_header(
    title = "Leverage estimates for 10 largest outliers"
  ) %>%
  cols_label(
    Obs = "Row #",
    Leverage = "Leverage"
  ) %>%
  fmt_number(
    columns = Leverage,
    decimals = 4
  )


##################################################
# 8. Create nicely formatted regression results tables
##################################################
# --- Coefficient-level results ---
reg_results <- as.data.frame(coef(summary(myreg))) %>%
  tibble::rownames_to_column("Term") %>%
  rename(
    Estimate = Estimate,
    `Std. Error` = `Std. Error`,
    t = `t value`,
    `p-value` = `Pr(>|t|)`
  )

reg_table <- reg_results %>%
  gt() %>%
  tab_header(
    title = "Regression Analysis Results",
    subtitle = "Coefficient Estimates"
  ) %>%
  fmt_number(
    columns = c(Estimate, `Std. Error`, t, `p-value`),
    decimals = 4
  )


# --- Model fit statistics ---
reg_summary <- summary(myreg)

fit_stats <- tibble::tibble(
  `R-squared` = reg_summary$r.squared,
  `Adj. R-squared` = reg_summary$adj.r.squared,
  `F-statistic` = reg_summary$fstatistic[1],
  `df (model)` = reg_summary$fstatistic[2],
  `df (residual)` = reg_summary$fstatistic[3],
  `Residual Std. Error` = reg_summary$sigma
)

fit_table <- fit_stats %>%
  gt() %>%
  tab_header(
    title = "Model Fit Statistics",
    subtitle = "Overall Regression Performance"
  ) %>%
  fmt_number(
    columns = everything(),
    decimals = 4
  )


##################################################
# 9. Final print of key graphics and tables
##################################################
DVGraph
IVGraph
RegressionPlot
outliers_table
reg_table
fit_table

Data Set