2024-10-31

Slide 1: What is Simple Linear Regression?

  • Simple Linear Regression is a statistical method that allows us to summarize the relationship between two continuous quantitative variables using one variable (x) or the independent variable to predict (y) dependent variable. Our goal is to try and reduce the residual error between the two variables

Slide 2:

Equation of the Regression Line: \[ Y = \beta_0 + \beta_1 X + \epsilon \]

  • \(Y\): Dependent variable (what we want to predict)
  • \(X\): Independent variable (predictor)
  • \(\beta_0\): Intercept of the line (value of \(Y\) when \(X = 0\))
  • \(\beta_1\): Slope of the line (change in \(Y\) for a one-unit change in \(X\))
  • \(\epsilon\): Error term (difference between observed and predicted values)

Slide 3: The simplified y=mx + b form of the previous equation

\[ y = mx + b \]

  • \(y\) = Dependent variable (response)
  • \(x\) = Independent variable (predictor)
  • \(m\) = Slope of the line (change in \(y\) for a unit change in \(x\))
  • \(b\) = Y-intercept (the value of \(y\) when \(x = 0\))

Slide 4: We’re going to explore what this would look like by using a data set predicting the test scores (dependent variable) on a scale of a 100 with the amount of hours spent reading (indepdendent variable)

Slide 5: Fitting the Model

To fit a simple linear regression model to our data, we can use R’s lm() function after we load in our data into a data frame.

## Load data
data <- data.frame(
  TestScores = c(60, 65, 70, 75, 80, 85, 88, 90, 92, 95, 96, 97, 98, 99, 100),
  HoursReading = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
)

# Fit the linear model
model <- lm(TestScores ~ HoursReading, data = data)
summary(model)
## 
## Call:
## lm(formula = TestScores ~ HoursReading, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -6.4   -2.9    0.2    3.3    4.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63.600      2.146   29.64 2.54e-13 ***
## HoursReading    2.800      0.236   11.86 2.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.95 on 13 degrees of freedom
## Multiple R-squared:  0.9154, Adjusted R-squared:  0.9089 
## F-statistic: 140.7 on 1 and 13 DF,  p-value: 2.399e-08

Slide 6: Model Summary:

Call: lm(formula = TestScores ~ HoursReading, data = data)

Residuals: Min 1Q Median 3Q Max -6.4 -2.9 0.2 3.3 4.8

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.600 2.146 29.64 2.54e-13 HoursReading 2.800 0.236 11.86 2.40e-08 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.95 on 13 degrees of freedom Multiple R-squared: 0.9154, Adjusted R-squared: 0.9089 F-statistic: 140.7 on 1 and 13 DF, p-value: 2.399e-08

Slide 7:

  • The line shows a positive linear relationship. As the hours are spent studying the test scores are also increasing. The equation for the line is y = 2.800x + 63.600 which we can use to predict other scores in the line

Slide 8: Histograms and line

library(ggplot2)


ggplot(data, aes(x = HoursReading, y = TestScores)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "lightblue") +
  labs(title = "  Test Scores vs Hours Spent Studying", 
       x = "Hours Spent Studying", 
       y = "Test Scores (out of 100)") +  xlim(0, 15) 
## `geom_smooth()` using formula = 'y ~ x'

Slide 9: Histogram to also visualize the frequency of each test score

# Create a histogram of test scores
ggplot(data, aes(x = TestScores)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Test Scores", 
       x = "Test Scores (out of 100)", 
       y = "Frequency") +
  theme_minimal()

## Slide 10: Interactive visualization of the plot

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Create an interactive plot using plotly
plot_ly(data, x = ~HoursReading, y = ~TestScores, type = 'scatter', mode = 'markers') %>%
  add_lines(x = ~HoursReading, y = fitted(model), name = 'Linear Fit', line = list(color = 'lightblue')) %>%
  layout(title = "Test Scores vs. Hours Reading",
         xaxis = list(title = "Hours Reading"),
         yaxis = list(title = "Test Scores"))

Slide 11: In Conclusion

  • Linear Regression is a great statistical tool that can be used in many different fields that can be used to predict unknown values within the statistical dataset