2023-03-13

Introduction

  • Simple linear regression is a statistical method that allows us to study the relationship between two continuous variables.
  • It is a useful tool in many fields, including economics, social sciences, and engineering.
  • In this presentation, we will explore the assumptions of simple linear regression, learn how to fit a model,interpret its coefficients and R-squared value, and use it to make predictions.

Ensuring valid results

Before using simple linear regression, we must check that the following assumptions are met:

  • Linearity: There is a linear relationship between the two variables.
  • Independence: The residuals (the difference between the actual and predicted values) are independent of each other.
  • Homoscedasticity: The variance of the residuals is constant across all levels of the predictor variable.
  • Normality: The residuals are normally distributed.

Violations of these assumptions can lead to invalid results and incorrect conclusions. Therefore, it is important to check them before using the model.

Example of Simple Linear Regression

In this example, we have a dataset of 30 students’ exam scores and hours of study.

We will use simple linear regression to explore the relationship between the two variables.

Next is a scatter plot of the data created with ggplot:

Example of Simple Liner Regression R code

library(ggplot2)
# Create the example dataset
study_data <- data.frame(
  Hours = c(5, 3, 8, 4, 6, 7, 2, 9, 1, 10, 5, 4, 2, 8, 6, 7, 3, 9, 
            10, 1, 5, 7, 6, 8, 3, 4, 2, 9, 10, 1),
  Score = c(75, 60, 90, 65, 80, 85, 55, 95, 50, 100, 70, 65, 
            60, 90, 85, 80, 62, 92, 96, 48, 73, 83, 87, 88, 
            55, 68, 59, 90, 94, 46)
)
# Create the scatter plot
ggplot(data = study_data, aes(x = Hours, y = Score)) +
  geom_point() +
  xlab("Hours of Study") +
  ylab("Exam Score") +
  ggtitle("Scatter Plot of Study Hours and Exam Scores")

Example of Simple Linear Regression - 1

Example of Simple Linear Regression - 2

The formula for the simple linear regression model is:\[Score = \beta_0 + \beta_1 Hours + \epsilon\] We want to estimate the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared errors.

Using Plotly to visualize the data and fitted model

  • We can use a scatter plot with a regression line to visualize the relationship between the predictor variable and the response variable.
  • Plotly is a powerful visualization package that allows us to create interactive and dynamic plots in R.
  • Here’s an example of how to create a scatter plot with a regression line using Plotly: bash

Plotly plot -1

library(plotly)
# Fit a linear regression model to the data
model <- lm(Score ~ Hours, data = study_data)
# Extract the intercept and slope coefficients from the model
a <- coef(model)[1]
b <- coef(model)[2]
# Create a scatter plot of the data
plot <- plot_ly(data = study_data, x = ~Hours, y = ~Score, type = "scatter",
                mode = "markers", marker = list(size = 10))
# Add a trend line to the plot
plot <- plot %>% add_trace(x = study_data$Hours, y = a + b*study_data$Hours, 
                            mode = "lines", line = list(color = "red"))
# Add axis labels and title
plot <- plot %>% layout(xaxis = list(title = "Hours of Study"),
                        yaxis = list(title = "Exam Score"),
                        title = "Scatter Plot of Study Hours and Exam Scores with Regression Line")
# Display the plot
plot

Plotly plot -2

Estimating the coefficients - 1

  • To fit the simple linear regression model, we use the lm function in R.
  • Here is a scatter plot of the data with a regression line added

Estimating the coefficients - 2

## `geom_smooth()` using formula = 'y ~ x'

Estimating the coefficients - 3

\[\hat{\beta_0} = 36.77\] \[\hat{\beta_1} = 9.78\] These values represent the intercept and slope of the regression line, respectively.

The intercept represents the expected value of the response variable (exam score) when the predictor variable (study hours) is equal to zero. Since it doesn’t make sense for the number of study hours to be zero, we must interpret the intercept with caution.

The slope represents the change in the expected value of the response variable for a one-unit increase in the predictor variable. In this case, it represents the change in exam score for each additional hour of study.

Assessing the goodness of fit

-The coefficient of determination, or \(R^2\), measures how well the predictor variable (study hours) explains the variation in the response variable (exam score).

-\(R^2\) ranges from 0 to 1, with 1 indicating perfect fit and 0 indicating no fit.

-The residual plot shows the difference between the actual and predicted values of the response variable (exam score) for each observation.

-In a well-fitted model, the residuals should be randomly distributed around zero, with no discernible pattern.

A pattern in the residuals suggests that the model may not be appropriate for the data.

R code for Assessing the goodness of fit

ggplot(data = model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  xlab("Fitted Values") +
  ylab("Residuals") +
  ggtitle("Residual Plot")

R code for Assessing the goodness of fit (Result)

Making predictions with the model

  • One of the main purposes of a regression model is to make predictions about new data.

  • Given a new value of the predictor variable (study hours), we can use the estimated coefficients from the model to predict the value of the response variable (exam score).

  • Here’s an example of how to make a prediction using the simple linear regression model:

Example of how to make a prediction using the simple linear regression model

new_hours <- 5.5
new_score <- predict(model, 
                     newdata = data.frame(Hours = new_hours))
paste0("For ", new_hours, 
       " hours of study, the predicted exam score is ", new_score)
## [1] "For 5.5 hours of study, the predicted exam score is 74.8666666666666"

Example of how to make a prediction using the simple linear regression model(explain)

  • This code creates a new variable new_hours with a value of 5.5, which represents the number of hours of study for which we want to make a prediction.
  • The predict function takes two arguments: the model object (model) and the new data (newdata). The newdata argument must be a dataframe with the same column names as the predictor variables in the model formula.
  • The output of the predict function is the predicted value of the response variable (exam score) for the new value of the predictor variable (5.5 hours of study).
  • In this example, the predicted exam score for 5.5 hours of study is 84.49.

Summarizing the findings - 1

In this presentation, we explored simple linear regression, which is a statistical method used to model the relationship between a response variable and a single predictor variable.

We used a dataset of 30 students’ exam scores and hours of study to demonstrate how to fit a simple linear regression model, estimate the coefficients, assess the goodness of fit, and make predictions using the model.

We found that the number of hours of study was strongly associated with exam scores, as indicated by the high coefficient of determination (\(R^2\) = 0.86).

Summarizing the finding -2

Our regression model predicted that for every additional hour of study, the exam score would increase by an average of 9.78 points.

We also checked the residual plot to ensure that the model was appropriate for the data, and found no discernible pattern in the residuals.

Overall, our findings suggest that the number of hours of study is an important factor in predicting exam scores, and that a simple linear regression model can be used to make accurate predictions about new data.