SimpleLR

Simple Linear Regression

Import Library

library(ggplot2)
library(dplyr)       # for data wrangling
library(tidyr)       # to convert data into tidy format
library(moderndive)  # for dataset
library(skimr)       # for statistical summary

Load Data

We will create subset of evals dataset from moderndive package dataset

eval_ch5 <- moderndive::evals |> 
  # Only keep the required columns
  select(ID, score, age, bty_avg)

head(eval_ch5)
# A tibble: 6 × 4
     ID score   age bty_avg
  <int> <dbl> <int>   <dbl>
1     1   4.7    36       5
2     2   4.1    36       5
3     3   3.9    36       5
4     4   4.8    36       5
5     5   4.6    59       3
6     6   4.3    59       3
glimpse(eval_ch5)
Rows: 463
Columns: 4
$ ID      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ score   <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.…
$ age     <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…
$ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333,…

What does these variables mean

  1. ID: An identification variable used to distinguish between 1 through 463 courses in the dataset

  2. score:

  3. age: A numerical variable of the Instructor’s age.

  4. bty_avg

Data Summary

summary(eval_ch5[, -1])
     score            age           bty_avg     
 Min.   :2.300   Min.   :29.00   Min.   :1.667  
 1st Qu.:3.800   1st Qu.:42.00   1st Qu.:3.167  
 Median :4.300   Median :48.00   Median :4.333  
 Mean   :4.175   Mean   :48.37   Mean   :4.418  
 3rd Qu.:4.600   3rd Qu.:57.00   3rd Qu.:5.500  
 Max.   :5.000   Max.   :73.00   Max.   :8.167  

Statistical summary

eval_ch5 |> 
  # check only outcome & explanatory variables
  select(score, bty_avg) |> 
  skim()
Data summary
Name select(eval_ch5, score, b…
Number of rows 463
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
score 0 1 4.17 0.54 2.30 3.80 4.30 4.6 5.00 ▁▁▅▇▇
bty_avg 0 1 4.42 1.53 1.67 3.17 4.33 5.5 8.17 ▃▇▇▃▂

Data Visualisation

Univariate Analysis

Analyzing each variable individually

For continuous variables

ggplot(eval_ch5, aes(x = age))+
  geom_histogram(bins = 30)

Score distribution

ggplot(eval_ch5, aes(x = bty_avg))+
  geom_histogram(bins = 10, color = "white")

On the scale of 1:8

ggplot(eval_ch5, aes(x = score))+
  geom_histogram(bins = 30, color = "white")

# Let's perform scatterplot to show the relationship between score & bty_avg
ggplot(eval_ch5, aes(x = bty_avg, y = score))+
  geom_point()

since most data points are overlapping and we could not find the any relationship between these 2 variable with scatterplot, we will try to make it with Jitter

ggplot(eval_ch5, aes(x = bty_avg, y = score))+
  geom_jitter()

ggplot(eval_ch5, aes(x = bty_avg, y = score))+
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)

eval_ch5 |> 
  get_correlation(formula = score ~ bty_avg)
# A tibble: 1 × 1
    cor
  <dbl>
1 0.187

There is a week but positive relationship between score & bty_avg score

# Fit regression modle
score_mdl <- lm(score ~ bty_avg, data = eval_ch5)
# Get regression table
get_regression_table(score_mdl)
# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    3.88      0.076     51.0        0    3.73     4.03 
2 bty_avg      0.067     0.016      4.09       0    0.035    0.099