library(ggplot2)
library(dplyr) # for data wrangling
library(tidyr) # to convert data into tidy format
library(moderndive) # for dataset
library(skimr) # for statistical summarySimpleLR
Simple Linear Regression
Import Library
Load Data
We will create subset of evals dataset from moderndive package dataset
eval_ch5 <- moderndive::evals |>
# Only keep the required columns
select(ID, score, age, bty_avg)
head(eval_ch5)# A tibble: 6 × 4
ID score age bty_avg
<int> <dbl> <int> <dbl>
1 1 4.7 36 5
2 2 4.1 36 5
3 3 3.9 36 5
4 4 4.8 36 5
5 5 4.6 59 3
6 6 4.3 59 3
glimpse(eval_ch5)Rows: 463
Columns: 4
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.…
$ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…
$ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333,…
What does these variables mean
ID: An identification variable used to distinguish between 1 through 463 courses in the dataset
score:
age: A numerical variable of the Instructor’s age.
bty_avg
Data Summary
summary(eval_ch5[, -1]) score age bty_avg
Min. :2.300 Min. :29.00 Min. :1.667
1st Qu.:3.800 1st Qu.:42.00 1st Qu.:3.167
Median :4.300 Median :48.00 Median :4.333
Mean :4.175 Mean :48.37 Mean :4.418
3rd Qu.:4.600 3rd Qu.:57.00 3rd Qu.:5.500
Max. :5.000 Max. :73.00 Max. :8.167
Statistical summary
eval_ch5 |>
# check only outcome & explanatory variables
select(score, bty_avg) |>
skim()| Name | select(eval_ch5, score, b… |
| Number of rows | 463 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0 | 1 | 4.17 | 0.54 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 | ▁▁▅▇▇ |
| bty_avg | 0 | 1 | 4.42 | 1.53 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 | ▃▇▇▃▂ |
Data Visualisation
Univariate Analysis
Analyzing each variable individually
For continuous variables
ggplot(eval_ch5, aes(x = age))+
geom_histogram(bins = 30)Score distribution
ggplot(eval_ch5, aes(x = bty_avg))+
geom_histogram(bins = 10, color = "white")On the scale of 1:8
ggplot(eval_ch5, aes(x = score))+
geom_histogram(bins = 30, color = "white")# Let's perform scatterplot to show the relationship between score & bty_avg
ggplot(eval_ch5, aes(x = bty_avg, y = score))+
geom_point()since most data points are overlapping and we could not find the any relationship between these 2 variable with scatterplot, we will try to make it with Jitter
ggplot(eval_ch5, aes(x = bty_avg, y = score))+
geom_jitter()ggplot(eval_ch5, aes(x = bty_avg, y = score))+
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)eval_ch5 |>
get_correlation(formula = score ~ bty_avg)# A tibble: 1 × 1
cor
<dbl>
1 0.187
There is a week but positive relationship between score & bty_avg score
# Fit regression modle
score_mdl <- lm(score ~ bty_avg, data = eval_ch5)
# Get regression table
get_regression_table(score_mdl)# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 3.88 0.076 51.0 0 3.73 4.03
2 bty_avg 0.067 0.016 4.09 0 0.035 0.099