Erdene Enkh, 28 January 2024

IMB; Multivariate Analysis;

Homework 3

Ivy League Online Courses Regression Analysis

# Read the dataset
mydata <- read.csv("~/Bootcamp/my_dataset_HW3.csv", header = TRUE)

# Display the first few rows of the dataset
head(mydata, 10)
##    Institution Course.Number Launch.Date
## 1         MITx        6.002x  09/05/2012
## 2         MITx         6.00x  09/26/2012
## 3         MITx        3.091x  10/09/2012
## 4     HarvardX         CS50x  10/15/2012
## 5     HarvardX        PH207x  10/15/2012
## 6         MITx         6.00x  02/04/2013
## 7         MITx        3.091x  02/05/2013
## 8         MITx        14.73x  02/12/2013
## 9         MITx         8.02x  02/18/2013
## 10    HarvardX         ER22x  03/02/2013
##                                                                      Course.Title
## 1                                                        Circuits and Electronics
## 2                                Introduction to Computer Science and Programming
## 3                                           Introduction to Solid State Chemistry
## 4                                                Introduction to Computer Science
## 5  Health in Numbers: Quantitative Methods in Clinical and Public Health Research
## 6                                Introduction to Computer Science and Programming
## 7                                           Introduction to Solid State Chemistry
## 8                                                The Challenges of Global Poverty
## 9                                                       Electricity and Magnetism
## 10                                                                        Justice
##                                                                                                  Instructors
## 1                                                                                             Khurram Afridi
## 2                                                                    Eric Grimson, John Guttag, Chris Terman
## 3                                                                                               Michael Cima
## 4                                      David Malan, Nate Hardison, Rob Bowden, Tommy MacWilliam, Zamyla Chan
## 5                                                                         Earl Francis Cook, Marcello Pagano
## 6                                                                                              Larry Rudolph
## 7                                                                                               Michael Cima
## 8                                                                             Esther Duflo, Abhijit Banerjee
## 9  Walter Lewin, John Belcher, Peter Dourmashkin, Ricardo Abbate, Saif Rayyan, George Stephans, Isaac Chuang
## 10                                                                                            Michael Sandel
##                                          Course.Subject Year Honor.Code.Certificates
## 1     Science, Technology, Engineering, and Mathematics    1                       1
## 2                                      Computer Science    1                       1
## 3     Science, Technology, Engineering, and Mathematics    1                       1
## 4                                      Computer Science    1                       1
## 5                Government, Health, and Social Science    1                       1
## 6                                      Computer Science    1                       1
## 7     Science, Technology, Engineering, and Mathematics    1                       1
## 8                Government, Health, and Social Science    1                       1
## 9     Science, Technology, Engineering, and Mathematics    1                       1
## 10 Humanities, History, Design, Religion, and Education    1                       1
##    Participants..Course.Content.Accessed. Audited....50..Course.Content.Accessed. Certified
## 1                                   36105                                    5431      3003
## 2                                   62709                                    8949      5783
## 3                                   16663                                    2855      2082
## 4                                  129400                                   12888      1439
## 5                                   52521                                   10729      5058
## 6                                   65380                                    6473      3313
## 7                                    8270                                     838       547
## 8                                   29044                                    6510      4607
## 9                                   39178                                    3543      1722
## 10                                  58779                                    9425      5438
##    X..Audited X..Certified X..Certified.of...50..Course.Content.Accessed X..Played.Video
## 1       15.04         8.32                                         54.98            83.2
## 2       14.27         9.22                                         64.05           89.14
## 3       17.13        12.49                                         72.85           87.49
## 4        9.96         1.11                                         11.11               0
## 5       20.44         9.64                                         47.12           77.45
## 6        9.90         5.07                                         51.17           82.43
## 7       10.13         6.61                                         65.16           80.25
## 8       22.41        15.86                                         70.60           83.24
## 9        9.04         4.40                                         48.49            85.3
## 10      16.05         9.26                                         51.07             ---
##    X..Posted.in.Forum X..Grade.Higher.Than.Zero Total.Course.Hours..Thousands.
## 1                8.17                     28.97                         418.94
## 2               14.38                     39.50                         884.04
## 3               14.42                     34.89                         227.55
## 4                0.00                      1.11                         220.90
## 5               15.98                     32.52                         804.41
## 6               10.30                     28.90                         639.40
## 7               10.22                     23.49                          68.11
## 8               13.89                     39.38                         279.22
## 9                5.86                     16.04                         380.35
## 10              21.86                     20.98                         186.61
##    Median.Hours.for.Certification Median.Age X..Male X..Female X..Bachelor.s.Degree.or.Higher
## 1                           64.45         26   88.28     11.72                          60.68
## 2                           78.53         28   83.50     16.50                          63.04
## 3                           61.28         27   70.32     29.68                          58.76
## 4                            0.00         28   80.02     19.98                          58.78
## 5                           76.10         32   56.78     43.22                          88.33
## 6                           84.14         27   83.99     16.01                          60.90
## 7                           59.29         27   73.30     26.70                          58.99
## 8                           40.30         30   53.76     46.24                          81.94
## 9                          107.88         26   85.42     14.58                          56.97
## 10                          13.67         30   60.42     39.58                          69.78

Research Question: How the percentage of certified participants of online courses can be affected by such factors as Institution, percentage of people posted in forums, percentage of those who have grades higher than zero, and median age?

Unit of observation: An online course Number of units: 290

Basically, there are two institutes MITx and HarvardX - categorical variable And people who actively post in forums might be more successful to get certificates, I mean they communicate with each other etc. Those who have grades higher than zero might also get certificate.

While doing this analysis I am shocked by the number of people who start online course and then just give up.

The source of the dataset: https://www.kaggle.com/datasets/edx/course-study

Description of variables in the dataset:

Institution: The educational institution offering the online course (MITx, HarvardX).

Course Number: The unique identifier for the course.

Launch Date: The date when the course was launched.

Course Title: The title or name of the online course.

Instructors: Names of instructors or educators involved in teaching the course.

Course Subject: The subject category to which the course belongs (e.g., Science, Technology, Engineering, and Mathematics).

Year: The year in which the course was conducted.

Honor Code Certificates: Binary indicator (1 or 0) denoting whether honor code certificates were offered.

Participants (Course Content Accessed): The total number of participants who accessed the course content.

Audited (> 50% Course Content Accessed): The number of participants who audited more than 50% of the course content.

Certified: The number of participants who successfully completed and earned certification.

% Audited: Percentage of participants who audited the course.

% Certified: Percentage of participants who earned certification.

% Certified of > 50% Course Content Accessed: Percentage of participants who earned certification among those who audited more than 50% of the course content.

% Played Video: Percentage of participants who played course videos.

% Posted in Forum: Percentage of participants who posted in the course forum.

% Grade Higher Than Zero: Percentage of participants who achieved a grade higher than zero.

Total Course Hours (Thousands): The total number of course hours, expressed in thousands.

Median Hours for Certification: The median number of hours taken by participants to achieve certification.

Median Age: The median age of course participants.

% Male: Percentage of male participants.

% Female: Percentage of female participants.

% Bachelor’s Degree or Higher: Percentage of participants with a bachelor’s degree or higher.

mydata$ID <- 1:nrow(mydata)


# Factorize Institution (MITx - 1, Harvard - 0)
mydata$Institution <- factor(mydata$Institution, levels = c("MITx", "HarvardX"), labels = c(1, 0))

# Display the head of the dataset to check changes
head(mydata)
##   Institution Course.Number Launch.Date
## 1           1        6.002x  09/05/2012
## 2           1         6.00x  09/26/2012
## 3           1        3.091x  10/09/2012
## 4           0         CS50x  10/15/2012
## 5           0        PH207x  10/15/2012
## 6           1         6.00x  02/04/2013
##                                                                     Course.Title
## 1                                                       Circuits and Electronics
## 2                               Introduction to Computer Science and Programming
## 3                                          Introduction to Solid State Chemistry
## 4                                               Introduction to Computer Science
## 5 Health in Numbers: Quantitative Methods in Clinical and Public Health Research
## 6                               Introduction to Computer Science and Programming
##                                                             Instructors
## 1                                                        Khurram Afridi
## 2                               Eric Grimson, John Guttag, Chris Terman
## 3                                                          Michael Cima
## 4 David Malan, Nate Hardison, Rob Bowden, Tommy MacWilliam, Zamyla Chan
## 5                                    Earl Francis Cook, Marcello Pagano
## 6                                                         Larry Rudolph
##                                      Course.Subject Year Honor.Code.Certificates
## 1 Science, Technology, Engineering, and Mathematics    1                       1
## 2                                  Computer Science    1                       1
## 3 Science, Technology, Engineering, and Mathematics    1                       1
## 4                                  Computer Science    1                       1
## 5            Government, Health, and Social Science    1                       1
## 6                                  Computer Science    1                       1
##   Participants..Course.Content.Accessed. Audited....50..Course.Content.Accessed. Certified
## 1                                  36105                                    5431      3003
## 2                                  62709                                    8949      5783
## 3                                  16663                                    2855      2082
## 4                                 129400                                   12888      1439
## 5                                  52521                                   10729      5058
## 6                                  65380                                    6473      3313
##   X..Audited X..Certified X..Certified.of...50..Course.Content.Accessed X..Played.Video
## 1      15.04         8.32                                         54.98            83.2
## 2      14.27         9.22                                         64.05           89.14
## 3      17.13        12.49                                         72.85           87.49
## 4       9.96         1.11                                         11.11               0
## 5      20.44         9.64                                         47.12           77.45
## 6       9.90         5.07                                         51.17           82.43
##   X..Posted.in.Forum X..Grade.Higher.Than.Zero Total.Course.Hours..Thousands.
## 1               8.17                     28.97                         418.94
## 2              14.38                     39.50                         884.04
## 3              14.42                     34.89                         227.55
## 4               0.00                      1.11                         220.90
## 5              15.98                     32.52                         804.41
## 6              10.30                     28.90                         639.40
##   Median.Hours.for.Certification Median.Age X..Male X..Female X..Bachelor.s.Degree.or.Higher ID
## 1                          64.45         26   88.28     11.72                          60.68  1
## 2                          78.53         28   83.50     16.50                          63.04  2
## 3                          61.28         27   70.32     29.68                          58.76  3
## 4                           0.00         28   80.02     19.98                          58.78  4
## 5                          76.10         32   56.78     43.22                          88.33  5
## 6                          84.14         27   83.99     16.01                          60.90  6
# Descriptive statistics
round(stat.desc(mydata[c("Institution", "X..Posted.in.Forum", "X..Grade.Higher.Than.Zero", "Median.Age")], basic = FALSE), 2)
##          Institution X..Posted.in.Forum X..Grade.Higher.Than.Zero Median.Age
## median            NA               7.24                     19.60      29.00
## mean              NA               9.35                     21.21      29.30
## SE.mean           NA               0.44                      0.79       0.24
## CI.mean           NA               0.87                      1.55       0.47
## var               NA              56.51                    179.87      16.39
## std.dev           NA               7.52                     13.41       4.05
## coef.var          NA               0.80                      0.63       0.14
summary(mydata[c("Institution", "X..Posted.in.Forum", "X..Grade.Higher.Than.Zero", "Median.Age")])
##  Institution X..Posted.in.Forum X..Grade.Higher.Than.Zero   Median.Age  
##  1:161       Min.   : 0.000     Min.   : 0.00             Min.   :22.0  
##  0:129       1st Qu.: 3.993     1st Qu.:10.59             1st Qu.:26.0  
##              Median : 7.245     Median :19.61             Median :29.0  
##              Mean   : 9.348     Mean   :21.21             Mean   :29.3  
##              3rd Qu.:14.107     3rd Qu.:30.90             3rd Qu.:31.0  
##              Max.   :35.280     Max.   :52.35             Max.   :53.0

From descriptive statistics we realize some things, such as:

The average percentage of forum posts is relatively high, with a mean of 9.35.

The distribution is right-skewed, with a minimum value of 0% and a maximum value of 35.28%. The average percentage of grades higher than zero is 21.21, indicating a relatively high overall performance. The distribution ranges from 0% to 52.35%. The average median age is 29.3 years. Apparently, adults with some work experiences study online to make their skills better. The dataset includes participants with a minimum age of 22 years and a maximum age of 53 years. The majority of participants fall within the age range of 26 to 31 years.

# Scatterplot Matrix
mydata_numer <- mydata[, c("X..Posted.in.Forum", "X..Grade.Higher.Than.Zero", "Median.Age")]

scatterplotMatrix(mydata_numer, 
                  smooth = FALSE)

From the scatterplot visualization we understand that there is a positive relationship between getting certification and posting in forums, getting higher than zero points and median age.

# VIF statistics
fit_mydata <- lm(X..Certified ~ Institution + X..Posted.in.Forum +
                   X..Grade.Higher.Than.Zero + Median.Age,
                 data = mydata)

vif(fit_mydata)
##               Institution        X..Posted.in.Forum X..Grade.Higher.Than.Zero 
##                  1.527417                  1.596425                  1.367748 
##                Median.Age 
##                  1.544345

When we look at the VIF values, they’re all pretty close to 1. That’s a good thing because it means there’s not much multicollinearity going on. It’s a positive sign for our regression analysis.

mydata$StdResid <- round(rstandard(fit_mydata), 3)

mydata$CooksD <- round(cooks.distance(fit_mydata), 3)

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

There are some standardized residuals outside the (-3, 3) range, so we have to remove some units.

head(mydata[order(-mydata$StdResid),], 10)
##     Institution Course.Number Launch.Date
## 188           0    GOV1368.3x  10/01/2015
## 187           0    GOV1368.2x  10/01/2015
## 175           0       HUM1.7x  09/21/2015
## 99            0       SW12.9x  11/20/2014
## 90            0       SW12.8x  10/09/2014
## 57            0       SW12.5x  04/24/2014
## 88            0       HUM2.3x  10/08/2014
## 189           0    GOV1368.4x  10/01/2015
## 75            0       SW12.7x  09/04/2014
## 40            0       SW12.3x  02/13/2014
##                                                                                                        Course.Title
## 188         Saving Schools: History, Politics, and Policy of U.S. Education – Accountability and National Standards
## 187                              Saving Schools: History, Politics, and Policy of U.S. Education – Teacher Policies
## 175 History of the Book: Monasteries, Schools, and Notaries, Part 2: Introduction to the Transitional Gothic Script
## 99                                                                                            Communist Liberations
## 90                                                                            Creating China: The Birth of a Nation
## 57                                                                             From Global Empire to Global Economy
## 88                                                 The Ancient Greek Hero in 24 Hours (Hours 12-15): Cult of Heroes
## 189                                 Saving Schools: History, Politics, and Policy of U.S. Education – School Choice
## 75                                                             Invasions, Rebellions, and the end of Imperial China
## 40                                                                          Cosmopolitan Tang: Aristocratic Culture
##               Instructors                                       Course.Subject Year
## 188         Paul Peterson Humanities, History, Design, Religion, and Education    4
## 187         Paul Peterson Humanities, History, Design, Religion, and Education    4
## 175       Beverly Kienzle Humanities, History, Design, Religion, and Education    4
## 99  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    3
## 90  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    3
## 57  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    2
## 88           Gregory Nagy Humanities, History, Design, Religion, and Education    3
## 189         Paul Peterson Humanities, History, Design, Religion, and Education    4
## 75  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    3
## 40  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    2
##     Honor.Code.Certificates Participants..Course.Content.Accessed.
## 188                       1                                    492
## 187                       1                                    702
## 175                       1                                    670
## 99                        1                                   4248
## 90                        1                                   4515
## 57                        1                                   5256
## 88                        1                                   1559
## 189                       1                                    511
## 75                        1                                   4662
## 40                        1                                   7422
##     Audited....50..Course.Content.Accessed. Certified X..Audited X..Certified
## 188                                     246       127      50.00        25.81
## 187                                     348       180      49.57        25.64
## 175                                     364       191      54.33        28.51
## 99                                     1835      1442      43.24        33.98
## 90                                     2081      1528      46.13        33.87
## 57                                     2649      1686      50.44        32.10
## 88                                      697       417      44.85        26.83
## 189                                     212       113      41.49        22.11
## 75                                     2148      1505      46.10        32.30
## 40                                     3221      2226      43.43        30.01
##     X..Certified.of...50..Course.Content.Accessed X..Played.Video X..Posted.in.Forum
## 188                                         51.22           54.07               8.54
## 187                                         51.72           55.98              11.54
## 175                                         51.37           64.93               7.01
## 99                                          65.12           79.76              30.14
## 90                                          62.85           78.36              31.46
## 57                                          62.21           75.32              28.69
## 88                                          59.54           43.18               2.90
## 189                                         53.30           49.71               5.68
## 75                                          61.45           80.08              33.98
## 40                                          63.18           77.16              29.14
##     X..Grade.Higher.Than.Zero Total.Course.Hours..Thousands. Median.Hours.for.Certification
## 188                     25.81                           1.18                           3.67
## 187                     25.64                           1.85                           4.65
## 175                     28.51                           3.49                           8.25
## 99                      50.49                          23.03                           9.76
## 90                      51.05                          23.92                           9.54
## 57                      48.95                          24.76                           9.09
## 88                      33.72                           3.20                           1.77
## 189                     22.11                           1.08                           4.44
## 75                      52.26                          22.93                           8.68
## 40                      50.29                          33.76                           8.99
##     Median.Age X..Male X..Female X..Bachelor.s.Degree.or.Higher  ID StdResid CooksD
## 188         30   49.09     50.91                          78.86 188    3.013  0.020
## 187         31   48.98     51.02                          80.42 187    2.937  0.015
## 175         39   44.76     55.24                          77.52 175    2.873  0.055
## 99          37   67.11     32.89                          82.98  99    2.832  0.060
## 90          37   64.33     35.67                          82.68  90    2.780  0.062
## 57          35   64.35     35.65                          82.04  57    2.637  0.043
## 88          34   50.35     49.65                          66.91  88    2.447  0.035
## 189         31   48.48     51.52                          74.39 189    2.386  0.014
## 75          38   63.87     36.13                          82.25  75    2.314  0.052
## 40          34   58.97     41.03                          79.40  40    2.159  0.030

After thinking, I decided to remove the units 168, 169.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
removal <- c("168", "169")

mydata <- mydata %>%
  filter(!(ID %in% removal))
hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Now the stardardized residuals that are above +3 are gone, but the ones that are below -3 are left. So we still need to filter out those below -3.

mydata_filtered <- mydata[mydata$StdResid >= -3, ]
head(mydata[order(-mydata$StdResid),], 5)
##     Institution Course.Number Launch.Date
## 186           0    GOV1368.3x  10/01/2015
## 185           0    GOV1368.2x  10/01/2015
## 173           0       HUM1.7x  09/21/2015
## 99            0       SW12.9x  11/20/2014
## 90            0       SW12.8x  10/09/2014
##                                                                                                        Course.Title
## 186         Saving Schools: History, Politics, and Policy of U.S. Education – Accountability and National Standards
## 185                              Saving Schools: History, Politics, and Policy of U.S. Education – Teacher Policies
## 173 History of the Book: Monasteries, Schools, and Notaries, Part 2: Introduction to the Transitional Gothic Script
## 99                                                                                            Communist Liberations
## 90                                                                            Creating China: The Birth of a Nation
##               Instructors                                       Course.Subject Year
## 186         Paul Peterson Humanities, History, Design, Religion, and Education    4
## 185         Paul Peterson Humanities, History, Design, Religion, and Education    4
## 173       Beverly Kienzle Humanities, History, Design, Religion, and Education    4
## 99  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    3
## 90  Peter Bol, Bill Kirby Humanities, History, Design, Religion, and Education    3
##     Honor.Code.Certificates Participants..Course.Content.Accessed.
## 186                       1                                    492
## 185                       1                                    702
## 173                       1                                    670
## 99                        1                                   4248
## 90                        1                                   4515
##     Audited....50..Course.Content.Accessed. Certified X..Audited X..Certified
## 186                                     246       127      50.00        25.81
## 185                                     348       180      49.57        25.64
## 173                                     364       191      54.33        28.51
## 99                                     1835      1442      43.24        33.98
## 90                                     2081      1528      46.13        33.87
##     X..Certified.of...50..Course.Content.Accessed X..Played.Video X..Posted.in.Forum
## 186                                         51.22           54.07               8.54
## 185                                         51.72           55.98              11.54
## 173                                         51.37           64.93               7.01
## 99                                          65.12           79.76              30.14
## 90                                          62.85           78.36              31.46
##     X..Grade.Higher.Than.Zero Total.Course.Hours..Thousands. Median.Hours.for.Certification
## 186                     25.81                           1.18                           3.67
## 185                     25.64                           1.85                           4.65
## 173                     28.51                           3.49                           8.25
## 99                      50.49                          23.03                           9.76
## 90                      51.05                          23.92                           9.54
##     Median.Age X..Male X..Female X..Bachelor.s.Degree.or.Higher  ID StdResid CooksD
## 186         30   49.09     50.91                          78.86 188    3.013  0.020
## 185         31   48.98     51.02                          80.42 187    2.937  0.015
## 173         39   44.76     55.24                          77.52 175    2.873  0.055
## 99          37   67.11     32.89                          82.98  99    2.832  0.060
## 90          37   64.33     35.67                          82.68  90    2.780  0.062
# Filter out units with standardized residuals below -3 and above 3
mydata_filtered <- mydata[mydata$StdResid >= -3 & mydata$StdResid <= 3, ]

# Create a new histogram after filtering
hist(mydata_filtered$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals (filtered)")

Now the standardized residuals are within (-3,3) and we can assume that there are no outliers.

Let’s do the Shapiro-Wilk normality test.

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.98157, p-value = 0.0009069

Here we see that the W-statistic is 0,99 which is close to 1. It is nice. Even though p-value is less than 0,05, I have a very big sample (290 units). The Central Limit Theorem comes into play and we can assume normality in our case, even though our p-value is against us.

hist(mydata$CooksD, 
     xlab = "Cook's distances", 
     ylab = "Frequency", 
     main = "Histogram of Cook's distances")

There are gaps in the Cook’s distance. So there are outliers we need to remove.

# Set a threshold for identifying outliers based on Cook's distance
cook_threshold <- 4 * mean(mydata$CooksD, na.rm = TRUE)

# Identify and remove observations with Cook's distance above the threshold
mydata_cleaned <- mydata[mydata$CooksD <= cook_threshold, ]

# Create a histogram for the cleaned data
hist(mydata_cleaned$CooksD, 
     xlab = "Cook's distances", 
     ylab = "Frequency", 
     main = "Histogram of Cook's distances (Cleaned)",
     breaks = 50,  # Adjust the number of bins as needed
     col = "lightblue",  # Adjust the color
     probability = TRUE)  # Show the density

# Add a density plot
lines(density(mydata_cleaned$CooksD), col = "red", lwd = 2)

Now let’s check the homoscedasticity.

# Load the car package
library(car)

# Create a scatterplot of residuals against fitted values
scatterplot(y = fit_mydata$residuals, x = fit_mydata$fitted.values, 
            ylab = "Residuals", xlab = "Fitted Values", main = "Residuals vs. Fitted Values")

In the scatterplot I see heteroscedasticity.

library(olsrr)
ols_test_breusch_pagan(fit_mydata)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                   
##  ----------------------------------------
##  Response : X..Certified 
##  Variables: fitted values of X..Certified 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    97.99667 
##  Prob > Chi2   =    4.190871e-23

After doing the Breusch Pagan test we reject the null hypothesis (constant variance), which means there is heteroscedasticity.

Because we discovered heteroscedasticity, we need to obtain robust standard errors.

fit_mydata <- lm(X..Certified ~ Institution + X..Posted.in.Forum +
                   X..Grade.Higher.Than.Zero + Median.Age,
                 data = mydata)

library(estimatr)

fit_mydata <- lm_robust(X..Certified ~ Institution + X..Posted.in.Forum +
                          X..Grade.Higher.Than.Zero + Median.Age,
                        data = mydata,
                        se_type = "HC1")
summary(fit_mydata)
## 
## Call:
## lm_robust(formula = X..Certified ~ Institution + X..Posted.in.Forum + 
##     X..Grade.Higher.Than.Zero + Median.Age, data = mydata, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                           Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper  DF
## (Intercept)               -8.92816    2.98208 -2.9939 2.997e-03 -14.79803 -3.05828 283
## Institution0               4.34085    0.69126  6.2796 1.274e-09   2.98019  5.70151 283
## X..Posted.in.Forum        -0.02143    0.05572 -0.3846 7.008e-01  -0.13111  0.08825 283
## X..Grade.Higher.Than.Zero  0.31022    0.02606 11.9019 9.370e-27   0.25891  0.36153 283
## Median.Age                 0.28657    0.11279  2.5409 1.159e-02   0.06457  0.50858 283
## 
## Multiple R-squared:  0.5573 ,    Adjusted R-squared:  0.551 
## F-statistic: 51.74 on 4 and 283 DF,  p-value: < 2.2e-16

The Multiple R-squared value is 0.559, meaning that approximately 55.9% of the variability in the percentage of certified participants is explained by the model. The F-statistic tests the overall significance of the model and is highly significant (p-value < 2.2e-16).

In summary, the variables Institution, Grade higher than zero, and Median age appear to be statistically significant predictors of percentage of certified in our model. The variable Percentage posted in forum is not considered statistically significant because its coefficient estimate is very close to zero, and the p-value associated with it is quite high (p-value = 0.6744). So, against my initial guess, sitting in forums doesn’t help much for finishing your courses.