Please skim through chapter 8 of Open Statistics textbook, and pay attention to Gauss Markov Assumptionspart (Full Ideal Conditions of OLS / classical linear regression assumptions). Under these conditions OLS is BLUE.. You can even refer to any standard Econometrics textbooks or online resources.
I found this
dataset from Kaggle. It is a dataset that represents students’ (N=480)
academic performance from different countries and of different ages. My
research question is, does a student’s amount of views for course
announcements (AnnouncementsView) and course content
resources (visITedResources) predict the number of times a
student raises their hand (raisedhands)?
\[ Y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon_i \]
In which \(Y_i\) =
raisedhands, \(\beta_0\) =
intercept (constant), \(\beta_1X_1\) =
AnnouncementsView, \(\beta_1X_2\) =
visITedResources, and \(\epsilon_i\) = random error
setwd("/Users/jiwonban/ADEC7301/Week 7")
df <- read.csv("Academic_Perfromance_Dataset.csv")
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(df,
type = "text",
title = "Summary")
##
## Summary
## =============================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------
## raisedhands 480 46.775 30.779 0 100
## VisITedResources 480 54.798 33.080 0 99
## AnnouncementsView 480 37.919 26.611 0 98
## Discussion 480 43.283 27.638 1 99
## ---------------------------------------------
summary(
lm(df$raisedhands ~ df$AnnouncementsView + df$VisITedResources)
)
##
## Call:
## lm(formula = df$raisedhands ~ df$AnnouncementsView + df$VisITedResources)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.254 -11.179 0.393 12.918 62.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.63719 1.87488 3.540 0.000439 ***
## df$AnnouncementsView 0.41641 0.04358 9.554 < 2e-16 ***
## df$VisITedResources 0.44433 0.03506 12.673 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.41 on 477 degrees of freedom
## Multiple R-squared: 0.5621, Adjusted R-squared: 0.5602
## F-statistic: 306.1 on 2 and 477 DF, p-value: < 2.2e-16
\[ Y_i = 6.64 + 0.42X_1 + 0.44X_2 + \epsilon_i \]
There is a significant relation among students’ viewing course
announcements and visiting course resources to them raising their hand
more frequently (F(2,477) = 306.1, p < 0.001). The
intercept of 6.64 indicates that, even without even accounting for
engagement via course announcements or course content materials (i.e.,
both \(X_1\) and \(X_2\) at 0), the model predicts a student
to raise their hand in the classroom 6.64 times. The coefficient of 0.42
indicates the expected difference in the amount of times a student
raises their hand given a one-unit increase in viewing course
announcements, while holding the variable VisITedResources
constant. The second coefficient tells us that with every one unit
increase in a student visiting course resources, while holding
AnnouncementsView constant, there is a predicted increase
of 0.44 in the number of times a student will raise their hand. The
signs are of expected direction, i.e., positive, as one would predict
that a student would raise their hand more if they are better engaged
with the course materials (e.g., those who view course announcements and
resources more often).
plot(lm(df$raisedhands ~ df$AnnouncementsView + df$VisITedResources),
col="gray")
You can refer to this video that will be helpful for both discussion and assignment. Alternatively if you prefer reading
Residuals vs Fitted - looks suitable for analysis, as the residuals are dispersed randomly (rather than clustered) and there are no visible patterns, indicating linearity in the relationship
QQ plot - looks suitable for analysis, as the residuals are mostly aligned to the dashed line, indicating normality of distribution
Scale-Location - looks suitable for analysis, as the residuals are randomly dispersed all around the line and plot, indicating random spread of distribution
Residuals vs Leverage - looks suitable for analysis, as all residuals are within bounds and no alarming outliers
You will find different sources talking about different number of assumptions (5-7) Explain in your own words (instead copying/pasting from the web). These are the 4 OLS assumptions in Chapter 9.3of Open Statistics textbook, or you can refer to some simple resources on this too.
The four Gauss-Markov Assumptions for linear regressions:
| 1) Linearity - the relationship should be relatively linear |
| 2) Nearly normal residuals - the standardized residuals should be normally distributed |
| 3) Constant variability - the variance of errors should be relatively constant |
| 4) Independent observations - there should be unique observations for every x to y |
The residual plots tell us that these four assumptions hold to be true. Specifically, the Residuals vs Fitted plot supports linearity, the QQ plot indicates normality, the Scale-Location pot shows that the variance is of equal variance, and the non-linear patterns of all residual plots indicate independent observations.
BLUE stands for: Best Linear Unbiased Estimator. In OLS, we should always strive to obtain BLUE, in which the “best” is the minimum variance in a relationship of variables.
In this context, the definition of "best" refers to the minimum variance or the narrowest sampling distribution. More specifically, when your model satisfies the assumptions, OLS coefficient estimates follow the tightest possible sampling distribution of unbiased estimates compared to other linear estimation methods.
There are many reasons, including easy interpretationor better fit, but please stick with the a few only. Please share any relevant sources that you found useful, or believe might be helpful for others.
Taking logarithmic functions of variables are very useful in visualizing very large datasets — and more specifically, helpful to visualize data that is skewed or in clusters. As we saw with our residual plots in HW6, taking the log of both variables (dependent and independent) made the distribution of residuals more linear and normal. Log functions of variables alsoaddress issues concerning assumptions for analysis (e.g., homoscedasticity); that is, when the data is skewed, we can perform log functions to our variables to make it more suitable for regression models.