I.  You will find this discussion useful for understanding OLS assumptions, and for your last Assignment.

Please skim through chapter 8 of Open Statistics textbook, and pay attention to Gauss Markov Assumptionspart (Full Ideal Conditions of OLS / classical linear regression assumptions).  Under these conditions OLS is BLUE..  You can even refer to any standard Econometrics textbooks or online resources.   

1.  Find a dataset and run a multivariate regression in R (have at-least 2 independant variables).  Make sure to type out the estimating equation with subscripts, and provide summary statistics of the dataset and present the final regression with stargazer package (you can try and present a few different specifications).

1. Solution:

I found this dataset from Kaggle. It is a dataset that represents students’ (N=480) academic performance from different countries and of different ages. My research question is, does a student’s amount of views for course announcements (AnnouncementsView) and course content resources (visITedResources) predict the number of times a student raises their hand (raisedhands)?

\[ Y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon_i \]

In which \(Y_i\) = raisedhands, \(\beta_0\) = intercept (constant), \(\beta_1X_1\) = AnnouncementsView, \(\beta_1X_2\) = visITedResources, and \(\epsilon_i\) = random error

setwd("/Users/jiwonban/ADEC7301/Week 7")
df <- read.csv("Academic_Perfromance_Dataset.csv")

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(df, 
          type = "text", 
          title = "Summary")
## 
## Summary
## =============================================
## Statistic          N   Mean  St. Dev. Min Max
## ---------------------------------------------
## raisedhands       480 46.775  30.779   0  100
## VisITedResources  480 54.798  33.080   0  99 
## AnnouncementsView 480 37.919  26.611   0  98 
## Discussion        480 43.283  27.638   1  99 
## ---------------------------------------------
summary(
  lm(df$raisedhands ~ df$AnnouncementsView + df$VisITedResources)
  )
## 
## Call:
## lm(formula = df$raisedhands ~ df$AnnouncementsView + df$VisITedResources)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.254 -11.179   0.393  12.918  62.537 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.63719    1.87488   3.540 0.000439 ***
## df$AnnouncementsView  0.41641    0.04358   9.554  < 2e-16 ***
## df$VisITedResources   0.44433    0.03506  12.673  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.41 on 477 degrees of freedom
## Multiple R-squared:  0.5621, Adjusted R-squared:  0.5602 
## F-statistic: 306.1 on 2 and 477 DF,  p-value: < 2.2e-16

\[ Y_i = 6.64 + 0.42X_1 + 0.44X_2 + \epsilon_i \]

2. Talk about what you find in a few lines i.e. interpret a few slopes.  Is the sign in the expected direction, and is the magnitude meaningful? What about the statistical significance?

There is a significant relation among students’ viewing course announcements and visiting course resources to them raising their hand more frequently (F(2,477) = 306.1, p < 0.001). The intercept of 6.64 indicates that, even without even accounting for engagement via course announcements or course content materials (i.e., both \(X_1\) and \(X_2\) at 0), the model predicts a student to raise their hand in the classroom 6.64 times. The coefficient of 0.42 indicates the expected difference in the amount of times a student raises their hand given a one-unit increase in viewing course announcements, while holding the variable VisITedResources constant. The second coefficient tells us that with every one unit increase in a student visiting course resources, while holding AnnouncementsView constant, there is a predicted increase of 0.44 in the number of times a student will raise their hand. The signs are of expected direction, i.e., positive, as one would predict that a student would raise their hand more if they are better engaged with the course materials (e.g., those who view course announcements and resources more often).

3.  More importantly, interpret the residuals. 

plot(lm(df$raisedhands ~ df$AnnouncementsView + df$VisITedResources),
     col="gray")

You can refer to this video that will be helpful for both discussion and assignment.  Alternatively if you prefer reading 

3. Solution:

  1. Residuals vs Fitted - looks suitable for analysis, as the residuals are dispersed randomly (rather than clustered) and there are no visible patterns, indicating linearity in the relationship

  2. QQ plot - looks suitable for analysis, as the residuals are mostly aligned to the dashed line, indicating normality of distribution

  3. Scale-Location - looks suitable for analysis, as the residuals are randomly dispersed all around the line and plot, indicating random spread of distribution

  4. Residuals vs Leverage - looks suitable for analysis, as all residuals are within bounds and no alarming outliers

4.  What are the Gauss Markov Assumptions assumptions, and did they hold?

You will find different sources talking about different number of assumptions (5-7)  Explain in your own words (instead copying/pasting from the web).  These are the 4 OLS assumptions in Chapter 9.3of Open Statistics textbook, or you can refer to some simple resources on this too.  

4. Solution:

The four Gauss-Markov Assumptions for linear regressions:

1) Linearity - the relationship should be relatively linear
2) Nearly normal residuals - the standardized residuals should be normally distributed
3) Constant variability - the variance of errors should be relatively constant
4) Independent observations - there should be unique observations for every x to y

The residual plots tell us that these four assumptions hold to be true. Specifically, the Residuals vs Fitted plot supports linearity, the QQ plot indicates normality, the Scale-Location pot shows that the variance is of equal variance, and the non-linear patterns of all residual plots indicate independent observations.

5.  What does OLS is BLUE mean?

5. Solution:

BLUE stands for: Best Linear Unbiased Estimator. In OLS, we should always strive to obtain BLUE, in which the “best” is the minimum variance in a relationship of variables.

In this context, the definition of "best" refers to the minimum variance or the narrowest sampling distribution. More specifically, when your model satisfies the assumptions, OLS coefficient estimates follow the tightest possible sampling distribution of unbiased estimates compared to other linear estimation methods.

II.  Why should we take the log of a variable in your linear regression?

There are many reasons, including easy interpretationor better fit, but please stick with the a few only. Please share any relevant sources that you found useful, or believe might be helpful for others.  

II. Solution:

Taking logarithmic functions of variables are very useful in visualizing very large datasets — and more specifically, helpful to visualize data that is skewed or in clusters. As we saw with our residual plots in HW6, taking the log of both variables (dependent and independent) made the distribution of residuals more linear and normal. Log functions of variables alsoaddress issues concerning assumptions for analysis (e.g., homoscedasticity); that is, when the data is skewed, we can perform log functions to our variables to make it more suitable for regression models.