ECNM_Discussion_4

Clear Data

rm(list = ls())      # Clear all files from your environment
         gc()            # Clear unused memory

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  581080 31.1    1325512 70.8         NA   669431 35.8
Vcells 1069441  8.2    8388608 64.0      16384  1852345 14.2

         cat("\f")       # Clear the console

 graphics.off()      # Clear all graphs

Part 1) Gauss Markov Assumptions - Explantions

1) Linearity: Using the OLS method, the relationship between the dependent variable $Y$ and the independent variables $X$ is linear.

$Y_i = \beta_0 + \beta_1 * X_{1i} + \beta_2 * X_{2i} + \beta_3 * X_{3i} + \epsilon_i$

Simple terms: This means that if we double an independent variable (like income), we expect the dependent variable (like spending) to change in a predictable, straight line. It essentially helps us make sense of the relationship between variables.

2) Zero Conditional Mean: The expected value of the error terms given any value of the independent variables is zero.

$E(\epsilon_i | X) = 0$ Implies $Cov(X_i, \epsilon_j i)=0$

Simple terms: Imagine you’re trying to predict someone’s height based on age. If height was also affecting the age somehow, our prediction wouldn’t be very accurate. This assumption ensures that any errors or things we didn’t account for are not messing with the relationship we’re studying.

3) Non Multicollinearity: There is no exact linear relationship among any of the predictors or $X$ variables. The columns of matrix are linearly independent, meaning they are not produced by or multiples of each other.

Simple terms: Means the independent variables should be distinct. If you’re using two variables that are nearly identical (like temperature in Fahrenheit and Celsius), the model might get confused. This assumption helps prevent redundancy in the data.

4) Homoscedasticity: The variance of the error terms $\epsilon_i$ (residuals) is constant across all values of the regressors.

$Var(\epsilon_i)$= $\sigma$

Simple terms: For example, if someone earns a lot or a little money, the “mistakes” or differences between predictions and reality should stay the same. This assumption helps ensure that the errors aren’t growing too big in one part of the data and creates consistent behavior of errors.

5) Exogeneity: The independent variables are fixed and not random, not correlated with the error term.

Simple terms: Imagine conducting an experiment where the ingredients you use don’t change on their own; they stay the same each time. This ensures that the independent variables are reliable and don’t shift randomly in the process.

6) No Autocorrelation: The error terms of each observation are uncorrelated with one another.

$Cov(\epsilon_i , \epsilon_j)=0$ for all $i \neq j$

Simple terms: Errors need to be independent. If we made a mistake in one observation (like guessing someone’s height), it shouldn’t automatically lead to another mistake in the next observation as they are unrelated mistakes.

Part 2) Cross sectional datasets

Bringing in the Mass School data

library("AER")

Warning: package 'AER' was built under R version 4.3.3

Loading required package: car

Loading required package: carData

Loading required package: lmtest

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Loading required package: survival

data("MASchools")

school <- MASchools

plot(~score4 + scratio + income + english, 
      data = school,
      main = "MA Schools")

head(school,5)

  district municipality expreg expspecial expbil expocc exptot scratio special
1        1     Abington   4201    7375.69      0      0   4646    16.6    14.6
2        2        Acton   4129    8573.99      0      0   4930     5.7    17.4
3        3     Acushnet   3627    8081.72      0      0   4281     7.5    12.1
4        5       Agawam   4015    8181.37      0      0   4826     8.6    21.1
5        7     Amesbury   4273    7037.22      0      0   4824     6.1    16.8
  lunch stratio income score4 score8  salary   english
1  11.8    19.0 16.379    714    691 34.3600 0.0000000
2   2.5    22.6 25.792    731     NA 38.0630 1.2461059
3  14.1    19.3 14.040    704    693 32.4910 0.0000000
4  12.1    17.9 16.111    704    691 33.1060 0.3225806
5  17.4    17.5 15.423    701    699 34.4365 0.0000000

Creating the linear model

#Creating the Linear Model
lm <- lm(score4 ~ scratio + income + english, 
               data = school)

This data shows the Massachusetts district-wide average MCAS scores for public elementary school districts in 1998.

$Testscore = \beta_0 + \beta_1 * student \ per \ computer_i + \beta_2 * income_i + \beta_3 * Percent \ english \ learners_i + \epsilon_i$

$Y_i$ = Test score for the 4th grade class

$\beta_0$ = Intercept (constant) term

$\beta_1$ = Slope coefficient representing the change in test score per unit change in student per computer

$X_1$ = Student per computer

$\beta_2$ = Slope coefficient representing the change in test score per unit change in per capita income

$X_2$ = Per capita income (USD $)

$\beta_3$ = Slope coefficient representing the change in test score per unit change in percent of english learners

$X_3$ = Percent of English learners, english not their first language.

$\epsilon_i$ = Error term representing enexplainted variation

Interpreting the linear model

library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(lm, type = "text")


===============================================
                        Dependent variable:    
                    ---------------------------
                              score4           
-----------------------------------------------
scratio                       -0.147           
                              (0.254)          
                                               
income                       1.358***          
                              (0.128)          
                                               
english                      -2.084***         
                              (0.249)          
                                               
Constant                    687.952***         
                              (3.547)          
                                               
-----------------------------------------------
Observations                    211            
R2                             0.542           
Adjusted R2                    0.535           
Residual Std. Error      10.265 (df = 207)     
F Statistic           81.588*** (df = 3; 207)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Significance Stars:

*** significant at the 1% level (p-value < 0.01)
** Significant at the 5% level (p-value < 0.05)
* significant at the 10% level (p-value < 0.1)

Interpretation

Constant
- When the independent variables (scratio, income, and english) are equal to 0, the average score is 687.95
Scartio
- For every extra computer per student, the school’s average test score decreases by 0.15 points. Good to add context to these statistics as while we see a negative correlation here, in 1998 there wasnt a large reliance on implmenting computers as part of student’s daily ciriculumn compared to today where one might see a postive correlation.
- This coefficient is not statistically significant at any significance level.
Income
- For every extra dollar increase in the average district income, avg 1.39 test scores increase by 1.49 points.
- This coefficient is statistically significant at the 99% confidence interval, $\alpha = 0.01$
English
- For every percentage point increase in the percent of english learners, test scores decrease by 2.10 points.
- This coefficient is statistically significant at the 99% confidence interval, $\alpha = 0.01$

Storing our regression in “my_reg”

my_reg <- lm

Part 3) 4 Linear Regression Plots

#Plotting my linear model
par(mfrow=c(2, 2))
plot(my_reg)

What each plot mean:

Residual vs fitted:

Check for a random scatter of points around the horizontal line. Homoscedasticity implies consistent variance of residuals across the range of fitted values.
Want to make sure residuals are spread out

Q-Q: Evaluate

if residuals follow a straight line.
A roughly linear pattern suggests residuals are normally distributed. Deviations from the line indicate non-normality. Departures at the tails might signal outliers or heavy-tailed distributions.

Scale Locations:

Outlying points far from the center might indicate influential observations impacting the regression line. High leverage points can heavily influence the regression model’s coefficients.
Line should be straight horizontally

Residuals vs Leverage:

If there are points outside of cooks lines then they might point to influential outliers

Interpreting the linear model

plot(my_reg, which = 1)

Residual vs fitted: The non linear residuals pattern show that linearity is violated. We can also note some outliers, the 179 and 208.

plot(my_reg, which = 2)

Q-Q Resideuals: Shows that the majority of the residuals follow a diagonal line across 5 standard deviations, small outliers on the X axis around the -3 standard deviation but overall showing normality in the distribution.

plot(my_reg, which = 3)

Scale-Location: The overall line here has a horizontal direction but is nevertheless influenced by residuals further from the center and outliers at top right.

plot(my_reg, 4:5)

Residuals vs Leverage: This allows us to identify influential outliers, based on the graph we don’t have any residual points outside the cooks line but nevertheless do have 2-3 residuals which stick out.

Adjusting the data

The Below graphs will help show the distribution of data before the transformation

library(ggplot2)

ggplot(data = school, aes(x = scratio)) +
  geom_histogram(aes(y = ..density..), 
                 bins = 30, 
                 fill = "navy") +
  geom_density(color = "red") +
  labs(title = "Distribution of scratio", 
       x = "scratio", 
       y = "Density")

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Warning: Removed 9 rows containing non-finite values (`stat_bin()`).

Warning: Removed 9 rows containing non-finite values (`stat_density()`).

ggplot(data = school, aes(x = income)) +
  geom_histogram(aes(y = ..density..), 
                 bins = 30, 
                 fill = "darkgreen") +
  geom_density(color = "red") +
  labs(title = "Distribution of Income", 
       x = "Income", 
       y = "Density")

ggplot(data = school, aes(x = english)) +
  geom_histogram(aes(y = ..density..), 
                 bins = 30, 
                 fill = "orange") +
  geom_density(color = "red") +
  labs(title = "Distribution of English", 
       x = "English", 
       y = "Density")

We are implementing a Log transformation:

Adding a summary along with the graphs to show the impacts it has had.

# log transformation to the english variable --> +1e-10 to avoid error
school$log_income <- log(school$income + 1e-10) 
school$log_english <- log(school$english + 1e-10)
school$log_scratio <- log(school$scratio + 1e-10)

# Adjusting the Linear Model
my_reg2 <- lm(score4 ~ log_scratio + log_income + log_english, 
              data = school)

# Summary of the new model
stargazer(my_reg2, type = "text")


===============================================
                        Dependent variable:    
                    ---------------------------
                              score4           
-----------------------------------------------
log_scratio                    0.170           
                              (2.124)          
                                               
log_income                   37.306***         
                              (2.783)          
                                               
log_english                  -0.224***         
                              (0.065)          
                                               
Constant                    598.747***         
                              (9.635)          
                                               
-----------------------------------------------
Observations                    211            
R2                             0.491           
Adjusted R2                    0.484           
Residual Std. Error      10.820 (df = 207)     
F Statistic           66.537*** (df = 3; 207)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

ggplot(data = school, aes(x = log_scratio)) +
  geom_histogram(aes(y = ..density..), 
                 bins = 30, 
                 fill = "navy") +
  geom_density(color = "red") +
  labs(title = "Distribution of Log scratio", 
       x = "scratio", 
       y = "Density")

Warning: Removed 9 rows containing non-finite values (`stat_bin()`).

Warning: Removed 9 rows containing non-finite values (`stat_density()`).

ggplot(data = school, aes(x = log_income)) +
  geom_histogram(aes(y = ..density..), 
                 bins = 30, 
                 fill = "darkgreen") +
  geom_density(color = "red") +
  labs(title = "Distribution of Log Income", 
       x = "Income", 
       y = "Density")

ggplot(data = school, aes(x = log_english)) +
  geom_histogram(aes(y = ..density..), 
                 bins = 30, 
                 fill = "orange") +
  geom_density(color = "red") +
  labs(title = "Distribution of Log English", 
       x = "English", 
       y = "Density")

Post transformation analysis

Statistics such as the R square and the F statistic worsened after the transformation but the data distribution in the density graphs look a bit better. The residuals are more evenly scattered and while still suffering from some outliers, it is has helped the linearity.

# Plotting the transformed linear model
par(mfrow=c(2, 2))
plot(my_reg2)