Simple terms: This means that if we double an independent variable (like income), we expect the dependent variable (like spending) to change in a predictable, straight line. It essentially helps us make sense of the relationship between variables.
2) Zero Conditional Mean: The expected value of the error terms given any value of the independent variables is zero.
Simple terms: Imagine you’re trying to predict someone’s height based on age. If height was also affecting the age somehow, our prediction wouldn’t be very accurate. This assumption ensures that any errors or things we didn’t account for are not messing with the relationship we’re studying.
3) Non Multicollinearity: There is no exact linear relationship among any of the predictors or \(X\) variables. The columns of matrix are linearly independent, meaning they are not produced by or multiples of each other.
Simple terms: Means the independent variables should be distinct. If you’re using two variables that are nearly identical (like temperature in Fahrenheit and Celsius), the model might get confused. This assumption helps prevent redundancy in the data.
4) Homoscedasticity: The variance of the error terms \(\epsilon_i\) (residuals) is constant across all values of the regressors.
\(Var(\epsilon_i)\)= \(\sigma\)
Simple terms: For example, if someone earns a lot or a little money, the “mistakes” or differences between predictions and reality should stay the same. This assumption helps ensure that the errors aren’t growing too big in one part of the data and creates consistent behavior of errors.
5) Exogeneity: The independent variables are fixed and not random, not correlated with the error term.
Simple terms: Imagine conducting an experiment where the ingredients you use don’t change on their own; they stay the same each time. This ensures that the independent variables are reliable and don’t shift randomly in the process.
6) No Autocorrelation: The error terms of each observation are uncorrelated with one another.
\(Cov(\epsilon_i , \epsilon_j)=0\) for all \(i \neq j\)
Simple terms: Errors need to be independent. If we made a mistake in one observation (like guessing someone’s height), it shouldn’t automatically lead to another mistake in the next observation as they are unrelated mistakes.
Part 2) Cross sectional datasets
Bringing in the Mass School data
library("AER")
Warning: package 'AER' was built under R version 4.3.3
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data("MASchools")school <- MASchools
plot(~score4 + scratio + income + english, data = school,main ="MA Schools")
When the independent variables (scratio, income, and english) are equal to 0, the average score is 687.95
Scartio
For every extra computer per student, the school’s average test score decreases by 0.15 points. Good to add context to these statistics as while we see a negative correlation here, in 1998 there wasnt a large reliance on implmenting computers as part of student’s daily ciriculumn compared to today where one might see a postive correlation.
This coefficient is not statistically significant at any significance level.
Income
For every extra dollar increase in the average district income, avg 1.39 test scores increase by 1.49 points.
This coefficient is statistically significant at the 99% confidence interval, \(\alpha = 0.01\)
English
For every percentage point increase in the percent of english learners, test scores decrease by 2.10 points.
This coefficient is statistically significant at the 99% confidence interval, \(\alpha = 0.01\)
Storing our regression in “my_reg”
my_reg <- lm
Part 3) 4 Linear Regression Plots
#Plotting my linear modelpar(mfrow=c(2, 2))plot(my_reg)
What each plot mean:
Residual vs fitted:
Check for a random scatter of points around the horizontal line. Homoscedasticity implies consistent variance of residuals across the range of fitted values.
Want to make sure residuals are spread out
Q-Q: Evaluate
if residuals follow a straight line.
A roughly linear pattern suggests residuals are normally distributed. Deviations from the line indicate non-normality. Departures at the tails might signal outliers or heavy-tailed distributions.
Scale Locations:
Outlying points far from the center might indicate influential observations impacting the regression line. High leverage points can heavily influence the regression model’s coefficients.
Line should be straight horizontally
Residuals vs Leverage:
If there are points outside of cooks lines then they might point to influential outliers
Interpreting the linear model
plot(my_reg, which =1)
Residual vs fitted: The non linear residuals pattern show that linearity is violated. We can also note some outliers, the 179 and 208.
plot(my_reg, which =2)
Q-Q Resideuals: Shows that the majority of the residuals follow a diagonal line across 5 standard deviations, small outliers on the X axis around the -3 standard deviation but overall showing normality in the distribution.
plot(my_reg, which =3)
Scale-Location: The overall line here has a horizontal direction but is nevertheless influenced by residuals further from the center and outliers at top right.
plot(my_reg, 4:5)
Residuals vs Leverage: This allows us to identify influential outliers, based on the graph we don’t have any residual points outside the cooks line but nevertheless do have 2-3 residuals which stick out.
Adjusting the data
The Below graphs will help show the distribution of data before the transformation
library(ggplot2)ggplot(data = school, aes(x = scratio)) +geom_histogram(aes(y = ..density..), bins =30, fill ="navy") +geom_density(color ="red") +labs(title ="Distribution of scratio", x ="scratio", y ="Density")
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
ggplot(data = school, aes(x = income)) +geom_histogram(aes(y = ..density..), bins =30, fill ="darkgreen") +geom_density(color ="red") +labs(title ="Distribution of Income", x ="Income", y ="Density")
ggplot(data = school, aes(x = english)) +geom_histogram(aes(y = ..density..), bins =30, fill ="orange") +geom_density(color ="red") +labs(title ="Distribution of English", x ="English", y ="Density")
We are implementing a Log transformation:
Adding a summary along with the graphs to show the impacts it has had.
# log transformation to the english variable --> +1e-10 to avoid errorschool$log_income <-log(school$income +1e-10) school$log_english <-log(school$english +1e-10)school$log_scratio <-log(school$scratio +1e-10)# Adjusting the Linear Modelmy_reg2 <-lm(score4 ~ log_scratio + log_income + log_english, data = school)# Summary of the new modelstargazer(my_reg2, type ="text")
ggplot(data = school, aes(x = log_income)) +geom_histogram(aes(y = ..density..), bins =30, fill ="darkgreen") +geom_density(color ="red") +labs(title ="Distribution of Log Income", x ="Income", y ="Density")
ggplot(data = school, aes(x = log_english)) +geom_histogram(aes(y = ..density..), bins =30, fill ="orange") +geom_density(color ="red") +labs(title ="Distribution of Log English", x ="English", y ="Density")
Post transformation analysis
Statistics such as the R square and the F statistic worsened after the transformation but the data distribution in the density graphs look a bit better. The residuals are more evenly scattered and while still suffering from some outliers, it is has helped the linearity.
# Plotting the transformed linear modelpar(mfrow=c(2, 2))plot(my_reg2)