Part I

  1. Gauss-Markov Assumptions are developed based on ordinary least squares regression analysis. When the assumptions are met in a linear regression model, it can be said that the ordinary least squares of the coefficients is the best linear unbiased estimator (BLUE). The assumptions are used as a benchmark to increase validity of our model to achieve an OLS model that is linear and unbiased where the estimator that has the smallest variance. The conditions to meet the assumption include linearity, non-collinearity, homoscedasticity, exogeneity, normality and randomness.

  2. In non-technical terms:

    Linearity - The relationship between the independent and dependent variable can be drawn on a graph as a straight line because the change in the independent variable is proportionate to its effect on the dependent variable. As the independent variable increase/decrease by an additional unit, the change in the dependent variable is consistent.

    Non-collinearity - In a model, the independent variables are not strongly related to one another. If independent variables have an effect on each other (collinearity is present), the model would be less reliable in determining the effect of the independent variable on the dependent variable and the relationship would no longer be unique.

    Homoscedasticity & Nonautocorrelation - For each value of the dependent variable, the spread is consistent across all values of the independent variable. Spread can be explained as how much each data point deviates from the mean. Homoscedasticity increases the reliability of the model by ensuring that the predictions will be consistent for all values of the independent variable. Nonautocorrelation is when spread of the each data point is not related to another.

    Exogeneity - The independent variable is not affected by any other factors as this would influence the effect of the independent variable on the dependent variable.

    Normality - When plotted on a graph, the errors (differences between the actual and predicted values) can be represented by a bell-shaped curve. By observing such a curve, it can be established that the errors are minimal and the effect is trustable.

    Randomness - The data used in the model is collected randomly where each observation has the same chance of being present in the sample. This ensures that the model is unbiased and is reflective of what is being modeled/observed.

    In the real world, we often communicate with people from different fields where they may not understand the technical terms so it is important to know how to explain these concepts without jargon.

  3. Linearity - The independent variable and dependent variable share a linear relationship. The OLS equation reflects this relationship whereby every unit increase in X, Y changes by a consistent value \(\beta1\).

    Non-collinearity - Given that there is more than one independent variable, X1 and X2, they should not have a linear relationship as it decrease the impact of the coefficient estimates in a model.

    Homoscedasticity & Nonautocorrelation - The variance of the error terms are consistent for all values of the independent variable which can be represented mathematically as \(Var( {\epsilon})=\sigma^2\) . Nonautocorrelation ensures that the error terms are independent of each other.

    Exogeneity - In a model, the independent variables are not affected by the expected value of the error term. Consequently, the expected value should be zero as this asserts that the independent variable is not related to factors outside of the model that influences the dependent variable. Mathematically, it can be shown as \(E(\mu|X)=0\)

    Normality - The distribution of the error term are normal and around zero.

    Randomness - The data are randomly sampled and every observation has equal probability of being included. By meeting this assumption, findings from the model can be applicable to a wider population.

    Being able to explain concepts in a technical matter is also crucial as it allows us to provide a more in-depth explanation when it comes to problem-solving.

Part II

Dataset

The dataset being used in the model is a cross-sectional dataset called “Salaries” which includes 397 observations and 6 variables.

# load data as df1
data("Salaries")
df1 <- Salaries
summary(df1)
##         rank     discipline yrs.since.phd    yrs.service        sex     
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545
sapply(df1, function(x) sum(is.na(x)))
##          rank    discipline yrs.since.phd   yrs.service           sex 
##             0             0             0             0             0 
##        salary 
##             0
# build a scatterplot between x and y variable
plot(df1$yrs.since.phd, df1$salary, xlab = "Number of Years Since PHD", ylab = "Salary", main = "Scatterplot of Salaries Dataset")

Simple Linear Regression

Estimating Equation of the Model

\[ salary = \beta0+\beta1yrs.since.phd+\mu \]

Independent Variable: number of years since phd

Dependent Variable: nine-month academic salary in dollars

# build a simple linear regression model
my_reg <- lm(salary ~ yrs.since.phd, data = df1)
summary(my_reg)
## 
## Call:
## lm(formula = salary ~ yrs.since.phd, data = df1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84171 -19432  -2858  16086 102383 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    91718.7     2765.8  33.162   <2e-16 ***
## yrs.since.phd    985.3      107.4   9.177   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27530 on 395 degrees of freedom
## Multiple R-squared:  0.1758, Adjusted R-squared:  0.1737 
## F-statistic: 84.23 on 1 and 395 DF,  p-value: < 2.2e-16

Based on the results from the regression model, the fitted regression can be written as:

\[ salary = 91718.7+985.3yrs.since.phd \]

Interpretation of model:

If a person’s number of years since their PHD increases by one, their nine-month salary would increase by 985.3 dollars. Based on the intercept, someone with zero years since their PHD would have a nine-month salary of 91718.7 dollars.

Since the p-value for the coefficient is less than 0.05, the null hypothesis where \(\beta1=0\) can be rejected which indicates that it is statistically significant.

Part III.

Question 1 and 2 of part III will be answered under each plot.

# plot residual vs fitted
plot(my_reg, which=1)

The residual vs fitted plot measures linearity, error variances and outliers.

Rule of thumb for assumption to be met

  • Linearity: Residuals are near zero (horizontal dotted line).

  • Equal Error Variances: The pattern of the residuals form a straight line.

  • No outliers: None of residuals are away from pattern of the other residuals.

Interpretation:

A curved shape can be observed and a proportion if the residuals are far away from zero. This could indicate that the model does not sufficiently account for variability in the data. Therefore, the Gauss-Markov assumption of linearity is violated.

# plot Q-Q residuals
plot(my_reg, which = 2)

Q-Q Residuals plot also known as normal probability plot of residuals is used to determine if the error terms are normally distributed

Rule of thumb:

Normality: Residuals in the plot are approximately linear and close to the dotted line.

Interpretation:

A good majority of the residuals are close to the dotted line and approximately linear. It can be assumed that the error terms are normally distributed and the condition for normality is met.

# plot scale-location
plot(my_reg, which = 3)

The scale-location plot is used to measure homoscedasticity (constant variance).

Rule of thumb for assumption to be met:

Homoscedasticity: The residuals are equally spread and the red line is approximately horizontal.

Interpretation:

Based on the plot, the assumption of homoscedasticity is violated as the red line indicates a slight upward trend. The spread around the red line also does not seem to be equal. When fitted values is more than 130000, they appear more scattered.

# plot cook's distance 
plot(my_reg, which = 4)

# plot residual vs leverage
plot(my_reg, which = 5)

The Residual vs Leverage plot (Cook’s Distance) can be used to check for outliers and high leverage points (data point that has extreme influence on the coefficient of the regression).

Rule of thumb for assumptions:

No outliers : The residuals are not far away from the mean, typically less than three standard deviations.

Absence of high leverage point: None of the residuals exceed cook’s distance.

Interpretation:

There is no high leverage point in the data as none of the residuals exceed Cook’s distance. There are also no outliers as none of the residuals are more than 3 standard deviations from the mean. The assumption can be met.

Conclusion: Through the individual interpretations of the charts above, two Gauss-Markov Assumptions were violated - linearity and homoscedasticity.

# transforming the dependent variable - salary 
lm_model2 <- lm(log(salary)~yrs.since.phd, data = df1)
summary(lm_model2)
## 
## Call:
## lm(formula = log(salary) ~ yrs.since.phd, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88900 -0.16833  0.00347  0.16163  0.61047 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.142e+01  2.368e-02 482.055   <2e-16 ***
## yrs.since.phd 8.591e-03  9.193e-04   9.345   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2357 on 395 degrees of freedom
## Multiple R-squared:  0.1811, Adjusted R-squared:  0.179 
## F-statistic: 87.33 on 1 and 395 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm_model2)

  1. Rebuilding the model with salary transformed did not correct the problems that existed in the original regression model. Non-linearity can still be seen in the residual vs fitted plot as the red line is not approximately horizontal. Heteroscedasticity is also still present as the scale-location plot did not change.