#Clear the environment
rm(list = ls())

#Installing necessary libraries
invisible(library(AER))
library("psych")

I. The Gauss-Markov assumptions are crucial for linear regression analysis: Linearity:

2.Linearity means we assume that when we change one thing, the outcome changes in a straightforward, predictable way. Imagine you're selling lemonade; if you increase the amount of lemons you use, you expect your sales to go up consistently. If using more lemons leads to varying increases in sales—sometimes a lot, sometimes not much—then our simple model won't capture that well. This straightforward relationship helps us make clear predictions. If we assume linearity when it's not true, we might be surprised when our sales don't match our expectations. So, it’s like saying if you press the gas pedal in your car, you should expect to go faster at a steady rate.

3.Linearity refers to the requirement that the relationship between the independent variables (predictors) and the dependent variable (outcome) can be described by a linear equation. This means that the effect of a change in an independent variable on the dependent variable is constant, regardless of the level of the independent variable. Linear models are simpler to understand and interpret. For example, if we increase our advertising budget by $1,000 and see an increase in sales by $2,000, that relationship is linear because it remains constant at all levels of spending. If the relationship were non-linear (e.g., diminishing returns), a linear model would not accurately capture that complexity. If the true relationship is non-linear and we use a linear model, our predictions and inferences might be misleading.

Random Sampling:

2.Random sampling is like picking names from a hat to make sure everyone gets a fair chance. If you want to know how people feel about a new park in your city, you wouldn't just ask your friends; that would give you biased results because your friends might share similar opinions. Instead, you'd want to ask a variety of people from different neighborhoods and backgrounds to get a true sense of what everyone thinks. This way, the results reflect the opinions of the entire community, not just a specific group. If you only talk to people in one area, you might miss out on what others feel, leading to conclusions that aren't accurate.

3.Random sampling is crucial because it ensures that every member of the population has an equal chance of being included in the sample. This process helps to avoid biases in the selection of data points that could lead to erroneous conclusions. When data is randomly sampled, it is more likely to reflect the true characteristics of the population. For instance, if we are studying the average height of adults in a city, a random sample would include a mix of different age groups, genders, and ethnic backgrounds, resulting in a more accurate estimate. If the sample is not random (e.g., only sampling from a specific neighborhood), the results may not generalize to the entire population, leading to inaccurate conclusions. For example, if we only sample from a wealthy area, we might overestimate the average income of the city.

Non-collinearity:

2.Non-collinearity means that the things we're looking at to predict an outcome shouldn't be too similar to each other. Think of it like this: if you're trying to figure out what makes someone a good driver, you wouldn't want to ask both about their years of driving experience and their number of driving lessons—those two things are closely related. If they are, it becomes hard to know which one actually matters more when we look at how well someone drives. If we mix up similar factors, it can lead to confusion, and we may end up with results that don't really tell us anything useful.

3.Non-collinearity means that the independent variables should not be perfectly correlated with one another. When two or more predictors are highly correlated, it becomes difficult to determine their individual effects on the dependent variable. High correlation among independent variables can lead to multicollinearity, which inflates the variances of the coefficient estimates. This inflation makes it challenging to assess the individual contribution of each variable, leading to less reliable statistical inferences. For instance, if both “years of education” and “income level” are included in the model, and they are highly correlated, it can be difficult to disentangle their effects on a dependent variable like “job performance.” As a result, the coefficients might be unstable and could change dramatically with small changes in the data.

Exogeneity:

2.Exogeneity means that the things we're measuring to predict an outcome shouldn't be affected by hidden factors that also influence the outcome. Let's say you're studying whether studying more leads to better grades. If you don't consider that students who study more might also be naturally more organized or have better resources, your conclusion about studying may be misleading. In other words, if studying and grades are connected by something else—like being a naturally good student—then your findings might suggest that studying alone is responsible for good grades, which isn't fair. It's important to ensure that the factors we are examining don't have their own hidden influences that could mislead us.

3.Exogeneity refers to the idea that the independent variables should not be correlated with the error term in the regression model. This means that any unobserved factors that affect the dependent variable should not also influence the independent variables. If the independent variables are correlated with the error term, it indicates that there are omitted variables influencing both the predictors and the outcome. This correlation can lead to biased and inconsistent estimates of the regression coefficients. For example, if we’re studying the effect of education on income but fail to control for innate ability (which affects both education and income), then our estimate of the effect of education may be biased. We might incorrectly conclude that education has a stronger impact on income than it actually does because we haven't accounted for this shared influence.

Homoscedasticity:

2.Homoscedasticity means that no matter the situation, the errors or mistakes we make in predictions stay the same. Imagine you're throwing darts at a dartboard. If you always miss by about the same amount no matter where you aim on the board, that's like homoscedasticity. However, if you start missing by a lot more in some areas than in others, it becomes harder to improve your accuracy. In simple terms, when we try to predict something, the mistakes should be small and consistent across the board. If the mistakes start getting bigger in some situations, we might be missing something important, and our predictions won't be as reliable.

3.Homoscedasticity is the assumption that the variance of the error term (or residuals) remains constant across all levels of the independent variables. This means that the error distribution is the same whether you’re predicting little or large quantities. In other words, the model should perform consistently over the entire dataset. If the errors increase with the independent variables, this suggests heteroscedasticity, which might result in inefficient estimates. This makes confidence intervals and hypothesis tests less reliable. Homoscedasticity ensures that the model’s predictions are consistently precise across the dataset.

II. Cross-sectional dataset

data("HousePrices")
df <- HousePrices

describe(df)

##             vars   n     mean       sd median  trimmed      mad   min    max
## price          1 546 68121.60 26702.67  62000 65171.36 22239.00 25000 190000
## lotsize        2 546  5150.27  2168.16   4600  4908.62  2057.11  1650  16200
## bedrooms       3 546     2.97     0.74      3     2.93     0.00     1      6
## bathrooms      4 546     1.29     0.50      1     1.21     0.00     1      4
## stories        5 546     1.81     0.87      2     1.67     1.48     1      4
## driveway*      6 546     1.86     0.35      2     1.95     0.00     1      2
## recreation*    7 546     1.18     0.38      1     1.10     0.00     1      2
## fullbase*      8 546     1.35     0.48      1     1.31     0.00     1      2
## gasheat*       9 546     1.05     0.21      1     1.00     0.00     1      2
## aircon*       10 546     1.32     0.47      1     1.27     0.00     1      2
## garage        11 546     0.69     0.86      0     0.59     0.00     0      3
## prefer*       12 546     1.23     0.42      1     1.17     0.00     1      2
##              range  skew kurtosis      se
## price       165000  1.20     1.91 1142.77
## lotsize      14550  1.32     2.71   92.79
## bedrooms         5  0.49     0.70    0.03
## bathrooms        3  1.58     2.13    0.02
## stories          3  1.07     0.63    0.04
## driveway*        1 -2.06     2.24    0.01
## recreation*      1  1.68     0.83    0.02
## fullbase*        1  0.63    -1.61    0.02
## gasheat*         1  4.33    16.82    0.01
## aircon*          1  0.79    -1.39    0.02
## garage           3  0.84    -0.58    0.04
## prefer*          1  1.25    -0.44    0.02

1&2.Linear Regression

my_reg <- lm(price ~ lotsize, df)
summary(my_reg)

## 
## Call:
## lm(formula = price ~ lotsize, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69551 -14626  -2858   9752 106901 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.414e+04  2.491e+03    13.7   <2e-16 ***
## lotsize     6.599e+00  4.458e-01    14.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22570 on 544 degrees of freedom
## Multiple R-squared:  0.2871, Adjusted R-squared:  0.2858 
## F-statistic: 219.1 on 1 and 544 DF,  p-value: < 2.2e-16

Interpretation

Intercept: The expected price of a house with a lot size of 0 square feet is approximately 34,140. The p-value is < 2e-16, which is highly significant (***), indicating strong evidence that the intercept is not equal to zero.

Slope: For each additional square foot of lot size, the sale price of a house is expected to increase by approximately 6.599 units of currency. The p-value is < 2e-16, which is also highly significant (***), indicating that the effect of lot size on price is statistically significant.

III. Residual Analysis

# Set up the plotting region 
par(mfrow = c(2, 2))
# Create a residual plot
plot(my_reg)

1.The first plot checks for non-linearity and whether the residuals have constant variance (homoscedasticity). The points seem to follow that.

The second plot if the residuals follow a normal distribution. This is important for certain hypothesis tests and confidence intervals and is fairly satisfied.

The third plot tests the homoscedasticity assumption further by plotting the square root of standardized residuals against fitted values. Here it seems to be almost equally distanced.

The fourth plot identifies influential points or outliers that have a disproportionate impact on the model. There are a couple points in the Cook’s distance. If removed, it will provide better results.

Linearity: the parameters we are estimating using the OLS method seems linear themselves.
Random: The data is said to be random.
Non-Collinearity: the regressors being calculated aren't perfectly correlated with each other.
Exogeneity: the regressors aren't correlated with the error term.
Homoscedasticity: this seems to be satisfied as well.

my_reg_log <- lm(log(price) ~ lotsize, df)
summary(my_reg_log)

## 
## Call:
## lm(formula = log(price) ~ lotsize, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96513 -0.20552  0.00396  0.19758  0.88531 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.058e+01  3.451e-02  306.51   <2e-16 ***
## lotsize     9.315e-05  6.177e-06   15.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3127 on 544 degrees of freedom
## Multiple R-squared:  0.2947, Adjusted R-squared:  0.2935 
## F-statistic: 227.4 on 1 and 544 DF,  p-value: < 2.2e-16

# Set up the plotting region 
par(mfrow = c(2, 2))
plot(my_reg_log)

The residual points are more evenly spread out. Doesn’t seem to impact a lot.

Discussion_4

Aritra

2024-09-26

I. The Gauss-Markov assumptions are crucial for linear regression analysis: Linearity:

II. Cross-sectional dataset

III. Residual Analysis