Applied Reggression Dummy Assignment

A real estate economist collects information on 1000 house price sales from two similar neighborhoods, one called “University Town” bordering a large state university, and another a neighborhood about three miles from the university. He specifies the following regression equation:

where y = house prices in $1000 x1 = the number of hundreds of square feet of living area D2 ={ 1 house near university & 0 otherwise x3 = age of the house (in years) D4 ={ 1 house has pool & 0 otherwise D5 ={ 1 if fireplace is present & 0 otherwise

Discuss the effect of these variables on house prices:

Based On the given regression equation let’s analyze each variable’s coefficient:

β0: The intercept term represents the estimated house price when all other variables are zero. It captures the baseline value of house prices, regardless of the neighborhood, living area, age, pool, or fireplace.

β1: The coefficient for x1 (number of hundreds of square feet of living area) represents the estimated change in house prices for a one-unit increase in living area, holding all other variables constant. A positive β1 suggests that larger living areas are associated with higher house prices.

δ2: The coefficient for D2 (house near the university) represents the difference in house prices between houses near the university and those further away, controlling for other variables. A positive δ2 suggests that houses near the university tend to have higher prices compared to houses farther away.

γ: The coefficient for (D2 * x1) represents the interaction effect between the proximity to the university and the living area. It captures whether the effect of living area on house prices differs depending on whether the house is near the university or not. A positive γ coefficient indicates that the positive relationship between living area and house prices is stronger for houses near the university.

β3: The coefficient for x3 (age of the house) represents the estimated change in house prices for a one-year increase in age, holding all other variables constant. A negative β3 suggests that older houses tend to have lower prices compared to newer houses.

δ4: The coefficient for D4 (house has a pool) represents the difference in house prices between houses with a pool and those without a pool, controlling for other variables. A positive δ4 suggests that houses with a pool tend to have higher prices compared to houses without a pool.

δ5: The coefficient for D5 (fireplace presence) represents the difference in house prices between houses with a fireplace and those without a fireplace, controlling for other variables. A positive δ5 suggests that houses with a fireplace tend to have higher prices compared to houses without a fireplace.

Comment based on the reggesion model equation:

Living area (x1), age (x3), pool presence (D4), and fireplace presence (D5) have direct effects on house prices.
Proximity to the university (D2) affects house prices independently, and the interaction between proximity to the university and living area (D2 * x1) has an additional effect on house prices.
Positive coefficients indicate that the corresponding variables are associated with higher house prices, while negative coefficients suggest lower house prices.

Data <- read.table("./Data/housing.txt", header = TRUE)
head(Data)

##         P     S A Ut Pol Fp
## 1 205.452 23.46 6  0   0  1
## 2 185.328 20.03 5  0   0  1
## 3 248.422 27.77 6  0   0  0
## 4 154.690 20.17 1  0   0  0
## 5 221.801 26.45 0  0   0  1
## 6 199.119 21.56 6  0   0  1

fitF <- lm(P~S+Ut+I(S*Ut)+A+Pol+Fp, data = Data)
summary(fitF)

## 
## Call:
## lm(formula = P ~ S + Ut + I(S * Ut) + A + Pol + Fp, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.289 -10.141   0.148  10.565  44.783 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.5000     6.1917   3.957 8.13e-05 ***
## S             7.6122     0.2452  31.048  < 2e-16 ***
## Ut           27.4530     8.4226   3.259 0.001154 ** 
## I(S * Ut)     1.2994     0.3321   3.913 9.72e-05 ***
## A            -0.1901     0.0512  -3.712 0.000217 ***
## Pol           4.3772     1.1967   3.658 0.000268 ***
## Fp            1.6492     0.9720   1.697 0.090056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 993 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8698 
## F-statistic:  1113 on 6 and 993 DF,  p-value: < 2.2e-16

The fitted regression model suggests the following:

The size of the living area (S) has a strong positive effect on house prices (P).
Houses near the university (Ut) have significantly higher prices compared to those farther away.
The interaction between living area and proximity to the university (S * Ut) enhances the effect on house prices.
Older houses (A) tend to have lower prices.
Houses with a pool (Pol) have higher prices.
The presence of a fireplace (Fp) also contributes to higher prices, although it is not as statistically significant as other variables.

The goodness-of-fit statistic, R^2 (R-squared), is a measure of how well the regression model fits the data. In this case, R^2 is 0.8706, indicating that approximately 87.06% of the variation in the dependent variable (house prices) can be explained by the independent variables in the model.

The model has a good fit with an R-squared value of 0.8706, indicating that approximately 87.06% of the variance in house prices is explained by the included variables. The F-statistic is highly significant, supporting the overall significance of the model.

n <- nrow(Data)
p <- length(coefficients(fitF))
R2_adj <- 1 - (1 - summary(fitF)$r.squared) * ((n - 1) / (n - p))
print(R2_adj)

## [1] 0.8697879

SSR <- sum(fitF$residuals^2)    # Sum of squared residuals
SSE <- sum((Data$P - mean(Data$P))^2)    # Sum of squared errors
df_model <- p - 1    # Degrees of freedom for the model
df_error <- n - p    # Degrees of freedom for the error
F_stat <- ((SSE - SSR) / df_model) / (SSR / df_error)
print(F_stat)

## [1] 1113.183

anova(fitF)

## Analysis of Variance Table
## 
## Response: P
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## S           1 628934  628934 2713.179 < 2.2e-16 ***
## Ut          1 909292  909292 3922.626 < 2.2e-16 ***
## I(S * Ut)   1   3458    3458   14.918 0.0001196 ***
## A           1   2928    2928   12.631 0.0003972 ***
## Pol         1   2982    2982   12.864 0.0003514 ***
## Fp          1    667     667    2.879 0.0900558 .  
## Residuals 993 230184     232                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variables S (living area), Ut (proximity to the university), I(S * Ut) (interaction between S and Ut), A (age of the house), and Pol (presence of a pool) all have highly significant F values with p-values less than 0.001. This indicates that these variables significantly contribute to explaining the variation in house prices.
The variable Fp (presence of a fireplace) has an F value of 2.879 with a p-value of 0.0900558. This suggests that the presence of a fireplace may have a weaker or less significant impact on house prices compared to other variables, although it is still marginally above the conventional significance level of 0.05.

In this specific ANOVA table, the “Residuals” row shows that there are 993 degrees of freedom associated with the residuals. The sum of squares (SS) for the residuals is 230,184, and the mean square (MS) is 232.

In summary, the ANOVA results indicate that the variables S, Ut, I(S * Ut), A, and Pol are significant predictors of house prices, while the significance of the variable Fp is less conclusive.

coef(fitF)    # Coefficients

## (Intercept)           S          Ut   I(S * Ut)           A         Pol 
##  24.4999849   7.6121766  27.4529522   1.2994049  -0.1900864   4.3771633 
##          Fp 
##   1.6491756

coef(summary(fitF))[, 2]    # Standard errors

## (Intercept)           S          Ut   I(S * Ut)           A         Pol 
##  6.19172142  0.24517647  8.42258236  0.33204775  0.05120461  1.19669164 
##          Fp 
##  0.97195682

Fitted regression model is ŷ = 24.499 + 7.612 x1 + 27.453 D2 + 1.299 (D2 ∗ x1) − 0.190 x3 + 4.377 D4 + 1.649 D5

Based on the regression results, we estimate that - Location which is near the university increases house prices by $27,453 (27.453*1000) - The change in expected price per additional square foot is $89.12 ($76.12+12.99) for houses near the university and $76.12 for houses in other areas (no pools and fire places) controlling other variables Note for each 100 square feet $7612 - Houses depreciate $190.10 per year - A pool increases the value of a home by $4,377.20 - A fireplace increases the value of a home by $1,649.20

SSRF<-sum(resid(fitF)^2)
print(SSRF)

## [1] 230184.4

# Obtain the residuals from the model
residuals <- resid(fitF)
#print(residuals)


# Calculate the sum of squares of residuals
SS_residuals <- sum(residuals^2)
print("Sum of squares of residuals")

## [1] "Sum of squares of residuals"

print(SS_residuals)

## [1] 230184.4

# Calculate the total sum of squares
y_mean <- mean(Data$P)
print("Mean of the Data")

## [1] "Mean of the Data"

print(y_mean)

## [1] 247.6557

SS_total <- sum((Data$P - y_mean)^2)
print("Total sum of squares")

## [1] "Total sum of squares"

print(SS_total)

## [1] 1778446

# Calculate the sum of squares due to regression
SS_regression <- SS_total - SS_residuals
print("Sum of squares due to regression")

## [1] "Sum of squares due to regression"

print(SS_regression)

## [1] 1548262

# Calculate the degrees of freedom for the regression
df_regression <- length(fitF$coefficients) - 1
print("Degrees of freedom for the regression")

## [1] "Degrees of freedom for the regression"

print(df_regression)

## [1] 6

# Calculate the degrees of freedom for the residuals
df_residuals <- length(residuals) - df_regression - 1
print("Degrees of freedom for the residuals")

## [1] "Degrees of freedom for the residuals"

print(df_residuals)

## [1] 993

# Calculate the mean square due to regression
MS_regression <- SS_regression / df_regression
print("Mean square due to regression")

## [1] "Mean square due to regression"

print(MS_regression)

## [1] 258043.6

# Calculate the mean square of residuals
MS_residuals <- SS_residuals / df_residuals
print("mean square of residuals")

## [1] "mean square of residuals"

print(MS_residuals)

## [1] 231.8071

# Calculate the F-value/ F-Statistic
F_value <- MS_regression / MS_residuals
print("F-value/ F-Statistic")

## [1] "F-value/ F-Statistic"

print(F_value)

## [1] 1113.183

# Calculate the p-value
p_value <- 1 - pf(F_value, df_regression, df_residuals)
print("P-value")

## [1] "P-value"

print(p_value)

## [1] 0

After doing the calculation process manually i found the p-values result are not the same.

Applied Reggression Dummy Assignment

Md. Rabbi Amin

6/20/2023

Discuss the effect of these variables on house prices: