A real estate economist collects information on 1000 house price sales from two similar neighborhoods, one called “University Town” bordering a large state university, and another a neighborhood about three miles from the university. He specifies the following regression equation:
y = β0 + β1x1 + δ2D2 + γ(D2 ∗ x1) + β3 x3 + δ4D4 + δ5D5 + ε
where y = house prices in $1000 x1 = the number of hundreds of square feet of living area D2 ={ 1 house near university & 0 otherwise x3 = age of the house (in years) D4 ={ 1 house has pool & 0 otherwise D5 ={ 1 if fireplace is present & 0 otherwise
Based On the given regression equation let’s analyze each variable’s coefficient:
β0: The intercept term represents the estimated house price when all other variables are zero. It captures the baseline value of house prices, regardless of the neighborhood, living area, age, pool, or fireplace.
β1: The coefficient for x1 (number of hundreds of square feet of living area) represents the estimated change in house prices for a one-unit increase in living area, holding all other variables constant. A positive β1 suggests that larger living areas are associated with higher house prices.
δ2: The coefficient for D2 (house near the university) represents the difference in house prices between houses near the university and those further away, controlling for other variables. A positive δ2 suggests that houses near the university tend to have higher prices compared to houses farther away.
γ: The coefficient for (D2 * x1) represents the interaction effect between the proximity to the university and the living area. It captures whether the effect of living area on house prices differs depending on whether the house is near the university or not. A positive γ coefficient indicates that the positive relationship between living area and house prices is stronger for houses near the university.
β3: The coefficient for x3 (age of the house) represents the estimated change in house prices for a one-year increase in age, holding all other variables constant. A negative β3 suggests that older houses tend to have lower prices compared to newer houses.
δ4: The coefficient for D4 (house has a pool) represents the difference in house prices between houses with a pool and those without a pool, controlling for other variables. A positive δ4 suggests that houses with a pool tend to have higher prices compared to houses without a pool.
δ5: The coefficient for D5 (fireplace presence) represents the difference in house prices between houses with a fireplace and those without a fireplace, controlling for other variables. A positive δ5 suggests that houses with a fireplace tend to have higher prices compared to houses without a fireplace.
Comment based on the reggesion model equation:
Data <- read.table("./Data/housing.txt", header = TRUE)
head(Data)
## P S A Ut Pol Fp
## 1 205.452 23.46 6 0 0 1
## 2 185.328 20.03 5 0 0 1
## 3 248.422 27.77 6 0 0 0
## 4 154.690 20.17 1 0 0 0
## 5 221.801 26.45 0 0 0 1
## 6 199.119 21.56 6 0 0 1
fitF <- lm(P~S+Ut+I(S*Ut)+A+Pol+Fp, data = Data)
summary(fitF)
##
## Call:
## lm(formula = P ~ S + Ut + I(S * Ut) + A + Pol + Fp, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.289 -10.141 0.148 10.565 44.783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.5000 6.1917 3.957 8.13e-05 ***
## S 7.6122 0.2452 31.048 < 2e-16 ***
## Ut 27.4530 8.4226 3.259 0.001154 **
## I(S * Ut) 1.2994 0.3321 3.913 9.72e-05 ***
## A -0.1901 0.0512 -3.712 0.000217 ***
## Pol 4.3772 1.1967 3.658 0.000268 ***
## Fp 1.6492 0.9720 1.697 0.090056 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.23 on 993 degrees of freedom
## Multiple R-squared: 0.8706, Adjusted R-squared: 0.8698
## F-statistic: 1113 on 6 and 993 DF, p-value: < 2.2e-16
The fitted regression model suggests the following:
The goodness-of-fit statistic, R^2 (R-squared), is a measure of how well the regression model fits the data. In this case, R^2 is 0.8706, indicating that approximately 87.06% of the variation in the dependent variable (house prices) can be explained by the independent variables in the model.
The model has a good fit with an R-squared value of 0.8706, indicating that approximately 87.06% of the variance in house prices is explained by the included variables. The F-statistic is highly significant, supporting the overall significance of the model.
n <- nrow(Data)
p <- length(coefficients(fitF))
R2_adj <- 1 - (1 - summary(fitF)$r.squared) * ((n - 1) / (n - p))
print(R2_adj)
## [1] 0.8697879
SSR <- sum(fitF$residuals^2) # Sum of squared residuals
SSE <- sum((Data$P - mean(Data$P))^2) # Sum of squared errors
df_model <- p - 1 # Degrees of freedom for the model
df_error <- n - p # Degrees of freedom for the error
F_stat <- ((SSE - SSR) / df_model) / (SSR / df_error)
print(F_stat)
## [1] 1113.183
anova(fitF)
## Analysis of Variance Table
##
## Response: P
## Df Sum Sq Mean Sq F value Pr(>F)
## S 1 628934 628934 2713.179 < 2.2e-16 ***
## Ut 1 909292 909292 3922.626 < 2.2e-16 ***
## I(S * Ut) 1 3458 3458 14.918 0.0001196 ***
## A 1 2928 2928 12.631 0.0003972 ***
## Pol 1 2982 2982 12.864 0.0003514 ***
## Fp 1 667 667 2.879 0.0900558 .
## Residuals 993 230184 232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this specific ANOVA table, the “Residuals” row shows that there are 993 degrees of freedom associated with the residuals. The sum of squares (SS) for the residuals is 230,184, and the mean square (MS) is 232.
In summary, the ANOVA results indicate that the variables S, Ut, I(S * Ut), A, and Pol are significant predictors of house prices, while the significance of the variable Fp is less conclusive.
coef(fitF) # Coefficients
## (Intercept) S Ut I(S * Ut) A Pol
## 24.4999849 7.6121766 27.4529522 1.2994049 -0.1900864 4.3771633
## Fp
## 1.6491756
coef(summary(fitF))[, 2] # Standard errors
## (Intercept) S Ut I(S * Ut) A Pol
## 6.19172142 0.24517647 8.42258236 0.33204775 0.05120461 1.19669164
## Fp
## 0.97195682
Fitted regression model is ŷ = 24.499 + 7.612 x1 + 27.453 D2 + 1.299 (D2 ∗ x1) − 0.190 x3 + 4.377 D4 + 1.649 D5
Based on the regression results, we estimate that - Location which is near the university increases house prices by $27,453 (27.453*1000) - The change in expected price per additional square foot is $89.12 ($76.12+12.99) for houses near the university and $76.12 for houses in other areas (no pools and fire places) controlling other variables Note for each 100 square feet $7612 - Houses depreciate $190.10 per year - A pool increases the value of a home by $4,377.20 - A fireplace increases the value of a home by $1,649.20
SSRF<-sum(resid(fitF)^2)
print(SSRF)
## [1] 230184.4
# Obtain the residuals from the model
residuals <- resid(fitF)
#print(residuals)
# Calculate the sum of squares of residuals
SS_residuals <- sum(residuals^2)
print("Sum of squares of residuals")
## [1] "Sum of squares of residuals"
print(SS_residuals)
## [1] 230184.4
# Calculate the total sum of squares
y_mean <- mean(Data$P)
print("Mean of the Data")
## [1] "Mean of the Data"
print(y_mean)
## [1] 247.6557
SS_total <- sum((Data$P - y_mean)^2)
print("Total sum of squares")
## [1] "Total sum of squares"
print(SS_total)
## [1] 1778446
# Calculate the sum of squares due to regression
SS_regression <- SS_total - SS_residuals
print("Sum of squares due to regression")
## [1] "Sum of squares due to regression"
print(SS_regression)
## [1] 1548262
# Calculate the degrees of freedom for the regression
df_regression <- length(fitF$coefficients) - 1
print("Degrees of freedom for the regression")
## [1] "Degrees of freedom for the regression"
print(df_regression)
## [1] 6
# Calculate the degrees of freedom for the residuals
df_residuals <- length(residuals) - df_regression - 1
print("Degrees of freedom for the residuals")
## [1] "Degrees of freedom for the residuals"
print(df_residuals)
## [1] 993
# Calculate the mean square due to regression
MS_regression <- SS_regression / df_regression
print("Mean square due to regression")
## [1] "Mean square due to regression"
print(MS_regression)
## [1] 258043.6
# Calculate the mean square of residuals
MS_residuals <- SS_residuals / df_residuals
print("mean square of residuals")
## [1] "mean square of residuals"
print(MS_residuals)
## [1] 231.8071
# Calculate the F-value/ F-Statistic
F_value <- MS_regression / MS_residuals
print("F-value/ F-Statistic")
## [1] "F-value/ F-Statistic"
print(F_value)
## [1] 1113.183
# Calculate the p-value
p_value <- 1 - pf(F_value, df_regression, df_residuals)
print("P-value")
## [1] "P-value"
print(p_value)
## [1] 0
After doing the calculation process manually i found the p-values result are not the same.