Data <- read.table("./Data/housing.txt", header = TRUE)
head(Data)
##         P     S A Ut Pol Fp
## 1 205.452 23.46 6  0   0  1
## 2 185.328 20.03 5  0   0  1
## 3 248.422 27.77 6  0   0  0
## 4 154.690 20.17 1  0   0  0
## 5 221.801 26.45 0  0   0  1
## 6 199.119 21.56 6  0   0  1
hist(Data$P)

The are bell shaped which indicating Housing Price is assumed to be normally distributed.

pairs(Data)

cor(Data)
##               P            S           A           Ut          Pol           Fp
## P    1.00000000  0.594678480 -0.07985190  0.728744476  0.051889626  0.064752140
## S    0.59467848  1.000000000 -0.02718335  0.023370608 -0.004175866  0.098473362
## A   -0.07985190 -0.027183352  1.00000000 -0.031958507  0.027663447  0.033123525
## Ut   0.72874448  0.023370608 -0.03195851  1.000000000  0.020482871 -0.007378108
## Pol  0.05188963 -0.004175866  0.02766345  0.020482871  1.000000000 -0.043068451
## Fp   0.06475214  0.098473362  0.03312353 -0.007378108 -0.043068451  1.000000000

To see the pairs plot we see the Housing price are linearly corrected with living area and age of the house.

fitF <- lm(P~S+Ut+I(S*Ut)+A+Pol+Fp+I(Pol*Fp), data = Data)
summary(fitF)
## 
## Call:
## lm(formula = P ~ S + Ut + I(S * Ut) + A + Pol + Fp + I(Pol * 
##     Fp), data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.410 -10.193   0.112  10.577  44.933 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.24138    6.20934   3.904 0.000101 ***
## S            7.61632    0.24536  31.042  < 2e-16 ***
## Ut          27.46998    8.42541   3.260 0.001151 ** 
## I(S * Ut)    1.29851    0.33216   3.909 9.89e-05 ***
## A           -0.18941    0.05123  -3.697 0.000230 ***
## Pol          5.06230    1.67022   3.031 0.002501 ** 
## Fp           1.93414    1.08628   1.781 0.075298 .  
## I(Pol * Fp) -1.40931    2.39584  -0.588 0.556511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 992 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8697 
## F-statistic: 953.6 on 7 and 992 DF,  p-value: < 2.2e-16

Large value o F statistic (953.6) with small p-value implies that we may fail to reject the null hypothesis H0:β1= δ2 = γ1 = β3 =δ4 = δ5 = γ2 = 0 at 5% significance. In order word the regression model provides (1) a better fit than a model that contains no independent variables.

The goodness of fit statistic is R^2adj = 0.8697. It also indicates that the model fits the data well about 87% variation in process can be explained by the variables ‘Living area’

plot(fitF)

Residual Plots indicate no violations of assumptions

# Obtain the residuals from the model
residuals <- resid(fitF)
#print(residuals)
plot(residuals)

# Calculate the sum of squares of residuals
SS_residuals <- sum(residuals^2)
print("Sum of squares of residuals: ")
## [1] "Sum of squares of residuals: "
print(SS_residuals)
## [1] 230104.2

Reduce model

fitF2 <- lm(P~S+A, data = Data)
summary(fitF2)
## 
## Call:
## lm(formula = P ~ S + A, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.017 -30.271   3.693  30.292  72.867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.2309     9.4052    3.64 0.000287 ***
## S             8.5723     0.3671   23.35  < 2e-16 ***
## A            -0.2853     0.1136   -2.51 0.012228 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.85 on 997 degrees of freedom
## Multiple R-squared:  0.3577, Adjusted R-squared:  0.3564 
## F-statistic: 277.6 on 2 and 997 DF,  p-value: < 2.2e-16
# Obtain the residuals from the model
residuals2 <- resid(fitF2)
#print(residuals)
plot(residuals2)

# Calculate the sum of squares of residuals
SS_residuals2 <- sum(residuals2^2)
print("Sum of squares of Reduce model residuals: ")
## [1] "Sum of squares of Reduce model residuals: "
print(SS_residuals2)
## [1] 1142293

We may reject the null hypothesis Hypothesis H0 at 5% level of significance and conclude that indicator variables are jointly significance.

final model

fitF3 <- lm(P~S+Ut+I(S*Ut)+A+Pol+Fp, data = Data)
summary(fitF3)
## 
## Call:
## lm(formula = P ~ S + Ut + I(S * Ut) + A + Pol + Fp, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.289 -10.141   0.148  10.565  44.783 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.5000     6.1917   3.957 8.13e-05 ***
## S             7.6122     0.2452  31.048  < 2e-16 ***
## Ut           27.4530     8.4226   3.259 0.001154 ** 
## I(S * Ut)     1.2994     0.3321   3.913 9.72e-05 ***
## A            -0.1901     0.0512  -3.712 0.000217 ***
## Pol           4.3772     1.1967   3.658 0.000268 ***
## Fp            1.6492     0.9720   1.697 0.090056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 993 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8698 
## F-statistic:  1113 on 6 and 993 DF,  p-value: < 2.2e-16

The goodness-of-fit statistic is 𝑅𝑎𝑑𝑗^2 = 0.8698, indicating that the model fits the data well 𝐹 ={(1142293−230184.2)/5}/{230184.2/(1000−7) = 787.23 At 5% level of significance the critical value is 𝐹0.05;4,993 = 2.372 Indicator variables are jointly significant at 5% level of significance All differential effects are also statistically significant. Fitted regression model is y_hat = 24.50 + 7.612 𝑥1 + 27.453 𝐷2 + 1.299 (𝐷2 ∗ 𝑥1) − 0.190 𝑥3 + 4.377 𝐷4 + 1.649 𝐷5 Based on the regression results, we estimate that - Location which is near the university increases house prices by $27,453 (27.453*1000) - The change in expected price per additional square foot is $89.12 ($76.12+12.99) for houses near the university and $76.12 for houses in other areas (no pools and fire places) controlling other variables Note for each 100 square feet $7612 - Houses depreciate $190.10 per year - A pool increases the value of a home by $4,377.20

fitF4 <- lm(P~S+Ut+I(S*Ut)+A+Pol, data = Data)
summary(fitF4)
## 
## Call:
## lm(formula = P ~ S + Ut + I(S * Ut) + A + Pol, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.352 -10.241   0.269  10.397  45.400 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.13775    6.19389   3.897 0.000104 ***
## S            7.66031    0.24376  31.426  < 2e-16 ***
## Ut          28.37492    8.41298   3.373 0.000773 ***
## I(S * Ut)    1.26232    0.33164   3.806 0.000150 ***
## A           -0.18697    0.05122  -3.650 0.000276 ***
## Pol          4.28786    1.19666   3.583 0.000356 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.24 on 994 degrees of freedom
## Multiple R-squared:  0.8702, Adjusted R-squared:  0.8695 
## F-statistic:  1333 on 5 and 994 DF,  p-value: < 2.2e-16