library(readxl)
library(ggplot2)
library(car)
## Loading required package: carData
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- read_excel("Apartments.xlsx")
head(mydata)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$ParkingF <- factor(mydata$Parking,
levels = c(0, 1),
labels = c("No Parking", "Parking"))
mydata$BalconyF <- factor(mydata$Balcony,
levels = c(0, 1),
labels = c("No Balcony", "Balcony"))
To find if there is any change in arithmetic mean of apartment price, t-test will be used. The assumptions are:
To test the normality, Shapiro-Wilk normality test is used:
# Testing normality of price variable
shapiro.test(mydata$Price)
##
## Shapiro-Wilk normality test
##
## data: mydata$Price
## W = 0.94017, p-value = 0.0006513
From result above, it is clear that the null hypothesis can be rejected (p < 0.05). Therefore, as normality condition is not met, Wilcoxon Signed Rank Test is used:
wilcox.test(mydata$Price,
mu = 1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
From the result, it is clear that the null hypothesis cannot be rejected ( p > 0.001). Therefore, the average price, mu = 1900.
fit1 <- lm(Price ~ Age, data=mydata)
coer_coef <- sqrt(summary(fit1)$r.squared)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
coer_coef
## [1] 0.230255
Regression coefficient: If the age of apartment increases by one year, then the price per m2 of apartment decreases on average by 8.975. ( p<0.05 ) Note: the result is not very statistically significant for this course, as p should be less than 0.001.
Coefficient of correlation: The Pearson correlation coefficient is 0.23. Therefore, there is a weak relationship between the age of apartment and price per m2.
Coefficient of determination: The coefficient of correlation is 0.05302. Therefore, the proportion of total variability of price that can be explained by the linear effect of age is 5.3%.
scatterplotMatrix(mydata[c("Price", "Age", "Distance")], smooth=FALSE )
As there is no strong relationship between the three variables, there will be no potential issue with multicolinearity.
fit2 <- lm(Price ~ Age + Distance, data=mydata)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
From the results, it is clear that all variables have VIF statistics below 5. Furthermore, the average VIF statistics is very close to 1. Therefore, there is no issue of strong multicolinearity.
mydata$StdResid <- round(rstandard(fit2), 3)
hist( mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
On the histogram above, since all values are within -3 and +3, we have no problems with outliers.
mydata$CookD <- round(cooks.distance(fit2), 3)
hist( mydata$CookD,
xlab = "Cook distance",
ylab = "Frequency",
main = "Histogram of Cook distance")
From the histogram of Cook distance, it is clear that there is a gape between 0.15 and 0.30. Therefore, we can find and remove the unit with high influence by using head() function:
#Adding ID variable
mydata$ID <- seq(1, nrow(mydata))
head(mydata[order(-mydata$CookD), c("ID", "CookD")], 6)
## # A tibble: 6 × 2
## ID CookD
## <int> <dbl>
## 1 38 0.32
## 2 55 0.104
## 3 33 0.069
## 4 53 0.066
## 5 22 0.061
## 6 39 0.038
mydata <- mydata %>%
filter(!ID %in% c(38, 55, 33, 53, 22))
After removing unit with ID 38, it was clear that there are 4 more units, which were creating gaps in Cook’s distance. Thus, those units were removed as well. Now if we plot histogram of Cook distance again, we can see that there no units with high influence:
hist( mydata$CookD,
xlab = "Cook distance",
ylab = "Frequency",
main = "Histogram of Cook distance")
fit2 <- lm(Price ~ Age + Distance, data=mydata)
mydata$StdFittedValues <- scale(fit2$fitted.values)
scatterplot( y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplot = FALSE,
regLine = FALSE,
smooth = FALSE)
From the graph, it seems as if there is hint of heteroskedasticity, as the distribution if standard residuals is narrower between x of -3 to -2, and wider as x increases. To be confident, Breusch Pagan test was used:
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
From the result above, it is clear that the p value is very high, 0.188. Therefore, the null hypothesis cannot be rejected, and there is no strong heteroskedasticity. (p > 0.05)
hist( mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
From the histogram, it is clear that the standardized residuals are slightly asymmetrically distributed to the right.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.93418, p-value = 0.0004761
From the Shapiro test, it is clear that the p value is lower than 0.05. Thus, the null hypothesis can be rejected and it can be concluded that the distribution of standard residuals is not normal. (p<0.05) However, it is not an issue in this case, as the sample size is large enough.
fit2 <- lm(Price ~ Age + Distance, data=mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Coefficient for age:
Assuming distance variable remain unchanged, for each additional year, the price per m2 of apartment goes down on average by 8.674. ( p<0.01 )
Coefficient for distance:
Assuming age variable remain unchanged, for each additional km from city center, the price per m2 of apartment goes down on average by 24.063. ( p<0.001 )
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data=mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
As the p value is 0.114, we cannot reject the null hypothesis. (p > 0.001). Therefore, the model fit3 does not fit the data better than fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFParking 128.700 60.801 2.117 0.0376 *
## BalconyFBalcony 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Explanation for parking:
Assuming all other variables remain unchanged, the apartments with parking space is on average more expensive by 128.7 than the apartments without parking space. (p < 0.05)
Explanation for balcony:
The result is not significant, as the p value is 0.917.
The hypothesis for F-statistics:
mydata$fittedValues3 <- fit3$fitted.values
ID2 <- mydata[mydata$ID == 2, ]
fitted_value <- ID2$fittedValues3
observed_value <- ID2$Price
residual <- observed_value - fitted_value
residual
## 2
## 443.4026