library(readxl)
mydata<- read_xlsx("./Apartments.xlsx")
head(mydata)
## # A tibble: 6 Γ 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$Parking <- factor(mydata$Parking,
levels = c(0, 1), labels = c("No", "Yes"))
mydata$Balcony <- factor(mydata$Balcony,
levels = c(0, 1), labels = c("No", "Yes"))
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Assumptions: Variable is numeric, standardised residuals of the population is normally distributed.
The t-test checks if the arithmetic mean of the sample is equal with the given price.
The p value (<0.05) indicates the mean of the sample is not equal to 1900. H0 hypothesis is rejected, the mean of the sample equals $2018.94. The price significantly differs from 1900, the t-test shows that the price deviation cannot be described by random error.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[c("Price", "Age")]))
## Price Age
## Price 1.00 -0.23
## Age -0.23 1.00
##
## n= 85
##
##
## P
## Price Age
## Price 0.034
## Age 0.034
Price= 2185.455 + Age* -8.98
Estimate of regression coefficient: Given all other variables, the results of the fit1 model show that the price of an apartment per m2 decreases by EUR 8.98 on average with each additional year (p < 0.05).
Coefficient of correlation: Age and Price are negatively correlated (-0.23) with a weak effect
Coefficient of determination: % proportion of price that can be explained by the linear effect of explanatory variables (age) in the regression model. Age only explains 5 % of the price (R^2 = 0.05302)
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[c("Price", "Age", "Distance")],
smooth = FALSE)
No variables are highly correlated there are probably no problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Price= 2460.101 + Age* -7.934+ Distance*-20.667
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
This function checks the degree of multicollinearity. All variables with a VIF statistic above 5 should be removed. Ideally the average VIF statistic should be close to 1 as possible. In our example, all VIF statistics are below 5. The average VIF is approximately at 1.00.
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
The histogram shows some gaps between 0.15 and 0.30
head(mydata[order(-mydata$CooksD), c("Price", "CooksD")], 10)
## # A tibble: 10 Γ 2
## Price CooksD
## <dbl> <dbl>
## 1 2180 0.32
## 2 1740 0.104
## 3 2790 0.069
## 4 1760 0.066
## 5 2540 0.061
## 6 2400 0.038
## 7 2820 0.037
## 8 2300 0.034
## 9 2810 0.032
## 10 2800 0.03
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!CooksD %in% c(0.320, 0.104))
Removing the outlier effect sizes 0.320 & 0.104. (I have chosen the effect, I canβt select price since there might be multiple apartments with the same price. A better solution would probably be if I add an ID for each apartment)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
Since p is above 0.05. H0 can not be rejected. Therefore we assume homoskedasticity for the standardised residuals.
# Reassign the function, since we removed two observations with high impact
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95952, p-value = 0.01044
The histogram shows that the the s. residuals could be normally distributed. Shapiro-Wilk normality test, however suggests that they are not normally distributed (pvalue<0.05). Thus, the null hypothesis that the standardized residuals are normally distributed is rejected.
mydata<- read_xlsx("./Apartments.xlsx")
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Price= 2460.101+ -7.93* Age- 20.67
Given the Distance from the city center in km, the results of the fit2 model show that the price of an apartment per m2 decreases by EUR 7.93 Euro on average with each additional year of an apartment (p < 0.05)
Given the Age of the apartment in years, the results of the fit2 model show that the price of an apartment per m2 decreases by EUR 20.67 Euro on average with each km the apartment is from the city center (p < 0.05)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova suggests that fit3 fits data better than model fit2, with an (p-value=0.01)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking 196.168 62.868 3.120 0.00251 **
## Balcony 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
The fit3 model show that the price of an apartment per m2 is on average EUR 196.17 Euro more expensive if the apartment has parking in comparison to apartments without parking (p < 0.05).
The fit3 model show that there is on average no significant differences in the price of an apartment per m2 between apartment with and without balcony (p=0.97).
Possible explanation for these results could be that the value of parking is higher than balcony. People often rent a parking slot if the apartment doesnβt include one, while balcony is a nice-to-have.
The f-statistics tests the significance of the regression model. In other words, if one or more of the coefficient are significant, the regression model is significant.
H0 = R^2=0 π½_Age = π½_Distance = π½_Parking = π½_Balcony = 0
HA= R^2 > 0 if at least one π½ does not equal zero
R^2β¦.R squared, gives % impact of the variables on the price π½β¦..Coefficient in regression model
mydata$FittedValues <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)
head(mydata[ , colnames(mydata) %in% c("Price", "FittedValues",
"Residuals")])
## # A tibble: 6 Γ 3
## Price FittedValues Residuals
## <dbl> <dbl> <dbl>
## 1 1640 1751. -111.
## 2 2800 2357. 443.
## 3 1660 1749. -88.8
## 4 1850 1590. 260.
## 5 1640 2053. -413.
## 6 1770 1897. -127.
The results show that apartment ID2, costs 2800 Euro. Based on our model this apartment would be estimated 2357.41 Euro. Therefore the residuals of ID2 equals 442.59 Euro.