library(readxl)
data <- read_xlsx("./Apartments.xlsx")
head(data)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
data$ParkingF <- factor(data$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
data$BalconyF <- factor(data$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(data)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingF BalconyF
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
summary(data$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1400 1710 1950 2019 2290 2820
first of all, I suggest to make basic descriptive statistics of the variable that we will study. So the price is a numerical variable. as we can see the minimum and maximum values are 1400 and 2820 respectively. the mean in the study of which we are interested is 2019. the median is 1950, that is, half or less of the apartments have a price of 1950 EUR per square meter
to investigate this question we have to conduct a t-test. he has two assumptions. the first - the variable must be numerical. this is true in our case the second is normality: the price must be normally distributed. Let’s check the second
library(ggplot2)
ggplot(data, aes(x = Price)) +
geom_histogram(binwidth = 150, colour = "black") +
ylab("Frequency") +
xlab("Price")
based on the graph we can suggest that price is not normally distributed as histogram is a bit right-skewed and maybe bimodal because we can see the first mode somewhere in 1700 in the second one somewhere in 2250
.
to be more sure let’s run Shapiro-Wilk test to check normality
shapiro.test(data$Price)
##
## Shapiro-Wilk normality test
##
## data: data$Price
## W = 0.94017, p-value = 0.0006513
Hypothesis for Shapiro-Wilk test:
H0: price is normally distributed
H1: price is not normally distributed
We reject H0 at p<0.001
Hence, Price variable is not normally distributed
.
This means we have to do non-parametric test which is wilcoxon signed rang test and here we will check medians but not means anymore
median(data$Price)
## [1] 1950
the median is 1950, that is, half or less of the apartments have a price of 1950 dollars per square meter
wilcox.test(data$Price,
mu = 1900,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: data$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900
Set hypotesis for the wilcoxon signed rang test:
H0: Mu_Price = 1900
H1: Mu_Price =/ 1900
We reject H0 at p=0.03 Hence, the median price per sq meter is NOT equal to 1900 eur.
.
fit1 <- lm(Price ~ Age, ## y = Price, x = Age
data = data)
summary(fit1) ## display summary
##
## Call:
## lm(formula = Price ~ Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor_coef <- sqrt(summary(fit1)$r.squared) ##calculate coefficient of correlation, which is root square of coefficient of determination
cor_coef ## display coefficient of correlation
## [1] 0.230255
the model will be the following
Price = 2185.455 -8.975*Age
Coefficient of determination: Adjusted R-squared: 0.04161 the coefficient of determination means that with the help of this model we explained 4.16% of the reasons for the change in the price of apartments
coefficient of correlation = 0.23 this tells us that the correlation is negative and according to Pearson it is weak
library(car)
## Loading required package: carData
scatterplotMatrix(data [ , c(1,2,3)], ## this function to construct a matrix of scatterplots. we choose columns number 1 2 and 3 as a vector
smooth = FALSE)
we can analyze the slope of the regression line on scatterplots
as we can see from the scatterplot between age and distance, the line is almost perfectly horizontal, meaning there is no correlation. that is, for any change in the distance, the age remains unchanged on average. and vice versa, for any change in age, the distance remains unchanged on average
and if we talk about age and price, then here we can trace a negative weak correlation which can be seen from the way the points are located around the regression line
if we analyze the distance and the price, the correlation is negative and I would say of medium strength, which means that as the distance increases, the price will decrease and this effect has a moderate strength, as can be seen from the way the points are located around the regression line
.
fit2 <- lm(Price ~ Age + Distance,
data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Model will be the following
Price = 2460.101 -7.934 * Age -20.667 * Distance
Coefficient interpretation:
.
vif(fit2)
## Age Distance
## 1.001845 1.001845
he VIF (Variance Inflation Factor) statistic is used to check the strengths of multicollinearity
if the vif value is greater than five, it means that there is a very strong multicollinearity between the independent variables and one of them should be excluded
also the average vif should be close to one
as we can see in our case the vif statistic is very close to 1 for both independent variables
Therefore, multicollinearity is absent
.
data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
hist(data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
we can detect outliers by calculating standardized residuals
values of standardized residuals should be in the ragne -/+ 3. if there are any units with the value less than -3 or more than +3 this units should be considered as outliers and be removed
in our case as we can see the histogram is a bit right-skewed however it looks moral more or less normal
And, what is important in our case, there are no units outside with the -/+ 3 range
Hence, no outliers
.
data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
we can calculate Cooks distances in order to detect units with high impact
when we calculate Cooks distances and build the histogram, we can check if there are any gaps in the histogram. if there are some gaps then there are units with a high impact.
as we can see in our case there is such a gap so we have to remove this unit
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create a new column named 'ID' containing the serial numbers
data <- data %>%
mutate(ID = 1:n())
head(data)
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 7 28 1640 0 1 No Yes -0.665 0.007 1
## 2 18 1 2800 1 0 Yes No 1.78 0.03 2
## 3 7 28 1660 0 0 No No -0.594 0.006 3
## 4 28 29 1850 0 1 No Yes 0.754 0.008 4
## 5 18 18 1640 1 1 Yes Yes -1.07 0.005 5
## 6 28 12 1770 0 1 No Yes -0.778 0.005 6
here I created a new column called ID, it will contain the serial number of each unit and will help us identify and remove high impact units
head(data[order(data$StdResid),], 3) #Three units with lowest value of stand. residuals
## # A tibble: 3 × 10
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 7 2 1760 0 1 No Yes -2.15 0.066 53
## 2 12 14 1650 0 1 No Yes -1.50 0.013 13
## 3 12 14 1650 0 0 No No -1.50 0.013 72
head(data[order(-data$CooksD),], 6) #Six units with highest value of Cooks distance
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32 38
## 2 43 37 1740 0 0 No No 1.44 0.104 55
## 3 2 11 2790 1 0 Yes No 2.05 0.069 33
## 4 7 2 1760 0 1 No Yes -2.15 0.066 53
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061 22
## 6 40 2 2400 0 1 No Yes 1.09 0.038 39
here we can see that the unit with the lowesr std resid has the serial number 53, std resid = 2.152, which is still within a -/+3 range
the unit with the hisgest Cooks distances is number 38, it must be removed
library(dplyr)
data <- data %>%
filter(!ID == "38") #Removing unit number 38 from the dataset
fit2 <- lm(Price ~ Age + Distance,
data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
fiting new model, without that unit with high impact
however, now we have to do analysis once again.
So,
data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
hist(data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
so, standardized residuals are still within a range of -/+ 3
data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
however in the Cook’s distances histogram there is still a unit with the gap, this is, the unit with high impact, that has to be removed in the same way and with the same logic
head(data[order(-data$CooksD),], 6) #Six units with highest value of Cooks distance
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 43 37 1740 0 0 No No 1.60 0.129 55
## 2 2 11 2790 1 0 Yes No 2.22 0.084 33
## 3 7 2 1760 0 1 No Yes -2.24 0.072 53
## 4 37 3 2540 1 1 Yes Yes 1.47 0.056 22
## 5 8 26 2300 1 1 Yes Yes 1.82 0.052 25
## 6 8 2 2820 1 0 Yes No 1.70 0.039 58
The unit to be removed is a unit with the ID number 55
library(dplyr)
data <- data %>%
filter(!ID == "55") #Removing unit number 55 from the dataset
again fitting the very new modal
fit2 <- lm(Price ~ Age + Distance,
data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
again checking residuals
data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
hist(data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
they are still ok, within a -/+3 range
data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
as we can see here again there is a gap so again we have to repeat the whole same procedure with removing the high impact units, again building the regression model and again checking when the histogram no longer has gaps
with your permission I won’t describe it all again and again. here is the general logic. I will simply remove problematic units until all are gone
head(data[order(-data$CooksD),], 6)
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <int>
## 1 2 11 2790 1 0 Yes No 2.17 0.084 33
## 2 7 2 1760 0 1 No Yes -2.35 0.084 53
## 3 37 3 2540 1 1 Yes Yes 1.57 0.065 22
## 4 8 26 2300 1 1 Yes Yes 1.85 0.054 25
## 5 45 21 1910 0 1 No Yes 1.07 0.05 31
## 6 40 2 2400 0 1 No Yes 1.04 0.038 39
data <- data %>%
filter(!ID == "33")
data <- data %>%
filter(!ID == "53")
here I will delete two units at once because their Cooks distances are absolutely identical, so they both cause problems
fit2 <- lm(Price ~ Age + Distance,
data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -408.72 -217.93 -38.86 220.60 506.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2492.715 75.544 32.997 < 2e-16 ***
## Age -7.496 3.169 -2.365 0.0205 *
## Distance -24.574 2.700 -9.100 6.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared: 0.5324, Adjusted R-squared: 0.5204
## F-statistic: 44.4 on 2 and 78 DF, p-value: 1.335e-13
data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
hist(data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
finally the histogram of the cook distances does not contain any gaps hence there are no units with high impact anymore
.
data$StdFitted <- scale(fit2$fitted.values) ## create new column with fitted values
library(car)
scatterplot(y = data$StdResid, x = data$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
for me, it looks as if there is heteroscedasticity here, because in the left part of the plot, the unit dispersion is less than in the right part. the scatterplot looks as if it opens to the right, like a fan, which indicates heteroscedasticity. If there was homoscedasticity, then the data were would be distributed evenly and there would be no significant difference in dispersion on the right and left sides
data$StdFitted <- scale(fit2$fitted.values) ## create new column with fitted values
library(car)
scatterplot(y = data$StdResid, x = data$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
for me, it looks as if there is heteroscedasticity here, because in the left part of the plot, the unit dispersion is less than in the right part. the scatterplot looks as if it opens to the right, like a fan, which indicates heteroscedasticity. If there was homoscedasticity, then the data were would be distributed evenly and there would be no significant difference in dispersion on the right and left sides
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.64762
## Prob > Chi2 = 0.1992832
now the formal test to check homoskedasticity for the regression model
lets set the hypothesis H0: the variance is constant measn there is a homoskedasticity H1: the variance is not constant means no homoskedasticity >> heteroskedasticity
we cannot reject H0 (p = 0.2, means p > 0.05)
Although the scatterplot analysis is more reminiscent of heteroskedasticity, the formal test did not confirm this. In addition, our sample is quite large (80). It’s not like it’s very big. but still we can assume that there is homoscedasticity here
.
fit2 <- lm(Price ~ Age + Distance,
data = data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -408.72 -217.93 -38.86 220.60 506.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2492.715 75.544 32.997 < 2e-16 ***
## Age -7.496 3.169 -2.365 0.0205 *
## Distance -24.574 2.700 -9.100 6.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared: 0.5324, Adjusted R-squared: 0.5204
## F-statistic: 44.4 on 2 and 78 DF, p-value: 1.335e-13
Model will be the following
Price = 2492.715 -7.496 * Age -24.574 * Distance
Coefficient interpretation:
Coefficient of determination = 0.5204, which means that the model 2 explains 52.04% of the variation in prices
.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = data)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403.16 -204.08 -41.24 239.52 528.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2372.693 93.344 25.419 < 2e-16 ***
## Age -6.906 3.118 -2.215 0.0298 *
## Distance -22.254 2.839 -7.837 2.25e-11 ***
## ParkingFYes 137.479 60.854 2.259 0.0267 *
## BalconyFYes 17.234 57.099 0.302 0.7636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared: 0.5621, Adjusted R-squared: 0.539
## F-statistic: 24.38 on 4 and 76 DF, p-value: 5.29e-13
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 5248925
## 2 76 4915944 2 332981 2.5739 0.08287 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lest set hypothesis
H0: delta ro square = 0 >> means both models are equally good
H1: delta ro square > 0 >> means the second model is better
we connot reject H0 (p>0.05), means with the help of the third model we were able to explain no more than with the help of the second
these are the results of the formal anova test to compare how good the models are
however, if we look at the R^2 of both models, we will see that for the third model it slightly increased. for the 2nd model it was 0.5204. for the 3rd one it became 0.539. this is only 1%, but still this identifies that with the help of the third model we explained a little more variability in the price
but again, the formal anova test showed that there is no difference in the quality of the models
.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403.16 -204.08 -41.24 239.52 528.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2372.693 93.344 25.419 < 2e-16 ***
## Age -6.906 3.118 -2.215 0.0298 *
## Distance -22.254 2.839 -7.837 2.25e-11 ***
## ParkingFYes 137.479 60.854 2.259 0.0267 *
## BalconyFYes 17.234 57.099 0.302 0.7636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared: 0.5621, Adjusted R-squared: 0.539
## F-statistic: 24.38 on 4 and 76 DF, p-value: 5.29e-13
Model will be the following
Price = 2372.693 -6.906 * Age -22.254 * Distance + 137.479 * ParkingFYes + 17.234 * BalconyFYes
Coefficient interpretation:
. F-statictics is a test of significance of the model as a whole
Hypothesis to be tested:
H0: ro^2 = 0 H1: ro^2 > 0
“ro^2” -is the coeffition of determination but for population
we reject H0 at p<0.001, which means that we were able to explaine something with out model
data$Fitted <- round(fit3$fitted.values, 2) ##to save fitted values
data$Residual <- round(data$Price - data$Fitted, 2) ## to calculate residuals
head(data[ 2 , c("ID", "Price", "Fitted", "Residual")]) ## to call for the second raw (apartment with ID 2), and vector of necessary columns "ID", "Price", "Fitted", "Residual"
## # A tibble: 1 × 4
## ID Price Fitted Residual
## <int> <dbl> <dbl> <dbl>
## 1 2 2800 2364. 436.