library(readxl)
Apartments <- read_excel("Apartments.xlsx")
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Apartments$ParkingF <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$BalconyF <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(Apartments)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingF BalconyF
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
library(pastecs)
round(stat.desc(Apartments[c("Age", "Distance", "Price", "Parking", "Balcony")]), 2)
## Age Distance Price Parking Balcony
## nbr.val 85.00 85.00 85.00 85.00 85.00
## nbr.null 0.00 0.00 0.00 42.00 48.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 1.00 1.00 1400.00 0.00 0.00
## max 45.00 45.00 2820.00 1.00 1.00
## range 44.00 44.00 1420.00 1.00 1.00
## sum 1577.00 1209.00 171610.00 43.00 37.00
## median 18.00 12.00 1950.00 1.00 0.00
## mean 18.55 14.22 2018.94 0.51 0.44
## SE.mean 1.05 1.23 40.98 0.05 0.05
## CI.mean.0.95 2.09 2.45 81.50 0.11 0.11
## var 93.96 129.44 142764.34 0.25 0.25
## std.dev 9.69 11.38 377.84 0.50 0.50
## coef.var 0.52 0.80 0.19 0.99 1.15
I have displayed some descriptive statistics of the variables used in the.
t.test(Apartments$Price, mu=1900)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject the Hypothesis, as the average Price of apartments is not equal to 1900, but rather 2018,941 eur. With a 95% certainty we can say that the average price of apartments lies between 1937,44 eur and 2100,44eur.
fit1 <-lm(Apartments$Price ~ Apartments$Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Apartments$Price ~ Apartments$Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Apartments$Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
corr_coef <- cor(Apartments$Age, Apartments$Price)
print(corr_coef)
## [1] -0.230255
Regression function: Price= 2185,455 - 8,975 x Age The function shows us that if Age is zero, the average price of apartments is 2185,455 eur. When we increase by 1, the price falls by 8,975 eur.
Multiple R-squared statistic that Age variable explains 5,302% of variability in Price of apartments.
The p-value test showed that we can with 95% certainty say that coeficients are not equal to 0. Therefore, we can reject the null hypothesis (H0: p<0,05).
The correlation between Price and age is negative, which means that age negatively influences the average prices. Higher the age, lower the price. However, the correlation (influence) is not strong.
library(car)
## Loading required package: carData
vars <- Apartments[, c("Price", "Age", "Distance")]
scatterplotMatrix(vars, ,
smooth = FALSE)
The graph/ picture shows us multicollinearity between explainatory
variables. We look at the graphs in the first row. The upper, middle
picture (graph) shows that higher age negatively impacts the price
(older the apartments, lower the price). The upper right graph shows the
impact of Distance (from the city centre) on the Price, where bigger
distance means lower prices for apartments.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
P-value of F-statistic shows us that we can reject the H0 hypothesis, as the number is lower than 0,05. At least one of the predictors/ explainatory variables (probably distance from the data above) has a meaningful effect on price. Price and age together explain 43,96% of average price of the apartments.
vif(fit2)
## Age Distance
## 1.001845 1.001845
The vif function is used to evaluate the strength of the correlation between the explanatory variables. The higher the VIF statistic, the more strongly the variable is related to other explanatory variables.
Because Vif statistics is close to 1 for both explainatory variables Age and Distance, there are no multicolinearity concerns. The problem occurs, if vif statistics equaled >5. Only then we could not include them in the model.
Apartments$std_resid <- round(rstandard(fit2),3)
hist(Apartments$std_resid,
xlim = c(-3,3),
ylim = c(0,20),
main = "Histogram of standardized residuals",
xlab = "standardized residuals",
col = "lightgreen")
Based on the distribution of standardized residuals, we try to predict
the distribution of errors in the population.
-From the picture we can see that residuals are slightly asymmetrically distributed to the right. We could test this also with shapiro test.
cooks_d <- cooks.distance(fit2)
Apartments$CooksD <- round(cooks.distance(fit2),3) ### First two rows are here to define "CooksD" variable
head(Apartments[order(-Apartments$std_resid), c("Distance", "Price","Age", "CooksD","std_resid")],)
## # A tibble: 6 × 5
## Distance Price Age CooksD std_resid
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 45 2180 5 0.32 2.58
## 2 11 2790 2 0.069 2.05
## 3 1 2800 18 0.03 1.78
## 4 1 2800 18 0.03 1.78
## 5 2 2820 8 0.037 1.66
## 6 1 2810 10 0.032 1.60
cooks_d <- cooks.distance(fit2)
Apartments$CooksD <- round(cooks.distance(fit2),3)
hist(Apartments$CooksD,
main = "Histogram of Cooks Distances",
xlab = "Cooks Distances",
col = "yellow")
Using the Cook’s Distance, we can spot units with a high impact on the
estimated regression function and remove them (similar to outliers).
Cook’s distance is a value above 0, where a higher number means a larger
impact.
From the graph and “head” code, we can see that there is one apartment that stands out with a value of 0,32, therefore too high influence. This is apartment no. 38 and it has to be removed. We can remove it with function “filter”
head(Apartments[order(-Apartments$CooksD), c("Distance", "Price","Age", "CooksD","std_resid")],)
## # A tibble: 6 × 5
## Distance Price Age CooksD std_resid
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 45 2180 5 0.32 2.58
## 2 37 1740 43 0.104 1.44
## 3 11 2790 2 0.069 2.05
## 4 2 1760 7 0.066 -2.15
## 5 3 2540 37 0.061 1.58
## 6 2 2400 40 0.038 1.09
With this function we can remove problematic units: library(dplyr) Apartments <- Apartments %>% filter(!CooksD %in% (0.320))
std_fitted <- as.numeric(scale(fitted(fit2)))
Apartments$stdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(x=Apartments$stdFitted, y=Apartments$std_resid,
xlab = "Standardized Fitted Values",
ylab = "Standardised Residuals",
main = "Heteroskedasticity Check",
boxplots = FALSE,
regLine = FALSE,
smooth= FALSE,
)
The points should be randomly distributed in a horizontal band of
constant variability. Heteroskedasticity occurs, if the variability
changes, and it affects the reliability of the estimated standard
errors. We also test it with the use of Breuch-Pagan test, below.
From the graph (picture), we can see random distribution in a horizontal band. We assume that Homoskedasticity is not violated (The variance of errors is constant). We can confirm our assumptions with Breuch-pagan test. We cannot reject H0 hypothesis (the variance is constant), because p-value is above 0,05.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
hist(Apartments$std_resid,
xlim = c(-3,3),
main = "Histogram of standardized residuals",
xlab = "standardized residuals",
ylab = "Density",
prob = TRUE, ##MEANS THAT YOU ADJUST Y AXIS TO THE RED CURVE
col = "lightgreen")
curve(dnorm(x, mean = mean(Apartments$std_resid), sd = sd (Apartments$std_resid)),
col = "red",
lwd = 2,
add = TRUE)
Apartments <- Apartments[!(Apartments$CooksD == 0.320),]
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
As done in few steps above, we can again exclude the unit with Cooks Distance 0,320. After that we calculate fit2 again and summarize the results.
From the output, we can read new regression function: “Price = 2456.076 - 6.464 x Age - 22.955 x Distance”. We can see that Age and Distance “negatively” impact the price. Higher the Age and Bigger the Distance to the city center, the cheaper is the average price of Apartments. If age is 0 and everything else stays the same, the price is 2456,076 eur. In other case, if Age is 1 and everything else stays the same, the price drops by 6,464. For every unit of Age rise, price drops by 6,464. The same happens with Distance, with a difference that Price with every unit of Distance more drops by 22,955.
H0 Hypothesis we were testing (H0= coefficients are equal to 0) can be rejected, as p-value is below 0,05.
R-squared tells us that 48,38% of Price variability is explained by Age and Distance variables together.
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
The function above provides us with coefficient of correlation. The number 0,6955 stands for strong and positive correlation between Age, Price and Distance.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject the null hypothesis (which says that fit2 model is more suitable), as p-value is less than 0,05. We can confirm that fit3 model gives us more relevant result.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Regression coefficient for Parking shows that, if apartment includes Parking and everything else is unchanged, it is on average 196,16 eur more expensive, than without a balcony. Regression coefficient for Balcony shows that, if apartment possesses a Balcony and everything else stays unchanged, it is on average 1,93 eur more expensive.
F-statistic tests whether the regression model as a whole explains a significant portion of the variation in apartment prices. H0: β1 = β2 = β3 = β4 = 0 (none of the 4 predictors or explainatory variables have any effect on the price) H1: At least one βi ≠ 0 (At least one predictor or explainatory variable explains a significant portion of the variation in price) The Result shows that modul is highly significant, as p-value is below 0,05 (or 0,001)
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
round(Apartments[2,12],3)
## # A tibble: 1 × 1
## Residuals
## <dbl>
## 1 428.
The residual for apartment ID2 tells us that the actual price for this apartment is by 427.8 eur higher than the estimated value from the regression.