library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
head(mydata)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$ParkingFactor <- factor(mydata$Parking, levels = c("0","1"), labels = c("No","Yes"))
mydata$BalconyFactor <- factor(mydata$Balcony, levels = c("0","1"), labels = c("No","Yes"))
head(mydata)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
t.test(mydata$Price, mu = 1900)
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0: Mu_Price = 1900 eur. H1: Mu_Price ≠ 1900 eur.
We reject null hypothesis (H0: Mu_Price = 1900 eur) at p = 0.005. Therefore, we can state average price is not 1900 EUR, differing from this value.
fit1 <- lm(Price ~ Age, data = mydata )
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
H0: B1 = 0 -> No correlation or association between Price and Age of a building.
H1: B1 ≠ 0 -> Some correlation or association between Price and Age of a building.
We reject null hypothesis at p = 0.034. Hence, we state that there must be some correlation between price and age.
If age increases by one year, price, on average, goes down by 8.98 euros per m2.
Nevertheless, the model only explains linearly around a 5.3% (R2 = 0.053) of the variance in price, suggesting other variables do have a significant impact in the Price of the apartments.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[,c(3,1,2)])
Analyzing the scatterplot matrix, and the relationships between the
variables, there are no apparent strong linear relation between any pair
of variables. This lacking pattern suggests that multicollinearity is
not a potential problem.
fit2 <- lm(Price ~ Age + Distance, data = mydata)
vif(fit2)
## Age Distance
## 1.001845 1.001845
The Variance Inflation Factor (VIF) near 1 among both variables, indicates a lack of multicollinearity between the variables studied, improving the regression analysis’ conditions, and therefore its validity. Moreover, this confirms the hypothesis made by looking the scatterplot matrix previously.
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Std. Residuals")
head(mydata[order(mydata$StdResid),],3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15 0.066
## 2 12 14 1650 0 1 No Yes -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
hist(mydata$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances")
head(mydata[order(-mydata$CooksD),],3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!CooksD == 0.320)
fit2 <- lm(Price ~ Age + Distance, data = mydata)
The histograms and values of the Standardized Residuals of the observations, show that all of them ranged from -2.2 to 2.6. Therefore, there are no obvious outliers, since no level is above +3, nor below -3.
Moreover, checking the Cooks Distances and given the relative size of them in the dataset, a value of 0.320 stands out especially high, representing a unit with high influence. This observation is then removed to ensure that it does not bias the model.
mydata$StdFitted <- scale(fit2$fitted.values)
mydata$CooksD <- round(cooks.distance(fit2), 3)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted Values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The examination of the scatterplot reveals a random dispersion of
points, close to the horizontal zero line, with no evident pattern
indicating variance changes. Additionally, the majority of values are
compacted between -1 and 1 standardized fitted values, representing
consistent variance. These observations suggest that no
heteroskedasticity can be evidenced, supporting the assumption of
homoskedasticity in the model.
However, the notable difference in the distribution of units between ranges (-3, -1) and (-1, 2), makes a Breusch Pagan test advisable, in order to confirm the absence of a pattern on the distribution and further assess the homoskedasticity assumption.
##extra
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
H0: Variance is constant.
H1: Variance is not constant.
We cannot reject null hypothesis at p = 0.088, and therefore we assume homoskedasticity
hist(mydata$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals")
The histogram of standardized residuals suggests that the distribution
of values are not normally distributed, as the distribution of
standardized residuals does not appear to conform to the typical normal
distribution shape. Nevertheless, it would be recommended to perform a
Shapiro-Wilk test to confirm this hypothesis.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.94879, p-value = 0.002187
H0: Values are normally distributed.
H1: Values are not normally distributed.
We reject null hypothesis at p = 0.003. Then, we assume no normality for the standardized residuals.
# we already have the clean version
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
H0: B1 = 0 -> No correlation or association.
H1: B1 ≠ 0 -> Some correlation or association.
Price ~ Age: We reject null hypothesis at p = 0.044. Hence, we state that there must be some correlation between price and age. If the age of the apartment increases by one year, price, on average, goes down by 6.45 euros per m2 (p = 0.044). Assuming all other variables remain unchanged. This means that Distance does influence Price in a negative way. They have a negative relationship (coefficient = -6.464)
Price ~ Distance: We reject null hypothesis at p < 0.001. Hence, we state that there must be some correlation between price and distance. If distance from the city center increases by one kilometer, price, on average, goes down by 22.96 euros per m2 (p < 0.001). Assuming all other variables remain unchanged This means that Distance does influence Price in a negative way. They have a negative relationship (coefficient = -22.955)
This model explains linearly approximately a 48.38% of the variance in price, suggesting that while age and distance are two important factors, other variables do also impact the price of apartments.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor, data = mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Δρ2 = 0 -> Both models are equally good (change in coefficient determination is null).
H1: Δρ2 > 0 -> Second model is more statistically significant (positive change in coefficient of determination).
We reject null hypothesis at p = 0.031. Then we assume that the second model, including the two categorical variables, yields a statistically significant improvement from the first one.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFactorYes 167.531 62.864 2.665 0.00933 **
## BalconyFactorYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
H0: B1 = 0 -> No correlation or association.
H1: B1 ≠ 0 -> Some correlation or association.
Price ~ ParkingFactorYes: If the apartment does have a parking, price, on average, goes up by 167.54 euros per m2 (p = 0.01). Assuming all other variables remain unchanged
Price ~ BalconyFactorYes: This variable effect is not statistically significant (p = 0.798). This suggests that balconies do not substantially impact the apartment’s price when other factors are considered.
F-STATISTIC:
The F-statistic tests a joint hypothesis about the significance of the whole model.
H0: All coefficients of the variables are 0 and therefore do not affect the price (ρ2 = 0).
H1: At least one of the variables affects the price of apartments(ρ2 > 0).
We reject null hypothesis at p < 0.001, and assume that one or more variables do have a significant effect on the apartment’s prices.
mydata$Fitted <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)
mydata[2,c("Fitted", "Residuals")]
## # A tibble: 1 × 2
## Fitted Residuals
## <dbl> <dbl>
## 1 2372. 428.
The fitted value for this apartment suggests an estimated price of 2372.197 eur. Nevertheless, the residual of 427.803 indicates an important difference between the estimate and the actual price.