Paulina Suvorov
library(readxl)
mydata <- read_xlsx("C:/Users/pauli/OneDrive/Desktop/BootcampR/R Take Home Exam/R Take Home Exam/Task 3/Apartments.xlsx")
head(mydata)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$ParkingF <- factor(mydata$Parking,
levels = c(0,1),
labels = c("No", "Yes"))
mydata$BalconyF <- factor(mydata$Balcony,
levels = c(0,1),
labels= c("No", "Yes"))
mydata1 <- mydata
library(pastecs)
round(stat.desc(mydata[ ,c(1:5)]),2)
## Age Distance Price Parking Balcony
## nbr.val 85.00 85.00 85.00 85.00 85.00
## nbr.null 0.00 0.00 0.00 42.00 48.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 1.00 1.00 1400.00 0.00 0.00
## max 45.00 45.00 2820.00 1.00 1.00
## range 44.00 44.00 1420.00 1.00 1.00
## sum 1577.00 1209.00 171610.00 43.00 37.00
## median 18.00 12.00 1950.00 1.00 0.00
## mean 18.55 14.22 2018.94 0.51 0.44
## SE.mean 1.05 1.23 40.98 0.05 0.05
## CI.mean.0.95 2.09 2.45 81.50 0.11 0.11
## var 93.96 129.44 142764.34 0.25 0.25
## std.dev 9.69 11.38 377.84 0.50 0.50
## coef.var 0.52 0.80 0.19 0.99 1.15
The average price per m2 is 2018.94. The difference between the highest and the lowest price per m2 is 1420. 42 of the apartments do not have a parking and 48 of the apartments do not have a balcony.
#H0: mu_price=1900
#H1: mu_ price≠1900
t.test(mydata$Price,
mu=1900,
alternative="two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Explanation: It is extremely unlikely that the average price per m2 is equal to 1900eur. We reject H0 at p<0.05 and accept H1, and conclude that the average price per m2 is not equal to 1900eur.
fit1 <- lm(Price ~ Age, data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanation: If the age of the apartment increases by 1 year, on average the price per m2 of the apartment decreases by 8.975 % points, p<0.05.
library(car)
scatterplotMatrix(mydata[,c(3,2,1)], smooth = FALSE)
Explanation: Scatter plot matrix between Price, Age and Distance shows
no signs of multicolinearity, because the slope between Age and Distance
is just slightly sloped.
fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
Based on the VIF Statistics, there is no multicolinearity present, since the VIF factor for all of the explanatory variables is less than 5.
mean(vif(fit2))
## [1] 1.001845
We can conclude that there is no multicolinearity.
#outliers/standardised residuals
mydata$StdResid <- round(rstandard(fit2),3)
hist(mydata$StdResid,
main = "Histogram of standardised residuals",
xlab = "Standardised residuals",
ylab = "Frequency",
breaks = seq(from= -3, to= 3, by=0.5))
We can notice an outlier on the left, that stands out a little.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95303, p-value = 0.003645
H0: variables are normally distributed H1: variables are not normally distributed We can conclude, that standard residuals are not normally distributed, p<0.05.
#units with big impact/cooks distances
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
main = "Histogram of cooks distances",
xlab = "Cooks distance",
ylab = "Frequency" )
The one that has a higher distance (the one on the left, that is more than 0.3) compared to others is problematic.
head(mydata[order(mydata$StdResid),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15 0.066
## 2 12 14 1650 0 1 No Yes -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
## 4 13 8 1800 0 0 No No -1.38 0.012
## 5 14 16 1660 0 1 No Yes -1.26 0.008
## 6 24 5 1830 1 0 Yes No -1.19 0.012
We need delete the one in row 53.
head(mydata[order(-mydata$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
We need to delete the one in row 38.
#we remove it
mydata1 <- mydata[c(-53, -38),]
hist(mydata1$StdResid,
main = "Histogram of standardised residuals",
xlab = "Standardised residuals",
ylab = "Frequency",
breaks = seq(from= -3, to= 3, by=0.5))
hist(mydata1$CooksD,
main = "Histogram of cooks distances",
xlab = "Cooks distance",
ylab = "Frequency" )
fit2 <- lm(Price ~ Age + Distance, data = mydata1)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.51 -223.89 -62.78 202.78 575.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2489.617 73.524 33.861 < 2e-16 ***
## Age -7.350 3.103 -2.368 0.0203 *
## Distance -23.636 2.731 -8.654 4.21e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.1 on 80 degrees of freedom
## Multiple R-squared: 0.513, Adjusted R-squared: 0.5008
## F-statistic: 42.13 on 2 and 80 DF, p-value: 3.177e-13
#standardisation
mydata1$StdfittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata1$StdResid, x= mydata1$StdfittedValues,
ylab = "Standarized residuals",
xlab = "Standarized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Explanation: Based on the scatter plot we can see there is no
heteroskedasticity.
mydata1$StdResid <- round(rstandard(fit2),3)
hist(mydata1$StdResid, xlab = "Standardised Residuals", ylab="Frequency", main="Histogram of Standardised Residuals", breaks = seq(from= -3, to= 3, by=0.5))
H0:variables are normally distributed H1:variables are not normally
distributed
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.94142, p-value = 0.0008992
Explanation:Based on the p-value<0.005, we reject H0 and conclude that standard residuals are not normally distributed.
fit2 <- lm(Price ~ Age+Distance, data=mydata1)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.51 -223.89 -62.78 202.78 575.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2489.617 73.524 33.861 < 2e-16 ***
## Age -7.350 3.103 -2.368 0.0203 *
## Distance -23.636 2.731 -8.654 4.21e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.1 on 80 degrees of freedom
## Multiple R-squared: 0.513, Adjusted R-squared: 0.5008
## F-statistic: 42.13 on 2 and 80 DF, p-value: 3.177e-13
Explanation: If the age of the apartment increases by 1 year, on average the price per m2 of the apartment price per m2 decreases by 7.350% points,with p<0.05 and everything else remains unchanged. If the distance from the city center increases by 1km, on average the apartment price per m2 decreases by 23.636 % points, with p<0.00 and everything else remains unchanged.
fit3 <- lm(Price ~ Age+Distance+ParkingF+BalconyF, data=mydata1)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.58 -198.68 -44.44 229.33 529.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2367.282 93.781 25.243 < 2e-16 ***
## Age -6.605 3.054 -2.162 0.0337 *
## Distance -21.140 2.878 -7.345 1.73e-10 ***
## ParkingFYes 147.508 62.799 2.349 0.0214 *
## BalconyFYes -3.122 58.635 -0.053 0.9577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.4 on 78 degrees of freedom
## Multiple R-squared: 0.5453, Adjusted R-squared: 0.522
## F-statistic: 23.39 on 4 and 78 DF, p-value: 9.972e-13
Explanation: If the age of the apartment increases by 1 year, on average the price per m2 of the apartment price per m2 decreases by 6.605% points,with p<0.05 and everything else remains unchanged. If the distance from the city center increases by 1km, on average the apartment price per m2 decreases by 21.140 % points, with p-value<0.00 and everything else remains unchanged. Given the parking space, the average price per m2 of the apartment is higher for 147.508 % points, with p-value<0.05 and everything else remains unchanged. We can not conclude, that if the apartment has a balcony it affects the average price per m2, p-value>0.05. #### With function anova check if model fit3 fits data better than model fit2.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5795128
## 2 78 5410469 2 384659 2.7727 0.06866 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: fit2 is more appropriate H1: fit3 is more appropriate p-value>0.05, so we have weaker evidence to reject H0, so we conclude the fit3 is more appropriate model.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.58 -198.68 -44.44 229.33 529.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2367.282 93.781 25.243 < 2e-16 ***
## Age -6.605 3.054 -2.162 0.0337 *
## Distance -21.140 2.878 -7.345 1.73e-10 ***
## ParkingFYes 147.508 62.799 2.349 0.0214 *
## BalconyFYes -3.122 58.635 -0.053 0.9577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.4 on 78 degrees of freedom
## Multiple R-squared: 0.5453, Adjusted R-squared: 0.522
## F-statistic: 23.39 on 4 and 78 DF, p-value: 9.972e-13
F-statistics: H0: ρ squared = 0 H1: ρ squared > 0 p-value<0.05 therefore we reject. We conclude, that at least one of the expalatory variables is different from 0/ has an effect on the dependent variable.
mydata1$FittedValues <- fitted.values(fit2)
mydata1$Residuals <- residuals(fit3)
head(mydata1[, colnames(mydata1) %in% c("Apartment", "Residuals")])
## # A tibble: 6 × 1
## Residuals
## <dbl>
## 1 -86.0
## 2 425.
## 3 -69.1
## 4 284.
## 5 -372.
## 6 -156.
Apartment with ID 2 has the residual 425.23070.