library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
Description:
mydata$ParkingF <- factor(mydata$Parking,
levels = c("0","1"),
labels = c("No","Yes"))
mydata$BalconyF <- factor(mydata$Balcony,
levels = c("0","1"),
labels = c("No","Yes"))
head(mydata)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingF BalconyF
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
mean(mydata$Price)
## [1] 2018.941
sd(mydata$Price)
## [1] 377.8417
library(ggplot2)
ggplot(mydata, aes(x = Price)) +
geom_histogram(binwidth = 100, colour = "black") +
ylab("Frequency") +
xlab ("Price")
shapiro.test(mydata$Price)
##
## Shapiro-Wilk normality test
##
## data: mydata$Price
## W = 0.94017, p-value = 0.0006513
median(mydata$Price)
## [1] 1950
wilcox.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## Wilcoxon signed rank test with continuity correction
##
## data: mydata$Price
## V = 2328, p-value = 0.02844
## alternative hypothesis: true location is not equal to 1900
We reject H0 with p = 0.02828
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0 is rejected at p = 0.004731
Based off the following tests we can conclude that the mean price does not = 1900.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
corcoef <- sqrt(summary(fit1)$r.squared)
corcoef
## [1] 0.230255
Regression Coefficient: If age increases by 1 year then on average Price decreases by 8.975 assuming all else eq
Coefficient of Correlation: The coefficient of 0.230255 can be described as a weak positive relationship between age of apartment and price.
Coefficient of Determination: The goodness of fit is 0.05302 which means that 5.3% of variance can be explained by the dependent variable in the model.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[ , c(-4,-5,-6,-7)],
smooth = FALSE)
Based off of this graph there appears to be no multicolinearity which means that the variables do not have any/strong relationship with each other.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
None of the VIF’s are greater than 5 and the mean VIF about = 1 so there is no multicolinearity.
mydata$StdResiduals <- round(rstandard(fit2), 3)
head(mydata[order(mydata$StdResiduals),],3)
## # A tibble: 3 × 8
## Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15
## 2 12 14 1650 0 1 No Yes -1.50
## 3 12 14 1650 0 0 No No -1.50
head(mydata[order(-mydata$StdResiduals),],3)
## # A tibble: 3 × 8
## Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58
## 2 2 11 2790 1 0 Yes No 2.05
## 3 18 1 2800 1 0 Yes No 1.78
In the above analysis there are no values greater or less than 3 so we have no reason to assume outliers.
mydata$CooksD <- round(cooks.distance(fit2), 3)
library(ggplot2)
ggplot(mydata, aes(x = CooksD)) +
geom_histogram(bins= 50, colour = "black") +
xlim(NA, 0.4) +
ylab("Frequency") +
xlab ("Cooks Distance")
## Warning: Removed 1 row containing missing values or values outside the scale range (`geom_bar()`).
head(mydata[order(-mydata$CooksD),])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
We can see a gap between 0.15 and 0.30. We will want to remove this value of hugh influence using the below function.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(CooksD < 0.06)
ggplot(mydata, aes(x = CooksD)) +
geom_histogram(bins= 50, colour = "black") +
xlim(NA, 0.4) +
ylab("Frequency") +
xlab ("Cooks Distance")
## Warning: Removed 1 row containing missing values or values outside the scale range (`geom_bar()`).
new histogram with removed outlier.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
mydata$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResiduals, x = mydata$StdFitted,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
We can see homoskedasticity by looking at this scatterplot for the most
part. There seems to be a slight narrowing between -3 and -2. To test
this hypothesis further we use the below test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
The p value 0.188 is high (p >0.05) and therefore we can not reject the null hypothesis that variance of errors is constant and there is no apparent heteroskedasticity.
hist(mydata$StdResiduals,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Standardized Residuals Histogram")
The standardized residuals look slightly skewed to the right side.
shapiro.test(mydata$StdResiduals)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResiduals
## W = 0.93418, p-value = 0.0004761
By using the above test we get a p value < 0.001 thus we reject the null hypothesis that the distribution of standardized residuals is normal.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Age Coefficient: Assuming all else equal, on average when age increases by 1 year, price decreases by 7.934.
Distance Coefficient: Assuming all else equal, on average when distance increases by 1 km, price decreases by 20,667.
these coefficients are both significant at a = 0.05.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = mydata)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFYes 128.700 60.801 2.117 0.0376 *
## BalconyFYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
The p value 0.01007 is < 0.05 so we reject the null that the difference between the coefficient of determination in the two models is 0 and can conclude that adding more variables improves the goodness of fit significantly. (fit3 is better)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFYes 128.700 60.801 2.117 0.0376 *
## BalconyFYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
ParkingF: Holding all else equal when an apartment has a parking space, on average price increases 196.168 at significance level of 0.01.
BalconyF: This variable is not significant and therefore it can be concluded that it does not have an effect on price. (p = 0.97436)
H0: p^2 = 0 ( all partial regression coeffs are 0 )
H1: p^2 > 0 (at least one partial regression coefficients of all the independent variables differs from 0 )
mydata$FittedValues3 <- fitted.values(fit3)
mydata[2,]
## # A tibble: 1 × 11
## Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals CooksD StdFitted[,1] FittedValues3
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 18 1 2800 1 0 Yes No 1.78 0.03 1.16 2357.
mydata$Residuals <- residuals(fit3)
mydata[2,]
## # A tibble: 1 × 12
## Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals CooksD StdFitted[,1] FittedValues3 Residuals
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18 1 2800 1 0 Yes No 1.78 0.03 1.16 2357. 443.