library(readxl)
mydataAP <- read_excel("C:/Users/Veronika/Å OLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/R PROGRAM - BOOTCAMP/TAKE HOME EXAM/Task 3/Apartments.xlsx")
View(mydataAP)
mydataAP <- as.data.frame(mydataAP)
head(mydataAP)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydataAP$Parking <- factor(mydataAP$Parking,
levels = c("0","1"),
labels = c("No","Yes"))
mydataAP$Balcony <- factor(mydataAP$Balcony,
levels = c("0","1"),
labels = c("No","Yes"))
t.test(mydataAP$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataAP$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
SPRMENI!!!: Based on the p-value (p = 0.004731), which is smaller that 0.05, we can reject the null hypothesis and say that the true mean is not equal to 1900. Based on the 95 percent confidence interval, we can stipulate that the true mean is somewhere between 1937.443 and 2100.440.
fit1 <- lm(Price ~ Age,
data = mydataAP)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydataAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanations: a) Estimate of regression coefficient (-8.975): If the age increases by 1 year, the price decreases by 8,975€ thousand per m^2. b) Coefficient of correlation: For the observed correlation, we set two hypotheses to complete the t- test. H0: beta1 = 0 and H1: beta1 =/= (not equal) 0), we get p=0.034 (p < 0.05), so we reject H0, accept H1, so the age has some effect on price. (SPREMENI!!!) c) Coefficient of determination: 5,3% of the variable Price is explained with linear effect of the Age.
library(car)
## Loading required package: carData
scatterplotMatrix(mydataAP[ , c(1, 2, 3)],
smooth = FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydataAP[ , c(1,2,3)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
Based on what we can spot in the matrix, and what we calculated with correlation matrix above, where we calculated a Pearson correlation coefficients,there is a strong correlation between Price and Distance and weak correlation between Age and Price and Age and Distance.
fit2 <- lm(Price ~ Age + Distance,
data = mydataAP)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Chris - mail
mydataAP$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydataAP$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(mydataAP$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataAP$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP$StdResid
## W = 0.95303, p-value = 0.003645
hist(mydataAP$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We spot that we have some outliers. We will test it (v nadaljevanju!!).
head(mydataAP[order(mydataAP$StdResid), ], 3)
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
head(mydataAP[order(-mydataAP$CooksD), ], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 39 40 2 2400 No Yes 1.091 0.038
As we can see, there is a gap between aparment no. 53 and 13 and between no. 38, 55 and 33 (according to Cooks Distance). But, let’s remove aparment no. 38 and see if it betters the analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydataAP <- mydataAP %>%
filter(!CooksD == 0.320)
fit3 <- lm(Price ~ Age + Distance,
data = mydataAP)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
mydataAP$StdResid <- round(rstandard(fit3), 3) #Standardized residuals
mydataAP$CooksD <- round(cooks.distance(fit3), 3) #Cooks distances
hist(mydataAP$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataAP$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP$StdResid
## W = 0.95649, p-value = 0.006355
hist(mydataAP$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Because we still have some gaps, I decided to adjust my data a bit more:
library(dplyr)
mydataAP <- mydataAP %>%
filter(!CooksD == 0.129)
fit4 <- lm(Price ~ Age + Distance,
data = mydataAP)
summary(fit4)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
mydataAP$StdResid <- round(rstandard(fit4), 3) #Standardized residuals
mydataAP$CooksD <- round(cooks.distance(fit4), 3) #Cooks distances
hist(mydataAP$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataAP$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP$StdResid
## W = 0.95952, p-value = 0.01044
hist(mydataAP$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
With the process above we eliminated the outliers in stadardised residuals and we do not have any gaps anymore in that graph, which is better because the graph is nicely connected. Although, we still have a gap in the Histogram Cooks distances, but I decided to not eliminate this one.
mydataAP$StdFitted <- scale(fit4$fitted.values)
library(car)
scatterplot(y = mydataAP$StdResid, x = mydataAP$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Based on the observations in scatterplot above, we see that we have a case of heteroscedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit4)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
POGLEJ!!!: The null hypothesis that the variance of errors is constant cannot be rejected (p-value is above 0.05). We therefore assume homoskedasticity.