library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
mydata <- as.data.frame(mydata)
Description:
We are interested in how the price per square meter in euros of an apartment is affected by its age(in years), its distance from the city(in km), the presence or the absence of a balcony(No/Yes) and the presence or the absence of a parking(No/Yes).
mydata$ParkingF <- factor(mydata$Parking,
levels = c (0, 1),
labels = c ("No", "Yes"))
mydata$BalconyF <- factor(mydata$Balcony,
levels = c (0, 1),
labels = c ("No", "Yes"))
library(ggplot2)
ggplot(mydata, aes(x = Price)) +
geom_histogram(binwidth = 500, colour = "black") +
ylab("Frequency")
xlab("Price per square meter in euros")
## $x
## [1] "Price per square meter in euros"
##
## attr(,"class")
## [1] "labels"
shapiro.test(mydata$Price)
##
## Shapiro-Wilk normality test
##
## data: mydata$Price
## W = 0.94017, p-value = 0.0006513
According to the Shapiro test above the Hypothesis is:
We reject the null Hypothesis(H0) at p<0.001, so we assume that the price per square meter is not normally distributed.
Even though, the assumption of normality is being violated and we should use wilcoxon signed rank test, only for the reasons of this home assignment, we will assume normality and we will continue with the t-test with one arithmetic mean.
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
For the t-test with one arithmetic mean the Hypotheses are:
We reject the null hypothesis(H0) at p=0.005. So, this means that the arithmetic mean of the price per square meter of the population is different from 1900.
library(car)
## Loading required package: carData
scatterplot(mydata$Price ~ mydata$Age,
smooth = FALSE,
boxplots = FALSE,
ylab = "Price per square meter in euros",
xlab = "Age in years")
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Price = 2185.455-8.975*Age
Hypotheses for the T-test of partial of regression coefficient:
We reject the null Hypothesis(H0) at p=0.034, meaning the regression coefficient is different from 0 in the population. So, the Age of an apartment significantly affects its Price.
Coefficient of determination:
Hypotheses for the test of significance of regression: - Ho: ρ^2 = 0 - H1: ρ^2 > 0
We reject the null hypothesis(H0) at p = 0.035. So, we found that the coefficient of determination of the population is greater than 0, meaning that there is at least one explanatory variable(the age) that explains the differences in price per square meter.
cor(mydata$Price, mydata$Age)
## [1] -0.230255
scatterplotMatrix(mydata[ , c(-4, -5, -6, -7)],
smooth = FALSE)
scatterplotMatrix(mydata[ , c(-4, -5, -6, -7)],
smooth = FALSE)
##install.packages("Hmisc")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-4,-5,-6,-7)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
We reject the null Hypothesis(H0) at p = 0.035, meaning that we found a linear negative relationship between the price of an apartment and its age.
We reject the null Hypothesis(H0) at p < 0.001, meaning that we found a linear negative relationship between the price of an apartment and its distance from the city center.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Price = 2460.101 -7.934Age -20.667Distance
If the Age of an apartment is 0, meaning if it is new and its distance form the city center is 0 its price will be 2460.101 euros per square meter, on average.(p<0.001)
When Age is increased by 1 year the Price of the apartment is decreased by 7.934 euros per square meter on average, assuming all the other explanatory variables remain unchanged.(p=0.016)
Hypotheses for the test of partial of regression coefficient:
We reject the null Hypothesis(H0) at p=0.016, meaning the regression coefficient is different from 0 in the population. So, the Age of an apartment significantly affects its Price.
When Distance of an apartment from the city center is increased by 1 km the Price of the apartment is decreased by 20.667 euros per square meter on average, assuming all the other variables remain unchanged.(p<0.001) Hypotheses for the test of partial of regression coefficient:
HO: β2 = 0
H1: β2 ≠ 0
We reject the null Hypothesis(H0) at p<0.001, meaning the regression coefficient is different from 0 in the population. So, the distance of an apartment from the city center significantly affects its Price.
Coefficient of determination:
Test of significance of regression: - Ho: ρ^2 = 0 - H1: ρ^2 > 0
We reject the null hypothesis(H0) at p<0.001. So, we found that the coefficient of determination of the population is greater than 0, meaning that there is at least one explanatory variable that explains the differences in price per square meter.
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standarized residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distance")
mydata$ID <- seq(1, nrow(mydata))
head(mydata[order(mydata$StdResid),], 3)
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## 53 7 2 1760 0 1 No Yes -2.152 0.066 53
## 13 12 14 1650 0 1 No Yes -1.499 0.013 13
## 72 12 14 1650 0 0 No No -1.499 0.013 72
head(mydata[order(-mydata$CooksD),], 6)
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## 38 5 45 2180 1 1 Yes Yes 2.577 0.320 38
## 55 43 37 1740 0 0 No No 1.445 0.104 55
## 33 2 11 2790 1 0 Yes No 2.051 0.069 33
## 53 7 2 1760 0 1 No Yes -2.152 0.066 53
## 22 37 3 2540 1 1 Yes Yes 1.576 0.061 22
## 39 40 2 2400 0 1 No Yes 1.091 0.038 39
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID == 38)
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
mydata$StdFitted <- scale(fit2$fitted.values)
mydata$StdResid <- round(rstandard(fit2), 3)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standarized Residuals",
xlab = "Stndarized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standarized residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95649, p-value = 0.006355
According to Shapiro test above the Hypothesis is:
The null hypothesis (H0) is rejected at the p value = 0.007, so we assume that errors are not normally distributed.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Price = 2456.076 -6.464Age -22.955Distance
If the Age of an apartment is 0, meaning if it is new and its distance form the city center is 0 its price will be 2456.076 euros per square meter, on average.(p<0.001)
When Age is increased by 1 year the Price of the apartment is decreased by 6.464 euros per square meter on average, assuming all the other variables remain unchanged.(p=0.044)
Test of partial of regression coefficient:
We reject the null Hypothesis(H0) at p=0.044, meaning the regression coefficient is different from 0 in the population. So, the Age of an apartment significantly affects its Price.
Test of partial of regression coefficient:
We reject the null Hypothesis(H0) at p<0.001, meaning the regression coefficient is different from 0 in the population. So, the distance of an apartment from the city center significantly affects its Price.
Coefficient of determination:
Test of significance of regression:
We reject the null hypothesis(H0) at p<0.001. So, we found that the coefficient of determination of the population is greater than 0, meaning that there is at least one explanatory variable that explains the differences in price per square meter.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = mydata)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Regression function:
Price = 2329.724 -5.821Age -20.279Distance +167.531ParkingF -15.207BalconyF
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypotheses for the test:
We reject the null hypothesis(H0) at p-value=0.031. So, Δρ^2 is greater than 0, meaning that the ρ^2 in the population significantly increased.The second model(fit3), which has more explanatory variables has a significantly higher coefficient of determination compared to the first model(fit2).The complex model(fit3) is significantly better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
If the Age of an apartment is 0, meaning if it is new, if its distance from the city center is 0 and if it does not have neither parking nor balcony, then its expected price will be 2329.724 euros per square meter, on average.(p<0.001)
When the distance of an apartment from the city center is increased by 1 km its price is decreases by 20.279 euros per square meter on average, assuming all the other variables remain unchanged(p<0.001)
Given the values of the other explanatory variables, the apartments that have parking are on average 167.531 euros more expensive per square meter compared to the apartments that do not have parking.(p=0.01)
Test of partial of regression coefficient:
We can’t(don’t have enough evidence to) reject the null Hypothesis(H0), meaning we can’t say that the regression coefficient is different from 0 in the population. So, we do not have enough evidence to say that the Age of an apartment significantly affects its Price.
Test of partial of regression coefficient:
We reject the null Hypothesis(H0) at p<0.001, meaning the regression coefficient is different from 0 in the population. So, the Distance of an apartment from the city center significantly affects its Price.
Test of partial of regression coefficient:
We reject the null Hypothesis(H0) at p=0.01, meaning the regression coefficient is different from 0 in the population. So, there are significant differences between having and not having a parking regarding the price per square meter of an apartment.
Test of partial of regression coefficient:
We can’t(don’t have enough evidence to) reject the null Hypothesis(H0), meaning we can’t say that the regression coefficient is different from 0 in the population. So, we can’t say that there are significant differences between having and not having a balcony regarding the price per square meter of an apartment.
Test of significance of regression:
We reject the null hypothesis(H0) at p<0.001. So, we found that the coefficient of determination of the population is greater than 0, meaning that there is at least one explanatory variable that explains the differences in price per square meter.
mydata$Fitted_fit3 <- fitted.values(fit3)
mydata$Residuals_fit3 <- residuals(fit3)
mydata[mydata$ID == 2, c("ID", "Fitted_fit3", "Residuals_fit3")]
## ID Fitted_fit3 Residuals_fit3
## 2 2 2372.197 427.8029
mydata[mydata$ID == 2, ]
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID StdFitted Fitted_fit3 Residuals_fit3
## 2 18 1 2800 1 0 Yes No 1.775 0.031 2 1.134972 2372.197 427.8029
We could also calculate the residuals for the apartment with ID2 manually in this way:
Price = 2329.724 -5.821Age -20.279Distance +167.531ParkingF -15.207BalconyF
Price = 2329.724-5.821 * 18-20.279 * 1+167.531 * 1-15.207 * 0=2372.198
Residuals = Actual value-Fitted value = 2800-2372.198=427.802