qqplot <- resid(model)
qqnorm(qqplot)
qqline(qqplot, col=2)
## (a)(ii) Studendized residuals vs Index
s <- rstudent(model)
plot(s , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")
abline(h = c(-3,3), col = "blue", lty = 2)
yhat<-fitted(model)
plot(qqplot ~ yhat, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")
leverage <- hatvalues(model)
plot(leverage, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")
# Find the index number of the observation with the largest leverage
max_leverage_index <- which.max(leverage)
print(max_leverage_index)
## 201
## 201
index_78_studentized_res <- s[201]
print(index_78_studentized_res)
## 201
## -0.5514178
max_studentized_res <- which.max(s)
print(max_studentized_res)
## 55
## 55
The index of the observation with the largest leverage from the dataset is 78, and row 201 from the excel. The reason why it has the largest leverage is because of its size which is 132 and the number of rooms it has which is 2. Normally units with this size have more than 2 rooms. No, this observation does not have a large studentized residual.
# Find the indices of the two largest studentized residual
top_s_indices <- order(s, decreasing = TRUE)[1:2]
# Print the indices of the two largest studentized residual
print(top_s_indices)
## [1] 55 96
The index number of the two outliers are 365 and 519 respectively. These observations have rentpersqm 16.93 and 16.8099 respectively for only 1 room which makes them stand out compared the rentpersqm of 1 room units ranges around 10 to 12.
library(tidyverse)
data2 <- read.csv("C:/Users/Rishabh/Desktop/STAT 371/Assignments/Assignment 4/munichrent_student_outliers_removed.csv")
model2 <- lm(sqrt(rentpersqm) ~ age + bestneighborhood_ + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
+ size + tiledbath_ + warmwater_, data = data2)
summary(model2)
##
## Call:
## lm(formula = sqrt(rentpersqm) ~ age + bestneighborhood_ + centralheating_ +
## extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_ +
## size + tiledbath_ + warmwater_, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90448 -0.23010 0.01551 0.24277 0.74651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.378828 0.157517 15.102 < 2e-16 ***
## age -0.001757 0.001087 -1.616 0.1079
## bestneighborhood_yes 0.304588 0.154416 1.973 0.0500 .
## centralheating_yes 0.252573 0.103813 2.433 0.0159 *
## extrabath_yes 0.109340 0.094123 1.162 0.2469
## goodneighborhood_yes 0.115770 0.052088 2.223 0.0274 *
## numrooms -0.096824 0.049768 -1.946 0.0532 .
## premiumkitchen_yes 0.240330 0.108054 2.224 0.0273 *
## size -0.001375 0.002167 -0.635 0.5264
## tiledbath_yes 0.095371 0.060674 1.572 0.1177
## warmwater_yes 0.562139 0.128804 4.364 2.11e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3354 on 187 degrees of freedom
## Multiple R-squared: 0.4191, Adjusted R-squared: 0.3881
## F-statistic: 13.49 on 10 and 187 DF, p-value: < 2.2e-16
qqplot2 <- resid(model2)
qqnorm(qqplot2)
qqline(qqplot2, col=2)
## (ii) Studendized residuals vs Index
s2 <- rstudent(model2)
plot(s2 , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")
abline(h = c(-3,3), col = "blue", lty = 2)
yhat2<-fitted(model2)
plot(qqplot2 ~ yhat2, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")
leverage2 <- hatvalues(model2)
plot(leverage2, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")
The coefficient for premium kitchen is 1.3865. This means that unit with premium kitchen has a $1.38 per sqm higher rent than unit without premium kitchen. Also, some the p-value for premium kitchen is 0.022859, this is significant and indicates that the observed result is unlikely to have occurred by chance alone.
library(sandwich)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
robust_model1 <- lm(rentpersqm ~ age + bestneighborhood_ + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
+ size + tiledbath_ + warmwater_, data = data2)
coeftest(robust_model1, vcov = vcovHC(robust_model1, "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.3346062 0.8340306 7.5952 1.426e-12 ***
## age -0.0100973 0.0057417 -1.7586 0.0802838 .
## bestneighborhood_yes 1.7680408 0.5344408 3.3082 0.0011262 **
## centralheating_yes 1.2033627 0.6897347 1.7447 0.0826844 .
## extrabath_yes 0.5970832 0.4221513 1.4144 0.1589126
## goodneighborhood_yes 0.7152106 0.3120068 2.2923 0.0230024 *
## numrooms -0.5396445 0.2634013 -2.0488 0.0418832 *
## premiumkitchen_yes 1.3865267 0.5704376 2.4306 0.0160162 *
## size -0.0095748 0.0122565 -0.7812 0.4356704
## tiledbath_yes 0.6401543 0.3602275 1.7771 0.0771801 .
## warmwater_yes 2.6631037 0.7749915 3.4363 0.0007266 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient for premium kitchen is 1.3865. This means that unit with premium kitchen has a $1.3865 per sqm higher rentwhat than unit without premium kitchen. Also, some the p-value for premium kitchen is 0.0160162, this is significant and indicates that the observed result is not likely to have occurred by chance. Note that this coefficient is the same for non-robust.