qqplot <- resid(model)
qqnorm(qqplot)
qqline(qqplot, col=2)
## (a)(ii) Studendized residuals vs Index
s <- rstudent(model)
plot(s , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")
abline(h = c(-3,3), col = "blue", lty = 2)
yhat<-fitted(model)
plot(qqplot ~ yhat, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")
leverage <- hatvalues(model)
plot(leverage, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")
# Find the index number of the observation with the largest leverage
max_leverage_index <- which.max(leverage)
print(max_leverage_index)
## 201
## 201
index_78_studentized_res <- s[201]
print(index_78_studentized_res)
## 201
## -0.5514178
max_studentized_res <- which.max(s)
print(max_studentized_res)
## 55
## 55
The index of the observation with the largest leverage from the dataset is 78. No, this observation does not have a large studentized residual.
# Find the indices of the two largest studentized residual
top_s_indices <- order(s, decreasing = TRUE)[1:2]
# Print the indices of the two largest studentized residual
print(top_s_indices)
## [1] 55 96
The index number of the two outliers are 365 and 519 respectively. These observations have rentpersqm 16.93 and 16.8099 respectively for only 1 room which makes them stand out compared the rentpersqm of 1 room units ranges around 10 to 12.
library(tidyverse)
data2 <- read.csv("C:/Users/Rishabh/Desktop/STAT 371/Assignments/Assignment 4/munichrent_student_outliers_removed.csv")
head(data2)
## age bestneighborhood_ centralheating_ extrabath_ goodneighborhood_ index
## 1 92 no yes no yes 26
## 2 44 no yes no no 697
## 3 43 no yes no no 1360
## 4 44 no yes no no 1932
## 5 53 no yes no no 1580
## 6 44 no yes no no 203
## numrooms premiumkitchen_ rentpersqm size tiledbath_ warmwater_
## 1 2 no 6.72 65 yes yes
## 2 3 no 7.71 71 yes yes
## 3 4 no 6.26 74 yes yes
## 4 4 no 7.82 98 yes yes
## 5 3 no 10.23 85 yes yes
## 6 5 no 7.96 140 yes yes
model2 <- lm(rentpersqm ~ age + bestneighborhood_ + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
+ size + tiledbath_ + warmwater_, data = data2)
summary(model2)
##
## Call:
## lm(formula = rentpersqm ~ age + bestneighborhood_ + centralheating_ +
## extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_ +
## size + tiledbath_ + warmwater_, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2713 -1.3327 -0.0013 1.3334 4.1393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.334606 0.880804 7.192 1.49e-11 ***
## age -0.010097 0.006081 -1.661 0.098483 .
## bestneighborhood_yes 1.768041 0.863469 2.048 0.041997 *
## centralheating_yes 1.203363 0.580506 2.073 0.039548 *
## extrabath_yes 0.597083 0.526321 1.134 0.258059
## goodneighborhood_yes 0.715211 0.291269 2.456 0.014983 *
## numrooms -0.539645 0.278294 -1.939 0.053992 .
## premiumkitchen_yes 1.386527 0.604220 2.295 0.022859 *
## size -0.009575 0.012116 -0.790 0.430365
## tiledbath_yes 0.640154 0.339278 1.887 0.060735 .
## warmwater_yes 2.663104 0.720249 3.697 0.000286 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.876 on 187 degrees of freedom
## Multiple R-squared: 0.3943, Adjusted R-squared: 0.3619
## F-statistic: 12.17 on 10 and 187 DF, p-value: 4.026e-16
qqplot2 <- resid(model2)
qqnorm(qqplot2)
qqline(qqplot2, col=2)
## (ii) Studendized residuals vs Index
s2 <- rstudent(model2)
plot(s2 , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")
abline(h = c(-3,3), col = "blue", lty = 2)
yhat2<-fitted(model2)
plot(qqplot2 ~ yhat2, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")
leverage2 <- hatvalues(model2)
plot(leverage2, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")
The coefficient for premium kitchen is 1.3865. This means that unit with premium kitchen has a $1.38 per sqm higher rent than unit without premium kitchen. Also, some the p-value for premium kitchen is 0.022859, this is significant and indicates that the observed result is unlikely to have occurred by chance alone.
robust_model <- rlm(rentpersqm ~ age + bestneighborhood_ + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
+ size + tiledbath_ + warmwater_, data = data2)
summary(robust_model)
##
## Call: rlm(formula = rentpersqm ~ age + bestneighborhood_ + centralheating_ +
## extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_ +
## size + tiledbath_ + warmwater_, data = data2)
## Residuals:
## Min 1Q Median 3Q Max
## -4.47075 -1.30554 -0.09328 1.33041 4.54853
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 6.0469 0.9119 6.6311
## age -0.0113 0.0063 -1.7897
## bestneighborhood_yes 1.8055 0.8940 2.0196
## centralheating_yes 1.4710 0.6010 2.4476
## extrabath_yes 0.5062 0.5449 0.9289
## goodneighborhood_yes 0.9477 0.3016 3.1429
## numrooms -0.5344 0.2881 -1.8547
## premiumkitchen_yes 1.4578 0.6255 2.3304
## size -0.0080 0.0125 -0.6415
## tiledbath_yes 0.6740 0.3513 1.9189
## warmwater_yes 2.5622 0.7457 3.4361
##
## Residual standard error: 1.947 on 187 degrees of freedom
The coefficient for premium kitchen is 1.4578. This means that unit with premium kitchen has a $1.4578 per sqm higher rentwhat than unit without premium kitchen. Also, some the p-value for premium kitchen is 2.3304, this is not significant and indicates that the observed result is likely to have occurred by chance.