Ques 1:

(a)(i) QQ Plot

qqplot <- resid(model)

qqnorm(qqplot)

qqline(qqplot, col=2)

## (a)(ii) Studendized residuals vs Index

s <- rstudent(model)

plot(s , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")

abline(h = c(-3,3), col = "blue", lty = 2)

(a)(iii) Residual vs. Fitted values

yhat<-fitted(model)

plot(qqplot ~ yhat, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")

(a)(iv) Leverage vs. Index

leverage <- hatvalues(model)
plot(leverage, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")

Conclusions for each of the plots above:

(b)

# Find the index number of the observation with the largest leverage
max_leverage_index <- which.max(leverage)
print(max_leverage_index)
## 201 
## 201
index_78_studentized_res <- s[201]
print(index_78_studentized_res)
##        201 
## -0.5514178
max_studentized_res <- which.max(s)
print(max_studentized_res)
## 55 
## 55

The index of the observation with the largest leverage from the dataset is 78, and row 201 from the excel. The reason why it has the largest leverage is because of its size which is 132 and the number of rooms it has which is 2. Normally units with this size have more than 2 rooms. No, this observation does not have a large studentized residual.

(c)

# Find the indices of the two largest studentized residual
top_s_indices <- order(s, decreasing = TRUE)[1:2]

# Print the indices of the two largest studentized residual
print(top_s_indices)
## [1] 55 96

The index number of the two outliers are 365 and 519 respectively. These observations have rentpersqm 16.93 and 16.8099 respectively for only 1 room which makes them stand out compared the rentpersqm of 1 room units ranges around 10 to 12.

Dataset with largest leverage and two outliers removed:

(d)(i)

library(tidyverse)
data2 <- read.csv("C:/Users/Rishabh/Desktop/STAT 371/Assignments/Assignment 4/munichrent_student_outliers_removed.csv")


model2 <- lm(sqrt(rentpersqm) ~ age + bestneighborhood_  + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
 + size + tiledbath_ + warmwater_, data = data2) 

summary(model2)
## 
## Call:
## lm(formula = sqrt(rentpersqm) ~ age + bestneighborhood_ + centralheating_ + 
##     extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_ + 
##     size + tiledbath_ + warmwater_, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90448 -0.23010  0.01551  0.24277  0.74651 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.378828   0.157517  15.102  < 2e-16 ***
## age                  -0.001757   0.001087  -1.616   0.1079    
## bestneighborhood_yes  0.304588   0.154416   1.973   0.0500 .  
## centralheating_yes    0.252573   0.103813   2.433   0.0159 *  
## extrabath_yes         0.109340   0.094123   1.162   0.2469    
## goodneighborhood_yes  0.115770   0.052088   2.223   0.0274 *  
## numrooms             -0.096824   0.049768  -1.946   0.0532 .  
## premiumkitchen_yes    0.240330   0.108054   2.224   0.0273 *  
## size                 -0.001375   0.002167  -0.635   0.5264    
## tiledbath_yes         0.095371   0.060674   1.572   0.1177    
## warmwater_yes         0.562139   0.128804   4.364 2.11e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3354 on 187 degrees of freedom
## Multiple R-squared:  0.4191, Adjusted R-squared:  0.3881 
## F-statistic: 13.49 on 10 and 187 DF,  p-value: < 2.2e-16

(i) QQ Plot

qqplot2 <- resid(model2)

qqnorm(qqplot2)

qqline(qqplot2, col=2)

## (ii) Studendized residuals vs Index

s2 <- rstudent(model2)

plot(s2 , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")

abline(h = c(-3,3), col = "blue", lty = 2)

(iii) Residual vs. Fitted values

yhat2<-fitted(model2)

plot(qqplot2 ~ yhat2, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")

(iv) Leverage vs. Index

leverage2 <- hatvalues(model2)
plot(leverage2, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")

(d)(ii)

The coefficient for premium kitchen is 1.3865. This means that unit with premium kitchen has a $1.38 per sqm higher rent than unit without premium kitchen. Also, some the p-value for premium kitchen is 0.022859, this is significant and indicates that the observed result is unlikely to have occurred by chance alone.

(e)

library(sandwich)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
robust_model1 <- lm(rentpersqm ~ age + bestneighborhood_  + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
 + size + tiledbath_ + warmwater_, data = data2) 
coeftest(robust_model1, vcov = vcovHC(robust_model1, "HC1"))
## 
## t test of coefficients:
## 
##                        Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)           6.3346062  0.8340306  7.5952 1.426e-12 ***
## age                  -0.0100973  0.0057417 -1.7586 0.0802838 .  
## bestneighborhood_yes  1.7680408  0.5344408  3.3082 0.0011262 ** 
## centralheating_yes    1.2033627  0.6897347  1.7447 0.0826844 .  
## extrabath_yes         0.5970832  0.4221513  1.4144 0.1589126    
## goodneighborhood_yes  0.7152106  0.3120068  2.2923 0.0230024 *  
## numrooms             -0.5396445  0.2634013 -2.0488 0.0418832 *  
## premiumkitchen_yes    1.3865267  0.5704376  2.4306 0.0160162 *  
## size                 -0.0095748  0.0122565 -0.7812 0.4356704    
## tiledbath_yes         0.6401543  0.3602275  1.7771 0.0771801 .  
## warmwater_yes         2.6631037  0.7749915  3.4363 0.0007266 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(e)(ii)

The coefficient for premium kitchen is 1.3865. This means that unit with premium kitchen has a $1.3865 per sqm higher rentwhat than unit without premium kitchen. Also, some the p-value for premium kitchen is 0.0160162, this is significant and indicates that the observed result is not likely to have occurred by chance. Note that this coefficient is the same for non-robust.