Ques 1:

(a)(i) QQ Plot

qqplot <- resid(model)

qqnorm(qqplot)

qqline(qqplot, col=2)

## (a)(ii) Studendized residuals vs Index

s <- rstudent(model)

plot(s , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")

abline(h = c(-3,3), col = "blue", lty = 2)

(a)(iii) Residual vs. Fitted values

yhat<-fitted(model)

plot(qqplot ~ yhat, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")

(a)(iv) Leverage vs. Index

leverage <- hatvalues(model)
plot(leverage, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")

Conclusions for each of the plots above:

(b)

# Find the index number of the observation with the largest leverage
max_leverage_index <- which.max(leverage)
print(max_leverage_index)

## 201 
## 201

index_78_studentized_res <- s[201]
print(index_78_studentized_res)

##        201 
## -0.5514178

max_studentized_res <- which.max(s)
print(max_studentized_res)

## 55 
## 55

The index of the observation with the largest leverage from the dataset is 78. No, this observation does not have a large studentized residual.

(c)

# Find the indices of the two largest studentized residual
top_s_indices <- order(s, decreasing = TRUE)[1:2]

# Print the indices of the two largest studentized residual
print(top_s_indices)

## [1] 55 96

The index number of the two outliers are 365 and 519 respectively. These observations have rentpersqm 16.93 and 16.8099 respectively for only 1 room which makes them stand out compared the rentpersqm of 1 room units ranges around 10 to 12.

Dataset with largest leverage and two outliers removed:

(d)(i)

library(tidyverse)
data2 <- read.csv("C:/Users/Rishabh/Desktop/STAT 371/Assignments/Assignment 4/munichrent_student_outliers_removed.csv")

head(data2)

##   age bestneighborhood_ centralheating_ extrabath_ goodneighborhood_ index
## 1  92                no             yes         no               yes    26
## 2  44                no             yes         no                no   697
## 3  43                no             yes         no                no  1360
## 4  44                no             yes         no                no  1932
## 5  53                no             yes         no                no  1580
## 6  44                no             yes         no                no   203
##   numrooms premiumkitchen_ rentpersqm size tiledbath_ warmwater_
## 1        2              no       6.72   65        yes        yes
## 2        3              no       7.71   71        yes        yes
## 3        4              no       6.26   74        yes        yes
## 4        4              no       7.82   98        yes        yes
## 5        3              no      10.23   85        yes        yes
## 6        5              no       7.96  140        yes        yes

model2 <- lm(rentpersqm ~ age + bestneighborhood_  + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
 + size + tiledbath_ + warmwater_, data = data2) 

summary(model2)

## 
## Call:
## lm(formula = rentpersqm ~ age + bestneighborhood_ + centralheating_ + 
##     extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_ + 
##     size + tiledbath_ + warmwater_, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2713 -1.3327 -0.0013  1.3334  4.1393 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.334606   0.880804   7.192 1.49e-11 ***
## age                  -0.010097   0.006081  -1.661 0.098483 .  
## bestneighborhood_yes  1.768041   0.863469   2.048 0.041997 *  
## centralheating_yes    1.203363   0.580506   2.073 0.039548 *  
## extrabath_yes         0.597083   0.526321   1.134 0.258059    
## goodneighborhood_yes  0.715211   0.291269   2.456 0.014983 *  
## numrooms             -0.539645   0.278294  -1.939 0.053992 .  
## premiumkitchen_yes    1.386527   0.604220   2.295 0.022859 *  
## size                 -0.009575   0.012116  -0.790 0.430365    
## tiledbath_yes         0.640154   0.339278   1.887 0.060735 .  
## warmwater_yes         2.663104   0.720249   3.697 0.000286 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.876 on 187 degrees of freedom
## Multiple R-squared:  0.3943, Adjusted R-squared:  0.3619 
## F-statistic: 12.17 on 10 and 187 DF,  p-value: 4.026e-16

(i) QQ Plot

qqplot2 <- resid(model2)

qqnorm(qqplot2)

qqline(qqplot2, col=2)

## (ii) Studendized residuals vs Index

s2 <- rstudent(model2)

plot(s2 , xlab= "index", ylab ="Studentized residuals", main = "Studendized residuals vs Index")

abline(h = c(-3,3), col = "blue", lty = 2)

(iii) Residual vs. Fitted values

yhat2<-fitted(model2)

plot(qqplot2 ~ yhat2, xlab ="Fitted values", ylab="Residuals", main = "Residual vs. Fitted values")

(iv) Leverage vs. Index

leverage2 <- hatvalues(model2)
plot(leverage2, xlab = 'index', ylab = 'leverage', main = "Leverage vs. Index")

(d)(ii)

The coefficient for premium kitchen is 1.3865. This means that unit with premium kitchen has a $1.38 per sqm higher rent than unit without premium kitchen. Also, some the p-value for premium kitchen is 0.022859, this is significant and indicates that the observed result is unlikely to have occurred by chance alone.

(e)

robust_model <- rlm(rentpersqm ~ age + bestneighborhood_  + centralheating_ + extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_
 + size + tiledbath_ + warmwater_, data = data2) 

summary(robust_model)

## 
## Call: rlm(formula = rentpersqm ~ age + bestneighborhood_ + centralheating_ + 
##     extrabath_ + goodneighborhood_ + numrooms + premiumkitchen_ + 
##     size + tiledbath_ + warmwater_, data = data2)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4.47075 -1.30554 -0.09328  1.33041  4.54853 
## 
## Coefficients:
##                      Value   Std. Error t value
## (Intercept)           6.0469  0.9119     6.6311
## age                  -0.0113  0.0063    -1.7897
## bestneighborhood_yes  1.8055  0.8940     2.0196
## centralheating_yes    1.4710  0.6010     2.4476
## extrabath_yes         0.5062  0.5449     0.9289
## goodneighborhood_yes  0.9477  0.3016     3.1429
## numrooms             -0.5344  0.2881    -1.8547
## premiumkitchen_yes    1.4578  0.6255     2.3304
## size                 -0.0080  0.0125    -0.6415
## tiledbath_yes         0.6740  0.3513     1.9189
## warmwater_yes         2.5622  0.7457     3.4361
## 
## Residual standard error: 1.947 on 187 degrees of freedom

(e)(ii)

The coefficient for premium kitchen is 1.4578. This means that unit with premium kitchen has a $1.4578 per sqm higher rentwhat than unit without premium kitchen. Also, some the p-value for premium kitchen is 2.3304, this is not significant and indicates that the observed result is likely to have occurred by chance.

Assignment

Rishabh Ballkooram

2023-11-09