HW 3

NAME SURNAME

1. Import the dataset Apartments.xlsx

library(readxl)
mydata<-read_xlsx("Apartments.xlsx")
mydata<-as.data.frame(mydata)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

A possible research question can be whether the availability of parking affects the price per m2 of the house price to be different (Hypo testing)

Another possible research question can be whether the price per m2 is affected linearly by the distance from city centre in km and the age of an apartment in years (Linear regression)

3. Change categorical variables into factors. (0.5 p)

mydata$Parking <- factor(mydata$Parking, 
       levels = c(0,1),
       labels = c("No","Yes"))
mydata$Balcony <- factor(mydata$Balcony, 
       levels = c(0,1), 
       labels = c("No","Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

shapiro.test(mydata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Price
## W = 0.94017, p-value = 0.0006513

Since the the Shapiro wilk test gives us a p-value < 0.001, we reject the null hypothesis that the data of price is normally distributed.
Hence, we use the non parametric mean method instead. So we are testing whether median is 1900 eur instead.

median(mydata$Price)

## [1] 1950

wilcox.test(mydata$Price, 
            mu = 1900, 
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

library(effectsize)
effectsize(wilcox.test(mydata$Price, mu = 1900, correct = FALSE))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.27              | [0.04, 0.48]
## 
## - Deviation from a difference of 1900.

interpret_rank_biserial(0.27, rules = "funder2019")

## [1] "medium"
## (Rules: funder2019)

Based on the sample data, we found that the median price per m2 is not equal to 1900 as the p-value is 0.029. So we reject the null hypothesis that the median is equal to 1900. And the change is medium, r = 0.27.
The median is instead 1950 and is more than 1900.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, 
           data = mydata)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(mydata$Price,mydata$Age)

## [1] -0.230255

The regression coefficient:
It shows that as the age of the apartment increases by 1 year, the price per m2 of the apartment on average decreases by 8.975 (p= 0.034).
Coefficient of correlation
It shows that there is a linear, weak and negative correlation between Age of an apartment in years and the price per m2 of the apartment
Coefficient of determination
It shows that only 5.31% of the variability in price per m2 is explained by the variable Age of an apartment in years in this case.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[c(-4,-5)], smooth=FALSE)

Based on the scatter plots, there does not seem to be a problem with multicolinearity as the 2 other explanatory variables Distance and Age do not have a show a strong relationship across.
Like there is no diagonal line with all the points being along the line.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

mydata_2 <- mydata
fit2 <- lm(Price ~ Age + Distance, 
           data = mydata_2)

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Based on the VIF statistics, we have to be cautious of it being more than 5 then it suggests that there is the multicolinearity problem.
And if it is like 3 then maybe we would have to consider it as well. But since the numbers are low at 1 (which is what we want, closer to 1 the better), then multicolinearity will not be a problem. Hence, we can carry on with the other parts.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

mydata_2$StdResid <- round(rstandard(fit2), 3)
mydata_2$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata_2$StdResid, 
     xlab = "standardised residuals", 
     ylab = "Frequency", 
     main = "Histogram of Standardised residuals")

hist(mydata_2$CooksD, 
     xlab ="Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distance")

#head(mydata[order(-mydata$StdResid),],6)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata_2 <-mydata_2 %>%
  filter(CooksD <= 0.103)
head(mydata_2[order(-mydata_2$CooksD),],7)

##    Age Distance Price Parking Balcony StdResid CooksD
## 33   2       11  2790     Yes      No    2.051  0.069
## 52   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 38  40        2  2400      No     Yes    1.091  0.038
## 56   8        2  2820     Yes      No    1.655  0.037
## 25   8       26  2300     Yes     Yes    1.571  0.034
## 55  10        1  2810      No      No    1.601  0.032

Based on the first histogram, I believe that there is no outliers to be removed as both the ends, the data are within the 3 range, so we do not have to remove any outliers.

But it is not the same case for the Cooks distance as there are 2 data point which jumps from the rest in the histogram. So, we will have to remove those points before we carry on.

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance, data = mydata_2)
mydata_2$StdFitted <- scale(fit2$fitted.values)
scatterplot(y=mydata_2$StdResid, x=mydata_2$StdFitted, 
            ylab="Standardised residuals", 
            xlab="Standardised fitted values", 
            boxplots = FALSE, 
            regLine= FALSE, 
            smooth = FALSE)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

Based on the scatter plot, I think the data passes as they seem not to be heteroskedasticity.
The points are random and they do not seem to be opening up too much. They are opening up a bit towards the bottom so to confirm I did the Breusch Pagan test.

Based on the test, we test that null hypothesis is that the variance is constant and the H1 is that the variance is not constant.
Since the P-value is 0.053, we can accept the null hypothesis and move on since I can deduce that there is no potential for heteroskedasticity.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

hist(mydata_2$StdResid, 
     xlab = "Standardised Residuals", 
     ylab = "Frequncy", 
     main = "Histogram of the Standardised Residuals")

shapiro.test(mydata_2$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_2$StdResid
## W = 0.94963, p-value = 0.002636

Based on the Shapiro test, the null Hypothesis is that the errors are normally distributed and H1 is that the errors are not normally distributed.
Since, the P-value is 0.0027, we can reject the null hypothesis and understand that the errors are not normally distributed.
But since the number of samples tested is 83 which is large, we can assume that they are indeed normally distributed due to the central limit theorem. So we do not have to worry about this assumption and we can move on.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

sqrt(summary(fit2)$r.squared)

## [1] 0.6629887

Based on the model, as the age of the apartment increases by 1 year, the price per m2 on average decreases by 7.934 given the other variables are constant (p = 0.016).

And, as the distance from the city centre increases by 1km, the price per m2 on average decreases by 20.667 given that the other variable is constant. (p <0.001).

And 43.96% of the variability in price per m2 can be explained by both the age of the apartment and the distance from the city centre for the apartment.

Based on the r value, we can deduce that there is a linear and semi strong relationship between the price per m2 and the Age of the apartment and the distance from the city centre, r = 0.663.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data=mydata)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the Multiple R-square, it seems that fit 3 is a better model.
But to confirm I first compared the Adjusted R-square and that was higher for fit 3 compared to fit 2 (it is given below). Hence, fit 3 is a better model.
Secondly, I did the Anova test and the null hypothesis was that both the models were the same and the H1 was that the models were different. Since P-value = 0.01, we can reject H0, and believe that the models are different. Which further confirms that Fit 3 is a better model.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

For parkingYes, given the values of other explainatory variables, the group of apartments which have parking on average has a higher price per m2 by around 196.17 euros (p-value = 0.00251) compared to the apartments without parking.

For balconyYes, given the values of other explainatory variables, the group of apartments with have balcony on average has a higher price per m2 by around 1.94 euros compared to the apartments without Balcony. But this is not statistically siginificant as p-value is high at 0.98.

The F-test at the bottom tests where the coefficient of determination is 0 or not. It tells us whether the model is a good model or a bad model. The null hypothesis is that coefficient of determination is 0 and the alternative hypothesis is that it is more than 0. Since, the p-value < 0.001, we can reject H0 and understand that the coeffficient of determination is more than 0. Hence, I think this is a good model to use.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

mydata$fitted<-fitted.values(fit3)
mydata$residuals <-residuals(fit3)
print(mydata[2,])

##   Age Distance Price Parking Balcony   fitted residuals
## 2  18        1  2800     Yes      No 2357.411  442.5889