Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydataAP <- read_excel("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/R PROGRAM - BOOTCAMP/TAKE HOME EXAM/Task 3/Apartments.xlsx")
View(mydataAP)

mydataAP <- as.data.frame(mydataAP)

head(mydataAP)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydataAP$Parking <- factor(mydataAP$Parking,
                                   levels = c("0","1"),
                                   labels = c("No","Yes"))

mydataAP$Balcony <- factor(mydataAP$Balcony,
                                   levels = c("0","1"),
                                   labels = c("No","Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydataAP$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataAP$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

SPRMENI!!!: Based on the p-value (p = 0.004731), which is smaller that 0.05, we can reject the null hypothesis and say that the true mean is not equal to 1900. Based on the 95 percent confidence interval, we can stipulate that the true mean is somewhere between 1937.443 and 2100.440.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = mydataAP)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydataAP)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanations: a) Estimate of regression coefficient (-8.975): If the age increases by 1 year, the price decreases by 8,975€ thousand per m^2. b) Coefficient of correlation: For the observed correlation, we set two hypotheses to complete the t- test. H0: beta1 = 0 and H1: beta1 =/= (not equal) 0), we get p=0.034 (p < 0.05), so we reject H0, accept H1, so the age has some effect on price. (SPREMENI!!!) c) Coefficient of determination: 5,3% of the variable Price is explained with linear effect of the Age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplotMatrix(mydataAP[ , c(1, 2, 3)], 
                  smooth = FALSE)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydataAP[ , c(1,2,3)]))

##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

Based on what we can spot in the matrix, and what we calculated with correlation matrix above, where we calculated a Pearson correlation coefficients,there is a strong correlation between Price and Distance and weak correlation between Age and Price and Age and Distance.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydataAP)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Chris - mail

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydataAP$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydataAP$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(mydataAP$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataAP$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP$StdResid
## W = 0.95303, p-value = 0.003645

hist(mydataAP$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We spot that we have some outliers. We will test it (v nadaljevanju!!).

head(mydataAP[order(mydataAP$StdResid), ], 3)

##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013

head(mydataAP[order(-mydataAP$CooksD), ], 6)

##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038

As we can see, there is a gap between aparment no. 53 and 13 and between no. 38, 55 and 33 (according to Cooks Distance). But, let’s remove aparment no. 38 and see if it betters the analysis.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydataAP <- mydataAP %>%
  filter(!CooksD == 0.320)

fit3 <- lm(Price ~ Age + Distance,
           data = mydataAP)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

mydataAP$StdResid <- round(rstandard(fit3), 3) #Standardized residuals
mydataAP$CooksD <- round(cooks.distance(fit3), 3) #Cooks distances

hist(mydataAP$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataAP$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP$StdResid
## W = 0.95649, p-value = 0.006355

hist(mydataAP$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Because we still have some gaps, I decided to adjust my data a bit more:

library(dplyr)
mydataAP <- mydataAP %>%
  filter(!CooksD == 0.129)

fit4 <- lm(Price ~ Age + Distance,
           data = mydataAP)

summary(fit4)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

mydataAP$StdResid <- round(rstandard(fit4), 3) #Standardized residuals
mydataAP$CooksD <- round(cooks.distance(fit4), 3) #Cooks distances

hist(mydataAP$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataAP$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP$StdResid
## W = 0.95952, p-value = 0.01044

hist(mydataAP$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

With the process above we eliminated the outliers in stadardised residuals and we do not have any gaps anymore in that graph, which is better because the graph is nicely connected. Although, we still have a gap in the Histogram Cooks distances, but I decided to not eliminate this one.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydataAP$StdFitted <- scale(fit4$fitted.values)

library(car)
scatterplot(y = mydataAP$StdResid, x = mydataAP$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Based on the observations in scatterplot above, we see that we have a case of heteroscedasticity.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit4)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

POGLEJ!!!: The null hypothesis that the variance of errors is constant cannot be rejected (p-value is above 0.05). We therefore assume homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

With function anova check if model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?