Task 3

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("Apartments.xlsx")
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$ParkingF <- factor(Apartments$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

Apartments$BalconyF <- factor(Apartments$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
head(Apartments)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingF BalconyF
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 No       Yes     
## 2    18        1  2800       1       0 Yes      No      
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 No       Yes     
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 No       Yes

library(pastecs)
round(stat.desc(Apartments[c("Age", "Distance", "Price", "Parking", "Balcony")]), 2)

##                  Age Distance     Price Parking Balcony
## nbr.val        85.00    85.00     85.00   85.00   85.00
## nbr.null        0.00     0.00      0.00   42.00   48.00
## nbr.na          0.00     0.00      0.00    0.00    0.00
## min             1.00     1.00   1400.00    0.00    0.00
## max            45.00    45.00   2820.00    1.00    1.00
## range          44.00    44.00   1420.00    1.00    1.00
## sum          1577.00  1209.00 171610.00   43.00   37.00
## median         18.00    12.00   1950.00    1.00    0.00
## mean           18.55    14.22   2018.94    0.51    0.44
## SE.mean         1.05     1.23     40.98    0.05    0.05
## CI.mean.0.95    2.09     2.45     81.50    0.11    0.11
## var            93.96   129.44 142764.34    0.25    0.25
## std.dev         9.69    11.38    377.84    0.50    0.50
## coef.var        0.52     0.80      0.19    0.99    1.15

I have displayed some descriptive statistics of the variables used in the.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price, mu=1900)

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject the Hypothesis, as the average Price of apartments is not equal to 1900, but rather 2018,941 eur. With a 95% certainty we can say that the average price of apartments lies between 1937,44 eur and 2100,44eur.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <-lm(Apartments$Price ~ Apartments$Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Apartments$Price ~ Apartments$Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2185.455     87.043  25.108   <2e-16 ***
## Apartments$Age   -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

corr_coef <- cor(Apartments$Age, Apartments$Price)
print(corr_coef)

## [1] -0.230255

Regression function: Price= 2185,455 - 8,975 x Age The function shows us that if Age is zero, the average price of apartments is 2185,455 eur. When we increase by 1, the price falls by 8,975 eur.

Multiple R-squared statistic that Age variable explains 5,302% of variability in Price of apartments.

The p-value test showed that we can with 95% certainty say that coeficients are not equal to 0. Therefore, we can reject the null hypothesis (H0: p<0,05).

The correlation between Price and age is negative, which means that age negatively influences the average prices. Higher the age, lower the price. However, the correlation (influence) is not strong.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

vars <- Apartments[, c("Price", "Age", "Distance")]
scatterplotMatrix(vars, ,
                  smooth = FALSE)

The graph/ picture shows us multicollinearity between explainatory variables. We look at the graphs in the first row. The upper, middle picture (graph) shows that higher age negatively impacts the price (older the apartments, lower the price). The upper right graph shows the impact of Distance (from the city centre) on the Price, where bigger distance means lower prices for apartments.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

P-value of F-statistic shows us that we can reject the H0 hypothesis, as the number is lower than 0,05. At least one of the predictors/ explainatory variables (probably distance from the data above) has a meaningful effect on price. Price and age together explain 43,96% of average price of the apartments.

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The vif function is used to evaluate the strength of the correlation between the explanatory variables. The higher the VIF statistic, the more strongly the variable is related to other explanatory variables.

Because Vif statistics is close to 1 for both explainatory variables Age and Distance, there are no multicolinearity concerns. The problem occurs, if vif statistics equaled >5. Only then we could not include them in the model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Standardized Residuals

Apartments$std_resid <- round(rstandard(fit2),3)
hist(Apartments$std_resid,
     xlim = c(-3,3),
     ylim = c(0,20),
     main = "Histogram of standardized residuals",
     xlab = "standardized residuals",
     col = "lightgreen")

Based on the distribution of standardized residuals, we try to predict the distribution of errors in the population.

-From the picture we can see that residuals are slightly asymmetrically distributed to the right. We could test this also with shapiro test.

Besides, we have no problems with outliers, as all the values are in between -3 and +3. This is shown also using the “head” function below.

cooks_d <- cooks.distance(fit2) 
Apartments$CooksD <- round(cooks.distance(fit2),3) ### First two rows are here to define "CooksD" variable

head(Apartments[order(-Apartments$std_resid), c("Distance", "Price","Age", "CooksD","std_resid")],)

## # A tibble: 6 × 5
##   Distance Price   Age CooksD std_resid
##      <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1       45  2180     5  0.32       2.58
## 2       11  2790     2  0.069      2.05
## 3        1  2800    18  0.03       1.78
## 4        1  2800    18  0.03       1.78
## 5        2  2820     8  0.037      1.66
## 6        1  2810    10  0.032      1.60

Cooks Distances

cooks_d <- cooks.distance(fit2)
Apartments$CooksD <- round(cooks.distance(fit2),3)
hist(Apartments$CooksD,
     main = "Histogram of Cooks Distances",
     xlab = "Cooks Distances",
     col = "yellow")

Using the Cook’s Distance, we can spot units with a high impact on the estimated regression function and remove them (similar to outliers). Cook’s distance is a value above 0, where a higher number means a larger impact.

From the graph and “head” code, we can see that there is one apartment that stands out with a value of 0,32, therefore too high influence. This is apartment no. 38 and it has to be removed. We can remove it with function “filter”

head(Apartments[order(-Apartments$CooksD), c("Distance", "Price","Age", "CooksD","std_resid")],)

## # A tibble: 6 × 5
##   Distance Price   Age CooksD std_resid
##      <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1       45  2180     5  0.32       2.58
## 2       37  1740    43  0.104      1.44
## 3       11  2790     2  0.069      2.05
## 4        2  1760     7  0.066     -2.15
## 5        3  2540    37  0.061      1.58
## 6        2  2400    40  0.038      1.09

With this function we can remove problematic units: library(dplyr) Apartments <- Apartments %>% filter(!CooksD %in% (0.320))

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_fitted <- as.numeric(scale(fitted(fit2)))
Apartments$stdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(x=Apartments$stdFitted, y=Apartments$std_resid,
            xlab = "Standardized Fitted Values",
            ylab = "Standardised Residuals",
            main = "Heteroskedasticity Check",
            boxplots = FALSE,
            regLine = FALSE,
            smooth= FALSE,
            )

The points should be randomly distributed in a horizontal band of constant variability. Heteroskedasticity occurs, if the variability changes, and it affects the reliability of the estimated standard errors. We also test it with the use of Breuch-Pagan test, below.

From the graph (picture), we can see random distribution in a horizontal band. We assume that Homoskedasticity is not violated (The variance of errors is constant). We can confirm our assumptions with Breuch-pagan test. We cannot reject H0 hypothesis (the variance is constant), because p-value is above 0,05.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments$std_resid,
     xlim = c(-3,3),
     main = "Histogram of standardized residuals",
     xlab = "standardized residuals",
     ylab = "Density",
     prob = TRUE, ##MEANS THAT YOU ADJUST Y AXIS TO THE RED CURVE
     col = "lightgreen")

curve(dnorm(x, mean = mean(Apartments$std_resid), sd = sd (Apartments$std_resid)), 
      col = "red", 
      lwd = 2, 
      add = TRUE)

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Apartments <- Apartments[!(Apartments$CooksD == 0.320),]
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

As done in few steps above, we can again exclude the unit with Cooks Distance 0,320. After that we calculate fit2 again and summarize the results.

From the output, we can read new regression function: “Price = 2456.076 - 6.464 x Age - 22.955 x Distance”. We can see that Age and Distance “negatively” impact the price. Higher the Age and Bigger the Distance to the city center, the cheaper is the average price of Apartments. If age is 0 and everything else stays the same, the price is 2456,076 eur. In other case, if Age is 1 and everything else stays the same, the price drops by 6,464. For every unit of Age rise, price drops by 6,464. The same happens with Distance, with a difference that Price with every unit of Distance more drops by 22,955.

H0 Hypothesis we were testing (H0= coefficients are equal to 0) can be rejected, as p-value is below 0,05.

R-squared tells us that 48,38% of Price variability is explained by Age and Distance variables together.

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

The function above provides us with coefficient of correlation. The number 0,6955 stands for strong and positive correlation between Age, Price and Distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, 
           data = Apartments)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

With function anova check if model fit3 fits data better than model fit2.

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis (which says that fit2 model is more suitable), as p-value is less than 0,05. We can confirm that fit3 model gives us more relevant result.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Regression coefficient for Parking shows that, if apartment includes Parking and everything else is unchanged, it is on average 196,16 eur more expensive, than without a balcony. Regression coefficient for Balcony shows that, if apartment possesses a Balcony and everything else stays unchanged, it is on average 1,93 eur more expensive.

F-statistic tests whether the regression model as a whole explains a significant portion of the variation in apartment prices. H0: β1 = β2 = β3 = β4 = 0 (none of the 4 predictors or explainatory variables have any effect on the price) H1: At least one βi ≠ 0 (At least one predictor or explainatory variable explains a significant portion of the variation in price) The Result shows that modul is highly significant, as p-value is below 0,05 (or 0,001)

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)

round(Apartments[2,12],3)

## # A tibble: 1 × 1
##   Residuals
##       <dbl>
## 1      428.

The residual for apartment ID2 tells us that the actual price for this apartment is by 427.8 eur higher than the estimated value from the regression.