Task 3

Paulina Suvorov

Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("C:/Users/pauli/OneDrive/Desktop/BootcampR/R Take Home Exam/R Take Home Exam/Task 3/Apartments.xlsx")
head(mydata)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors

mydata$ParkingF <- factor(mydata$Parking,
                          levels = c(0,1),
                          labels = c("No", "Yes"))
mydata$BalconyF <- factor(mydata$Balcony,
                          levels = c(0,1),
                          labels= c("No", "Yes"))
mydata1 <- mydata

library(pastecs)
round(stat.desc(mydata[ ,c(1:5)]),2)

##                  Age Distance     Price Parking Balcony
## nbr.val        85.00    85.00     85.00   85.00   85.00
## nbr.null        0.00     0.00      0.00   42.00   48.00
## nbr.na          0.00     0.00      0.00    0.00    0.00
## min             1.00     1.00   1400.00    0.00    0.00
## max            45.00    45.00   2820.00    1.00    1.00
## range          44.00    44.00   1420.00    1.00    1.00
## sum          1577.00  1209.00 171610.00   43.00   37.00
## median         18.00    12.00   1950.00    1.00    0.00
## mean           18.55    14.22   2018.94    0.51    0.44
## SE.mean         1.05     1.23     40.98    0.05    0.05
## CI.mean.0.95    2.09     2.45     81.50    0.11    0.11
## var            93.96   129.44 142764.34    0.25    0.25
## std.dev         9.69    11.38    377.84    0.50    0.50
## coef.var        0.52     0.80      0.19    0.99    1.15

The average price per m2 is 2018.94. The difference between the highest and the lowest price per m2 is 1420. 42 of the apartments do not have a parking and 48 of the apartments do not have a balcony.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

#H0: mu_price=1900
#H1: mu_ price≠1900
t.test(mydata$Price,
       mu=1900,
       alternative="two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Explanation: It is extremely unlikely that the average price per m2 is equal to 1900eur. We reject H0 at p<0.05 and accept H1, and conclude that the average price per m2 is not equal to 1900eur.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = mydata)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanation: If the age of the apartment increases by 1 year, on average the price per m2 of the apartment decreases by 8.975 % points, p<0.05.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(mydata[,c(3,2,1)], smooth = FALSE)

Explanation: Scatter plot matrix between Price, Age and Distance shows no signs of multicolinearity, because the slope between Age and Distance is just slightly sloped.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Based on the VIF Statistics, there is no multicolinearity present, since the VIF factor for all of the explanatory variables is less than 5.

mean(vif(fit2))

## [1] 1.001845

We can conclude that there is no multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

#outliers/standardised residuals
mydata$StdResid <- round(rstandard(fit2),3)

hist(mydata$StdResid,
     main = "Histogram of standardised residuals",
     xlab = "Standardised residuals",
     ylab = "Frequency",
     breaks = seq(from= -3, to= 3, by=0.5))

We can notice an outlier on the left, that stands out a little.

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95303, p-value = 0.003645

H0: variables are normally distributed H1: variables are not normally distributed We can conclude, that standard residuals are not normally distributed, p<0.05.

#units with big impact/cooks distances
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$CooksD,
     main = "Histogram of cooks distances",
     xlab = "Cooks distance",
     ylab = "Frequency" )

The one that has a higher distance (the one on the left, that is more than 0.3) compared to others is problematic.

head(mydata[order(mydata$StdResid),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     7        2  1760       0       1 No       Yes         -2.15  0.066
## 2    12       14  1650       0       1 No       Yes         -1.50  0.013
## 3    12       14  1650       0       0 No       No          -1.50  0.013
## 4    13        8  1800       0       0 No       No          -1.38  0.012
## 5    14       16  1660       0       1 No       Yes         -1.26  0.008
## 6    24        5  1830       1       0 Yes      No          -1.19  0.012

We need delete the one in row 53.

head(mydata[order(-mydata$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32 
## 2    43       37  1740       0       0 No       No           1.44  0.104
## 3     2       11  2790       1       0 Yes      No           2.05  0.069
## 4     7        2  1760       0       1 No       Yes         -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes          1.58  0.061
## 6    40        2  2400       0       1 No       Yes          1.09  0.038

We need to delete the one in row 38.

#we remove it
mydata1 <- mydata[c(-53, -38),]

hist(mydata1$StdResid,
     main = "Histogram of standardised residuals",
     xlab = "Standardised residuals",
     ylab = "Frequency",
     breaks = seq(from= -3, to= 3, by=0.5))

hist(mydata1$CooksD,
     main = "Histogram of cooks distances",
     xlab = "Cooks distance",
     ylab = "Frequency" )

fit2 <- lm(Price ~ Age + Distance, data = mydata1)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.51 -223.89  -62.78  202.78  575.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2489.617     73.524  33.861  < 2e-16 ***
## Age           -7.350      3.103  -2.368   0.0203 *  
## Distance     -23.636      2.731  -8.654 4.21e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.1 on 80 degrees of freedom
## Multiple R-squared:  0.513,  Adjusted R-squared:  0.5008 
## F-statistic: 42.13 on 2 and 80 DF,  p-value: 3.177e-13

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

#standardisation
mydata1$StdfittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata1$StdResid, x= mydata1$StdfittedValues,
            ylab = "Standarized residuals",
            xlab = "Standarized fitted values",
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

Explanation: Based on the scatter plot we can see there is no heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

mydata1$StdResid <- round(rstandard(fit2),3)
hist(mydata1$StdResid, xlab = "Standardised Residuals", ylab="Frequency", main="Histogram of Standardised Residuals", breaks = seq(from= -3, to= 3, by=0.5))

H0:variables are normally distributed H1:variables are not normally distributed

shapiro.test(mydata1$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.94142, p-value = 0.0008992

Explanation:Based on the p-value<0.005, we reject H0 and conclude that standard residuals are not normally distributed.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age+Distance, data=mydata1)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.51 -223.89  -62.78  202.78  575.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2489.617     73.524  33.861  < 2e-16 ***
## Age           -7.350      3.103  -2.368   0.0203 *  
## Distance     -23.636      2.731  -8.654 4.21e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.1 on 80 degrees of freedom
## Multiple R-squared:  0.513,  Adjusted R-squared:  0.5008 
## F-statistic: 42.13 on 2 and 80 DF,  p-value: 3.177e-13

Explanation: If the age of the apartment increases by 1 year, on average the price per m2 of the apartment price per m2 decreases by 7.350% points,with p<0.05 and everything else remains unchanged. If the distance from the city center increases by 1km, on average the apartment price per m2 decreases by 23.636 % points, with p<0.00 and everything else remains unchanged.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age+Distance+ParkingF+BalconyF, data=mydata1)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.58 -198.68  -44.44  229.33  529.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2367.282     93.781  25.243  < 2e-16 ***
## Age           -6.605      3.054  -2.162   0.0337 *  
## Distance     -21.140      2.878  -7.345 1.73e-10 ***
## ParkingFYes  147.508     62.799   2.349   0.0214 *  
## BalconyFYes   -3.122     58.635  -0.053   0.9577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263.4 on 78 degrees of freedom
## Multiple R-squared:  0.5453, Adjusted R-squared:  0.522 
## F-statistic: 23.39 on 4 and 78 DF,  p-value: 9.972e-13

Explanation: If the age of the apartment increases by 1 year, on average the price per m2 of the apartment price per m2 decreases by 6.605% points,with p<0.05 and everything else remains unchanged. If the distance from the city center increases by 1km, on average the apartment price per m2 decreases by 21.140 % points, with p-value<0.00 and everything else remains unchanged. Given the parking space, the average price per m2 of the apartment is higher for 147.508 % points, with p-value<0.05 and everything else remains unchanged. We can not conclude, that if the apartment has a balcony it affects the average price per m2, p-value>0.05. #### With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5795128                              
## 2     78 5410469  2    384659 2.7727 0.06866 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: fit2 is more appropriate H1: fit3 is more appropriate p-value>0.05, so we have weaker evidence to reject H0, so we conclude the fit3 is more appropriate model.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -420.58 -198.68  -44.44  229.33  529.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2367.282     93.781  25.243  < 2e-16 ***
## Age           -6.605      3.054  -2.162   0.0337 *  
## Distance     -21.140      2.878  -7.345 1.73e-10 ***
## ParkingFYes  147.508     62.799   2.349   0.0214 *  
## BalconyFYes   -3.122     58.635  -0.053   0.9577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263.4 on 78 degrees of freedom
## Multiple R-squared:  0.5453, Adjusted R-squared:  0.522 
## F-statistic: 23.39 on 4 and 78 DF,  p-value: 9.972e-13

F-statistics: H0: ρ squared = 0 H1: ρ squared > 0 p-value<0.05 therefore we reject. We conclude, that at least one of the expalatory variables is different from 0/ has an effect on the dependent variable.

Save fitted values and claculate the residual for apartment ID2.

mydata1$FittedValues <- fitted.values(fit2)
mydata1$Residuals <- residuals(fit3)

head(mydata1[, colnames(mydata1) %in% c("Apartment", "Residuals")])

## # A tibble: 6 × 1
##   Residuals
##       <dbl>
## 1     -86.0
## 2     425. 
## 3     -69.1
## 4     284. 
## 5    -372. 
## 6    -156.

Apartment with ID 2 has the residual 425.23070.