HW 3

VENETIA POLYZOU

1. Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")

mydata <- as.data.frame(mydata)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

We are interested in how the price per square meter in euros of an apartment is affected by its age(in years), its distance from the city(in km), the presence or the absence of a balcony(No/Yes) and the presence or the absence of a parking(No/Yes).

3. Change categorical variables into factors. (0.5 p)

mydata$ParkingF <- factor(mydata$Parking,
                          levels = c (0, 1), 
                          labels = c ("No", "Yes"))

mydata$BalconyF <- factor(mydata$Balcony,
                          levels = c (0, 1), 
                          labels = c ("No", "Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

library(ggplot2)
ggplot(mydata, aes(x = Price)) +
  geom_histogram(binwidth = 500, colour = "black") +
  ylab("Frequency")

  xlab("Price per square meter in euros")

## $x
## [1] "Price per square meter in euros"
## 
## attr(,"class")
## [1] "labels"

shapiro.test(mydata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Price
## W = 0.94017, p-value = 0.0006513

According to the Shapiro test above the Hypothesis is:

Ho: Price per square meter is normally distributed.
H1: Price per square meter is not normally distributed.

We reject the null Hypothesis(H0) at p<0.001, so we assume that the price per square meter is not normally distributed.

Even though, the assumption of normality is being violated and we should use wilcoxon signed rank test, only for the reasons of this home assignment, we will assume normality and we will continue with the t-test with one arithmetic mean.

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

For the t-test with one arithmetic mean the Hypotheses are:

Ho: μ=1900
H1: μ≠1900

We reject the null hypothesis(H0) at p=0.005. So, this means that the arithmetic mean of the price per square meter of the population is different from 1900.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

library(car)

## Loading required package: carData

scatterplot(mydata$Price ~ mydata$Age,
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Price per square meter in euros", 
            xlab = "Age in years")

We can observe that the two variables are linearly correlated.

fit1 <- lm(Price ~ Age,
           data = mydata)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Price = 2185.455-8.975*Age

If the Age of an apartment is 0, meaning if it is new, its price will be 2185.455 euros per square meter, on average.(p<0.001)
When Age is increased by 1 year the Price of the apartment is decreased by 8.975 euros per square meter on average.(p=0.034)

Hypotheses for the T-test of partial of regression coefficient:

HO: β1 = 0
H1: β1 ≠ 0

We reject the null Hypothesis(H0) at p=0.034, meaning the regression coefficient is different from 0 in the population. So, the Age of an apartment significantly affects its Price.

Coefficient of determination:

5.302% of the variability of the price of an apartment is explained by the linear effect of its age.

Hypotheses for the test of significance of regression: - Ho: ρ^2 = 0 - H1: ρ^2 > 0

We reject the null hypothesis(H0) at p = 0.035. So, we found that the coefficient of determination of the population is greater than 0, meaning that there is at least one explanatory variable(the age) that explains the differences in price per square meter.

Based on the multiple correlation coefficient(0.05302)^0.5=0.2302, we conclude that the linear relationship between Price and Age is weak and negative.

cor(mydata$Price, mydata$Age)

## [1] -0.230255

Also from this function we can reassure the presence of a linear, negative and weak relationship between the price of an apartment and its age.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

scatterplotMatrix(mydata[ , c(-4, -5, -6, -7)],
                  smooth = FALSE)

The diagram between Age and Distance shows their correlation, which seems relatively low. So, we have no serious evidence of multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

scatterplotMatrix(mydata[ , c(-4, -5, -6, -7)],
                  smooth = FALSE)

We do not observe any obvious nonlinear relationships.

##install.packages("Hmisc")
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c(-4,-5,-6,-7)]))

##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

We can observe a linear, negative and weak relationship between the price of an apartment and its age.
Hypotheses of the test:
Ho: ρ = 0
H1: ρ ≠ 0

We reject the null Hypothesis(H0) at p = 0.035, meaning that we found a linear negative relationship between the price of an apartment and its age.

Also,we can observe a linear negative and semi strong relationship between its price and its distance. Hypotheses of the test:
Ho: ρ = 0
H1: ρ ≠ 0

We reject the null Hypothesis(H0) at p < 0.001, meaning that we found a linear negative relationship between the price of an apartment and its distance from the city center.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Price = 2460.101 -7.934Age -20.667Distance

If the Age of an apartment is 0, meaning if it is new and its distance form the city center is 0 its price will be 2460.101 euros per square meter, on average.(p<0.001)
When Age is increased by 1 year the Price of the apartment is decreased by 7.934 euros per square meter on average, assuming all the other explanatory variables remain unchanged.(p=0.016)

Hypotheses for the test of partial of regression coefficient:

HO: β1 = 0
H1: β1 ≠ 0

We reject the null Hypothesis(H0) at p=0.016, meaning the regression coefficient is different from 0 in the population. So, the Age of an apartment significantly affects its Price.

When Distance of an apartment from the city center is increased by 1 km the Price of the apartment is decreased by 20.667 euros per square meter on average, assuming all the other variables remain unchanged.(p<0.001) Hypotheses for the test of partial of regression coefficient:
HO: β2 = 0
H1: β2 ≠ 0

We reject the null Hypothesis(H0) at p<0.001, meaning the regression coefficient is different from 0 in the population. So, the distance of an apartment from the city center significantly affects its Price.

Coefficient of determination:

43.96% of the variability of the price of an apartment is explained by the linear effect of its age and its distance form the city center.

Test of significance of regression: - Ho: ρ^2 = 0 - H1: ρ^2 > 0

We reject the null hypothesis(H0) at p<0.001. So, we found that the coefficient of determination of the population is greater than 0, meaning that there is at least one explanatory variable that explains the differences in price per square meter.

Based on the multiple correlation coefficient (0.4396)^0.5=0.663 we conclude that the linear relationship between the price(dependent variable) and all of the explanatory variables(the age and the distance) is negative and semi strong.

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

I do not need to drop any variable due too being too correlated with the other one, because the VIF is very close to 1(as it should be).

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$StdResid,
     xlab = "Standarized residuals",
     ylab = "Frequency",
     main = "Histogram of standarized residuals")

hist(mydata$CooksD,
    xlab = "Cooks distance",
    ylab = "Frequency",
    main = "Histogram of Cooks distance")

mydata$ID <- seq(1, nrow(mydata))

head(mydata[order(mydata$StdResid),], 3)

##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## 53   7        2  1760       0       1       No      Yes   -2.152  0.066 53
## 13  12       14  1650       0       1       No      Yes   -1.499  0.013 13
## 72  12       14  1650       0       0       No       No   -1.499  0.013 72

We do not observe any StdResidual lower than -3 so we do not have any outlyers.

head(mydata[order(-mydata$CooksD),], 6)

##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID
## 38   5       45  2180       1       1      Yes      Yes    2.577  0.320 38
## 55  43       37  1740       0       0       No       No    1.445  0.104 55
## 33   2       11  2790       1       0      Yes       No    2.051  0.069 33
## 53   7        2  1760       0       1       No      Yes   -2.152  0.066 53
## 22  37        3  2540       1       1      Yes      Yes    1.576  0.061 22
## 39  40        2  2400       0       1       No      Yes    1.091  0.038 39

We observe that the jump that exists in the Cooks Distances diagram is the apartment with ID 38, so we will remove it.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
filter(!ID == 38)

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

fit2 <- lm(Price ~ Age + Distance,
          data = mydata)

mydata$StdFitted <- scale(fit2$fitted.values)
mydata$StdResid <- round(rstandard(fit2), 3)

library(car)

scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standarized Residuals",
            xlab = "Stndarized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

We observe that the variance of errors is constant(it is not opening).So it does not seem to appear potential heteroskedasticity.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$StdResid,
     xlab = "Standarized residuals",
     ylab = "Frequency",
     main = "Histogram of standarized residuals")

Based on the graph, errors do not seem to be normally distributed because the histogram is not perfectly symmetric and it appears slightly right skewed(positively skewed).

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95649, p-value = 0.006355

According to Shapiro test above the Hypothesis is:

H0: Errors are normally distributed.
H1: Errors are not normally distributed.

The null hypothesis (H0) is rejected at the p value = 0.007, so we assume that errors are not normally distributed.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance,
           data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Price = 2456.076 -6.464Age -22.955Distance

If the Age of an apartment is 0, meaning if it is new and its distance form the city center is 0 its price will be 2456.076 euros per square meter, on average.(p<0.001)
When Age is increased by 1 year the Price of the apartment is decreased by 6.464 euros per square meter on average, assuming all the other variables remain unchanged.(p=0.044)

Test of partial of regression coefficient:

HO: β1 = 0
H1: β1 ≠ 0

We reject the null Hypothesis(H0) at p=0.044, meaning the regression coefficient is different from 0 in the population. So, the Age of an apartment significantly affects its Price.

When Distance of an apartment from the city center is increased by 1 km the Price of the apartment is decreased by 22.955 euros per square meter on average, assuming all the other variables remain unchanged.(p<0.001)

Test of partial of regression coefficient:

H0: β2 = 0
H1: β2 ≠ 0

Coefficient of determination:

48.38% of the variability of the price of an apartment is explained by the linear effect of its age and its distance from the city center.

Test of significance of regression:

Ho: ρ^2 = 0
H1: ρ^2 ≠ 0

Based on the multiple correlation coefficient (0.4838)^0.5=0.696 we conclude that the linear relationship between the price(dependent variable) and all of the explanatory variables(the age and the distance) is negative and semi strong.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
          data = mydata)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Regression function:

Price = 2329.724 -5.821Age -20.279Distance +167.531ParkingF -15.207BalconyF

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypotheses for the test:

H0: Δρ^2=0
Η1: Δρ^2>0

We reject the null hypothesis(H0) at p-value=0.031. So, Δρ^2 is greater than 0, meaning that the ρ^2 in the population significantly increased.The second model(fit3), which has more explanatory variables has a significantly higher coefficient of determination compared to the first model(fit2).The complex model(fit3) is significantly better.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

If the Age of an apartment is 0, meaning if it is new, if its distance from the city center is 0 and if it does not have neither parking nor balcony, then its expected price will be 2329.724 euros per square meter, on average.(p<0.001)
When the distance of an apartment from the city center is increased by 1 km its price is decreases by 20.279 euros per square meter on average, assuming all the other variables remain unchanged(p<0.001)
Given the values of the other explanatory variables, the apartments that have parking are on average 167.531 euros more expensive per square meter compared to the apartments that do not have parking.(p=0.01)

Test of partial of regression coefficient:

HO: β1 = 0
H1: β1 ≠ 0

We can’t(don’t have enough evidence to) reject the null Hypothesis(H0), meaning we can’t say that the regression coefficient is different from 0 in the population. So, we do not have enough evidence to say that the Age of an apartment significantly affects its Price.

Test of partial of regression coefficient:

HO: β2 = 0
H1: β2 ≠ 0

We reject the null Hypothesis(H0) at p<0.001, meaning the regression coefficient is different from 0 in the population. So, the Distance of an apartment from the city center significantly affects its Price.

Test of partial of regression coefficient:

HO: β3 = 0
H1: β3 ≠ 0

We reject the null Hypothesis(H0) at p=0.01, meaning the regression coefficient is different from 0 in the population. So, there are significant differences between having and not having a parking regarding the price per square meter of an apartment.

Test of partial of regression coefficient:

HO: β4 = 0
H1: β4 ≠ 0

We can’t(don’t have enough evidence to) reject the null Hypothesis(H0), meaning we can’t say that the regression coefficient is different from 0 in the population. So, we can’t say that there are significant differences between having and not having a balcony regarding the price per square meter of an apartment.

52.75% of the variability of the price of an apartment per square meter is explained by linear effect of its age, its distance from the city center, the presence or absence of parking and the presence or absence of balcony.

Test of significance of regression:

Ho: ρ^2 = 0
H1: ρ^2 ≠ 0

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

mydata$Fitted_fit3 <- fitted.values(fit3)
mydata$Residuals_fit3 <- residuals(fit3)

mydata[mydata$ID == 2, c("ID",  "Fitted_fit3", "Residuals_fit3")]

##   ID Fitted_fit3 Residuals_fit3
## 2  2    2372.197       427.8029

mydata[mydata$ID == 2, ]

##   Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD ID StdFitted Fitted_fit3 Residuals_fit3
## 2  18        1  2800       1       0      Yes       No    1.775  0.031  2  1.134972    2372.197       427.8029

We could also calculate the residuals for the apartment with ID2 manually in this way:

Price = 2329.724 -5.821Age -20.279Distance +167.531ParkingF -15.207BalconyF

We will first calculated the fitted values by replacing the explanatory variables of the regression function with the values of the apartment ID2.

Price = 2329.724-5.821 * 18-20.279 * 1+167.531 * 1-15.207 * 0=2372.198

Residuals = Actual value-Fitted value = 2800-2372.198=427.802