Acadia Grenier HW 3

Acadia Grenier

1. Import the dataset Apartments.xlsx

library(readxl)
apartments <- read_excel("~/Desktop/SpringFiles.2025/5708 Applied DA w R/Applied Data Analysis in R - HW3 (data)/Apartments.xlsx")
View(apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2.What could be a possible research question given the data you analyze? (1 p)

A possible research question could be: “Does the age of an apartment (years) impact the price more or less than Distance in km from the city center?”

3. Change categorical variables into factors. (0.5 p)

apartments$ParkingF <- factor(apartments$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))
apartments$BalconyF <- factor(apartments$Balcony,
                             levels = c(0,1),
                             labels = c("No","Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(apartments$Price, mu = 1900)

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

You would reject the null hypothesis since the p < 0.05, implying that the evidence is statistically significant that the average apartment price is not equal to 1900 EUR. In other words the confidence interval for the true mean does not include 1900.

Now I am checking to see if the normality assumption is violated to determine whether I should use the Wilcoxon Signed-Rank Test

library(ggplot2)
ggplot(apartments, aes(x = Price)) +
         geom_histogram(binwidth = 50, colour = "gray") +
         ylab("Frequency") +
         xlab("Price per m2")

Since the Histogram is relatively normally distributed, we can assume normality. Therefore the Wilcoxon Signed-Rank test is not needed (There is minor asymmetry to the right, but since there’s not major outliers, I think the t-test suffices).

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Estimate of regression coefficient: Price = 2185.455 + -8.975 * Age “If the Age of the apartment increases by 1 year, Price per m2 on average decreases by -8.975 EUR (holding all other explanatory variables constant/unchanged).”
Intercept: 2185.455 EUR “The expected price per m2 for a brand new apartment (Age = 0 years) is 2185.455 EUR, holding all explanatory variables constant/unchanged.”
R^2 (Coefficient of Determination): 0.053 “5.3% of variability of Price is explained by the linear effect of Age.” Despite being statistically significant, Age has a very low R^2, meaning it has a weak effect on the Price per m2 of an apartment
Coefficient of Correlation (R): sq-rt of R^2 = 0.23 “The linear relationship between Price and Age is negative, with a correlation between Price and Age of 0.23 which indicates a weak negative linear relationship.”

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(apartments[, c(-4, -5, -6, -7)],
                  smooth = FALSE)

Based on the scatterplot matrix, there doesn’t appear to be a huge problem of multicolinearity. For Age against Distance, the relationship shows low correlation (slight positive trend, spread-out). Additionally for Price against Distance and Price against Age, there is a negative relationship. As Age increases, Price decreases; as Distance increases, Price also decreases. Based on these relationships and on the low correlation between Age and Distance, there is no problem of multicollinearity present.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Note: I wasn’t sure if I needed to explain the coefficients for this model like we were instructed to on the last one

Estimate of regression coefficient: Price = 2460.1 + -7.93 * Age + -20.67 * Distance “If the Age of the apartment increases by 1 year, Price per m2 on average decreases by -7.93 EUR holding all other explanatory variables constant/unchanged.” “If the Distance from city center of the apartment increases by 1 km, Price per m2 on average decreases by -20.67 EUR holding all other explanatory variables constant/unchanged.”
Intercept: 2460.1 EUR “The expected price per m2 for a brand new apartment in city center (Age = 0 years, Distance from City center = 0km) is 2460.10 EUR, holding all explanatory variables constant/unchanged.”
R^2 (Coefficient of Determination): 0.4396 “43.96% of variability of Price is explained by the linear effect of Age.” Compared to fit1, Distance and Age have a much higher R^2, meaning it has a stronger effect on the Price per m2 of an apartment, but is not very strong.
Coefficient of Correlation (R): sq-rt of R^2 = 0.66 “The linear relationship between Price, Age and Distance is negative, with a correlation between Price and Age and Distance of 0.66 which indicates a weak negative linear relationship.”

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The two explanatory variables have an equal VIF of 1.001845 which indicates nearly no multicollinearity between the predictor variables given the range of VIF (1-5). Since the variables are extremely close to the minimum of 1, there is nearly no correlation between predictor variables meaning they are completely independent of each other. In other words, the predictors share .18% of their variance so they contribute mostly unique information to the regression with very little overlap.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(apartments$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized Residuals")

Given the -/+3 threshold range, no outliers would be dropped as all standardized residuals fall within this range.

shapiro.test(apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  apartments$StdResid
## W = 0.95303, p-value = 0.003645

I also used the Shapiro-Wilk test to see whether my standardized residuals are normally distributed or not. Since the p < 0.05 I would reject the null hypothesis that my residuals are normally distributed despite the w = 0.95 being close to 1. This means the residuals aren’t normally distributed.

hist(apartments$CooksD,
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances")

There are a few observations with Cook’s distances higher ( 0.1, 0.15, and 0.3) than the majority which are below 0.05, indicating some influential points.

head(apartments[order(-apartments$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32 
## 2    43       37  1740       0       0 No       No           1.44  0.104
## 3     2       11  2790       1       0 Yes      No           2.05  0.069
## 4     7        2  1760       0       1 No       Yes         -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes          1.58  0.061
## 6    40        2  2400       0       1 No       Yes          1.09  0.038

Since the top 2 apartment’s Cooks Distances stand out as higher than the rest, I will remove them using their age as they are the only apartments with the ages 5 and 43.

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

apartments <- apartments %>%
  filter(!Age == "5")

library(dplyr)
apartments <- apartments %>%
  filter(!Age == "43")

fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Here I redefined fit2 to account for the removal of the units of high impact.

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

apartments$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = apartments$StdResid, x = apartments$StdFitted,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted Values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Based on the shape of the Scatterplot, there is apparent heteroskendasticity as the spread along the y-axis gets broader around 1 on the x-axis.

#install.packages("olsrr")
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

Based on the Breusch Pagan test, this model fails to reject the null hypothesis of homoskendasticity because the standard alpha is above 0.05 (it is 0.052). This means that the variance might not be perfectly constant across all fitted values. Because of how close the standard alpha is to the threshold, further testing is needed.

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(apartments$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized Residuals")

Based on the shape of the histogram, the distribution appears to be bimodel meaning it doesn’t follow the normal bell-shaped distribution. It also varies from the original fit2 which did not exclude the two units of high impact.

shapiro.test(apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  apartments$StdResid
## W = 0.95952, p-value = 0.01044

The Shapiro-Wilk test provides a p-value of 0.01044 which is less than the ideal significance level of 0.05, therefore we reject the null hypothesis that the standardized residuals are normally distributed.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Since I already removed the units of high impact in Question 9, I will just explain all coefficients here:

Price = 2490.11 + -7.85 * Age + -23.95 * Distance

Intercept: 2490.11 EUR “The expected price per m2 for a brand new apartment in city center (Age = 0 years, Distance from City center = 0km) is 2490.11 EUR, holding all explanatory variables constant/unchanged.” Additionally, the p-value is extremely small, indicating the intercept as statistically significant, and the base value as reliable
Correlation Coefficients Explanations: “If the Age of the apartment increases by 1 year, Price per m2 on average decreases by -7.85 EUR (p<0.05) holding all other explanatory variables constant/unchanged.” “If the Distance from city center of the apartment increases by 1 km, Price per m2 on average decreases by -23.95 EUR (p<0.01) holding all other explanatory variables constant/unchanged.” (Both are statistically significant, with Distance having the smallest p-value)
R^2 (Coefficient of Determination): 0.4968 “49.68% of variability of Price is explained by the linear effect of Age.” Compared to the original fit2, Distance and Age have a slightly higher R^2, meaning it has a stronger effect on the Price per m2 of an apartment, but is not very strong.
Coefficient of Correlation (R): sq-rt of R^2 = 0.70 “The linear relationship between Price, Age and Distance is negative, with a correlation between Price and Age and Distance of 0.70 which indicates a strong negative linear relationship.”

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value (0.02813) is less than the significance level of 0.05, we reject the null hypothesis that Model 1 (fit2) fits the data as well as Model 2. This indicates that Model 2 (fit3), which includes the dummy variables: ParkingF and BalconyF, fits the data far better than Model 1 (fit2). Therefore, the inclusion of the parking and balcony factors improves the model’s fit significantly.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

ParkingFYes: 168.92 EUR “Given the values of other explanatory variables, then apartments with parking have a higher price by 168.92 EUR on average compared to aparments without parking (p<0.01).”
BalconyFYes: -6.99 “Given the values of other explanatory variables, then apartments with Balconies have a lower price by 6.99 EUR on average compared to aparments without Balconies.” The p-value is very high ( p-value = 0.91) indicating this is not a statistically significant effect.

Test of Significance of Regression: H0: p^2 = 0 H1: p^2 > 0

Since the p-value of the the F-statistic is very small (1.449e-12) and less than 0.05. We reject the null hypothesis at reject p < 0.001. By rejecting the null we know that the model is statistically significant, and at least one of the predictor variables has a statistically significant relationship with Price.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

apartments <- apartments %>%
  mutate(ID = row_number())

apartments$Fitted <- fitted.values(fit3)
apartments$Residuals <- residuals(fit3)
head(apartments[colnames(apartments) %in% c("ID", "Price","Fitted","Residuals")])

## # A tibble: 6 × 4
##   Price    ID Fitted Residuals
##   <dbl> <int>  <dbl>     <dbl>
## 1  1640     1  1707.     -66.8
## 2  2800     2  2377.     423. 
## 3  1660     3  1714.     -53.8
## 4  1850     4  1534.     316. 
## 5  1640     5  2009.    -369. 
## 6  1770     6  1896.    -126.

ID2 is the second on the list and shows that the actual price of ID2 is 422.96 EUR higher than the price that my model predicted in the fit3 model.