HW 3

GRACE KELLEY

1. Import the dataset Apartments.xlsx

library(readxl)
apartments <- read_excel("Apartments.xlsx")
head(apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. What could be a possible research question given the data you analyze? (1 p)

What is the relationship between the age of the apartment, and it’s price?

3. Change categorical variables into factors. (0.5 p)

apartments$Balcony <- as.numeric(apartments$Balcony)
apartments$Parking <- as.numeric(apartments$Parking)

apartments$Balcony <- factor(apartments$Balcony,
                             levels=c(0,1),
                             labels = c("No", "Yes"))

apartments$Parking <- factor(apartments$Parking,
                             levels=c(0,1),
                             labels = c("No", "Yes"))
str(apartments)

## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t_test <- t.test(apartments$Price, mu = 1900)
print(t_test)

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the one-sided t-test, the following can be concluded:

The test statistic is t = 2.9022 with 84 degrees of freedom, resulting in a p-value of 0.004731, which is less than the conventional significance level of 0.05.
- Therefore, we reject the null hypothesis H0: μ_Price = 1900 EUR.
The sample mean price is 2018.941 EUR, which is higher than the hypothesized value of 1900 EUR.
The 95% confidence interval for the true mean price is (1937.443, 2100.440) EUR, which does not include 1900 EUR.
- This indicates that there is statistically significant evidence that the true mean apartment price in the dataset is different from 1900 EUR.
- Specifically, the apartment prices are significantly higher than 1900 EUR on average.
- The practical significance of this finding is that apartments in the dataset tend to cost about 119 EUR more on average than the hypothesized value of 1900 EUR.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

correlation <- cor(apartments$Age, apartments$Price)
print(paste("Correlation coefficient:", correlation))

## [1] "Correlation coefficient: -0.230255017823335"

Regression Coefficient:
- The coefficient for Age is -8.975, which is statistically significant (p = 0.034 < 0.05).
- This means that for each additional year of apartment age, the price decreases by approximately 8.98 EUR on average– ie: older apartments tend to be less expensive, with price declining by about 9 EUR per year of age.
Coefficient of Correlation:
- The correlation coefficient is -0.2303.
- The negative sign confirms the inverse relationship between Age and Price - as apartments get older, their prices tend to decrease.
  - The magnitude of 0.23 indicates a relatively weak correlation between these variables.
Coefficient of Determination:
- The Multiple R-squared value is 0.05302, meaning that only around 5.3% of the variability in apartment prices can be explained by differences in age.
- This is a low value, suggesting that while age has a statistically significant effect on price, it’s not a strong predictor on its own.
- The adjusted R-squared (0.04161) is even lower, accounting for the model’s simplicity.

6. Show the scaterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

#install.packages("GGally")
library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Create scatterplot matrix for Price, Age, and Distance
pairs(apartments[, c("Price", "Age", "Distance")], 
      main = "Scatterplot Matrix of Price, Age, and Distance")

ggpairs(apartments, columns = c("Price", "Age", "Distance"))

Lack of high correlation in these matrices, suggesting no multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apartments)

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

library(car)

## Loading required package: carData

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Variance Inflation Factor (VIF) is a numerical scale that measures the correlation between variables:
- A VIF of 1 indicates no correlation between variables
- A VIF between 1 and 5 indicates a moderate correlation
- A VIF over 5 shows an extremely correlation, which indicates multicolinearity and indicates that a variable needs to be dropped
Confirmed lack of multicolinearity using VIF, which shows normal correlation at ~1.0.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

# Calculate standardized residuals
standardized_residuals <- rstandard(fit2)

# Calculate Cook's distances
cooks_distances <- cooks.distance(fit2)

# Cook's distance: Generally, values above 0.5 or 1 indicate high influence (threshold = 1)
# Standardized residuals: Values exceeding ±2 or ±3 are considered outliers (threshold = 2)

# Create a logical vector to identify problematic units
problematic_units <- abs(standardized_residuals) > 2 | cooks_distances > 1

# Remove problematic units from the dataset
cleaned_data <- apartments[!problematic_units, ]

# Fit the model again with the cleaned dataset
fit2_cleaned <- lm(Price ~ Age + Distance, data = cleaned_data)

# View the summary of the new regression model
summary(fit2_cleaned)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = cleaned_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -404.0 -230.9  -51.4  190.6  504.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2455.768     73.296  33.505  < 2e-16 ***
## Age           -6.011      3.086  -1.948    0.055 .  
## Distance     -23.543      2.665  -8.834 2.05e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared:  0.5179, Adjusted R-squared:  0.5057 
## F-statistic: 42.44 on 2 and 79 DF,  p-value: 3.042e-13

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standardized fitted values. Explain the findings. (0.5 p)

# Calculate standardized residuals
standardized_residuals <- rstandard(fit2)

# Calculate standardized fitted values
standardized_fitted_values <- rstudent(fit2)

# Create scatterplot
plot(standardized_fitted_values, standardized_residuals,
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     main = "Scatterplot to Check for Heteroskedasticity")
abline(h = 0, col = "plum", lty = 2) # Adds horizontal line at residual = 0

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

# Get standardized residuals
std_residuals <- rstandard(fit1)

# Visual check using histogram
hist(std_residuals, 
     breaks = 20, 
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     freq = FALSE)
curve(dnorm(x), add = TRUE, col = "plum", lwd = 2)

# Q-Q plot
qqnorm(std_residuals)
qqline(std_residuals, col = "hotpink")

# Formal test for normality using Shapiro-Wilk test
shapiro_test <- shapiro.test(std_residuals)
print(shapiro_test)

## 
##  Shapiro-Wilk normality test
## 
## data:  std_residuals
## W = 0.94844, p-value = 0.001935

Visual Results:

Histogram of Standardized Residuals:
- The histogram shows some departure from normality
- Some irregularities in the distribution with multiple peaks
Q-Q Plot:
- Points should follow the diagonal red line if perfectly normally distributed
- There are noticeable deviations at both tails (below -1.5 and above +1.5)
- The pattern shows a slight S-shape, with points deviating from the line at both extremes
- This suggests heavier tails than would be expected in a normal distribution

Formal Test (Shapiro-Wilk):

W = 0.94844, p-value = 0.001935
Since the p-value (0.001935) is less than 0.05, we reject the null hypothesis of normality
This confirms that the residuals are not normally distributed

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

# Re-estimate the model with the cleaned dataset
fit2_cleaned <- lm(Price ~ Age, data = cleaned_data)

# Show summary of the new model
summary(fit2_cleaned)

## 
## Call:
## lm(formula = Price ~ Age, data = cleaned_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -619.26 -275.55  -66.69  244.81  780.74 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2164.000     91.680  23.604   <2e-16 ***
## Age           -8.041      4.312  -1.865   0.0659 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 367.9 on 80 degrees of freedom
## Multiple R-squared:  0.04166,    Adjusted R-squared:  0.02968 
## F-statistic: 3.478 on 1 and 80 DF,  p-value: 0.06586

Intercept (2164.000):
- This is the estimated value of the apartment’s price when the age of the apartment is zero (Age = 0).
  - The predicted price of a brand-new apartment will be 2164 EUR according to this model.
- The t value (23.604) and Pr(>|t|) (< 2e-16) indicate that this intercept is statistically significant (p < 0.001). This means that the intercept is reliably different from zero.
Age (-8.041):
- This is the estimated change in the apartment’s price for each one-year increase in the apartment’s age.
- The value is -8.041 EUR.
  - This means that the price of an apartment tends to decrease by approximately 8.04 EUR for each additional year of age.
- The t value (-1.865) and Pr(>|t|) (0.0659) indicate that, at the 0.05 significance level, this is not statistically significant. However, it is significant at the 0.1 level. It is showing a trend, but with less certainty.
  - This suggest that there is a trend for older apartments to be cheaper.
Residual standard error (367.9):
- This value represents the average amount that the observed apartment prices deviate from the predicted prices of the model. The closer to 0, the more accurate a model’s prediction is.
  - The value of 367.9 indicates this model lacks accuracy in it’s prediction.
Multiple R-squared (0.04166):
- This value indicates the proportion of the variance in apartment prices that is explained by the age of the apartment.
- Only about 4.17% of the variability in price is explained by age.
  - This suggests that age alone is not a strong predictor of apartment price.
- The adjusted R-squared, which adjusts for the number of explanatory variables in the model, is even lower at 0.02868.
F-statistic (3.478) and p-value (0.06586):
- The F-statistic tests the overall significance of the regression model.
  - The associated p-value (0.06586) indicates that, at the 0.05 level, the overall model is not statistically significant.
    - We can not say with a high degree of certainty that the model is a significant predictor of price.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova_result <- anova(fit2, fit3)
print(anova_result)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual Degrees of Freedom:
- Model 1 has 82 degrees of freedom, and Model 2 has 80.
  - The decrease in degrees of freedom (82 - 80 = 2) corresponds to the two additional predictors (Parking and Balcony) in Model 2.
Residual Sum of Squares (RSS):
- Model 1 has an RSS of 6720983, and Model 2 has an RSS of 5991088. The RSS is lower for Model 2, indicating that it explains more of the variation in “Price.”
F-statistic:
- The F-statistic is 4.8732, indicating that Model 2’s improvement from Model 1 is statistically significant
P-value:
- The p-value (0.01007) is less than 0.05, which means that the improvement in the model’s fit when adding “Parking” and “Balcony” is statistically significant.
  - Therefore, Model 2 (Price ~ Age + Distance + Parking + Balcony) fits the data significantly better than Model 1 (Price ~ Age + Distance).
    - This means that the variables parking and balcony add statistically significant information to the model, when predicting price.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Explanation of Coefficients:

Intercept (2636.565):
- The estimated price of an apartment when Age and Distance are zero, and Parking and Balcony are at their base level, “No”, is 2636.57 EUR.
Age (-5.719):
- For each additional year of age, the price of the apartment is estimated to decrease by 5.719 EUR, holding all other variables constant.
Distance (-23.755):
- For each additional kilometer of distance from the city center, the price of the apartment is estimated to decrease by 23.755 EUR, holding all other variables constant.
ParkingYes (218.423):
- Apartments with parking (“Yes”) are estimated to be 218.423 EUR more expensive than apartments without parking (“No”), holding all other variables constant.
BalconyYes (162.138):
- Apartments with a balcony (“Yes”) are estimated to be 162.138 EUR more expensive than apartments without a balcony (“No”), holding all other variables constant.

Hypothesis Tested by the F-statistic:

The F-statistic at the bottom of the output (23.49) tests the overall significance of the regression model.
Null Hypothesis (H0):
- All regression coefficients (Age, Distance, ParkingYes, BalconyYes) are equal to zero. This means that none of the predictors have a statistically significant linear relationship with the price of an apartment.
Alternative Hypothesis (H1):
- At least one of the regression coefficients is not equal to zero. This means that at least one predictor has a statistically significant linear relationship with the price of an apartment.
P-value (4.882e-14):
- The p-value associated with the F-statistic is extremely small, much less than 0.05.
  - This indicates strong evidence against the null hypothesis.
  - Therefore, we reject the null hypothesis and conclude that at least one of the predictors is significantly related to Price.
In short, the F test tells us if the model, as a whole, is statistically significant

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

fitted_values <- fitted(fit3)
residual_id2 <- apartments$Price[2] - fitted_values[2]
print(paste("Residual for Apartment ID 2:", residual_id2))

## [1] "Residual for Apartment ID 2: 442.58892757391"