Bootcamp Homework Jurij Sitar

Task 2

First, we import the file Business School.xlsx and display the first few units in the dataset.

library(readxl)
task2data <-read_xlsx(path= "Business School.xlsx")
head(task2data, 5)

## # A tibble: 5 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade` `Work Experience` `Employability (Before)`
##          <dbl> <chr>                          <dbl>       <dbl> <chr>                                <dbl>
## 1            1 Business                        68.4        90.2 No                                     252
## 2            2 Computer Science                70.2        68.7 Yes                                    101
## 3            3 Finance                         76.4        83.3 No                                     401
## 4            4 Business                        82.6        88.7 No                                     287
## 5            5 Finance                         76.9        75.4 No                                     275
## # ℹ 3 more variables: `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

summary(task2data)

##    Student ID     Undergrad Degree   Undergrad Grade    MBA Grade     Work Experience    Employability (Before)
##  Min.   :  1.00   Length:100         Min.   : 61.20   Min.   :58.14   Length:100         Min.   :101.0         
##  1st Qu.: 25.75   Class :character   1st Qu.: 71.47   1st Qu.:71.14   Class :character   1st Qu.:245.8         
##  Median : 50.50   Mode  :character   Median : 76.65   Median :76.38   Mode  :character   Median :256.8         
##  Mean   : 50.50                      Mean   : 76.90   Mean   :76.04                      Mean   :257.9         
##  3rd Qu.: 75.25                      3rd Qu.: 81.70   3rd Qu.:82.15                      3rd Qu.:261.0         
##  Max.   :100.00                      Max.   :100.00   Max.   :95.00                      Max.   :421.0         
##  Employability (After)    Status          Annual Salary   
##  Min.   :119.0         Length:100         Min.   : 20000  
##  1st Qu.:312.0         Class :character   1st Qu.: 87125  
##  Median :435.6         Mode  :character   Median :103500  
##  Mean   :422.7                            Mean   :109058  
##  3rd Qu.:529.0                            3rd Qu.:124000  
##  Max.   :631.0                            Max.   :340000

Description:

Student ID: ordinal number
Undergrad Degree: categorical variable of string type
Undegrad Grade: number between 0 and 100
MBA Grade: number between 0 and 100
Work Experience: categorical variable with values Yes and No
Employability (Before): number
Employability (After): number
Status: categorical variable with value “Placed” for all units
Annual Salary: number

library(ggplot2)

ggplot(task2data, aes(x = reorder(`Undergrad Degree`, -table(`Undergrad Degree`)[`Undergrad Degree`]), 
                            fill = `Undergrad Degree`)) +
  geom_bar(colour = "black") +
  xlab("Degree Type") +
  ylab("Number of Students") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set1")

The most common undergrad degree is Business.

library(ggplot2)

ggplot(task2data, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 15000, fill = "steelblue", colour = "black") +
  xlab("Annual Salary ($)") +
  ylab("Count of Students") +
  ggtitle("Annual Salaries")

The distribution of annual salaries is asymmetrical to the right. It has a few outliers on the high side of the salary range.

Null hypothesis of average grades equal to 74

To test the hypothesis \(H_0: \mu_{MBA\ grade} = 74\) we run the two sided T test.

t.test(task2data$`MBA Grade`, 
       mu = 74, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  task2data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

The results tell us that we can reject the null hypothesis, as the 95% confidence interval of \([74.517,77.563]\) does not include the value 74. The average MBA grade is higher than 74.

Task 3

First, we import the file Apartments.xlsx into our notebook with the library readxl and display the first five units in the dataset.

library(readxl)
task3data <- read_xlsx(path = "Apartments.xlsx")
head(task3data, 5)

## # A tibble: 5 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1

Description of the dataset:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

We will change the categorical variables of Parking and Balcony into factors.

task3data$ParkingF <- factor(task3data$Parking, levels = c(0,1), labels = c("No", "Yes"))
task3data$BalconyF <- factor(task3data$Balcony, levels = c(0,1), labels = c("No", "Yes"))

Next, we test the hypothesis \(H_0: \mu_{price} = 1900€\).

t.test(task3data$Price, mu = 1900, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  task3data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

With a p-value of \(0.004731 \lt 0.05\) we reject the null hypothesis. The confidence interval of \([1937,2100]\) does not include the price of 1900€. The average price is above 1900€.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 = lm(task3data$Price ~ task3data$Age)
summary(fit1)

## 
## Call:
## lm(formula = task3data$Price ~ task3data$Age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2185.455     87.043  25.108   <2e-16 ***
## task3data$Age   -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The intercept of the regression function Price = f(Age) is at 2185.5 meaning that a new apartment (age 0) on average costs 2185.5€/m². The regression coefficient of -8.795 tells us that the price of an apartment lowers on average by 8.795€/m² for every year of age.

Scatterplot matrix between Price, Age and Distance.

pairs(task3data[, c("Price", "Age", "Distance")],
      main = "Scatterplot matrix: Price, Age, Distance",
      labels = c("Price", "Age", "Distance"),
      pch = 19, col = "steelblue")

Based on the scatterplot matrix we do not see any multicolinearity between Age and Distance.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(task3data$Price ~ task3data$Age + task3data$Distance)
summary(fit2)

## 
## Call:
## lm(formula = task3data$Price ~ task3data$Age + task3data$Distance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2460.101     76.632   32.10  < 2e-16 ***
## task3data$Age        -7.934      3.225   -2.46    0.016 *  
## task3data$Distance  -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

#plot(fit2, which=1)

Check the multicolinearity with VIF statistics. Explain the findings.

library(car)

## Loading required package: carData

vif(fit2)

##      task3data$Age task3data$Distance 
##           1.001845           1.001845

With a VIF value close to 1 (1.001845) between Distance and Age we can conclude that there is no multicolinearity between the two independent variables.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

task3data$rstd <- round(rstandard(fit2), 3)
#shapiro.test(task3data$rstd)

hist(task3data$rstd, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col  = "steelblue")

From the histogram of standardized residuals we can see that most units lie in the range of \([-2, 2]\), with only a few going beyond that. That should be acceptable.

task3data$cooksd <- round(cooks.distance(fit2), 3)

hist(task3data$cooksd, 
     xlab = "Cook's distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances",
     col  = "steelblue")

head(task3data[order(-task3data$cooksd),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF  rstd cooksd
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>    <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes       2.58  0.32 
## 2    43       37  1740       0       0 No       No        1.44  0.104
## 3     2       11  2790       1       0 Yes      No        2.05  0.069
## 4     7        2  1760       0       1 No       Yes      -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes       1.58  0.061
## 6    40        2  2400       0       1 No       Yes       1.09  0.038

From the histogram of Cook’s distances we can see most units have a Cook’s distance less than or equal to \(0.15\), with a few discontinuous units above a distance of \(0.30\). We will remove units with Cook’s distance of more than \(4/n\) where \(n=85\).

task3data_clean <- task3data[-which(task3data$cooksd > 4/85), ]

fit2_clean <- lm(Price ~ Age + Distance, data = task3data_clean)

task3data_clean$cooksd <- round(cooks.distance(fit2_clean), 3)

hist(task3data_clean$cooksd, 
     xlab = "Cook's distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances without influential units",
     col  = "steelblue")

After removing the influential units and outliers, we can see a drastic improvement in the two histograms.

Check for potential heteroskedasticity with scatterplot between standardized residuals and standardized fitted values.

# standardized fitted values
fitted_std <- scale(fitted(fit2_clean), center = TRUE, scale = TRUE)

# plot
plot(fitted_std, task3data_clean$rstd,
     xlab = "Standardized fitted values",
     ylab = "Standardized residuals",
     main = "Standardized residuals vs Standardized fitted values",
     pch = 19, col = "steelblue")
abline(h = 0, lty = 2)

# formal test
#install.packages("lmtest") # uncomment if needed
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2_clean)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

From the Breusch-Pagan test for heteroskedasticity we can read that the \(Chi^2\) value is equal to \(0.187\) which is greater than \(0.05\). As such we fail to reject the null hypothesis at a 5% significance level. There is no strong evidence of heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

task3data_clean$rstd <- round(rstandard(fit2_clean), 3)

hist(task3data_clean$rstd, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals without outliers",
     col  = "steelblue")

shapiro.test(task3data_clean$rstd)

## 
##  Shapiro-Wilk normality test
## 
## data:  task3data_clean$rstd
## W = 0.94154, p-value = 0.001166

After removing outliers and most influential units the histogram of standardized residuals shows the standardized residuals are not normally distributed. The histogram is asymmetrical to the right. The Shapiro-Wilk test gives a p-value of 0.001166. As the value is smaller than 0.05 we reject the null hypothesis that the standardized residuals are normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = task3data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

#plot(fit2_clean, which=1)

The average price of an apartment is 2502€/m² and the price linearly falls both with distance from the city centre and age of the apartment. The distance being equal, the apartment on average loses 8.674€/m² of value for every year of age. The age being equal, the apartment on average loses 24.063€/m² for every additional kilometre from the city centre.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = task3data_clean)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2_clean, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

As the p-value is equal to \(0.1135 > 0.05\) we cannot reject the null hypothesis that fit3 is better than fit2_clean. The variables Parking and Balcony do not explain more variation of the price per square meter of apartments than fit2 without them.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = task3data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Keeping age, distance and balcony variables constant, the addition of parking raises the price per square meter by 128.7€ on average. Keeping age, distance and parking constant, the addition of a balcony raises the price per square meter by 6€ per square meter on average. In fit3 the age and distance impact the average price per square meter of the apartment a bit less than in fit2_clean.

The hypothesis of F-statistics is that the observed variable Price can be explained with all regression coefficients equal to zero. The p-value approaching 0 means that we can reject that hypothesis and that the regression coefficients do explain some of the price variation of apartments.

Save fitted values and claculate the residual for apartment ID2.

task3data_clean$Fitted <- fitted(fit3)
task3data_clean$Residuals <- residuals(fit3)

task3data_clean[2, c("Fitted", "Residuals")]

## # A tibble: 1 × 2
##   Fitted Residuals
##    <dbl>     <dbl>
## 1  2357.      443.