First, we import the file Business School.xlsx and
display the first few units in the dataset.
library(readxl)
task2data <-read_xlsx(path= "Business School.xlsx")
head(task2data, 5)
## # A tibble: 5 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade` `Work Experience` `Employability (Before)`
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1 Business 68.4 90.2 No 252
## 2 2 Computer Science 70.2 68.7 Yes 101
## 3 3 Finance 76.4 83.3 No 401
## 4 4 Business 82.6 88.7 No 287
## 5 5 Finance 76.9 75.4 No 275
## # ℹ 3 more variables: `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
summary(task2data)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before)
## Min. : 1.00 Length:100 Min. : 61.20 Min. :58.14 Length:100 Min. :101.0
## 1st Qu.: 25.75 Class :character 1st Qu.: 71.47 1st Qu.:71.14 Class :character 1st Qu.:245.8
## Median : 50.50 Mode :character Median : 76.65 Median :76.38 Mode :character Median :256.8
## Mean : 50.50 Mean : 76.90 Mean :76.04 Mean :257.9
## 3rd Qu.: 75.25 3rd Qu.: 81.70 3rd Qu.:82.15 3rd Qu.:261.0
## Max. :100.00 Max. :100.00 Max. :95.00 Max. :421.0
## Employability (After) Status Annual Salary
## Min. :119.0 Length:100 Min. : 20000
## 1st Qu.:312.0 Class :character 1st Qu.: 87125
## Median :435.6 Mode :character Median :103500
## Mean :422.7 Mean :109058
## 3rd Qu.:529.0 3rd Qu.:124000
## Max. :631.0 Max. :340000
Description:
library(ggplot2)
ggplot(task2data, aes(x = reorder(`Undergrad Degree`, -table(`Undergrad Degree`)[`Undergrad Degree`]),
fill = `Undergrad Degree`)) +
geom_bar(colour = "black") +
xlab("Degree Type") +
ylab("Number of Students") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1")
The most common undergrad degree is Business.
library(ggplot2)
ggplot(task2data, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 15000, fill = "steelblue", colour = "black") +
xlab("Annual Salary ($)") +
ylab("Count of Students") +
ggtitle("Annual Salaries")
The distribution of annual salaries is asymmetrical to the right. It has a few outliers on the high side of the salary range.
To test the hypothesis \(H_0: \mu_{MBA\ grade} = 74\) we run the two sided T test.
t.test(task2data$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: task2data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
The results tell us that we can reject the null hypothesis, as the 95% confidence interval of \([74.517,77.563]\) does not include the value 74. The average MBA grade is higher than 74.
First, we import the file Apartments.xlsx into our
notebook with the library readxl and display the first five
units in the dataset.
library(readxl)
task3data <- read_xlsx(path = "Apartments.xlsx")
head(task3data, 5)
## # A tibble: 5 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
Description of the dataset:
We will change the categorical variables of Parking and
Balcony into factors.
task3data$ParkingF <- factor(task3data$Parking, levels = c(0,1), labels = c("No", "Yes"))
task3data$BalconyF <- factor(task3data$Balcony, levels = c(0,1), labels = c("No", "Yes"))
Next, we test the hypothesis \(H_0: \mu_{price} = 1900€\).
t.test(task3data$Price, mu = 1900, alternative = "two.sided")
##
## One Sample t-test
##
## data: task3data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
With a p-value of \(0.004731 \lt 0.05\) we reject the null hypothesis. The confidence interval of \([1937,2100]\) does not include the price of 1900€. The average price is above 1900€.
fit1 = lm(task3data$Price ~ task3data$Age)
summary(fit1)
##
## Call:
## lm(formula = task3data$Price ~ task3data$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## task3data$Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
The intercept of the regression function Price = f(Age) is at 2185.5 meaning that a new apartment (age 0) on average costs 2185.5€/m². The regression coefficient of -8.795 tells us that the price of an apartment lowers on average by 8.795€/m² for every year of age.
pairs(task3data[, c("Price", "Age", "Distance")],
main = "Scatterplot matrix: Price, Age, Distance",
labels = c("Price", "Age", "Distance"),
pch = 19, col = "steelblue")
Based on the scatterplot matrix we do not see any multicolinearity
between Age and Distance.
fit2 <- lm(task3data$Price ~ task3data$Age + task3data$Distance)
summary(fit2)
##
## Call:
## lm(formula = task3data$Price ~ task3data$Age + task3data$Distance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## task3data$Age -7.934 3.225 -2.46 0.016 *
## task3data$Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
#plot(fit2, which=1)
library(car)
## Loading required package: carData
vif(fit2)
## task3data$Age task3data$Distance
## 1.001845 1.001845
With a VIF value close to 1 (1.001845) between Distance
and Age we can conclude that there is no multicolinearity
between the two independent variables.
task3data$rstd <- round(rstandard(fit2), 3)
#shapiro.test(task3data$rstd)
hist(task3data$rstd,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "steelblue")
From the histogram of standardized residuals we can see that most units lie in the range of \([-2, 2]\), with only a few going beyond that. That should be acceptable.
task3data$cooksd <- round(cooks.distance(fit2), 3)
hist(task3data$cooksd,
xlab = "Cook's distance",
ylab = "Frequency",
main = "Histogram of Cooks distances",
col = "steelblue")
head(task3data[order(-task3data$cooksd),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF rstd cooksd
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
From the histogram of Cook’s distances we can see most units have a Cook’s distance less than or equal to \(0.15\), with a few discontinuous units above a distance of \(0.30\). We will remove units with Cook’s distance of more than \(4/n\) where \(n=85\).
task3data_clean <- task3data[-which(task3data$cooksd > 4/85), ]
fit2_clean <- lm(Price ~ Age + Distance, data = task3data_clean)
task3data_clean$cooksd <- round(cooks.distance(fit2_clean), 3)
hist(task3data_clean$cooksd,
xlab = "Cook's distance",
ylab = "Frequency",
main = "Histogram of Cooks distances without influential units",
col = "steelblue")
After removing the influential units and outliers, we can see a drastic improvement in the two histograms.
# standardized fitted values
fitted_std <- scale(fitted(fit2_clean), center = TRUE, scale = TRUE)
# plot
plot(fitted_std, task3data_clean$rstd,
xlab = "Standardized fitted values",
ylab = "Standardized residuals",
main = "Standardized residuals vs Standardized fitted values",
pch = 19, col = "steelblue")
abline(h = 0, lty = 2)
# formal test
#install.packages("lmtest") # uncomment if needed
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2_clean)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
From the Breusch-Pagan test for heteroskedasticity we can read that the \(Chi^2\) value is equal to \(0.187\) which is greater than \(0.05\). As such we fail to reject the null hypothesis at a 5% significance level. There is no strong evidence of heteroskedasticity.
task3data_clean$rstd <- round(rstandard(fit2_clean), 3)
hist(task3data_clean$rstd,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals without outliers",
col = "steelblue")
shapiro.test(task3data_clean$rstd)
##
## Shapiro-Wilk normality test
##
## data: task3data_clean$rstd
## W = 0.94154, p-value = 0.001166
After removing outliers and most influential units the histogram of standardized residuals shows the standardized residuals are not normally distributed. The histogram is asymmetrical to the right. The Shapiro-Wilk test gives a p-value of 0.001166. As the value is smaller than 0.05 we reject the null hypothesis that the standardized residuals are normally distributed.
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = task3data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
#plot(fit2_clean, which=1)
The average price of an apartment is 2502€/m² and the price linearly falls both with distance from the city centre and age of the apartment. The distance being equal, the apartment on average loses 8.674€/m² of value for every year of age. The age being equal, the apartment on average loses 24.063€/m² for every additional kilometre from the city centre.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = task3data_clean)
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
As the p-value is equal to \(0.1135 >
0.05\) we cannot reject the null hypothesis that
fit3 is better than fit2_clean. The variables
Parking and Balcony do not explain more
variation of the price per square meter of apartments than
fit2 without them.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = task3data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFYes 128.700 60.801 2.117 0.0376 *
## BalconyFYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Keeping age, distance and balcony variables constant, the addition of
parking raises the price per square meter by 128.7€ on average. Keeping
age, distance and parking constant, the addition of a balcony raises the
price per square meter by 6€ per square meter on average. In
fit3 the age and distance impact the average price per
square meter of the apartment a bit less than in
fit2_clean.
The hypothesis of F-statistics is that the observed variable
Price can be explained with all regression coefficients
equal to zero. The p-value approaching 0 means that we can reject that
hypothesis and that the regression coefficients do explain some of the
price variation of apartments.
task3data_clean$Fitted <- fitted(fit3)
task3data_clean$Residuals <- residuals(fit3)
task3data_clean[2, c("Fitted", "Residuals")]
## # A tibble: 1 × 2
## Fitted Residuals
## <dbl> <dbl>
## 1 2357. 443.