Take home exam

Explaining the data set - task 1

data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Explanation: Showing the structure of the data set.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Explanation: The average lenght of a sepal of an iris flower was 5.843 cm, while the minimum width of a petal of an iris flower was 0.100 cm.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

These are the variables that we have in our data.

dim(iris)
## [1] 150   5

In this data we have 150 rows and 5 columns.

Data manipulation

iris$Petal.Area <- iris$Petal.Length * iris$Petal.Width
colnames(iris)[1] <- "Sepal_Length"
iris_filtered <- subset(iris, Petal.Length > 1.5)

head(iris)
##   Sepal_Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1          5.1         3.5          1.4         0.2  setosa       0.28
## 2          4.9         3.0          1.4         0.2  setosa       0.28
## 3          4.7         3.2          1.3         0.2  setosa       0.26
## 4          4.6         3.1          1.5         0.2  setosa       0.30
## 5          5.0         3.6          1.4         0.2  setosa       0.28
## 6          5.4         3.9          1.7         0.4  setosa       0.68

Descriptive satistics

summary(iris)
##   Sepal_Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species     Petal.Area    
##  setosa    :50   Min.   : 0.110  
##  versicolor:50   1st Qu.: 0.420  
##  virginica :50   Median : 5.615  
##                  Mean   : 5.794  
##                  3rd Qu.: 9.690  
##                  Max.   :15.870
mean(iris$Sepal_Length)
## [1] 5.843333
median(iris$Sepal_Length)
## [1] 5.8
sd(iris$Sepal_Length)
## [1] 0.8280661

Explanation: On average, the sepal lenght across 150 iris flowers from the data is 5.84 cm. The median is 5.80 cm which means that half of the flowers have a sepal smaller than 5.80 cm, while the other half have a larger sepal lenght.The sepal lenght of iris flowers vary on average +-0.83 cm from the mean.

Vizualise the data

hist(iris$Sepal_Length, main="Histogram of Sepal Length", xlab="Sepal Length",
     col = "violet",
     border = "deeppink4")

Explanation: From the histogram we can see that the most frequent were iris flowers with sepal leghts between 6.0 and 6.5 cm, and that the least frequent sepal lenght is between 4.0 and 4.5 cm.

plot(iris$Sepal_Length, iris$Petal.Length, main="Sepal vs Petal Length", 
     xlab="Sepal Length", ylab="Petal Length", col=iris$Species)

boxplot(Sepal_Length ~ Species, data=iris, main="Sepal Length by Species",
        xlab="Species", ylab="Sepal Length")

Task 2

library(readxl)

file_path <- "./Business School.xlsx" 
mba_data <- read_excel("./Business School.xlsx")
head(mba_data)
## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
colnames("mba_data")
## NULL
library(ggplot2)

ggplot(mba_data, aes(x=`Undergrad Degree`)) +
  geom_bar(colour = "yellowgreen", fill = "skyblue") +
  ylab("Frequency")

Explanation: The most common degree is “Business”.

# Summary statistics for Annual Salary
summary(mba_data$"Annual Salary")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000
mean(mba_data$"Annual Salary")
## [1] 109058
median(mba_data$"Annual Salary")
## [1] 103500
sd(mba_data$"Annual Salary")
## [1] 41501.49
ggplot(mba_data, aes(x = `Annual Salary` / 1000)) +
  geom_histogram(binwidth = 5, fill = "darkgoldenrod2", color = "forestgreen") +
  labs(title = "Distribution of Annual Salary (in Thousands)", 
       x = "Annual Salary (in Thousands)", 
       y = "Count") +
  scale_x_continuous(labels = scales::comma)

Explanation: From the graph we can see that the most common annual salary for MBA students is 105,000 EUR.

t_test_result <- t.test(mba_data$`MBA Grade`, mu = 74)
print(t_test_result)
## 
##  One Sample t-test
## 
## data:  mba_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Task 3

Import the dataset Apartments.xlsx

library(readxl)


file_path <- "./Apartments.xlsx"
apartment_data <- read_excel("./Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

apartment_data$Parking <- as.factor(apartment_data$Parking)
apartment_data$Balcony <- as.factor(apartment_data$Balcony)

str(apartment_data)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 1 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t_test_result <- t.test(apartment_data$Price, mu = 1900)

print(t_test_result)
## 
##  One Sample t-test
## 
## data:  apartment_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Conclusion: We reject the null hypothesis because the p value is less than 0.05. The mean price of the apartments is significantly different from 1900 EUR. The t-value being positive suggests that the mean price of the apartments is higher than 1900 EUR.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = apartment_data)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = apartment_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(apartment_data$Age, apartment_data$Price)
## [1] -0.230255

From the results we can read that the age of apartments effects it’s price negatively, the price goes down (Estimate of Age: -8.975). As the age increases, the price decreases. We reject the null hypothesis that age of apartments has no effect on the price (p-value: 0.034). Only 5.3% of the variation in price can be explained by the age of the apartment. Age alone is not a very strong predictor of price.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("GGally")
## 
## The downloaded binary packages are in
##  /var/folders/0b/8vj65c4j39v9_bn7pds5v6sr0000gn/T//RtmpVHwTp2/downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(apartment_data[, c("Price", "Age", "Distance")])

Multicolinearity is not a potential problem because the it is between 0.7 and -0.7.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apartment_data)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartment_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

install.packages("car")
## 
## The downloaded binary packages are in
##  /var/folders/0b/8vj65c4j39v9_bn7pds5v6sr0000gn/T//RtmpVHwTp2/downloaded_packages
library(car)
## Loading required package: carData
vif_values <- vif(fit2)
print(vif_values)
##      Age Distance 
## 1.001845 1.001845

There is no serious multicollinearity problem.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

standardized_residuals <- rstandard(fit2)
cooks_distances <- cooks.distance(fit2)

diagnostics <- data.frame(
  Standardized_Residuals = standardized_residuals,
  Cooks_Distance = cooks_distances
)

print(diagnostics)
##    Standardized_Residuals Cooks_Distance
## 1             -0.66534868   7.386569e-03
## 2              1.78328759   3.036543e-02
## 3             -0.59376287   5.882612e-03
## 4              0.75437937   8.299153e-03
## 5             -1.07339872   5.112584e-03
## 6             -0.77751899   4.900891e-03
## 7             -0.30197426   5.481629e-04
## 8             -0.78702618   3.782998e-03
## 9              0.45519754   1.288366e-03
## 10             1.24081605   1.671333e-02
## 11             1.55056437   2.012202e-02
## 12            -0.35348493   1.052317e-03
## 13            -1.49938655   1.311979e-02
## 14            -0.46873367   1.596111e-03
## 15             0.29161240   8.077040e-04
## 16            -0.99974519   6.450802e-03
## 17            -0.64518999   4.466658e-03
## 18             0.57708256   3.352797e-03
## 19            -0.81224591   1.016915e-02
## 20            -1.38103755   1.226489e-02
## 21            -0.37096388   1.514554e-03
## 22             1.57598216   6.086868e-02
## 23            -0.28302175   5.473966e-04
## 24            -0.44000857   1.410045e-03
## 25             1.57098451   3.412790e-02
## 26             0.62752775   3.800650e-03
## 27             1.55037406   2.341203e-02
## 28            -1.07148455   5.406313e-03
## 29            -0.51613952   2.634792e-03
## 30             0.25601392   1.325713e-03
## 31             0.88866941   3.026431e-02
## 32             0.45424508   1.726921e-03
## 33             2.05058591   6.913379e-02
## 34            -0.66346701   5.158092e-03
## 35            -1.26104001   7.942022e-03
## 36            -1.18908643   1.150838e-02
## 37            -1.00495060   7.355297e-03
## 38             2.57677232   3.197306e-01
## 39             1.09117576   3.750987e-02
## 40             0.27869313   2.343477e-03
## 41            -1.05183308   4.572751e-03
## 42            -0.10301522   1.181142e-04
## 43             0.25167568   1.363230e-03
## 44             0.11668553   1.908988e-04
## 45            -0.32874383   2.843472e-03
## 46             1.47782998   1.908942e-02
## 47            -0.05286709   3.387028e-05
## 48            -1.00856647   2.169110e-02
## 49             0.74691175   4.469820e-03
## 50            -0.68260290   2.798905e-03
## 51             0.39348196   3.794891e-03
## 52            -0.86117571   3.384716e-03
## 53            -2.15178743   6.625775e-02
## 54            -1.10136340   1.199531e-02
## 55             1.44476770   1.042044e-01
## 56            -0.32780310   4.879320e-04
## 57             1.60066793   3.199717e-02
## 58             1.65534294   3.654413e-02
## 59            -0.55038426   1.237107e-03
## 60             0.33337375   1.203554e-03
## 61             1.78328759   3.036543e-02
## 62            -0.59376287   5.882612e-03
## 63             0.75437937   8.299153e-03
## 64            -1.07339872   5.112584e-03
## 65            -0.77751899   4.900891e-03
## 66            -0.30197426   5.481629e-04
## 67            -0.78702618   3.782998e-03
## 68             0.45519754   1.288366e-03
## 69             1.24081605   1.671333e-02
## 70             1.55056437   2.012202e-02
## 71            -0.35348493   1.052317e-03
## 72            -1.49938655   1.311979e-02
## 73            -0.46873367   1.596111e-03
## 74             0.29161240   8.077040e-04
## 75            -0.99974519   6.450802e-03
## 76            -0.64518999   4.466658e-03
## 77            -0.32874383   2.843472e-03
## 78             1.47782998   1.908942e-02
## 79            -0.05286709   3.387028e-05
## 80            -1.00856647   2.169110e-02
## 81             0.74691175   4.469820e-03
## 82            -0.68260290   2.798905e-03
## 83             0.39348196   3.794891e-03
## 84            -0.86117571   3.384716e-03
## 85            -0.05286709   3.387028e-05
n <- nrow(apartment_data)
outliers <- abs(standardized_residuals) > 3
influential <- cooks_distances > (4 / n)
diagnostics$Outlier <- outliers
diagnostics$Influential <- influential

problematic_units <- diagnostics[diagnostics$Outlier | diagnostics$Influential, ]
print(problematic_units)
##    Standardized_Residuals Cooks_Distance Outlier Influential
## 22               1.575982     0.06086868   FALSE        TRUE
## 33               2.050586     0.06913379   FALSE        TRUE
## 38               2.576772     0.31973058   FALSE        TRUE
## 53              -2.151787     0.06625775   FALSE        TRUE
## 55               1.444768     0.10420445   FALSE        TRUE

There are no problemati units.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

standardized_fitted_values <- scale(fitted(fit2))

library(ggplot2)

ggplot(data = data.frame(standardized_fitted_values, standardized_residuals), 
       aes(x = standardized_fitted_values, y = standardized_residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Standardized Residuals vs. Standardized Fitted Values",
       x = "Standardized Fitted Values",
       y = "Standardized Residuals") +
  theme_minimal()

The scatterplot indicates potential heteroskedasticity - variance of the errors changes with the level of predictor variables.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

ggplot(data = data.frame(standardized_residuals), aes(sample = standardized_residuals)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot of Standardized Residuals",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal()

shapiro_test_result <- shapiro.test(standardized_residuals)
print(shapiro_test_result)
## 
##  Shapiro-Wilk normality test
## 
## data:  standardized_residuals
## W = 0.95306, p-value = 0.00366

The assumption of normality for the residuals has been violated.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartment_data)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With function anova check if model fit3 fits data better than model fit2.

anova_result <- anova(fit2, fit3)
print(anova_result)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis that Parking and Balcony do not improve the model. When parking and Balcony are included in the model, it is statistically significantly improved than the model with only Age and Distance.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Age and distance negatively impact the price, while parking positively impact the price. Balcony variable does not appear to have significant impact. The null hypothesis shows that all of the regression coefficients in the model are equal to zero. None of the predict variables have a significant effect on price (dependent variable). Alternative null hypothesis indicated that at least one variable has a significant effect on the price.

Save fitted values and claculate the residual for apartment ID2.