Task 1

data(package = .packages(all.available = TRUE))
library(carData)
ProfessorsSalaries <- force(Salaries) 
head(ProfessorsSalaries)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

1. Data set explanation

Among the data sets provided, I chose Salaries of Professors in the carData package.

This data set analyzes the academic and financial landscape for professors, considering 6 variables such as salary, career advancement, and other patterns within academia.

More precisely, my chosen data set contains 397 observations of these 6 variables:

  • Rank: a categorical variable that indicates the academic rank of a professor. The ranks may include “Assistant Professor,” “Associate Professor,” or “Full Professor”;
  • Discipline: a categorical variable that represents the professor’s academic field or discipline. According to my internet research, data sets likely use codes or abbreviations to represent different fields, such as “A” for theoretical fields like Arts and Humanities, and “B” for applied fields like Business or Engineering;
  • yrs.since.phd: this numeric variable shows the number of years since the professor received their PhD;
  • yrs.service: again a numeric variable that represents the number of years the professor has worked at their current institution;
  • Sex: a categorical variable indicating the professor’s gender;
  • Salary: a numeric variable representing the professor’s annual salary (I suppose in dollars). It is the primary dependent variable in my analysis to explore relationships between salary and other factors like experience, rank, and gender.

2. Data manipulations

Renaming columns to more user-friendly names

colnames(ProfessorsSalaries)[2] <- "academic field"
colnames(ProfessorsSalaries)[5] <- "gender"

I then wished to modify the academic field variable’s categories A and B. Thus, I looked up how to accomplish it online and discovered the mutate {dplyr} function in R Help, through which columns can be modified together with the ifelse function explained in the PDF on the professor’s website.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ProfessorsSalaries <- ProfessorsSalaries %>%
  mutate(`academic field` = ifelse(`academic field` == "A", "Humanistic", 
                             ifelse(`academic field` == "B", "Scientific", `academic field`)))
head(ProfessorsSalaries)
##        rank academic field yrs.since.phd yrs.service gender salary
## 1      Prof     Scientific            19          18   Male 139750
## 2      Prof     Scientific            20          16   Male 173200
## 3  AsstProf     Scientific             4           3   Male  79750
## 4      Prof     Scientific            45          39   Male 115000
## 5      Prof     Scientific            40          41   Male 141500
## 6 AssocProf     Scientific             6           6   Male  97000

Creation of a new variable: salary per year of service

ProfessorsSalaries$SalaryperYrs.service <- ProfessorsSalaries$salary / ProfessorsSalaries$yrs.service

Removing units with missing values

library(tidyr)
ProfessorsSalaries <- ProfessorsSalaries %>% drop_na()
# Creation of a new data.frame about male humanistic associate professors' salaries
AssocProfMaleHum_Salaries <- data.frame(
  "rank" = ProfessorsSalaries$rank[ProfessorsSalaries$rank == "AssocProf" & 
                                   ProfessorsSalaries$gender == "Male" & 
                                   ProfessorsSalaries$`academic field` == "Humanistic"], 
  "academic_field" = ProfessorsSalaries$`academic field`[ProfessorsSalaries$rank == "AssocProf" & 
                                                         ProfessorsSalaries$gender == "Male" & 
                                                         ProfessorsSalaries$`academic field` == "Humanistic"],
  "gender" = ProfessorsSalaries$gender[ProfessorsSalaries$rank == "AssocProf" & 
                                       ProfessorsSalaries$gender == "Male" & 
                                       ProfessorsSalaries$`academic field` == "Humanistic"],
  "salary" = ProfessorsSalaries$salary[ProfessorsSalaries$rank == "AssocProf" & 
                                       ProfessorsSalaries$gender == "Male" & 
                                       ProfessorsSalaries$`academic field` == "Humanistic"]
)
print(AssocProfMaleHum_Salaries)
##         rank academic_field gender salary
## 1  AssocProf     Humanistic   Male  83850
## 2  AssocProf     Humanistic   Male  82099
## 3  AssocProf     Humanistic   Male  82600
## 4  AssocProf     Humanistic   Male  81500
## 5  AssocProf     Humanistic   Male  82100
## 6  AssocProf     Humanistic   Male  83001
## 7  AssocProf     Humanistic   Male  73877
## 8  AssocProf     Humanistic   Male 100102
## 9  AssocProf     Humanistic   Male  81500
## 10 AssocProf     Humanistic   Male  70000
## 11 AssocProf     Humanistic   Male  83000
## 12 AssocProf     Humanistic   Male  74000
## 13 AssocProf     Humanistic   Male  88600
## 14 AssocProf     Humanistic   Male  88650
## 15 AssocProf     Humanistic   Male  81800
## 16 AssocProf     Humanistic   Male 104800
## 17 AssocProf     Humanistic   Male  70700
## 18 AssocProf     Humanistic   Male  81285
## 19 AssocProf     Humanistic   Male 108413
## 20 AssocProf     Humanistic   Male  78182
## 21 AssocProf     Humanistic   Male 104121
## 22 AssocProf     Humanistic   Male  86895

In this new data.frame I isolated the 22 male, humanistic, Associate Professors and I could conduct further research about their descriptive statistics, such as their salary mean.

mean(AssocProfMaleHum_Salaries$salary)
## [1] 85048.86

Explanation: a male humanistic associate professor at the university taken into consideration in the analysis earns, on average, $85049 annually.

3. Descriptive Statistics

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following objects are masked from 'package:dplyr':
## 
##     first, last
options(width = 100)
round(stat.desc(ProfessorsSalaries), 2)
##          rank academic field yrs.since.phd yrs.service gender       salary SalaryperYrs.service
## nbr.val    NA             NA        397.00      397.00     NA       397.00               397.00
## nbr.null   NA             NA          0.00       11.00     NA         0.00                 0.00
## nbr.na     NA             NA          0.00        0.00     NA         0.00                 0.00
## min        NA             NA          1.00        0.00     NA     57800.00              1133.33
## max        NA             NA         56.00       60.00     NA    231545.00                  Inf
## range      NA             NA         55.00       60.00     NA    173745.00                  Inf
## sum        NA             NA       8859.00     6993.00     NA  45141464.00                  Inf
## median     NA             NA         21.00       16.00     NA    107300.00              8015.91
## mean       NA             NA         22.31       17.61     NA    113706.46                  Inf
## SE.mean    NA             NA          0.65        0.65     NA      1520.16                  NaN
## CI.mean    NA             NA          1.27        1.28     NA      2988.60                  NaN
## var        NA             NA        166.07      169.16     NA 917425865.05                  NaN
## std.dev    NA             NA         12.89       13.01     NA     30289.04                  NaN
## coef.var   NA             NA          0.58        0.74     NA         0.27                  NaN

For categorical variables, it is useless to provide a descriptive statistics summary, thus I will re-set the function to display only the relevant findings for this study.

library(pastecs)
round(stat.desc(ProfessorsSalaries[, c("yrs.since.phd", "yrs.service", "salary", "SalaryperYrs.service")]), 2)
##              yrs.since.phd yrs.service       salary SalaryperYrs.service
## nbr.val             397.00      397.00       397.00               397.00
## nbr.null              0.00       11.00         0.00                 0.00
## nbr.na                0.00        0.00         0.00                 0.00
## min                   1.00        0.00     57800.00              1133.33
## max                  56.00       60.00    231545.00                  Inf
## range                55.00       60.00    173745.00                  Inf
## sum                8859.00     6993.00  45141464.00                  Inf
## median               21.00       16.00    107300.00              8015.91
## mean                 22.31       17.61    113706.46                  Inf
## SE.mean               0.65        0.65      1520.16                  NaN
## CI.mean.0.95          1.27        1.28      2988.60                  NaN
## var                 166.07      169.16 917425865.05                  NaN
## std.dev              12.89       13.01     30289.04                  NaN
## coef.var              0.58        0.74         0.27                  NaN

Since the function drop_na was insufficient for the new variable I had previously set (SalaryperYrs.service), I looked up how to remove the inf and NaN rows from my data set again online. I found this function on this forum: https://stackoverflow.com/questions/36590230/how-to-remove-rows-with-inf-from-a-dataframe-in-r, and I modified it to fit my needs.

ProfessorsSalaries_new <- ProfessorsSalaries[is.finite(ProfessorsSalaries$SalaryperYrs.service), ]
head(ProfessorsSalaries_new)
##        rank academic field yrs.since.phd yrs.service gender salary SalaryperYrs.service
## 1      Prof     Scientific            19          18   Male 139750             7763.889
## 2      Prof     Scientific            20          16   Male 173200            10825.000
## 3  AsstProf     Scientific             4           3   Male  79750            26583.333
## 4      Prof     Scientific            45          39   Male 115000             2948.718
## 5      Prof     Scientific            40          41   Male 141500             3451.220
## 6 AssocProf     Scientific             6           6   Male  97000            16166.667

There are now 386 observations in my new data collection, which is 11 fewer than the old one. Here I will now display the descriptive statistical results summary for this variable exclusively as a last check.

library(pastecs)
round(stat.desc(ProfessorsSalaries_new[, c("SalaryperYrs.service")]), 2)
##      nbr.val     nbr.null       nbr.na          min          max        range          sum 
##       386.00         0.00         0.00      1133.33    118700.00    117566.67   5226619.27 
##       median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
##      7934.09     13540.46       852.84      1676.82 280754731.35     16755.74         1.24

I came to the conclusion that it would be best to stop looking into this new variable after the initial sample size (all of the professors working at the analysed university) was lost. After taking into account the three earlier ones, I will now analyze some sample statistics.

library(pastecs)
round(stat.desc(ProfessorsSalaries[, c("yrs.since.phd", "yrs.service", "salary")]), 2)
##              yrs.since.phd yrs.service       salary
## nbr.val             397.00      397.00       397.00
## nbr.null              0.00       11.00         0.00
## nbr.na                0.00        0.00         0.00
## min                   1.00        0.00     57800.00
## max                  56.00       60.00    231545.00
## range                55.00       60.00    173745.00
## sum                8859.00     6993.00  45141464.00
## median               21.00       16.00    107300.00
## mean                 22.31       17.61    113706.46
## SE.mean               0.65        0.65      1520.16
## CI.mean.0.95          1.27        1.28      2988.60
## var                 166.07      169.16 917425865.05
## std.dev              12.89       13.01     30289.04
## coef.var              0.58        0.74         0.27

Explanation and analysis

  • Mean: at this university, professors make, on average, $113,706.46 a year, regardless of their position; the mean number of years after a PhD is 22,31, meaning that, on average, professors have been in the field for more than two decades;
  • Median: at the university of my sample, half of the professors have worked there for less than 16 years, demonstrating a distribution that includes both novice and more experienced faculty members;
  • CV: professors’ salaries have less relative variability than their mean (they are distributed around the mean by 27%), according to a salary CV of about 27%. Although there are variations in pay, the range is smaller than the average salary, indicating that salaries in this data set are more consistent than in years served and years since PhD;
  • Range: the range of 60 years indicates that professors’ tenure at this university varies, ranging from 0 years for those who are just starting out to 60 years for those who have devoted their whole careers to teaching.

4. Graphical analysis

This histogram shows the distribution of professors’ salaries. The shape will indicate if the distribution is skewed or normal.

library(ggplot2)
ggplot(ProfessorsSalaries, aes(x = salary)) +
  geom_histogram(binwidth = 5000, fill = "chocolate", color = "black") +
  labs(title = "Salary distribution among professors", x = "Salary", y = "Frequency") +
  theme_minimal()

The histogram of professors’ salaries indicates a right-skewed distribution, suggesting that most salaries cluster at the “lower” end while a small number of higher salaries significantly influence the mean.

Now, my goal is to compare the gender-specific wage distributions of professors, highlighting any potential disparities in pay between male and female faculty members using separate histograms.

library(ggplot2)
ggplot(ProfessorsSalaries, aes(x = salary, fill = gender)) +
  geom_histogram(binwidth = 5000, color = "black", position = "identity") +
  labs(title = "Professors' Salaries by Gender", x = "Salary", y = "Frequency") +
  theme_minimal() +
  facet_wrap(~ gender)

The following function illustrates the considerable numerical disparity between male and female academics in the sample, which makes it difficult to do a gender-based salary comparison.

table(ProfessorsSalaries$gender)
## 
## Female   Male 
##     39    358

Despite this difference, the right skewness of the histograms for both genders indicates a comparable overall salary distribution. It is interesting to notice, though, that the female histogram’s skewness ends at a lower wage level, indicating that, within their respective ranges, female professors typically make less money annually than their male colleagues.

Now, this scatterplot will show the relationship between experience and salary. A positive trend would indicate that higher experience tends to result in higher salary.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplot(salary ~ yrs.service, 
             data = ProfessorsSalaries,
             smooth = FALSE,
             boxplots = FALSE,
             ylab = "Salary", 
             xlab = "Years of Service", 
             main = "Professors' salary vs. years of service relationship")

cor(ProfessorsSalaries$yrs.service, ProfessorsSalaries$salary)
## [1] 0.3347447

The scatterplot reveals a moderate positive correlation coefficient of approximately 0.33 between Salary and Years of Experience, indicating that, generally, as years of experience increase, salaries tend to rise as well. However, the correlation is not strong, suggesting that other factors may also significantly influence salary beyond just years of experience.

This boxplot shows how salaries vary between different academic ranks, such as Assistant, Associate, and Full Professors.

library(ggplot2)
ggplot(ProfessorsSalaries, aes(x = rank, y = salary, fill = rank)) + 
  geom_boxplot() +
  scale_fill_brewer(palette = "Spectral") +
  labs(title = "Salary by Academic Rank", 
       x = "Academic Rank", 
       y = "Salary") +
  theme_minimal()

The boxplots clearly show how academic rank has a substantial impact on salary, with higher ranks being correlated with greater median wages, wider interquartile gaps, and bigger ranges. Because full professors make up the majority of the university’s compensation structure, their boxplot is notably elevated. While assistant and associate professors show no outliers, the professor category contains a few. These are some causes for the presence of outliers that may not necessarily require removal. In conclusion, the association between academic rank and salary is successfully depicted by the graph, reaffirming the expected relationship between these two variables.

Task 2

library(readxl)
Business_School <- read_excel("~/Desktop/R/IMB 2024/Bootcamp/R Take Home Exam 2024/Task 2/Business School.xlsx")
Business_School <- as.data.frame(Business_School)
head(Business_School)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before)
## 1          1         Business            68.4      90.2              No                    252
## 2          2 Computer Science            70.2      68.7             Yes                    101
## 3          3          Finance            76.4      83.3              No                    401
## 4          4         Business            82.6      88.7              No                    287
## 5          5          Finance            76.9      75.4              No                    275
## 6          6 Computer Science            83.3      82.1              No                    254
##   Employability (After) Status Annual Salary
## 1                   276 Placed        111000
## 2                   119 Placed        107000
## 3                   462 Placed        109000
## 4                   342 Placed        148000
## 5                   347 Placed        255500
## 6                   313 Placed        103500

1. Undergraduate degrees distribution

library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) + 
  geom_bar() + 
  ylab("Frequency")

The histogram reveals that the most frequently held Undergraduate Degree among students is Business by a significant margin. Following that, Finance is the second most common degree, with Computer Science also being notably represented, though to a lesser extent.

2. Descriptive statistics - Annual salary

library(pastecs)
round(stat.desc(Business_School$`Annual Salary`))
##      nbr.val     nbr.null       nbr.na          min          max        range          sum 
##          100            0            0        20000       340000       320000     10905800 
##       median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
##       103500       109058         4150         8235   1722373475        41501            0
library(ggplot2)
options(scipen = 999)
ggplot(Business_School, aes(x=`Annual Salary`)) + 
  geom_histogram(binwidth = 5000, fill = "chocolate", color = "black") +
  ylab("Frequency") + 
  xlab("Annual Salary") +
  ggtitle("Distribution of Annual Salary for MBA Students") +
  theme_minimal()

The descriptive statistics indicate that the annual salary for MBA students ranges from a minimum of 20,000 to a maximum of 340,000. The average annual salary is 109,858, and the median salary is 103,580, meaning that 50% of the students earn more than this amount. Upon examining the histogram, we observe that the salary distribution is skewed to the right, suggesting that a few students have significantly higher salaries compared to the rest. This skewness may be caused by outliers at the higher end of the salary range. To further investigate whether the salary data follows a normal distribution, it would be appropriate to conduct a Shapiro-Wilk test. Additionally, to improve the readability of the histogram, with some colleagues we used the options(scipen = 999) function, which prevents scientific notation from being used on the x-axis. This method was sourced from this link: https://stackoverflow.com/questions/5352099/how-can-i-disable-scientific-notation).

3. Hypothesis testing

mean(Business_School$`MBA Grade`)
## [1] 76.04055
sd(Business_School$`MBA Grade`)
## [1] 7.675114
t.test(Business_School$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

I conducted a one-sample t-test to determine whether the current MBA students’ mean grade differs from last year’s average of 74.

Hypotheses - Null (H₀): The mean MBA grade is 74 (𝜇 = 74). - Alternative (H₁): The mean MBA grade is not 74 (𝜇 ≠ 74).

Interpretation: Since the p-value (0.00915) is less than 0.05, we reject the null hypothesis. This indicates that the mean MBA grade is statistically different from 74. The confidence interval (74.52, 77.56) also does not include 74, further supporting this conclusion.

Effect Size: The sample mean of 76.04 is slightly higher than 74, with a difference of about 2 points. While statistically significant, this small difference may not be practically meaningful, depending on how the school interprets grade changes.

In summary, the test shows the current MBA grades are significantly higher than last year’s average, though the difference is relatively modest.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/Desktop/R/IMB 2024/Bootcamp/R Take Home Exam 2024/Task 3/Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description

  • Age: the age of the apartment in years;
  • Distance: distance from the city center in kilometers (km);
  • Price: price per square meter (m²);
  • Parking: availability of parking (0 = No, 1 = Yes);
  • Balcony: presence of a balcony (0 = No, 1 = Yes).

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking,
                                           levels = c (0, 1),
                                           labels = c ("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
                                           levels = c (0, 1),
                                           labels = c ("No", "Yes"))
head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640      No     Yes
## 2  18        1  2800     Yes      No
## 3   7       28  1660      No      No
## 4  28       29  1850      No     Yes
## 5  18       18  1640     Yes     Yes
## 6  28       12  1770      No     Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

mean(Apartments$Price)
## [1] 2018.941
sd(Apartments$Price)
## [1] 377.8417
t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the one-sample t-test results, the p-value (0.0047) is less than the 0.05 significance level. Therefore, we reject the null hypothesis and conclude that the true mean apartment price is significantly different from 1900. The 95% confidence interval indicates that the true mean price likely falls between 1937.44 and 2100.44. This further supports the conclusion that the average price per square meter is not 1900, with the sample mean estimated at 2018.94.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
          data = Apartments)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanation

Estimate of the regression coefficient: the regression coefficient for Age is -8.975, meaning that for each additional year in the age of an apartment, the price per square meter decreases by approximately 8.98 units. This indicates a negative relationship between apartment age and price.

Coefficient of correlation: to assess the significance of the relationship between Age and Price, we conducted a t-test on the slope (null hypothesis: β₁ = 0). The p-value for the age variable is 0.034, which is less than 0.05, indicating that the effect of Age on Price is statistically significant. Thus, we reject the null hypothesis and conclude that age does have an impact on price.

Coefficient of determination (R-squared): the R-squared value is 0.053, meaning that 5.3% of the variability in apartment prices can be explained by the linear effect of age. While the relationship is statistically significant, the low R-squared suggests that age alone is not a strong predictor of apartment prices, and other factors likely play a larger role.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
                  smooth = FALSE) 

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(Apartments[ , c(1, 2, 3)]))
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

There is a weak correlation between Age and Distance (r = 0.04) and between Age and Price (r = -0.23), suggesting that changes in apartment age are not strongly related to either the distance from the city center or the price. However, there is a strong negative correlation between Price and Distance (r = -0.63), meaning that as the distance from the city center increases, the price per square meter tends to decrease significantly. Based on these results, there doesn’t appear to be a multicollinearity issue between the variables Age, Distance, and Price, as none of the correlations between the independent variables (Age and Distance) are strong. The only notable relationship is between Price and Distance, but this doesn’t indicate multicollinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
          data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

The VIF results for both Age and Distance are 1.001845, indicating no significant multicollinearity between the predictors. Since these values are well below the thresholds of 5, it suggests that both variables contribute uniquely to the model. Consequently, the regression model remains stable, and the estimates for Price based on Age and Distance are reliable.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

head(Apartments[, c("StdResid", "CooksD")])
##   StdResid CooksD
## 1   -0.665  0.007
## 2    1.783  0.030
## 3   -0.594  0.006
## 4    0.754  0.008
## 5   -1.073  0.005
## 6   -0.778  0.005
head(Apartments[order(Apartments$StdResid),], 5) #Five units with lowest value of stand. residuals
##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013
## 20  13        8  1800      No      No   -1.381  0.012
## 35  14       16  1660      No     Yes   -1.261  0.008
head(Apartments[order(-Apartments$CooksD),], 5) #Five units with highest value of Cooks distance
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061

There is a noticeable difference in the Cook’s distances; Apartment 38 has a Cook’s distance of 0.320, which is substantially greater than the next highest apartment (Apartment 55, 0.104). This suggests that Apartment 38 is having a significant impact on the model. Apartment 38 looks to be both a significant point and a possible outlier because it also has a comparatively large standardized residual (2.577). In order to determine whether removing Apartment 38 will enhance the model’s fit and stability, I shall do so.

library(dplyr)
Apartments_new <- Apartments %>%
  filter(!CooksD %in% c(0.320))

Check

fit2_new <- lm(Price ~ Age + Distance,
          data = Apartments_new)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896
summary(fit2_new)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221 < 0.0000000000000002 ***
## Age           -6.464      3.159  -2.046                0.044 *  
## Distance     -22.955      2.786  -8.240     0.00000000000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 0.000000000002339

After removing the influential point ( Apartment 38 ), the updated regression model shows some improvements. The R-squared increased from 0.4396 to 0.4838, indicating that the model now explains a slightly larger portion of the variance in apartment prices. Additionally, the residual standard error decreased from 286.3 to 276.1, suggesting a better overall model fit. However, despite these improvements, the model remains relatively weak, as the R-squared is still below 0.5. This implies that there are other factors not accounted for by Age and Distance that influence apartment prices. To further refine the model, a deeper analysis is needed. I should examine graphical diagnostics, such as the standardized residuals and Cook’s distances, to identify and remove other potential outliers or influential points. Iterating through this process will help improve the model’s accuracy and stability until it reaches an optimal point for analysis.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Apartments_new$StdFitted <- scale(fit2_new$fitted.values)

library(car)
scatterplot(y = Apartments_new$StdResid, x = Apartments_new$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Upon examining the scatterplot of standardized residuals against standardized fitted values, the distribution of the residuals does not reveal a clear pattern. However, there seems to be a slight indication of heteroskedasticity, as the spread of residuals appears to widen at certain fitted value ranges. To assess this potential issue more precisely, I do a Breusch-Pagan test for heteroskedasticity.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2_new)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

The results of the Breusch-Pagan test showed a chi-square value of 2.93 and a p-value of 0.0871. Since the p-value is greater than the conventional significance level of 0.05, we fail to reject the null hypothesis, suggesting that there is no strong evidence of heteroskedasticity in the model. This implies that the residuals likely exhibit constant variance.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Apartments_new$StdResid <- round(rstandard(fit2_new), 3) #Standardized residuals
Apartments_new$CooksD <- round(cooks.distance(fit2_new), 3) #Cooks distances

library(ggplot2)
ggplot(Apartments_new, aes(x = StdResid)) +
  geom_histogram(binwidth = 0.2, fill = "chocolate", color = "black") +
  labs(x = "Standardized Residuals", 
       y = "Frequency", 
       title = "Histogram of Standardized Residuals") +
  theme_minimal()

shapiro.test(Apartments_new$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments_new$StdResid
## W = 0.95649, p-value = 0.006355
  • Null hypothesis: the data is normally distributed;
  • Alternative hypothesis: the data is not normally distributed.

Given that the p-value (0.006355) is significantly less than the common significance level of 0.05, we reject the null hypothesis. This indicates that there is strong evidence to conclude that the standardized residuals are not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
          data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896
  • Age estimate: for each additional year in the age of an apartment, the price per square meter decreases by approximately €7.93 units; the p-value is 0.016, which is less than 0.05, indicating that the relationship between Age and Price is statistically significant;
  • Distance estimate: for each additional unit of distance, the price per square meter decreases by approximately €20.67; the p-value is 0.0000000000618, also indicating a statistically significant effect on Price;
  • The R-squared value is 0.4396, which means that approximately 43.96% of the variability in apartment prices can be explained by the linear effects of Age and Distance combined.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
          data = Apartments)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova results indicate that the model fit3 fits the data significantly better than the model fit2 (p-value = 0.01007, we can reject null hypothesis), suggesting that adding these variables improves the model’s explanatory power.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415 < 0.0000000000000002 ***
## Age           -6.799      3.110  -2.186              0.03172 *  
## Distance     -18.045      2.758  -6.543        0.00000000528 ***
## ParkingYes   196.168     62.868   3.120              0.00251 ** 
## BalconyYes     1.935     60.014   0.032              0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 0.00000000001849
  • ParkingYes coefficient: apartments with parking facilities are estimated to have a price per square meter that is €196.17 higher than those without parking;
  • BalconyYes coefficient: the p-value of 0.97436 suggests that this effect is not statistically significant, implying that the presence of a balcony does not have a meaningful impact on the apartment price in this model.

F-statistic - Null Hypothesis: none of the independent variables have a significant effect on the dependent variable (Price); - Alternative Hypothesis: at least one of the independent variables significantly affects the dependent variable.

Given p-value < 0.001, we reject the null hypothesis, concluding that the model fit3 provides a significantly better explanation of the variation in apartment prices compared to a model with no variables.

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3) 
Apartments$Residuals <- residuals(fit3) 
head(Apartments[colnames(Apartments) %in% c("Fitted", "Residuals")])
##     Fitted  Residuals
## 1 1750.741 -110.74150
## 2 2357.411  442.58893
## 3 1748.807  -88.80674
## 4 1589.921  260.07897
## 5 2052.576 -412.57579
## 6 1896.691 -126.69107
Apartments[2, c("Fitted", "Residuals")]
##     Fitted Residuals
## 2 2357.411  442.5889

The residual for Apartment ID2 is 422.5889

Take home homework - finished