data(package = .packages(all.available = TRUE))
library(carData)
ProfessorsSalaries <- force(Salaries)
head(ProfessorsSalaries)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Among the data sets provided, I chose Salaries of Professors in the carData package.
This data set analyzes the academic and financial landscape for professors, considering 6 variables such as salary, career advancement, and other patterns within academia.
More precisely, my chosen data set contains 397 observations of these 6 variables:
Renaming columns to more user-friendly names
colnames(ProfessorsSalaries)[2] <- "academic field"
colnames(ProfessorsSalaries)[5] <- "gender"
I then wished to modify the academic field variable’s categories A and B. Thus, I looked up how to accomplish it online and discovered the mutate {dplyr} function in R Help, through which columns can be modified together with the ifelse function explained in the PDF on the professor’s website.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ProfessorsSalaries <- ProfessorsSalaries %>%
mutate(`academic field` = ifelse(`academic field` == "A", "Humanistic",
ifelse(`academic field` == "B", "Scientific", `academic field`)))
head(ProfessorsSalaries)
## rank academic field yrs.since.phd yrs.service gender salary
## 1 Prof Scientific 19 18 Male 139750
## 2 Prof Scientific 20 16 Male 173200
## 3 AsstProf Scientific 4 3 Male 79750
## 4 Prof Scientific 45 39 Male 115000
## 5 Prof Scientific 40 41 Male 141500
## 6 AssocProf Scientific 6 6 Male 97000
Creation of a new variable: salary per year of service
ProfessorsSalaries$SalaryperYrs.service <- ProfessorsSalaries$salary / ProfessorsSalaries$yrs.service
Removing units with missing values
library(tidyr)
ProfessorsSalaries <- ProfessorsSalaries %>% drop_na()
# Creation of a new data.frame about male humanistic associate professors' salaries
AssocProfMaleHum_Salaries <- data.frame(
"rank" = ProfessorsSalaries$rank[ProfessorsSalaries$rank == "AssocProf" &
ProfessorsSalaries$gender == "Male" &
ProfessorsSalaries$`academic field` == "Humanistic"],
"academic_field" = ProfessorsSalaries$`academic field`[ProfessorsSalaries$rank == "AssocProf" &
ProfessorsSalaries$gender == "Male" &
ProfessorsSalaries$`academic field` == "Humanistic"],
"gender" = ProfessorsSalaries$gender[ProfessorsSalaries$rank == "AssocProf" &
ProfessorsSalaries$gender == "Male" &
ProfessorsSalaries$`academic field` == "Humanistic"],
"salary" = ProfessorsSalaries$salary[ProfessorsSalaries$rank == "AssocProf" &
ProfessorsSalaries$gender == "Male" &
ProfessorsSalaries$`academic field` == "Humanistic"]
)
print(AssocProfMaleHum_Salaries)
## rank academic_field gender salary
## 1 AssocProf Humanistic Male 83850
## 2 AssocProf Humanistic Male 82099
## 3 AssocProf Humanistic Male 82600
## 4 AssocProf Humanistic Male 81500
## 5 AssocProf Humanistic Male 82100
## 6 AssocProf Humanistic Male 83001
## 7 AssocProf Humanistic Male 73877
## 8 AssocProf Humanistic Male 100102
## 9 AssocProf Humanistic Male 81500
## 10 AssocProf Humanistic Male 70000
## 11 AssocProf Humanistic Male 83000
## 12 AssocProf Humanistic Male 74000
## 13 AssocProf Humanistic Male 88600
## 14 AssocProf Humanistic Male 88650
## 15 AssocProf Humanistic Male 81800
## 16 AssocProf Humanistic Male 104800
## 17 AssocProf Humanistic Male 70700
## 18 AssocProf Humanistic Male 81285
## 19 AssocProf Humanistic Male 108413
## 20 AssocProf Humanistic Male 78182
## 21 AssocProf Humanistic Male 104121
## 22 AssocProf Humanistic Male 86895
In this new data.frame I isolated the 22 male, humanistic, Associate Professors and I could conduct further research about their descriptive statistics, such as their salary mean.
mean(AssocProfMaleHum_Salaries$salary)
## [1] 85048.86
Explanation: a male humanistic associate professor at the university taken into consideration in the analysis earns, on average, $85049 annually.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
options(width = 100)
round(stat.desc(ProfessorsSalaries), 2)
## rank academic field yrs.since.phd yrs.service gender salary SalaryperYrs.service
## nbr.val NA NA 397.00 397.00 NA 397.00 397.00
## nbr.null NA NA 0.00 11.00 NA 0.00 0.00
## nbr.na NA NA 0.00 0.00 NA 0.00 0.00
## min NA NA 1.00 0.00 NA 57800.00 1133.33
## max NA NA 56.00 60.00 NA 231545.00 Inf
## range NA NA 55.00 60.00 NA 173745.00 Inf
## sum NA NA 8859.00 6993.00 NA 45141464.00 Inf
## median NA NA 21.00 16.00 NA 107300.00 8015.91
## mean NA NA 22.31 17.61 NA 113706.46 Inf
## SE.mean NA NA 0.65 0.65 NA 1520.16 NaN
## CI.mean NA NA 1.27 1.28 NA 2988.60 NaN
## var NA NA 166.07 169.16 NA 917425865.05 NaN
## std.dev NA NA 12.89 13.01 NA 30289.04 NaN
## coef.var NA NA 0.58 0.74 NA 0.27 NaN
For categorical variables, it is useless to provide a descriptive statistics summary, thus I will re-set the function to display only the relevant findings for this study.
library(pastecs)
round(stat.desc(ProfessorsSalaries[, c("yrs.since.phd", "yrs.service", "salary", "SalaryperYrs.service")]), 2)
## yrs.since.phd yrs.service salary SalaryperYrs.service
## nbr.val 397.00 397.00 397.00 397.00
## nbr.null 0.00 11.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 1.00 0.00 57800.00 1133.33
## max 56.00 60.00 231545.00 Inf
## range 55.00 60.00 173745.00 Inf
## sum 8859.00 6993.00 45141464.00 Inf
## median 21.00 16.00 107300.00 8015.91
## mean 22.31 17.61 113706.46 Inf
## SE.mean 0.65 0.65 1520.16 NaN
## CI.mean.0.95 1.27 1.28 2988.60 NaN
## var 166.07 169.16 917425865.05 NaN
## std.dev 12.89 13.01 30289.04 NaN
## coef.var 0.58 0.74 0.27 NaN
Since the function drop_na was insufficient for the new variable I had previously set (SalaryperYrs.service), I looked up how to remove the inf and NaN rows from my data set again online. I found this function on this forum: https://stackoverflow.com/questions/36590230/how-to-remove-rows-with-inf-from-a-dataframe-in-r, and I modified it to fit my needs.
ProfessorsSalaries_new <- ProfessorsSalaries[is.finite(ProfessorsSalaries$SalaryperYrs.service), ]
head(ProfessorsSalaries_new)
## rank academic field yrs.since.phd yrs.service gender salary SalaryperYrs.service
## 1 Prof Scientific 19 18 Male 139750 7763.889
## 2 Prof Scientific 20 16 Male 173200 10825.000
## 3 AsstProf Scientific 4 3 Male 79750 26583.333
## 4 Prof Scientific 45 39 Male 115000 2948.718
## 5 Prof Scientific 40 41 Male 141500 3451.220
## 6 AssocProf Scientific 6 6 Male 97000 16166.667
There are now 386 observations in my new data collection, which is 11 fewer than the old one. Here I will now display the descriptive statistical results summary for this variable exclusively as a last check.
library(pastecs)
round(stat.desc(ProfessorsSalaries_new[, c("SalaryperYrs.service")]), 2)
## nbr.val nbr.null nbr.na min max range sum
## 386.00 0.00 0.00 1133.33 118700.00 117566.67 5226619.27
## median mean SE.mean CI.mean.0.95 var std.dev coef.var
## 7934.09 13540.46 852.84 1676.82 280754731.35 16755.74 1.24
I came to the conclusion that it would be best to stop looking into this new variable after the initial sample size (all of the professors working at the analysed university) was lost. After taking into account the three earlier ones, I will now analyze some sample statistics.
library(pastecs)
round(stat.desc(ProfessorsSalaries[, c("yrs.since.phd", "yrs.service", "salary")]), 2)
## yrs.since.phd yrs.service salary
## nbr.val 397.00 397.00 397.00
## nbr.null 0.00 11.00 0.00
## nbr.na 0.00 0.00 0.00
## min 1.00 0.00 57800.00
## max 56.00 60.00 231545.00
## range 55.00 60.00 173745.00
## sum 8859.00 6993.00 45141464.00
## median 21.00 16.00 107300.00
## mean 22.31 17.61 113706.46
## SE.mean 0.65 0.65 1520.16
## CI.mean.0.95 1.27 1.28 2988.60
## var 166.07 169.16 917425865.05
## std.dev 12.89 13.01 30289.04
## coef.var 0.58 0.74 0.27
Explanation and analysis
This histogram shows the distribution of professors’ salaries. The shape will indicate if the distribution is skewed or normal.
library(ggplot2)
ggplot(ProfessorsSalaries, aes(x = salary)) +
geom_histogram(binwidth = 5000, fill = "chocolate", color = "black") +
labs(title = "Salary distribution among professors", x = "Salary", y = "Frequency") +
theme_minimal()
The histogram of professors’ salaries indicates a right-skewed distribution, suggesting that most salaries cluster at the “lower” end while a small number of higher salaries significantly influence the mean.
Now, my goal is to compare the gender-specific wage distributions of professors, highlighting any potential disparities in pay between male and female faculty members using separate histograms.
library(ggplot2)
ggplot(ProfessorsSalaries, aes(x = salary, fill = gender)) +
geom_histogram(binwidth = 5000, color = "black", position = "identity") +
labs(title = "Professors' Salaries by Gender", x = "Salary", y = "Frequency") +
theme_minimal() +
facet_wrap(~ gender)
The following function illustrates the considerable numerical disparity between male and female academics in the sample, which makes it difficult to do a gender-based salary comparison.
table(ProfessorsSalaries$gender)
##
## Female Male
## 39 358
Despite this difference, the right skewness of the histograms for both genders indicates a comparable overall salary distribution. It is interesting to notice, though, that the female histogram’s skewness ends at a lower wage level, indicating that, within their respective ranges, female professors typically make less money annually than their male colleagues.
Now, this scatterplot will show the relationship between experience and salary. A positive trend would indicate that higher experience tends to result in higher salary.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplot(salary ~ yrs.service,
data = ProfessorsSalaries,
smooth = FALSE,
boxplots = FALSE,
ylab = "Salary",
xlab = "Years of Service",
main = "Professors' salary vs. years of service relationship")
cor(ProfessorsSalaries$yrs.service, ProfessorsSalaries$salary)
## [1] 0.3347447
The scatterplot reveals a moderate positive correlation coefficient of approximately 0.33 between Salary and Years of Experience, indicating that, generally, as years of experience increase, salaries tend to rise as well. However, the correlation is not strong, suggesting that other factors may also significantly influence salary beyond just years of experience.
This boxplot shows how salaries vary between different academic ranks, such as Assistant, Associate, and Full Professors.
library(ggplot2)
ggplot(ProfessorsSalaries, aes(x = rank, y = salary, fill = rank)) +
geom_boxplot() +
scale_fill_brewer(palette = "Spectral") +
labs(title = "Salary by Academic Rank",
x = "Academic Rank",
y = "Salary") +
theme_minimal()
The boxplots clearly show how academic rank has a substantial impact on salary, with higher ranks being correlated with greater median wages, wider interquartile gaps, and bigger ranges. Because full professors make up the majority of the university’s compensation structure, their boxplot is notably elevated. While assistant and associate professors show no outliers, the professor category contains a few. These are some causes for the presence of outliers that may not necessarily require removal. In conclusion, the association between academic rank and salary is successfully depicted by the graph, reaffirming the expected relationship between these two variables.
library(readxl)
Business_School <- read_excel("~/Desktop/R/IMB 2024/Bootcamp/R Take Home Exam 2024/Task 2/Business School.xlsx")
Business_School <- as.data.frame(Business_School)
head(Business_School)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before)
## 1 1 Business 68.4 90.2 No 252
## 2 2 Computer Science 70.2 68.7 Yes 101
## 3 3 Finance 76.4 83.3 No 401
## 4 4 Business 82.6 88.7 No 287
## 5 5 Finance 76.9 75.4 No 275
## 6 6 Computer Science 83.3 82.1 No 254
## Employability (After) Status Annual Salary
## 1 276 Placed 111000
## 2 119 Placed 107000
## 3 462 Placed 109000
## 4 342 Placed 148000
## 5 347 Placed 255500
## 6 313 Placed 103500
library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) +
geom_bar() +
ylab("Frequency")
The histogram reveals that the most frequently held Undergraduate Degree among students is Business by a significant margin. Following that, Finance is the second most common degree, with Computer Science also being notably represented, though to a lesser extent.
library(pastecs)
round(stat.desc(Business_School$`Annual Salary`))
## nbr.val nbr.null nbr.na min max range sum
## 100 0 0 20000 340000 320000 10905800
## median mean SE.mean CI.mean.0.95 var std.dev coef.var
## 103500 109058 4150 8235 1722373475 41501 0
library(ggplot2)
options(scipen = 999)
ggplot(Business_School, aes(x=`Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "chocolate", color = "black") +
ylab("Frequency") +
xlab("Annual Salary") +
ggtitle("Distribution of Annual Salary for MBA Students") +
theme_minimal()
The descriptive statistics indicate that the annual salary for MBA students ranges from a minimum of 20,000 to a maximum of 340,000. The average annual salary is 109,858, and the median salary is 103,580, meaning that 50% of the students earn more than this amount. Upon examining the histogram, we observe that the salary distribution is skewed to the right, suggesting that a few students have significantly higher salaries compared to the rest. This skewness may be caused by outliers at the higher end of the salary range. To further investigate whether the salary data follows a normal distribution, it would be appropriate to conduct a Shapiro-Wilk test. Additionally, to improve the readability of the histogram, with some colleagues we used the options(scipen = 999) function, which prevents scientific notation from being used on the x-axis. This method was sourced from this link: https://stackoverflow.com/questions/5352099/how-can-i-disable-scientific-notation).
mean(Business_School$`MBA Grade`)
## [1] 76.04055
sd(Business_School$`MBA Grade`)
## [1] 7.675114
t.test(Business_School$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
I conducted a one-sample t-test to determine whether the current MBA students’ mean grade differs from last year’s average of 74.
Hypotheses - Null (H₀): The mean MBA grade is 74 (𝜇 = 74). - Alternative (H₁): The mean MBA grade is not 74 (𝜇 ≠ 74).
Interpretation: Since the p-value (0.00915) is less than 0.05, we reject the null hypothesis. This indicates that the mean MBA grade is statistically different from 74. The confidence interval (74.52, 77.56) also does not include 74, further supporting this conclusion.
Effect Size: The sample mean of 76.04 is slightly higher than 74, with a difference of about 2 points. While statistically significant, this small difference may not be practically meaningful, depending on how the school interprets grade changes.
In summary, the test shows the current MBA grades are significantly higher than last year’s average, though the difference is relatively modest.
library(readxl)
Apartments <- read_excel("~/Desktop/R/IMB 2024/Bootcamp/R Take Home Exam 2024/Task 3/Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description
Apartments$Parking <- factor(Apartments$Parking,
levels = c (0, 1),
labels = c ("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c (0, 1),
labels = c ("No", "Yes"))
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
mean(Apartments$Price)
## [1] 2018.941
sd(Apartments$Price)
## [1] 377.8417
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on the one-sample t-test results, the p-value (0.0047) is less than the 0.05 significance level. Therefore, we reject the null hypothesis and conclude that the true mean apartment price is significantly different from 1900. The 95% confidence interval indicates that the true mean price likely falls between 1937.44 and 2100.44. This further supports the conclusion that the average price per square meter is not 1900, with the sample mean estimated at 2018.94.
fit1 <- lm(Price ~ Age,
data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanation
Estimate of the regression coefficient: the regression coefficient for Age is -8.975, meaning that for each additional year in the age of an apartment, the price per square meter decreases by approximately 8.98 units. This indicates a negative relationship between apartment age and price.
Coefficient of correlation: to assess the significance of the relationship between Age and Price, we conducted a t-test on the slope (null hypothesis: β₁ = 0). The p-value for the age variable is 0.034, which is less than 0.05, indicating that the effect of Age on Price is statistically significant. Thus, we reject the null hypothesis and conclude that age does have an impact on price.
Coefficient of determination (R-squared): the R-squared value is 0.053, meaning that 5.3% of the variability in apartment prices can be explained by the linear effect of age. While the relationship is statistically significant, the low R-squared suggests that age alone is not a strong predictor of apartment prices, and other factors likely play a larger role.
library(car)
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
smooth = FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(Apartments[ , c(1, 2, 3)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
There is a weak correlation between Age and Distance (r = 0.04) and between Age and Price (r = -0.23), suggesting that changes in apartment age are not strongly related to either the distance from the city center or the price. However, there is a strong negative correlation between Price and Distance (r = -0.63), meaning that as the distance from the city center increases, the price per square meter tends to decrease significantly. Based on these results, there doesn’t appear to be a multicollinearity issue between the variables Age, Distance, and Price, as none of the correlations between the independent variables (Age and Distance) are strong. The only notable relationship is between Price and Distance, but this doesn’t indicate multicollinearity.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF results for both Age and Distance are 1.001845, indicating no significant multicollinearity between the predictors. Since these values are well below the thresholds of 5, it suggests that both variables contribute uniquely to the model. Consequently, the regression model remains stable, and the estimates for Price based on Age and Distance are reliable.
Apartments$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
head(Apartments[, c("StdResid", "CooksD")])
## StdResid CooksD
## 1 -0.665 0.007
## 2 1.783 0.030
## 3 -0.594 0.006
## 4 0.754 0.008
## 5 -1.073 0.005
## 6 -0.778 0.005
head(Apartments[order(Apartments$StdResid),], 5) #Five units with lowest value of stand. residuals
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
## 20 13 8 1800 No No -1.381 0.012
## 35 14 16 1660 No Yes -1.261 0.008
head(Apartments[order(-Apartments$CooksD),], 5) #Five units with highest value of Cooks distance
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
There is a noticeable difference in the Cook’s distances; Apartment 38 has a Cook’s distance of 0.320, which is substantially greater than the next highest apartment (Apartment 55, 0.104). This suggests that Apartment 38 is having a significant impact on the model. Apartment 38 looks to be both a significant point and a possible outlier because it also has a comparatively large standardized residual (2.577). In order to determine whether removing Apartment 38 will enhance the model’s fit and stability, I shall do so.
library(dplyr)
Apartments_new <- Apartments %>%
filter(!CooksD %in% c(0.320))
Check
fit2_new <- lm(Price ~ Age + Distance,
data = Apartments_new)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
summary(fit2_new)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 0.0000000000000002 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 0.00000000000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 0.000000000002339
After removing the influential point ( Apartment 38 ), the updated regression model shows some improvements. The R-squared increased from 0.4396 to 0.4838, indicating that the model now explains a slightly larger portion of the variance in apartment prices. Additionally, the residual standard error decreased from 286.3 to 276.1, suggesting a better overall model fit. However, despite these improvements, the model remains relatively weak, as the R-squared is still below 0.5. This implies that there are other factors not accounted for by Age and Distance that influence apartment prices. To further refine the model, a deeper analysis is needed. I should examine graphical diagnostics, such as the standardized residuals and Cook’s distances, to identify and remove other potential outliers or influential points. Iterating through this process will help improve the model’s accuracy and stability until it reaches an optimal point for analysis.
Apartments_new$StdFitted <- scale(fit2_new$fitted.values)
library(car)
scatterplot(y = Apartments_new$StdResid, x = Apartments_new$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Upon examining the scatterplot of standardized residuals against standardized fitted values, the distribution of the residuals does not reveal a clear pattern. However, there seems to be a slight indication of heteroskedasticity, as the spread of residuals appears to widen at certain fitted value ranges. To assess this potential issue more precisely, I do a Breusch-Pagan test for heteroskedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2_new)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
The results of the Breusch-Pagan test showed a chi-square value of 2.93 and a p-value of 0.0871. Since the p-value is greater than the conventional significance level of 0.05, we fail to reject the null hypothesis, suggesting that there is no strong evidence of heteroskedasticity in the model. This implies that the residuals likely exhibit constant variance.
Apartments_new$StdResid <- round(rstandard(fit2_new), 3) #Standardized residuals
Apartments_new$CooksD <- round(cooks.distance(fit2_new), 3) #Cooks distances
library(ggplot2)
ggplot(Apartments_new, aes(x = StdResid)) +
geom_histogram(binwidth = 0.2, fill = "chocolate", color = "black") +
labs(x = "Standardized Residuals",
y = "Frequency",
title = "Histogram of Standardized Residuals") +
theme_minimal()
shapiro.test(Apartments_new$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments_new$StdResid
## W = 0.95649, p-value = 0.006355
Given that the p-value (0.006355) is significantly less than the common significance level of 0.05, we reject the null hypothesis. This indicates that there is strong evidence to conclude that the standardized residuals are not normally distributed.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova results indicate that the model fit3 fits the data significantly better than the model fit2 (p-value = 0.01007, we can reject null hypothesis), suggesting that adding these variables improves the model’s explanatory power.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 0.0000000000000002 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 0.00000000528 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 0.00000000001849
F-statistic - Null Hypothesis: none of the independent variables have a significant effect on the dependent variable (Price); - Alternative Hypothesis: at least one of the independent variables significantly affects the dependent variable.
Given p-value < 0.001, we reject the null hypothesis, concluding that the model fit3 provides a significantly better explanation of the variation in apartment prices compared to a model with no variables.
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[colnames(Apartments) %in% c("Fitted", "Residuals")])
## Fitted Residuals
## 1 1750.741 -110.74150
## 2 2357.411 442.58893
## 3 1748.807 -88.80674
## 4 1589.921 260.07897
## 5 2052.576 -412.57579
## 6 1896.691 -126.69107
Apartments[2, c("Fitted", "Residuals")]
## Fitted Residuals
## 2 2357.411 442.5889
The residual for Apartment ID2 is 422.5889