mydata1 <- read.csv("student_performance.csv",
header = TRUE, )
The dataset contains 6 variables for a sample of students. Most variables are numeric and continuous, such as self-study hours, attendance, and scores. The grade variable is categorical. This structure allows us to perform both descriptive and graphical analysis, as well as to construct derived variables and filtered datasets.
set.seed(2025)
mydata1 <- mydata1[sample(nrow(mydata1), min(50, nrow(mydata1))), ,drop = FALSE]
# Showing the first 6 rows of the dataset
head (mydata1) #With this function, we take a look at the first rows of our data so that we can check if we imported the right data.
## student_id weekly_self_study_hours attendance_percentage
## 817612 817612 10.4 93.5
## 229274 229274 7.2 74.9
## 33047 33047 23.6 82.1
## 860347 860347 6.8 100.0
## 627149 627149 12.7 77.2
## 62412 62412 14.3 82.3
## class_participation total_score grade
## 817612 7.2 82.6 B
## 229274 8.9 67.7 C
## 33047 6.4 100.0 A
## 860347 7.7 50.6 D
## 627149 9.5 77.7 B
## 62412 7.2 63.1 C
Comment: The data checks, so we continue.
# Descriptive statistics for all variables
summary(mydata1)
## student_id weekly_self_study_hours attendance_percentage
## Min. : 13052 Min. : 0.00 Min. : 66.30
## 1st Qu.:231991 1st Qu.:10.40 1st Qu.: 77.83
## Median :479707 Median :16.35 Median : 83.10
## Mean :447598 Mean :15.39 Mean : 84.27
## 3rd Qu.:639711 3rd Qu.:20.02 3rd Qu.: 90.08
## Max. :977693 Max. :33.60 Max. :100.00
## class_participation total_score grade
## Min. : 1.400 Min. : 40.60 Length:50
## 1st Qu.: 5.225 1st Qu.: 76.12 Class :character
## Median : 6.300 Median : 88.25 Mode :character
## Mean : 6.358 Mean : 84.03
## 3rd Qu.: 7.575 3rd Qu.: 99.67
## Max. :10.000 Max. :100.00
Description of the variables (student_performance.csv):
student_id: Identification number for each student (numeric, used only for identification).
weekly_self_study_hours: Number of hours a student spends on self-study per week (numeric variable, measured in hours).
attendance_percentage: Attendance rate of the student (numeric variable, expressed in %).
class_participation: Level of class participation (numeric variable, scale depends on scoring system).
total_score: Total performance score across all activities/exams (numeric variable).
grade: Final grade assigned to the student (categorical variable, e.g., A, B, C, D, F).
#Renaming the variables in the dataset
colnames(mydata1) <- c("Student ID",
"Weekly Self Study Hours",
"Attendance Percentage",
"Class Participation",
"Total Score",
"Grade")
#Checking if the variables were renamed correctly
head(mydata1) #The column names now include spaces
## Student ID Weekly Self Study Hours Attendance Percentage
## 817612 817612 10.4 93.5
## 229274 229274 7.2 74.9
## 33047 33047 23.6 82.1
## 860347 860347 6.8 100.0
## 627149 627149 12.7 77.2
## 62412 62412 14.3 82.3
## Class Participation Total Score Grade
## 817612 7.2 82.6 B
## 229274 8.9 67.7 C
## 33047 6.4 100.0 A
## 860347 7.7 50.6 D
## 627149 9.5 77.7 B
## 62412 7.2 63.1 C
# We create a new variable that reflects how many total points a student earns per hour of self-study. Study Efficiency = Total Score per hour of self-study
mydata1$Study_Efficiency <- mydata1$`Total Score` / mydata1$`Weekly Self Study Hours`
#Quick check of the new variable
round(summary(mydata1$Study_Efficiency), 2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.98 4.52 5.44 Inf 7.51 Inf
If any student has 0 weekly hours, the division will produce Inf. We’ll handle that in the next step when we clean data.
Before analysis, we clean the data to ensure validity. Units with missing values or logical inconsistencies (e.g., 0 self-study hours) are excluded. We also compute a new variable Study Efficiency, which represents how many points a student earns per hour of self-study. This will help us later analyze performance efficiency.
#Cleaning: remove units with any missing values and impossible Attendance %
#We also remove rows with 0 Weekly hours to avoid Inf in Study Efficiency
library(dplyr)
library(tidyr)
# We clean the dataset by removing missing values and extreme/illogical values (e.g. attendance > 100%, 0 self-study).
mydata1 <- mydata1 %>%
drop_na() %>% #removing rows with any NA values
filter(`Attendance Percentage` >= 0,
`Attendance Percentage` <= 100,
`Weekly Self Study Hours` > 0) #avoid division by zero
#Check the dimensions after cleaning
nrow(mydata1); head(mydata1)
## [1] 49
## Student ID Weekly Self Study Hours Attendance Percentage Class Participation
## 1 817612 10.4 93.5 7.2
## 2 229274 7.2 74.9 8.9
## 3 33047 23.6 82.1 6.4
## 4 860347 6.8 100.0 7.7
## 5 627149 12.7 77.2 9.5
## 6 62412 14.3 82.3 7.2
## Total Score Grade Study_Efficiency
## 1 82.6 B 7.942308
## 2 67.7 C 9.402778
## 3 100.0 A 4.237288
## 4 50.6 D 7.441176
## 5 77.7 B 6.118110
## 6 63.1 C 4.412587
#Creating a new data.frame: students with strong engagement
#Criteria: Attendance >= 80% and Class Participation >= its sample median
med_part <- median(mydata1$`Class Participation`)
mydata_high <- mydata1 %>%
filter(`Attendance Percentage` >= 80,
`Class Participation` >= med_part)
#Show first rows and how many students qualify
head(mydata_high)
## Student ID Weekly Self Study Hours Attendance Percentage Class Participation
## 1 817612 10.4 93.5 7.2
## 2 33047 23.6 82.1 6.4
## 3 860347 6.8 100.0 7.7
## 4 62412 14.3 82.3 7.2
## 5 286103 15.6 87.6 8.1
## 6 309024 24.8 86.7 6.7
## Total Score Grade Study_Efficiency
## 1 82.6 B 7.942308
## 2 100.0 A 4.237288
## 3 50.6 D 7.441176
## 4 63.1 C 4.412587
## 5 86.1 A 5.519231
## 6 100.0 A 4.032258
nrow(mydata_high)
## [1] 20
The descriptive statistics help us understand the central tendency and variability in student performance. The mean represents the average score, the median is the midpoint of the distribution, and the standard deviation quantifies how much individual scores deviate from the mean. These statistics are useful for comparing subgroups or analyzing performance trends.
#Select variables to summarize
vars_sel <- c("Weekly Self Study Hours",
"Attendance Percentage",
"Class Participation",
"Total Score")
#1) Compact overview (min, 1st Qu., median, mean, 3rd Qu., max)
summary(mydata1[ , vars_sel]) #overview of key descriptive statistics
## Weekly Self Study Hours Attendance Percentage Class Participation
## Min. : 1.7 Min. : 66.3 Min. : 1.40
## 1st Qu.:10.4 1st Qu.: 77.9 1st Qu.: 5.20
## Median :16.5 Median : 83.2 Median : 6.40
## Mean :15.7 Mean : 84.5 Mean : 6.38
## 3rd Qu.:20.1 3rd Qu.: 90.1 3rd Qu.: 7.60
## Max. :33.6 Max. :100.0 Max. :10.00
## Total Score
## Min. : 40.60
## 1st Qu.: 76.80
## Median : 88.60
## Mean : 84.33
## 3rd Qu.:100.00
## Max. :100.00
#2) Three key stats for one focal variable (Total Score)
mean(mydata1$`Total Score`, na.rm = TRUE) #average (mean)
## [1] 84.32857
median(mydata1$`Total Score`, na.rm = TRUE) #middle value (median)
## [1] 88.6
sd(mydata1$`Total Score`, na.rm = TRUE) #typical spread (sd)
## [1] 15.57392
We summarize the dataset with summary(). For Total Score, the mean is the average score, the median is the middle score when ordered, and the standard deviation shows typical variation around the mean. We use these three statistics to describe central tendency and variability.
library(ggplot2) #for ggplot graphics
# Create a filtered version of the data just for plotting (to avoid warnings)
plot_data <- mydata1 %>%
filter(!is.na(`Total Score`) & `Total Score` >= 50)
## 1) Histogram: distribution of Total Score
ggplot(mydata1, aes(x = `Total Score`)) +
geom_histogram(binwidth = 5, colour = "black") + #bars every 5 points
scale_x_continuous(limits = c(50, max(mydata1$`Total Score`, na.rm = TRUE))) +
xlab("Total Score (points)") +
ylab("Frequency") +
ggtitle("Distribution of Total Score")
Note: To improve the histogram, we filtered out students with missing or extremely low scores (below 50). This helps us focus on the meaningful part of the score distribution.
The histogram shows the distribution of total scores among students. Most students have scores clustered around the mean, with some outliers on the higher end.
The histogram of total scores reveals a left-skewed (negatively skewed) distribution. Most students achieved relatively high scores, with the peak frequency occurring between 85 and 90 points.
#2. Scatterplot with linear regression line
ggplot(plot_data, aes(x = `Weekly Self Study Hours`, y = `Total Score`)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + #adds linear trend line (no CI band)
xlab("Weekly Self Study Hours (hours)") +
ylab("Total Score (points)") +
ggtitle("Study Hours and Total Score")
The scatterplot reveals a positive association between weekly self-study hours and total score, suggesting that more study time generally leads to better performance.
# 3) Boxplot: Total Score by Grade (if Grade is categorical)
if(!is.factor(mydata1$Grade)) mydata1$Grade <- as.factor(mydata1$Grade)
ggplot(plot_data, aes(x = Grade, y = `Total Score`)) +
geom_boxplot() +
xlab("Grade") +
ylab("Total Score (points)") +
ggtitle("Total Score by Grade")
The boxplot compares scores across grade categories.
We observe that students with grade A tend to have the highest median scores, while lower grades show wider score variability and more outliers.
# Load the dataset
library(readxl)
mba_data <- read_excel("Business School.xlsx", sheet = "Sheet1")
# Preview the dataset
head(mba_data)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
Comment: The data checks, so we continue.
library(ggplot2)
ggplot(mba_data, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "steelblue", color = "black") +
xlab("Undergrad Degree") +
ylab("Number of Students") +
ggtitle("Distribution of Undergraduate Degrees") +
theme_minimal()
Interpretation: The bar chart displays the distribution of undergraduate degrees among 100 MBA students. The most common undergraduate degree is Business, with over 30 students, followed by Finance and Computer Science, each with around 24–25 students.
Degrees such as Engineering and Art are less represented, with fewer than 10 students each. This indicates a strong dominance of business-related academic backgrounds among the current MBA cohort, which may reflect the program’s industry orientation or admission preferences.
# Basic summary statistics
summary(mba_data$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
# Additional useful measures
mean(mba_data$`Annual Salary`, na.rm = TRUE)
## [1] 109058
median(mba_data$`Annual Salary`, na.rm = TRUE)
## [1] 103500
sd(mba_data$`Annual Salary`, na.rm = TRUE)
## [1] 41501.49
The annual salary of MBA students ranges widely, from $20,000 to $340,000, indicating a highly diverse income distribution.
The mean salary is $109,058, while the median is $103,500. Since the mean is higher than the median, the distribution is right-skewed — this suggests that a few individuals with very high salaries are pulling the average upward.
The standard deviation is approximately $41,501, indicating a large variability in salaries. This spread reinforces the observation that the group includes both low and high earners.
Overall, while many students earn between $87,000 and $124,000 (the interquartile range), the presence of outliers above $200,000 significantly influences the distribution.
# One-sample t-test: is the mean MBA grade different from 74?
t.test(mba_data$`MBA Grade`, mu = 74)
##
## One Sample t-test
##
## data: mba_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
# Calculate Cohen's d
mean_grade <- mean(mba_data$`MBA Grade`, na.rm = TRUE)
sd_grade <- sd(mba_data$`MBA Grade`, na.rm = TRUE)
cohen_d <- (mean_grade - 74) / sd_grade
cohen_d
## [1] 0.2658658
A one-sample t-test was conducted to assess whether the average MBA grade differs from 74.The test result was statistically significant (p = 0.009), so we reject the null hypothesis. The average MBA grade is significantly higher than 74, with a sample mean of 76.04. The effect size is Cohen’s d = 0.27, which represents a small practical difference.
library(readxl)
apartments <- read_excel("Apartments.xlsx")
# Show the first 10 rows of the dataset
head(apartments, 10) # Showing first 10 units
## # A tibble: 10 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
## 7 14 20 1850 0 1
## 8 18 6 1970 1 1
## 9 22 7 2270 1 0
## 10 25 2 2570 1 0
apartments$Parking <- as.factor(apartments$Parking)
apartments$Balcony <- as.factor(apartments$Balcony)
# Show the first 10 rows to confirm the change
knitr::kable(head(apartments, 10), caption = "First 10 rows after converting Parking and Balcony to factors")
| Age | Distance | Price | Parking | Balcony |
|---|---|---|---|---|
| 7 | 28 | 1640 | 0 | 1 |
| 18 | 1 | 2800 | 1 | 0 |
| 7 | 28 | 1660 | 0 | 0 |
| 28 | 29 | 1850 | 0 | 1 |
| 18 | 18 | 1640 | 1 | 1 |
| 28 | 12 | 1770 | 0 | 1 |
| 14 | 20 | 1850 | 0 | 1 |
| 18 | 6 | 1970 | 1 | 1 |
| 22 | 7 | 2270 | 1 | 0 |
| 25 | 2 | 2570 | 1 | 0 |
H₀: The true average apartment price is €1900 H₁: The true average apartment price is not €1900
t.test(apartments$Price, mu = 1900)
##
## One Sample t-test
##
## data: apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The one-sample t-test indicates that the average apartment price is statistically significantly different from €1900 (p = 0.0047). Since the p-value is less than 0.05, we reject the null hypothesis. The 95% confidence interval for the mean price is between €1937.44 and €2100.44, suggesting that the true mean is significantly higher than €1900.
fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
We estimated a simple linear regression to evaluate the effect of apartment age on price. The results show that with each additional year of age, the apartment price decreases by approx. €8.98 per m². The effect is statistically significant (p = 0.034), indicating a real negative relationship.
cor(apartments$Price, apartments$Age)
## [1] -0.230255
However, the model only explains about 5.3% of price variation, and the correlation between age and price is weak (r = -0.23). This suggests that while Age matters, other factors (e.g., distance, parking, balcony) likely play a bigger role in determining price.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(apartments[ , c("Price", "Age", "Distance")],
smooth = FALSE)
The scatterplot matrix illustrates the pairwise relationships between the variables Price, Age, and Distance.
A negative linear relationship is visible between Price and Age, indicating that older apartments tend to have lower prices.
A stronger negative relationship is observed between Price and Distance, suggesting that apartments located farther from the city center are generally cheaper.
There is no apparent relationship between Age and Distance. The points are widely dispersed without a clear pattern.
This lack of a strong relationship between Age and Distance implies that there is no multicollinearity problem between these two predictors, meaning they can be used together in a multiple regression model without concern.
fit2 <- lm(Price ~ Age + Distance, data = apartments)
# Show the results
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Intercept (2460.10): Represents the estimated price when both age and distance are zero. While not practically meaningful, it serves as a mathematical baseline.
Age (-7.93): Each additional year in apartment age is associated with a €7.93 decrease in price per m², holding distance constant.
Distance (-20.67): Each additional kilometer from the city center corresponds to a €20.67 decrease in price per m², holding age constant.
Both coefficients are statistically significant at the 5% level, with Distance showing a particularly strong effect (p < 0.001).
R² = 0.44: The model explains approximately 44% of the variation in apartment prices, which is considered moderate explanatory power in social sciences.
Adjusted R² = 0.43: The adjusted value remains close, indicating a good balance between model complexity and explanatory strength.
F-statistic (p < 0.001): The overall model is statistically significant, meaning that at least one of the predictors significantly explains price variation.
Conclusion: The analysis shows that both age and distance negatively influence apartment prices. Distance has a stronger marginal impact, while age also contributes meaningfully. The model provides a solid foundation for predicting price based on these characteristics.
vif(fit2)
## Age Distance
## 1.001845 1.001845
Since both VIF values for Age and Distance are well below the commonly used threshold of 5, we conclude that there is no evidence of multicollinearity between the predictors.
This means that both Age and Distance can be safely included in the regression model without inflating the standard errors or distorting the coefficient estimates.
# Calculate standardized residuals
standard_resid <- rstandard(fit2)
# Calculate Cook's distance
cook_d <- cooks.distance(fit2)
# Combine with apartment IDs (row numbers) for review
diagnostics <- data.frame(
ID = 1:nrow(apartments),
Standardized_Residual = standard_resid,
Cooks_Distance = cook_d
)
# Calculate threshold for Cook's Distance
n <- nrow(apartments)
threshold_cook <- 4 / n # Rule: Cook's D > 4/n → possible influential point
# Subset potential problematic observations
subset(diagnostics, abs(Standardized_Residual) > 2 | Cooks_Distance > threshold_cook)
## ID Standardized_Residual Cooks_Distance
## 22 22 1.575982 0.06086868
## 33 33 2.050586 0.06913379
## 38 38 2.576772 0.31973058
## 53 53 -2.151787 0.06625775
## 55 55 1.444768 0.10420445
# Add standardized residuals and Cook's distance to the apartments data
apartments$StdResid <- round(rstandard(fit2), 3)
apartments$CooksD <- round(cooks.distance(fit2), 3)
# Histogram of standardized residuals
hist(apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "gray",
border = "black")
# Histogram of Cook's distances
hist(apartments$CooksD,
xlab = "Cook's distance",
ylab = "Frequency",
main = "Histogram of Cook's distances",
col = "gray",
border = "black")
Based on the standardized residuals and Cook’s distances, we identify several observations that may present problems for our regression model:
Observation 38 stands out as a potential outlier and influential unit. Its standardized residual exceeds the conventional threshold of ±2, and its Cook’s Distance is above the common threshold of 4/𝑛, indicating high influence on the model’s estimates.
Observations 33 and 53 have standardized residuals slightly above 2 in absolute value, suggesting they may be mild outliers, although their Cook’s distances remain relatively low. These do not appear to have a strong influence on the regression model.
Following standard diagnostic guidelines:
We will exclude observation 38 from further analysis due to both its high residual and influence.
Observations 33 and 53 are retained in the model, as their influence is minimal, and their exclusion is not strictly necessary.
This step improves the robustness of our regression analysis and ensures that the model is not disproportionately affected by extreme or unusual observations.
apartments_clean <- apartments[-c(33, 38, 53), ]
# Calculate standardized fitted values
standard_fitted <- scale(fitted(fit2))
# Scatterplot of standardized residuals vs. standardized fitted values
plot(standard_fitted, standard_resid,
main = "Standardized Residuals vs Fitted Values",
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = "steelblue")
abline(h = 0, col = "red", lty = 2)
The plot shows a clear linear pattern, indicating possible heteroskedasticity. Instead of being randomly scattered, residuals increase with fitted values. This violates the constant variance assumption, meaning standard errors and p-values may be unreliable.
# Histogram of standardized residuals
hist(standard_resid,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
col = "lightblue",
breaks = 10)
# Shapiro-Wilk test for normality
shapiro.test(standard_resid)
##
## Shapiro-Wilk normality test
##
## data: standard_resid
## W = 0.95306, p-value = 0.00366
There is evidence of deviation from normality in the residuals. While minor deviations may not severely affect the regression model (due to robustness of OLS under large sample sizes), the non-normality should be kept in mind — especially if using inference-based methods like confidence intervals or hypothesis testing on coefficients.
# Remove problematic observations (by row number / ID)
clean_apartments <- apartments[-c(33, 38, 53), ]
# Re-estimate the model without those units
fit2_clean <- lm(Price ~ Age + Distance, data = clean_apartments)
# View the summary of the new model
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -404.0 -230.9 -51.4 190.6 504.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2455.768 73.296 33.505 < 2e-16 ***
## Age -6.011 3.086 -1.948 0.055 .
## Distance -23.543 2.665 -8.834 2.05e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared: 0.5179, Adjusted R-squared: 0.5057
## F-statistic: 42.44 on 2 and 79 DF, p-value: 3.042e-13
After removing a few outliers or influential points, the model fit has significantly improved.The effect of Age is still negative but only marginally significant, suggesting it might not be a strong predictor alone. The effect of Distance remains clearly negative and highly significant. Overall, the model now explains more than 50% of the variability in apartment prices, which is quite solid for cross-sectional data.
# Estimate model with all predictors
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = clean_apartments)
# Show summary
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = clean_apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -410.55 -190.37 -35.23 205.41 545.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2339.781 92.756 25.225 < 2e-16 ***
## Age -5.413 3.044 -1.778 0.0793 .
## Distance -21.224 2.818 -7.531 8.09e-11 ***
## Parking1 136.806 61.698 2.217 0.0295 *
## Balcony1 8.546 57.679 0.148 0.8826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257.9 on 77 degrees of freedom
## Multiple R-squared: 0.5469, Adjusted R-squared: 0.5233
## F-statistic: 23.23 on 4 and 77 DF, p-value: 1.282e-12
The model suggests that distance from city center and presence of parking are significant predictors of apartment prices. Age shows a weak negative effect, and balconies do not appear to influence price significantly in this sample.
# Compare nested models: fit2_clean (Age + Distance) vs fit3 (Age + Distance + Parking + Balcony)
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 5447075
## 2 77 5119927 2 327148 2.46 0.09212 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We used an ANOVA to test whether the extended model (fit3), which includes the categorical variables Parking and Balcony, improves the model fit compared to the simpler model (fit2_clean) that only includes Age and Distance.
The null hypothesis is that the additional variables (Parking and Balcony) do not improve the model.
The p-value is 0.092, which is above the standard 0.05 threshold, but below 0.10, suggesting weak evidence in favor of the extended model.
Conclusion: Since p = 0.092, we do not reject the null hypothesis at the 5% level. However, at a 10% significance level, there is marginal evidence that adding Parking and Balcony may improve model fit.
# Save fitted values from fit3
fitted_values <- fitted(fit3)
# Save residuals
residuals_fit3 <- resid(fit3)
# Get fitted value and residual for apartment ID 2 (row 2)
fitted_values[2]
## 2
## 2357.925
residuals_fit3[2]
## 2
## 442.075
From the model fit3, which includes Age, Distance, Parking, and Balcony, we retrieve the following for Apartment ID 2 (i.e., the second row in the dataset): Predicted (fitted) price: €2,357.93 per m², residual: €442.08
This means the model underestimated the price of Apartment 2 by €442.08. In other words, the actual price was €2800, while the model predicted only €2357.93. This residual reflects that this apartment performs significantly above expectations based on the observed characteristics in the model. This could be due to unobserved features such as: Interior quality, Renovation, Floor level and Neighborhood prestige.