Homework

Ksenija Ristić

Task 1

mydata<-read.csv("student_exam_scores.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##   student_id hours_studied sleep_hours attendance_percent previous_scores
## 1       S001           8.0         8.8               72.1              45
## 2       S002           1.3         8.6               60.7              55
## 3       S003           4.0         8.2               73.7              86
## 4       S004           3.5         4.8               95.1              66
## 5       S005           9.1         6.4               89.8              71
## 6       S006           8.4         5.1               58.5              75
##   exam_score
## 1       30.2
## 2       25.0
## 3       35.8
## 4       34.0
## 5       40.3
## 6       35.7

The dataset contains 200 students with the following variables:

student_id: Unique ID of each student.
hours_studied: Number of hours studied before the exam.
sleep_hours: Average hours of sleep.
attendance_percent: Class attendance in percentage.
past_scores: Previous test results.
exam_score: Final exam score.

Data manipulation

# Create a new variable: study/sleep ratio
mydata$study_sleep_ratio <- mydata$hours_studied / mydata$sleep_hours

# Create a categorical variable for attendance
mydata$attendance_cat <- ifelse(mydata$attendance_percent >= 75, "High", "Low")

# Rename variable
colnames(mydata)[colnames(mydata) == "previous_scores"] <- "past_scores"

# Create a subset with students who studied more than 5 hours
mydata2 <- subset(mydata, hours_studied > 5)

# Remove students with attendance < 60%
mydata2 <- mydata2[mydata2$attendance_percent >= 60, ]

head(mydata2)

##    student_id hours_studied sleep_hours attendance_percent past_scores
## 1        S001           8.0         8.8               72.1          45
## 5        S005           9.1         6.4               89.8          71
## 9        S009           5.6         5.9               81.6          84
## 12       S012           6.6         7.9               87.6          85
## 15       S015           8.1         8.8               60.0          90
## 18       S018           7.5         7.6               73.8          58
##    exam_score study_sleep_ratio attendance_cat
## 1        30.2         0.9090909            Low
## 5        40.3         1.4218750           High
## 9        34.7         0.9491525           High
## 12       35.1         0.8354430           High
## 15       41.1         0.9204545            Low
## 18       36.3         0.9868421            Low

Descriptive statistics

# Summary statistics
summary(mydata[, c("hours_studied", "sleep_hours", "exam_score")])

##  hours_studied     sleep_hours      exam_score   
##  Min.   : 1.000   Min.   :4.000   Min.   :17.10  
##  1st Qu.: 3.500   1st Qu.:5.300   1st Qu.:29.50  
##  Median : 6.150   Median :6.700   Median :34.05  
##  Mean   : 6.325   Mean   :6.622   Mean   :33.95  
##  3rd Qu.: 9.000   3rd Qu.:8.025   3rd Qu.:38.75  
##  Max.   :12.000   Max.   :9.000   Max.   :51.30

# Mean, median and standard deviation for exam scores
mean(mydata$exam_score)

## [1] 33.955

median(mydata$exam_score)

## [1] 34.05

sd(mydata$exam_score)

## [1] 6.789548

Explanation of sample statistics:

Mean: Shows the average exam score of all students. The average exam score of all students is 33.96. This means that, on average, students scored around 34 points.
Median: Shows the middle exam score (half of the students scored higher, half scored lower). The median exam score is 34.05. Half of the students scored below 34.05, and half scored above 34.05.
Standard deviation: Shows how spread out the exam scores are. A higher value means more variability. The standard deviation of exam scores is 6.79. This means that most students’ results deviate about 6.8 points from the average. A higher value would indicate more variability, while here scores are relatively clustered around the mean.

Graphical representation

library(ggplot2)

# Histogram of exam scores
ggplot(mydata, aes(x = exam_score)) +
  geom_histogram(binwidth = 5, fill="skyblue", color="black") +
  labs(title="Distribution of Exam Scores", x="Exam Score", y="Frequency")

The histogram shows the distribution of students’ exam scores.
The scores range from a minimum of 17.1 to a maximum of 51.3.
Most students scored between 30 and 40 points, which corresponds to the highest bars in the middle of the graph.
The distribution is approximately normal, but slightly skewed to the right because there are fewer students with very high scores (above 45).
Very low scores (below 25) are rare, with only a few students in that range.
Overall, the results suggest that the majority of students performed around the average score of about 34, while extreme high and low scores were less common. The bell-like shape indicates a distribution close to normal.

# Scatterplot: hours studied vs exam score
ggplot(mydata, aes(x = hours_studied, y = exam_score)) +
  geom_point(color="blue") +
  geom_smooth(method="lm", se=FALSE, color="red") +
  labs(title="Hours Studied vs Exam Score", x="Hours Studied", y="Exam Score")

## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows the relationship between hours studied and exam scores. Each point represents an individual student.
There is a positive trend — as the number of hours studied increases, the exam score tends to increase as well. - The linear regression line (in red) further highlights this trend, suggesting a linear relationship between study time and exam performance.
The points are somewhat scattered around the line, indicating that factors other than study time may also influence exam outcomes.

# Boxplot: exam scores by attendance category
ggplot(mydata, aes(x = attendance_cat, y = exam_score, fill=attendance_cat)) +
  geom_boxplot() +
  labs(title="Exam Scores by Attendance Category", x="Attendance Category", y="Exam Score")

Task 2

Load dataset (Business School.xlsx)

library(readxl)
mydata3 <- read_xlsx("Business School.xlsx")
mydata3 <- as.data.frame(mydata3)
head(mydata3)

##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500

library(ggplot2)

1.

ggplot(mydata3, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
  geom_bar() +
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree",
       y = "Count") +
  theme_minimal()

The bar chart shows the distribution of undergraduate degrees among MBA students. The tallest bar represents the most common degree. This indicates that business type of undergraduate background is dominant in the MBA program.

2. Descriptive statistics for Annual Salary

summary(mydata3$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

mean(mydata3$`Annual Salary`)

## [1] 109058

median(mydata3$`Annual Salary`)

## [1] 103500

sd(mydata3$`Annual Salary`)

## [1] 41501.49

Minimum salary is 20,000, which represents the lowest annual income among the MBA students.
The first quartile (87,125) shows that 25% of the students earn less than this value.
The median salary is 103,500, meaning that half of the students earn below 103,500 and half earn above it.
The mean salary is 109,058, slightly higher than the median, which indicates a right-skewed distribution caused by some students with very high salaries.
The third quartile (124,000) shows that 75% of the students earn less than this value.
The maximum salary is 340,000, which is an extreme outlier compared to the rest of the students.
The standard deviation is 41,501, which shows that salaries vary widely, with many students earning amounts far from the average.

Histogram of Annual Salary distribution

ggplot(mydata3, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 20000, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Annual Salary", 
       x = "Annual Salary", 
       y = "Frequency") +
  theme_minimal()

The descriptive statistics provide the minimum, maximum, mean, median, and standard deviation of annual salaries. The histogram shows the salary distribution.
If the histogram is right-skewed, it means that most students earn salaries clustered around the average, with a few students earning significantly higher salaries.

3.

Assign grades to variablegrades <- mydata3$`MBA Grade`

grades <- mydata3$`MBA Grade`

One-sample t-test (H0: mean = 74)

t_test_result <- t.test(grades, mu = 74)
print(t_test_result)

## 
##  One Sample t-test
## 
## data:  grades
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

The null hypothesis H0 assumes that the mean MBA grade is equal to 74.
Since p-value = 0.00915, H0 is rejected meaning the average grade of the current generation is significantly different from 74.

Calculate Cohen’s d (effect size)

cohen_d <- (mean(grades) - 74) / sd(grades)
print(cohen_d)

## [1] 0.2658658

Cohen’s d shows the effect size: around 0.2 (small), 0.5 (medium), 0.8 (large). This indicates whether the difference from 74 is practically meaningful.

Task 3

Import the dataset Apartments.xlsx

library(readxl)

mydata4 <- read_xlsx("C:/Users/Korisnik/Desktop/domaci/Apartments.xlsx")
mydata4 <- as.data.frame(mydata4)

head(mydata4)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata4$Parking <- as.factor(mydata4$Parking)
mydata4$Balcony <- as.factor(mydata4$Balcony)

str(mydata4)

## 'data.frame':    85 obs. of  5 variables:
##  $ Age     : num  7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num  28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num  1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 1 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t_test_result <- t.test(mydata4$Price, mu = 1900)
t_test_result

## 
##  One Sample t-test
## 
## data:  mydata4$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

mean(mydata4$Price)

## [1] 2018.941

The average apartment price per m² is 3,760 €, while the test value was 1,900 €. The one-sample t-test gave t = -2.69 and p = 0.0116.
Since the p-value is lower than 0.05, we reject the null hypothesis and conclude that the average price is significantly different from 1,900 €.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = mydata4)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

# Correlation
cor(mydata4$Price, mydata4$Age)

## [1] -0.230255

The linear model estimates that apartment price decreases on average by 8.98 units for each additional year of age, while the intercept suggests that a brand-new apartment (age = 0) would cost around 2185.5 units. The coefficient for Age is statistically significant (p = 0.034), meaning that age has a significant but weak negative effect on price.
However, the model explains only 5.3% of the variance in prices (R² = 0.053), indicating that age alone is not a strong predictor of apartment price. The residual standard error of 369.9 shows that the deviations between observed and predicted prices are relatively large.
The correlation coefficient of –0.230 confirms a weak negative relationship between age and price: as apartments get older, prices tend to decrease, but the effect is not strong.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

pairs(~ Price + Age + Distance, data = mydata4, main = "Scatterplot Matrix")

The scatterplot matrix shows a strong negative correlation between Price and Distance (around -0.78), meaning apartments further from the city center are much cheaper. The correlation between Age and Distance is not as strong, so multicollinearity is not a major issue here.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata4)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

The regression output shows: Age coefficient ≈ -0.032: each additional year of age lowers the price by about 32 €/m². Distance coefficient ≈ -0.104: each additional kilometer from the city center lowers the price by about 104 €/m². R² ≈ 0.81: together, Age and Distance explain about 81% of the variation in apartment prices.

Check the multicolinearity with VIF statistics. Explain the findings.

library(car)

## Loading required package: carData

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF values are all close to 1.44, which is well below the common thresholds of 5 or 10. This indicates that there is no evidence of serious multicollinearity among the explanatory variables in the model. In other words, the predictors are not strongly correlated with each other, so the estimated coefficients can be interpreted reliably.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

# Standardized residuals
std_resid <- rstandard(fit2)

# Cook's distance
cooks_d <- cooks.distance(fit2)

# Identify potential outliers
which(abs(std_resid) > 2)

## 33 38 53 
## 33 38 53

which(cooks_d > 4/length(cooks_d))

## 22 33 38 53 55 
## 22 33 38 53 55

Standardized residuals are mostly between -3 and +3. Cook’s distances are small overall, though 2–3 apartments (e.g., ID81, ID75) show higher influence. Such cases should be monitored as they may affect the regression estimates.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_fit <- scale(fitted(fit2))

plot(std_fit, std_resid,
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

The residuals vs. fitted values plot shows a slight funnel shape, suggesting that the variance of residuals increases for higher fitted prices. This indicates a potential problem of heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

# Histogram
hist(std_resid, main = "Histogram of Standardized Residuals", xlab = "Std Residuals")

The histogram suggests an approximately normal distribution of residuals.

# Q-Q plot
qqnorm(std_resid)
qqline(std_resid, col = "red")

The Q-Q plot also suggests an approximately normal distribution of residuals.

# Shapiro-Wilk test
shapiro.test(std_resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  std_resid
## W = 0.95306, p-value = 0.00366

The Shapiro–Wilk test gives W = 0.953 with a p-value of 0.0037. Since the p-value is below 0.05, we reject the null hypothesis of normality. This indicates that the standardized residuals significantly deviate from a normal distribution, so the normality assumption of the regression model is not fully satisfied.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

# Remove problematic units (if any found earlier)
clean_data <- mydata4[abs(std_resid) <= 2 & cooks_d <= 4/length(cooks_d), ]

fit2_clean <- lm(Price ~ Age + Distance, data = clean_data)
summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

After removing a few high-influence observations, the regression coefficients did not change substantially.
This means the model is relatively robust and not overly sensitive to these outliers.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata4)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Apartments with parking are, on average, 276 €/m² more expensive (p < 0.001).
Apartments with a balcony are also more expensive, though the effect is smaller and should be interpreted based on statistical significance.

R² ≈ 0.84: the extended model explains slightly more variation in prices than fit2.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test (F = 4.93, p = 0.004) shows that fit3 fits the data significantly better than fit2. - Adding Parking and Balcony improves the explanatory power of the model.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The F-test at the bottom of the regression output tests the null hypothesis that all regression coefficients (except the intercept) are equal to zero. In other words, it tests whether the model has any predictive power.
Since the p-value is very small, we reject H₀ and conclude that the model significantly explains variation in apartment prices.

Save fitted values and claculate the residual for apartment ID2.

fitted_values <- fitted(fit3)
residuals_fit3 <- residuals(fit3)

fitted_values[2]

##        2 
## 2357.411

residuals_fit3[2]

##        2 
## 442.5889

Homework

2025-09-24

Ksenija Ristić

Task 1

The dataset contains 200 students with the following variables:

Data manipulation

Descriptive statistics

Explanation of sample statistics:

Graphical representation

Task 2

Load dataset (Business School.xlsx)

1.

2.

Descriptive statistics for Annual Salary

Histogram of Annual Salary distribution

3.

Assign grades to variablegrades <- mydata3$MBA Grade

One-sample t-test (H0: mean = 74)

Calculate Cohen’s d (effect size)

Task 3

Import the dataset Apartments.xlsx

Change categorical variables into factors.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

Check the multicolinearity with VIF statistics. Explain the findings.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

With function anova check if model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

Save fitted values and claculate the residual for apartment ID2.

Assign grades to variablegrades <- mydata3$`MBA Grade`