mydata <- read.table("~/Documents/Šola/IMB/Bootcamp/R Take Home Exam 2024/Sleep_health_and_lifestyle_dataset.csv",
header = TRUE,
sep = ",",
dec = ".") #Importing the dataset
head(mydata, 8) #Showing first 8 units
## Person.ID Gender Age Occupation Sleep.Duration Quality.of.Sleep
## 1 1 Male 27 Software Engineer 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Representative 5.9 4
## 5 5 Male 28 Sales Representative 5.9 4
## 6 6 Male 28 Software Engineer 5.9 4
## 7 7 Male 29 Teacher 6.3 6
## 8 8 Male 29 Doctor 7.8 7
## Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1 42 6 Overweight 126/83 77
## 2 60 8 Normal 125/80 75
## 3 60 8 Normal 125/80 75
## 4 30 8 Obese 140/90 85
## 5 30 8 Obese 140/90 85
## 6 30 8 Obese 140/90 85
## 7 40 7 Obese 140/90 82
## 8 75 6 Normal 120/80 70
## Daily.Steps Sleep.Disorder
## 1 4200 None
## 2 10000 None
## 3 10000 None
## 4 3000 Sleep Apnea
## 5 3000 Sleep Apnea
## 6 3000 Insomnia
## 7 3500 Insomnia
## 8 8000 None
Dataset description:
names(mydata)[11] <- "Heart_beat_per_min" #Renaming variable
I renamed variable Heart.Rate to Heart_beat_per_min.
library(dplyr) #Showing only people with insomnia as a sleep disorder
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
insomnia_only <- mydata %>%
filter(Sleep.Disorder == "Insomnia")
I made a seperate data frame called insomnia_only that only has people with insomnia as a sleep disorder.
library(dplyr) #Creating a new data frame with only people that have stress levels above 6 and quality of sleep rating under 6
high_stress_and_low_quality_sleep <- mydata %>%
filter(Stress.Level > 6 & Quality.of.Sleep < 6)
I also made a new data frame called high_stress_and_low_quality_sleep that has people with stress levels above 6 and quality of sleep rating under 6.
anyNA(mydata) #Checking if there is any missing data in my data frame
## [1] FALSE
I checked if there is any missing data in my data, however there is none, since it says false.
mydata$Activity_to_Sleep_ratio <- mydata$Physical.Activity.Level / mydata$Sleep.Duration #Making a new variable called Activity_to_Sleep_ratio and rounding it to 2 decimals
mydata$Activity_to_Sleep_ratio <- round(mydata$Activity_to_Sleep_ratio, 2)
I made a new variable that calculates the ratio from Physical Activity Level and Sleep Duration. The new variable is called Activity_to_Sleep_ratio.
mydata <- mydata %>%
mutate(
Gender = as.factor(Gender),
Occupation = as.factor(Occupation),
BMI.Category = as.factor(BMI.Category),
Blood.Pressure = as.factor(Blood.Pressure),
Sleep.Disorder = as.factor(Sleep.Disorder)
) #Some of my data was not in factors, therefore I transformed them to factors
summary(mydata[ , -1]) #Summary of my data, without the ID variable
## Gender Age Occupation Sleep.Duration Quality.of.Sleep
## Female:185 Min. :27.00 Nurse :73 Min. :5.800 Min. :4.000
## Male :189 1st Qu.:35.25 Doctor :71 1st Qu.:6.400 1st Qu.:6.000
## Median :43.00 Engineer :63 Median :7.200 Median :7.000
## Mean :42.18 Lawyer :47 Mean :7.132 Mean :7.313
## 3rd Qu.:50.00 Teacher :40 3rd Qu.:7.800 3rd Qu.:8.000
## Max. :59.00 Accountant:37 Max. :8.500 Max. :9.000
## (Other) :43
## Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure
## Min. :30.00 Min. :3.000 Normal :195 130/85 :99
## 1st Qu.:45.00 1st Qu.:4.000 Normal Weight: 21 125/80 :65
## Median :60.00 Median :5.000 Obese : 10 140/95 :65
## Mean :59.17 Mean :5.385 Overweight :148 120/80 :45
## 3rd Qu.:75.00 3rd Qu.:7.000 115/75 :32
## Max. :90.00 Max. :8.000 135/90 :27
## (Other):41
## Heart_beat_per_min Daily.Steps Sleep.Disorder Activity_to_Sleep_ratio
## Min. :65.00 Min. : 3000 Insomnia : 77 Min. : 3.530
## 1st Qu.:68.00 1st Qu.: 5600 None :219 1st Qu.: 6.820
## Median :70.00 Median : 7000 Sleep Apnea: 78 Median : 8.330
## Mean :70.17 Mean : 6817 Mean : 8.325
## 3rd Qu.:72.00 3rd Qu.: 8000 3rd Qu.: 9.620
## Max. :86.00 Max. :10000 Max. :15.250
##
mean(mydata$Age) #Calculating the mean for variable Age
## [1] 42.18449
The average age of individuals in the data frame is 42.18.
median(mydata$Daily.Steps) #Calculating the median for variable Daily Steps
## [1] 7000
The median number of daily steps is 7000, meaning that half of individuals in the data frame take fewer than 7000 steps per day, while the other half take more.
min(mydata$Sleep.Duration) #Calculating the minimum for variable Sleep Duration
## [1] 5.8
The minimum sleep duration in the data frame is 5.8 hours, meaning that the person who slept the least in this dataset reported sleeping 5.8 hours per day.
library(ggplot2) #Histogram of Sleep Duration
ggplot(mydata, aes(x = Sleep.Duration)) +
geom_histogram(binwidth = 0.2, fill = "plum", color = "plum4") +
labs(title = "Histogram of Sleep Duration", x = "Sleep Duration (hours)", y = "Frequency")
This shows us the histogram, with frequency of Sleep Duration in hours. This histogram shows us that we have two peaks, one at 6 hours of sleep and one at between 7 and 7.5 hours of sleep. Otherwise the distribution of sleep is quite evenly spread out.
library(ggplot2) #Scatterplot of Daily Steps vs. Heart Rate per min
ggplot(mydata, aes(x = Daily.Steps, y = Heart_beat_per_min)) +
geom_point(color = "hotpink") +
labs(title = "Scatterplot of Daily Steps vs. Heart Rate per min",
x = "Daily Steps",
y = "Heart Rate (beats per minute)")
This shows us the scatterplot between Daily Steps and Heart Rate (beats per min). This scatterplot indicates a negative correlation between the number of steps and Heart Rate. On the graph we can also see some outliers, which are values that are unusually high or low compared to others.
library(ggplot2) #Boxplot of Sleep Duration by BMI Category
ggplot(mydata, aes(x = Daily.Steps, y = Sleep.Duration)) +
geom_boxplot(fill = "maroon2", color = "maroon") +
labs(title = "Boxplot of Sleep Duration by Daily Steps",
x = "Daily Steps",
y = "Sleep Duration (hours)") +
coord_cartesian(xlim = c(300, 12000))
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
This is a boxplot showing a connection between Sleep Duration and Daily Steps. We can indicate that half of all values are presented in the pink box. Median is the line inside the box, which is a bit over 7 hours, this indicates that half of people in this dataset sleep more than this and half of them less. Interquartile range within the box tells us that 50% of people in this dataset sleep between almost 6 and a half and almost 8 hours. Whiskers show us the maximum and minimum value. There are also no outliers present in this boxplot. Overall there is no clear connection between Sleep Duration and Daily Steps, since all the data is so spread out.
library(readxl) #Importing excel data
mydata2 <- read_xlsx("~/Documents/Šola/IMB/Bootcamp/R Take Home Exam 2024/Task 2/Business School.xlsx")
mydata2 <- as.data.frame(mydata2)
head(mydata2) #Showing first 6 units
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
library(ggplot2) #Showing the distribution of Undergrad Degrees
ggplot(mydata2, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "orchid2", color = "orchid4") +
labs(title = "Distribution of Undergrad Degrees", x = "Undergrad Degree", y = "Frequency")
From the graph above we can see that in our dataset from all Undergrad Degrees, Business Degree is the most common one.
summary(mydata2$`Annual Salary`) #Showing desciptive statistics of Annual Salary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
library(ggplot2) #Showing the histogram of Annual Salary
ggplot(mydata2, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 10000, fill = "sienna2", color = "sienna") +
labs(title = "Histogram of Annual Salary", x = "Annual Salary", y = "Frequency")
This shows us the histogram, with frequency of Annual Salary. This histogram shows us that the distribution is skewed to the right. This tells us that not many people in our database have very big salaries.Those that have very high salaries are the outliers we see on the right side of the graph. We see one very high peak at 100000, therefore this is an unimodal distribution, which suggests this is the value most people in our database have.
t.test(mydata2$`MBA Grade`,
mu = 74,
alternative = "two.sided") #T-test for MBA Grade
##
## One Sample t-test
##
## data: mydata2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Above there is a t-test with the hypothesis: - H0: μMBA Grade = 74 - H1: μMBA Grade ≠ 74
We can see that the p-value is 0.00915 (which is lower than p<0.01), therefore we can reject the null hypothesis. This tells us that the average number of MBA Grade is different from 74.
library(effectsize) #Calculating Cohen's
effectsize::cohens_d(mydata2$`MBA Grade`,
mu = 74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
We can see that the value of Cohen’s is 0.27, which tells us that there is a small effect size with 95% confidence. This suggests that there is a modest difference between this year’s generation and last year’s generation.
library(readxl) #Importing excel data
mydata3 <- read_xlsx("~/Documents/Šola/IMB/Bootcamp/R Take Home Exam 2024/Task 3/Apartments.xlsx")
mydata3 <- as.data.frame(mydata3)
head(mydata3) #Showing first 6 units
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata3$ParkingFactor <- factor(mydata3$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata3$BalconyFactor <- factor(mydata3$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(mydata3, 5) #I changed categorical variables Parking and Balcony into factors, then I showed first 5 units
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
t.test(mydata3$Price,
mu = 1900,
alternative = "two.sided") #T-test for Price
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Above there is a t-test with the hypothesis: - H0: μPrice = 1900 - H1: μPrice ≠ 1900
We can see that the p-value is 0.004731 (which is lower than p<0.01), therefore we can reject the null hypothesis. This tells us that the average number of Price is different from 1900.
fit1 <- lm(Price ~ Age,
data = mydata3)
summary(fit1) #Simple regression function
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata3$Price, mydata3$Age) #Calculating coefficient of correlation
## [1] -0.230255
Estimate of regression coefficient = -8.975
Therefore if the Age of the apartment goes up by 1 year, the price of the apartment per m2 goes down by 8.975 euros on average (p<0.05)
Coefficient of correlation = -0.23
Therefore there is a weak negative linear correlation between price per m2 and age of an apartment.
Coefficient of determination = 0.05302
Therefore 5.30% of the variability of the price per m2 is explained by the linear effect of age of an apartment (in years). This tells us that age of an apartment doesn’t play that big of a role in the variability of price.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata3[c("Price", "Age", "Distance")],
smooth = FALSE) #Scatterplot matrix between Price, Age and Distance
As seen from the above Scatterplot Matrix there seems to be no multicolinearity problems. Non of the slopes exhibit strong linear trends, on the contrary they are quite flat or weakly sloped regression lines, therefore we can conclude that there is no multiocolinearity between Price, Age and Distance.
fit2 <- lm(Price ~ Age + Distance,
data = mydata3)
summary(fit2) #Multiple regression function
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
Since both VIFs for Age and Distance are around 1 there is no problem with multicolinearity. If any of the values would be above 5, than we would have problems with multicolinearity.
mydata3$StdResid <- round(rstandard(fit2), 3) #Standard residuals
mydata3$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "frequency",
main = "Histogram of standardized residuals") #Histogram of standardized residuals
head(mydata3[order(mydata3$StdResid), "StdResid"], 3) #Checking for outliers below -3
## [1] -2.152 -1.499 -1.499
head(mydata3[order(-mydata3$StdResid),"StdResid"], 3) #Checking for outliers above 3
## [1] 2.577 2.051 1.783
From the graph we can see that no value goes above 3 or below -3, therefore we have no outliers.
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances") #Histogram of Cooks distances
From the Cooks distance histogram we can see that we have some units with potential high influence (there is a gap between 0.15 and 0.30), so we should look into this further.
head(mydata3[order(-mydata3$CooksD), "CooksD"], 12) #Showing 12 highest Cooks distances values
## [1] 0.320 0.104 0.069 0.066 0.061 0.038 0.037 0.034 0.032 0.030 0.030 0.030
As we can see there is a big gap between the first value and the second. However there is also a gap between the first 2 and the next three (0.069, 0.066, 0.061). Therefore I decided to delete all 5 biggest values.
library(dplyr)
mydata3 <- mydata3 %>%
filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061)) #Deleting units with high influence
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances") #Checking the corrected Cooks distances
After deleting 5 units with potentially high influence, I checked the new Cooks distances and now we can see there is no gaps anymore, therefore it was a good choice to remove those 5 units.
fit2 <- lm(Price ~ Age + Distance, data = mydata3) #Refitting the model after removing 5 units in the last task
mydata3$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE) #Scatterplot between standarized residuals and standrdized fitted values
From the Scatterplot there seems that the data is randomly distributed in a horizontal band of constant variability, with no curvature, therefore there shouldn’t be any heteroscedasticity. To be sure we can check with the Breuch-Pagan test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2) #Breuch-Pagan test
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
Above there is a Breuch-Pagan test with the hypothesis: - H0: βk = 0 - H1: βk ≠ 0
Since p>0.05 we cannot reject the null hypothesis, so we can assume homoskedasticity.
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histrogram of standardized residuals",) #Histogram of standardized residuals
Histogram of standardized residuals is slightly right skewed, also we can see that all values are below 3 and above -3, therefore there are no outliers. For normal distribution we can check with the Shapiro test.
shapiro.test(mydata3$StdResid) #Doing the shapiro test
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.93418, p-value = 0.0004761
Above there is a Shapiro test with the hypothesis: - H0: Errors are distributed normally. - H1: Errors are not distributed normally
Since p<0.001 we can reject the null hypothesis, so we can assume that standardized residuals are not distributed normally. However since our sample size is bigger than 30 units, the fact that standardized residuals are not distributed normally shouldn’t be a problem.
summary(fit2) #Estimating fit2
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
Explanation of coefficients:
The coefficient of determination is 0.5361. Therefore 53.61% of the variability of the price per m2 is explained by the linear effect of age of the apartment and distance from city centre.
Next we check the test of the regression, which states in the null hypothesis that the population coefficient of determination is equal to 0, which would mean that all partial regression coefficients are equal to 0. - H0: ∆ρ² = 0 - H1: ∆ρ² > 0 Where F = 44.49 with p<0.001, so we can reject the null hypothesis. The coefficient of determination of the population is greater than 0, which means that at least part of the variability of the dependent variable is explained by the linear influence of the explanatory variables.
If the age of the apartment is increased by 1 year, the price per m2 decreases on average by 8.674 EUR (p<0.01), assuming that distance from the citry centre remains unchanged.
If the distance of the apartment from the citry centre is increased by 1 km, the price per m2 decreases on average by 24.063 EUR (p<0.001), assuming that age of the apartment remains unchanged.
A multiple correlation coefficient is obtained by calculating the square root of the multiple coefficient of determination. The linear correlation between price, age and distance of the apartment is strong, since it is 0.73.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor, data = mydata3) #Estimating the new linear function
anova(fit2, fit3) #anova with fit3 and fit 2
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
Above there is the anova function comparing fit3 and fit2 with the hypothesis: - H0: ∆ρ² = 0 - H1: ∆ρ² > 0
Since p>0.05 (p=0.01135) we cannot reject the null hypothesis, so we can assume that both models are equally good, therefore we should work with the one that is more simple.
summary(fit3) #Results of fit3
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFactorYes 128.700 60.801 2.117 0.0376 *
## BalconyFactorYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Explanation of the categorical variables and F-statistics:
With each additional parking space, the price per m2 increases on average by 128.7 EUR (p<0.05), assuming that the other explanatory variables remain unchanged.
We cannot say that having a balcony would have an effect on price per m2 of the apartment since p>0.05.
Next we check the test of the regression, which states in the null hypothesis that the population coefficient of determination is equal to 0, which would mean that all partial regression coefficients are equal to 0. - H0: ∆ρ² = 0 - H1: ∆ρ² > 0 Where F = 24.08 with p<0.001, so we can reject the null hypothesis. The coefficient of determination of the population is greater than 0, which means that at least part of the variability of the dependent variable is explained by the linear influence of the explanatory variables.
mydata3$StdFittedValues <- fitted.values(fit3)
mydata3$StdResid <- residuals(fit3)
head(mydata3[ , colnames(mydata3) %in% c("ID", "Price", "StdFittedValues",
"StdResid")]) #Saving fitted values and calculating the residual for ID2
## Price StdResid StdFittedValues
## 1 1640 -88.64095 1728.641
## 2 2800 443.40256 2356.597
## 3 1660 -62.60903 1722.609
## 4 1850 310.68782 1539.312
## 5 1640 -349.28625 1989.286
## 6 1770 -142.65528 1912.655
Based on the estimated regression function, we would expect the price per m2 for this apartment to be 2356.597 thousand EUR. The actual price was 2.8 thousand EUR, so the residual is 443.4 EUR.