TASK 1

a) Explain the dataset

mydata <- read.table("~/Documents/Šola/IMB/Bootcamp/R Take Home Exam 2024/Sleep_health_and_lifestyle_dataset.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".") #Importing the dataset
head(mydata, 8) #Showing first 8 units
##   Person.ID Gender Age           Occupation Sleep.Duration Quality.of.Sleep
## 1         1   Male  27    Software Engineer            6.1                6
## 2         2   Male  28               Doctor            6.2                6
## 3         3   Male  28               Doctor            6.2                6
## 4         4   Male  28 Sales Representative            5.9                4
## 5         5   Male  28 Sales Representative            5.9                4
## 6         6   Male  28    Software Engineer            5.9                4
## 7         7   Male  29              Teacher            6.3                6
## 8         8   Male  29               Doctor            7.8                7
##   Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1                      42            6   Overweight         126/83         77
## 2                      60            8       Normal         125/80         75
## 3                      60            8       Normal         125/80         75
## 4                      30            8        Obese         140/90         85
## 5                      30            8        Obese         140/90         85
## 6                      30            8        Obese         140/90         85
## 7                      40            7        Obese         140/90         82
## 8                      75            6       Normal         120/80         70
##   Daily.Steps Sleep.Disorder
## 1        4200           None
## 2       10000           None
## 3       10000           None
## 4        3000    Sleep Apnea
## 5        3000    Sleep Apnea
## 6        3000       Insomnia
## 7        3500       Insomnia
## 8        8000           None

Dataset description:

  • ID of a person
  • Gender (Female, Male)
  • Age of a person in years
  • Occupation: the profession of a person (Software Engineer, Doctor, Sales Representative, Teacher, Nurse, Accountant, Scientist, Engineer, Lawyer, Salesperson, Manager)
  • Sleep.duration: the number of hours a person sleeps per day
  • Quality.of.sleep: a subjective rating of the quality of sleep (scale 1 to 10)
  • Physical.activity.level: the number of minutes a person engages in physical activity daily
  • Stress.level: a subjective rating of the stress level experienced by a person (scale 1 to 10)
  • BMI.Category: the BMI category of a person (Underweight, Normal, Overweight, Obese)
  • Blood.Pressure: the blood pressure measurement of a person (systolic, diastolic)
  • Heart.Rate: the number of heartbeats per minute at rest (bts)
  • Daily.Steps: number of steps a person makes per day
  • Sleep.Disorder: the presence or absence of a sleep disorder in a person (None, Insomnia, Sleep Apnea)

b) Making data manipulations

names(mydata)[11] <- "Heart_beat_per_min" #Renaming variable

I renamed variable Heart.Rate to Heart_beat_per_min.

library(dplyr) #Showing only people with insomnia as a sleep disorder
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
insomnia_only <- mydata %>%
  filter(Sleep.Disorder == "Insomnia") 

I made a seperate data frame called insomnia_only that only has people with insomnia as a sleep disorder.

library(dplyr) #Creating a new data frame with only people that have stress levels above 6 and quality of sleep rating under 6

high_stress_and_low_quality_sleep <- mydata %>%
  filter(Stress.Level > 6 & Quality.of.Sleep < 6)

I also made a new data frame called high_stress_and_low_quality_sleep that has people with stress levels above 6 and quality of sleep rating under 6.

anyNA(mydata) #Checking if there is any missing data in my data frame
## [1] FALSE

I checked if there is any missing data in my data, however there is none, since it says false.

mydata$Activity_to_Sleep_ratio <- mydata$Physical.Activity.Level / mydata$Sleep.Duration #Making a new variable called Activity_to_Sleep_ratio and rounding it to 2 decimals

mydata$Activity_to_Sleep_ratio <- round(mydata$Activity_to_Sleep_ratio, 2)

I made a new variable that calculates the ratio from Physical Activity Level and Sleep Duration. The new variable is called Activity_to_Sleep_ratio.

c) Descriptive statistics

mydata <- mydata %>%
  mutate(
    Gender = as.factor(Gender),
    Occupation = as.factor(Occupation),
    BMI.Category = as.factor(BMI.Category),
    Blood.Pressure = as.factor(Blood.Pressure),
    Sleep.Disorder = as.factor(Sleep.Disorder)
  ) #Some of my data was not in factors, therefore I transformed them to factors
summary(mydata[ , -1]) #Summary of my data, without the ID variable
##     Gender         Age             Occupation Sleep.Duration  Quality.of.Sleep
##  Female:185   Min.   :27.00   Nurse     :73   Min.   :5.800   Min.   :4.000   
##  Male  :189   1st Qu.:35.25   Doctor    :71   1st Qu.:6.400   1st Qu.:6.000   
##               Median :43.00   Engineer  :63   Median :7.200   Median :7.000   
##               Mean   :42.18   Lawyer    :47   Mean   :7.132   Mean   :7.313   
##               3rd Qu.:50.00   Teacher   :40   3rd Qu.:7.800   3rd Qu.:8.000   
##               Max.   :59.00   Accountant:37   Max.   :8.500   Max.   :9.000   
##                               (Other)   :43                                   
##  Physical.Activity.Level  Stress.Level          BMI.Category Blood.Pressure
##  Min.   :30.00           Min.   :3.000   Normal       :195   130/85 :99    
##  1st Qu.:45.00           1st Qu.:4.000   Normal Weight: 21   125/80 :65    
##  Median :60.00           Median :5.000   Obese        : 10   140/95 :65    
##  Mean   :59.17           Mean   :5.385   Overweight   :148   120/80 :45    
##  3rd Qu.:75.00           3rd Qu.:7.000                       115/75 :32    
##  Max.   :90.00           Max.   :8.000                       135/90 :27    
##                                                              (Other):41    
##  Heart_beat_per_min  Daily.Steps        Sleep.Disorder Activity_to_Sleep_ratio
##  Min.   :65.00      Min.   : 3000   Insomnia   : 77    Min.   : 3.530         
##  1st Qu.:68.00      1st Qu.: 5600   None       :219    1st Qu.: 6.820         
##  Median :70.00      Median : 7000   Sleep Apnea: 78    Median : 8.330         
##  Mean   :70.17      Mean   : 6817                      Mean   : 8.325         
##  3rd Qu.:72.00      3rd Qu.: 8000                      3rd Qu.: 9.620         
##  Max.   :86.00      Max.   :10000                      Max.   :15.250         
## 
mean(mydata$Age) #Calculating the mean for variable Age
## [1] 42.18449

The average age of individuals in the data frame is 42.18.

median(mydata$Daily.Steps) #Calculating the median for variable Daily Steps
## [1] 7000

The median number of daily steps is 7000, meaning that half of individuals in the data frame take fewer than 7000 steps per day, while the other half take more.

min(mydata$Sleep.Duration) #Calculating the minimum for variable Sleep Duration
## [1] 5.8

The minimum sleep duration in the data frame is 5.8 hours, meaning that the person who slept the least in this dataset reported sleeping 5.8 hours per day.

d) Graphs (histogram, scatterplot, boxplot)

library(ggplot2) #Histogram of Sleep Duration

ggplot(mydata, aes(x = Sleep.Duration)) +
  geom_histogram(binwidth = 0.2, fill = "plum", color = "plum4") +
  labs(title = "Histogram of Sleep Duration", x = "Sleep Duration (hours)", y = "Frequency")

This shows us the histogram, with frequency of Sleep Duration in hours. This histogram shows us that we have two peaks, one at 6 hours of sleep and one at between 7 and 7.5 hours of sleep. Otherwise the distribution of sleep is quite evenly spread out.

library(ggplot2) #Scatterplot of Daily Steps vs. Heart Rate per min

ggplot(mydata, aes(x = Daily.Steps, y = Heart_beat_per_min)) +
  geom_point(color = "hotpink") +
  labs(title = "Scatterplot of Daily Steps vs. Heart Rate per min", 
       x = "Daily Steps", 
       y = "Heart Rate (beats per minute)") 

This shows us the scatterplot between Daily Steps and Heart Rate (beats per min). This scatterplot indicates a negative correlation between the number of steps and Heart Rate. On the graph we can also see some outliers, which are values that are unusually high or low compared to others.

library(ggplot2) #Boxplot of Sleep Duration by BMI Category

ggplot(mydata, aes(x = Daily.Steps, y = Sleep.Duration)) +
  geom_boxplot(fill = "maroon2", color = "maroon") +
  labs(title = "Boxplot of Sleep Duration by Daily Steps", 
       x = "Daily Steps", 
       y = "Sleep Duration (hours)") +
  coord_cartesian(xlim = c(300, 12000))
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

This is a boxplot showing a connection between Sleep Duration and Daily Steps. We can indicate that half of all values are presented in the pink box. Median is the line inside the box, which is a bit over 7 hours, this indicates that half of people in this dataset sleep more than this and half of them less. Interquartile range within the box tells us that 50% of people in this dataset sleep between almost 6 and a half and almost 8 hours. Whiskers show us the maximum and minimum value. There are also no outliers present in this boxplot. Overall there is no clear connection between Sleep Duration and Daily Steps, since all the data is so spread out.

TASK 2

a) Graphing the distribution of undergrad degrees

library(readxl) #Importing excel data

mydata2 <- read_xlsx("~/Documents/Šola/IMB/Bootcamp/R Take Home Exam 2024/Task 2/Business School.xlsx")

mydata2 <- as.data.frame(mydata2)
head(mydata2) #Showing first 6 units
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500
library(ggplot2) #Showing the distribution of Undergrad Degrees

ggplot(mydata2, aes(x = `Undergrad Degree`)) + 
  geom_bar(fill = "orchid2", color = "orchid4") +
  labs(title = "Distribution of Undergrad Degrees", x = "Undergrad Degree", y = "Frequency")

From the graph above we can see that in our dataset from all Undergrad Degrees, Business Degree is the most common one.

b) Descriptive statistics of Annual Salary with histogram

summary(mydata2$`Annual Salary`) #Showing desciptive statistics of Annual Salary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000
library(ggplot2) #Showing the histogram of Annual Salary

ggplot(mydata2, aes(x = `Annual Salary`)) + 
  geom_histogram(binwidth = 10000, fill = "sienna2", color = "sienna") +
  labs(title = "Histogram of Annual Salary", x = "Annual Salary", y = "Frequency")

This shows us the histogram, with frequency of Annual Salary. This histogram shows us that the distribution is skewed to the right. This tells us that not many people in our database have very big salaries.Those that have very high salaries are the outliers we see on the right side of the graph. We see one very high peak at 100000, therefore this is an unimodal distribution, which suggests this is the value most people in our database have.

c) testing the hypothesis

t.test(mydata2$`MBA Grade`,
       mu = 74,
       alternative = "two.sided") #T-test for MBA Grade
## 
##  One Sample t-test
## 
## data:  mydata2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Above there is a t-test with the hypothesis: - H0: μMBA Grade = 74 - H1: μMBA Grade ≠ 74

We can see that the p-value is 0.00915 (which is lower than p<0.01), therefore we can reject the null hypothesis. This tells us that the average number of MBA Grade is different from 74.

library(effectsize) #Calculating Cohen's

effectsize::cohens_d(mydata2$`MBA Grade`,
                     mu = 74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

We can see that the value of Cohen’s is 0.27, which tells us that there is a small effect size with 95% confidence. This suggests that there is a modest difference between this year’s generation and last year’s generation.

TASK 3

a) Import the dataset Apartments.xlsx

library(readxl) #Importing excel data

mydata3 <- read_xlsx("~/Documents/Šola/IMB/Bootcamp/R Take Home Exam 2024/Task 3/Apartments.xlsx")

mydata3 <- as.data.frame(mydata3)

head(mydata3) #Showing first 6 units
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

b) Change categorical variables into factors

mydata3$ParkingFactor <- factor(mydata3$Parking,
                                levels = c(0, 1),
                                labels = c("No", "Yes"))

mydata3$BalconyFactor <- factor(mydata3$Balcony,
                                levels = c(0, 1),
                                labels = c("No", "Yes"))

head(mydata3, 5) #I changed categorical variables Parking and Balcony into factors, then I showed first 5 units
##   Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 1   7       28  1640       0       1            No           Yes
## 2  18        1  2800       1       0           Yes            No
## 3   7       28  1660       0       0            No            No
## 4  28       29  1850       0       1            No           Yes
## 5  18       18  1640       1       1           Yes           Yes

c) Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata3$Price,
       mu = 1900,
       alternative = "two.sided") #T-test for Price
## 
##  One Sample t-test
## 
## data:  mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Above there is a t-test with the hypothesis: - H0: μPrice = 1900 - H1: μPrice ≠ 1900

We can see that the p-value is 0.004731 (which is lower than p<0.01), therefore we can reject the null hypothesis. This tells us that the average number of Price is different from 1900.

d) Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = mydata3)

summary(fit1) #Simple regression function
## 
## Call:
## lm(formula = Price ~ Age, data = mydata3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(mydata3$Price, mydata3$Age) #Calculating coefficient of correlation
## [1] -0.230255

Estimate of regression coefficient = -8.975

Therefore if the Age of the apartment goes up by 1 year, the price of the apartment per m2 goes down by 8.975 euros on average (p<0.05)

Coefficient of correlation = -0.23

Therefore there is a weak negative linear correlation between price per m2 and age of an apartment.

Coefficient of determination = 0.05302

Therefore 5.30% of the variability of the price per m2 is explained by the linear effect of age of an apartment (in years). This tells us that age of an apartment doesn’t play that big of a role in the variability of price.

e) Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata3[c("Price", "Age", "Distance")],
                  smooth = FALSE) #Scatterplot matrix between Price, Age and Distance

As seen from the above Scatterplot Matrix there seems to be no multicolinearity problems. Non of the slopes exhibit strong linear trends, on the contrary they are quite flat or weakly sloped regression lines, therefore we can conclude that there is no multiocolinearity between Price, Age and Distance.

f) Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata3)

summary(fit2) #Multiple regression function
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

g) Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

Since both VIFs for Age and Distance are around 1 there is no problem with multicolinearity. If any of the values would be above 5, than we would have problems with multicolinearity.

h) Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata3$StdResid <- round(rstandard(fit2), 3) #Standard residuals
mydata3$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(mydata3$StdResid,
     xlab = "Standardized residuals",
     ylab = "frequency",
     main = "Histogram of standardized residuals") #Histogram of standardized residuals

head(mydata3[order(mydata3$StdResid), "StdResid"], 3) #Checking for outliers below -3
## [1] -2.152 -1.499 -1.499
head(mydata3[order(-mydata3$StdResid),"StdResid"], 3) #Checking for outliers above 3
## [1] 2.577 2.051 1.783

From the graph we can see that no value goes above 3 or below -3, therefore we have no outliers.

hist(mydata3$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances") #Histogram of Cooks distances

From the Cooks distance histogram we can see that we have some units with potential high influence (there is a gap between 0.15 and 0.30), so we should look into this further.

head(mydata3[order(-mydata3$CooksD), "CooksD"], 12) #Showing 12 highest Cooks distances values
##  [1] 0.320 0.104 0.069 0.066 0.061 0.038 0.037 0.034 0.032 0.030 0.030 0.030

As we can see there is a big gap between the first value and the second. However there is also a gap between the first 2 and the next three (0.069, 0.066, 0.061). Therefore I decided to delete all 5 biggest values.

library(dplyr)

mydata3 <- mydata3 %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061)) #Deleting units with high influence
hist(mydata3$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances") #Checking the corrected Cooks distances

After deleting 5 units with potentially high influence, I checked the new Cooks distances and now we can see there is no gaps anymore, therefore it was a good choice to remove those 5 units.

i) Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance, data = mydata3) #Refitting the model after removing 5 units in the last task
mydata3$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE) #Scatterplot between standarized residuals and standrdized fitted values

From the Scatterplot there seems that the data is randomly distributed in a horizontal band of constant variability, with no curvature, therefore there shouldn’t be any heteroscedasticity. To be sure we can check with the Breuch-Pagan test.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2) #Breuch-Pagan test
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

Above there is a Breuch-Pagan test with the hypothesis: - H0: βk = 0 - H1: βk ≠ 0

Since p>0.05 we cannot reject the null hypothesis, so we can assume homoskedasticity.

j) Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histrogram of standardized residuals",) #Histogram of standardized residuals

Histogram of standardized residuals is slightly right skewed, also we can see that all values are below 3 and above -3, therefore there are no outliers. For normal distribution we can check with the Shapiro test.

shapiro.test(mydata3$StdResid) #Doing the shapiro test
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.93418, p-value = 0.0004761

Above there is a Shapiro test with the hypothesis: - H0: Errors are distributed normally. - H1: Errors are not distributed normally

Since p<0.001 we can reject the null hypothesis, so we can assume that standardized residuals are not distributed normally. However since our sample size is bigger than 30 units, the fact that standardized residuals are not distributed normally shouldn’t be a problem.

k) Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2) #Estimating fit2
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187

Explanation of coefficients:

The coefficient of determination is 0.5361. Therefore 53.61% of the variability of the price per m2 is explained by the linear effect of age of the apartment and distance from city centre.

Next we check the test of the regression, which states in the null hypothesis that the population coefficient of determination is equal to 0, which would mean that all partial regression coefficients are equal to 0. - H0: ∆ρ² = 0 - H1: ∆ρ² > 0 Where F = 44.49 with p<0.001, so we can reject the null hypothesis. The coefficient of determination of the population is greater than 0, which means that at least part of the variability of the dependent variable is explained by the linear influence of the explanatory variables.

If the age of the apartment is increased by 1 year, the price per m2 decreases on average by 8.674 EUR (p<0.01), assuming that distance from the citry centre remains unchanged.

If the distance of the apartment from the citry centre is increased by 1 km, the price per m2 decreases on average by 24.063 EUR (p<0.001), assuming that age of the apartment remains unchanged.

A multiple correlation coefficient is obtained by calculating the square root of the multiple coefficient of determination. The linear correlation between price, age and distance of the apartment is strong, since it is 0.73.

l) Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor, data = mydata3) #Estimating the new linear function

m) With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3) #anova with fit3 and fit 2
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Above there is the anova function comparing fit3 and fit2 with the hypothesis: - H0: ∆ρ² = 0 - H1: ∆ρ² > 0

Since p>0.05 (p=0.01135) we cannot reject the null hypothesis, so we can assume that both models are equally good, therefore we should work with the one that is more simple.

n) Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3) #Results of fit3
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2393.316     93.930  25.480  < 2e-16 ***
## Age                -7.970      3.191  -2.498   0.0147 *  
## Distance          -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFactorYes  128.700     60.801   2.117   0.0376 *  
## BalconyFactorYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Explanation of the categorical variables and F-statistics:

With each additional parking space, the price per m2 increases on average by 128.7 EUR (p<0.05), assuming that the other explanatory variables remain unchanged.

We cannot say that having a balcony would have an effect on price per m2 of the apartment since p>0.05.

Next we check the test of the regression, which states in the null hypothesis that the population coefficient of determination is equal to 0, which would mean that all partial regression coefficients are equal to 0. - H0: ∆ρ² = 0 - H1: ∆ρ² > 0 Where F = 24.08 with p<0.001, so we can reject the null hypothesis. The coefficient of determination of the population is greater than 0, which means that at least part of the variability of the dependent variable is explained by the linear influence of the explanatory variables.

o) Save fitted values and calculate the residual for apartment ID2.

mydata3$StdFittedValues <- fitted.values(fit3)
mydata3$StdResid <- residuals(fit3)
head(mydata3[ , colnames(mydata3) %in% c("ID", "Price", "StdFittedValues",
"StdResid")]) #Saving fitted values and calculating the residual for ID2
##   Price   StdResid StdFittedValues
## 1  1640  -88.64095        1728.641
## 2  2800  443.40256        2356.597
## 3  1660  -62.60903        1722.609
## 4  1850  310.68782        1539.312
## 5  1640 -349.28625        1989.286
## 6  1770 -142.65528        1912.655

Based on the estimated regression function, we would expect the price per m2 for this apartment to be 2356.597 thousand EUR. The actual price was 2.8 thousand EUR, so the residual is 443.4 EUR.