Task 1

Task 1.1

library(readxl)
mydata <- read_xlsx("./ObesityDataSet.xlsx")

mydata <- as.data.frame(mydata)

head(mydata)
##   Age Gender Height Weight    Alcohol CaloricFood MealsDay Smoke Activity        ObesityLevel
## 1  21      2   1.62   64.0         no          no        3    no        0       Normal_Weight
## 2  21      2   1.52   56.0  Sometimes          no        3   yes        3       Normal_Weight
## 3  23      1   1.80   77.0 Frequently          no        3    no        2       Normal_Weight
## 4  27      1   1.80   87.0 Frequently          no        3    no        2  Overweight_Level_I
## 5  22      1   1.78   89.8  Sometimes          no        1    no        0 Overweight_Level_II
## 6  29      1   1.62   53.0  Sometimes         yes        3    no        0       Normal_Weight

This dataset includes data for the estimation of obesity level in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

Variables:

-Age: age of participant

-Gender: gender of participant (1=Male, 2=Female)

-Height: height of participant in meters

-Weight: weight of participant in kgs

-Alcohol: frequency of consuming alcohol (no, Sometimes, Frequently, Always)

-CaloricFood: whether they eat caloric food every day (yes, no)

-MealsDay: number of meals in a typical day

-Smoke: whether they smoke (yes, no)

-Activity: number of days they have a physical activity per typical week

-ObesityLevel: obesity level (Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III )

Task 1.2

# Creating factor for variable Gender 

mydata$GenderFactor <- factor(mydata$Gender, 
                             levels = c(1, 2), 
                             labels = c("Male", "Female"))

str (mydata)
## 'data.frame':    498 obs. of  11 variables:
##  $ Age         : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Gender      : num  2 2 1 1 1 1 2 1 1 1 ...
##  $ Height      : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight      : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ Alcohol     : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ CaloricFood : chr  "no" "no" "no" "no" ...
##  $ MealsDay    : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ Smoke       : chr  "no" "yes" "no" "no" ...
##  $ Activity    : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ ObesityLevel: chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
##  $ GenderFactor: Factor w/ 2 levels "Male","Female": 2 2 1 1 1 1 2 1 1 1 ...
# Including only first four columns in a new data frame

mydata2 <- mydata[ , c(1,2,3,4)]
head(mydata2)
##   Age Gender Height Weight
## 1  21      2   1.62   64.0
## 2  21      2   1.52   56.0
## 3  23      1   1.80   77.0
## 4  27      1   1.80   87.0
## 5  22      1   1.78   89.8
## 6  29      1   1.62   53.0
# Renaming variable ObesityLevel into ObesityLvl, and showing first 6 rows of last 3 columns

names(mydata)[10] <- "ObesityLvl"

head(mydata[, c(9, 10, 11)])        
##   Activity          ObesityLvl GenderFactor
## 1        0       Normal_Weight       Female
## 2        3       Normal_Weight       Female
## 3        2       Normal_Weight         Male
## 4        2  Overweight_Level_I         Male
## 5        0 Overweight_Level_II         Male
## 6        0       Normal_Weight         Male
# Ordering data by age in ascending order

head(mydata[order(mydata$Age), ], 10)
##     Age Gender Height Weight   Alcohol CaloricFood MealsDay Smoke Activity          ObesityLvl
## 416  14      1   1.71     72        no         yes        3    no        2       Normal_Weight
## 117  15      2   1.65     86        no         yes        3    no        3      Obesity_Type_I
## 277  16      1   1.67     50        no         yes        1    no        1 Insufficient_Weight
## 291  16      1   1.82     71 Sometimes         yes        3    no        2       Normal_Weight
## 296  16      2   1.66     58        no          no        1    no        0       Normal_Weight
## 303  16      1   1.84     45 Sometimes         yes        3    no        3 Insufficient_Weight
## 304  16      2   1.57     49 Sometimes         yes        4    no        0       Normal_Weight
## 310  16      2   1.66     58        no          no        1    no        0       Normal_Weight
## 313  16      2   1.60     57        no         yes        3    no        3       Normal_Weight
## 386  16      2   1.55     45        no         yes        3    no        1       Normal_Weight
##     GenderFactor
## 416         Male
## 117       Female
## 277         Male
## 291         Male
## 296       Female
## 303         Male
## 304       Female
## 310       Female
## 313       Female
## 386       Female

Since I did not have any missing values in my data set, I did not use functions drop_na and replace_with_na.

Task 1.3

# Descriptive statistics for Age, Height, Weight

summary(mydata[ , c(1,3,4)])
##       Age            Height          Weight      
##  Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  1st Qu.:19.00   1st Qu.:1.613   1st Qu.: 58.00  
##  Median :21.00   Median :1.680   Median : 67.00  
##  Mean   :23.15   Mean   :1.686   Mean   : 69.57  
##  3rd Qu.:24.00   3rd Qu.:1.750   3rd Qu.: 80.00  
##  Max.   :61.00   Max.   :1.980   Max.   :173.00

-MEAN: The average age of participants is 23.15, the average height of participants is 1.686m, the average weight of participants is 69.57kg.

-MEDIAN: Half of the participants are up to 21 years old, the others are older. Half of the participants are up to 1.680m tall, the others are taller. Half of the participants weigh up to 67kg, the others weigh more.

-MINIMUM: The youngest participant is 14 years old. The shortest participant is 1.450m tall. The lightest participant weighs 39kg.

-Q1: 25% of participants are up to 19 years old, the others are older. 25% of the participants are up to 1.613m tall, the others are higher. 25% of the participants weigh up to 58kg, the others weigh more.

Task 1.4

library(ggplot2)
hist(mydata$Age, 
 ylab = "Frequency",
 xlab = "Age",
 main = "Distribution of age",
 col = "skyblue",
 breaks = seq(from = 10, to = 70, by = 10))

Most participants are between 20 and 30 years old.

library(car)
## Loading required package: carData
scatterplot(y = mydata$MealsDay, 
            x = mydata$Activity, 
            ylab = "Meals per day", 
            xlab = "Active days per week", 
            smooth = FALSE)

Relationship between variables is positive, meaning that, on average, the more active the participant is, the more meals per day he/she eats.

library(ggplot2) 
ggplot(mydata, aes(y = Height, x = GenderFactor)) + 
 geom_boxplot()

On average, males are taller than females.

Half of the males are up to 1.75m tall, the others are taller.

75% of females are up to 1.65m tall, the others are taller.

I would not characterize the dots as outliers, because it is possible that a male is almost 2m tall and a female 1.8m tall.

Task 2

library(readxl)
task2 <- read_xlsx("C:/Users/Tamara/Desktop/R data/R Take Home Exam 2024/Task 2/BusinessSchool.xlsx")

task2 <- as.data.frame(task2)

head(task2)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before)
## 1          1         Business            68.4      90.2              No                    252
## 2          2 Computer Science            70.2      68.7             Yes                    101
## 3          3          Finance            76.4      83.3              No                    401
## 4          4         Business            82.6      88.7              No                    287
## 5          5          Finance            76.9      75.4              No                    275
## 6          6 Computer Science            83.3      82.1              No                    254
##   Employability (After) Status Annual Salary
## 1                   276 Placed        111000
## 2                   119 Placed        107000
## 3                   462 Placed        109000
## 4                   342 Placed        148000
## 5                   347 Placed        255500
## 6                   313 Placed        103500

Task 2.1

library(ggplot2) 
ggplot(task2, aes(x = `Undergrad Degree`)) +
  geom_bar(colour = "deeppink2", fill="thistle1") +
  labs(title = "Distribution of Undergraduate Degrees", 
       x = "Undergraduate Degrees", 
       y = "Frequency")

The most common undergraduate degree is business. There is approximately 35 business graduates.

Task 2.2

library(psych) 
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(task2$`Annual Salary`) 
##    vars   n   mean       sd median  trimmed     mad   min    max  range skew kurtosis      se
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22     9.41 4150.15
library(ggplot2) 
ggplot(task2, aes(x = `Annual Salary` )) +
 geom_histogram(colour="seagreen", fill="lawngreen")+ 
 labs(y="Frequency", x="Annual Salary", title="Annual Salary Distribution")+
  scale_x_continuous(breaks = seq(20000, 340000, 64000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution is right-skewed, meaning that the lower salaries are more frequent compared to the higher ones; it is also unimodal.

Task 2.3

t.test(task2$`MBA Grade`,
       mu=74,
       alternative="two.sided")
## 
##  One Sample t-test
## 
## data:  task2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

We can reject null hypothesis at p=0.009. We can conclude that the average grade is different from 74. We can be 95% confident that the true average grade is between 74.52 and 77.56.

# Effect size
effectsize::cohens_d(task2$`MBA Grade`, mu=74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.
effectsize::interpret_cohens_d(0.27, rules="sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)

The effect size is small.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
task3 <- read_xlsx("R Take Home Exam 2024/Task 3/Apartments.xlsx")

task3 <- as.data.frame(task3)

head(task3)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

task3$Parking <- factor(task3$Parking, 
                           levels = c(0, 1), 
                           labels = c("No", "Yes"))

task3$Balcony <- factor(task3$Balcony, 
                           levels = c(0, 1), 
                           labels = c("No", "Yes"))

head(task3)
##   Age Distance Price Parking Balcony
## 1   7       28  1640      No     Yes
## 2  18        1  2800     Yes      No
## 3   7       28  1660      No      No
## 4  28       29  1850      No     Yes
## 5  18       18  1640     Yes     Yes
## 6  28       12  1770      No     Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(task3$Price,
       mu=1900,
       alternative="two.sided")
## 
##  One Sample t-test
## 
## data:  task3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject null hypothesis at p=0.005. We can conclude that the average price is different from 1900. We can be 95% confident that the true average price is between 1937.443 and 2100.440.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, 
 data = task3)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = task3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

-Regression coefficient: If age of the apartment increases by 1 year, the price per m2 decreases by 8.975 eur on average (p=0.034)

-Coefficient of determination: 5.30% of variability in price per m2 is affected by linear effect of age of the apartment.

-Intercept: If age of the apartment is 0, you would expect the price of 2185.45 eur per m2.

cor(task3$Price, task3$Age)
## [1] -0.230255

-Pearson correlation coefficient: Since the value is between 0.1 and 0.3 in absolute terms, the linear relationship between price per m2 and age of the apartment is weak. Since the value is negative, the relationship is negative.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(task3[c("Price", "Age", "Distance")], 
 smooth = FALSE) 

Based on the matrix, I believe that there is not a significant problem with multicolinearity, since the values are dispersed and do not tightly follow the line in the graphs showing relationship between age and distance (row 2, column 3). The line is also closer to horizontal than to steep, again indicating that correlation is not strong.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
 data = task3)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = task3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

Since the VIF statistics’ value is lower than 5 and its average is close to 1, the multicolinearity is not an issue and we should not remove any of the variables.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

task3$StdResid <- round(rstandard(fit2), 3)
task3$CooksD <- round(cooks.distance(fit2), 3) 
hist(task3$StdResid, 
 xlab = "Standardized residuals", 
 ylab = "Frequency", 
 main = "Histogram of standardized residuals")

head(task3[order(-task3$StdResid),],5)
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 33   2       11  2790     Yes      No    2.051  0.069
## 2   18        1  2800     Yes      No    1.783  0.030
## 61  18        1  2800     Yes     Yes    1.783  0.030
## 58   8        2  2820     Yes      No    1.655  0.037
head(task3[order(task3$StdResid),],5)
##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013
## 20  13        8  1800      No      No   -1.381  0.012
## 35  14       16  1660      No     Yes   -1.261  0.008

There aren’t any outliers (no units with standardized residuals bigger than 3 in absolute value).

hist(task3$CooksD, 
 xlab = "Cooks distance", 
 ylab = "Frequency", 
 main = "Histogram of Cooks distance")

There is a large gap between 0.15 and 0.30, indicating that the units appearing after this gap are possibly units with high impact.

head(task3[order(-task3$CooksD),])
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038
summary(task3[ ,c(1,2,3)])
##       Age           Distance         Price     
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400  
##  1st Qu.:12.00   1st Qu.: 4.00   1st Qu.:1710  
##  Median :18.00   Median :12.00   Median :1950  
##  Mean   :18.55   Mean   :14.22   Mean   :2019  
##  3rd Qu.:24.00   3rd Qu.:20.00   3rd Qu.:2290  
##  Max.   :45.00   Max.   :45.00   Max.   :2820

Since the value of the row 38 has a large value of Cooks distance, it might be a high impact value. Moreover, this unit has the combination of the largest possible distance from the city centre (45 km) and higher than average price, which is also strange. Therefore, I will consider that this unit is with high impact and remove it.

task3NEW <- task3[-38, ]
# Repeating to check

hist(task3NEW$CooksD, 
 xlab = "Cooks distance", 
 ylab = "Frequency", 
 main = "Histogram of Cooks distance")

There is still a relatively large gap between 0.07 and 0.10.

head(task3NEW[order(-task3NEW$CooksD),])
##    Age Distance Price Parking Balcony StdResid CooksD
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038
## 58   8        2  2820     Yes      No    1.655  0.037
summary(task3NEW[ ,c(1,2,3)])
##       Age           Distance         Price     
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400  
##  1st Qu.:12.00   1st Qu.: 4.00   1st Qu.:1710  
##  Median :18.00   Median :12.00   Median :1930  
##  Mean   :18.71   Mean   :13.86   Mean   :2017  
##  3rd Qu.:24.25   3rd Qu.:19.25   3rd Qu.:2290  
##  Max.   :45.00   Max.   :40.00   Max.   :2820

The value of the row 55 now has the largest value of Cooks distance. However, I do not believe it has a strange combination of values: it is almost the oldest apartment, it is relatively far away from the city centre compared to others, and its price is a little bit above the minimum (maybe due to its size). Thus, I do not consider it as a high impact unit and I will not remove it.

Check for potential heteroskedasticity with scatterplot between standardized residuals and standardized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance, data  = task3NEW)

task3NEW$StdResid <- round(rstandard(fit2), 3)
task3NEW$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = task3NEW$StdResid, x = task3NEW$StdFitted,
 ylab = "Standardized residuals",
 xlab = "Standardized fitted values",
 boxplots = FALSE,
 regLine = FALSE,
 smooth = FALSE)

As I do not see any curves evidently, I would conclude that there is no problem with non-linearity, but there might be a problem with heteroskedasticity because variance seems to be a bit higher on the right side of the graph. I will check for heteroskedasticity in the following chunk.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

We can’t reject the null hypothesis (p=0.08 which is >0.05). Therefore, we can assume homoskedasticity.

Are standardized residuals distributed normally? Show the graph and formally test it. Explain the findings.

task3NEW$StdResid <- round(rstandard(fit2), 3)
task3NEW$CooksD <- round(cooks.distance(fit2), 3) 
hist(task3NEW$StdResid, 
 xlab = "Standardized residuals", 
 ylab = "Frequency", 
 main = "Histogram of standardized residuals")

shapiro.test(task3NEW$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  task3NEW$StdResid
## W = 0.95649, p-value = 0.006355

The graph seems to be slightly right-skewed. Also, based on Shapiro-Wilk test, we can reject the null hypothesis (p=0.006) and conclude that the standardized residuals are not distributed normally. However, our sample size is large (>30), so this should not be a problem.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
 data = task3NEW)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = task3NEW)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

-Regression coefficient:

  1. If age of the apartment increases by 1 year, the price per m2 decreases by 6.46 eur on average (p=0.044), assuming distance is constant.

  2. If distance from the city centre increases by 1km, the price per m2 decreases by 22.96 eur on average (p<0.001), assuming age is constant.

-Coefficient of determination: 48.38% of variability in price per m2 is affected by linear effect of age and distance.

-Intercept: If age and distance of the apartment from the city centre are both 0, you would expect the price of 2456.076 eur per m2.

sqrt(summary(fit2)$r.squared)
## [1] 0.6955609

-Multiple correlation coefficient: Since the value is between 0.3 and 0.7 in absolute terms, the linear relationship of price with age and distance is semi-strong.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(task3NEW[ ,c(1,2,3)]))
##            Age Distance Price
## Age       1.00     0.09 -0.23
## Distance  0.09     1.00 -0.68
## Price    -0.23    -0.68  1.00
## 
## n= 84 
## 
## 
## P
##          Age    Distance Price 
## Age             0.3963   0.0387
## Distance 0.3963          0.0000
## Price    0.0387 0.0000

-Pearson correlation coefficient:

There is a weak negative linear relationship between price and age.

There is a semi-strong negative linear relationship between price and distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, 
 data = task3NEW)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3) 
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis (p=0.03). We can conclude, with 95% confidence, that fit3 fits data better than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = task3NEW)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Given the values of the other independent variables, the apartments with parking are on average by 167.53 eur per m2 more expensive in comparison to the apartments without parking (p=0.009).

Since the p-value for the variable Balcony is not significant, it does not make sense to interpret it. There is not enough evidence to conclude that the presence of balcony has a meaningful effect on price.

-F-statistic:

H0:βAge=βDistance=βParkingYes=βBalconyYes=0

H1:At least one βj is different from 0.

We can conclude that at least one of the independent variables significantly affects price per m2 (p<0.001).

Save fitted values and calculate the residual for apartment ID2.

task3NEW$FittedValues <- fitted(fit2)
task3NEW$Residuals <- residuals(fit2)
residual_ID2 <- task3NEW$Residuals[2]
print(residual_ID2)
## [1] 483.2291