Task 1

data(package = .packages(all.available = TRUE))
Air_Quality<-force(airquality)

1. Explain the dataset

head(Air_Quality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

I am using the dataset containing the details of daily air quality measurements in New York, May to September 1973.

This dataset contains 153 observations on 6 variables:

  • Ozone: Ozone concentration (ppb)
  • Solar.R: Solar Radiation (lang)
  • Wind: Wind speed (mph)
  • Temp: Temperature (degrees F)
  • Month: Month (5 = May, 6 = June, 7 = July, 8 = August, 9 = September)
  • Day: Day of month (1-31)

2. Perform some data manipulations (create new variable, delete some units due to missing data, rename variables, create new data.frame based on conditions, etc.).

Air_Quality$Month <- factor(Air_Quality$Month,
                         levels = c (5,6,7,8,9),
                         labels = c ("May","Jun","Jul","Aug","Sep"))

Air_Quality$Day <- factor(Air_Quality$Day,
                         levels = c (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31),
                         labels = c (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
                         
              
                         
head(Air_Quality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67   May   1
## 2    36     118  8.0   72   May   2
## 3    12     149 12.6   74   May   3
## 4    18     313 11.5   62   May   4
## 5    NA      NA 14.3   56   May   5
## 6    28      NA 14.9   66   May   6
# removing units with missing data

library(tidyr)
Air_Quality_clean <- drop_na(Air_Quality)

# renaming variable

colnames(Air_Quality_clean)[4]<- "Temp_in_F"

# create new variable

Air_Quality_clean$Solar_Temp_Ratio <- Air_Quality_clean$Solar.R / Air_Quality_clean$Temp_in_F

# create new data.frame

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Air_Quality_clean_July <- Air_Quality_clean %>%
  filter(Month == "Jul")
head(Air_Quality_clean_July)
##   Ozone Solar.R Wind Temp_in_F Month Day Solar_Temp_Ratio
## 1   135     269  4.1        84   Jul   1         3.202381
## 2    49     248  9.2        85   Jul   2         2.917647
## 3    32     236  9.2        81   Jul   3         2.913580
## 4    64     175  4.6        83   Jul   5         2.108434
## 5    40     314 10.9        83   Jul   6         3.783133
## 6    77     276  5.1        88   Jul   7         3.136364

3. Present the descriptive statistics for the selected variables and explain at least 3 sample statistics

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following object is masked from 'package:tidyr':
## 
##     extract
round(stat.desc(Air_Quality_clean[ , c (-5, -6) ]), 2)
##                Ozone  Solar.R    Wind Temp_in_F Solar_Temp_Ratio
## nbr.val       111.00   111.00  111.00    111.00           111.00
## nbr.null        0.00     0.00    0.00      0.00             0.00
## nbr.na          0.00     0.00    0.00      0.00             0.00
## min             1.00     7.00    2.30     57.00             0.09
## max           168.00   334.00   20.70     97.00             5.22
## range         167.00   327.00   18.40     40.00             5.12
## sum          4673.00 20513.00 1103.30   8635.00           262.68
## median         31.00   207.00    9.70     79.00             2.48
## mean           42.10   184.80    9.94     77.79             2.37
## SE.mean         3.16     8.65    0.34      0.90             0.12
## CI.mean.0.95    6.26    17.15    0.67      1.79             0.23
## var          1107.29  8308.74   12.66     90.82             1.47
## std.dev        33.28    91.15    3.56      9.53             1.21
## coef.var        0.79     0.49    0.36      0.12             0.51
mean(Air_Quality_clean$Wind)
## [1] 9.93964

The average wind speed measured in May, June, July, August and September in New York in 1973 was 9.94 mph.

median(Air_Quality_clean$Temp_in_F)
## [1] 79

The median temperature was 79 degrees F - 50% of the days were warmer than 79 degrees F and 50% were colder.

min(Air_Quality_clean_July$Temp_in_F)
## [1] 73
max(Air_Quality_clean_July$Temp_in_F)
## [1] 92

The lowest temperature measured in July was 73 degrees F. The highest was 92 degrees F.

library(ggplot2)
ggplot(Air_Quality_clean, aes(x = Month)) +
  geom_bar() +
  ylab("Frequency") +
  xlab("Month") 

After removing the days that had some missing data, I have decided to check how many days had complete data and were included in the clean data.The graph shows that the least represented month in the clean data is June. Most values are included from September.

hist(Air_Quality_clean$Ozone, main="Histogram of Ozone Levels", xlab="Ozone (ppb)")

In above histogram I wanted to see how the ozone was distributed in the measurements. We can see that the ozone level distribution is skewed to the right. On most days the ozone levels were up to 50 ppb.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplot(y = Air_Quality_clean$Solar.R, 
            x = Air_Quality_clean$Temp_in_F, 
            ylab = "Solar radiation", 
            xlab = "Temperature", 
            smooth = FALSE,
            boxplots = FALSE)

With this scatter plot I wanted to check if there is any correlation between the measured temperature and the solar radiation. I assumed that on the hotter days the solar radiation would also be higher (essentially, that the temperature would be higher on sunny days). From the graph, I can see that there seems to be a positive correlation between both variables like I expected.

ggplot(Air_Quality_clean, aes(x = Month, y = Temp_in_F)) +
  geom_boxplot() +
  labs(title = "Temperature by Month", 
       x = "Month", 
       y = "Temperature in F")

With the box plots I tried to check how the temperature varied across different months. As expected, the temperatures were highest in the summer months (June, July and August). We can also see that in July, the temperature were most constant.

Task 2

1. Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?

library(readxl)
MBA <- read_xlsx("./Business School.xlsx")

MBA <- as.data.frame(MBA)
head(MBA)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500
colnames(MBA)[2]<- "Undergrad_Degree" 
colnames(MBA)[3]<- "Undergrad_Grade" 
colnames(MBA)[4]<- "MBA_Grade" 
colnames(MBA)[5]<- "Work_Experience" 
colnames(MBA)[9]<- "Annual_Salary"
MBA$Undergrad_Degree <- factor(MBA$Undergrad_Degree,
                                          levels = c("Art", "Business", "Computer Science", "Engineering", "Finance"),
                                          labels = c("Art", "Business", "Computer Science", "Engineering", "Finance"))
library(ggplot2)
ggplot(MBA, aes(x=Undergrad_Degree)) + 
  geom_bar() + 
  ylab("Frequency") +
xlab("Undergraduate degree")

The most common undergraduate degree is a Business degree.

2. Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution.

library(pastecs)
round(stat.desc(MBA$Annual_Salary),0)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0
options(scipen = 999)
library(ggplot2)
ggplot(MBA, aes(x=Annual_Salary)) + 
  geom_histogram(col="white", binwidth = 10000) + 
  ylab("Frequency") + 
  xlab("Annual Salary")

The distribution seems to be skewed to the right. To check the distribution we could do the Shapiro-Wilks test. There also seem to be some outliers in our distribution - we need to check if they can be removed, which could make the distribution more normal.

3. Test the following hypothesis: 𝐻0:𝜇MBA Grade=74. Explain the result and interpret the effect size.

mean(MBA$MBA_Grade)
## [1] 76.04055
t.test(MBA$MBA_Grade,
       mu = 74,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  MBA$MBA_Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

H0: μ equals 74

H1: μ does not equal 74

Based on the p value (p=0.009 < 0.05) we reject the null hypothesis, the true mean of MBA Grade does not equal 74. Based on the 95% confidence interval, we can also see that the true mean is most likely between 74.5 and 77.6.

library(effectsize)
cohens_d(MBA$MBA_Grade, mu = 74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

The effect size is 0.27 and according to theory this is a small effect size. This means that the difference between the sample mean and tested mean (74) is 0.27 standard deviations.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_xlsx("./Apartments.xlsx")

Apartments <- as.data.frame(Apartments)
head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

We have categorical values Parking and Balcony:

Apartments$Balcony <- factor(Apartments$Balcony,
                         levels = c (0,1),
                         labels = c ("No","Yes"))

Apartments$Parking <- factor(Apartments$Parking,
                         levels = c (0,1),
                         labels = c ("No", "Yes"))
                         
              
                         
head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640      No     Yes
## 2  18        1  2800     Yes      No
## 3   7       28  1660      No      No
## 4  28       29  1850      No     Yes
## 5  18       18  1640     Yes     Yes
## 6  28       12  1770      No     Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

calculating sample mean:

mean(Apartments$Price)
## [1] 2018.941

H0: Mu_Price = 1900 eur

H1: Mu_Price does not equal 1900 eur

t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject the null hypothesis at p=0,004731 - the true mean does not equal 1900 EUR per m2. As per the 95% confidence interval, the true mean is somewhere between 1937 and 2101 EUR per m2.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255

Regression coefficient: If the age of the apartment increases by 1 year, the price of the apartment falls on average by 8.975 EUR per m2 (p = 0.034).

Coefficient of determination: 5.3% of the variability of price is explained by the linear effect of the age of the apartment.

Coefficient of correlation: Coefficient of correlation can be calculated as the square root of R-squared. In this care we have a weak correlation. Since this is not a multiple correlation coefficient (because we are checking the correlation between only two variables), we can also see if the correlation is positive or negative.

cor(Apartments$Price, Apartments$Age)
## [1] -0.230255

So in this case we can observe a weak negative correlation between the price and the age of the apartment.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
                  smooth = FALSE) 

From the graphs it seems that there might a strong connection between distance and price. Since price is a dependent variable, this does not point to a problem of multicolinearity. There does not seem to be a strong correlation between the two independent variables (Distance and Age), so I am assuming that multicolinearity will not be a problem in this model. We could check this more closely with VIF statistic.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

To say that we do not have a problem with multicolinearity, the VIF statistic for each variable should be equal or below 5 and the average VIF should be around one. The calculated results confirm that multicolinearity is not too strong in our case.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3) # Calculating standardized residuals

Apartments$CooksD <- round(cooks.distance(fit2), 3) #Calculating cooks distances
hist(Apartments$StdResid,
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

We can see that there are no standardized errors that have a value larger than -3 or smaller than 3. So there seems to be no need to remove any of the data.

We can do the Shapiro-Wilk test to check if the standardized residuals are distributed normally.

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

Based on the results, we reject the H0 (p = 0.003645), which means that we assume the standardized errors are not distributed normally. From the theory we know, that one of the assumptions for regression analysis is that the errors are distributed normally, which is not true in our case. However, since we have a large sample (n>30) we can assume t distribution.

hist(Apartments$CooksD,
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

I can see that we have at least one value with high impact that I will need to remove. I’d like to see how the cooks distances look in my data so I will ask R to display the 6 values with the highest Cooks distances.

head(Apartments[order(-Apartments$CooksD),], 6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038

I can see that at least the largest two values should be removed:

library(dplyr)
Apartments <- Apartments %>%
  filter(!row_number() %in% c(38, 55))

head(Apartments[order(-Apartments$CooksD), ], 6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 33   2       11  2790     Yes      No    2.051  0.069
## 52   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 38  40        2  2400      No     Yes    1.091  0.038
## 56   8        2  2820     Yes      No    1.655  0.037
## 25   8       26  2300     Yes     Yes    1.571  0.034

Let’s check the model now with the values removed:

fit2_E <- lm(Price ~ Age + Distance,
           data = Apartments)

summary(fit2_E)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684 < 0.0000000000000002 ***
## Age           -7.850      3.244  -2.420               0.0178 *  
## Distance     -23.945      2.826  -8.473    0.000000000000953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 0.000000000001173

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

I will check this for my new model, that already has values of high impact removed (fit2_E).

Apartments$StdFitted <- scale(fit2_E$fitted.values)

scatterplot(y = Apartments$StdResid, x = Apartments$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the scatterplot it seems that we might have a problem with heteroskedasitcity. I will check this with the Breusch Pagan test.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2_E)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

Based on the p value (p>0.05) we cannot reject H0, meaning we assume homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

I will also check this for my new model:

Apartments$StdResid_f2 <- round(rstandard(fit2_E), 3) 
Apartments$CooksD_f2 <- round(cooks.distance(fit2_E), 3) 

hist(Apartments$StdResid_f2, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid_f2)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid_f2
## W = 0.95952, p-value = 0.01044

For formal testing I used the Skapiro Wilk test.

H0: errors are distributed normally.

H1: errors are not distributed normally.

Based on the p value (p = 0.01) we reject the null hypothesis. However, because we have a large sample (n >> 30) we can assume t distribution of errors.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

I have already estimated this above with the model fit2_E. A summary of my new model:

summary(fit2_E)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684 < 0.0000000000000002 ***
## Age           -7.850      3.244  -2.420               0.0178 *  
## Distance     -23.945      2.826  -8.473    0.000000000000953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 0.000000000001173

The regression coefficients:

  • If the age of the apartment increases by one year, the price of the apartment decreases on average by 7.850 EUR per m2 (p = 0.0178).
  • If the distance of the apartment increases by 1 km, the price of the apartment decreases on average by 23.915 EUR per m2 (p<0.001).

Multiple R-squared = coefficient of determination:

49.68% of variability of the apartment price can be explained by the linear effect of distance and age.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = Apartments)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185 < 0.0000000000000002 ***
## Age           -7.197      3.148  -2.286              0.02499 *  
## Distance     -21.241      2.911  -7.296       0.000000000214 ***
## ParkingYes   168.921     62.166   2.717              0.00811 ** 
## BalconyYes    -6.985     58.745  -0.119              0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 0.000000000001449

With function anova check if model fit3 fits data better than model fit2.

anova(fit2_E, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: ∆ro^2=0

H1: ∆ro^2>0

Based on the Pr(>F) = 0.028 (<0.05) we can reject the null hypothesis. We therefore say that the second model (fit3) fits the data statistically better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185 < 0.0000000000000002 ***
## Age           -7.197      3.148  -2.286              0.02499 *  
## Distance     -21.241      2.911  -7.296       0.000000000214 ***
## ParkingYes   168.921     62.166   2.717              0.00811 ** 
## BalconyYes    -6.985     58.745  -0.119              0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 0.000000000001449

Given the Age, Distance and presence of Balcony in the apartment, the apartments with a parking space have on average higher price by 168.921 EUR per m2 than the apartments without the parking space (p = 0.008).

As the coefficient for balcony is not statistically significant, I will not interpret it. We cannot say that the balcony has an effect on the price of the apartment.

F statistics:

H0: ro^2 = 0

H1: ro^2 > 0

We can reject the null hypothesis (p<0.001). The coefficient of determination of the population is above 0. At least some of the variability of the dependent variable can be explained by the liner influence f the explanatory variables.

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted_3 <- fitted.values(fit3)
Apartments$Residual_3 <- residuals(fit3)
Apartments[2, "Residual_3"]
## [1] 422.9572