data()

NikaBoh

Task1

Source:

“Carl Hoffstedt. This differs from the dataset Highway in the alr4 package only by addition of transformation of some of the columns.”

References:

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

Weisberg, S. (2014) Applied Linear Regression, Fourth Edition, Wiley, Section 7.2.

#install.packages("carData")
library(carData)
data("Highway1")              #Show the table and explain the data set
head(Highway1)
##   rate   len adt trks      sigs1 slim shld lane acpt  itg lwid htype
## 1 4.58  4.99  69    8 0.20040080   55   10    8  4.6 1.20   12   FAI
## 2 2.86 16.11  73    8 0.06207325   60   10    4  4.4 1.43   12   FAI
## 3 3.02  9.75  49   10 0.10256410   60   10    4  4.7 1.54   12   FAI
## 4 2.29 10.65  61   13 0.09389671   65   10    6  3.8 0.94   12   FAI
## 5 1.61 20.01  28   12 0.04997501   70   10    4  2.2 0.65   12   FAI
## 6 6.87  5.97  30    6 2.00750419   55   10    4 24.8 0.34   12    PA

Explanation of data set:

This is a data set from an unpublished master’s paper by Carl Hoffstedt. The data is about the automobile accident rate, measured in accidents per million vehicle miles and with 11 other variables. There are 39 units of observation (39 sections of large highways in the state of Minnesota, in 1973). - Rate: Number of automobile accidents per million vehicle miles. - Len: The length of the Highway1, in miles. - Adt: Means the average daily traffic count, in thousands. - Trks: It is a truck volume as a percent of the total volume. - Sigs1: It is a number of signals per mile of a roadway (adjusted to have non-zero values) - Slim: Shows the speed limit in the year 1973. - Shld: The width in feet, of outer shoulder on the roadway. - Lane: The total number of lanes of traffic. - Acpt: Number of access points per mile. - Itg: Number of freeway-type interchanges per mile - Iwid: The lane width, in feet. - Htype: It tells the type of roadway or the source of funding for the road (either MC, FAI, PA, or MA). - it is a categorical variable

mydata <- data.frame(Highway1)  
colnames(Highway1) <- c("Rate", "Length", "AvgTrafficCount", "TruckVolume%", "NrSignals", "SpeedLimit", "RoadWidth", "NrLanes", "AccessPoints", "NrInterchanges", "LaneWidth", "RoadType")

head(Highway1)    #Change of names of the variables
##   Rate Length AvgTrafficCount TruckVolume%  NrSignals SpeedLimit RoadWidth
## 1 4.58   4.99              69            8 0.20040080         55        10
## 2 2.86  16.11              73            8 0.06207325         60        10
## 3 3.02   9.75              49           10 0.10256410         60        10
## 4 2.29  10.65              61           13 0.09389671         65        10
## 5 1.61  20.01              28           12 0.04997501         70        10
## 6 6.87   5.97              30            6 2.00750419         55        10
##   NrLanes AccessPoints NrInterchanges LaneWidth RoadType
## 1       8          4.6           1.20        12      FAI
## 2       4          4.4           1.43        12      FAI
## 3       4          4.7           1.54        12      FAI
## 4       6          3.8           0.94        12      FAI
## 5       4          2.2           0.65        12      FAI
## 6       4         24.8           0.34        12       PA
Highway1$Speed <- seq(50, 80, length.out = 39)  #The new variable is created with the data made up.
head(Highway1)
##   Rate Length AvgTrafficCount TruckVolume%  NrSignals SpeedLimit RoadWidth
## 1 4.58   4.99              69            8 0.20040080         55        10
## 2 2.86  16.11              73            8 0.06207325         60        10
## 3 3.02   9.75              49           10 0.10256410         60        10
## 4 2.29  10.65              61           13 0.09389671         65        10
## 5 1.61  20.01              28           12 0.04997501         70        10
## 6 6.87   5.97              30            6 2.00750419         55        10
##   NrLanes AccessPoints NrInterchanges LaneWidth RoadType    Speed
## 1       8          4.6           1.20        12      FAI 50.00000
## 2       4          4.4           1.43        12      FAI 50.78947
## 3       4          4.7           1.54        12      FAI 51.57895
## 4       6          3.8           0.94        12      FAI 52.36842
## 5       4          2.2           0.65        12      FAI 53.15789
## 6       4         24.8           0.34        12       PA 53.94737
any_missing <- any(is.na(Highway1))   #There is no missing data as it shows "FALSE"
summary(Highway1[ , c(-5, -7, -9, -10, -11, -12)])  #Deleted some of the variables
##       Rate           Length       AvgTrafficCount  TruckVolume%      SpeedLimit
##  Min.   :1.610   Min.   : 2.960   Min.   : 1.00   Min.   : 6.000   Min.   :40  
##  1st Qu.:2.630   1st Qu.: 7.995   1st Qu.: 5.00   1st Qu.: 8.000   1st Qu.:50  
##  Median :3.050   Median :11.390   Median :13.00   Median : 9.000   Median :55  
##  Mean   :3.933   Mean   :12.884   Mean   :19.62   Mean   : 9.333   Mean   :55  
##  3rd Qu.:4.595   3rd Qu.:17.800   3rd Qu.:24.00   3rd Qu.:11.000   3rd Qu.:60  
##  Max.   :9.230   Max.   :40.090   Max.   :73.00   Max.   :15.000   Max.   :70  
##     NrLanes          Speed     
##  Min.   :2.000   Min.   :50.0  
##  1st Qu.:2.000   1st Qu.:57.5  
##  Median :2.000   Median :65.0  
##  Mean   :3.128   Mean   :65.0  
##  3rd Qu.:4.000   3rd Qu.:72.5  
##  Max.   :8.000   Max.   :80.0
mean(Highway1$Rate)
## [1] 3.933333
round(mean(Highway1$Rate), 2)                       #Round up the mean to 2 decimal numbers
## [1] 3.93

Presentation of descriptive statistics: - The average rate of automobile accidents (in the sample) is 3.93 accidents per million vehicle miles. - On average, the highest number of average traffic counts (measured in thousands) is 73. - 75% of all automobile accidents observed in the sample, had up to 4 lanes on the highway. - The lowest (minimal) length of the highway was 2.960 miles. - The half of all automobile accidents (observed in the sample) had the speed of less than 65 miles per hour and the other half had it above the 65 miles per hour.

library(ggplot2)
ggplot(Highway1, aes(x = Speed)) +
  geom_histogram(binwidth = 5, fill = "yellow", color = "black") +
  labs(x = "Speed", y = "Rate", title = "Distribution of accident rates around the speed")

Explanation of the histogram: - It shows a normal distribution (unimodal), without any extreme values or outliers.Mode is the most frequent value and here it represents the speed of 65km per hour.

library(car)

scatterplot(Highway1$Rate ~ Highway1$Speed,
            smooth = FALSE,
            boxplots = FALSE,
            ylab =  "Rate",
            xlab = "Speed")

Description: - It is a scatterplot. A horizontal line can mean that the rate of accidents remains constant regardless of changes in the speed variable (it looks a bit positive as well). The rate may not be affected by the variations in the speed. There is also the possibility of strong bias in data collection as I randomly put the values for variable speed in the table. We can also see the outliers that can be deleted from the data set if we do other tests to prove it, such as vif.

Task2

library(readr)
mydata1 <- as.data.frame(mydata)

data <- read.csv("~/Bootcamp_working/Task 2/Body mass.csv", sep = ";", dec=",", header = TRUE,col.names = c("ID", "Mass"))


head(data)                    #View of the first 6 rows of the data set
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2
library(pastecs)
round(stat.desc(data[ , -1]), 2)   #Rounded up by two decimal points 
##      nbr.val     nbr.null       nbr.na          min          max        range 
##        50.00         0.00         0.00        49.70        83.20        33.50 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##      3143.80        62.80        62.88         0.85         1.71        36.14 
##      std.dev     coef.var 
##         6.01         0.10
hist(data$Mass,
  main = "Distribution of body mass",
  xlab = "Weight",
  ylab = "Frequency",
  col = "yellow",
  border = "black",
  breaks = seq(from = 0, to = 100, by = 10))

- The histogram shows the frequency of nine-graders’ weight measures.The graph is slightly asymmetrical to the right (positively skewed).

H0:𝜇 = 59.5kg H1:𝜇 =/= 59.5kg (We want to see the difference)

t.test(data$Mass,
       mu = 59.5,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876
library(effectsize)
cohens_d(data$Mass, mu = 59.5)                       #Calculating the effect size
## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

The value is 0.56. We still need the interpretation from the source we used at the lecture - Sawilowsky form 2009.

interpret_cohens_d(0.56, rules = "sawilowsky2009")  
## [1] "medium"
## (Rules: sawilowsky2009)

The code shows us that the effect size (the increase of weight) is perceived as medium.

Task3

library(readxl)                                 #Importing the data set
Apartments <- read_excel("~/Bootcamp_working/Task 3/Apartments.xlsx")
View(Apartments)
head(Apartments)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description: - Age: Age of an apartment in years - Distance: The distance from city center in km - Price: Price per m2 - Parking: 0-No, 1-Yes - Balcony: 0-No, 1-Yes

Apartments$ParkingF <- factor(Apartments$Parking,      #Change categorical variables into factors
                              levels = c(0, 1),
                              labels = c("No","Yes"))

Apartments$BalconyF <- factor(Apartments$Balcony,
                              levels = c(0, 1),
                         labels = c("No", "Yes"))

Apartments <- as.data.frame(Apartments)                #To number each unit
head(Apartments)
##   Age Distance Price Parking Balcony ParkingF BalconyF
## 1   7       28  1640       0       1       No      Yes
## 2  18        1  2800       1       0      Yes       No
## 3   7       28  1660       0       0       No       No
## 4  28       29  1850       0       1       No      Yes
## 5  18       18  1640       1       1      Yes      Yes
## 6  28       12  1770       0       1       No      Yes
t.test(Apartments$Price,              # t-test for H0 hypothesis
       mu = 1900,
       alternative= "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941
library(car)
scatterplot(Apartments$Price ~ Apartments$Age,
            smooth = FALSE,
            boxplot = FALSE,
            ylab = "Price per m2 in EUR",
            xlab = "Age in years")

- The scatterplot shows a negative linear correlation between price of apartment per m2 and age of the apartment measured in years.

fit1 <- lm(Price ~ Age,           #A simple regression function
           data = Apartments)
summary (fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanation:

H0: B1 = 0 H1: B1 =/= 0

p < 0.05, so we can reject the H0 and accept the alternative one. It mean that the coefficient of Age (B1) has a statistically significant effect on the price per m2. The hypothesis we accepted (H1) states that the slope of B1 differs from 0.

cor(Apartments$Price, Apartments$Age)        #Coefficient of correlation
## [1] -0.230255
library(car)                            #A scatterplot matrix between Price, Age and Distance
scatterplotMatrix(Apartments [ , c(-4, -5, -6, -7)],
                  smooth = FALSE)

Explanation: - On the diagonal there is a distribution for all three variables. The distribution for Age is slightly asymmetrical to the right but it still looks like a normal distribution. The explanation would be that there is a lot of apartments that are of younger age. Distance is quite asymmetrical to the right but is looks like a bimodal distribution, which may be explained by apartments being located either in the city center or in the suburbs. Price is also asymmetric to the right and it looks a bit bimodal as well.

library(Hmisc)                             #To check the multicolinearity
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(Apartments [ , c(-4, -5, -6, -7)]))
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000
fit2 <- lm(Price ~ Age + Distance,            #Estimate a multiple regression function
           data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Explanation: - We have the following function: (average) Price per m2 = 2460.1 - 7.9Age - 20.7Distance - b0 is 2460.1, which would be the avg price per m2 in a new apartment that is located in the city center. - If we increase the age by 1 year, the avg price of apartment per m2 would decrease by 7.9eur, if other variable remains constant. - If we increase the distance by 1km, the avg price of apartment per m2 would decrease by 20.7eur, if other variable remains constant.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))                                 #Check the multicolinearity with VIF
## [1] 1.001845
Apartments$StdResid <-round(rstandard(fit2), 3)  #For standardized residuals
Apartments$Cooksd <- round(cooks.distance(fit2), 3)   #For Cook's distance

hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 1))

Explanation: - At first we may think we have an outlier but since the values are in between the interval from -3 to +3, there are no outliers to remove. But we can still test it.

shapiro.test(Apartments$StdResid)     #To test the distribution of standardized residuals
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645
hist(Apartments$Cooksd,
     xlab = "Cook's distance",
     ylab = "Frequency",
     col = "yellow",
     border = "black",
     main = "Histogram of Cook's distance")

Explanation: - It is okay that the values are below 1, that means that none of the variables have a significantly big impact on the data set. - However, we see an outlier, which has a much bigger impact than the others (the one between 0.30 and 0.35).

head(Apartments[order(-Apartments$Cooksd), ], 3)     #Finding the outlier
##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 38   5       45  2180       1       1      Yes      Yes    2.577  0.320
## 55  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069

Explanation: - The apartment number (ID) 38 has the biggest Cook’s distance of 0.320 and which we can see above in the histogram as an outlier. We can remove it.

Apartments <- Apartments [-38, ]     #We delete the ID 38
hist(Apartments$Cooksd,
     xlab = "Cook's distance",
     ylab = "Frequency",
     col = "yellow",
     border = "black",
     main = "Histogram of Cook's distance 2")

There are still two units with a larger impact on the data set than the others.

head(Apartments[order(-Apartments$Cooksd), ], 10) 
##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 55  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069
## 53   7        2  1760       0       1       No      Yes   -2.152  0.066
## 22  37        3  2540       1       1      Yes      Yes    1.576  0.061
## 39  40        2  2400       0       1       No      Yes    1.091  0.038
## 58   8        2  2820       1       0      Yes       No    1.655  0.037
## 25   8       26  2300       1       1      Yes      Yes    1.571  0.034
## 57  10        1  2810       0       0       No       No    1.601  0.032
## 2   18        1  2800       1       0      Yes       No    1.783  0.030
## 31  45       21  1910       0       1       No      Yes    0.889  0.030
Apartments <- Apartments [-54, ]   #Delete the ID 54 as well
fit2 <-lm(Price ~ Age + Distance,
          data = Apartments)

Apartments$StdFitted <- scale (fit2$fitted.values)
library(ggplot2)                                   #Heteroskedasticity?
ggplot(Apartments, aes(y=StdResid, x=StdFitted)) + geom_point() +
  ylab ("Standardized residuals") + 
  xlab("Standardized fitted values") +
  theme_minimal()

We are not quite sure about the heteroskedacity from what we see here. In such case, we have to do the Breusch Pagan test.

#install.packages("olsrr")
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)  # The test between the st.residuals and fitted values(fit1 !)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.5851484 
##  Prob > Chi2   =    0.4443014

Explanation: - We have H0 and H1 given above. - The p-value is 0.44, which is lower than 0.05. It means that we have to reject the null and accept the alternative, which states that the variance is not constant and therefore, the heteroskedacity is present (we can also assume it based on the graph above).

hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 1))

hist(Apartments$StdResid,     
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 0.5))

                   #If we take a smaller number (0.5) as "by" it shows the distances better!

Explanation: - The histogram is asymmetrical to the left and it does not look like a normal distribution. However, the values are still within the interval from -3 to 3, so there is no need to remove any of the values. - Note that we have more than 30 units in the data set, so the not normal distribution should not be a concern.

shapiro.test(Apartments$StdResid)     #To confirm there is no need to remove any of the variables
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.94963, p-value = 0.002636

We have the hypotheses: H0: The variables are normally distributed. H1: The variables are not normally distributed.

The p-value (0.002) is lower than 0.05 and therefore we can reject the null and again accept the alternative one. We can conclude that the variables are not normally distributed.

summary(fit2)                         #Estimate the fit2 again. Explain the coefficients
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Explanation: - The formula: Price of apartment per m2 = 2456.076 -6.464Age - 22.955Distance the formula has changed a little bit: The b0 was before 2460.1, b1 was before -7.934 and b2 was -20.667.

p-value (0.044) is lower than 0.05 so we reject the null and accept the alternative one, which states that If on average the apartment ages by one year, the price per m2 of the apartment will decrease by 6.464 eur, assuming the distance stays the same.

p-value (< 0.01) is lower than 0.05. We reject the null and accept the alternative one. It means that if a distance of the apartment from the city center increases by 1km, the price of apartment per m2 would decrease by 22.955 eur, assuming the age stays the same.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
           data = Apartments)
                          #Estimate the linear function (with categorical variables)
anova (fit2, fit3)        #Does fit3 fits data better than fit2 with anova function
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value (0.031) is lower than 0.05, so we reject the null and accept the alternative one, which states that fit 3 fits the data set better than fit2.

summary(fit3)             #The results of fit 3, explanation of coefficients.
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Explanation: - Formula: Price of apartment per m2 = 2329.724 -5.821Age -20.279Distance +167.531Parking -15.207Balcony - We supposed the multiple R squared value would increase as we chose the better option to show the data, and so it did, now it is 0.5035 (50.35%).

H0: Beta 3 = 0 H1: Beta 3 =/= 0 p-value (0.00933) < 0.05. We reject the null and accept the alternative hypothesis.

H0: Beta 4 = 0 H1: Beta 4 =/= 0 p-value (0.79795) > 0.05. We do not reject the null hypothesis.

H0: Ro2 = 0 (it can never be less than 0!!) H1: Ro2 > 0

F= 22.04, p-value < 0.001 Since the p-value is lower than 0.05 we reject the null hypothesis and accept the alternative one. It states that the coefficient is more than 0, meaning that there is a linear relationship between price and other variables explained in the model.

Apartments$Fitted <- fitted.values(fit3 [2])   #Save fitted values & calculate the residual for ID2
Apartments$Residuals <- residuals (fit3 [2])

head(Apartments$Residuals [2])
## [1] 422.9572

The residual for ID 2 is 427.8029 eur per m2 (the difference) based on the fitted value.