data()
Source:
“Carl Hoffstedt. This differs from the dataset Highway in the alr4 package only by addition of transformation of some of the columns.”
References:
Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
Weisberg, S. (2014) Applied Linear Regression, Fourth Edition, Wiley, Section 7.2.
#install.packages("carData")
library(carData)
data("Highway1") #Show the table and explain the data set
head(Highway1)
## rate len adt trks sigs1 slim shld lane acpt itg lwid htype
## 1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI
## 2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI
## 3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI
## 4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI
## 5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI
## 6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA
Explanation of data set:
This is a data set from an unpublished master’s paper by Carl Hoffstedt. The data is about the automobile accident rate, measured in accidents per million vehicle miles and with 11 other variables. There are 39 units of observation (39 sections of large highways in the state of Minnesota, in 1973). - Rate: Number of automobile accidents per million vehicle miles. - Len: The length of the Highway1, in miles. - Adt: Means the average daily traffic count, in thousands. - Trks: It is a truck volume as a percent of the total volume. - Sigs1: It is a number of signals per mile of a roadway (adjusted to have non-zero values) - Slim: Shows the speed limit in the year 1973. - Shld: The width in feet, of outer shoulder on the roadway. - Lane: The total number of lanes of traffic. - Acpt: Number of access points per mile. - Itg: Number of freeway-type interchanges per mile - Iwid: The lane width, in feet. - Htype: It tells the type of roadway or the source of funding for the road (either MC, FAI, PA, or MA). - it is a categorical variable
mydata <- data.frame(Highway1)
colnames(Highway1) <- c("Rate", "Length", "AvgTrafficCount", "TruckVolume%", "NrSignals", "SpeedLimit", "RoadWidth", "NrLanes", "AccessPoints", "NrInterchanges", "LaneWidth", "RoadType")
head(Highway1) #Change of names of the variables
## Rate Length AvgTrafficCount TruckVolume% NrSignals SpeedLimit RoadWidth
## 1 4.58 4.99 69 8 0.20040080 55 10
## 2 2.86 16.11 73 8 0.06207325 60 10
## 3 3.02 9.75 49 10 0.10256410 60 10
## 4 2.29 10.65 61 13 0.09389671 65 10
## 5 1.61 20.01 28 12 0.04997501 70 10
## 6 6.87 5.97 30 6 2.00750419 55 10
## NrLanes AccessPoints NrInterchanges LaneWidth RoadType
## 1 8 4.6 1.20 12 FAI
## 2 4 4.4 1.43 12 FAI
## 3 4 4.7 1.54 12 FAI
## 4 6 3.8 0.94 12 FAI
## 5 4 2.2 0.65 12 FAI
## 6 4 24.8 0.34 12 PA
Highway1$Speed <- seq(50, 80, length.out = 39) #The new variable is created with the data made up.
head(Highway1)
## Rate Length AvgTrafficCount TruckVolume% NrSignals SpeedLimit RoadWidth
## 1 4.58 4.99 69 8 0.20040080 55 10
## 2 2.86 16.11 73 8 0.06207325 60 10
## 3 3.02 9.75 49 10 0.10256410 60 10
## 4 2.29 10.65 61 13 0.09389671 65 10
## 5 1.61 20.01 28 12 0.04997501 70 10
## 6 6.87 5.97 30 6 2.00750419 55 10
## NrLanes AccessPoints NrInterchanges LaneWidth RoadType Speed
## 1 8 4.6 1.20 12 FAI 50.00000
## 2 4 4.4 1.43 12 FAI 50.78947
## 3 4 4.7 1.54 12 FAI 51.57895
## 4 6 3.8 0.94 12 FAI 52.36842
## 5 4 2.2 0.65 12 FAI 53.15789
## 6 4 24.8 0.34 12 PA 53.94737
any_missing <- any(is.na(Highway1)) #There is no missing data as it shows "FALSE"
summary(Highway1[ , c(-5, -7, -9, -10, -11, -12)]) #Deleted some of the variables
## Rate Length AvgTrafficCount TruckVolume% SpeedLimit
## Min. :1.610 Min. : 2.960 Min. : 1.00 Min. : 6.000 Min. :40
## 1st Qu.:2.630 1st Qu.: 7.995 1st Qu.: 5.00 1st Qu.: 8.000 1st Qu.:50
## Median :3.050 Median :11.390 Median :13.00 Median : 9.000 Median :55
## Mean :3.933 Mean :12.884 Mean :19.62 Mean : 9.333 Mean :55
## 3rd Qu.:4.595 3rd Qu.:17.800 3rd Qu.:24.00 3rd Qu.:11.000 3rd Qu.:60
## Max. :9.230 Max. :40.090 Max. :73.00 Max. :15.000 Max. :70
## NrLanes Speed
## Min. :2.000 Min. :50.0
## 1st Qu.:2.000 1st Qu.:57.5
## Median :2.000 Median :65.0
## Mean :3.128 Mean :65.0
## 3rd Qu.:4.000 3rd Qu.:72.5
## Max. :8.000 Max. :80.0
mean(Highway1$Rate)
## [1] 3.933333
round(mean(Highway1$Rate), 2) #Round up the mean to 2 decimal numbers
## [1] 3.93
Presentation of descriptive statistics: - The average rate of automobile accidents (in the sample) is 3.93 accidents per million vehicle miles. - On average, the highest number of average traffic counts (measured in thousands) is 73. - 75% of all automobile accidents observed in the sample, had up to 4 lanes on the highway. - The lowest (minimal) length of the highway was 2.960 miles. - The half of all automobile accidents (observed in the sample) had the speed of less than 65 miles per hour and the other half had it above the 65 miles per hour.
library(ggplot2)
ggplot(Highway1, aes(x = Speed)) +
geom_histogram(binwidth = 5, fill = "yellow", color = "black") +
labs(x = "Speed", y = "Rate", title = "Distribution of accident rates around the speed")
Explanation of the histogram: - It shows a normal distribution
(unimodal), without any extreme values or outliers.Mode is the most
frequent value and here it represents the speed of 65km per hour.
library(car)
scatterplot(Highway1$Rate ~ Highway1$Speed,
smooth = FALSE,
boxplots = FALSE,
ylab = "Rate",
xlab = "Speed")
Description: - It is a scatterplot. A horizontal line can mean that the
rate of accidents remains constant regardless of changes in the speed
variable (it looks a bit positive as well). The rate may not be affected
by the variations in the speed. There is also the possibility of strong
bias in data collection as I randomly put the values for variable speed
in the table. We can also see the outliers that can be deleted from the
data set if we do other tests to prove it, such as vif.
library(readr)
mydata1 <- as.data.frame(mydata)
data <- read.csv("~/Bootcamp_working/Task 2/Body mass.csv", sep = ";", dec=",", header = TRUE,col.names = c("ID", "Mass"))
head(data) #View of the first 6 rows of the data set
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
library(pastecs)
round(stat.desc(data[ , -1]), 2) #Rounded up by two decimal points
## nbr.val nbr.null nbr.na min max range
## 50.00 0.00 0.00 49.70 83.20 33.50
## sum median mean SE.mean CI.mean.0.95 var
## 3143.80 62.80 62.88 0.85 1.71 36.14
## std.dev coef.var
## 6.01 0.10
hist(data$Mass,
main = "Distribution of body mass",
xlab = "Weight",
ylab = "Frequency",
col = "yellow",
border = "black",
breaks = seq(from = 0, to = 100, by = 10))
- The histogram shows the frequency of nine-graders’ weight measures.The
graph is slightly asymmetrical to the right (positively skewed).
H0:𝜇 = 59.5kg H1:𝜇 =/= 59.5kg (We want to see the difference)
t.test(data$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
library(effectsize)
cohens_d(data$Mass, mu = 59.5) #Calculating the effect size
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
The value is 0.56. We still need the interpretation from the source we used at the lecture - Sawilowsky form 2009.
interpret_cohens_d(0.56, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
The code shows us that the effect size (the increase of weight) is perceived as medium.
library(readxl) #Importing the data set
Apartments <- read_excel("~/Bootcamp_working/Task 3/Apartments.xlsx")
View(Apartments)
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description: - Age: Age of an apartment in years - Distance: The distance from city center in km - Price: Price per m2 - Parking: 0-No, 1-Yes - Balcony: 0-No, 1-Yes
Apartments$ParkingF <- factor(Apartments$Parking, #Change categorical variables into factors
levels = c(0, 1),
labels = c("No","Yes"))
Apartments$BalconyF <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments <- as.data.frame(Apartments) #To number each unit
head(Apartments)
## Age Distance Price Parking Balcony ParkingF BalconyF
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
t.test(Apartments$Price, # t-test for H0 hypothesis
mu = 1900,
alternative= "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
library(car)
scatterplot(Apartments$Price ~ Apartments$Age,
smooth = FALSE,
boxplot = FALSE,
ylab = "Price per m2 in EUR",
xlab = "Age in years")
- The scatterplot shows a negative linear correlation between price of
apartment per m2 and age of the apartment measured in years.
fit1 <- lm(Price ~ Age, #A simple regression function
data = Apartments)
summary (fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanation:
The average price per m2 = 2185.5 - 9*Age
Number 2185.5 is like a b0 from a basic formula, which tells the price (2185.5 eur) per m2, if the apartment was brand new (0 years of age).
The coefficient -9 is like a coefficient b1, it tells us that if the age of an apartment increases by 1 year, on average the price of apartment per m2 decreases for 9 eur.
R squared is the coefficient of determination. It means that (0.05302) 53.02% of the variability in prices per m2 of apartments is explained by variability in its age.
We also see the p-value of 0.03401
H0: B1 = 0 H1: B1 =/= 0
p < 0.05, so we can reject the H0 and accept the alternative one. It mean that the coefficient of Age (B1) has a statistically significant effect on the price per m2. The hypothesis we accepted (H1) states that the slope of B1 differs from 0.
cor(Apartments$Price, Apartments$Age) #Coefficient of correlation
## [1] -0.230255
library(car) #A scatterplot matrix between Price, Age and Distance
scatterplotMatrix(Apartments [ , c(-4, -5, -6, -7)],
smooth = FALSE)
Explanation: - On the diagonal there is a distribution for all three
variables. The distribution for Age is slightly asymmetrical to the
right but it still looks like a normal distribution. The explanation
would be that there is a lot of apartments that are of younger age.
Distance is quite asymmetrical to the right but is looks like a bimodal
distribution, which may be explained by apartments being located either
in the city center or in the suburbs. Price is also asymmetric to the
right and it looks a bit bimodal as well.
library(Hmisc) #To check the multicolinearity
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(Apartments [ , c(-4, -5, -6, -7)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
fit2 <- lm(Price ~ Age + Distance, #Estimate a multiple regression function
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Explanation: - We have the following function: (average) Price per m2 = 2460.1 - 7.9Age - 20.7Distance - b0 is 2460.1, which would be the avg price per m2 in a new apartment that is located in the city center. - If we increase the age by 1 year, the avg price of apartment per m2 would decrease by 7.9eur, if other variable remains constant. - If we increase the distance by 1km, the avg price of apartment per m2 would decrease by 20.7eur, if other variable remains constant.
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2)) #Check the multicolinearity with VIF
## [1] 1.001845
Apartments$StdResid <-round(rstandard(fit2), 3) #For standardized residuals
Apartments$Cooksd <- round(cooks.distance(fit2), 3) #For Cook's distance
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "yellow",
border = "black",
breaks = seq(from = -3, to = 3, by = 1))
Explanation: - At first we may think we have an outlier but since the
values are in between the interval from -3 to +3, there are no outliers
to remove. But we can still test it.
shapiro.test(Apartments$StdResid) #To test the distribution of standardized residuals
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95303, p-value = 0.003645
We have the hypotheses: H0: All variables are normally distributed. H1: All variables are not normally distributed.
p-value (0.003645) is lower than 0.05, so we can reject the null and accept the alternative hypothesis. So the variables are not normally distributed.
Because our sample size is bigger than 30, we don’t have to be concerned about the data not being normally distributed.
hist(Apartments$Cooksd,
xlab = "Cook's distance",
ylab = "Frequency",
col = "yellow",
border = "black",
main = "Histogram of Cook's distance")
Explanation: - It is okay that the values are below 1, that means that
none of the variables have a significantly big impact on the data set. -
However, we see an outlier, which has a much bigger impact than the
others (the one between 0.30 and 0.35).
head(Apartments[order(-Apartments$Cooksd), ], 3) #Finding the outlier
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 38 5 45 2180 1 1 Yes Yes 2.577 0.320
## 55 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
Explanation: - The apartment number (ID) 38 has the biggest Cook’s distance of 0.320 and which we can see above in the histogram as an outlier. We can remove it.
Apartments <- Apartments [-38, ] #We delete the ID 38
hist(Apartments$Cooksd,
xlab = "Cook's distance",
ylab = "Frequency",
col = "yellow",
border = "black",
main = "Histogram of Cook's distance 2")
There are still two units with a larger impact on the data set than the
others.
head(Apartments[order(-Apartments$Cooksd), ], 10)
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 55 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
## 53 7 2 1760 0 1 No Yes -2.152 0.066
## 22 37 3 2540 1 1 Yes Yes 1.576 0.061
## 39 40 2 2400 0 1 No Yes 1.091 0.038
## 58 8 2 2820 1 0 Yes No 1.655 0.037
## 25 8 26 2300 1 1 Yes Yes 1.571 0.034
## 57 10 1 2810 0 0 No No 1.601 0.032
## 2 18 1 2800 1 0 Yes No 1.783 0.030
## 31 45 21 1910 0 1 No Yes 0.889 0.030
Apartments <- Apartments [-54, ] #Delete the ID 54 as well
fit2 <-lm(Price ~ Age + Distance,
data = Apartments)
Apartments$StdFitted <- scale (fit2$fitted.values)
library(ggplot2) #Heteroskedasticity?
ggplot(Apartments, aes(y=StdResid, x=StdFitted)) + geom_point() +
ylab ("Standardized residuals") +
xlab("Standardized fitted values") +
theme_minimal()
We are not quite sure about the heteroskedacity from what we see here.
In such case, we have to do the Breusch Pagan test.
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1) # The test between the st.residuals and fitted values(fit1 !)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.5851484
## Prob > Chi2 = 0.4443014
Explanation: - We have H0 and H1 given above. - The p-value is 0.44, which is lower than 0.05. It means that we have to reject the null and accept the alternative, which states that the variance is not constant and therefore, the heteroskedacity is present (we can also assume it based on the graph above).
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "yellow",
border = "black",
breaks = seq(from = -3, to = 3, by = 1))
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "yellow",
border = "black",
breaks = seq(from = -3, to = 3, by = 0.5))
#If we take a smaller number (0.5) as "by" it shows the distances better!
Explanation: - The histogram is asymmetrical to the left and it does not look like a normal distribution. However, the values are still within the interval from -3 to 3, so there is no need to remove any of the values. - Note that we have more than 30 units in the data set, so the not normal distribution should not be a concern.
shapiro.test(Apartments$StdResid) #To confirm there is no need to remove any of the variables
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.94963, p-value = 0.002636
We have the hypotheses: H0: The variables are normally distributed. H1: The variables are not normally distributed.
The p-value (0.002) is lower than 0.05 and therefore we can reject the null and again accept the alternative one. We can conclude that the variables are not normally distributed.
summary(fit2) #Estimate the fit2 again. Explain the coefficients
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
Explanation: - The formula: Price of apartment per m2 = 2456.076 -6.464Age - 22.955Distance the formula has changed a little bit: The b0 was before 2460.1, b1 was before -7.934 and b2 was -20.667.
p-value (0.044) is lower than 0.05 so we reject the null and accept the alternative one, which states that If on average the apartment ages by one year, the price per m2 of the apartment will decrease by 6.464 eur, assuming the distance stays the same.
p-value (< 0.01) is lower than 0.05. We reject the null and accept the alternative one. It means that if a distance of the apartment from the city center increases by 1km, the price of apartment per m2 would decrease by 22.955 eur, assuming the age stays the same.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = Apartments)
#Estimate the linear function (with categorical variables)
anova (fit2, fit3) #Does fit3 fits data better than fit2 with anova function
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value (0.031) is lower than 0.05, so we reject the null and accept the alternative one, which states that fit 3 fits the data set better than fit2.
summary(fit3) #The results of fit 3, explanation of coefficients.
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
Explanation: - Formula: Price of apartment per m2 = 2329.724 -5.821Age -20.279Distance +167.531Parking -15.207Balcony - We supposed the multiple R squared value would increase as we chose the better option to show the data, and so it did, now it is 0.5035 (50.35%).
H0: Beta 3 = 0 H1: Beta 3 =/= 0 p-value (0.00933) < 0.05. We reject the null and accept the alternative hypothesis.
H0: Beta 4 = 0 H1: Beta 4 =/= 0 p-value (0.79795) > 0.05. We do not reject the null hypothesis.
H0: Ro2 = 0 (it can never be less than 0!!) H1: Ro2 > 0
F= 22.04, p-value < 0.001 Since the p-value is lower than 0.05 we reject the null hypothesis and accept the alternative one. It states that the coefficient is more than 0, meaning that there is a linear relationship between price and other variables explained in the model.
Apartments$Fitted <- fitted.values(fit3 [2]) #Save fitted values & calculate the residual for ID2
Apartments$Residuals <- residuals (fit3 [2])
head(Apartments$Residuals [2])
## [1] 422.9572
The residual for ID 2 is 427.8029 eur per m2 (the difference) based on the fitted value.