data()
Source:
“Carl Hoffstedt. This differs from the dataset Highway in the alr4 package only by addition of transformation of some of the columns.”
References:
Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
Weisberg, S. (2014) Applied Linear Regression, Fourth Edition, Wiley, Section 7.2.
#install.packages("carData")
library(carData)
data("Highway1") #Show the table and explain the data set
head(Highway1)
## rate len adt trks sigs1 slim shld lane acpt itg lwid htype
## 1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI
## 2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI
## 3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI
## 4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI
## 5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI
## 6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA
Explanation of data set:
This is a data set from an unpublished master’s paper by Carl Hoffstedt. The data is about the automobile accident rate, measured in accidents per million vehicle miles and with 11 other variables. There are 39 units of observation (39 sections of large highways in the state of Minnesota, in 1973). - Rate: Number of automobile accidents per million vehicle miles.
Len: The length of the Highway1, in miles.
Adt: Means the average daily traffic count, in thousands.
Trks: It is a truck volume as a percent of the total volume.
Sigs1: It is a number of signals per mile of a roadway (adjusted to have non-zero values)
Slim: Shows the speed limit in the year 1973.
Shld: The width in feet, of outer shoulder on the roadway.
Lane: The total number of lanes of traffic.
Acpt: Number of access points per mile.
Itg: Number of freeway-type interchanges per mile.
Iwid: The lane width, in feet.
Htype: It tells the type of roadway or the source of funding for the road (either MC, FAI, PA, or MA). - it is a categorical variable
mydata <- data.frame(Highway1)
colnames(Highway1) <- c("Rate", "Length", "AvgTrafficCount", "TruckVolume%", "NrSignals", "SpeedLimit", "RoadWidth", "NrLanes", "AccessPoints", "NrInterchanges", "LaneWidth", "RoadType")
head(Highway1) #Change of names of the variables
## Rate Length AvgTrafficCount TruckVolume% NrSignals SpeedLimit RoadWidth
## 1 4.58 4.99 69 8 0.20040080 55 10
## 2 2.86 16.11 73 8 0.06207325 60 10
## 3 3.02 9.75 49 10 0.10256410 60 10
## 4 2.29 10.65 61 13 0.09389671 65 10
## 5 1.61 20.01 28 12 0.04997501 70 10
## 6 6.87 5.97 30 6 2.00750419 55 10
## NrLanes AccessPoints NrInterchanges LaneWidth RoadType
## 1 8 4.6 1.20 12 FAI
## 2 4 4.4 1.43 12 FAI
## 3 4 4.7 1.54 12 FAI
## 4 6 3.8 0.94 12 FAI
## 5 4 2.2 0.65 12 FAI
## 6 4 24.8 0.34 12 PA
Highway1$Speed <- seq(50, 80, length.out = 39) #The new variable is created with the data made up.
head(Highway1)
## Rate Length AvgTrafficCount TruckVolume% NrSignals SpeedLimit RoadWidth
## 1 4.58 4.99 69 8 0.20040080 55 10
## 2 2.86 16.11 73 8 0.06207325 60 10
## 3 3.02 9.75 49 10 0.10256410 60 10
## 4 2.29 10.65 61 13 0.09389671 65 10
## 5 1.61 20.01 28 12 0.04997501 70 10
## 6 6.87 5.97 30 6 2.00750419 55 10
## NrLanes AccessPoints NrInterchanges LaneWidth RoadType Speed
## 1 8 4.6 1.20 12 FAI 50.00000
## 2 4 4.4 1.43 12 FAI 50.78947
## 3 4 4.7 1.54 12 FAI 51.57895
## 4 6 3.8 0.94 12 FAI 52.36842
## 5 4 2.2 0.65 12 FAI 53.15789
## 6 4 24.8 0.34 12 PA 53.94737
any_missing <- any(is.na(Highway1)) #There is no missing data as it shows "FALSE"
summary(Highway1[ , c(-5, -7, -9, -10, -11, -12)]) #Deleted some of the variables
## Rate Length AvgTrafficCount TruckVolume% SpeedLimit
## Min. :1.610 Min. : 2.960 Min. : 1.00 Min. : 6.000 Min. :40
## 1st Qu.:2.630 1st Qu.: 7.995 1st Qu.: 5.00 1st Qu.: 8.000 1st Qu.:50
## Median :3.050 Median :11.390 Median :13.00 Median : 9.000 Median :55
## Mean :3.933 Mean :12.884 Mean :19.62 Mean : 9.333 Mean :55
## 3rd Qu.:4.595 3rd Qu.:17.800 3rd Qu.:24.00 3rd Qu.:11.000 3rd Qu.:60
## Max. :9.230 Max. :40.090 Max. :73.00 Max. :15.000 Max. :70
## NrLanes Speed
## Min. :2.000 Min. :50.0
## 1st Qu.:2.000 1st Qu.:57.5
## Median :2.000 Median :65.0
## Mean :3.128 Mean :65.0
## 3rd Qu.:4.000 3rd Qu.:72.5
## Max. :8.000 Max. :80.0
mean(Highway1$Rate)
## [1] 3.933333
round(mean(Highway1$Rate), 2) #Round up the mean to 2 decimal numbers
## [1] 3.93
Presentation of descriptive statistics:
The average rate of automobile accidents (in the sample) is 3.93 accidents per million vehicle miles.
On average, the highest number of average traffic counts (measured in thousands) is 73.
75% of all automobile accidents observed in the sample, had up to 4 lanes on the highway.
The lowest (minimal) length of the highway was 2.960 miles.
The half of all automobile accidents (observed in the sample) had the speed of less than 65 miles per hour and the other half had it above the 65 miles per hour.
library(ggplot2)
ggplot(Highway1, aes(x = Speed)) +
geom_histogram(binwidth = 5, fill = "yellow", color = "black") +
labs(x = "Speed", y = "Rate", title = "Distribution of accident rates around the speed")
Explanation of the histogram:
library(car)
scatterplot(Highway1$Rate ~ Highway1$Speed,
smooth = FALSE,
boxplots = FALSE,
ylab = "Rate",
xlab = "Speed")
Description:
library(readr)
mydata1 <- as.data.frame(mydata)
data <- read.csv("~/Bootcamp_working/Task 2/Body mass.csv", sep = ";", dec=",", header = TRUE,col.names = c("ID", "Mass"))
head(data) #View of the first 6 rows of the data set
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
library(pastecs)
round(stat.desc(data[ , -1]), 2) #Rounded up by two decimal points
## nbr.val nbr.null nbr.na min max range
## 50.00 0.00 0.00 49.70 83.20 33.50
## sum median mean SE.mean CI.mean.0.95 var
## 3143.80 62.80 62.88 0.85 1.71 36.14
## std.dev coef.var
## 6.01 0.10
Instead of the code summary, I used stat.desc to show more functions.
I also excluded the ID column as it is not relevant (categorical data).
We can see that the mean (average) is 62.88kg and it ranges from 49.70kg to 83.20kg.
The median or middle value is 62.80kg, which means that 50% of people in the sample have up to 62.80kg, and the others weigh more.
We also see there is a confidence interval (at alpha = 5%): We are 95% confident that the true arithmetic mean lies between 61.17 (62.88-1-71) and 64.59 (62.88+1.71).
Sum of all body masses in the sample is 3143.80kg.
hist(data$Mass,
main = "Distribution of body mass",
xlab = "Weight",
ylab = "Frequency",
col = "yellow",
border = "black",
breaks = seq(from = 0, to = 100, by = 10))
The histogram shows the frequency of nine-graders’ weight measures.The graph is slightly asymmetrical to the right (positively skewed).
We have the following hypotheses: Based on data from the 2018/2019 school year, the average weight was 59.5 kg
H0:𝜇 = 59.5kg H1:𝜇 =/= 59.5kg (We want to see the difference)
t.test(data$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
A high t-value may show statistically significant results.
We look at the p-value = 0.000234, which is lower than 0.001 (p-value < 0.001) and therefore also lower than 0.05 (our significance level of 5%). As p < 0.001 => p < 0.05, we reject the null hypothesis (H0) and accept the alternative one, which tells us that on average there is a difference of average body mass of nine-graders in comparison with the year 2021/2022.
We can also see the 95% confidence interval, which does not include our mean (59.5kg), so here is another reason to reject the H0.
library(effectsize)
cohens_d(data$Mass, mu = 59.5) #Calculating the effect size
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
The value is 0.56.
We still need the interpretation from the source we used at the lecture - Sawilowsky form 2009.
interpret_cohens_d(0.56, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
The code shows us that the effect size (the increase of weight) is perceived as medium.
library(readxl) #Importing the data set
Apartments <- read_excel("~/Bootcamp_working/Task 3/Apartments.xlsx")
View(Apartments)
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes
Apartments$ParkingF <- factor(Apartments$Parking, #Change categorical variables into factors
levels = c(0, 1),
labels = c("No","Yes"))
Apartments$BalconyF <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments <- as.data.frame(Apartments) #To number each unit
head(Apartments)
## Age Distance Price Parking Balcony ParkingF BalconyF
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
t.test(Apartments$Price, # t-test for H0 hypothesis
mu = 1900,
alternative= "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We get the p-value of 0.004731, which is lower than 0.05. We can reject the null hypothesis and accept the alternative one, which states that on average, the price per m2 of apartments in Ljubljana region differs from 1900 eur.
The average price is higher than 1900 eur per m2.
Another reason to reject the null hypothesis stands in the 95% confidence interval which, again, does not include the mean of 1900.
library(car)
scatterplot(Apartments$Price ~ Apartments$Age,
smooth = FALSE,
boxplot = FALSE,
ylab = "Price per m2 in EUR",
xlab = "Age in years")
- The scatterplot shows a negative linear correlation between price of
apartment per m2 and age of the apartment measured in years.
fit1 <- lm(Price ~ Age, #A simple regression function
data = Apartments)
summary (fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanation:
The average price per m2 = 2185.455 - 8.975*Age
Number 2185.455 is like a b0 from a basic formula, which tells the price (2185.455 eur) per m2, if the apartment was brand new (0 years of age).
The coefficient -8.975 is like a coefficient b1, it tells us that if the age of an apartment increases by 1 year, on average the price of apartment per m2 decreases for 8.975 eur.
R squared is the coefficient of determination. It means that (0.05302) 53.02% of the variability in prices per m2 of apartments is explained by variability in its age.
We also see the p-value of 0.03401
H0: B1 = 0 H1: B1 =/= 0
p < 0.05, so we can reject the H0 and accept the alternative one. It mean that the coefficient of Age (B1) has a statistically significant effect on the price per m2. The hypothesis we accepted (H1) states that the slope of B1 differs from 0.
cor(Apartments$Price, Apartments$Age) #Coefficient of correlation
## [1] -0.230255
library(car) #A scatterplot matrix between Price, Age and Distance
scatterplotMatrix(Apartments [ , c(-4, -5, -6, -7)],
smooth = FALSE)
Explanation:
On the diagonal there is a distribution for all three variables. The distribution for Age is slightly asymmetrical to the right but it still looks like a normal distribution. The explanation would be that there is a lot of apartments that are of younger age. Distance is quite asymmetrical to the right but is looks like a bimodal distribution, which may be explained by apartments being located either in the city center or in the suburbs. Price is also asymmetric to the right and it looks a bit bimodal as well.
To determine the problem with multicolinearity, we have to look at the relationship between the explanatory variables (so distance and age). In the matrix they appear to be very weakly positively correlated.
library(Hmisc) #To check the multicolinearity
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(Apartments [ , c(-4, -5, -6, -7)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
We look at Age and Distance and they have a very weak positive correlation (0.04, as it lies between 0.0 and 0.1). It looks like multicolinearity is not a problem here.
There is a weak negative correlation between price and age, but also very weak negative correlation between price and distance. there is a very weak positive correlation between age and distance.
fit2 <- lm(Price ~ Age + Distance, #Estimate a multiple regression function
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Explanation:
We have the following function: (average) Price per m2 = 2460.1 - 7.9Age - 20.7Distance
b0 is 2460.1, which would be the avg price per m2 in a new apartment that is located in the city center.
If we increase the age by 1 year, the avg price of apartment per m2 would decrease by 7.9eur, if other variable remains constant.
If we increase the distance by 1km, the avg price of apartment per m2 would decrease by 20.7eur, if other variable remains constant.
both p-values are lower than 0.05 (p-value for Age is 0.016 and p-value for Distance is lower than 0.01). This suggests that both variables have a statistically significant effect on the price of apartment per m2.
Multiple R-squared of 0.4396 tells that with 43.96% of the variability of price of apartment in m2 can be explained by variability of the age and distance from the city center.
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2)) #Check the multicolinearity with VIF
## [1] 1.001845
Apartments$StdResid <-round(rstandard(fit2), 3) #For standardized residuals
Apartments$Cooksd <- round(cooks.distance(fit2), 3) #For Cook's distance
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "yellow",
border = "black",
breaks = seq(from = -3, to = 3, by = 1))
Explanation:
shapiro.test(Apartments$StdResid) #To test the distribution of standardized residuals
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95303, p-value = 0.003645
H0: All variables are normally distributed.
H1: All variables are not normally distributed.
p-value (0.003645) is lower than 0.05, so we can reject the null and accept the alternative hypothesis. So the variables are not normally distributed.
Because our sample size is bigger than 30, we don’t have to be concerned about the data not being normally distributed.
hist(Apartments$Cooksd,
xlab = "Cook's distance",
ylab = "Frequency",
col = "yellow",
border = "black",
main = "Histogram of Cook's distance")
Explanation:
It is okay that the values are below 1, that means that none of the variables have a significantly big impact on the data set.
However, we see an outlier, which has a much bigger impact than the others (the one between 0.30 and 0.35).
head(Apartments[order(-Apartments$Cooksd), ], 3) #Finding the outlier
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 38 5 45 2180 1 1 Yes Yes 2.577 0.320
## 55 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
Explanation:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Deleting the unit with ID 38
Apartments <- Apartments %>%
filter(!Distance == "45")
hist(Apartments$Cooksd,
xlab = "Cook's distance",
ylab = "Frequency",
col = "yellow",
border = "black",
main = "Histogram of Cook's distance 2")
There are still two units with a larger impact on the data set than the
others.
head(Apartments[order(-Apartments$Cooksd), ], 5)
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 54 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
## 52 7 2 1760 0 1 No Yes -2.152 0.066
## 22 37 3 2540 1 1 Yes Yes 1.576 0.061
## 38 40 2 2400 0 1 No Yes 1.091 0.038
Apartments <- Apartments %>% #Deleting the unit wit ID 54
filter(!Age == "43")
fit2 <-lm(Price ~ Age + Distance,
data = Apartments)
Apartments$StdFitted <- scale (fit2$fitted.values)
library(ggplot2) #Heteroskedasticity?
ggplot(Apartments, aes(y=StdResid, x=StdFitted)) + geom_point() +
ylab ("Standardized residuals") +
xlab("Standardized fitted values") +
theme_minimal()
We are not quite sure about the heteroskedacity from what we see here. In such case, we have to do the Breusch Pagan test.
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1) # The test between the st.residuals and fitted values(fit1 !)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.5851484
## Prob > Chi2 = 0.4443014
Explanation:
We have H0 and H1 given above.
The p-value is 0.44, which is lower than 0.05. It means that we have to reject the null and accept the alternative, which states that the variance is not constant and therefore, the heteroskedacity is present (we can also assume it based on the graph above).
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "yellow",
border = "black",
breaks = seq(from = -3, to = 3, by = 1))
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "yellow",
border = "black",
breaks = seq(from = -3, to = 3, by = 0.5))
#If we take a smaller number (0.5) as "by" it shows the distances better!
Explanation:
The histogram is asymmetrical to the left and it does not look like a normal distribution. However, the values are still within the interval from -3 to 3, so there is no need to remove any of the values.
Note that we have more than 30 units in the data set, so the not normal distribution should not be a concern.
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.94963, p-value = 0.002636
We have the hypotheses: H0: The variables are normally distributed. H1: The variables are not normally distributed.
The p-value (0.002) is lower than 0.05 and therefore we can reject the null and again accept the alternative one. We can conclude that the variables are not normally distributed
summary(fit2) #Estimate the fit2 again. Explain the coefficients
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
Explanation:
The formula: Price of apartment per m2 = 2490.112 -7.850Age -23.945 Distance
H0: Beta 1 = 0 H1: Beta 1 =/= 0
p-value (0.0178) is lower than 0.05 so we reject the null and accept the alternative one, which states that If on average the apartment ages by one year, the price per m2 of the apartment will decrease by 7.850 eur, assuming the distance stays the same.
p-value (< 0.001) is lower than 0.05. We reject the null and accept the alternative one. It means that if a distance of the apartment from the city center increases by 1km, the price of apartment per m2 would decrease by 23.945 eur, assuming the age stays the same.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = Apartments)
#Estimate the linear function (with categorical variables)
anova (fit2, fit3) #Does fit3 fits data better than fit2 with anova function
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Fit 2 is better.
H1: Fit 3 is better.
p-value (0.02813) is lower than 0.05, so we reject the null and accept the alternative one, which states that fit 3 fits the data set better than fit2.
summary(fit3) #The results of fit 3, explanation of coefficients.
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
Explanation:
Formula: Price of apartment per m2 = 2358.900 - 7.197Age - 21.241Distance + 168.921Parking - 6.985 Balcony
We supposed the multiple R squared value would increase as we chose the better option to show the data, and so it did, now it is 0.5408 (54.08%).
If on average, an apartment has a parking place, the price of apartment per m2 increases by 168.921 eur, assuming all other variables remain the same.
H0: Beta 3 = 0 H1: Beta 3 =/= 0
p-value (0.00811) < 0.05. We reject the null and accept the alternative hypothesis.
H0: Beta 4 = 0 H1: Beta 4 =/= 0
p-value (0.90566) > 0.05. We do not reject the null hypothesis.
H0: Ro2 = 0 (it can never be less than 0!!) H1: Ro2 > 0
F= 22.97, p-value < 0.001
Since the p-value is lower than 0.05 we reject the null hypothesis and accept the alternative one. It states that the coefficient is more than 0, meaning that there is a linear relationship between price and other variables explained in the model.
Apartments$Fitted <- fitted.values(fit3 [2]) #Save fitted values & calculate the residual for ID2
Apartments$Residuals <- residuals (fit3 [2])
head(Apartments$Residuals [2])
## [1] 422.9572
The residual for ID 2 is 422.9572 eur per m2 (the difference) based on the fitted value.