Homework

data()

Nika Boh

Task 1

Source:

“Carl Hoffstedt. This differs from the dataset Highway in the alr4 package only by addition of transformation of some of the columns.”

References:

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

Weisberg, S. (2014) Applied Linear Regression, Fourth Edition, Wiley, Section 7.2.

#install.packages("carData")
library(carData)

data("Highway1")              #Show the table and explain the data set
head(Highway1)

##   rate   len adt trks      sigs1 slim shld lane acpt  itg lwid htype
## 1 4.58  4.99  69    8 0.20040080   55   10    8  4.6 1.20   12   FAI
## 2 2.86 16.11  73    8 0.06207325   60   10    4  4.4 1.43   12   FAI
## 3 3.02  9.75  49   10 0.10256410   60   10    4  4.7 1.54   12   FAI
## 4 2.29 10.65  61   13 0.09389671   65   10    6  3.8 0.94   12   FAI
## 5 1.61 20.01  28   12 0.04997501   70   10    4  2.2 0.65   12   FAI
## 6 6.87  5.97  30    6 2.00750419   55   10    4 24.8 0.34   12    PA

Explanation of data set:

This is a data set from an unpublished master’s paper by Carl Hoffstedt. The data is about the automobile accident rate, measured in accidents per million vehicle miles and with 11 other variables. There are 39 units of observation (39 sections of large highways in the state of Minnesota, in 1973). - Rate: Number of automobile accidents per million vehicle miles.

Len: The length of the Highway1, in miles.
Adt: Means the average daily traffic count, in thousands.
Trks: It is a truck volume as a percent of the total volume.
Sigs1: It is a number of signals per mile of a roadway (adjusted to have non-zero values)
Slim: Shows the speed limit in the year 1973.
Shld: The width in feet, of outer shoulder on the roadway.
Lane: The total number of lanes of traffic.
Acpt: Number of access points per mile.
Itg: Number of freeway-type interchanges per mile.
Iwid: The lane width, in feet.
Htype: It tells the type of roadway or the source of funding for the road (either MC, FAI, PA, or MA). - it is a categorical variable

mydata <- data.frame(Highway1)

colnames(Highway1) <- c("Rate", "Length", "AvgTrafficCount", "TruckVolume%", "NrSignals", "SpeedLimit", "RoadWidth", "NrLanes", "AccessPoints", "NrInterchanges", "LaneWidth", "RoadType")

head(Highway1)    #Change of names of the variables

##   Rate Length AvgTrafficCount TruckVolume%  NrSignals SpeedLimit RoadWidth
## 1 4.58   4.99              69            8 0.20040080         55        10
## 2 2.86  16.11              73            8 0.06207325         60        10
## 3 3.02   9.75              49           10 0.10256410         60        10
## 4 2.29  10.65              61           13 0.09389671         65        10
## 5 1.61  20.01              28           12 0.04997501         70        10
## 6 6.87   5.97              30            6 2.00750419         55        10
##   NrLanes AccessPoints NrInterchanges LaneWidth RoadType
## 1       8          4.6           1.20        12      FAI
## 2       4          4.4           1.43        12      FAI
## 3       4          4.7           1.54        12      FAI
## 4       6          3.8           0.94        12      FAI
## 5       4          2.2           0.65        12      FAI
## 6       4         24.8           0.34        12       PA

Highway1$Speed <- seq(50, 80, length.out = 39)  #The new variable is created with the data made up.
head(Highway1)

##   Rate Length AvgTrafficCount TruckVolume%  NrSignals SpeedLimit RoadWidth
## 1 4.58   4.99              69            8 0.20040080         55        10
## 2 2.86  16.11              73            8 0.06207325         60        10
## 3 3.02   9.75              49           10 0.10256410         60        10
## 4 2.29  10.65              61           13 0.09389671         65        10
## 5 1.61  20.01              28           12 0.04997501         70        10
## 6 6.87   5.97              30            6 2.00750419         55        10
##   NrLanes AccessPoints NrInterchanges LaneWidth RoadType    Speed
## 1       8          4.6           1.20        12      FAI 50.00000
## 2       4          4.4           1.43        12      FAI 50.78947
## 3       4          4.7           1.54        12      FAI 51.57895
## 4       6          3.8           0.94        12      FAI 52.36842
## 5       4          2.2           0.65        12      FAI 53.15789
## 6       4         24.8           0.34        12       PA 53.94737

any_missing <- any(is.na(Highway1))   #There is no missing data as it shows "FALSE"

summary(Highway1[ , c(-5, -7, -9, -10, -11, -12)])  #Deleted some of the variables

##       Rate           Length       AvgTrafficCount  TruckVolume%      SpeedLimit
##  Min.   :1.610   Min.   : 2.960   Min.   : 1.00   Min.   : 6.000   Min.   :40  
##  1st Qu.:2.630   1st Qu.: 7.995   1st Qu.: 5.00   1st Qu.: 8.000   1st Qu.:50  
##  Median :3.050   Median :11.390   Median :13.00   Median : 9.000   Median :55  
##  Mean   :3.933   Mean   :12.884   Mean   :19.62   Mean   : 9.333   Mean   :55  
##  3rd Qu.:4.595   3rd Qu.:17.800   3rd Qu.:24.00   3rd Qu.:11.000   3rd Qu.:60  
##  Max.   :9.230   Max.   :40.090   Max.   :73.00   Max.   :15.000   Max.   :70  
##     NrLanes          Speed     
##  Min.   :2.000   Min.   :50.0  
##  1st Qu.:2.000   1st Qu.:57.5  
##  Median :2.000   Median :65.0  
##  Mean   :3.128   Mean   :65.0  
##  3rd Qu.:4.000   3rd Qu.:72.5  
##  Max.   :8.000   Max.   :80.0

mean(Highway1$Rate)

## [1] 3.933333

round(mean(Highway1$Rate), 2)                       #Round up the mean to 2 decimal numbers

## [1] 3.93

Presentation of descriptive statistics:

The average rate of automobile accidents (in the sample) is 3.93 accidents per million vehicle miles.
On average, the highest number of average traffic counts (measured in thousands) is 73.
75% of all automobile accidents observed in the sample, had up to 4 lanes on the highway.
The lowest (minimal) length of the highway was 2.960 miles.
The half of all automobile accidents (observed in the sample) had the speed of less than 65 miles per hour and the other half had it above the 65 miles per hour.

library(ggplot2)
ggplot(Highway1, aes(x = Speed)) +
  geom_histogram(binwidth = 5, fill = "yellow", color = "black") +
  labs(x = "Speed", y = "Rate", title = "Distribution of accident rates around the speed")

Explanation of the histogram:

It shows a normal distribution (unimodal), without any extreme values or outliers.Mode is the most frequent value and here it represents the speed of 65km per hour.

library(car)

scatterplot(Highway1$Rate ~ Highway1$Speed,
            smooth = FALSE,
            boxplots = FALSE,
            ylab =  "Rate",
            xlab = "Speed")

Description:

It is a scatterplot. A horizontal line can mean that the rate of accidents remains constant regardless of changes in the speed variable (it looks a bit positive as well). The rate may not be affected by the variations in the speed. There is also the possibility of strong bias in data collection as I randomly put the values for variable speed in the table. We can also see the outliers that can be deleted from the data set if we do other tests to prove it, such as vif.

Task 2

library(readr)
mydata1 <- as.data.frame(mydata)

data <- read.csv("~/Bootcamp_working/Task 2/Body mass.csv", sep = ";", dec=",", header = TRUE,col.names = c("ID", "Mass"))


head(data)                    #View of the first 6 rows of the data set

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

library(pastecs)
round(stat.desc(data[ , -1]), 2)   #Rounded up by two decimal points

##      nbr.val     nbr.null       nbr.na          min          max        range 
##        50.00         0.00         0.00        49.70        83.20        33.50 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##      3143.80        62.80        62.88         0.85         1.71        36.14 
##      std.dev     coef.var 
##         6.01         0.10

Instead of the code summary, I used stat.desc to show more functions.
I also excluded the ID column as it is not relevant (categorical data).
We can see that the mean (average) is 62.88kg and it ranges from 49.70kg to 83.20kg.
The median or middle value is 62.80kg, which means that 50% of people in the sample have up to 62.80kg, and the others weigh more.
We also see there is a confidence interval (at alpha = 5%): We are 95% confident that the true arithmetic mean lies between 61.17 (62.88-1-71) and 64.59 (62.88+1.71).
Sum of all body masses in the sample is 3143.80kg.

hist(data$Mass,
  main = "Distribution of body mass",
  xlab = "Weight",
  ylab = "Frequency",
  col = "yellow",
  border = "black",
  breaks = seq(from = 0, to = 100, by = 10))

The histogram shows the frequency of nine-graders’ weight measures.The graph is slightly asymmetrical to the right (positively skewed).
We have the following hypotheses: Based on data from the 2018/2019 school year, the average weight was 59.5 kg

H0:𝜇 = 59.5kg H1:𝜇 =/= 59.5kg (We want to see the difference)

t.test(data$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

A high t-value may show statistically significant results.
We look at the p-value = 0.000234, which is lower than 0.001 (p-value < 0.001) and therefore also lower than 0.05 (our significance level of 5%). As p < 0.001 => p < 0.05, we reject the null hypothesis (H0) and accept the alternative one, which tells us that on average there is a difference of average body mass of nine-graders in comparison with the year 2021/2022.
We can also see the 95% confidence interval, which does not include our mean (59.5kg), so here is another reason to reject the H0.

library(effectsize)
cohens_d(data$Mass, mu = 59.5)                       #Calculating the effect size

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

The value is 0.56.

We still need the interpretation from the source we used at the lecture - Sawilowsky form 2009.

interpret_cohens_d(0.56, rules = "sawilowsky2009")

## [1] "medium"
## (Rules: sawilowsky2009)

The code shows us that the effect size (the increase of weight) is perceived as medium.

Task 3

library(readxl)                                 #Importing the data set
Apartments <- read_excel("~/Bootcamp_working/Task 3/Apartments.xlsx")
View(Apartments)
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Apartments$ParkingF <- factor(Apartments$Parking,      #Change categorical variables into factors
                              levels = c(0, 1),
                              labels = c("No","Yes"))

Apartments$BalconyF <- factor(Apartments$Balcony,
                              levels = c(0, 1),
                         labels = c("No", "Yes"))

Apartments <- as.data.frame(Apartments)                #To number each unit
head(Apartments)

##   Age Distance Price Parking Balcony ParkingF BalconyF
## 1   7       28  1640       0       1       No      Yes
## 2  18        1  2800       1       0      Yes       No
## 3   7       28  1660       0       0       No       No
## 4  28       29  1850       0       1       No      Yes
## 5  18       18  1640       1       1      Yes      Yes
## 6  28       12  1770       0       1       No      Yes

Then we test the following hypothesis: H0: Mu_Price = 1900 eur then H1: Mu_Price =/= 1900 eur

t.test(Apartments$Price,              # t-test for H0 hypothesis
       mu = 1900,
       alternative= "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We get the p-value of 0.004731, which is lower than 0.05. We can reject the null hypothesis and accept the alternative one, which states that on average, the price per m2 of apartments in Ljubljana region differs from 1900 eur.
The average price is higher than 1900 eur per m2.
Another reason to reject the null hypothesis stands in the 95% confidence interval which, again, does not include the mean of 1900.

library(car)
scatterplot(Apartments$Price ~ Apartments$Age,
            smooth = FALSE,
            boxplot = FALSE,
            ylab = "Price per m2 in EUR",
            xlab = "Age in years")

- The scatterplot shows a negative linear correlation between price of apartment per m2 and age of the apartment measured in years.

fit1 <- lm(Price ~ Age,           #A simple regression function
           data = Apartments)
summary (fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanation:

The average price per m2 = 2185.455 - 8.975*Age
Number 2185.455 is like a b0 from a basic formula, which tells the price (2185.455 eur) per m2, if the apartment was brand new (0 years of age).
The coefficient -8.975 is like a coefficient b1, it tells us that if the age of an apartment increases by 1 year, on average the price of apartment per m2 decreases for 8.975 eur.
R squared is the coefficient of determination. It means that (0.05302) 53.02% of the variability in prices per m2 of apartments is explained by variability in its age.
We also see the p-value of 0.03401

H0: B1 = 0 H1: B1 =/= 0

p < 0.05, so we can reject the H0 and accept the alternative one. It mean that the coefficient of Age (B1) has a statistically significant effect on the price per m2. The hypothesis we accepted (H1) states that the slope of B1 differs from 0.

cor(Apartments$Price, Apartments$Age)        #Coefficient of correlation

## [1] -0.230255

The coefficient of correlation shows a negative and weak (as is lies between 0.1 and 0.3) linear correlation between price of apartment per m2 and its age, measured in years.

library(car)                            #A scatterplot matrix between Price, Age and Distance
scatterplotMatrix(Apartments [ , c(-4, -5, -6, -7)],
                  smooth = FALSE)

Explanation:

On the diagonal there is a distribution for all three variables. The distribution for Age is slightly asymmetrical to the right but it still looks like a normal distribution. The explanation would be that there is a lot of apartments that are of younger age. Distance is quite asymmetrical to the right but is looks like a bimodal distribution, which may be explained by apartments being located either in the city center or in the suburbs. Price is also asymmetric to the right and it looks a bit bimodal as well.
To determine the problem with multicolinearity, we have to look at the relationship between the explanatory variables (so distance and age). In the matrix they appear to be very weakly positively correlated.

library(Hmisc)                             #To check the multicolinearity

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(Apartments [ , c(-4, -5, -6, -7)]))

##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

We look at Age and Distance and they have a very weak positive correlation (0.04, as it lies between 0.0 and 0.1). It looks like multicolinearity is not a problem here.
There is a weak negative correlation between price and age, but also very weak negative correlation between price and distance. there is a very weak positive correlation between age and distance.

fit2 <- lm(Price ~ Age + Distance,            #Estimate a multiple regression function
           data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Explanation:

We have the following function: (average) Price per m2 = 2460.1 - 7.9Age - 20.7Distance
b0 is 2460.1, which would be the avg price per m2 in a new apartment that is located in the city center.
If we increase the age by 1 year, the avg price of apartment per m2 would decrease by 7.9eur, if other variable remains constant.
If we increase the distance by 1km, the avg price of apartment per m2 would decrease by 20.7eur, if other variable remains constant.
both p-values are lower than 0.05 (p-value for Age is 0.016 and p-value for Distance is lower than 0.01). This suggests that both variables have a statistically significant effect on the price of apartment per m2.
Multiple R-squared of 0.4396 tells that with 43.96% of the variability of price of apartment in m2 can be explained by variability of the age and distance from the city center.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))                                 #Check the multicolinearity with VIF

## [1] 1.001845

VIF (variance inflation factor) test has confirmed that there is no multicolinearity as the number is below 5 as it should be. If the result was above 5, there would be multicolinearity present.

Apartments$StdResid <-round(rstandard(fit2), 3)  #For standardized residuals
Apartments$Cooksd <- round(cooks.distance(fit2), 3)   #For Cook's distance

hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 1))

Explanation:

At first we may think we have an outlier but since the values are in between the interval from -3 to +3, there are no outliers to remove. But we can still test it.

shapiro.test(Apartments$StdResid)     #To test the distribution of standardized residuals

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645

We have the hypotheses:

H0: All variables are normally distributed.

H1: All variables are not normally distributed.

p-value (0.003645) is lower than 0.05, so we can reject the null and accept the alternative hypothesis. So the variables are not normally distributed.
Because our sample size is bigger than 30, we don’t have to be concerned about the data not being normally distributed.

hist(Apartments$Cooksd,
     xlab = "Cook's distance",
     ylab = "Frequency",
     col = "yellow",
     border = "black",
     main = "Histogram of Cook's distance")

Explanation:

It is okay that the values are below 1, that means that none of the variables have a significantly big impact on the data set.
However, we see an outlier, which has a much bigger impact than the others (the one between 0.30 and 0.35).

head(Apartments[order(-Apartments$Cooksd), ], 3)     #Finding the outlier

##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 38   5       45  2180       1       1      Yes      Yes    2.577  0.320
## 55  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069

Explanation:

The apartment number (ID) 38 has the biggest Cook’s distance of 0.320 and which we can see above in the histogram as an outlier. We can remove it.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

                                       #Deleting the unit with ID 38
Apartments <- Apartments %>%
  filter(!Distance == "45")

hist(Apartments$Cooksd,
     xlab = "Cook's distance",
     ylab = "Frequency",
     col = "yellow",
     border = "black",
     main = "Histogram of Cook's distance 2")

There are still two units with a larger impact on the data set than the others.

head(Apartments[order(-Apartments$Cooksd), ], 5)

##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 54  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069
## 52   7        2  1760       0       1       No      Yes   -2.152  0.066
## 22  37        3  2540       1       1      Yes      Yes    1.576  0.061
## 38  40        2  2400       0       1       No      Yes    1.091  0.038

Apartments <- Apartments %>%           #Deleting the unit wit ID 54
  filter(!Age == "43")

fit2 <-lm(Price ~ Age + Distance,
          data = Apartments)

Apartments$StdFitted <- scale (fit2$fitted.values)

library(ggplot2)                                   #Heteroskedasticity?
ggplot(Apartments, aes(y=StdResid, x=StdFitted)) + geom_point() +
  ylab ("Standardized residuals") + 
  xlab("Standardized fitted values") +
  theme_minimal()

We are not quite sure about the heteroskedacity from what we see here. In such case, we have to do the Breusch Pagan test.

#install.packages("olsrr")
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit1)  # The test between the st.residuals and fitted values(fit1 !)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.5851484 
##  Prob > Chi2   =    0.4443014

Explanation:

We have H0 and H1 given above.
The p-value is 0.44, which is lower than 0.05. It means that we have to reject the null and accept the alternative, which states that the variance is not constant and therefore, the heteroskedacity is present (we can also assume it based on the graph above).

hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 1))

hist(Apartments$StdResid,     
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 0.5))

                   #If we take a smaller number (0.5) as "by" it shows the distances better!

Explanation:

The histogram is asymmetrical to the left and it does not look like a normal distribution. However, the values are still within the interval from -3 to 3, so there is no need to remove any of the values.
Note that we have more than 30 units in the data set, so the not normal distribution should not be a concern.

shapiro.test(Apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.94963, p-value = 0.002636

We have the hypotheses: H0: The variables are normally distributed. H1: The variables are not normally distributed.

The p-value (0.002) is lower than 0.05 and therefore we can reject the null and again accept the alternative one. We can conclude that the variables are not normally distributed

summary(fit2)                         #Estimate the fit2 again. Explain the coefficients

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Explanation:

The formula: Price of apartment per m2 = 2490.112 -7.850Age -23.945 Distance
H0: Beta 1 = 0 H1: Beta 1 =/= 0

p-value (0.0178) is lower than 0.05 so we reject the null and accept the alternative one, which states that If on average the apartment ages by one year, the price per m2 of the apartment will decrease by 7.850 eur, assuming the distance stays the same.

H0: Beta 2 = 0 H1: Beta 2 =/= 0

p-value (< 0.001) is lower than 0.05. We reject the null and accept the alternative one. It means that if a distance of the apartment from the city center increases by 1km, the price of apartment per m2 would decrease by 23.945 eur, assuming the age stays the same.

Multiple R-squared of 0.4968 means that 49.68% of all the variability can be explained by the effect of Age and Distance of the apartment.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
           data = Apartments)
                          #Estimate the linear function (with categorical variables)

anova (fit2, fit3)        #Does fit3 fits data better than fit2 with anova function

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypotheses:

H0: Fit 2 is better.

H1: Fit 3 is better.

p-value (0.02813) is lower than 0.05, so we reject the null and accept the alternative one, which states that fit 3 fits the data set better than fit2.

summary(fit3)             #The results of fit 3, explanation of coefficients.

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Explanation:

Formula: Price of apartment per m2 = 2358.900 - 7.197Age - 21.241Distance + 168.921Parking - 6.985 Balcony
We supposed the multiple R squared value would increase as we chose the better option to show the data, and so it did, now it is 0.5408 (54.08%).
If on average, an apartment has a parking place, the price of apartment per m2 increases by 168.921 eur, assuming all other variables remain the same.

H0: Beta 3 = 0 H1: Beta 3 =/= 0

p-value (0.00811) < 0.05. We reject the null and accept the alternative hypothesis.

If on average, an apartment has a balcony, the price of apartment per m2 decreases by 6.985 eur, assuming all other variables remain the same.

H0: Beta 4 = 0 H1: Beta 4 =/= 0

p-value (0.90566) > 0.05. We do not reject the null hypothesis.

F statistic: It is a test of significance of the regression.

H0: Ro2 = 0 (it can never be less than 0!!) H1: Ro2 > 0

F= 22.97, p-value < 0.001

Since the p-value is lower than 0.05 we reject the null hypothesis and accept the alternative one. It states that the coefficient is more than 0, meaning that there is a linear relationship between price and other variables explained in the model.

Apartments$Fitted <- fitted.values(fit3 [2])   #Save fitted values & calculate the residual for ID2
Apartments$Residuals <- residuals (fit3 [2])

head(Apartments$Residuals [2])

## [1] 422.9572

The residual for ID 2 is 422.9572 eur per m2 (the difference) based on the fitted value.

Homework

2023-09-20

Nika Boh

Task 1

Task 2

Task 3