data()

Nika Boh

Task 1

Source:

“Carl Hoffstedt. This differs from the dataset Highway in the alr4 package only by addition of transformation of some of the columns.”

References:

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

Weisberg, S. (2014) Applied Linear Regression, Fourth Edition, Wiley, Section 7.2.

#install.packages("carData")
library(carData)
data("Highway1")              #Show the table and explain the data set
head(Highway1)
##   rate   len adt trks      sigs1 slim shld lane acpt  itg lwid htype
## 1 4.58  4.99  69    8 0.20040080   55   10    8  4.6 1.20   12   FAI
## 2 2.86 16.11  73    8 0.06207325   60   10    4  4.4 1.43   12   FAI
## 3 3.02  9.75  49   10 0.10256410   60   10    4  4.7 1.54   12   FAI
## 4 2.29 10.65  61   13 0.09389671   65   10    6  3.8 0.94   12   FAI
## 5 1.61 20.01  28   12 0.04997501   70   10    4  2.2 0.65   12   FAI
## 6 6.87  5.97  30    6 2.00750419   55   10    4 24.8 0.34   12    PA

Explanation of data set:

This is a data set from an unpublished master’s paper by Carl Hoffstedt. The data is about the automobile accident rate, measured in accidents per million vehicle miles and with 11 other variables. There are 39 units of observation (39 sections of large highways in the state of Minnesota, in 1973). - Rate: Number of automobile accidents per million vehicle miles.

  • Len: The length of the Highway1, in miles.

  • Adt: Means the average daily traffic count, in thousands.

  • Trks: It is a truck volume as a percent of the total volume.

  • Sigs1: It is a number of signals per mile of a roadway (adjusted to have non-zero values)

  • Slim: Shows the speed limit in the year 1973.

  • Shld: The width in feet, of outer shoulder on the roadway.

  • Lane: The total number of lanes of traffic.

  • Acpt: Number of access points per mile.

  • Itg: Number of freeway-type interchanges per mile.

  • Iwid: The lane width, in feet.

  • Htype: It tells the type of roadway or the source of funding for the road (either MC, FAI, PA, or MA). - it is a categorical variable

mydata <- data.frame(Highway1)  
colnames(Highway1) <- c("Rate", "Length", "AvgTrafficCount", "TruckVolume%", "NrSignals", "SpeedLimit", "RoadWidth", "NrLanes", "AccessPoints", "NrInterchanges", "LaneWidth", "RoadType")

head(Highway1)    #Change of names of the variables
##   Rate Length AvgTrafficCount TruckVolume%  NrSignals SpeedLimit RoadWidth
## 1 4.58   4.99              69            8 0.20040080         55        10
## 2 2.86  16.11              73            8 0.06207325         60        10
## 3 3.02   9.75              49           10 0.10256410         60        10
## 4 2.29  10.65              61           13 0.09389671         65        10
## 5 1.61  20.01              28           12 0.04997501         70        10
## 6 6.87   5.97              30            6 2.00750419         55        10
##   NrLanes AccessPoints NrInterchanges LaneWidth RoadType
## 1       8          4.6           1.20        12      FAI
## 2       4          4.4           1.43        12      FAI
## 3       4          4.7           1.54        12      FAI
## 4       6          3.8           0.94        12      FAI
## 5       4          2.2           0.65        12      FAI
## 6       4         24.8           0.34        12       PA
Highway1$Speed <- seq(50, 80, length.out = 39)  #The new variable is created with the data made up.
head(Highway1)
##   Rate Length AvgTrafficCount TruckVolume%  NrSignals SpeedLimit RoadWidth
## 1 4.58   4.99              69            8 0.20040080         55        10
## 2 2.86  16.11              73            8 0.06207325         60        10
## 3 3.02   9.75              49           10 0.10256410         60        10
## 4 2.29  10.65              61           13 0.09389671         65        10
## 5 1.61  20.01              28           12 0.04997501         70        10
## 6 6.87   5.97              30            6 2.00750419         55        10
##   NrLanes AccessPoints NrInterchanges LaneWidth RoadType    Speed
## 1       8          4.6           1.20        12      FAI 50.00000
## 2       4          4.4           1.43        12      FAI 50.78947
## 3       4          4.7           1.54        12      FAI 51.57895
## 4       6          3.8           0.94        12      FAI 52.36842
## 5       4          2.2           0.65        12      FAI 53.15789
## 6       4         24.8           0.34        12       PA 53.94737
any_missing <- any(is.na(Highway1))   #There is no missing data as it shows "FALSE"
summary(Highway1[ , c(-5, -7, -9, -10, -11, -12)])  #Deleted some of the variables
##       Rate           Length       AvgTrafficCount  TruckVolume%      SpeedLimit
##  Min.   :1.610   Min.   : 2.960   Min.   : 1.00   Min.   : 6.000   Min.   :40  
##  1st Qu.:2.630   1st Qu.: 7.995   1st Qu.: 5.00   1st Qu.: 8.000   1st Qu.:50  
##  Median :3.050   Median :11.390   Median :13.00   Median : 9.000   Median :55  
##  Mean   :3.933   Mean   :12.884   Mean   :19.62   Mean   : 9.333   Mean   :55  
##  3rd Qu.:4.595   3rd Qu.:17.800   3rd Qu.:24.00   3rd Qu.:11.000   3rd Qu.:60  
##  Max.   :9.230   Max.   :40.090   Max.   :73.00   Max.   :15.000   Max.   :70  
##     NrLanes          Speed     
##  Min.   :2.000   Min.   :50.0  
##  1st Qu.:2.000   1st Qu.:57.5  
##  Median :2.000   Median :65.0  
##  Mean   :3.128   Mean   :65.0  
##  3rd Qu.:4.000   3rd Qu.:72.5  
##  Max.   :8.000   Max.   :80.0
mean(Highway1$Rate)
## [1] 3.933333
round(mean(Highway1$Rate), 2)                       #Round up the mean to 2 decimal numbers
## [1] 3.93

Presentation of descriptive statistics:

  • The average rate of automobile accidents (in the sample) is 3.93 accidents per million vehicle miles.

  • On average, the highest number of average traffic counts (measured in thousands) is 73.

  • 75% of all automobile accidents observed in the sample, had up to 4 lanes on the highway.

  • The lowest (minimal) length of the highway was 2.960 miles.

  • The half of all automobile accidents (observed in the sample) had the speed of less than 65 miles per hour and the other half had it above the 65 miles per hour.

library(ggplot2)
ggplot(Highway1, aes(x = Speed)) +
  geom_histogram(binwidth = 5, fill = "yellow", color = "black") +
  labs(x = "Speed", y = "Rate", title = "Distribution of accident rates around the speed")

Explanation of the histogram:

  • It shows a normal distribution (unimodal), without any extreme values or outliers.Mode is the most frequent value and here it represents the speed of 65km per hour.
library(car)

scatterplot(Highway1$Rate ~ Highway1$Speed,
            smooth = FALSE,
            boxplots = FALSE,
            ylab =  "Rate",
            xlab = "Speed")

Description:

  • It is a scatterplot. A horizontal line can mean that the rate of accidents remains constant regardless of changes in the speed variable (it looks a bit positive as well). The rate may not be affected by the variations in the speed. There is also the possibility of strong bias in data collection as I randomly put the values for variable speed in the table. We can also see the outliers that can be deleted from the data set if we do other tests to prove it, such as vif.

Task 2

library(readr)
mydata1 <- as.data.frame(mydata)

data <- read.csv("~/Bootcamp_working/Task 2/Body mass.csv", sep = ";", dec=",", header = TRUE,col.names = c("ID", "Mass"))


head(data)                    #View of the first 6 rows of the data set
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2
library(pastecs)
round(stat.desc(data[ , -1]), 2)   #Rounded up by two decimal points 
##      nbr.val     nbr.null       nbr.na          min          max        range 
##        50.00         0.00         0.00        49.70        83.20        33.50 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##      3143.80        62.80        62.88         0.85         1.71        36.14 
##      std.dev     coef.var 
##         6.01         0.10
hist(data$Mass,
  main = "Distribution of body mass",
  xlab = "Weight",
  ylab = "Frequency",
  col = "yellow",
  border = "black",
  breaks = seq(from = 0, to = 100, by = 10))

H0:𝜇 = 59.5kg H1:𝜇 =/= 59.5kg (We want to see the difference)

t.test(data$Mass,
       mu = 59.5,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876
library(effectsize)
cohens_d(data$Mass, mu = 59.5)                       #Calculating the effect size
## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

The value is 0.56.

We still need the interpretation from the source we used at the lecture - Sawilowsky form 2009.

interpret_cohens_d(0.56, rules = "sawilowsky2009")  
## [1] "medium"
## (Rules: sawilowsky2009)

The code shows us that the effect size (the increase of weight) is perceived as medium.

Task 3

library(readxl)                                 #Importing the data set
Apartments <- read_excel("~/Bootcamp_working/Task 3/Apartments.xlsx")
View(Apartments)
head(Apartments)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Apartments$ParkingF <- factor(Apartments$Parking,      #Change categorical variables into factors
                              levels = c(0, 1),
                              labels = c("No","Yes"))

Apartments$BalconyF <- factor(Apartments$Balcony,
                              levels = c(0, 1),
                         labels = c("No", "Yes"))

Apartments <- as.data.frame(Apartments)                #To number each unit
head(Apartments)
##   Age Distance Price Parking Balcony ParkingF BalconyF
## 1   7       28  1640       0       1       No      Yes
## 2  18        1  2800       1       0      Yes       No
## 3   7       28  1660       0       0       No       No
## 4  28       29  1850       0       1       No      Yes
## 5  18       18  1640       1       1      Yes      Yes
## 6  28       12  1770       0       1       No      Yes
t.test(Apartments$Price,              # t-test for H0 hypothesis
       mu = 1900,
       alternative= "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941
library(car)
scatterplot(Apartments$Price ~ Apartments$Age,
            smooth = FALSE,
            boxplot = FALSE,
            ylab = "Price per m2 in EUR",
            xlab = "Age in years")

- The scatterplot shows a negative linear correlation between price of apartment per m2 and age of the apartment measured in years.

fit1 <- lm(Price ~ Age,           #A simple regression function
           data = Apartments)
summary (fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanation:

H0: B1 = 0 H1: B1 =/= 0

p < 0.05, so we can reject the H0 and accept the alternative one. It mean that the coefficient of Age (B1) has a statistically significant effect on the price per m2. The hypothesis we accepted (H1) states that the slope of B1 differs from 0.

cor(Apartments$Price, Apartments$Age)        #Coefficient of correlation
## [1] -0.230255
library(car)                            #A scatterplot matrix between Price, Age and Distance
scatterplotMatrix(Apartments [ , c(-4, -5, -6, -7)],
                  smooth = FALSE)

Explanation:

library(Hmisc)                             #To check the multicolinearity
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(Apartments [ , c(-4, -5, -6, -7)]))
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000
fit2 <- lm(Price ~ Age + Distance,            #Estimate a multiple regression function
           data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Explanation:

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))                                 #Check the multicolinearity with VIF
## [1] 1.001845
Apartments$StdResid <-round(rstandard(fit2), 3)  #For standardized residuals
Apartments$Cooksd <- round(cooks.distance(fit2), 3)   #For Cook's distance

hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 1))

Explanation:

shapiro.test(Apartments$StdResid)     #To test the distribution of standardized residuals
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645

H0: All variables are normally distributed.

H1: All variables are not normally distributed.

hist(Apartments$Cooksd,
     xlab = "Cook's distance",
     ylab = "Frequency",
     col = "yellow",
     border = "black",
     main = "Histogram of Cook's distance")

Explanation:

head(Apartments[order(-Apartments$Cooksd), ], 3)     #Finding the outlier
##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 38   5       45  2180       1       1      Yes      Yes    2.577  0.320
## 55  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069

Explanation:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
                                       #Deleting the unit with ID 38
Apartments <- Apartments %>%
  filter(!Distance == "45")  
hist(Apartments$Cooksd,
     xlab = "Cook's distance",
     ylab = "Frequency",
     col = "yellow",
     border = "black",
     main = "Histogram of Cook's distance 2")

There are still two units with a larger impact on the data set than the others.

head(Apartments[order(-Apartments$Cooksd), ], 5) 
##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid Cooksd
## 54  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069
## 52   7        2  1760       0       1       No      Yes   -2.152  0.066
## 22  37        3  2540       1       1      Yes      Yes    1.576  0.061
## 38  40        2  2400       0       1       No      Yes    1.091  0.038
Apartments <- Apartments %>%           #Deleting the unit wit ID 54
  filter(!Age == "43")   
fit2 <-lm(Price ~ Age + Distance,
          data = Apartments)

Apartments$StdFitted <- scale (fit2$fitted.values)
library(ggplot2)                                   #Heteroskedasticity?
ggplot(Apartments, aes(y=StdResid, x=StdFitted)) + geom_point() +
  ylab ("Standardized residuals") + 
  xlab("Standardized fitted values") +
  theme_minimal()

We are not quite sure about the heteroskedacity from what we see here. In such case, we have to do the Breusch Pagan test.

#install.packages("olsrr")
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)  # The test between the st.residuals and fitted values(fit1 !)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.5851484 
##  Prob > Chi2   =    0.4443014

Explanation:

hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 1))

hist(Apartments$StdResid,     
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "yellow",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 0.5))

                   #If we take a smaller number (0.5) as "by" it shows the distances better!

Explanation:

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.94963, p-value = 0.002636

We have the hypotheses: H0: The variables are normally distributed. H1: The variables are not normally distributed.

The p-value (0.002) is lower than 0.05 and therefore we can reject the null and again accept the alternative one. We can conclude that the variables are not normally distributed

summary(fit2)                         #Estimate the fit2 again. Explain the coefficients
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Explanation:

p-value (0.0178) is lower than 0.05 so we reject the null and accept the alternative one, which states that If on average the apartment ages by one year, the price per m2 of the apartment will decrease by 7.850 eur, assuming the distance stays the same.

p-value (< 0.001) is lower than 0.05. We reject the null and accept the alternative one. It means that if a distance of the apartment from the city center increases by 1km, the price of apartment per m2 would decrease by 23.945 eur, assuming the age stays the same.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
           data = Apartments)
                          #Estimate the linear function (with categorical variables)
anova (fit2, fit3)        #Does fit3 fits data better than fit2 with anova function
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Fit 2 is better.

H1: Fit 3 is better.

p-value (0.02813) is lower than 0.05, so we reject the null and accept the alternative one, which states that fit 3 fits the data set better than fit2.

summary(fit3)             #The results of fit 3, explanation of coefficients.
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Explanation:

H0: Beta 3 = 0 H1: Beta 3 =/= 0

p-value (0.00811) < 0.05. We reject the null and accept the alternative hypothesis.

H0: Beta 4 = 0 H1: Beta 4 =/= 0

p-value (0.90566) > 0.05. We do not reject the null hypothesis.

H0: Ro2 = 0 (it can never be less than 0!!) H1: Ro2 > 0

F= 22.97, p-value < 0.001

Since the p-value is lower than 0.05 we reject the null hypothesis and accept the alternative one. It states that the coefficient is more than 0, meaning that there is a linear relationship between price and other variables explained in the model.

Apartments$Fitted <- fitted.values(fit3 [2])   #Save fitted values & calculate the residual for ID2
Apartments$Residuals <- residuals (fit3 [2])

head(Apartments$Residuals [2])
## [1] 422.9572

The residual for ID 2 is 422.9572 eur per m2 (the difference) based on the fitted value.