library(knitr )
library(psych)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(lm.beta)
library(naniar)
library(Hmisc)
library(estimatr)

Homework 3 - TINE DOBNIK

##1:Import and cleaning

set.seed(1) 
mydata1 <- read.table("~/Downloads/archive/audi.csv", 
                     header=TRUE, 
                     sep=",", 
                     dec=",") %>% sample_n(400) 
#I set seed for easier examination due to the primary data set having over 10k observations

mydata1 <- mydata1[ , c(-1, -2, -4,-7) ] 
#I threw out the variables that will not be needed

colnames(mydata1) <- c("Price", "Mileage", "FuelType", "MilesPerGallon","EngineSize") 
#I renamed the variables for easier examination

mydata2 <- transform(mydata1, MilesPerGallon=as.numeric(MilesPerGallon),EngineSize=as.numeric(EngineSize)) 
#Because some of the variables were characters I transformed them to numeric so I could properly conduct the analysis

head(mydata2)
##   Price Mileage FuelType MilesPerGallon EngineSize
## 1 11495   33397   Petrol           67.3        1.0
## 2 17336   62118   Diesel           56.5        2.0
## 3 19490   23800   Petrol           49.6        2.0
## 4 20000    8500   Petrol           47.9        2.0
## 5 14000   56441   Petrol           36.7        2.0
## 6 13250   40000   Petrol           58.9        1.4

Unit of observation: One used Audi car Sample size -> 400

2: Source

Used data set regarding prices of used Audi cars in the UK

##3: Descriptions Description of used variables:

Price: Price of a car in £ Mileage: Distance alearady travelled with the car in km FuelType: Engine fuel that the car uses MilesPerGallon: Shows how far you can travel for every gallon of fuel used. EngineSize: Size of the engine in litres

round(stat.desc(mydata2[c(-3)])) #I used the statistic description in order to get a glimpse of minimums, maximums.
##                  Price   Mileage MilesPerGallon EngineSize
## nbr.val            400       400            400        400
## nbr.null             0         0              0          1
## nbr.na               0         0              0          0
## min               3495        10             22          0
## max             102544    138649            118          4
## range            99049    138639             96          4
## sum            9333928  10009081          19967        781
## median           20565     19748             50          2
## mean             23335     25023             50          2
## SE.mean            609      1209              1          0
## CI.mean.0.95      1198      2377              1          0
## var          148439589 584693014            163          0
## std.dev          12184     24180             13          1
## coef.var             1         1              0          0

##4: Main goal of data analysis -> research question In what way does Mileage, Miles per Gallon, Engine Size and Fuel Type affect the price of used Audi card in UK.

##5: Defining of multiple regression model.

I believe that all the variables will affect Price in a certain way. ####Mileage Will lower the price of the used Audi car. This could may well be due to affinity to break down or have more frequent problems with its system. ####Miles per Gallon This variable will raise the price due to it making the car more attractive to the customer because the car will be more economical and individuals will spend less money on gas. ####Engine Size The bigger the engine size will be the higher the price will be due to the complications that bigger machines will need to be produced. Usually the bigger cars are also more expensive, and bigger cars need bigger engine sizes. ####Fuel Type Cars that are run on diesel tend to be more expensive than petrol and hybrid cars, due to the higher cost of diesel fuel and the more expensive engines and fuel systems required to run on diesel

##6:Graphical representation of relationships of analyzed variables with Scatterplots

scatterplotMatrix(mydata2[ , c(-3)], 
                  smooth = FALSE) 

From the first row we can observe that Mileage and Miels per Gallon are negatively correlated to Price. However Engine Size is positively correlated with Price.

In order to not only visually assess the correlation, we will also use rcorr function which will mathematically show us the correlations.

rcorr(as.matrix(mydata2[ , c(-3)])) 
##                Price Mileage MilesPerGallon EngineSize
## Price           1.00   -0.55          -0.57       0.58
## Mileage        -0.55    1.00           0.30       0.07
## MilesPerGallon -0.57    0.30           1.00      -0.43
## EngineSize      0.58    0.07          -0.43       1.00
## 
## n= 400 
## 
## 
## P
##                Price  Mileage MilesPerGallon EngineSize
## Price                 0.0000  0.0000         0.0000    
## Mileage        0.0000         0.0000         0.1373    
## MilesPerGallon 0.0000 0.0000                 0.0000    
## EngineSize     0.0000 0.1373  0.0000

##7: Estimation of regression model + diagnostics

fit1 <- lm(Price ~ Mileage + MilesPerGallon + EngineSize + FuelType, 
           data = mydata2)

We used the linear model function to model the relationship between Price and the explanatory variables.

vif(fit1) 
##                    GVIF Df GVIF^(1/(2*Df))
## Mileage        1.260729  1        1.122822
## MilesPerGallon 2.557104  1        1.599095
## EngineSize     1.972578  1        1.404485
## FuelType       2.043018  2        1.195551
mean(vif(fit1))
## [1] 1.512949

Here we wanted to see if there is multicolinearity in our multiple regression model.Given that none of them are above 5. Means that we do not have a problem with multicolinearity. Also the average VIF value is approx. 1.5 which shows to little problems of multicolinearity.

But because we have more than 3 categories we will compare the last column with the square root of 5

sqrt(5)
## [1] 2.236068

We can see that all the variables in the last column are smaller than the square root of 5. So we can confirmn that there is no multicolinearity

mydata2$StdResid <- round(rstandard(fit1), 4) 
mydata2$CooksD <- round(cooks.distance(fit1), 4) 

First function estimates the standard residuals. We use this for locating outliers. Second function estimates the cooks distances. We use this for locating values of high impact.

Lets check the Normality

hist(mydata2$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata2$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$StdResid
## W = 0.87839, p-value < 2.2e-16
hist(mydata2$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We can see that we have some outliers (We see that from Histogram of Standardized residuals) and some units of high impact (Histogram of Cooks distances)

Atop of that, even though I have more than 100 observations, I used Shapiro-Wilk normality test. I will assess 2 Hypotheses:

H0:The variance of errors is distributed is normally H1:The variance of errors is not distributed normally

Given the results of the normality test, we reject the null hypothesis and say that variance of errors of car price are not distributed normally at p value <0.001. Furthermore this also means that regression model violates the normality assumption.

This is why we will use the function head in order to find the outliers and units of high impact. Also I will check that within mydata2

head(mydata2[order(mydata2$StdResid),], 6)
##     Price Mileage FuelType MilesPerGallon EngineSize StdResid CooksD
## 347 23888   68000   Petrol           26.4        4.2  -2.5170 0.0667
## 330 25950   66000   Petrol           26.9        4.2  -2.2214 0.0514
## 380 12950   72000   Petrol           34.4        3.0  -1.9415 0.0185
## 334  8995   95000   Diesel           37.6        3.0  -1.9359 0.0249
## 333 22490   31987   Petrol           35.8        3.0  -1.9059 0.0107
## 70  18998    3601   Petrol           41.5        2.0  -1.8702 0.0040
head(mydata2[order(-mydata2$CooksD),], 6) 
##      Price Mileage FuelType MilesPerGallon EngineSize StdResid CooksD
## 166 102544    2000   Petrol           22.1        4.0   8.0849 0.5276
## 93   32788    1500   Diesel           47.1        0.0   3.1916 0.1359
## 104   9500  117740   Petrol           57.6        1.4   3.0306 0.0820
## 365  77888    3011   Diesel           32.8        3.0   5.5008 0.0786
## 347  23888   68000   Petrol           26.4        4.2  -2.5170 0.0667
## 330  25950   66000   Petrol           26.9        4.2  -2.2214 0.0514
mydata2 <- mydata2[c(-93,-166,-365,-248,-104), ] 

#eliminating units with high impact (93,166) and outliers (365,248,104)
#Std residuals -> -/+3
#Cooks distance -> 

After we have thrown out the units with high impact and outliers we estimate the model again. ##8: Estimating the regression again

fit1 <- lm(Price ~ Mileage + MilesPerGallon + EngineSize + FuelType, 
           data = mydata2)
mydata2$StdResid <- round(rstandard(fit1), 4) 

mydata2$CooksD <- round(cooks.distance(fit1), 4)

head(mydata2[order(mydata2$StdResid),], 6)
##     Price Mileage FuelType MilesPerGallon EngineSize StdResid CooksD
## 347 23888   68000   Petrol           26.4        4.2  -2.6143 0.0787
## 330 25950   66000   Petrol           26.9        4.2  -2.2570 0.0581
## 205 15880   19000   Diesel           47.9        2.0  -2.1497 0.0056
## 70  18998    3601   Petrol           41.5        2.0  -2.1439 0.0053
## 334  8995   95000   Diesel           37.6        3.0  -2.1096 0.0305
## 380 12950   72000   Petrol           34.4        3.0  -2.1024 0.0230

Now also follows the graphical representation

hist(mydata2$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

hist(mydata2$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

I should throw the 2 bins out after number 3, but when I did so it messed up my scale of fitted values. I initially thought that when you throw them out once, you dont correct them anymore. I am sorry for not being able to sort this out.

shapiro.test(mydata2$StdResid) 
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$StdResid
## W = 0.97244, p-value = 8.254e-07

We can once again say that given the results of the normality test, we reject the null hypothesis and say that variance of errors of car price are not distributed normally at p value <0.001. Furthermore this also means that regression model violates the normality assumption.

##9: Checking for homoskedasticity Ater we have checked the distribution we also need to check homoskedasticity

mydata2$StdFittedValues <- scale(fit1$fitted.values) 

We used scale here cause we standardize! Afterwards we check homoskedasticity with visual representation where: Y axis represents standard residuals and x axis standardized fitted values

scatterplot(y = mydata2$StdResid, x = mydata2$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

From the scatterplot we can say, that linearity is achieved.

In order to check for heteroskedasticity we will use Breusch Pagan test

library(olsrr)
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    60.4982 
##  Prob > Chi2   =    7.364572e-15

We will check it with 2 hypotheses: H0: I have homoskedasticity (the variance is constant) H1: I have heteroskedasticity (the variance is not constant)

From the Breusch Pagan test we can reject the null hypothesis, which means that the variance is not constant at p<0.001.

This leads to us having to use the lm_robust function that gives us robust standard errors.

##10:Using Robust Standard Errors for heteroskedasticity and interpretation of fit2

fit2 <- lm_robust(Price ~ Mileage + MilesPerGallon + EngineSize + FuelType,
                  se_type = "HC1",
           data = mydata2)

summary(fit2)
## 
## Call:
## lm_robust(formula = Price ~ Mileage + MilesPerGallon + EngineSize + 
##     FuelType, data = mydata2, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                  Estimate Std. Error t value  Pr(>|t|)   CI Lower   CI Upper
## (Intercept)    29574.1619  2.937e+03  10.069 2.404e-21 23799.4127 35348.9112
## Mileage           -0.2445  1.407e-02 -17.376 1.827e-50    -0.2721    -0.2168
## MilesPerGallon  -306.2605  3.004e+01 -10.194 8.704e-22  -365.3255  -247.1956
## EngineSize      8168.2962  8.771e+02   9.313 9.326e-19  6443.8736  9892.7188
## FuelTypeHybrid 36114.5927  2.448e+03  14.752 1.980e-39 31301.2741 40927.9113
## FuelTypePetrol -2633.7649  7.112e+02  -3.703 2.434e-04 -4031.9757 -1235.5540
##                 DF
## (Intercept)    389
## Mileage        389
## MilesPerGallon 389
## EngineSize     389
## FuelTypeHybrid 389
## FuelTypePetrol 389
## 
## Multiple R-squared:  0.7959 ,    Adjusted R-squared:  0.7933 
## F-statistic: 241.6 on 5 and 389 DF,  p-value: < 2.2e-16

###Interpretation of variables:

####Mileage If mileage of used audi cars increases by 1km, than the Price of the car on average decreases by 0.245 £. If we assume that all the other variables remain unchanged.

####MilesPerGallon If miles per gallon increases by 1 gallon per mile, than the Price of the car on average decreases by 306.26 £. If we assume that all the other variables remain unchanged.

####EngineSize If Engine size increases increases by 1 litre, than the price of the car on average increases by 9892.72 £. If we assume that all the other variables remain unchanged.

####FuelTypeHybrid If Fuel type changes from Diesel to Hybrid, than the price increases by 40927.91 £ on average. If we assume that all the other variables remain unchanged.

####FuelTypePetrol If Fuel type changes from Diesel to Petrol, than the price decreases by 1235.55 £ on average.If we assume that all the other variables remain unchanged.

###Multiple R-Squared In my case is 79%. Which leads to 79 percent of Price variability is explained by the variability of all 4 explanatory variables.

###F-statistics H0: Ro square is 0 H1: Ro square is more than 0

If we check at the p value, we can reject the null hypthesis at p value<0.001. Furthermore this means that we can safely say that there exists relationship between dependent and at least one independent variable.

##Interpretation and computation of multiple correlation coefficient

sqrt(summary(fit2)$r.squared) 
## [1] 0.8921601

###Multiple correlation coefficient We can see that multiple correlation coefficient equals to 0.89, which further on indicates that the linear relationship between Price of audi cars and other 4 explanatory variables is strong.