importdata <- read.table("~/Multivirate Analyses/MVA_2022_2023/Homework/Homework 3 R/CarPrice_Assignment.csv", header=TRUE, sep=",", dec=".")
customdata <- importdata %>% dplyr::select(1, 26, 4, 10, 11, 12, 13, 14, 17, 19, 20, 21, 22)
head(customdata)
## car_ID price fueltype wheelbase carlength carwidth carheight curbweight enginesize boreratio stroke compressionratio
## 1 1 13495 gas 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0
## 2 2 16500 gas 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0
## 3 3 16500 gas 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0
## 4 4 13950 gas 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0
## 5 5 17450 gas 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0
## 6 6 15250 gas 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 8.5
## horsepower
## 1 111
## 2 111
## 3 154
## 4 102
## 5 115
## 6 110
Description of data:
Source of data: Car Price Prediction Multiple Linear Regression https://www.kaggle.com/datasets/hellbuoy/car-price-prediction?select=CarPrice_Assignment.csv
Unit of Observation: Cars Sample Size: 205 observations
Variables: 12
Dependent variable - price: Price of car in Dollars
Independent variables - fueltype: Car Fuel Type (gas or diesel)
wheelbase: Length of the wheelbase in inches (distance between front and rear wheels)
carlength: Length of car in inches
carwidth: Width of car in inches
carheight: Height of car inches
curbweight: Weight of car in pounds
enginesize: Size of engine in cubic inches
boreratio: Car Bore-Stroke Ratio is the ratio between the dimensions of the engine cylinder bore diameter to its piston stroke-length
stroke: Car strokes (numerical value in strokes)
compressionratio: Car compression ratio (ratio between the volume of the cylinder with the piston in the bottom position, Vbottom (largest volume), and in the top position, Vtop (smallest volume)
horsepower: Car horsepower
Purpose: Looking at the data we have several variables that are likely describing different dimensions of the same thing. For example, Car Length, Car Height, Car Width, Curb Weight may all be describing the size of a car.
Therefore, it would be best to reduce the number of variables whilst keeping a relatively close approximation of each of the original variables contribution.
We will achieve this dimesion reduction through principle component analysis.
#Changed any missing data to NA and removed any NA's from the data.
mydata1 <- customdata %>%
na_if ('') %>%
drop_na()
#Changed fueltype to a factor.
mydata1$fuelTypeF <- factor(mydata1$fueltype,
levels = c("gas","diesel"),
labels = c("Gas", "Diesel"))
#Initial model
stat1 <- lm(price ~ fuelTypeF + wheelbase + curbweight + carlength + carwidth + carheight + enginesize + boreratio + stroke + compressionratio + horsepower,
data = mydata1)
Estimate of the potential regression model
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
vif(stat1)
## fuelTypeF wheelbase curbweight carlength carwidth carheight enginesize
## 48.708926 6.980363 13.876185 8.662738 5.575933 2.170805 5.372246
## boreratio stroke compressionratio horsepower
## 1.982688 1.367405 45.161054 5.578574
mean(vif(stat1))
## [1] 13.22154
We can see that we have have an issue with Multicollinearity as our average Variance inflation factor is over 13, PCA would be appropriate to help us peform regression in the future on this dataset.
mydata1_PCA <- mydata1[, c("wheelbase", "curbweight", "carlength", "carwidth", "carheight", "enginesize", "boreratio", "stroke", "compressionratio", "horsepower")]
round(stat.desc(mydata1_PCA, basic = FALSE), 2)
## wheelbase curbweight carlength carwidth carheight enginesize boreratio stroke compressionratio horsepower
## median 97.00 2414.00 173.20 65.50 54.10 120.00 3.31 3.29 9.00 95.00
## mean 98.76 2555.57 174.05 65.91 53.72 126.91 3.33 3.26 10.14 104.12
## SE.mean 0.42 36.37 0.86 0.15 0.17 2.91 0.02 0.02 0.28 2.76
## CI.mean.0.95 0.83 71.70 1.70 0.30 0.34 5.73 0.04 0.04 0.55 5.45
## var 36.26 271107.87 152.21 4.60 5.97 1734.11 0.07 0.10 15.78 1563.74
## std.dev 6.02 520.68 12.34 2.15 2.44 41.64 0.27 0.31 3.97 39.54
## coef.var 0.06 0.20 0.07 0.03 0.05 0.33 0.08 0.10 0.39 0.38
We can see that the Variances of the variables are not equal, therefore we need to standardize variables
SD <- cor(mydata1_PCA)
round(SD, 3)
## wheelbase curbweight carlength carwidth carheight enginesize boreratio stroke compressionratio
## wheelbase 1.000 0.776 0.875 0.795 0.589 0.569 0.489 0.161 0.250
## curbweight 0.776 1.000 0.878 0.867 0.296 0.851 0.648 0.169 0.151
## carlength 0.875 0.878 1.000 0.841 0.491 0.683 0.606 0.130 0.158
## carwidth 0.795 0.867 0.841 1.000 0.279 0.735 0.559 0.183 0.181
## carheight 0.589 0.296 0.491 0.279 1.000 0.067 0.171 -0.055 0.261
## enginesize 0.569 0.851 0.683 0.735 0.067 1.000 0.584 0.203 0.029
## boreratio 0.489 0.648 0.606 0.559 0.171 0.584 1.000 -0.056 0.005
## stroke 0.161 0.169 0.130 0.183 -0.055 0.203 -0.056 1.000 0.186
## compressionratio 0.250 0.151 0.158 0.181 0.261 0.029 0.005 0.186 1.000
## horsepower 0.353 0.751 0.553 0.641 -0.109 0.810 0.574 0.081 -0.204
## horsepower
## wheelbase 0.353
## curbweight 0.751
## carlength 0.553
## carwidth 0.641
## carheight -0.109
## enginesize 0.810
## boreratio 0.574
## stroke 0.081
## compressionratio -0.204
## horsepower 1.000
Variables Standardised
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
cortest.bartlett(SD, n = nrow(mydata1_PCA))
## $chisq
## [1] 1835.363
##
## $p.value
## [1] 0
##
## $df
## [1] 45
We check to see if the data is suitable for PCA using Bartlets test of sphericity
H0: Correlations between variables are equal to 0 H1: Correlations between variables are not equal to 0
We reject the null hypothesis with p-value < 0.0001, the correlations are high enough, the data is suitable for PCA.
KMO(SD)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = SD)
## Overall MSA = 0.86
## MSA for each item =
## wheelbase curbweight carlength carwidth carheight enginesize boreratio
## 0.84 0.87 0.90 0.91 0.75 0.91 0.95
## stroke compressionratio horsepower
## 0.62 0.46 0.78
We check if the absolute average correlation is high enough for PCA
Overall MSA should be above .5, in this case it is .86 which is Meritorious.
PComponents <- PCA(mydata1_PCA,
scale.unit = TRUE,
graph = FALSE)
get_eigenvalue(PComponents)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.32371235 53.2371235 53.23712
## Dim.2 1.65121648 16.5121648 69.74929
## Dim.3 1.14385156 11.4385156 81.18780
## Dim.4 0.69111794 6.9111794 88.09898
## Dim.5 0.45084788 4.5084788 92.60746
## Dim.6 0.30353739 3.0353739 95.64284
## Dim.7 0.16466212 1.6466212 97.28946
## Dim.8 0.12253599 1.2253599 98.51482
## Dim.9 0.08715785 0.8715785 99.38640
## Dim.10 0.06136045 0.6136045 100.00000
The Kaiser-Guttman criterion is that any principle component with variance less than 1 contains less information than one of the original variables and so is not worth retaining, thus we only take PCA with eigenvalue > 1.
In this case we have 3
Using 3 components we should be able to explain 81% of the variance.
fviz_eig(PComponents,
choice = "eigenvalue",
main = "Scree plot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
We can see a slight elbow break at PC 4, so we should retain the 3 PC’s before the break.
fa.parallel(mydata1_PCA,
sim = FALSE,
fa = "pc")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, : The estimated weights for the factor
## scores are probably incorrect. Try a different factor score estimation method.
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
Parallel analysis indicates that we should use 2 components, however, both the Scree Plot and the Kaiser-Guttman criterion suggest we can use 3 components and so we will use 3.
PComponents <- PCA(mydata1_PCA,
scale.unit = TRUE,
graph = FALSE,
ncp = 3)
PComponents
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 205 individuals, described by 10 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
print(PComponents$var$cor)
## Dim.1 Dim.2 Dim.3
## wheelbase 0.8492962 0.38097916 -0.07084381
## curbweight 0.9647616 -0.06123861 0.03463443
## carlength 0.9281465 0.17276036 -0.09464495
## carwidth 0.9147703 0.02382500 0.06713919
## carheight 0.3764917 0.72928178 -0.36011982
## enginesize 0.8529240 -0.31566997 0.14673350
## boreratio 0.7142067 -0.20400951 -0.25030963
## stroke 0.1823040 0.09463025 0.88446105
## compressionratio 0.1577357 0.66827311 0.35617704
## horsepower 0.7320866 -0.58587052 0.03409522
Weight of each PC is how much information was retained of the variable in first three components.
For example Car Width = .912+.0.022+0.06^2 = 0.8321x100 = 83%
With PC1, PC2 and PC3 we retained 83% of information of the variable Curbweight, and lost 17%.
print(PComponents$var$contrib)
## Dim.1 Dim.2 Dim.3
## wheelbase 13.5488920 8.79019347 0.4387672
## curbweight 17.4833808 0.22711540 0.1048689
## carlength 16.1814897 1.80752443 0.7831144
## carwidth 15.7184437 0.03437651 0.3940783
## carheight 2.6625411 32.20970270 11.3376846
## enginesize 13.6648891 6.03479498 1.8823002
## boreratio 9.5814951 2.52055862 5.4775387
## stroke 0.6242777 0.54232038 68.3892367
## compressionratio 0.4673535 27.04605664 11.0907821
## horsepower 10.0672373 20.78735688 0.1016289
Percentage information contributions of each variable to each Component
For example: the Variable Curbweight contributed 17.4% of information to dimension 1
fviz_pca_var(PComponents,
repel = TRUE)
We can see the coordinates of each variable with their weighting contribution
fviz_pca_biplot(PComponents)
Combination of coordinates of each unit in the sample and their weight
For example we can see that unit 130 is likely to be a powerful car.
head(PComponents$ind$coord)
## Dim.1 Dim.2 Dim.3
## 1 -1.3861096 -2.1859208 -0.9463190
## 2 -1.3861096 -2.1859208 -0.9463190
## 3 -0.3883140 -1.0203233 1.4124103
## 4 -0.2384437 0.4636708 0.3160968
## 5 0.4764623 -0.1714430 0.2883191
## 6 0.1382168 -0.2718164 0.4599246
Coordinates of each unit in the sample.
#instering components into dataframe
mydata1$PC1 <- PComponents$ind$coord[ , 1]
mydata1$PC2 <- PComponents$ind$coord[ , 2]
mydata1$PC3 <- PComponents$ind$coord[ , 3]
head(mydata1, 3)
## car_ID price fueltype wheelbase carlength carwidth carheight curbweight enginesize boreratio stroke compressionratio
## 1 1 13495 gas 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9
## 2 2 16500 gas 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9
## 3 3 16500 gas 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9
## horsepower fuelTypeF PC1 PC2 PC3
## 1 111 Gas -1.386110 -2.185921 -0.946319
## 2 111 Gas -1.386110 -2.185921 -0.946319
## 3 154 Gas -0.388314 -1.020323 1.412410
#checking the correlation between components, must be 0
cor(x=mydata1$PC1, y=mydata1$PC2)
## [1] 6.837024e-16
The correlation is 0
the Components are orthogonal.
Conclusion
Principal components method was performed on 10 standardized variables (n= 205). The KMO measure confirms the appropriateness of the variables, KMO = 0.86
The MSA statistics for the individual variables are above 0.50 for three variables, and for the Compression Ratio intelligence variable it is only 0.46, but we left it in the analysis as it is very close to .5. Based on the Scree Plot and Kaiser-Guttman criterion, it makes the most sense to retain the first three principal components, which together summarize 81% of the information in the original 10 variables.
Based on the weighting of the principal components, we conclude that PC1 (𝜆1 = 5.32371235) represents the overall size of a car, while PC2 (𝜆2 = 1.65121648) represents power output and PC3 (𝜆2 = 1.14385156) represents the qualities of a passenger vehicle.