The project centered around analyzing a real-world dataset as part of a competition. Its aim was to develop skills in addressing specific questions using the provided dataset and its metadata.
# Importing 'validate' package for data quality analysis
library(validate)
# Importing 'ggplot2' package for visualizations
library(ggplot2)##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:validate':
##
## expr
# Importing 'dplyr' package for data manipulation
library(dplyr)##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:validate':
##
## expr
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# To extract my personalized dataset from the larger provided dataset, I utilized my unique identifier, which corresponded to my registration number for the competition. For privacy reasons, I will conceal this identifier(I put any Number).
SID <- 5
# SID mod 50 + 1
SIDoffset <- (SID %% 50) + 1
# Loading the data set
load("car-analysis-data.Rda")
# Now subset the car data set
# Pick every 50th observation starting from the offset
# Put into data frame named mydf
mydf <- cars.analysis[seq(from=SIDoffset,to=nrow(cars.analysis),by=50),]Systematic data quality assessment across all variables is essential.
# Developing the validator function named 'data_checker' using validate package
data_checker <- validator(
NullValues = !is.na(mydf), #Checking for missing values
YearDigits = field_length(mydf$year, 4) , #Checking that the 'year' is four-digits long.
Non_NegYear = mydf$year >=1, #Verifying that 'year' is bigger than zero
Non_NegEngSize = mydf$engine_size > 0, #Verifying that 'engine_size' is bigger than zero
MaxGreaterMin = mydf$max_mpg >= mydf$min_mpg, #Verifying 'max_mpg' exceeds 'min_mpg'.
# Checking that 'mileage,' 'min_mpg,''max_mpg,' and 'price' contains positive values.
Non_NegMileage = mydf$mileage >=0,
Non_NegMinmpg = mydf$min_mpg >= 0,
Non_NegMaxmpg = mydf$max_mpg >= 0,
Non_NegPrice = mydf$price >= 0,
# Checking that 'damaged,' 'first_owner,' 'navigation_system,' 'bluetooth,' 'third_row_seating,' 'automatic_transmission,' and 'heated_seats' variables contains only 0 and 1 values.
HasAuto = is.element(mydf$automatic_transmission,c(0,1)),
HasDamaged = is.element(mydf$damaged, c(0,1)),
FirstOwner = is.element(mydf$first_owner, c(0,1)),
HasNavigation = is.element(mydf$navigation_system, c(0,1)),
HasBluetooth = is.element(mydf$bluetooth, c(0,1)),
Has3Row = is.element(mydf$third_row_seating, c(0,1)),
HasHeated = is.element(mydf$heated_seats, c(0,1)),
# Verifying the character variables
CharaBrand = is.character(mydf$brand),
CharaFuel = is.character(mydf$fuel),
CharaDrive = is.character(mydf$drivetrain),
#Verifying the factors variables
FAuto = is.factor(mydf$automatic_transmission),
FDamaged = is.factor(mydf$damaged),
FFirstOwner = is.factor(mydf$first_owner),
FNavigation = is.factor(mydf$navigation_system),
FBluetooth = is.factor(mydf$bluetooth),
F3Row = is.factor(mydf$third_row_seating),
FHeated = is.factor(mydf$heated_seats),
#Verifying the numerical variables
NumYear = is.numeric(mydf$year),
NumMileage = is.numeric(mydf$mileage),
NumEngSize = is.numeric(mydf$engine_size),
NumMinmpg = is.numeric(mydf$min_mpg),
NumMaxmpg = is.numeric(mydf$max_mpg),
NumPrice = is.numeric(mydf$price)
)
# Confronting 'mydf' and 'data_checker' and saving output in 'out'
out <- confront(mydf, data_checker)
# Generating summary of the output
summary(out)## name items passes fails nNA error warning
## 1 NullValues 6560 6413 147 0 FALSE FALSE
## 2 YearDigits 410 410 0 0 FALSE FALSE
## 3 Non_NegYear 410 410 0 0 FALSE FALSE
## 4 Non_NegEngSize 410 380 1 29 FALSE FALSE
## 5 MaxGreaterMin 410 355 2 53 FALSE FALSE
## 6 Non_NegMileage 410 410 0 0 FALSE FALSE
## 7 Non_NegMinmpg 410 357 0 53 FALSE FALSE
## 8 Non_NegMaxmpg 410 355 2 53 FALSE FALSE
## 9 Non_NegPrice 410 410 0 0 FALSE FALSE
## 10 HasAuto 410 410 0 0 FALSE FALSE
## 11 HasDamaged 410 406 4 0 FALSE FALSE
## 12 FirstOwner 410 402 8 0 FALSE FALSE
## 13 HasNavigation 410 410 0 0 FALSE FALSE
## 14 HasBluetooth 410 410 0 0 FALSE FALSE
## 15 Has3Row 410 410 0 0 FALSE FALSE
## 16 HasHeated 410 410 0 0 FALSE FALSE
## 17 CharaBrand 1 1 0 0 FALSE FALSE
## 18 CharaFuel 1 1 0 0 FALSE FALSE
## 19 CharaDrive 1 1 0 0 FALSE FALSE
## 20 FAuto 1 0 1 0 FALSE FALSE
## 21 FDamaged 1 0 1 0 FALSE FALSE
## 22 FFirstOwner 1 0 1 0 FALSE FALSE
## 23 FNavigation 1 0 1 0 FALSE FALSE
## 24 FBluetooth 1 0 1 0 FALSE FALSE
## 25 F3Row 1 0 1 0 FALSE FALSE
## 26 FHeated 1 0 1 0 FALSE FALSE
## 27 NumYear 1 1 0 0 FALSE FALSE
## 28 NumMileage 1 1 0 0 FALSE FALSE
## 29 NumEngSize 1 1 0 0 FALSE FALSE
## 30 NumMinmpg 1 1 0 0 FALSE FALSE
## 31 NumMaxmpg 1 1 0 0 FALSE FALSE
## 32 NumPrice 1 1 0 0 FALSE FALSE
## expression
## 1 !is.na(mydf)
## 2 field_length(mydf[["year"]], 4)
## 3 mydf[["year"]] >= 1
## 4 mydf[["engine_size"]] > 0
## 5 mydf[["max_mpg"]] >= mydf[["min_mpg"]]
## 6 mydf[["mileage"]] >= 0
## 7 mydf[["min_mpg"]] >= 0
## 8 mydf[["max_mpg"]] >= 0
## 9 mydf[["price"]] >= 0
## 10 is.element(mydf[["automatic_transmission"]], c(0, 1))
## 11 is.element(mydf[["damaged"]], c(0, 1))
## 12 is.element(mydf[["first_owner"]], c(0, 1))
## 13 is.element(mydf[["navigation_system"]], c(0, 1))
## 14 is.element(mydf[["bluetooth"]], c(0, 1))
## 15 is.element(mydf[["third_row_seating"]], c(0, 1))
## 16 is.element(mydf[["heated_seats"]], c(0, 1))
## 17 is.character(mydf[["brand"]])
## 18 is.character(mydf[["fuel"]])
## 19 is.character(mydf[["drivetrain"]])
## 20 is.factor(mydf[["automatic_transmission"]])
## 21 is.factor(mydf[["damaged"]])
## 22 is.factor(mydf[["first_owner"]])
## 23 is.factor(mydf[["navigation_system"]])
## 24 is.factor(mydf[["bluetooth"]])
## 25 is.factor(mydf[["third_row_seating"]])
## 26 is.factor(mydf[["heated_seats"]])
## 27 is.numeric(mydf[["year"]])
## 28 is.numeric(mydf[["mileage"]])
## 29 is.numeric(mydf[["engine_size"]])
## 30 is.numeric(mydf[["min_mpg"]])
## 31 is.numeric(mydf[["max_mpg"]])
## 32 is.numeric(mydf[["price"]])
Issues with Data Quality Noted from the Summary:
.Found 157 missing values.
.2 values in the ‘max_mpg’ variable are less than the corresponding
‘min_mpg’.
.2 values in the ‘max_mpg’ variable are not positive.
.5 values in the ‘damaged’ variable are neither 0 nor 1.
.7 values in the ‘first_owner’ variable are not exclusively 0 or
1.
.All variables expected to be factors are not appropriately
categorized.
Issues with Data Quality as noted from 1.3:
.Found 157 missing values.
.2 values in the ‘max_mpg’ variable are less than the corresponding
‘min_mpg’.
.2 values in the ‘max_mpg’ variable are not positive.
.5 values in the ‘damaged’ variable are neither 0 nor 1.
.7 values in the ‘first_owner’ variable are not exclusively 0 or
1.
.All variables expected to be factors are not appropriately
categorized.
Addressing issues:
.We’ll utilize colSums(is.na()) to find and manage missing values. We’ll
utilise the mode for categorical variable and the mean for numerical
variable to replace missing values. For numerical data, mean imputation
works best, while mode for categorical data.
.The discrepancy between two values in the ‘max_mpg’ variable and their
corresponding ‘min_mpg’ values is probable attributable to the existence
of negative values in ‘max_mpg’. We’ll fix this by substituting negative
values with positive ones.
.The mode (most frequent value) will replace non-0 or 1 values in the
‘damaged’ and ‘first_owner’ variables,these variables are binary, and
mode replacement is best for binary variables.
.’as.factor’ function will classify variables believed to be
factors.
# Counting the missing values per variables
colSums(is.na(mydf))## brand year mileage
## 0 0 0
## engine_size automatic_transmission fuel
## 29 0 0
## drivetrain min_mpg max_mpg
## 0 53 53
## damaged first_owner navigation_system
## 4 8 0
## bluetooth third_row_seating heated_seats
## 0 0 0
## price
## 0
# Mean imputations for numerical variables
mydf$engine_size[is.na(mydf$engine_size)] <- round(mean(mydf$engine_size, na.rm = TRUE), 1)
mydf$min_mpg[is.na(mydf$min_mpg)] <- round(mean(mydf$min_mpg, na.rm = TRUE), 1)
mydf$max_mpg[is.na(mydf$max_mpg)] <- round(mean(mydf$max_mpg, na.rm = TRUE), 1)
# Running the table function to find the most frequent values in 'damaged' and 'first_owner' variables
table(mydf$damaged)##
## 0 1
## 304 102
table(mydf$first_owner)##
## 0 1
## 216 186
# Mode imputations for categorical variables
mydf$damaged[is.na(mydf$damaged)] <- 0
mydf$first_owner[is.na(mydf$first_owner)] <- 0
# Cleaning the negative values in 'max_mpg' variable
## Starting by checking the values
mydf$max_mpg[mydf$max_mpg < 0]## [1] -30 -30
## Both are -30, now they are going to be replaced by 30
mydf$max_mpg[mydf$max_mpg < 0] <- 30
# Changing to factors
mydf$automatic_transmission <- as.factor(mydf$automatic_transmission)
mydf$damaged <- as.factor(mydf$damaged)
mydf$first_owner <- as.factor(mydf$first_owner)
mydf$navigation_system <- as.factor(mydf$navigation_system)
mydf$bluetooth <- as.factor(mydf$bluetooth)
mydf$third_row_seating <- as.factor(mydf$third_row_seating)
mydf$heated_seats <- as.factor(mydf$heated_seats)
# Character variables turned into factors.
mydf$brand <- as.factor(mydf$brand)
mydf$fuel <- as.factor(mydf$fuel)
mydf$drivetrain <- as.factor(mydf$drivetrain)To wrap up this section following cleaning and reevaluation, it has
been confirmed that:
.The discrepancy between ‘max_mpg’ values being lower than ‘min_mpg’
values stemmed from the existence of negative values within the
‘max_mpg’ variable
.The non-0 or 1 values identified within the ‘damaged’ and ‘first_owner’
variables were a consequence of the existence of missing values within
these variables
. Furthermore, to simplify EDA, all character variables have been turned
into factors.
The strategy;
Data overview: summary() function will be used to get a high-level overview of the data, and the str() function will be used to confirm its structure..
Assessment of numerical variables: Histograms will be employed to visually represent the data distribution of the numerical variables. To investigate the associations between numerical variables, correlations will be computed utilizing the cor() function. These relationships will be visually represented using scatter graphs, with an emphasis on the correlation between price (the research question) and additional numerical variables.
Assessment of categorical variables: Using bar graphs and table analysis, I will examine the distribution of categorical variables.
Assessment of both numerical and categorical variables at the same time: Box plot graphs will be used to show the relationships between numerical and categorical variables in connection to the research questions.
# Running summary
summary(mydf)## brand year mileage engine_size
## Volkswagen: 28 Min. :1972 Min. : 0 Min. :0.00
## Ford : 26 1st Qu.:2015 1st Qu.: 24304 1st Qu.:2.00
## Mazda : 23 Median :2018 Median : 47021 Median :2.50
## Nissan : 22 Mean :2017 Mean : 54013 Mean :2.78
## Audi : 21 3rd Qu.:2020 3rd Qu.: 73918 3rd Qu.:3.50
## FIAT : 20 Max. :2023 Max. :299999 Max. :6.60
## (Other) :270
## automatic_transmission fuel drivetrain min_mpg
## 0: 28 Diesel : 4 Four-wheel Drive :205 Min. : 0.00
## 1:382 Electric: 10 Front-wheel Drive:148 1st Qu.:18.00
## GPL : 6 Rear-wheel Drive : 55 Median :21.10
## Hybrid : 16 Unknown : 2 Mean :21.14
## Pertol : 2 3rd Qu.:23.00
## Petrol :369 Max. :72.00
## Unknown : 3
## max_mpg damaged first_owner navigation_system bluetooth
## Min. : 0.00 0:308 0:224 0:230 0: 58
## 1st Qu.:25.00 1:102 1:186 1:180 1:352
## Median :28.10
## Mean :28.38
## 3rd Qu.:31.00
## Max. :80.00
##
## third_row_seating heated_seats price
## 0:341 0:209 Min. : 4500
## 1: 69 1:201 1st Qu.:17996
## Median :27207
## Mean :27932
## 3rd Qu.:36297
## Max. :54392
##
From the summary above,
.This dataset includes a wide range of car brands. The Alfa brand stands
out the most, followed by MINI and Chevrolet. Cars in this dataset were
manufactured between 1984 and 2023, with a concentration on the years
2015 to 2020, indicating a prevalence of newer models.
.The car mileage runs from 0 to 194,000, and the average mileage is greater than the median, indicating that the distribution is biassed to the right. This implies that a few cars have extremely high mileage, whereas the bulk of cars have lower or moderate mileage. Engine sizes range from 1.2 to 6.6, and, like mileage, the average engine size is more than the median, implying that a small percentage of cars have larger engines.
.Most cars have automatic transmissions, run on petrol, and have four-wheel drive. Minimum and maximum fuel economy range from 0 to 54 and 0 to 57, respectively. Their comparable average and median values imply an equitable distribution of efficiency statistics.
.The majority of cars have no visible damage, are not first-owner vehicles, lack a navigation system, but do have Bluetooth connectivity. Furthermore, many cars lack third-row seating or heated seats.Prices range from 4100 to 54717, and with the average price higher than the median, it shows a right-skewed distribution, indicating a few high-cost cars among primarily cheaper and moderately priced ones.
# Confirming the structure
str(mydf)## 'data.frame': 410 obs. of 16 variables:
## $ brand : Factor w/ 24 levels "Alfa","Audi",..: 22 8 9 9 7 10 19 20 16 20 ...
## $ year : num 2022 2012 2021 2020 2007 ...
## $ mileage : num 52250 150842 33267 48816 14268 ...
## $ engine_size : num 2.5 3.5 2 2 4.6 2 3 1.8 2.5 1.6 ...
## $ automatic_transmission: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
## $ fuel : Factor w/ 7 levels "Diesel","Electric",..: 4 6 6 6 6 6 6 6 6 6 ...
## $ drivetrain : Factor w/ 4 levels "Four-wheel Drive",..: 1 2 2 2 3 1 2 2 1 1 ...
## $ min_mpg : num 35 18 21 21.1 17 20 24 31 27 21.1 ...
## $ max_mpg : num 35 25 26 28.1 25 28 30 36 32 28.1 ...
## $ damaged : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 1 1 1 ...
## $ first_owner : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 1 1 1 1 ...
## $ navigation_system : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ bluetooth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ third_row_seating : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 2 1 ...
## $ heated_seats : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 1 2 1 ...
## $ price : num 35895 11999 19500 26951 26000 ...
.The data is proven to be a dataframe with 410 observations and 16 variables. All the variables are in the right class.
# Histogram for year
ggplot(mydf,aes(x= mydf$year)) +
geom_histogram(bins = 30) +
ggtitle("Year Distribution")+
xlab("Year") +
ylab("Frequency")## Warning: Use of `mydf$year` is discouraged.
## ℹ Use `year` instead.
# Histogram for mileage
ggplot(mydf,aes(x= mydf$mileage)) +
geom_histogram(bins = 30) +
ggtitle("Mileage Distribution")+
xlab("Mileage") +
ylab("Frequency")## Warning: Use of `mydf$mileage` is discouraged.
## ℹ Use `mileage` instead.
# Histogram for engine size
ggplot(mydf,aes(x= mydf$engine_size)) +
geom_histogram(bins = 30) +
ggtitle("engine size Distribution")+
xlab("Engine size") +
ylab("Frequency")## Warning: Use of `mydf$engine_size` is discouraged.
## ℹ Use `engine_size` instead.
# Histogram for min_mpg
ggplot(mydf,aes(x= mydf$min_mpg)) +
geom_histogram(bins = 30) +
ggtitle("Minimum fuel efficiency of the car Distribution")+
xlab("Min_mpg") +
ylab("Frequency")## Warning: Use of `mydf$min_mpg` is discouraged.
## ℹ Use `min_mpg` instead.
# Histogram for max_mpg
ggplot(mydf,aes(x= mydf$max_mpg)) +
geom_histogram(bins = 30) +
ggtitle("Maximum fuel efficiency of the car Distribution")+
xlab("Max_mpg") +
ylab("Frequency")## Warning: Use of `mydf$max_mpg` is discouraged.
## ℹ Use `max_mpg` instead.
# Histogram for price
ggplot(mydf,aes(x= mydf$price)) +
geom_histogram(bins = 30) +
ggtitle(" Car Price Distribution")+
xlab("Price") +
ylab("Frequency")## Warning: Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
Looking at the histogram:
.Prior to 2010, not many cars were produced. The majority were produced
from 2015 onward, with the biggest number produced between 2015 and
2020.
.The mileage distribution is skewed to the right, indicating that the
majority of cars have less than 80,000 miles. The highest miles are
between 25,000 and 50,000.
.Many cars have an engine size of 2.
.Most cars have a minimum fuel efficiency of 20 and a maximum of roughly
27. Both sides of this distribution appear to be comparable, which
corresponds to what we saw in the summary statistics.
.The most expensive cars is roughly $12,000.
# Correlation analysis
# Selecting only numerical variables
numerical_columns <- mydf %>% select_if(is.numeric)
# For readability and clarity, the correlation values are rounded to two decimal places.
round(cor(numerical_columns), 2)## year mileage engine_size min_mpg max_mpg price
## year 1.00 -0.45 -0.15 -0.04 -0.04 0.48
## mileage -0.45 1.00 0.21 -0.07 -0.07 -0.60
## engine_size -0.15 0.21 1.00 -0.27 -0.27 0.24
## min_mpg -0.04 -0.07 -0.27 1.00 0.92 -0.22
## max_mpg -0.04 -0.07 -0.27 0.92 1.00 -0.22
## price 0.48 -0.60 0.24 -0.22 -0.22 1.00
Examining the correlation matrix: .We discovered a variety of
correlation values, some positive and others negative, below and above
50%.
.The observed inverse correlation between the year of production and
mileage of a car implies that newer cars generally possess lower
mileage.
.Moreover, the correlation between the price of a car and its year of
production suggests that cars that are more recent in design generally
command greater price tags.
.It appears that mileage also affects the price of the car. The negative
correlation indicates that there is a tendency for the price to decrease
as mileage increases.
.It appears that engine size, min_mpg and max_mpg have no direct effect
on the price of the car. Nevertheless, a robust positive correlation
exists between them, indicating that both variables may not be practical
predictors of cars prices, only one is enough to be used.
# Scatterplot analysis
# Year versus Price
ggplot(mydf, aes(x=mydf$year, y=mydf$price)) +
geom_point() +
labs(title = "Year Versus Price", x="Year", y="Price")## Warning: Use of `mydf$year` is discouraged.
## ℹ Use `year` instead.
## Warning: Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# Mileage versus Price
ggplot(mydf, aes(x=mydf$mileage, y=mydf$price)) +
geom_point() +
labs(title = "Mileage Versus Price ", x="Mileage", y="Price")## Warning: Use of `mydf$mileage` is discouraged.
## ℹ Use `mileage` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# Engine Size versus Price
ggplot(mydf, aes(x=mydf$engine_size, y=mydf$price)) +
geom_point() +
labs(title = "Engine Size Versus Price", x="Engine Size", y="Price")## Warning: Use of `mydf$engine_size` is discouraged.
## ℹ Use `engine_size` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# Min_mpg versus Price
ggplot(mydf, aes(x=mydf$min_mpg, y=mydf$price)) +
geom_point() +
labs(title = "Min_mpg Versus Price ", x="Min_mpg", y="Price")## Warning: Use of `mydf$min_mpg` is discouraged.
## ℹ Use `min_mpg` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# Max_mpg versus Price
ggplot(mydf, aes(x=mydf$max_mpg, y=mydf$price)) +
geom_point() +
labs(title = "Max_mpg Versus Price", x="Engine Size", y="Price")## Warning: Use of `mydf$max_mpg` is discouraged.
## ℹ Use `max_mpg` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
.Looking at the scatter plot, all the observation from correlation matrix have being proved here visually
# Table analysis
# To simplify analysis Brand, fuel, and drive train variables are sorting in descending order.
# Brands
sort(table(mydf$brand), decreasing = TRUE)##
## Volkswagen Ford Mazda Nissan Audi
## 28 26 23 22 21
## FIAT MINI Jeep Chevrolet Honda
## 20 20 19 18 18
## Mercedes-Benz Alfa Hyundai Kia Cadillac
## 18 17 17 17 16
## Jaguar Lexus Mitsubishi Volvo Maserati
## 16 15 15 15 14
## Land Toyota BMW Porsche
## 12 12 7 4
# Fuel
sort(table(mydf$fuel), decreasing = TRUE)##
## Petrol Hybrid Electric GPL Diesel Unknown Pertol
## 369 16 10 6 4 3 2
#Replacing "Pertol" with "Petrol"
mydf$fuel[mydf$fuel=="Pertol"] <- "Petrol"
# Dropping the empty level on the 'fuel'
mydf$fuel <- droplevels(mydf$fuel)
# Drivetrain
sort(table(mydf$drivetrain), decreasing = TRUE)##
## Four-wheel Drive Front-wheel Drive Rear-wheel Drive Unknown
## 205 148 55 2
# Here is for automatic_transmission, damaged, first_owner, navigation_system, bluetooth, third_row_seating and heated_seats variables
table(mydf$automatic_transmission)##
## 0 1
## 28 382
table(mydf$damaged)##
## 0 1
## 308 102
table(mydf$first_owner)##
## 0 1
## 224 186
table(mydf$navigation_system)##
## 0 1
## 230 180
table(mydf$bluetooth)##
## 0 1
## 58 352
table(mydf$third_row_seating)##
## 0 1
## 341 69
table(mydf$heated_seats)##
## 0 1
## 209 201
Looking at table analysis: .The Alfa brand stands out the most,
followed by MINI and Chevrolet.The Suzuki brand has least number of cars
in the dataset followed by Porsche, Mercedes-Benz, and Honda. .The
number of cars using petrol are many, with only four cars using
Diesel.
.Four-wheel Drive cars are manys, followed by Front-wheel Drive and
Rear-wheel Drive. .The majority of cars have no visible damage, are not
first-owner vehicles, lack a navigation system, but do have Bluetooth
connectivity. Furthermore, many cars lack third-row seating or heated
seats.
# bar graph
# for brand
ggplot(mydf, aes(x=mydf$brand)) +
geom_bar() +
labs(title = "brand bar graph", x= "brand", y = "Frequency")## Warning: Use of `mydf$brand` is discouraged.
## ℹ Use `brand` instead.
# for fuel
ggplot(mydf, aes(x=mydf$fuel)) +
geom_bar() +
labs(title = "Fuel bar graph", x= "Type of fuel", y = "Frequency")## Warning: Use of `mydf$fuel` is discouraged.
## ℹ Use `fuel` instead.
# for drive train
ggplot(mydf, aes(x=mydf$drivetrain)) +
geom_bar() +
labs(title = "drivetrain bar graph", x= "drivetrain", y = "Frequency")## Warning: Use of `mydf$drivetrain` is discouraged.
## ℹ Use `drivetrain` instead.
# for automatic_transmission
ggplot(mydf, aes(x=mydf$automatic_transmission)) +
geom_bar() +
labs(title = "automatic_transmission bar graph", x= "automatic_transmission", y = "Frequency")## Warning: Use of `mydf$automatic_transmission` is discouraged.
## ℹ Use `automatic_transmission` instead.
# for damaged
ggplot(mydf, aes(x=mydf$damaged)) +
geom_bar() +
labs(title = "damaged bar graph", x= "damaged", y = "Frequency")## Warning: Use of `mydf$damaged` is discouraged.
## ℹ Use `damaged` instead.
# for first_owner
ggplot(mydf, aes(x=mydf$first_owner)) +
geom_bar() +
labs(title = "first_owner bar graph", x= "first_owner", y = "Frequency")## Warning: Use of `mydf$first_owner` is discouraged.
## ℹ Use `first_owner` instead.
# for navigation_system
ggplot(mydf, aes(x=mydf$navigation_system)) +
geom_bar() +
labs(title = "navigation_system bar graph", x= "navigation_system", y = "Frequency")## Warning: Use of `mydf$navigation_system` is discouraged.
## ℹ Use `navigation_system` instead.
# for bluetooth
ggplot(mydf, aes(x=mydf$bluetooth)) +
geom_bar() +
labs(title = "bluetooth bar graph", x= "bluetooth", y = "Frequency")## Warning: Use of `mydf$bluetooth` is discouraged.
## ℹ Use `bluetooth` instead.
# for third_row_seating
ggplot(mydf, aes(x=mydf$third_row_seating)) +
geom_bar() +
labs(title = "third_row_seating bar graph", x= "third_row_seating", y = "Frequency")## Warning: Use of `mydf$third_row_seating` is discouraged.
## ℹ Use `third_row_seating` instead.
# for heated_seats
ggplot(mydf, aes(x=mydf$heated_seats)) +
geom_bar() +
labs(title = "heated_seats bar graph", x= "heated_seats", y = "Frequency")## Warning: Use of `mydf$heated_seats` is discouraged.
## ℹ Use `heated_seats` instead.
Looking at bar graph: .The visually representation proved the
observation from summary statistics and table analysis that the majority
of cars are from Alfa brand,they are using petrol, they have four-wheel
drive,have no visible damage, are not first-owner vehicles, lack a
navigation system, but do have Bluetooth connectivity. Furthermore, many
cars lack third-row seating or heated seats.
# box plot analysis
# Brand Versus Price
ggplot(mydf, aes(x=mydf$brand, y=mydf$price)) +
geom_boxplot() +
labs(title = "Brand Versus Price", x ="Brand", y="Price")## Warning: Use of `mydf$brand` is discouraged.
## ℹ Use `brand` instead.
## Warning: Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# Fuel Versus Price
ggplot(mydf, aes(x=mydf$fuel, y=mydf$price)) +
geom_boxplot() +
labs(title = "Fuel Versus Price", x ="Fuel type", y="Price")## Warning: Use of `mydf$fuel` is discouraged.
## ℹ Use `fuel` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# Drivetrain Versus Price
ggplot(mydf, aes(x=mydf$drivetrain, y=mydf$price)) +
geom_boxplot() +
labs(title = "drivetrain Versus Price", x ="drivetrain", y="Price")## Warning: Use of `mydf$drivetrain` is discouraged.
## ℹ Use `drivetrain` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# automatic_transmission Versus Price
ggplot(mydf, aes(x=mydf$automatic_transmission, y=mydf$price)) +
geom_boxplot() +
labs(title = "automatic_transmission Versus Price", x ="automatic_transmission", y="Price")## Warning: Use of `mydf$automatic_transmission` is discouraged.
## ℹ Use `automatic_transmission` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# damaged Versus Price
ggplot(mydf, aes(x=mydf$damaged, y=mydf$price)) +
geom_boxplot() +
labs(title = "damaged Versus Price", x ="damaged", y="Price")## Warning: Use of `mydf$damaged` is discouraged.
## ℹ Use `damaged` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# first_owner Versus Price
ggplot(mydf, aes(x=mydf$first_owner, y=mydf$price)) +
geom_boxplot() +
labs(title = "first_owner Versus Price", x ="first_owner", y="Price")## Warning: Use of `mydf$first_owner` is discouraged.
## ℹ Use `first_owner` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# navigation_system Versus Price
ggplot(mydf, aes(x=mydf$navigation_system, y=mydf$price)) +
geom_boxplot() +
labs(title = "navigation_system Versus Price", x ="navigation_system", y="Price")## Warning: Use of `mydf$navigation_system` is discouraged.
## ℹ Use `navigation_system` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# bluetooth Versus Price
ggplot(mydf, aes(x=mydf$bluetooth, y=mydf$price)) +
geom_boxplot() +
labs(title = "bluetooth Versus Price", x ="bluetooth", y="Price")## Warning: Use of `mydf$bluetooth` is discouraged.
## ℹ Use `bluetooth` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# third_row_seating Versus Price
ggplot(mydf, aes(x=mydf$third_row_seating, y=mydf$price)) +
geom_boxplot() +
labs(title = "third_row_seating Versus Price", x ="third_row_seating", y="Price")## Warning: Use of `mydf$third_row_seating` is discouraged.
## ℹ Use `third_row_seating` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
# heated_seats Versus Price
ggplot(mydf, aes(x=mydf$heated_seats, y=mydf$price)) +
geom_boxplot() +
labs(title = "heated_seats Versus Price", x ="heated_seats", y="Price")## Warning: Use of `mydf$heated_seats` is discouraged.
## ℹ Use `heated_seats` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.
Looking at boxplot:
.The median price of hybrid-powered cars is the highest of all fuel
varieties.
.Cars with four-wheel drive have the highest median price.
.The median price of cars with automatic transmissions is typically
greater than that of those without.
.Damaged cars typically have a lower median price than undamaged
ones.
.Cars sold by their first owners have a higher median price than those
that aren’t.
.Cars featuring navigation, Bluetooth, third-row seating, and heated
seats typically have higher median pricing. .The majority of boxplots do
not exhibit hazardous outliers.
Typically, the price of a vehicle is significantly influenced by the brand, gasoline type, drivetrain, automatic gearbox, condition of the damage, initial ownership, navigation system, Bluetooth, third-row seating, and heated seats. Generally, the expense of an car is significantly impacted by these attributes.
Brand Distribution:
.Within this dataset, Alfa, MINI, and Chevrolet are noticeably present.
Suzuki, Porsche, Mercedes-Benz, and Honda, in contrast, have a reduced
number of entries.
Temporal Trends:
.Cars span the years 1984 to 2023, with an emphasis on manufacturing
that occurred between 2015 and 2020, suggesting a preponderance of more
recent models.
Mileage and Engine Size: .The mileage of the cars varies significantly, spanning from 0 to 194,000. A right-skewed distribution indicates that the majority of the cars have accumulated reduced mileage. In a similar fashion, engine displacements range from 1.2 to 6.6, although a limited number of cars are equipped with larger engines.
Transmission and Fuel Type: .Automatic transmissions, petrol fuel, and four-wheel drive are prevalent among the majority of cars. Diesel-powered cars are minimal.
Efficiency Metrics:
.The majority of cars have a minimum fuel economy of 20 and a maximum
efficiency of around 27. Both of their distributions appear to be
symmetric.
Features and Pricing:
.While cars typically do not have observable damage, navigation systems,
or heated seating, Bluetooth connectivity is a prevalent feature. The
price distribution is right-skewed, with the majority of cars priced
below or at the middle range.
Correlations: .Significant correlations suggest that newer cars generally possess lower mileage and are priced higher. Prices are negatively impacted by mileage, whereas engine size and fuel efficiency have no direct effect on pricing.
Visual representations: .histograms, bar graph,scatter plots, and box plots, serve to validate these observations. Furthermore, box plot analysis reveals that the median price of four-wheel drive and hybrid-powered cars is greater. Furthermore, elevated median pricing is attributable to elements such as automatic a transmission, which absence of damage, first-owner status, and specific attributes including Bluetooth and navigation.
.The ‘unknown’ elements contained within the ‘fuel’ and ‘drivetrain’
variables may be substituted with the values that occur most frequently
within those particular variables. This substitution strategy
contributes to the production of more consistent and trustworthy data
for analysis.
.Luxury brands, such as Mercedes-Benz and Porsche, have lower numbers
than common brands.
.The objective is to see how the prices of the cars are explained by the other variables. As a result of the considerable quantity of independent variables, interactions shall be omitted from the modelling process. Due to the continuous nature of the dependent variable, ‘price’, a linear regression model will be applied. In order to develop the model, a backward sequential selection methodology will be implemented, commencing with the maximal model that incorporates all variables and progressively reducing its complexity. The functions ‘step()’ and ‘update()’ will be implemented to aid in the simplification of the model. Furthermore, in accordance with the findings of EDA, the variables ‘max_mpg,’ ‘min_mpg,’ and ‘engine_size’ will be eliminated because they had little to no impact on the pricing of cars.
# Eliminating the variables 'max_mpg,''min_mpg,' and 'engine_size'
mydf_new <- mydf[, -c(4,8,9)]
# Attach mydf_new to facilitate variables access.
attach(mydf_new)
# Constructing the linear regression model
model <- lm(price ~ brand+year+mileage+automatic_transmission+fuel+drivetrain+damaged+first_owner+navigation_system+bluetooth+third_row_seating+heated_seats+ I(year)^2+I(mileage)^2)
# Assessment of the model's summary
summary(model)##
## Call:
## lm(formula = price ~ brand + year + mileage + automatic_transmission +
## fuel + drivetrain + damaged + first_owner + navigation_system +
## bluetooth + third_row_seating + heated_seats + I(year)^2 +
## I(mileage)^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14030 -4204 -364 3452 20540
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.991e+06 2.060e+05 -9.665 < 2e-16 ***
## brandAudi 4.365e+03 2.011e+03 2.171 0.030595 *
## brandBMW 5.075e+03 2.702e+03 1.878 0.061124 .
## brandCadillac 7.313e+03 2.172e+03 3.367 0.000839 ***
## brandChevrolet 6.525e+03 2.061e+03 3.166 0.001673 **
## brandFIAT -4.883e+03 2.104e+03 -2.321 0.020804 *
## brandFord 6.482e+03 1.914e+03 3.387 0.000784 ***
## brandHonda 1.245e+03 2.131e+03 0.584 0.559456
## brandHyundai 2.358e+02 2.219e+03 0.106 0.915437
## brandJaguar 6.626e+03 2.132e+03 3.107 0.002035 **
## brandJeep 4.773e+03 2.045e+03 2.333 0.020171 *
## brandKia -2.413e+02 2.176e+03 -0.111 0.911777
## brandLand 6.692e+03 2.294e+03 2.918 0.003742 **
## brandLexus 7.049e+03 2.151e+03 3.277 0.001149 **
## brandMaserati 6.166e+03 2.245e+03 2.746 0.006322 **
## brandMazda -2.499e+03 2.014e+03 -1.241 0.215498
## brandMercedes-Benz 7.928e+03 2.002e+03 3.960 8.99e-05 ***
## brandMINI 1.657e+02 2.114e+03 0.078 0.937552
## brandMitsubishi -4.999e+03 2.199e+03 -2.273 0.023576 *
## brandNissan 2.878e+02 1.993e+03 0.144 0.885289
## brandPorsche 8.212e+03 3.354e+03 2.449 0.014804 *
## brandToyota 4.681e+03 2.404e+03 1.948 0.052225 .
## brandVolkswagen 5.632e+02 1.925e+03 0.293 0.769994
## brandVolvo 5.292e+03 2.169e+03 2.440 0.015146 *
## year 1.006e+03 1.022e+02 9.849 < 2e-16 ***
## mileage -1.045e-01 1.015e-02 -10.299 < 2e-16 ***
## automatic_transmission1 -6.292e+02 1.276e+03 -0.493 0.622223
## fuelElectric -6.655e+03 3.788e+03 -1.757 0.079744 .
## fuelGPL -1.203e+04 3.930e+03 -3.061 0.002364 **
## fuelHybrid -7.496e+03 3.535e+03 -2.120 0.034662 *
## fuelPetrol -7.285e+03 3.150e+03 -2.313 0.021284 *
## fuelUnknown -7.125e+03 4.833e+03 -1.474 0.141265
## drivetrainFront-wheel Drive -5.403e+03 7.952e+02 -6.794 4.38e-11 ***
## drivetrainRear-wheel Drive 2.454e+03 1.014e+03 2.420 0.015999 *
## drivetrainUnknown 5.213e+04 6.131e+03 8.503 4.65e-16 ***
## damaged1 -1.389e+03 7.131e+02 -1.948 0.052173 .
## first_owner1 1.806e+03 7.000e+02 2.579 0.010287 *
## navigation_system1 3.728e+03 7.216e+02 5.166 3.93e-07 ***
## bluetooth1 -1.177e+03 1.072e+03 -1.098 0.272911
## third_row_seating1 3.983e+03 8.981e+02 4.436 1.21e-05 ***
## heated_seats1 3.872e+02 6.600e+02 0.587 0.557797
## I(year) NA NA NA NA
## I(mileage) NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5830 on 369 degrees of freedom
## Multiple R-squared: 0.7936, Adjusted R-squared: 0.7713
## F-statistic: 35.48 on 40 and 369 DF, p-value: < 2.2e-16
.An asterisk indicates that some independent variables are significant, but others are insignificant or only slightly significant.Insignificant variables will be eliminated using the step function.
# Eliminating insignificant variables
modelB <- step(model)## Start: AIC=7148.88
## price ~ brand + year + mileage + automatic_transmission + fuel +
## drivetrain + damaged + first_owner + navigation_system +
## bluetooth + third_row_seating + heated_seats + I(year)^2 +
## I(mileage)^2
##
##
## Step: AIC=7148.88
## price ~ brand + year + mileage + automatic_transmission + fuel +
## drivetrain + damaged + first_owner + navigation_system +
## bluetooth + third_row_seating + heated_seats + I(year)
##
##
## Step: AIC=7148.88
## price ~ brand + year + mileage + automatic_transmission + fuel +
## drivetrain + damaged + first_owner + navigation_system +
## bluetooth + third_row_seating + heated_seats
##
## Df Sum of Sq RSS AIC
## - automatic_transmission 1 8265835 1.2552e+10 7147.2
## - heated_seats 1 11698876 1.2555e+10 7147.3
## - bluetooth 1 40983367 1.2584e+10 7148.2
## <none> 1.2543e+10 7148.9
## - fuel 5 323314624 1.2867e+10 7149.3
## - damaged 1 128991798 1.2672e+10 7151.1
## - first_owner 1 226144980 1.2769e+10 7154.2
## - third_row_seating 1 668766737 1.3212e+10 7168.2
## - navigation_system 1 907092711 1.3450e+10 7175.5
## - brand 23 4694517765 1.7238e+10 7233.2
## - year 1 3297212141 1.5840e+10 7242.6
## - mileage 1 3605304935 1.6149e+10 7250.5
## - drivetrain 3 5010210007 1.7553e+10 7280.7
##
## Step: AIC=7147.15
## price ~ brand + year + mileage + fuel + drivetrain + damaged +
## first_owner + navigation_system + bluetooth + third_row_seating +
## heated_seats
##
## Df Sum of Sq RSS AIC
## - heated_seats 1 11860461 1.2563e+10 7145.5
## - bluetooth 1 43024656 1.2595e+10 7146.6
## <none> 1.2552e+10 7147.2
## - fuel 5 321686514 1.2873e+10 7147.5
## - damaged 1 127517290 1.2679e+10 7149.3
## - first_owner 1 233519221 1.2785e+10 7152.7
## - third_row_seating 1 661281195 1.3213e+10 7166.2
## - navigation_system 1 904183901 1.3456e+10 7173.7
## - brand 23 4699430821 1.7251e+10 7231.5
## - year 1 3291288699 1.5843e+10 7240.6
## - mileage 1 3598029012 1.6150e+10 7248.5
## - drivetrain 3 5106802992 1.7658e+10 7281.1
##
## Step: AIC=7145.54
## price ~ brand + year + mileage + fuel + drivetrain + damaged +
## first_owner + navigation_system + bluetooth + third_row_seating
##
## Df Sum of Sq RSS AIC
## - bluetooth 1 42237773 1.2606e+10 7144.9
## <none> 1.2563e+10 7145.5
## - fuel 5 324306622 1.2888e+10 7146.0
## - damaged 1 127863377 1.2691e+10 7147.7
## - first_owner 1 238293982 1.2802e+10 7151.2
## - third_row_seating 1 693820754 1.3257e+10 7165.6
## - navigation_system 1 965220850 1.3529e+10 7173.9
## - brand 23 4692654495 1.7256e+10 7229.7
## - year 1 3363311469 1.5927e+10 7240.8
## - mileage 1 3602378050 1.6166e+10 7246.9
## - drivetrain 3 5214853771 1.7778e+10 7281.9
##
## Step: AIC=7144.91
## price ~ brand + year + mileage + fuel + drivetrain + damaged +
## first_owner + navigation_system + third_row_seating
##
## Df Sum of Sq RSS AIC
## <none> 1.2606e+10 7144.9
## - fuel 5 345731555 1.2951e+10 7146.0
## - damaged 1 146008516 1.2752e+10 7147.6
## - first_owner 1 264284458 1.2870e+10 7151.4
## - third_row_seating 1 676941929 1.3283e+10 7164.4
## - navigation_system 1 925261490 1.3531e+10 7172.0
## - brand 23 4655069662 1.7261e+10 7227.8
## - mileage 1 3694527431 1.6300e+10 7248.3
## - year 1 3823213886 1.6429e+10 7251.5
## - drivetrain 3 5180609827 1.7786e+10 7280.1
# Checking modelB summary
summary(modelB)##
## Call:
## lm(formula = price ~ brand + year + mileage + fuel + drivetrain +
## damaged + first_owner + navigation_system + third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14157.9 -4178.8 -246.1 3508.6 20722.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.893e+06 1.820e+05 -10.399 < 2e-16 ***
## brandAudi 4.710e+03 1.974e+03 2.386 0.017527 *
## brandBMW 5.234e+03 2.693e+03 1.944 0.052662 .
## brandCadillac 7.488e+03 2.156e+03 3.473 0.000576 ***
## brandChevrolet 6.616e+03 2.056e+03 3.218 0.001404 **
## brandFIAT -4.668e+03 2.086e+03 -2.238 0.025818 *
## brandFord 6.716e+03 1.901e+03 3.533 0.000462 ***
## brandHonda 1.438e+03 2.106e+03 0.683 0.495245
## brandHyundai 9.086e+02 2.149e+03 0.423 0.672742
## brandJaguar 6.679e+03 2.123e+03 3.146 0.001790 **
## brandJeep 5.071e+03 2.030e+03 2.498 0.012918 *
## brandKia -9.358e+01 2.169e+03 -0.043 0.965603
## brandLand 6.729e+03 2.288e+03 2.942 0.003470 **
## brandLexus 7.265e+03 2.140e+03 3.394 0.000762 ***
## brandMaserati 6.194e+03 2.224e+03 2.784 0.005637 **
## brandMazda -1.896e+03 1.954e+03 -0.970 0.332509
## brandMercedes-Benz 7.917e+03 1.998e+03 3.963 8.86e-05 ***
## brandMINI 7.272e+02 2.065e+03 0.352 0.724921
## brandMitsubishi -4.861e+03 2.184e+03 -2.226 0.026621 *
## brandNissan 5.985e+02 1.974e+03 0.303 0.761951
## brandPorsche 8.511e+03 3.331e+03 2.555 0.011006 *
## brandToyota 5.000e+03 2.381e+03 2.100 0.036429 *
## brandVolkswagen 8.594e+02 1.885e+03 0.456 0.648761
## brandVolvo 5.587e+03 2.152e+03 2.597 0.009782 **
## year 9.568e+02 9.007e+01 10.622 < 2e-16 ***
## mileage -1.054e-01 1.009e-02 -10.442 < 2e-16 ***
## fuelElectric -6.813e+03 3.756e+03 -1.814 0.070515 .
## fuelGPL -1.238e+04 3.910e+03 -3.165 0.001677 **
## fuelHybrid -7.800e+03 3.510e+03 -2.223 0.026848 *
## fuelPetrol -7.501e+03 3.124e+03 -2.401 0.016833 *
## fuelUnknown -7.210e+03 4.820e+03 -1.496 0.135591
## drivetrainFront-wheel Drive -5.563e+03 7.811e+02 -7.121 5.56e-12 ***
## drivetrainRear-wheel Drive 2.529e+03 1.006e+03 2.513 0.012387 *
## drivetrainUnknown 5.128e+04 5.907e+03 8.681 < 2e-16 ***
## damaged1 -1.469e+03 7.076e+02 -2.076 0.038602 *
## first_owner1 1.933e+03 6.923e+02 2.793 0.005497 **
## navigation_system1 3.664e+03 7.011e+02 5.225 2.90e-07 ***
## third_row_seating1 3.959e+03 8.858e+02 4.470 1.04e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5821 on 372 degrees of freedom
## Multiple R-squared: 0.7926, Adjusted R-squared: 0.772
## F-statistic: 38.42 on 37 and 372 DF, p-value: < 2.2e-16
.It looks that numerous insignificant terms are still included in the model. However, before dealing with them, I will consider deleting the brand variable not just for model simplicity, but also because practically all of the brand terms are insignificant or weakly significant. I’ll utilise the update() function to address this.
# Eliminating brand variable
modelC <- update(modelB,~.-brand)
# Checking modelC summary
summary(modelC)##
## Call:
## lm(formula = price ~ year + mileage + fuel + drivetrain + damaged +
## first_owner + navigation_system + third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15735 -4346 -807 3971 20798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.688e+06 1.902e+05 -8.875 < 2e-16 ***
## year 8.575e+02 9.412e+01 9.111 < 2e-16 ***
## mileage -1.057e-01 1.103e-02 -9.586 < 2e-16 ***
## fuelElectric -1.232e+04 4.127e+03 -2.986 0.00300 **
## fuelGPL -1.095e+04 4.372e+03 -2.504 0.01267 *
## fuelHybrid -8.169e+03 3.850e+03 -2.121 0.03451 *
## fuelPetrol -8.954e+03 3.463e+03 -2.585 0.01008 *
## fuelUnknown -8.890e+03 5.430e+03 -1.637 0.10238
## drivetrainFront-wheel Drive -8.312e+03 7.598e+02 -10.939 < 2e-16 ***
## drivetrainRear-wheel Drive 2.633e+03 1.068e+03 2.466 0.01410 *
## drivetrainUnknown 4.606e+04 6.423e+03 7.171 3.70e-12 ***
## damaged1 -1.309e+03 7.795e+02 -1.680 0.09376 .
## first_owner1 1.785e+03 7.602e+02 2.348 0.01937 *
## navigation_system1 5.109e+03 7.076e+02 7.220 2.69e-12 ***
## third_row_seating1 2.912e+03 9.184e+02 3.171 0.00164 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6610 on 395 degrees of freedom
## Multiple R-squared: 0.716, Adjusted R-squared: 0.7059
## F-statistic: 71.14 on 14 and 395 DF, p-value: < 2.2e-16
. Considering deleting fuel and first_owner variables, which are insignificant.
# Eliminating fuel and first_owner variables
modelD <- update(modelC,~.-fuel-first_owner)
# Checking modelD summary
summary(modelD)##
## Call:
## lm(formula = price ~ year + mileage + drivetrain + damaged +
## navigation_system + third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16833.1 -4565.5 -639.9 3930.2 24957.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.799e+06 1.884e+05 -9.551 < 2e-16 ***
## year 9.087e+02 9.322e+01 9.747 < 2e-16 ***
## mileage -1.026e-01 1.053e-02 -9.750 < 2e-16 ***
## drivetrainFront-wheel Drive -8.359e+03 7.618e+02 -10.972 < 2e-16 ***
## drivetrainRear-wheel Drive 2.403e+03 1.072e+03 2.241 0.025563 *
## drivetrainUnknown 4.741e+04 6.207e+03 7.638 1.63e-13 ***
## damaged1 -1.553e+03 7.823e+02 -1.985 0.047876 *
## navigation_system1 5.138e+03 7.072e+02 7.265 1.96e-12 ***
## third_row_seating1 3.384e+03 9.122e+02 3.710 0.000237 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6701 on 401 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6978
## F-statistic: 119.1 on 8 and 401 DF, p-value: < 2.2e-16
. Drivetrain variable has only one significant value, thus I remove it.
# Eliminating Drivetrain variable
modelE <- update(modelD,~.-drivetrain)
# Checking the modelE summary
summary(modelE)##
## Call:
## lm(formula = price ~ year + mileage + damaged + navigation_system +
## third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26798 -5445 -758 4977 45457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.677e+05 1.780e+05 -4.876 1.56e-06 ***
## year 4.462e+02 8.813e+01 5.063 6.28e-07 ***
## mileage -1.425e-01 1.222e-02 -11.661 < 2e-16 ***
## damaged1 -2.636e+03 9.702e+02 -2.717 0.006878 **
## navigation_system1 7.437e+03 8.538e+02 8.711 < 2e-16 ***
## third_row_seating1 4.292e+03 1.122e+03 3.825 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8353 on 404 degrees of freedom
## Multiple R-squared: 0.5362, Adjusted R-squared: 0.5305
## F-statistic: 93.42 on 5 and 404 DF, p-value: < 2.2e-16
. Eliminating the remaining insignificant terms
# Eliminating automatic_transmission and bluetooth variables
modelF <- update(modelE,~.-automatic_transmission-bluetooth)
# Checking modelF summary
summary(modelF)##
## Call:
## lm(formula = price ~ year + mileage + damaged + navigation_system +
## third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26798 -5445 -758 4977 45457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.677e+05 1.780e+05 -4.876 1.56e-06 ***
## year 4.462e+02 8.813e+01 5.063 6.28e-07 ***
## mileage -1.425e-01 1.222e-02 -11.661 < 2e-16 ***
## damaged1 -2.636e+03 9.702e+02 -2.717 0.006878 **
## navigation_system1 7.437e+03 8.538e+02 8.711 < 2e-16 ***
## third_row_seating1 4.292e+03 1.122e+03 3.825 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8353 on 404 degrees of freedom
## Multiple R-squared: 0.5362, Adjusted R-squared: 0.5305
## F-statistic: 93.42 on 5 and 404 DF, p-value: < 2.2e-16
. All remaining terms are significant, so the optimal model is
\[ price = -940200+482.4\times{year}-0.1376\times{mileage}-0.002925\times{damaged}+6091\times{navigationsystem}+5234\times{thirdrowseating} \]
.Some variables influence car prices positively in our best model,
whereas others show a negative impact.
.The beginning price is projected to be -940,200 when the impact of all
variables are zero.
.The year of production has a positive effect on the price,
demonstrating that newer cars cost 482.4 more for every year rise.
.Mileage has a negative impact on pricing, since increasing mileage
reduces the car’s price by 0.1376.
.A damaged state reduces the value of the car, while an undamaged status
does not directly increase the price. Any damage, however, reduces the
price by 0.002925.
.The addition of a navigation system or third-row seating raises the
price by 6091 or 5234 respectively.
.51% is an acceptable R-squared value for the model, which indicates how
well our model explains car prices in light of these factors. On the
basis of these independent factors, our model can therefore explain
approximately half of the variation in car prices; however, it still has
space for enhancement.
# Model plotting for diagnostic purposes
plot(modelF). Plot 1 demonstrates no substantial heteroscedasticity, indicating a
consistent point spread. While not ideal, it’s acceptable.
.Plot 2 shows 99% of points are straight, indicating a normal
distribution. However, some data outliers exist.
.Since the dependent variable is continuous, I will potentially use square root and logarithmic modifications to improve model performance. This method may increase the variables’ match with the target, improving the model’s predictive power.
# Applying sqrt() function on the modelF
modelK <- lm(sqrt(price)~ year + mileage + damaged + navigation_system + third_row_seating)
#Checking modelK summary
summary(modelK)##
## Call:
## lm(formula = sqrt(price) ~ year + mileage + damaged + navigation_system +
## third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.460 -16.480 -1.007 15.725 148.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.083e+03 5.403e+02 -5.705 2.25e-08 ***
## year 1.616e+00 2.675e-01 6.041 3.49e-09 ***
## mileage -4.533e-04 3.711e-05 -12.216 < 2e-16 ***
## damaged1 -7.489e+00 2.945e+00 -2.543 0.011375 *
## navigation_system1 2.228e+01 2.592e+00 8.597 < 2e-16 ***
## third_row_seating1 1.311e+01 3.406e+00 3.847 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.36 on 404 degrees of freedom
## Multiple R-squared: 0.5622, Adjusted R-squared: 0.5568
## F-statistic: 103.8 on 5 and 404 DF, p-value: < 2.2e-16
# Applying log() function on the modelF
modelM <- lm(log(price)~ year + mileage + damaged + navigation_system + third_row_seating)
#Checking modelM summary
summary(modelM)##
## Call:
## lm(formula = log(price) ~ year + mileage + damaged + navigation_system +
## third_row_seating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21451 -0.19948 0.01378 0.20272 2.04492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.973e+01 7.160e+00 -5.549 5.22e-08 ***
## year 2.481e-02 3.546e-03 6.998 1.08e-11 ***
## mileage -6.004e-06 4.918e-07 -12.209 < 2e-16 ***
## damaged1 -8.620e-02 3.903e-02 -2.208 0.027788 *
## navigation_system1 2.782e-01 3.435e-02 8.100 6.56e-15 ***
## third_row_seating1 1.634e-01 4.514e-02 3.620 0.000333 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3361 on 404 degrees of freedom
## Multiple R-squared: 0.5696, Adjusted R-squared: 0.5643
## F-statistic: 106.9 on 5 and 404 DF, p-value: < 2.2e-16
#Ploting modelK
plot(modelK)# Ploting modelM
plot(modelM).The illustrations of both models (modelK and modelM) maintain the same consistent and well-behaved patterns as the previous model (modelF). Nevertheless, both R-squared and F-statistic values exhibit an upward trend. In general, increased values of these metrics signify a more optimal model fit. On the basis of these metrics, modelM currently performs the best.The representation of modelM in equation form is:
\[ log(price) = -31.69+0.02085\times{year}-0.000006574\times{mileage}-0.09425\times{damaged1}+0.2350\times{navigationsystem1}+0.2202\times{thirdrowseating1} \]
.Concerning how the remaining variables account for the probability
that a car’s first owner will sell it.
.Given the binary nature of the dependent variable, ‘first_owner’, and
the mixed-category and numeric composition of the independent variables,
logistic regression will be utilised to analyse this situation. Because
an adequate number of independent variables are already present to
predict the probability that a car will be sold by its first owner, I
have chosen not to incorporate interactions during the modelling
process.
.Beginning with the maximal model, which includes every variable, I will
employ backward sequential selection to construct the model.
Subsequently, I will incrementally reduce the complexity of the model
until it reaches its minimum. “step()” and “update()” are functions that
I intend to employ in order to simplify the model.
.I shall conclude by employing the “coef()” and “exp()” functions to
compute the odds ratios.Moreover, in light of the strong correlation
identified between ‘min_mpg’ and ‘max_mpg’ during EDA, the variable
‘max_mpg’ will be eliminated from this procedure.
# removing the mydf_new dataframe that was appended during the development of the preceding model.
detach(mydf_new)
# Eliminating 'max_mpg' on account of multicollinearity
mydf_logistic <- mydf[,-9]
# For easier access, append the mydf_logistic dataframe
attach(mydf_logistic)
# Constructing the logistic regression
logistic_model <- glm(first_owner~brand+year+mileage+engine_size+automatic_transmission+fuel+drivetrain+min_mpg+damaged+navigation_system+bluetooth+third_row_seating+heated_seats+price, family = binomial)
#Verifying the developed logistic regression model's summary
summary(logistic_model)##
## Call:
## glm(formula = first_owner ~ brand + year + mileage + engine_size +
## automatic_transmission + fuel + drivetrain + min_mpg + damaged +
## navigation_system + bluetooth + third_row_seating + heated_seats +
## price, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4987 -0.8015 -0.2971 0.7701 2.1649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.683e+02 9.791e+02 -0.274 0.7841
## brandAudi -8.669e-01 8.173e-01 -1.061 0.2889
## brandBMW -7.480e-01 1.122e+00 -0.667 0.5049
## brandCadillac 7.779e-01 9.586e-01 0.812 0.4171
## brandChevrolet -7.561e-01 9.390e-01 -0.805 0.4207
## brandFIAT 4.798e-01 8.612e-01 0.557 0.5774
## brandFord -4.060e-01 7.771e-01 -0.522 0.6014
## brandHonda 1.442e+00 8.957e-01 1.610 0.1075
## brandHyundai -4.995e-01 9.153e-01 -0.546 0.5852
## brandJaguar -6.271e-01 8.325e-01 -0.753 0.4513
## brandJeep 8.742e-02 8.947e-01 0.098 0.9222
## brandKia 1.034e+00 8.706e-01 1.187 0.2351
## brandLand -1.296e+00 1.004e+00 -1.291 0.1968
## brandLexus 4.676e-01 9.541e-01 0.490 0.6241
## brandMaserati -1.846e+00 1.050e+00 -1.759 0.0786 .
## brandMazda 7.898e-01 8.176e-01 0.966 0.3341
## brandMercedes-Benz -6.615e-01 8.151e-01 -0.812 0.4170
## brandMINI 8.340e-01 8.313e-01 1.003 0.3157
## brandMitsubishi 1.371e+00 8.687e-01 1.578 0.1145
## brandNissan 2.192e-01 8.413e-01 0.261 0.7944
## brandPorsche -6.165e-01 1.323e+00 -0.466 0.6412
## brandToyota 1.692e+00 1.041e+00 1.625 0.1042
## brandVolkswagen 3.200e-03 7.635e-01 0.004 0.9967
## brandVolvo 4.953e-01 9.207e-01 0.538 0.5906
## year 1.267e-01 5.737e-02 2.208 0.0273 *
## mileage -1.329e-05 5.792e-06 -2.295 0.0217 *
## engine_size -3.151e-02 1.907e-01 -0.165 0.8688
## automatic_transmission1 -8.706e-01 5.171e-01 -1.684 0.0923 .
## fuelElectric 1.250e+01 9.723e+02 0.013 0.9897
## fuelGPL 1.410e+01 9.723e+02 0.015 0.9884
## fuelHybrid 1.613e+01 9.723e+02 0.017 0.9868
## fuelPetrol 1.444e+01 9.723e+02 0.015 0.9882
## fuelUnknown 1.339e+01 9.723e+02 0.014 0.9890
## drivetrainFront-wheel Drive 1.049e-01 3.711e-01 0.283 0.7774
## drivetrainRear-wheel Drive -7.144e-01 4.395e-01 -1.626 0.1040
## drivetrainUnknown -1.347e+01 1.425e+03 -0.009 0.9925
## min_mpg -4.888e-02 3.258e-02 -1.500 0.1335
## damaged1 -6.316e-01 3.079e-01 -2.051 0.0403 *
## navigation_system1 2.517e-01 3.316e-01 0.759 0.4479
## bluetooth1 -1.012e+00 5.353e-01 -1.890 0.0587 .
## third_row_seating1 5.527e-01 4.129e-01 1.339 0.1806
## heated_seats1 8.098e-02 2.841e-01 0.285 0.7756
## price 5.434e-05 2.575e-05 2.110 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.85 on 409 degrees of freedom
## Residual deviance: 399.52 on 367 degrees of freedom
## AIC: 485.52
##
## Number of Fisher Scoring iterations: 15
.Let us now simplify using the step() function.
#Applying step() function to logistic_model
logistic_modelB <- step(logistic_model)## Start: AIC=485.52
## first_owner ~ brand + year + mileage + engine_size + automatic_transmission +
## fuel + drivetrain + min_mpg + damaged + navigation_system +
## bluetooth + third_row_seating + heated_seats + price
##
## Df Deviance AIC
## - brand 23 426.62 466.62
## - drivetrain 3 402.63 482.63
## - engine_size 1 399.55 483.55
## - heated_seats 1 399.60 483.60
## - navigation_system 1 400.10 484.10
## - third_row_seating 1 401.34 485.34
## <none> 399.52 485.52
## - min_mpg 1 401.93 485.93
## - automatic_transmission 1 402.30 486.30
## - bluetooth 1 403.27 487.27
## - damaged 1 403.83 487.83
## - price 1 404.12 488.12
## - fuel 5 412.15 488.15
## - year 1 404.91 488.91
## - mileage 1 404.94 488.94
##
## Step: AIC=466.62
## first_owner ~ year + mileage + engine_size + automatic_transmission +
## fuel + drivetrain + min_mpg + damaged + navigation_system +
## bluetooth + third_row_seating + heated_seats + price
##
## Df Deviance AIC
## - engine_size 1 426.62 464.62
## - min_mpg 1 426.62 464.62
## - navigation_system 1 426.78 464.78
## - heated_seats 1 426.79 464.79
## <none> 426.62 466.62
## - automatic_transmission 1 428.79 466.79
## - price 1 429.08 467.08
## - damaged 1 430.17 468.17
## - drivetrain 3 434.38 468.38
## - bluetooth 1 432.90 470.90
## - mileage 1 433.01 471.01
## - fuel 5 441.66 471.66
## - third_row_seating 1 435.04 473.04
## - year 1 436.58 474.58
##
## Step: AIC=464.62
## first_owner ~ year + mileage + automatic_transmission + fuel +
## drivetrain + min_mpg + damaged + navigation_system + bluetooth +
## third_row_seating + heated_seats + price
##
## Df Deviance AIC
## - min_mpg 1 426.62 462.62
## - navigation_system 1 426.78 462.78
## - heated_seats 1 426.79 462.79
## <none> 426.62 464.62
## - automatic_transmission 1 428.79 464.79
## - price 1 429.73 465.73
## - damaged 1 430.18 466.18
## - drivetrain 3 434.49 466.49
## - bluetooth 1 432.90 468.90
## - mileage 1 433.74 469.74
## - fuel 5 441.79 469.79
## - third_row_seating 1 435.71 471.71
## - year 1 437.66 473.66
##
## Step: AIC=462.62
## first_owner ~ year + mileage + automatic_transmission + fuel +
## drivetrain + damaged + navigation_system + bluetooth + third_row_seating +
## heated_seats + price
##
## Df Deviance AIC
## - navigation_system 1 426.78 460.78
## - heated_seats 1 426.79 460.79
## <none> 426.62 462.62
## - automatic_transmission 1 428.80 462.80
## - price 1 429.93 463.93
## - damaged 1 430.18 464.18
## - drivetrain 3 434.51 464.51
## - bluetooth 1 432.91 466.91
## - fuel 5 441.80 467.80
## - mileage 1 433.93 467.93
## - third_row_seating 1 435.72 469.72
## - year 1 437.72 471.72
##
## Step: AIC=460.78
## first_owner ~ year + mileage + automatic_transmission + fuel +
## drivetrain + damaged + bluetooth + third_row_seating + heated_seats +
## price
##
## Df Deviance AIC
## - heated_seats 1 427.01 459.01
## <none> 426.78 460.78
## - automatic_transmission 1 428.92 460.92
## - damaged 1 430.31 462.31
## - drivetrain 3 434.62 462.62
## - price 1 431.26 463.26
## - bluetooth 1 432.91 464.91
## - fuel 5 441.80 465.80
## - mileage 1 433.93 465.93
## - third_row_seating 1 435.79 467.79
## - year 1 437.77 469.77
##
## Step: AIC=459.01
## first_owner ~ year + mileage + automatic_transmission + fuel +
## drivetrain + damaged + bluetooth + third_row_seating + price
##
## Df Deviance AIC
## <none> 427.01 459.01
## - automatic_transmission 1 429.19 459.19
## - damaged 1 430.55 460.55
## - drivetrain 3 434.79 460.79
## - price 1 431.69 461.69
## - bluetooth 1 433.06 463.06
## - fuel 5 441.84 463.84
## - mileage 1 434.18 464.18
## - third_row_seating 1 436.81 466.81
## - year 1 438.26 468.26
#Checking the summary of the new developed model
summary(logistic_modelB)##
## Call:
## glm(formula = first_owner ~ year + mileage + automatic_transmission +
## fuel + drivetrain + damaged + bluetooth + third_row_seating +
## price, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3089 -0.8709 -0.3404 0.9059 2.6759
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.119e+02 9.865e+02 -0.316 0.75191
## year 1.479e-01 4.832e-02 3.060 0.00221 **
## mileage -1.314e-05 4.992e-06 -2.631 0.00850 **
## automatic_transmission1 -7.383e-01 4.942e-01 -1.494 0.13522
## fuelElectric 1.269e+01 9.817e+02 0.013 0.98969
## fuelGPL 1.375e+01 9.817e+02 0.014 0.98883
## fuelHybrid 1.623e+01 9.817e+02 0.017 0.98681
## fuelPetrol 1.458e+01 9.817e+02 0.015 0.98815
## fuelUnknown 1.375e+01 9.817e+02 0.014 0.98882
## drivetrainFront-wheel Drive 5.258e-01 3.204e-01 1.641 0.10076
## drivetrainRear-wheel Drive -7.211e-01 3.909e-01 -1.845 0.06506 .
## drivetrainUnknown -1.221e+01 1.519e+03 -0.008 0.99359
## damaged1 -5.362e-01 2.877e-01 -1.864 0.06235 .
## bluetooth1 -1.159e+00 4.838e-01 -2.395 0.01662 *
## third_row_seating1 1.047e+00 3.433e-01 3.049 0.00230 **
## price 3.713e-05 1.729e-05 2.147 0.03176 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.85 on 409 degrees of freedom
## Residual deviance: 427.01 on 394 degrees of freedom
## AIC: 459.01
##
## Number of Fisher Scoring iterations: 15
.lets us now removing insignificant terms using update() function
#Applying update() function to remove insignificant term 'fuel'
logistic_modelC <- update(logistic_modelB,~.-fuel)
#Checking the model summary
summary(logistic_modelC)##
## Call:
## glm(formula = first_owner ~ year + mileage + automatic_transmission +
## drivetrain + damaged + bluetooth + third_row_seating + price,
## family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2167 -0.8830 -0.3815 0.9244 2.7031
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.086e+02 9.623e+01 -3.207 0.00134 **
## year 1.534e-01 4.782e-02 3.207 0.00134 **
## mileage -1.145e-05 4.808e-06 -2.382 0.01720 *
## automatic_transmission1 -7.598e-01 4.902e-01 -1.550 0.12114
## drivetrainFront-wheel Drive 5.305e-01 3.130e-01 1.695 0.09012 .
## drivetrainRear-wheel Drive -6.919e-01 3.854e-01 -1.795 0.07261 .
## drivetrainUnknown -1.007e+01 5.901e+02 -0.017 0.98639
## damaged1 -4.951e-01 2.803e-01 -1.766 0.07734 .
## bluetooth1 -1.077e+00 4.630e-01 -2.327 0.01996 *
## third_row_seating1 1.028e+00 3.340e-01 3.079 0.00208 **
## price 3.823e-05 1.664e-05 2.298 0.02156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.85 on 409 degrees of freedom
## Residual deviance: 441.84 on 399 degrees of freedom
## AIC: 463.84
##
## Number of Fisher Scoring iterations: 13
.The logistic_modelC is currently our minimum adequate model.Written as
\[log(\frac{p}{1-p})= -564.8 + 0.2805\times{year} -0.000007046\times{mileage} -1.135\times{bluetooth} \]
# Extract the coefficients from logistic_modelC and calculate the exponentiation
exp(coef(logistic_modelC))## (Intercept) year
## 9.601309e-135 1.165763e+00
## mileage automatic_transmission1
## 9.999885e-01 4.677772e-01
## drivetrainFront-wheel Drive drivetrainRear-wheel Drive
## 1.699733e+00 5.006388e-01
## drivetrainUnknown damaged1
## 4.250915e-05 6.094908e-01
## bluetooth1 third_row_seating1
## 3.404474e-01 2.796655e+00
## price
## 1.000038e+00
.The odds ratios highlight specific variables impacting the likelihood of a car being sold by its first owner.
.Variables like ‘mileage’ and ‘bluetooth’, with odds values below 1, indicate a decrease in the likelihood of a car being sold by the first owner. Put simply, higher values for these variables correlate with a reduced probability of a car being sold by its initial owner.
.Conversely, the ‘year’ variable, with an odds value greater than 1, is associated with an increase in the probability of a car being sold by the first owner. In simpler terms, an increase in the production year is linked to a higher likelihood of the car being sold by its first owner.