1 Description

The project centered around analyzing a real-world dataset as part of a competition. Its aim was to develop skills in addressing specific questions using the provided dataset and its metadata.

2 Packages

# Importing 'validate' package for data quality analysis
library(validate)

# Importing 'ggplot2' package for visualizations
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:validate':
## 
##     expr
# Importing 'dplyr' package for data manipulation
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:validate':
## 
##     expr
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

3 1. Organise and clean the data

3.1 1.1 Subset the data into the specific dataset allocated

# To extract my personalized dataset from the larger provided dataset, I utilized my unique identifier, which corresponded to my registration number for the competition. For privacy reasons, I will conceal this identifier(I put any Number).
SID <- 5

# SID mod 50 + 1
SIDoffset <- (SID %% 50) + 1

# Loading the data set
load("car-analysis-data.Rda")

# Now subset the car data set
# Pick every 50th observation starting from the offset
# Put into data frame named mydf
mydf <- cars.analysis[seq(from=SIDoffset,to=nrow(cars.analysis),by=50),]

3.2 1.2 Data quality analysis plan

Systematic data quality assessment across all variables is essential.

  1. Null Values: Check all variables for missing values.
  2. Numeric Checks: -Check that ‘year’, ‘mileage’, ‘engine_size’, ‘min_mpg’, ‘max_mpg’, and ‘price’ are numerical variables
    -Check that the value of ‘year’ is four digits long and greater than zero. -Check that ‘engine_size’ is greater than zero.
    -Check that the values for ‘mileage,’ ‘min_mpg,’ ‘max_mpg,’ and ‘price’ variables must all be positive.
    -Verify that ‘max_mpg’ values are higher than or equal ’min_mpg.
  3. Categorical Variables: -Verify that the ‘brand’, ‘fuel’, and ‘drivetrain’ columns contain character data.
    -Check that ‘damaged,’ ‘first_owner,’ ‘navigation_system,’ ‘bluetooth,’ ‘third_row_seating,’ ‘automatic_transmission,’ and ‘heated_seats’ variables are factors and they contains only 0 and 1 values.

3.3 1.3 Data quality analysis findings

# Developing the validator function named 'data_checker' using validate package
data_checker <- validator(
        NullValues = !is.na(mydf), #Checking for missing values
        YearDigits = field_length(mydf$year, 4) , #Checking that the 'year' is four-digits long.
        Non_NegYear = mydf$year >=1, #Verifying that 'year' is bigger than zero
        Non_NegEngSize = mydf$engine_size > 0, #Verifying that 'engine_size' is bigger than zero
        MaxGreaterMin = mydf$max_mpg >= mydf$min_mpg, #Verifying 'max_mpg' exceeds 'min_mpg'.
        
        # Checking that 'mileage,' 'min_mpg,''max_mpg,' and 'price' contains positive values.
        Non_NegMileage = mydf$mileage >=0, 
        Non_NegMinmpg = mydf$min_mpg >= 0, 
        Non_NegMaxmpg = mydf$max_mpg >= 0,
        Non_NegPrice = mydf$price >= 0,
        
        # Checking that 'damaged,' 'first_owner,' 'navigation_system,' 'bluetooth,' 'third_row_seating,' 'automatic_transmission,' and 'heated_seats' variables contains only 0 and 1 values.
        HasAuto = is.element(mydf$automatic_transmission,c(0,1)), 
        HasDamaged = is.element(mydf$damaged, c(0,1)), 
        FirstOwner = is.element(mydf$first_owner, c(0,1)),
        HasNavigation = is.element(mydf$navigation_system, c(0,1)),
        HasBluetooth = is.element(mydf$bluetooth, c(0,1)),
        Has3Row = is.element(mydf$third_row_seating, c(0,1)),
        HasHeated = is.element(mydf$heated_seats, c(0,1)),
        
        # Verifying the character variables
        CharaBrand = is.character(mydf$brand),
        CharaFuel = is.character(mydf$fuel),
        CharaDrive = is.character(mydf$drivetrain),
        
        #Verifying the factors variables
        FAuto = is.factor(mydf$automatic_transmission),
        FDamaged = is.factor(mydf$damaged),
        FFirstOwner = is.factor(mydf$first_owner),
        FNavigation = is.factor(mydf$navigation_system),
        FBluetooth = is.factor(mydf$bluetooth),
        F3Row = is.factor(mydf$third_row_seating),
        FHeated = is.factor(mydf$heated_seats),
        
        #Verifying the numerical variables
        NumYear = is.numeric(mydf$year),
        NumMileage = is.numeric(mydf$mileage),
        NumEngSize = is.numeric(mydf$engine_size),
        NumMinmpg = is.numeric(mydf$min_mpg),
        NumMaxmpg = is.numeric(mydf$max_mpg),
        NumPrice = is.numeric(mydf$price)
                          )

# Confronting 'mydf' and 'data_checker' and saving output in 'out'
out <- confront(mydf, data_checker)

# Generating summary of the output
summary(out)
##              name items passes fails nNA error warning
## 1      NullValues  6560   6413   147   0 FALSE   FALSE
## 2      YearDigits   410    410     0   0 FALSE   FALSE
## 3     Non_NegYear   410    410     0   0 FALSE   FALSE
## 4  Non_NegEngSize   410    380     1  29 FALSE   FALSE
## 5   MaxGreaterMin   410    355     2  53 FALSE   FALSE
## 6  Non_NegMileage   410    410     0   0 FALSE   FALSE
## 7   Non_NegMinmpg   410    357     0  53 FALSE   FALSE
## 8   Non_NegMaxmpg   410    355     2  53 FALSE   FALSE
## 9    Non_NegPrice   410    410     0   0 FALSE   FALSE
## 10        HasAuto   410    410     0   0 FALSE   FALSE
## 11     HasDamaged   410    406     4   0 FALSE   FALSE
## 12     FirstOwner   410    402     8   0 FALSE   FALSE
## 13  HasNavigation   410    410     0   0 FALSE   FALSE
## 14   HasBluetooth   410    410     0   0 FALSE   FALSE
## 15        Has3Row   410    410     0   0 FALSE   FALSE
## 16      HasHeated   410    410     0   0 FALSE   FALSE
## 17     CharaBrand     1      1     0   0 FALSE   FALSE
## 18      CharaFuel     1      1     0   0 FALSE   FALSE
## 19     CharaDrive     1      1     0   0 FALSE   FALSE
## 20          FAuto     1      0     1   0 FALSE   FALSE
## 21       FDamaged     1      0     1   0 FALSE   FALSE
## 22    FFirstOwner     1      0     1   0 FALSE   FALSE
## 23    FNavigation     1      0     1   0 FALSE   FALSE
## 24     FBluetooth     1      0     1   0 FALSE   FALSE
## 25          F3Row     1      0     1   0 FALSE   FALSE
## 26        FHeated     1      0     1   0 FALSE   FALSE
## 27        NumYear     1      1     0   0 FALSE   FALSE
## 28     NumMileage     1      1     0   0 FALSE   FALSE
## 29     NumEngSize     1      1     0   0 FALSE   FALSE
## 30      NumMinmpg     1      1     0   0 FALSE   FALSE
## 31      NumMaxmpg     1      1     0   0 FALSE   FALSE
## 32       NumPrice     1      1     0   0 FALSE   FALSE
##                                               expression
## 1                                           !is.na(mydf)
## 2                        field_length(mydf[["year"]], 4)
## 3                                    mydf[["year"]] >= 1
## 4                              mydf[["engine_size"]] > 0
## 5                 mydf[["max_mpg"]] >= mydf[["min_mpg"]]
## 6                                 mydf[["mileage"]] >= 0
## 7                                 mydf[["min_mpg"]] >= 0
## 8                                 mydf[["max_mpg"]] >= 0
## 9                                   mydf[["price"]] >= 0
## 10 is.element(mydf[["automatic_transmission"]], c(0, 1))
## 11                is.element(mydf[["damaged"]], c(0, 1))
## 12            is.element(mydf[["first_owner"]], c(0, 1))
## 13      is.element(mydf[["navigation_system"]], c(0, 1))
## 14              is.element(mydf[["bluetooth"]], c(0, 1))
## 15      is.element(mydf[["third_row_seating"]], c(0, 1))
## 16           is.element(mydf[["heated_seats"]], c(0, 1))
## 17                         is.character(mydf[["brand"]])
## 18                          is.character(mydf[["fuel"]])
## 19                    is.character(mydf[["drivetrain"]])
## 20           is.factor(mydf[["automatic_transmission"]])
## 21                          is.factor(mydf[["damaged"]])
## 22                      is.factor(mydf[["first_owner"]])
## 23                is.factor(mydf[["navigation_system"]])
## 24                        is.factor(mydf[["bluetooth"]])
## 25                is.factor(mydf[["third_row_seating"]])
## 26                     is.factor(mydf[["heated_seats"]])
## 27                            is.numeric(mydf[["year"]])
## 28                         is.numeric(mydf[["mileage"]])
## 29                     is.numeric(mydf[["engine_size"]])
## 30                         is.numeric(mydf[["min_mpg"]])
## 31                         is.numeric(mydf[["max_mpg"]])
## 32                           is.numeric(mydf[["price"]])

Issues with Data Quality Noted from the Summary:
.Found 157 missing values.
.2 values in the ‘max_mpg’ variable are less than the corresponding ‘min_mpg’.
.2 values in the ‘max_mpg’ variable are not positive.
.5 values in the ‘damaged’ variable are neither 0 nor 1.
.7 values in the ‘first_owner’ variable are not exclusively 0 or 1.
.All variables expected to be factors are not appropriately categorized.

3.4 1.4 Data cleaning

Issues with Data Quality as noted from 1.3:
.Found 157 missing values.
.2 values in the ‘max_mpg’ variable are less than the corresponding ‘min_mpg’.
.2 values in the ‘max_mpg’ variable are not positive.
.5 values in the ‘damaged’ variable are neither 0 nor 1.
.7 values in the ‘first_owner’ variable are not exclusively 0 or 1.
.All variables expected to be factors are not appropriately categorized.

Addressing issues:
.We’ll utilize colSums(is.na()) to find and manage missing values. We’ll utilise the mode for categorical variable and the mean for numerical variable to replace missing values. For numerical data, mean imputation works best, while mode for categorical data.
.The discrepancy between two values in the ‘max_mpg’ variable and their corresponding ‘min_mpg’ values is probable attributable to the existence of negative values in ‘max_mpg’. We’ll fix this by substituting negative values with positive ones.
.The mode (most frequent value) will replace non-0 or 1 values in the ‘damaged’ and ‘first_owner’ variables,these variables are binary, and mode replacement is best for binary variables.
.’as.factor’ function will classify variables believed to be factors.

# Counting the missing values per variables
colSums(is.na(mydf))
##                  brand                   year                mileage 
##                      0                      0                      0 
##            engine_size automatic_transmission                   fuel 
##                     29                      0                      0 
##             drivetrain                min_mpg                max_mpg 
##                      0                     53                     53 
##                damaged            first_owner      navigation_system 
##                      4                      8                      0 
##              bluetooth      third_row_seating           heated_seats 
##                      0                      0                      0 
##                  price 
##                      0
# Mean imputations for numerical variables
mydf$engine_size[is.na(mydf$engine_size)] <- round(mean(mydf$engine_size, na.rm = TRUE), 1)
mydf$min_mpg[is.na(mydf$min_mpg)] <- round(mean(mydf$min_mpg, na.rm = TRUE), 1)
mydf$max_mpg[is.na(mydf$max_mpg)] <- round(mean(mydf$max_mpg, na.rm = TRUE), 1)

# Running the table function to find the most frequent values in 'damaged' and 'first_owner' variables
table(mydf$damaged)
## 
##   0   1 
## 304 102
table(mydf$first_owner)
## 
##   0   1 
## 216 186
# Mode imputations for categorical variables
mydf$damaged[is.na(mydf$damaged)] <- 0
mydf$first_owner[is.na(mydf$first_owner)] <- 0

# Cleaning the negative values in 'max_mpg' variable
## Starting by checking the values
mydf$max_mpg[mydf$max_mpg < 0]
## [1] -30 -30
## Both are -30, now they are going to be replaced by 30
mydf$max_mpg[mydf$max_mpg < 0] <- 30

# Changing to factors
mydf$automatic_transmission <- as.factor(mydf$automatic_transmission)
mydf$damaged <- as.factor(mydf$damaged)
mydf$first_owner <- as.factor(mydf$first_owner)
mydf$navigation_system <- as.factor(mydf$navigation_system)
mydf$bluetooth <- as.factor(mydf$bluetooth)
mydf$third_row_seating <- as.factor(mydf$third_row_seating)
mydf$heated_seats <- as.factor(mydf$heated_seats)

# Character variables turned into factors.
mydf$brand <- as.factor(mydf$brand)
mydf$fuel <- as.factor(mydf$fuel)
mydf$drivetrain <- as.factor(mydf$drivetrain)

To wrap up this section following cleaning and reevaluation, it has been confirmed that:
.The discrepancy between ‘max_mpg’ values being lower than ‘min_mpg’ values stemmed from the existence of negative values within the ‘max_mpg’ variable
.The non-0 or 1 values identified within the ‘damaged’ and ‘first_owner’ variables were a consequence of the existence of missing values within these variables
. Furthermore, to simplify EDA, all character variables have been turned into factors.

4 2. Exploratory Data Analysis (EDA)

4.1 2.1 EDA plan

The strategy;

  1. Data overview: summary() function will be used to get a high-level overview of the data, and the str() function will be used to confirm its structure..

  2. Assessment of numerical variables: Histograms will be employed to visually represent the data distribution of the numerical variables. To investigate the associations between numerical variables, correlations will be computed utilizing the cor() function. These relationships will be visually represented using scatter graphs, with an emphasis on the correlation between price (the research question) and additional numerical variables.

  3. Assessment of categorical variables: Using bar graphs and table analysis, I will examine the distribution of categorical variables.

  4. Assessment of both numerical and categorical variables at the same time: Box plot graphs will be used to show the relationships between numerical and categorical variables in connection to the research questions.

4.2 2.2 EDA execution

4.2.1 Data overview

# Running summary
summary(mydf)
##         brand          year         mileage        engine_size  
##  Volkswagen: 28   Min.   :1972   Min.   :     0   Min.   :0.00  
##  Ford      : 26   1st Qu.:2015   1st Qu.: 24304   1st Qu.:2.00  
##  Mazda     : 23   Median :2018   Median : 47021   Median :2.50  
##  Nissan    : 22   Mean   :2017   Mean   : 54013   Mean   :2.78  
##  Audi      : 21   3rd Qu.:2020   3rd Qu.: 73918   3rd Qu.:3.50  
##  FIAT      : 20   Max.   :2023   Max.   :299999   Max.   :6.60  
##  (Other)   :270                                                 
##  automatic_transmission       fuel                 drivetrain     min_mpg     
##  0: 28                  Diesel  :  4   Four-wheel Drive :205   Min.   : 0.00  
##  1:382                  Electric: 10   Front-wheel Drive:148   1st Qu.:18.00  
##                         GPL     :  6   Rear-wheel Drive : 55   Median :21.10  
##                         Hybrid  : 16   Unknown          :  2   Mean   :21.14  
##                         Pertol  :  2                           3rd Qu.:23.00  
##                         Petrol  :369                           Max.   :72.00  
##                         Unknown :  3                                          
##     max_mpg      damaged first_owner navigation_system bluetooth
##  Min.   : 0.00   0:308   0:224       0:230             0: 58    
##  1st Qu.:25.00   1:102   1:186       1:180             1:352    
##  Median :28.10                                                  
##  Mean   :28.38                                                  
##  3rd Qu.:31.00                                                  
##  Max.   :80.00                                                  
##                                                                 
##  third_row_seating heated_seats     price      
##  0:341             0:209        Min.   : 4500  
##  1: 69             1:201        1st Qu.:17996  
##                                 Median :27207  
##                                 Mean   :27932  
##                                 3rd Qu.:36297  
##                                 Max.   :54392  
## 

From the summary above,
.This dataset includes a wide range of car brands. The Alfa brand stands out the most, followed by MINI and Chevrolet. Cars in this dataset were manufactured between 1984 and 2023, with a concentration on the years 2015 to 2020, indicating a prevalence of newer models.

.The car mileage runs from 0 to 194,000, and the average mileage is greater than the median, indicating that the distribution is biassed to the right. This implies that a few cars have extremely high mileage, whereas the bulk of cars have lower or moderate mileage. Engine sizes range from 1.2 to 6.6, and, like mileage, the average engine size is more than the median, implying that a small percentage of cars have larger engines.

.Most cars have automatic transmissions, run on petrol, and have four-wheel drive. Minimum and maximum fuel economy range from 0 to 54 and 0 to 57, respectively. Their comparable average and median values imply an equitable distribution of efficiency statistics.

.The majority of cars have no visible damage, are not first-owner vehicles, lack a navigation system, but do have Bluetooth connectivity. Furthermore, many cars lack third-row seating or heated seats.Prices range from 4100 to 54717, and with the average price higher than the median, it shows a right-skewed distribution, indicating a few high-cost cars among primarily cheaper and moderately priced ones.

# Confirming the structure
str(mydf)
## 'data.frame':    410 obs. of  16 variables:
##  $ brand                 : Factor w/ 24 levels "Alfa","Audi",..: 22 8 9 9 7 10 19 20 16 20 ...
##  $ year                  : num  2022 2012 2021 2020 2007 ...
##  $ mileage               : num  52250 150842 33267 48816 14268 ...
##  $ engine_size           : num  2.5 3.5 2 2 4.6 2 3 1.8 2.5 1.6 ...
##  $ automatic_transmission: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
##  $ fuel                  : Factor w/ 7 levels "Diesel","Electric",..: 4 6 6 6 6 6 6 6 6 6 ...
##  $ drivetrain            : Factor w/ 4 levels "Four-wheel Drive",..: 1 2 2 2 3 1 2 2 1 1 ...
##  $ min_mpg               : num  35 18 21 21.1 17 20 24 31 27 21.1 ...
##  $ max_mpg               : num  35 25 26 28.1 25 28 30 36 32 28.1 ...
##  $ damaged               : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 1 1 1 ...
##  $ first_owner           : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 1 1 1 1 ...
##  $ navigation_system     : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ bluetooth             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ third_row_seating     : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 2 1 ...
##  $ heated_seats          : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 1 2 1 ...
##  $ price                 : num  35895 11999 19500 26951 26000 ...

.The data is proven to be a dataframe with 410 observations and 16 variables. All the variables are in the right class.

4.2.2 Assessment of numerical variables.

# Histogram for year
ggplot(mydf,aes(x= mydf$year)) + 
  geom_histogram(bins = 30) +
  ggtitle("Year Distribution")+
  xlab("Year") +
  ylab("Frequency")
## Warning: Use of `mydf$year` is discouraged.
## ℹ Use `year` instead.

# Histogram for mileage
ggplot(mydf,aes(x= mydf$mileage)) + 
  geom_histogram(bins = 30) +
  ggtitle("Mileage Distribution")+
  xlab("Mileage") +
  ylab("Frequency")
## Warning: Use of `mydf$mileage` is discouraged.
## ℹ Use `mileage` instead.

# Histogram for engine size
ggplot(mydf,aes(x= mydf$engine_size)) + 
  geom_histogram(bins = 30) +
  ggtitle("engine size Distribution")+
  xlab("Engine size") +
  ylab("Frequency")
## Warning: Use of `mydf$engine_size` is discouraged.
## ℹ Use `engine_size` instead.

# Histogram for min_mpg
ggplot(mydf,aes(x= mydf$min_mpg)) + 
  geom_histogram(bins = 30) +
  ggtitle("Minimum fuel efficiency of the car Distribution")+
  xlab("Min_mpg") +
  ylab("Frequency")
## Warning: Use of `mydf$min_mpg` is discouraged.
## ℹ Use `min_mpg` instead.

# Histogram for max_mpg
ggplot(mydf,aes(x= mydf$max_mpg)) + 
  geom_histogram(bins = 30) +
  ggtitle("Maximum fuel efficiency of the car Distribution")+
  xlab("Max_mpg") +
  ylab("Frequency")
## Warning: Use of `mydf$max_mpg` is discouraged.
## ℹ Use `max_mpg` instead.

# Histogram for price
ggplot(mydf,aes(x= mydf$price)) + 
  geom_histogram(bins = 30) +
  ggtitle(" Car Price Distribution")+
  xlab("Price") +
  ylab("Frequency")
## Warning: Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

Looking at the histogram:
.Prior to 2010, not many cars were produced. The majority were produced from 2015 onward, with the biggest number produced between 2015 and 2020.
.The mileage distribution is skewed to the right, indicating that the majority of cars have less than 80,000 miles. The highest miles are between 25,000 and 50,000.
.Many cars have an engine size of 2.
.Most cars have a minimum fuel efficiency of 20 and a maximum of roughly 27. Both sides of this distribution appear to be comparable, which corresponds to what we saw in the summary statistics.
.The most expensive cars is roughly $12,000.

# Correlation analysis

# Selecting only numerical variables
numerical_columns <- mydf %>% select_if(is.numeric)

# For readability and clarity, the correlation values are rounded to two decimal places.
round(cor(numerical_columns), 2)
##              year mileage engine_size min_mpg max_mpg price
## year         1.00   -0.45       -0.15   -0.04   -0.04  0.48
## mileage     -0.45    1.00        0.21   -0.07   -0.07 -0.60
## engine_size -0.15    0.21        1.00   -0.27   -0.27  0.24
## min_mpg     -0.04   -0.07       -0.27    1.00    0.92 -0.22
## max_mpg     -0.04   -0.07       -0.27    0.92    1.00 -0.22
## price        0.48   -0.60        0.24   -0.22   -0.22  1.00

Examining the correlation matrix: .We discovered a variety of correlation values, some positive and others negative, below and above 50%.
.The observed inverse correlation between the year of production and mileage of a car implies that newer cars generally possess lower mileage.
.Moreover, the correlation between the price of a car and its year of production suggests that cars that are more recent in design generally command greater price tags.
.It appears that mileage also affects the price of the car. The negative correlation indicates that there is a tendency for the price to decrease as mileage increases.
.It appears that engine size, min_mpg and max_mpg have no direct effect on the price of the car. Nevertheless, a robust positive correlation exists between them, indicating that both variables may not be practical predictors of cars prices, only one is enough to be used.

# Scatterplot analysis

# Year versus Price
ggplot(mydf, aes(x=mydf$year, y=mydf$price)) +
  geom_point() +
  labs(title = "Year Versus Price", x="Year", y="Price")
## Warning: Use of `mydf$year` is discouraged.
## ℹ Use `year` instead.
## Warning: Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# Mileage versus Price
ggplot(mydf, aes(x=mydf$mileage, y=mydf$price)) +
  geom_point() +
  labs(title = "Mileage Versus Price ", x="Mileage", y="Price")
## Warning: Use of `mydf$mileage` is discouraged.
## ℹ Use `mileage` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# Engine Size versus Price
ggplot(mydf, aes(x=mydf$engine_size, y=mydf$price)) +
  geom_point() +
  labs(title = "Engine Size Versus Price", x="Engine Size", y="Price")
## Warning: Use of `mydf$engine_size` is discouraged.
## ℹ Use `engine_size` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# Min_mpg versus Price
ggplot(mydf, aes(x=mydf$min_mpg, y=mydf$price)) +
  geom_point() +
  labs(title = "Min_mpg Versus Price ", x="Min_mpg", y="Price")
## Warning: Use of `mydf$min_mpg` is discouraged.
## ℹ Use `min_mpg` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# Max_mpg versus Price
ggplot(mydf, aes(x=mydf$max_mpg, y=mydf$price)) +
  geom_point() +
  labs(title = "Max_mpg Versus Price", x="Engine Size", y="Price")
## Warning: Use of `mydf$max_mpg` is discouraged.
## ℹ Use `max_mpg` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

.Looking at the scatter plot, all the observation from correlation matrix have being proved here visually

4.2.3 Assessment of categorical variables.

# Table analysis
# To simplify analysis Brand, fuel, and drive train variables are sorting in descending order.

# Brands
sort(table(mydf$brand), decreasing = TRUE)
## 
##    Volkswagen          Ford         Mazda        Nissan          Audi 
##            28            26            23            22            21 
##          FIAT          MINI          Jeep     Chevrolet         Honda 
##            20            20            19            18            18 
## Mercedes-Benz          Alfa       Hyundai           Kia      Cadillac 
##            18            17            17            17            16 
##        Jaguar         Lexus    Mitsubishi         Volvo      Maserati 
##            16            15            15            15            14 
##          Land        Toyota           BMW       Porsche 
##            12            12             7             4
# Fuel
sort(table(mydf$fuel), decreasing = TRUE)
## 
##   Petrol   Hybrid Electric      GPL   Diesel  Unknown   Pertol 
##      369       16       10        6        4        3        2
#Replacing "Pertol" with "Petrol"
mydf$fuel[mydf$fuel=="Pertol"] <- "Petrol"

# Dropping the empty level on the 'fuel'
mydf$fuel <- droplevels(mydf$fuel)

# Drivetrain
sort(table(mydf$drivetrain), decreasing = TRUE)
## 
##  Four-wheel Drive Front-wheel Drive  Rear-wheel Drive           Unknown 
##               205               148                55                 2
# Here is for automatic_transmission, damaged, first_owner, navigation_system, bluetooth, third_row_seating and heated_seats variables
table(mydf$automatic_transmission)
## 
##   0   1 
##  28 382
table(mydf$damaged)
## 
##   0   1 
## 308 102
table(mydf$first_owner)
## 
##   0   1 
## 224 186
table(mydf$navigation_system)
## 
##   0   1 
## 230 180
table(mydf$bluetooth)
## 
##   0   1 
##  58 352
table(mydf$third_row_seating)
## 
##   0   1 
## 341  69
table(mydf$heated_seats)
## 
##   0   1 
## 209 201

Looking at table analysis: .The Alfa brand stands out the most, followed by MINI and Chevrolet.The Suzuki brand has least number of cars in the dataset followed by Porsche, Mercedes-Benz, and Honda. .The number of cars using petrol are many, with only four cars using Diesel.
.Four-wheel Drive cars are manys, followed by Front-wheel Drive and Rear-wheel Drive. .The majority of cars have no visible damage, are not first-owner vehicles, lack a navigation system, but do have Bluetooth connectivity. Furthermore, many cars lack third-row seating or heated seats.

# bar graph

# for brand
ggplot(mydf, aes(x=mydf$brand)) +
  geom_bar()  +
  labs(title = "brand bar graph", x= "brand", y = "Frequency")
## Warning: Use of `mydf$brand` is discouraged.
## ℹ Use `brand` instead.

# for fuel
ggplot(mydf, aes(x=mydf$fuel)) +
  geom_bar()  +
  labs(title = "Fuel bar graph", x= "Type of fuel", y = "Frequency")
## Warning: Use of `mydf$fuel` is discouraged.
## ℹ Use `fuel` instead.

# for drive train
ggplot(mydf, aes(x=mydf$drivetrain)) +
  geom_bar()  +
  labs(title = "drivetrain bar graph", x= "drivetrain", y = "Frequency")
## Warning: Use of `mydf$drivetrain` is discouraged.
## ℹ Use `drivetrain` instead.

# for automatic_transmission
ggplot(mydf, aes(x=mydf$automatic_transmission)) +
  geom_bar() +
  labs(title = "automatic_transmission bar graph", x= "automatic_transmission", y = "Frequency")
## Warning: Use of `mydf$automatic_transmission` is discouraged.
## ℹ Use `automatic_transmission` instead.

# for damaged
ggplot(mydf, aes(x=mydf$damaged)) +
  geom_bar() +
  labs(title = "damaged bar graph", x= "damaged", y = "Frequency")
## Warning: Use of `mydf$damaged` is discouraged.
## ℹ Use `damaged` instead.

# for first_owner
ggplot(mydf, aes(x=mydf$first_owner)) +
  geom_bar() +
  labs(title = "first_owner bar graph", x= "first_owner", y = "Frequency")
## Warning: Use of `mydf$first_owner` is discouraged.
## ℹ Use `first_owner` instead.

# for navigation_system
ggplot(mydf, aes(x=mydf$navigation_system)) +
  geom_bar() +
  labs(title = "navigation_system bar graph", x= "navigation_system", y = "Frequency")
## Warning: Use of `mydf$navigation_system` is discouraged.
## ℹ Use `navigation_system` instead.

# for bluetooth
ggplot(mydf, aes(x=mydf$bluetooth)) +
  geom_bar() +
  labs(title = "bluetooth bar graph", x= "bluetooth", y = "Frequency")
## Warning: Use of `mydf$bluetooth` is discouraged.
## ℹ Use `bluetooth` instead.

# for third_row_seating
ggplot(mydf, aes(x=mydf$third_row_seating)) +
  geom_bar() +
  labs(title = "third_row_seating bar graph", x= "third_row_seating", y = "Frequency")
## Warning: Use of `mydf$third_row_seating` is discouraged.
## ℹ Use `third_row_seating` instead.

# for heated_seats
ggplot(mydf, aes(x=mydf$heated_seats)) +
  geom_bar() +
  labs(title = "heated_seats bar graph", x= "heated_seats", y = "Frequency")
## Warning: Use of `mydf$heated_seats` is discouraged.
## ℹ Use `heated_seats` instead.

Looking at bar graph: .The visually representation proved the observation from summary statistics and table analysis that the majority of cars are from Alfa brand,they are using petrol, they have four-wheel drive,have no visible damage, are not first-owner vehicles, lack a navigation system, but do have Bluetooth connectivity. Furthermore, many cars lack third-row seating or heated seats.

4.2.4 Assessment of both numerical and categorical variables at the same time.

# box plot analysis

# Brand Versus Price
ggplot(mydf, aes(x=mydf$brand, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "Brand Versus Price", x ="Brand", y="Price")
## Warning: Use of `mydf$brand` is discouraged.
## ℹ Use `brand` instead.
## Warning: Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# Fuel Versus Price
ggplot(mydf, aes(x=mydf$fuel, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "Fuel Versus Price", x ="Fuel type", y="Price")
## Warning: Use of `mydf$fuel` is discouraged.
## ℹ Use `fuel` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# Drivetrain Versus Price
ggplot(mydf, aes(x=mydf$drivetrain, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "drivetrain Versus Price", x ="drivetrain", y="Price")
## Warning: Use of `mydf$drivetrain` is discouraged.
## ℹ Use `drivetrain` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# automatic_transmission Versus Price
ggplot(mydf, aes(x=mydf$automatic_transmission, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "automatic_transmission Versus Price", x ="automatic_transmission", y="Price")
## Warning: Use of `mydf$automatic_transmission` is discouraged.
## ℹ Use `automatic_transmission` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# damaged Versus Price
ggplot(mydf, aes(x=mydf$damaged, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "damaged Versus Price", x ="damaged", y="Price")
## Warning: Use of `mydf$damaged` is discouraged.
## ℹ Use `damaged` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# first_owner Versus Price
ggplot(mydf, aes(x=mydf$first_owner, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "first_owner Versus Price", x ="first_owner", y="Price")
## Warning: Use of `mydf$first_owner` is discouraged.
## ℹ Use `first_owner` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# navigation_system Versus Price
ggplot(mydf, aes(x=mydf$navigation_system, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "navigation_system Versus Price", x ="navigation_system", y="Price")
## Warning: Use of `mydf$navigation_system` is discouraged.
## ℹ Use `navigation_system` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# bluetooth Versus Price
ggplot(mydf, aes(x=mydf$bluetooth, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "bluetooth Versus Price", x ="bluetooth", y="Price")
## Warning: Use of `mydf$bluetooth` is discouraged.
## ℹ Use `bluetooth` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# third_row_seating Versus Price
ggplot(mydf, aes(x=mydf$third_row_seating, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "third_row_seating Versus Price", x ="third_row_seating", y="Price")
## Warning: Use of `mydf$third_row_seating` is discouraged.
## ℹ Use `third_row_seating` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

# heated_seats Versus Price
ggplot(mydf, aes(x=mydf$heated_seats, y=mydf$price)) + 
  geom_boxplot() +
  labs(title = "heated_seats Versus Price", x ="heated_seats", y="Price")
## Warning: Use of `mydf$heated_seats` is discouraged.
## ℹ Use `heated_seats` instead.
## Use of `mydf$price` is discouraged.
## ℹ Use `price` instead.

Looking at boxplot:
.The median price of hybrid-powered cars is the highest of all fuel varieties.
.Cars with four-wheel drive have the highest median price.
.The median price of cars with automatic transmissions is typically greater than that of those without.
.Damaged cars typically have a lower median price than undamaged ones.
.Cars sold by their first owners have a higher median price than those that aren’t.
.Cars featuring navigation, Bluetooth, third-row seating, and heated seats typically have higher median pricing. .The majority of boxplots do not exhibit hazardous outliers.

Typically, the price of a vehicle is significantly influenced by the brand, gasoline type, drivetrain, automatic gearbox, condition of the damage, initial ownership, navigation system, Bluetooth, third-row seating, and heated seats. Generally, the expense of an car is significantly impacted by these attributes.

4.3 2.3 EDA summary of results

Brand Distribution:
.Within this dataset, Alfa, MINI, and Chevrolet are noticeably present. Suzuki, Porsche, Mercedes-Benz, and Honda, in contrast, have a reduced number of entries.

Temporal Trends:
.Cars span the years 1984 to 2023, with an emphasis on manufacturing that occurred between 2015 and 2020, suggesting a preponderance of more recent models.

Mileage and Engine Size: .The mileage of the cars varies significantly, spanning from 0 to 194,000. A right-skewed distribution indicates that the majority of the cars have accumulated reduced mileage. In a similar fashion, engine displacements range from 1.2 to 6.6, although a limited number of cars are equipped with larger engines.

Transmission and Fuel Type: .Automatic transmissions, petrol fuel, and four-wheel drive are prevalent among the majority of cars. Diesel-powered cars are minimal.

Efficiency Metrics:
.The majority of cars have a minimum fuel economy of 20 and a maximum efficiency of around 27. Both of their distributions appear to be symmetric.

Features and Pricing:
.While cars typically do not have observable damage, navigation systems, or heated seating, Bluetooth connectivity is a prevalent feature. The price distribution is right-skewed, with the majority of cars priced below or at the middle range.

Correlations: .Significant correlations suggest that newer cars generally possess lower mileage and are priced higher. Prices are negatively impacted by mileage, whereas engine size and fuel efficiency have no direct effect on pricing.

Visual representations: .histograms, bar graph,scatter plots, and box plots, serve to validate these observations. Furthermore, box plot analysis reveals that the median price of four-wheel drive and hybrid-powered cars is greater. Furthermore, elevated median pricing is attributable to elements such as automatic a transmission, which absence of damage, first-owner status, and specific attributes including Bluetooth and navigation.

4.4 2.4 Additional insights and issues

.The ‘unknown’ elements contained within the ‘fuel’ and ‘drivetrain’ variables may be substituted with the values that occur most frequently within those particular variables. This substitution strategy contributes to the production of more consistent and trustworthy data for analysis.
.Luxury brands, such as Mercedes-Benz and Porsche, have lower numbers than common brands.

5 3. Modelling

5.1 3.1 Explain your analysis plan

.The objective is to see how the prices of the cars are explained by the other variables. As a result of the considerable quantity of independent variables, interactions shall be omitted from the modelling process. Due to the continuous nature of the dependent variable, ‘price’, a linear regression model will be applied. In order to develop the model, a backward sequential selection methodology will be implemented, commencing with the maximal model that incorporates all variables and progressively reducing its complexity. The functions ‘step()’ and ‘update()’ will be implemented to aid in the simplification of the model. Furthermore, in accordance with the findings of EDA, the variables ‘max_mpg,’ ‘min_mpg,’ and ‘engine_size’ will be eliminated because they had little to no impact on the pricing of cars.

5.2 3.2 Build a model for car price

# Eliminating the variables 'max_mpg,''min_mpg,' and 'engine_size'
mydf_new <- mydf[, -c(4,8,9)]

# Attach mydf_new to facilitate variables access.
attach(mydf_new)

# Constructing the linear regression model
model <- lm(price ~ brand+year+mileage+automatic_transmission+fuel+drivetrain+damaged+first_owner+navigation_system+bluetooth+third_row_seating+heated_seats+ I(year)^2+I(mileage)^2)

# Assessment of the model's summary
summary(model)
## 
## Call:
## lm(formula = price ~ brand + year + mileage + automatic_transmission + 
##     fuel + drivetrain + damaged + first_owner + navigation_system + 
##     bluetooth + third_row_seating + heated_seats + I(year)^2 + 
##     I(mileage)^2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14030  -4204   -364   3452  20540 
## 
## Coefficients: (2 not defined because of singularities)
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.991e+06  2.060e+05  -9.665  < 2e-16 ***
## brandAudi                    4.365e+03  2.011e+03   2.171 0.030595 *  
## brandBMW                     5.075e+03  2.702e+03   1.878 0.061124 .  
## brandCadillac                7.313e+03  2.172e+03   3.367 0.000839 ***
## brandChevrolet               6.525e+03  2.061e+03   3.166 0.001673 ** 
## brandFIAT                   -4.883e+03  2.104e+03  -2.321 0.020804 *  
## brandFord                    6.482e+03  1.914e+03   3.387 0.000784 ***
## brandHonda                   1.245e+03  2.131e+03   0.584 0.559456    
## brandHyundai                 2.358e+02  2.219e+03   0.106 0.915437    
## brandJaguar                  6.626e+03  2.132e+03   3.107 0.002035 ** 
## brandJeep                    4.773e+03  2.045e+03   2.333 0.020171 *  
## brandKia                    -2.413e+02  2.176e+03  -0.111 0.911777    
## brandLand                    6.692e+03  2.294e+03   2.918 0.003742 ** 
## brandLexus                   7.049e+03  2.151e+03   3.277 0.001149 ** 
## brandMaserati                6.166e+03  2.245e+03   2.746 0.006322 ** 
## brandMazda                  -2.499e+03  2.014e+03  -1.241 0.215498    
## brandMercedes-Benz           7.928e+03  2.002e+03   3.960 8.99e-05 ***
## brandMINI                    1.657e+02  2.114e+03   0.078 0.937552    
## brandMitsubishi             -4.999e+03  2.199e+03  -2.273 0.023576 *  
## brandNissan                  2.878e+02  1.993e+03   0.144 0.885289    
## brandPorsche                 8.212e+03  3.354e+03   2.449 0.014804 *  
## brandToyota                  4.681e+03  2.404e+03   1.948 0.052225 .  
## brandVolkswagen              5.632e+02  1.925e+03   0.293 0.769994    
## brandVolvo                   5.292e+03  2.169e+03   2.440 0.015146 *  
## year                         1.006e+03  1.022e+02   9.849  < 2e-16 ***
## mileage                     -1.045e-01  1.015e-02 -10.299  < 2e-16 ***
## automatic_transmission1     -6.292e+02  1.276e+03  -0.493 0.622223    
## fuelElectric                -6.655e+03  3.788e+03  -1.757 0.079744 .  
## fuelGPL                     -1.203e+04  3.930e+03  -3.061 0.002364 ** 
## fuelHybrid                  -7.496e+03  3.535e+03  -2.120 0.034662 *  
## fuelPetrol                  -7.285e+03  3.150e+03  -2.313 0.021284 *  
## fuelUnknown                 -7.125e+03  4.833e+03  -1.474 0.141265    
## drivetrainFront-wheel Drive -5.403e+03  7.952e+02  -6.794 4.38e-11 ***
## drivetrainRear-wheel Drive   2.454e+03  1.014e+03   2.420 0.015999 *  
## drivetrainUnknown            5.213e+04  6.131e+03   8.503 4.65e-16 ***
## damaged1                    -1.389e+03  7.131e+02  -1.948 0.052173 .  
## first_owner1                 1.806e+03  7.000e+02   2.579 0.010287 *  
## navigation_system1           3.728e+03  7.216e+02   5.166 3.93e-07 ***
## bluetooth1                  -1.177e+03  1.072e+03  -1.098 0.272911    
## third_row_seating1           3.983e+03  8.981e+02   4.436 1.21e-05 ***
## heated_seats1                3.872e+02  6.600e+02   0.587 0.557797    
## I(year)                             NA         NA      NA       NA    
## I(mileage)                          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5830 on 369 degrees of freedom
## Multiple R-squared:  0.7936, Adjusted R-squared:  0.7713 
## F-statistic: 35.48 on 40 and 369 DF,  p-value: < 2.2e-16

.An asterisk indicates that some independent variables are significant, but others are insignificant or only slightly significant.Insignificant variables will be eliminated using the step function.

# Eliminating insignificant variables
modelB <- step(model)
## Start:  AIC=7148.88
## price ~ brand + year + mileage + automatic_transmission + fuel + 
##     drivetrain + damaged + first_owner + navigation_system + 
##     bluetooth + third_row_seating + heated_seats + I(year)^2 + 
##     I(mileage)^2
## 
## 
## Step:  AIC=7148.88
## price ~ brand + year + mileage + automatic_transmission + fuel + 
##     drivetrain + damaged + first_owner + navigation_system + 
##     bluetooth + third_row_seating + heated_seats + I(year)
## 
## 
## Step:  AIC=7148.88
## price ~ brand + year + mileage + automatic_transmission + fuel + 
##     drivetrain + damaged + first_owner + navigation_system + 
##     bluetooth + third_row_seating + heated_seats
## 
##                          Df  Sum of Sq        RSS    AIC
## - automatic_transmission  1    8265835 1.2552e+10 7147.2
## - heated_seats            1   11698876 1.2555e+10 7147.3
## - bluetooth               1   40983367 1.2584e+10 7148.2
## <none>                                 1.2543e+10 7148.9
## - fuel                    5  323314624 1.2867e+10 7149.3
## - damaged                 1  128991798 1.2672e+10 7151.1
## - first_owner             1  226144980 1.2769e+10 7154.2
## - third_row_seating       1  668766737 1.3212e+10 7168.2
## - navigation_system       1  907092711 1.3450e+10 7175.5
## - brand                  23 4694517765 1.7238e+10 7233.2
## - year                    1 3297212141 1.5840e+10 7242.6
## - mileage                 1 3605304935 1.6149e+10 7250.5
## - drivetrain              3 5010210007 1.7553e+10 7280.7
## 
## Step:  AIC=7147.15
## price ~ brand + year + mileage + fuel + drivetrain + damaged + 
##     first_owner + navigation_system + bluetooth + third_row_seating + 
##     heated_seats
## 
##                     Df  Sum of Sq        RSS    AIC
## - heated_seats       1   11860461 1.2563e+10 7145.5
## - bluetooth          1   43024656 1.2595e+10 7146.6
## <none>                            1.2552e+10 7147.2
## - fuel               5  321686514 1.2873e+10 7147.5
## - damaged            1  127517290 1.2679e+10 7149.3
## - first_owner        1  233519221 1.2785e+10 7152.7
## - third_row_seating  1  661281195 1.3213e+10 7166.2
## - navigation_system  1  904183901 1.3456e+10 7173.7
## - brand             23 4699430821 1.7251e+10 7231.5
## - year               1 3291288699 1.5843e+10 7240.6
## - mileage            1 3598029012 1.6150e+10 7248.5
## - drivetrain         3 5106802992 1.7658e+10 7281.1
## 
## Step:  AIC=7145.54
## price ~ brand + year + mileage + fuel + drivetrain + damaged + 
##     first_owner + navigation_system + bluetooth + third_row_seating
## 
##                     Df  Sum of Sq        RSS    AIC
## - bluetooth          1   42237773 1.2606e+10 7144.9
## <none>                            1.2563e+10 7145.5
## - fuel               5  324306622 1.2888e+10 7146.0
## - damaged            1  127863377 1.2691e+10 7147.7
## - first_owner        1  238293982 1.2802e+10 7151.2
## - third_row_seating  1  693820754 1.3257e+10 7165.6
## - navigation_system  1  965220850 1.3529e+10 7173.9
## - brand             23 4692654495 1.7256e+10 7229.7
## - year               1 3363311469 1.5927e+10 7240.8
## - mileage            1 3602378050 1.6166e+10 7246.9
## - drivetrain         3 5214853771 1.7778e+10 7281.9
## 
## Step:  AIC=7144.91
## price ~ brand + year + mileage + fuel + drivetrain + damaged + 
##     first_owner + navigation_system + third_row_seating
## 
##                     Df  Sum of Sq        RSS    AIC
## <none>                            1.2606e+10 7144.9
## - fuel               5  345731555 1.2951e+10 7146.0
## - damaged            1  146008516 1.2752e+10 7147.6
## - first_owner        1  264284458 1.2870e+10 7151.4
## - third_row_seating  1  676941929 1.3283e+10 7164.4
## - navigation_system  1  925261490 1.3531e+10 7172.0
## - brand             23 4655069662 1.7261e+10 7227.8
## - mileage            1 3694527431 1.6300e+10 7248.3
## - year               1 3823213886 1.6429e+10 7251.5
## - drivetrain         3 5180609827 1.7786e+10 7280.1
# Checking modelB summary
summary(modelB)
## 
## Call:
## lm(formula = price ~ brand + year + mileage + fuel + drivetrain + 
##     damaged + first_owner + navigation_system + third_row_seating)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14157.9  -4178.8   -246.1   3508.6  20722.0 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.893e+06  1.820e+05 -10.399  < 2e-16 ***
## brandAudi                    4.710e+03  1.974e+03   2.386 0.017527 *  
## brandBMW                     5.234e+03  2.693e+03   1.944 0.052662 .  
## brandCadillac                7.488e+03  2.156e+03   3.473 0.000576 ***
## brandChevrolet               6.616e+03  2.056e+03   3.218 0.001404 ** 
## brandFIAT                   -4.668e+03  2.086e+03  -2.238 0.025818 *  
## brandFord                    6.716e+03  1.901e+03   3.533 0.000462 ***
## brandHonda                   1.438e+03  2.106e+03   0.683 0.495245    
## brandHyundai                 9.086e+02  2.149e+03   0.423 0.672742    
## brandJaguar                  6.679e+03  2.123e+03   3.146 0.001790 ** 
## brandJeep                    5.071e+03  2.030e+03   2.498 0.012918 *  
## brandKia                    -9.358e+01  2.169e+03  -0.043 0.965603    
## brandLand                    6.729e+03  2.288e+03   2.942 0.003470 ** 
## brandLexus                   7.265e+03  2.140e+03   3.394 0.000762 ***
## brandMaserati                6.194e+03  2.224e+03   2.784 0.005637 ** 
## brandMazda                  -1.896e+03  1.954e+03  -0.970 0.332509    
## brandMercedes-Benz           7.917e+03  1.998e+03   3.963 8.86e-05 ***
## brandMINI                    7.272e+02  2.065e+03   0.352 0.724921    
## brandMitsubishi             -4.861e+03  2.184e+03  -2.226 0.026621 *  
## brandNissan                  5.985e+02  1.974e+03   0.303 0.761951    
## brandPorsche                 8.511e+03  3.331e+03   2.555 0.011006 *  
## brandToyota                  5.000e+03  2.381e+03   2.100 0.036429 *  
## brandVolkswagen              8.594e+02  1.885e+03   0.456 0.648761    
## brandVolvo                   5.587e+03  2.152e+03   2.597 0.009782 ** 
## year                         9.568e+02  9.007e+01  10.622  < 2e-16 ***
## mileage                     -1.054e-01  1.009e-02 -10.442  < 2e-16 ***
## fuelElectric                -6.813e+03  3.756e+03  -1.814 0.070515 .  
## fuelGPL                     -1.238e+04  3.910e+03  -3.165 0.001677 ** 
## fuelHybrid                  -7.800e+03  3.510e+03  -2.223 0.026848 *  
## fuelPetrol                  -7.501e+03  3.124e+03  -2.401 0.016833 *  
## fuelUnknown                 -7.210e+03  4.820e+03  -1.496 0.135591    
## drivetrainFront-wheel Drive -5.563e+03  7.811e+02  -7.121 5.56e-12 ***
## drivetrainRear-wheel Drive   2.529e+03  1.006e+03   2.513 0.012387 *  
## drivetrainUnknown            5.128e+04  5.907e+03   8.681  < 2e-16 ***
## damaged1                    -1.469e+03  7.076e+02  -2.076 0.038602 *  
## first_owner1                 1.933e+03  6.923e+02   2.793 0.005497 ** 
## navigation_system1           3.664e+03  7.011e+02   5.225 2.90e-07 ***
## third_row_seating1           3.959e+03  8.858e+02   4.470 1.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5821 on 372 degrees of freedom
## Multiple R-squared:  0.7926, Adjusted R-squared:  0.772 
## F-statistic: 38.42 on 37 and 372 DF,  p-value: < 2.2e-16

.It looks that numerous insignificant terms are still included in the model. However, before dealing with them, I will consider deleting the brand variable not just for model simplicity, but also because practically all of the brand terms are insignificant or weakly significant. I’ll utilise the update() function to address this.

# Eliminating brand variable
modelC <- update(modelB,~.-brand)

# Checking modelC summary
summary(modelC)
## 
## Call:
## lm(formula = price ~ year + mileage + fuel + drivetrain + damaged + 
##     first_owner + navigation_system + third_row_seating)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15735  -4346   -807   3971  20798 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.688e+06  1.902e+05  -8.875  < 2e-16 ***
## year                         8.575e+02  9.412e+01   9.111  < 2e-16 ***
## mileage                     -1.057e-01  1.103e-02  -9.586  < 2e-16 ***
## fuelElectric                -1.232e+04  4.127e+03  -2.986  0.00300 ** 
## fuelGPL                     -1.095e+04  4.372e+03  -2.504  0.01267 *  
## fuelHybrid                  -8.169e+03  3.850e+03  -2.121  0.03451 *  
## fuelPetrol                  -8.954e+03  3.463e+03  -2.585  0.01008 *  
## fuelUnknown                 -8.890e+03  5.430e+03  -1.637  0.10238    
## drivetrainFront-wheel Drive -8.312e+03  7.598e+02 -10.939  < 2e-16 ***
## drivetrainRear-wheel Drive   2.633e+03  1.068e+03   2.466  0.01410 *  
## drivetrainUnknown            4.606e+04  6.423e+03   7.171 3.70e-12 ***
## damaged1                    -1.309e+03  7.795e+02  -1.680  0.09376 .  
## first_owner1                 1.785e+03  7.602e+02   2.348  0.01937 *  
## navigation_system1           5.109e+03  7.076e+02   7.220 2.69e-12 ***
## third_row_seating1           2.912e+03  9.184e+02   3.171  0.00164 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6610 on 395 degrees of freedom
## Multiple R-squared:  0.716,  Adjusted R-squared:  0.7059 
## F-statistic: 71.14 on 14 and 395 DF,  p-value: < 2.2e-16

. Considering deleting fuel and first_owner variables, which are insignificant.

# Eliminating fuel and first_owner  variables
modelD <- update(modelC,~.-fuel-first_owner)

# Checking modelD summary
summary(modelD)
## 
## Call:
## lm(formula = price ~ year + mileage + drivetrain + damaged + 
##     navigation_system + third_row_seating)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16833.1  -4565.5   -639.9   3930.2  24957.9 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.799e+06  1.884e+05  -9.551  < 2e-16 ***
## year                         9.087e+02  9.322e+01   9.747  < 2e-16 ***
## mileage                     -1.026e-01  1.053e-02  -9.750  < 2e-16 ***
## drivetrainFront-wheel Drive -8.359e+03  7.618e+02 -10.972  < 2e-16 ***
## drivetrainRear-wheel Drive   2.403e+03  1.072e+03   2.241 0.025563 *  
## drivetrainUnknown            4.741e+04  6.207e+03   7.638 1.63e-13 ***
## damaged1                    -1.553e+03  7.823e+02  -1.985 0.047876 *  
## navigation_system1           5.138e+03  7.072e+02   7.265 1.96e-12 ***
## third_row_seating1           3.384e+03  9.122e+02   3.710 0.000237 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6701 on 401 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6978 
## F-statistic: 119.1 on 8 and 401 DF,  p-value: < 2.2e-16

. Drivetrain variable has only one significant value, thus I remove it.

# Eliminating Drivetrain variable
modelE <- update(modelD,~.-drivetrain)

# Checking the modelE summary
summary(modelE)
## 
## Call:
## lm(formula = price ~ year + mileage + damaged + navigation_system + 
##     third_row_seating)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26798  -5445   -758   4977  45457 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -8.677e+05  1.780e+05  -4.876 1.56e-06 ***
## year                4.462e+02  8.813e+01   5.063 6.28e-07 ***
## mileage            -1.425e-01  1.222e-02 -11.661  < 2e-16 ***
## damaged1           -2.636e+03  9.702e+02  -2.717 0.006878 ** 
## navigation_system1  7.437e+03  8.538e+02   8.711  < 2e-16 ***
## third_row_seating1  4.292e+03  1.122e+03   3.825 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8353 on 404 degrees of freedom
## Multiple R-squared:  0.5362, Adjusted R-squared:  0.5305 
## F-statistic: 93.42 on 5 and 404 DF,  p-value: < 2.2e-16

. Eliminating the remaining insignificant terms

# Eliminating automatic_transmission and bluetooth  variables
modelF <- update(modelE,~.-automatic_transmission-bluetooth)

# Checking modelF summary
summary(modelF)
## 
## Call:
## lm(formula = price ~ year + mileage + damaged + navigation_system + 
##     third_row_seating)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26798  -5445   -758   4977  45457 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -8.677e+05  1.780e+05  -4.876 1.56e-06 ***
## year                4.462e+02  8.813e+01   5.063 6.28e-07 ***
## mileage            -1.425e-01  1.222e-02 -11.661  < 2e-16 ***
## damaged1           -2.636e+03  9.702e+02  -2.717 0.006878 ** 
## navigation_system1  7.437e+03  8.538e+02   8.711  < 2e-16 ***
## third_row_seating1  4.292e+03  1.122e+03   3.825 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8353 on 404 degrees of freedom
## Multiple R-squared:  0.5362, Adjusted R-squared:  0.5305 
## F-statistic: 93.42 on 5 and 404 DF,  p-value: < 2.2e-16

. All remaining terms are significant, so the optimal model is

\[ price = -940200+482.4\times{year}-0.1376\times{mileage}-0.002925\times{damaged}+6091\times{navigationsystem}+5234\times{thirdrowseating} \]

5.3 3.3 Critique model using relevant diagnostics

.Some variables influence car prices positively in our best model, whereas others show a negative impact.
.The beginning price is projected to be -940,200 when the impact of all variables are zero.
.The year of production has a positive effect on the price, demonstrating that newer cars cost 482.4 more for every year rise.
.Mileage has a negative impact on pricing, since increasing mileage reduces the car’s price by 0.1376.
.A damaged state reduces the value of the car, while an undamaged status does not directly increase the price. Any damage, however, reduces the price by 0.002925.
.The addition of a navigation system or third-row seating raises the price by 6091 or 5234 respectively.
.51% is an acceptable R-squared value for the model, which indicates how well our model explains car prices in light of these factors. On the basis of these independent factors, our model can therefore explain approximately half of the variation in car prices; however, it still has space for enhancement.

# Model plotting for diagnostic purposes
plot(modelF)

. Plot 1 demonstrates no substantial heteroscedasticity, indicating a consistent point spread. While not ideal, it’s acceptable.
.Plot 2 shows 99% of points are straight, indicating a normal distribution. However, some data outliers exist.

5.4 3.4 Suggest and implement improvements to your model

.Since the dependent variable is continuous, I will potentially use square root and logarithmic modifications to improve model performance. This method may increase the variables’ match with the target, improving the model’s predictive power.

# Applying sqrt() function on the modelF 
modelK <- lm(sqrt(price)~ year + mileage + damaged + navigation_system + third_row_seating)

#Checking modelK summary
summary(modelK)
## 
## Call:
## lm(formula = sqrt(price) ~ year + mileage + damaged + navigation_system + 
##     third_row_seating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.460 -16.480  -1.007  15.725 148.058 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.083e+03  5.403e+02  -5.705 2.25e-08 ***
## year                1.616e+00  2.675e-01   6.041 3.49e-09 ***
## mileage            -4.533e-04  3.711e-05 -12.216  < 2e-16 ***
## damaged1           -7.489e+00  2.945e+00  -2.543 0.011375 *  
## navigation_system1  2.228e+01  2.592e+00   8.597  < 2e-16 ***
## third_row_seating1  1.311e+01  3.406e+00   3.847 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.36 on 404 degrees of freedom
## Multiple R-squared:  0.5622, Adjusted R-squared:  0.5568 
## F-statistic: 103.8 on 5 and 404 DF,  p-value: < 2.2e-16
# Applying log() function on the modelF
modelM <- lm(log(price)~ year + mileage + damaged + navigation_system + third_row_seating)

#Checking modelM summary
summary(modelM)
## 
## Call:
## lm(formula = log(price) ~ year + mileage + damaged + navigation_system + 
##     third_row_seating)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21451 -0.19948  0.01378  0.20272  2.04492 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.973e+01  7.160e+00  -5.549 5.22e-08 ***
## year                2.481e-02  3.546e-03   6.998 1.08e-11 ***
## mileage            -6.004e-06  4.918e-07 -12.209  < 2e-16 ***
## damaged1           -8.620e-02  3.903e-02  -2.208 0.027788 *  
## navigation_system1  2.782e-01  3.435e-02   8.100 6.56e-15 ***
## third_row_seating1  1.634e-01  4.514e-02   3.620 0.000333 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3361 on 404 degrees of freedom
## Multiple R-squared:  0.5696, Adjusted R-squared:  0.5643 
## F-statistic: 106.9 on 5 and 404 DF,  p-value: < 2.2e-16
#Ploting modelK
plot(modelK)

# Ploting modelM
plot(modelM)

.The illustrations of both models (modelK and modelM) maintain the same consistent and well-behaved patterns as the previous model (modelF). Nevertheless, both R-squared and F-statistic values exhibit an upward trend. In general, increased values of these metrics signify a more optimal model fit. On the basis of these metrics, modelM currently performs the best.The representation of modelM in equation form is:

\[ log(price) = -31.69+0.02085\times{year}-0.000006574\times{mileage}-0.09425\times{damaged1}+0.2350\times{navigationsystem1}+0.2202\times{thirdrowseating1} \]

6 4. Modelling another dependent variable

6.1 4.1 Model the likelihood of a car being sold by the first owner (using the first_owner variable provided).

.Concerning how the remaining variables account for the probability that a car’s first owner will sell it.
.Given the binary nature of the dependent variable, ‘first_owner’, and the mixed-category and numeric composition of the independent variables, logistic regression will be utilised to analyse this situation. Because an adequate number of independent variables are already present to predict the probability that a car will be sold by its first owner, I have chosen not to incorporate interactions during the modelling process.
.Beginning with the maximal model, which includes every variable, I will employ backward sequential selection to construct the model. Subsequently, I will incrementally reduce the complexity of the model until it reaches its minimum. “step()” and “update()” are functions that I intend to employ in order to simplify the model.
.I shall conclude by employing the “coef()” and “exp()” functions to compute the odds ratios.Moreover, in light of the strong correlation identified between ‘min_mpg’ and ‘max_mpg’ during EDA, the variable ‘max_mpg’ will be eliminated from this procedure.

# removing the mydf_new dataframe that was appended during the development of the preceding model.
detach(mydf_new)

# Eliminating 'max_mpg' on account of multicollinearity
mydf_logistic <- mydf[,-9]

# For easier access, append the mydf_logistic dataframe
attach(mydf_logistic)

# Constructing the logistic regression
logistic_model <- glm(first_owner~brand+year+mileage+engine_size+automatic_transmission+fuel+drivetrain+min_mpg+damaged+navigation_system+bluetooth+third_row_seating+heated_seats+price, family = binomial)

#Verifying the developed logistic regression model's summary
summary(logistic_model)
## 
## Call:
## glm(formula = first_owner ~ brand + year + mileage + engine_size + 
##     automatic_transmission + fuel + drivetrain + min_mpg + damaged + 
##     navigation_system + bluetooth + third_row_seating + heated_seats + 
##     price, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4987  -0.8015  -0.2971   0.7701   2.1649  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                 -2.683e+02  9.791e+02  -0.274   0.7841  
## brandAudi                   -8.669e-01  8.173e-01  -1.061   0.2889  
## brandBMW                    -7.480e-01  1.122e+00  -0.667   0.5049  
## brandCadillac                7.779e-01  9.586e-01   0.812   0.4171  
## brandChevrolet              -7.561e-01  9.390e-01  -0.805   0.4207  
## brandFIAT                    4.798e-01  8.612e-01   0.557   0.5774  
## brandFord                   -4.060e-01  7.771e-01  -0.522   0.6014  
## brandHonda                   1.442e+00  8.957e-01   1.610   0.1075  
## brandHyundai                -4.995e-01  9.153e-01  -0.546   0.5852  
## brandJaguar                 -6.271e-01  8.325e-01  -0.753   0.4513  
## brandJeep                    8.742e-02  8.947e-01   0.098   0.9222  
## brandKia                     1.034e+00  8.706e-01   1.187   0.2351  
## brandLand                   -1.296e+00  1.004e+00  -1.291   0.1968  
## brandLexus                   4.676e-01  9.541e-01   0.490   0.6241  
## brandMaserati               -1.846e+00  1.050e+00  -1.759   0.0786 .
## brandMazda                   7.898e-01  8.176e-01   0.966   0.3341  
## brandMercedes-Benz          -6.615e-01  8.151e-01  -0.812   0.4170  
## brandMINI                    8.340e-01  8.313e-01   1.003   0.3157  
## brandMitsubishi              1.371e+00  8.687e-01   1.578   0.1145  
## brandNissan                  2.192e-01  8.413e-01   0.261   0.7944  
## brandPorsche                -6.165e-01  1.323e+00  -0.466   0.6412  
## brandToyota                  1.692e+00  1.041e+00   1.625   0.1042  
## brandVolkswagen              3.200e-03  7.635e-01   0.004   0.9967  
## brandVolvo                   4.953e-01  9.207e-01   0.538   0.5906  
## year                         1.267e-01  5.737e-02   2.208   0.0273 *
## mileage                     -1.329e-05  5.792e-06  -2.295   0.0217 *
## engine_size                 -3.151e-02  1.907e-01  -0.165   0.8688  
## automatic_transmission1     -8.706e-01  5.171e-01  -1.684   0.0923 .
## fuelElectric                 1.250e+01  9.723e+02   0.013   0.9897  
## fuelGPL                      1.410e+01  9.723e+02   0.015   0.9884  
## fuelHybrid                   1.613e+01  9.723e+02   0.017   0.9868  
## fuelPetrol                   1.444e+01  9.723e+02   0.015   0.9882  
## fuelUnknown                  1.339e+01  9.723e+02   0.014   0.9890  
## drivetrainFront-wheel Drive  1.049e-01  3.711e-01   0.283   0.7774  
## drivetrainRear-wheel Drive  -7.144e-01  4.395e-01  -1.626   0.1040  
## drivetrainUnknown           -1.347e+01  1.425e+03  -0.009   0.9925  
## min_mpg                     -4.888e-02  3.258e-02  -1.500   0.1335  
## damaged1                    -6.316e-01  3.079e-01  -2.051   0.0403 *
## navigation_system1           2.517e-01  3.316e-01   0.759   0.4479  
## bluetooth1                  -1.012e+00  5.353e-01  -1.890   0.0587 .
## third_row_seating1           5.527e-01  4.129e-01   1.339   0.1806  
## heated_seats1                8.098e-02  2.841e-01   0.285   0.7756  
## price                        5.434e-05  2.575e-05   2.110   0.0349 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.85  on 409  degrees of freedom
## Residual deviance: 399.52  on 367  degrees of freedom
## AIC: 485.52
## 
## Number of Fisher Scoring iterations: 15

.Let us now simplify using the step() function.

#Applying step() function to logistic_model
logistic_modelB <- step(logistic_model)
## Start:  AIC=485.52
## first_owner ~ brand + year + mileage + engine_size + automatic_transmission + 
##     fuel + drivetrain + min_mpg + damaged + navigation_system + 
##     bluetooth + third_row_seating + heated_seats + price
## 
##                          Df Deviance    AIC
## - brand                  23   426.62 466.62
## - drivetrain              3   402.63 482.63
## - engine_size             1   399.55 483.55
## - heated_seats            1   399.60 483.60
## - navigation_system       1   400.10 484.10
## - third_row_seating       1   401.34 485.34
## <none>                        399.52 485.52
## - min_mpg                 1   401.93 485.93
## - automatic_transmission  1   402.30 486.30
## - bluetooth               1   403.27 487.27
## - damaged                 1   403.83 487.83
## - price                   1   404.12 488.12
## - fuel                    5   412.15 488.15
## - year                    1   404.91 488.91
## - mileage                 1   404.94 488.94
## 
## Step:  AIC=466.62
## first_owner ~ year + mileage + engine_size + automatic_transmission + 
##     fuel + drivetrain + min_mpg + damaged + navigation_system + 
##     bluetooth + third_row_seating + heated_seats + price
## 
##                          Df Deviance    AIC
## - engine_size             1   426.62 464.62
## - min_mpg                 1   426.62 464.62
## - navigation_system       1   426.78 464.78
## - heated_seats            1   426.79 464.79
## <none>                        426.62 466.62
## - automatic_transmission  1   428.79 466.79
## - price                   1   429.08 467.08
## - damaged                 1   430.17 468.17
## - drivetrain              3   434.38 468.38
## - bluetooth               1   432.90 470.90
## - mileage                 1   433.01 471.01
## - fuel                    5   441.66 471.66
## - third_row_seating       1   435.04 473.04
## - year                    1   436.58 474.58
## 
## Step:  AIC=464.62
## first_owner ~ year + mileage + automatic_transmission + fuel + 
##     drivetrain + min_mpg + damaged + navigation_system + bluetooth + 
##     third_row_seating + heated_seats + price
## 
##                          Df Deviance    AIC
## - min_mpg                 1   426.62 462.62
## - navigation_system       1   426.78 462.78
## - heated_seats            1   426.79 462.79
## <none>                        426.62 464.62
## - automatic_transmission  1   428.79 464.79
## - price                   1   429.73 465.73
## - damaged                 1   430.18 466.18
## - drivetrain              3   434.49 466.49
## - bluetooth               1   432.90 468.90
## - mileage                 1   433.74 469.74
## - fuel                    5   441.79 469.79
## - third_row_seating       1   435.71 471.71
## - year                    1   437.66 473.66
## 
## Step:  AIC=462.62
## first_owner ~ year + mileage + automatic_transmission + fuel + 
##     drivetrain + damaged + navigation_system + bluetooth + third_row_seating + 
##     heated_seats + price
## 
##                          Df Deviance    AIC
## - navigation_system       1   426.78 460.78
## - heated_seats            1   426.79 460.79
## <none>                        426.62 462.62
## - automatic_transmission  1   428.80 462.80
## - price                   1   429.93 463.93
## - damaged                 1   430.18 464.18
## - drivetrain              3   434.51 464.51
## - bluetooth               1   432.91 466.91
## - fuel                    5   441.80 467.80
## - mileage                 1   433.93 467.93
## - third_row_seating       1   435.72 469.72
## - year                    1   437.72 471.72
## 
## Step:  AIC=460.78
## first_owner ~ year + mileage + automatic_transmission + fuel + 
##     drivetrain + damaged + bluetooth + third_row_seating + heated_seats + 
##     price
## 
##                          Df Deviance    AIC
## - heated_seats            1   427.01 459.01
## <none>                        426.78 460.78
## - automatic_transmission  1   428.92 460.92
## - damaged                 1   430.31 462.31
## - drivetrain              3   434.62 462.62
## - price                   1   431.26 463.26
## - bluetooth               1   432.91 464.91
## - fuel                    5   441.80 465.80
## - mileage                 1   433.93 465.93
## - third_row_seating       1   435.79 467.79
## - year                    1   437.77 469.77
## 
## Step:  AIC=459.01
## first_owner ~ year + mileage + automatic_transmission + fuel + 
##     drivetrain + damaged + bluetooth + third_row_seating + price
## 
##                          Df Deviance    AIC
## <none>                        427.01 459.01
## - automatic_transmission  1   429.19 459.19
## - damaged                 1   430.55 460.55
## - drivetrain              3   434.79 460.79
## - price                   1   431.69 461.69
## - bluetooth               1   433.06 463.06
## - fuel                    5   441.84 463.84
## - mileage                 1   434.18 464.18
## - third_row_seating       1   436.81 466.81
## - year                    1   438.26 468.26
#Checking the summary of the new developed model
summary(logistic_modelB)
## 
## Call:
## glm(formula = first_owner ~ year + mileage + automatic_transmission + 
##     fuel + drivetrain + damaged + bluetooth + third_row_seating + 
##     price, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3089  -0.8709  -0.3404   0.9059   2.6759  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                 -3.119e+02  9.865e+02  -0.316  0.75191   
## year                         1.479e-01  4.832e-02   3.060  0.00221 **
## mileage                     -1.314e-05  4.992e-06  -2.631  0.00850 **
## automatic_transmission1     -7.383e-01  4.942e-01  -1.494  0.13522   
## fuelElectric                 1.269e+01  9.817e+02   0.013  0.98969   
## fuelGPL                      1.375e+01  9.817e+02   0.014  0.98883   
## fuelHybrid                   1.623e+01  9.817e+02   0.017  0.98681   
## fuelPetrol                   1.458e+01  9.817e+02   0.015  0.98815   
## fuelUnknown                  1.375e+01  9.817e+02   0.014  0.98882   
## drivetrainFront-wheel Drive  5.258e-01  3.204e-01   1.641  0.10076   
## drivetrainRear-wheel Drive  -7.211e-01  3.909e-01  -1.845  0.06506 . 
## drivetrainUnknown           -1.221e+01  1.519e+03  -0.008  0.99359   
## damaged1                    -5.362e-01  2.877e-01  -1.864  0.06235 . 
## bluetooth1                  -1.159e+00  4.838e-01  -2.395  0.01662 * 
## third_row_seating1           1.047e+00  3.433e-01   3.049  0.00230 **
## price                        3.713e-05  1.729e-05   2.147  0.03176 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.85  on 409  degrees of freedom
## Residual deviance: 427.01  on 394  degrees of freedom
## AIC: 459.01
## 
## Number of Fisher Scoring iterations: 15

.lets us now removing insignificant terms using update() function

#Applying update() function to remove insignificant term 'fuel'
logistic_modelC <- update(logistic_modelB,~.-fuel)

#Checking the model summary
summary(logistic_modelC)
## 
## Call:
## glm(formula = first_owner ~ year + mileage + automatic_transmission + 
##     drivetrain + damaged + bluetooth + third_row_seating + price, 
##     family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2167  -0.8830  -0.3815   0.9244   2.7031  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                 -3.086e+02  9.623e+01  -3.207  0.00134 **
## year                         1.534e-01  4.782e-02   3.207  0.00134 **
## mileage                     -1.145e-05  4.808e-06  -2.382  0.01720 * 
## automatic_transmission1     -7.598e-01  4.902e-01  -1.550  0.12114   
## drivetrainFront-wheel Drive  5.305e-01  3.130e-01   1.695  0.09012 . 
## drivetrainRear-wheel Drive  -6.919e-01  3.854e-01  -1.795  0.07261 . 
## drivetrainUnknown           -1.007e+01  5.901e+02  -0.017  0.98639   
## damaged1                    -4.951e-01  2.803e-01  -1.766  0.07734 . 
## bluetooth1                  -1.077e+00  4.630e-01  -2.327  0.01996 * 
## third_row_seating1           1.028e+00  3.340e-01   3.079  0.00208 **
## price                        3.823e-05  1.664e-05   2.298  0.02156 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.85  on 409  degrees of freedom
## Residual deviance: 441.84  on 399  degrees of freedom
## AIC: 463.84
## 
## Number of Fisher Scoring iterations: 13

.The logistic_modelC is currently our minimum adequate model.Written as

\[log(\frac{p}{1-p})= -564.8 + 0.2805\times{year} -0.000007046\times{mileage} -1.135\times{bluetooth} \]

6.1.1 Odd ratios and Odds

# Extract the coefficients from logistic_modelC and calculate the exponentiation
exp(coef(logistic_modelC))
##                 (Intercept)                        year 
##               9.601309e-135                1.165763e+00 
##                     mileage     automatic_transmission1 
##                9.999885e-01                4.677772e-01 
## drivetrainFront-wheel Drive  drivetrainRear-wheel Drive 
##                1.699733e+00                5.006388e-01 
##           drivetrainUnknown                    damaged1 
##                4.250915e-05                6.094908e-01 
##                  bluetooth1          third_row_seating1 
##                3.404474e-01                2.796655e+00 
##                       price 
##                1.000038e+00

.The odds ratios highlight specific variables impacting the likelihood of a car being sold by its first owner.

.Variables like ‘mileage’ and ‘bluetooth’, with odds values below 1, indicate a decrease in the likelihood of a car being sold by the first owner. Put simply, higher values for these variables correlate with a reduced probability of a car being sold by its initial owner.

.Conversely, the ‘year’ variable, with an odds value greater than 1, is associated with an increase in the probability of a car being sold by the first owner. In simpler terms, an increase in the production year is linked to a higher likelihood of the car being sold by its first owner.