Exploratory Analysis

Data Loading

Both datasets were found on Kaggle. The first dataset contains attributes relating to apple quality and can be accessed here. The second dataset records information and pricing for vehicles and can be accessed here.

apple_quality <- read.csv("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_622/data/apple_quality.csv")

car_price <- read.csv("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_622/data/car_price_prediction.csv")

Apple Quality

The apple quality dataset contains about 4,000 rows and is 379 KB. The variables are as follows:

Variable Description
A_id Unique identifier for each fruit
Size Size of the fruit
Weight Weight of the fruit
Sweetness Degree of sweetness of the fruit
Crunchiness Texture indicating the crunchiness of the fruit
Juiciness Level of juiciness of the fruit
Ripeness Stage of ripeness of the fruit
Acidity Acidity level of the fruit
Quality Overall quality of the fruit
glimpse(apple_quality)
## Rows: 4,000
## Columns: 9
## $ A_id        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Size        <dbl> -3.97004852, -1.19521719, -0.29202386, -0.65719577, 1.3642…
## $ Weight      <dbl> -2.5123364, -2.8392565, -1.3512820, -2.2716266, -1.2966119…
## $ Sweetness   <dbl> 5.3463296, 3.6640588, -1.7384292, 1.3248738, -0.3846582, -…
## $ Crunchiness <dbl> -1.01200871, 1.58823231, -0.34261593, -0.09787472, -0.5530…
## $ Juiciness   <dbl> 1.8449004, 0.8532858, 2.8386355, 3.6379705, 3.0308744, -3.…
## $ Ripeness    <dbl> 0.32983980, 0.86753008, -0.03803333, -3.41376134, -1.30384…
## $ Acidity     <dbl> -0.4915905, -0.7228094, 2.6216365, 0.7907232, 0.5019840, -…
## $ Quality     <chr> "good", "good", "bad", "good", "good", "bad", "good", "goo…

There are 4,001 rows and 9 columns. The first column is a unique identifier for each apple recorded and will not be relevant for modeling. All of the columns are numeric except for Quality, which is our target variable.

We can visualize the distributions of our variables.

apple_quality |>
  select(-A_id) |>  # exclude apple ID
  pivot_longer(cols=c(Size:Acidity),
               names_to = 'Feature') |>
  ggplot(aes(x = value)) +
  geom_histogram(bins=30, fill='#6495ED') +
  facet_wrap(~Feature) +
  labs(title = "Distributions of Predictor Variables", x = "Value", y = "Count")

All of the features seem normally distributed.

apple_quality |>
  ggplot(aes(x = Quality)) +
  geom_bar(aes(fill = Quality)) +
  labs(title = "Distribution of Apple Quality", y = "Count") +
  geom_text(stat='count', aes(label=..count..), vjust=2, color="white", fontface='bold')
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The breakdown of “good” and “bad” apples is about equal, with 1,996 apples qualifying as “bad” and 2,004 apples qualifying as “good”.

We can also check the distribution of each predictive feature based on the response variable.

apple_quality |>
  select(-A_id) |>  # exclude apple ID
  pivot_longer(cols=c(Size:Acidity),
               names_to = 'Feature') |>
  ggplot(aes(x = value, y=Quality)) +
  geom_boxplot(aes(fill=Quality)) +
  facet_wrap(~Feature) +
  labs(title = "Distributions of Predictor Variables by Quality", x = "Value", y = "Count")

“Good” apples are relatively juicier, bigger, sweeter, and less ripe. They also vary more in weight and crunchiness.

Let’s check for correlations between predictor variables.

apple_quality |>
  select(-A_id, -Quality) |>
  cor() |>
  corrplot(method="color",
           diag=FALSE,
           type="lower",
           addCoef.col = "black",
           number.cex=0.70)

Some of the features are slightly correlated with each other but none are significantly correlated.

This data is labelled, with each apple getting a Quality rating of “good” or “bad”. For this data, I would use binary logistic regression to predict the quality of an apple. In order to do this, the Quality column needs to be binary encoded to 0s and 1s. For this example, we will use “good” = 1 and “bad” = 0.

apple_features <- apple_quality |>
  select(-A_id) |>
  mutate(Quality = ifelse(Quality == "good", 1, 0))

Let’s check if any of our variables are missing values.

colSums(is.na(apple_features))
##        Size      Weight   Sweetness Crunchiness   Juiciness    Ripeness 
##           0           0           0           0           0           0 
##     Acidity     Quality 
##           0           0

There are no missing values.

The distributions are all pretty much normal so we could now create a binary logistic model from the data.

apple_mod <- glm(Quality~., data=apple_features, family="binomial")
apple_mod
## 
## Call:  glm(formula = Quality ~ ., family = "binomial", data = apple_features)
## 
## Coefficients:
## (Intercept)         Size       Weight    Sweetness  Crunchiness    Juiciness  
##     0.64622      0.64947      0.26673      0.56879      0.03458      0.44224  
##    Ripeness      Acidity  
##    -0.12674     -0.29412  
## 
## Degrees of Freedom: 3999 Total (i.e. Null);  3992 Residual
## Null Deviance:       5545 
## Residual Deviance: 4114  AIC: 4130

Car Prices

The car price dataset consists of about 19,000 rows and is 2,113 KB. The variables are as follows:

Variable Description
ID Unique ID for the vechile
Price Price of the vehicle
Levy Lien against the vehicle
Manufacturer Vehicle manufacturer
Model Vehicle model
ProdYear Year vehicle was produced
Category Type of vehicle (Sedan, Minivan, etc.)
LeatherInterior Whether the interior is leather (Yes, No)
FuelType Type of fuel (Petrol, Diesel, Hybrid, etc.)
EngineVolume Engine displacement in liters
Mileage Car mileage in kilometers
Cylinders Number of cylinders in engine
GearBoxType Gear box of the transmission (Manual, Automatic, etc.)
DriveWheels Type of drive (Front, 4x4, Rear, etc.)
Doors Number of doors
Wheel Steering wheel placement
Color Exterior color of vehicle
Airbags Number of airbags
glimpse(car_price)
## Rows: 19,237
## Columns: 18
## $ ID              <int> 45654403, 44731507, 45774419, 45769185, 45809263, 4580…
## $ Price           <int> 13328, 16621, 8467, 3607, 11726, 39493, 1803, 549, 109…
## $ Levy            <chr> "1399", "1018", "-", "862", "446", "891", "761", "751"…
## $ Manufacturer    <chr> "LEXUS", "CHEVROLET", "HONDA", "FORD", "HONDA", "HYUND…
## $ Model           <chr> "RX 450", "Equinox", "FIT", "Escape", "FIT", "Santa FE…
## $ ProdYear        <int> 2010, 2011, 2006, 2011, 2014, 2016, 2010, 2013, 2014, …
## $ Category        <chr> "Jeep", "Jeep", "Hatchback", "Jeep", "Hatchback", "Jee…
## $ LeatherInterior <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ FuelType        <chr> "Hybrid", "Petrol", "Petrol", "Hybrid", "Petrol", "Die…
## $ EngineVolume    <chr> "3.5", "3", "1.3", "2.5", "1.3", "2", "1.8", "2.4", "2…
## $ Mileage         <chr> "186005 km", "192000 km", "200000 km", "168966 km", "9…
## $ Cylinders       <int> 6, 6, 4, 4, 4, 4, 4, 4, 4, 6, 6, 8, 4, 6, 4, 4, 4, 4, …
## $ GearBoxType     <chr> "Automatic", "Tiptronic", "Variator", "Automatic", "Au…
## $ DriveWheels     <chr> "4x4", "4x4", "Front", "4x4", "Front", "Front", "Front…
## $ Doors           <chr> "4-May", "4-May", "4-May", "4-May", "4-May", "4-May", …
## $ Wheel           <chr> "Left wheel", "Left wheel", "Right-hand drive", "Left …
## $ Color           <chr> "Silver", "Black", "Black", "White", "Silver", "White"…
## $ Airbags         <int> 12, 8, 2, 0, 4, 4, 12, 12, 12, 12, 12, 0, 4, 12, 4, 12…

There are 19,237 rows and 18 columns in the dataset. Some of the variables are incorrectly coded.

Let’s fix these variables.

Not all cars have a Levy. However, instead of null values, the missing levies are filled with a “-”. Let’s replace with a “0” for cars with no levy recorded.

# change levy to integer type
car_price <- car_price |>
  mutate(Levy = ifelse(Levy == '-',"0", Levy),
         Levy = as.integer(Levy))

In general, a levy for a vehicle is not higher than the purchase price. Let’s filter out any values where the Levy may be higher than the Price.

car_price <- car_price |>
  filter(Levy < Price)

Some of the values in EngineVolume also note whether the engine is turbocharged with the word “Turbo” after the engine displacement value. We can extract the “Turbo” into a separate dummy column to note whether or not an engine is turbocharged.

# change engine volume to double
car_price <- car_price |>
  separate(EngineVolume, into=c("EngineVolume", "TurboEngine"), sep=" ") |>
  mutate(TurboEngine = ifelse(!is.na(TurboEngine), "Yes", "No"),
         EngineVolume = as.double(EngineVolume))

The car mileage has “km” in each. We know that this is mileage in kilometers. We do not need the “km” notation.

car_price <- car_price |>
  mutate(Mileage = as.integer(str_remove(Mileage, " km")))

From the glimpse of the data, it seems wheel is noted as “Left wheel” or “Right-hand drive”. Let’s make sure this is true.

car_price |>
  count(Wheel)
##              Wheel     n
## 1       Left wheel 15828
## 2 Right-hand drive  1474

Since this is the placement of the wheel in the car, we can just use “Left” or “Right”.

car_price <- car_price |>
  mutate(Wheel = ifelse(Wheel == "Left wheel", "Left", "Right"))

We can do something similar with Doors.

car_price |>
  count(Doors)
##   Doors     n
## 1 2-Mar   753
## 2 4-May 16421
## 3    >5   128
car_price <- car_price |>
  mutate(Doors = str_remove(Doors, "-.*"))

The data was updated two years ago. Let’s create an Age column by subtracting the ProdYear from 2022.

car_price <- car_price |>
  mutate("Age" = 2022 - ProdYear)

Let’s check if there are any missing values.

colSums(is.na(car_price))
##              ID           Price            Levy    Manufacturer           Model 
##               0               0               0               0               0 
##        ProdYear        Category LeatherInterior        FuelType    EngineVolume 
##               0               0               0               0               0 
##     TurboEngine         Mileage       Cylinders     GearBoxType     DriveWheels 
##               0               0               0               0               0 
##           Doors           Wheel           Color         Airbags             Age 
##               0               0               0               0               0

There are no missing values.

Now let’s take a look at the distributions of some of our variables.

cat_features <- c("Category", "LeatherInterior", "FuelType", "TurboEngine", "GearBoxType", "DriveWheels", "Doors", "Color") 

for (feature in cat_features) {
  print(
  ggplot(data = car_price, aes(y = .data[[feature]])) +
    geom_bar(fill='plum4') +
    labs(title = paste("Barplot of", feature)))
}

  • Most of the entries are sedans, jeeps, and hatchbacks.
  • More than half the cars have leather interiors.
  • Most cars take petrol, diesel, or are hybrid.
  • Most of the cars do not have a turbo engine.
  • Most cars are automatic.
  • Most cars have front driving wheels.
  • Most cars have 4 doors.
  • Most cars are white, silver, grey, or black.

For the purposes of this analysis, let’s only use cars that are either petrol, diesel, or hybrid, as those account for most of the vehicles.

car_price_filtered <- car_price |>
  filter(FuelType %in% c("Diesel", "Petrol", "Hybrid"))
cont_features <- c("Price", "Mileage", "Levy", "EngineVolume", "Age")

for (feature in cont_features) {
  print(
  ggplot(data = car_price_filtered, aes(x = .data[[feature]])) +
    geom_histogram(bins=40, fill='lightblue') +
    labs(title = paste("Histogram of", feature), x=feature)
  )
  print(
  ggplot(data = car_price_filtered, aes(y = .data[[feature]])) +
    geom_boxplot(fill='lightblue') +
    labs(title = paste("Boxplot of", feature), x=feature)
  )
}

There are many outliers which do not allow us to see the distributions for Price and Mileage clearly. Let’s see if we can transform the data to see these distributions more clearly.

for (feature in c("Price", "Mileage")) {
  print(
  ggplot(data = car_price_filtered, aes(x = log(.data[[feature]]))) +
    geom_histogram(bins=40, fill='lightblue') +
    labs(title = paste("Histogram of Transformed", feature), x=feature)
  )
  print(
  ggplot(data = car_price_filtered, aes(y = log(.data[[feature]]))) +
    geom_boxplot(fill='lightblue') +
    labs(title = paste("Boxplot of Transformed", feature), x=feature)
  )
}

We can see that there are a lot of outliers but the distributions look a little more normal.

How much of the data is outliers?

count_outliers <- function(x) {
  iqr <- IQR(x)
  lower <- quantile(x, .25) - 1.5*iqr
  upper <- quantile(x, .75) + 1.5*iqr
  outliers <- x[x < lower | x > upper]
  return(length(outliers)) 
}

car_price_numeric <- car_price_filtered |>
  select(-ID, -ProdYear) |>
  select_if(is.numeric) 

outlier_counts <- sapply(car_price_numeric, count_outliers)

data.frame(
  outlier_count = outlier_counts
) |>
  mutate(percent_outliers = (outlier_count/nrow(car_price_numeric))*100)
##              outlier_count percent_outliers
## Price                  905         5.715549
## Levy                    86         0.543135
## EngineVolume           759         4.793482
## Mileage                400         2.526209
## Cylinders             3606        22.773778
## Airbags                  0         0.000000
## Age                   1147         7.243906

Almost 6% of the prices are outliers.

Let’s remove any prices that are considered “outliers”.

outlier_indices <- function(x) {
  q1 <- quantile(x, 0.25)
  q3 <- quantile(x, 0.75)
  iqr <- q3 - q1
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  outlier_indices <- which(x < lower_bound | x > upper_bound)
  return(outlier_indices)
}

outliers_removed <- car_price_filtered[-(outlier_indices(car_price_filtered$Price)),] 

nrow(outliers_removed)
## [1] 14929

We still have 14,929 rows.

How are the features correlated?

outliers_removed |>
  select(-ID, -ProdYear) |>
  select_if(is.numeric) |>
  cor() |>
  corrplot(method='color',
           diag=F,
           type='lower',
           addCoef.col = "black",
           number.cex=0.7,
           title='Outliers Removed',
           mar=c(0,0,1,0))

There is a slight negative correlation between Price and Levy and Price and EngineVolume. There is a slight negative correlation between Price and Age. Cylinder and EngineVolume are highly correlated. Age and Levy are slightly correlated, as well as Age and Airbags.

Let’s check the breakdown of Price over different categories.

vars <- c("FuelType", "TurboEngine", "GearBoxType", "DriveWheels", "Wheel", "Color")

for (feature in vars) {
  print(
  ggplot(data = outliers_removed, aes(y = .data[[feature]], x=Price)) +
    geom_boxplot(fill="plum") +
    labs(title = paste("Barplot of", feature)))
}

Diesel cars are priced higher the highest. Cars with Turbo engines are also more expensive. Cars with steering wheels on the left are more expensive but also have more varied prices.

Based on the distributions of the data and the fact that none of the variables have strong linear correlations, I would use either a Random Forest or kNN model to predict on this data.

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# split into training and testing
set.seed(42)  

outliers_removed_sub <- outliers_removed |>
  select(-ID, -ProdYear)

train_indices <- sample(1:nrow(outliers_removed_sub), 0.7 * nrow(outliers_removed_sub))  # 70% for training
train_data <- outliers_removed_sub[train_indices, ]
test_data <- outliers_removed_sub[-train_indices, ]

# train model
rf_model <- randomForest(Price ~ ., data = train_data)

print(rf_model)
## 
## Call:
##  randomForest(formula = Price ~ ., data = train_data) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 31861984
##                     % Var explained: 76.53