Both datasets were found on Kaggle. The first dataset contains attributes relating to apple quality and can be accessed here. The second dataset records information and pricing for vehicles and can be accessed here.
apple_quality <- read.csv("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_622/data/apple_quality.csv")
car_price <- read.csv("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_622/data/car_price_prediction.csv")
The apple quality dataset contains about 4,000 rows and is 379 KB. The variables are as follows:
| Variable | Description |
|---|---|
A_id |
Unique identifier for each fruit |
Size |
Size of the fruit |
Weight |
Weight of the fruit |
Sweetness |
Degree of sweetness of the fruit |
Crunchiness |
Texture indicating the crunchiness of the fruit |
Juiciness |
Level of juiciness of the fruit |
Ripeness |
Stage of ripeness of the fruit |
Acidity |
Acidity level of the fruit |
Quality |
Overall quality of the fruit |
glimpse(apple_quality)
## Rows: 4,000
## Columns: 9
## $ A_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Size <dbl> -3.97004852, -1.19521719, -0.29202386, -0.65719577, 1.3642…
## $ Weight <dbl> -2.5123364, -2.8392565, -1.3512820, -2.2716266, -1.2966119…
## $ Sweetness <dbl> 5.3463296, 3.6640588, -1.7384292, 1.3248738, -0.3846582, -…
## $ Crunchiness <dbl> -1.01200871, 1.58823231, -0.34261593, -0.09787472, -0.5530…
## $ Juiciness <dbl> 1.8449004, 0.8532858, 2.8386355, 3.6379705, 3.0308744, -3.…
## $ Ripeness <dbl> 0.32983980, 0.86753008, -0.03803333, -3.41376134, -1.30384…
## $ Acidity <dbl> -0.4915905, -0.7228094, 2.6216365, 0.7907232, 0.5019840, -…
## $ Quality <chr> "good", "good", "bad", "good", "good", "bad", "good", "goo…
There are 4,001 rows and 9 columns. The first column is a unique
identifier for each apple recorded and will not be relevant for
modeling. All of the columns are numeric except for
Quality, which is our target variable.
We can visualize the distributions of our variables.
apple_quality |>
select(-A_id) |> # exclude apple ID
pivot_longer(cols=c(Size:Acidity),
names_to = 'Feature') |>
ggplot(aes(x = value)) +
geom_histogram(bins=30, fill='#6495ED') +
facet_wrap(~Feature) +
labs(title = "Distributions of Predictor Variables", x = "Value", y = "Count")
All of the features seem normally distributed.
apple_quality |>
ggplot(aes(x = Quality)) +
geom_bar(aes(fill = Quality)) +
labs(title = "Distribution of Apple Quality", y = "Count") +
geom_text(stat='count', aes(label=..count..), vjust=2, color="white", fontface='bold')
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The breakdown of “good” and “bad” apples is about equal, with 1,996 apples qualifying as “bad” and 2,004 apples qualifying as “good”.
We can also check the distribution of each predictive feature based on the response variable.
apple_quality |>
select(-A_id) |> # exclude apple ID
pivot_longer(cols=c(Size:Acidity),
names_to = 'Feature') |>
ggplot(aes(x = value, y=Quality)) +
geom_boxplot(aes(fill=Quality)) +
facet_wrap(~Feature) +
labs(title = "Distributions of Predictor Variables by Quality", x = "Value", y = "Count")
“Good” apples are relatively juicier, bigger, sweeter, and less ripe. They also vary more in weight and crunchiness.
Let’s check for correlations between predictor variables.
apple_quality |>
select(-A_id, -Quality) |>
cor() |>
corrplot(method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.70)
Some of the features are slightly correlated with each other but none are significantly correlated.
This data is labelled, with each apple getting a Quality
rating of “good” or “bad”. For this data, I would use binary logistic
regression to predict the quality of an apple. In order to do this, the
Quality column needs to be binary encoded to 0s and 1s. For
this example, we will use “good” = 1 and “bad” = 0.
apple_features <- apple_quality |>
select(-A_id) |>
mutate(Quality = ifelse(Quality == "good", 1, 0))
Let’s check if any of our variables are missing values.
colSums(is.na(apple_features))
## Size Weight Sweetness Crunchiness Juiciness Ripeness
## 0 0 0 0 0 0
## Acidity Quality
## 0 0
There are no missing values.
The distributions are all pretty much normal so we could now create a binary logistic model from the data.
apple_mod <- glm(Quality~., data=apple_features, family="binomial")
apple_mod
##
## Call: glm(formula = Quality ~ ., family = "binomial", data = apple_features)
##
## Coefficients:
## (Intercept) Size Weight Sweetness Crunchiness Juiciness
## 0.64622 0.64947 0.26673 0.56879 0.03458 0.44224
## Ripeness Acidity
## -0.12674 -0.29412
##
## Degrees of Freedom: 3999 Total (i.e. Null); 3992 Residual
## Null Deviance: 5545
## Residual Deviance: 4114 AIC: 4130
The car price dataset consists of about 19,000 rows and is 2,113 KB. The variables are as follows:
| Variable | Description |
|---|---|
ID |
Unique ID for the vechile |
Price |
Price of the vehicle |
Levy |
Lien against the vehicle |
Manufacturer |
Vehicle manufacturer |
Model |
Vehicle model |
ProdYear |
Year vehicle was produced |
Category |
Type of vehicle (Sedan, Minivan, etc.) |
LeatherInterior |
Whether the interior is leather (Yes, No) |
FuelType |
Type of fuel (Petrol, Diesel, Hybrid, etc.) |
EngineVolume |
Engine displacement in liters |
Mileage |
Car mileage in kilometers |
Cylinders |
Number of cylinders in engine |
GearBoxType |
Gear box of the transmission (Manual, Automatic, etc.) |
DriveWheels |
Type of drive (Front, 4x4, Rear, etc.) |
Doors |
Number of doors |
Wheel |
Steering wheel placement |
Color |
Exterior color of vehicle |
Airbags |
Number of airbags |
glimpse(car_price)
## Rows: 19,237
## Columns: 18
## $ ID <int> 45654403, 44731507, 45774419, 45769185, 45809263, 4580…
## $ Price <int> 13328, 16621, 8467, 3607, 11726, 39493, 1803, 549, 109…
## $ Levy <chr> "1399", "1018", "-", "862", "446", "891", "761", "751"…
## $ Manufacturer <chr> "LEXUS", "CHEVROLET", "HONDA", "FORD", "HONDA", "HYUND…
## $ Model <chr> "RX 450", "Equinox", "FIT", "Escape", "FIT", "Santa FE…
## $ ProdYear <int> 2010, 2011, 2006, 2011, 2014, 2016, 2010, 2013, 2014, …
## $ Category <chr> "Jeep", "Jeep", "Hatchback", "Jeep", "Hatchback", "Jee…
## $ LeatherInterior <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ FuelType <chr> "Hybrid", "Petrol", "Petrol", "Hybrid", "Petrol", "Die…
## $ EngineVolume <chr> "3.5", "3", "1.3", "2.5", "1.3", "2", "1.8", "2.4", "2…
## $ Mileage <chr> "186005 km", "192000 km", "200000 km", "168966 km", "9…
## $ Cylinders <int> 6, 6, 4, 4, 4, 4, 4, 4, 4, 6, 6, 8, 4, 6, 4, 4, 4, 4, …
## $ GearBoxType <chr> "Automatic", "Tiptronic", "Variator", "Automatic", "Au…
## $ DriveWheels <chr> "4x4", "4x4", "Front", "4x4", "Front", "Front", "Front…
## $ Doors <chr> "4-May", "4-May", "4-May", "4-May", "4-May", "4-May", …
## $ Wheel <chr> "Left wheel", "Left wheel", "Right-hand drive", "Left …
## $ Color <chr> "Silver", "Black", "Black", "White", "Silver", "White"…
## $ Airbags <int> 12, 8, 2, 0, 4, 4, 12, 12, 12, 12, 12, 0, 4, 12, 4, 12…
There are 19,237 rows and 18 columns in the dataset. Some of the variables are incorrectly coded.
Let’s fix these variables.
Not all cars have a Levy. However, instead of null
values, the missing levies are filled with a “-”. Let’s replace with a
“0” for cars with no levy recorded.
# change levy to integer type
car_price <- car_price |>
mutate(Levy = ifelse(Levy == '-',"0", Levy),
Levy = as.integer(Levy))
In general, a levy for a vehicle is not higher than the purchase
price. Let’s filter out any values where the Levy may be
higher than the Price.
car_price <- car_price |>
filter(Levy < Price)
Some of the values in EngineVolume also note whether the
engine is turbocharged with the word “Turbo” after the engine
displacement value. We can extract the “Turbo” into a separate dummy
column to note whether or not an engine is turbocharged.
# change engine volume to double
car_price <- car_price |>
separate(EngineVolume, into=c("EngineVolume", "TurboEngine"), sep=" ") |>
mutate(TurboEngine = ifelse(!is.na(TurboEngine), "Yes", "No"),
EngineVolume = as.double(EngineVolume))
The car mileage has “km” in each. We know that this is mileage in kilometers. We do not need the “km” notation.
car_price <- car_price |>
mutate(Mileage = as.integer(str_remove(Mileage, " km")))
From the glimpse of the data, it seems wheel is noted as “Left wheel” or “Right-hand drive”. Let’s make sure this is true.
car_price |>
count(Wheel)
## Wheel n
## 1 Left wheel 15828
## 2 Right-hand drive 1474
Since this is the placement of the wheel in the car, we can just use “Left” or “Right”.
car_price <- car_price |>
mutate(Wheel = ifelse(Wheel == "Left wheel", "Left", "Right"))
We can do something similar with Doors.
car_price |>
count(Doors)
## Doors n
## 1 2-Mar 753
## 2 4-May 16421
## 3 >5 128
car_price <- car_price |>
mutate(Doors = str_remove(Doors, "-.*"))
The data was updated two years ago. Let’s create an Age
column by subtracting the ProdYear from 2022.
car_price <- car_price |>
mutate("Age" = 2022 - ProdYear)
Let’s check if there are any missing values.
colSums(is.na(car_price))
## ID Price Levy Manufacturer Model
## 0 0 0 0 0
## ProdYear Category LeatherInterior FuelType EngineVolume
## 0 0 0 0 0
## TurboEngine Mileage Cylinders GearBoxType DriveWheels
## 0 0 0 0 0
## Doors Wheel Color Airbags Age
## 0 0 0 0 0
There are no missing values.
Now let’s take a look at the distributions of some of our variables.
cat_features <- c("Category", "LeatherInterior", "FuelType", "TurboEngine", "GearBoxType", "DriveWheels", "Doors", "Color")
for (feature in cat_features) {
print(
ggplot(data = car_price, aes(y = .data[[feature]])) +
geom_bar(fill='plum4') +
labs(title = paste("Barplot of", feature)))
}
For the purposes of this analysis, let’s only use cars that are either petrol, diesel, or hybrid, as those account for most of the vehicles.
car_price_filtered <- car_price |>
filter(FuelType %in% c("Diesel", "Petrol", "Hybrid"))
cont_features <- c("Price", "Mileage", "Levy", "EngineVolume", "Age")
for (feature in cont_features) {
print(
ggplot(data = car_price_filtered, aes(x = .data[[feature]])) +
geom_histogram(bins=40, fill='lightblue') +
labs(title = paste("Histogram of", feature), x=feature)
)
print(
ggplot(data = car_price_filtered, aes(y = .data[[feature]])) +
geom_boxplot(fill='lightblue') +
labs(title = paste("Boxplot of", feature), x=feature)
)
}
There are many outliers which do not allow us to see the
distributions for Price and Mileage clearly.
Let’s see if we can transform the data to see these distributions more
clearly.
for (feature in c("Price", "Mileage")) {
print(
ggplot(data = car_price_filtered, aes(x = log(.data[[feature]]))) +
geom_histogram(bins=40, fill='lightblue') +
labs(title = paste("Histogram of Transformed", feature), x=feature)
)
print(
ggplot(data = car_price_filtered, aes(y = log(.data[[feature]]))) +
geom_boxplot(fill='lightblue') +
labs(title = paste("Boxplot of Transformed", feature), x=feature)
)
}
We can see that there are a lot of outliers but the distributions look a little more normal.
How much of the data is outliers?
count_outliers <- function(x) {
iqr <- IQR(x)
lower <- quantile(x, .25) - 1.5*iqr
upper <- quantile(x, .75) + 1.5*iqr
outliers <- x[x < lower | x > upper]
return(length(outliers))
}
car_price_numeric <- car_price_filtered |>
select(-ID, -ProdYear) |>
select_if(is.numeric)
outlier_counts <- sapply(car_price_numeric, count_outliers)
data.frame(
outlier_count = outlier_counts
) |>
mutate(percent_outliers = (outlier_count/nrow(car_price_numeric))*100)
## outlier_count percent_outliers
## Price 905 5.715549
## Levy 86 0.543135
## EngineVolume 759 4.793482
## Mileage 400 2.526209
## Cylinders 3606 22.773778
## Airbags 0 0.000000
## Age 1147 7.243906
Almost 6% of the prices are outliers.
Let’s remove any prices that are considered “outliers”.
outlier_indices <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
outlier_indices <- which(x < lower_bound | x > upper_bound)
return(outlier_indices)
}
outliers_removed <- car_price_filtered[-(outlier_indices(car_price_filtered$Price)),]
nrow(outliers_removed)
## [1] 14929
We still have 14,929 rows.
How are the features correlated?
outliers_removed |>
select(-ID, -ProdYear) |>
select_if(is.numeric) |>
cor() |>
corrplot(method='color',
diag=F,
type='lower',
addCoef.col = "black",
number.cex=0.7,
title='Outliers Removed',
mar=c(0,0,1,0))
There is a slight negative correlation between Price and
Levy and Price and EngineVolume.
There is a slight negative correlation between Price and
Age. Cylinder and EngineVolume
are highly correlated. Age and Levy are
slightly correlated, as well as Age and
Airbags.
Let’s check the breakdown of Price over different
categories.
vars <- c("FuelType", "TurboEngine", "GearBoxType", "DriveWheels", "Wheel", "Color")
for (feature in vars) {
print(
ggplot(data = outliers_removed, aes(y = .data[[feature]], x=Price)) +
geom_boxplot(fill="plum") +
labs(title = paste("Barplot of", feature)))
}
Diesel cars are priced higher the highest. Cars with Turbo engines are also more expensive. Cars with steering wheels on the left are more expensive but also have more varied prices.
Based on the distributions of the data and the fact that none of the variables have strong linear correlations, I would use either a Random Forest or kNN model to predict on this data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# split into training and testing
set.seed(42)
outliers_removed_sub <- outliers_removed |>
select(-ID, -ProdYear)
train_indices <- sample(1:nrow(outliers_removed_sub), 0.7 * nrow(outliers_removed_sub)) # 70% for training
train_data <- outliers_removed_sub[train_indices, ]
test_data <- outliers_removed_sub[-train_indices, ]
# train model
rf_model <- randomForest(Price ~ ., data = train_data)
print(rf_model)
##
## Call:
## randomForest(formula = Price ~ ., data = train_data)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 31861984
## % Var explained: 76.53