First, do the data cleaning part of the handling missing value by calculating the means of the entire data set and replace the missing value with the means for the numeric columns and with mode for the categorical columns.
car_data[car_data == ''] <- NA
# Imputing numeric columns with the mean
numeric_cols <- sapply(car_data, is.numeric)
car_data[numeric_cols] <- lapply(car_data[numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
# Imputing categorical columns with the mode
factor_cols <- sapply(car_data, is.factor)
car_data[factor_cols] <- lapply(car_data[factor_cols], function(x) ifelse(is.na(x), mode(x, na.rm = TRUE), x))
c_cols <- sapply(car_data, is.character)
car_data[c_cols] <- lapply(car_data[c_cols], function(x) ifelse(is.na(x), mode(x), x))
I was thinking of using knn to fill the missing values, but the process of writing code is kind of hard so I did not choose the more accurate method but the more convenient method. The debugging process cost me lots of time. If I have a chance to do the project again, I would spend more time of the data cleaning first and use the knn method to deal with the missing value, which would be a great improvement in the accuracy I believe.
car_data_imputed <- kNN(car_data, k = 5)
sco_imputed <- kNN(sco, k = 5)
car_data_imputed[categorical_vars] <- lapply(car_data_imputed[categorical_vars], as.factor)
sco_imputed[categorical_vars] <- lapply(sco_imputed[categorical_vars], as.factor)
Then, for the feature selection, I first plot the correlation matrix to eliminate those factors with high correlation of more than 0.5, then I use random forest model to find out the factors with the most importance to the target “price”.
numeric_predictors <- car_data[, sapply(car_data, is.numeric)]
# Correlation matrix for numeric predictors
corr_matrix <- cor(numeric_predictors, use = "complete.obs")
ggcorrplot(corr_matrix, type = 'lower', show.diag = F, colors = c('red', 'white', 'darkgreen'))
Based on the first diagram, we could eliminate the factors with am absolute correlation more than 0.5, then we could get the selected factors of (“fuel_tank_volume_gallons”, “back_legroom_inches”, “front_legroom_inches”, “width_inches”, “daysonmarket”,“maximum_seating”, “mileage”, “owner_count”, “seller_rating”).
Then, I use these factors to draw an correlation matrix again, which aimed to use the iterated process to increase the accuracy.
selected_features <- c("fuel_tank_volume_gallons", "back_legroom_inches",
"front_legroom_inches", "width_inches", "daysonmarket",
"maximum_seating", "mileage", "owner_count", "seller_rating")
plot_corr <- function(data, features) {
corr_matrix <- cor(data[, features], use = "complete.obs")
ggcorrplot(corr_matrix, type = 'lower', lab = TRUE, show.diag = FALSE,
colors = c('red', 'white', 'darkgreen'), title = "Correlation matrix")
}
plot_corr(car_data, selected_features)
Based on the diagram, there are still some factors such as “width_inches” and “fuel_tank_volume_gallons” has very high correlation. With analysis of the graph, I finally determined the elected_feature of numerical variable (“fuel_tank_volume_gallons”, “back_legroom_inches”, “front_legroom_inches”, “daysonmarket”, “mileage”, “seller_rating”, “owner_count”)
I tried to use chi-squared statistic to do feature selection for the categorical features, but the results turns to be not really well. So I tried to conduct the random forest method of importance for feature selection. Use the importance score to select the top 10 features of the most importance features and then eliminate the factors with high correlation in the previous correlation matrix of the numerical features.
target <- car_data$price
features <- car_data %>% select(-price)
rf_model <- randomForest(x= features, y=target, mtry=4, importance = TRUE, ntree = 200)
variable_importance <- rf_model$importance
sorted_importance <- variable_importance[order(variable_importance[, 1], decreasing = TRUE), ]
sorted_features <- sort(sorted_importance[, '%IncMSE'], decreasing = TRUE)
top_features <- names(sorted_features)[1:10]
top_features
selected_predictors <- predictors[, top_features]
By using importance of random forest to conduct feature selection, I got the top 10 important feature matter to predicting price. From these features, I compared with the correlation matrix and remove these with high correlation and get the following features.
“mileage”, “trim_name”, “back_legroom_inches”, “model_name”, “major_options”, “front_legroom_inches”, “horsepower”,“fuel_type”
Then, I used there features to build a model to predict the price. First I used the random forest model. Splitting data into predictors and target by p of 0.7.
features <- c("mileage", "trim_name", "back_legroom_inches",
"model_name", "major_options", "front_legroom_inches",
"price", "horsepower","fuel_type")
car_data<-car_data[features]
predictors <- car_data[, names(car_data) != "price"]
target <- car_data$price
training_indices <- createDataPartition(car_data$price, p = 0.7, list = FALSE)
train_data <- car_data[training_indices, ]
test_data <- car_data[-training_indices, ]
In order to get a better random forest model, it’s necessary to find out the better parameters for the model. I choose to find which is the best mtry and min.node.size for prediction.
control <- trainControl(method="cv", number=5, search="grid")
tuneGrid <- expand.grid(mtry = 3:10)
train_predictors <- train_data[, names(train_data) != "price"]
train_target <- train_data$price
rf_model <- train(
x = predictors,
y = target,
method = "rf",
ntree = 200,
trControl = control,
tuneGrid = tuneGrid,
metric = "RMSE"
)
best_params <- rf_model$bestTune
forest_ranger <- ranger(
formula = price ~ .,
data = car_data,
num.trees = 200,
mtry = best_params$mtry,
min.node.size = best_params$min.node.size,
splitrule = "variance",
)
print(forest_ranger)
best_mtry <- forest_ranger$bestTune$mtry
best_min_node_size <- forest_ranger$bestTune$min.node.size
Based on the output, the best mtry for the model is 4 and the best min.node.size would be 5 for the model.
Build a random forest model with the best mtry and best min.node.size and run the model with ntree of 500. Then ran the prediction using the test data and calculate the RMSE to compare with different modification and prediction with other models in the following.
rf_model <- randomForest(price ~ ., data =train_data, mtry = 4, min.node.size = 5, ntree = 500)
predictions <- predict(rf_model, newdata = test_data)
# Calculate RMSE (Root Mean Squared Error) for evaluation
rmse <- sqrt(mean((predictions - test_data$price)^2))
print(paste("Random Forest RMSE:", rmse))
## [1] "Random Forest RMSE: 6701.17746083239"
Then I tried to use XGBoost model to find out which has a lower RMSE.
dtrain <- xgb.DMatrix(data = as.matrix(train_data[, -which(names(train_data) == "price")]), label = train_data$price)
dtest <- xgb.DMatrix(data = as.matrix(test_data[, -which(names(test_data) == "price")]), label = test_data$price)
params <- list(
booster = "gbtree",
objective = "reg:squarederror",
eta = 0.1,
max_depth = 6,
subsample = 0.7,
colsample_bytree = 0.7
)
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100, watchlist = list(eval = dtest, train = dtrain), print_every_n = 10)
predictions <- predict(xgb_model, newdata = dtest)
rmse <- sqrt(mean((predictions - test_data$price)^2))
print(paste("XGBoost RMSE:", rmse))
And it turns out that the random forest model is better for this prediction with lower RMSE, so I choose to use RMSE for the prediction.
Use the rf_model to reach the prediction for the sco file. Do the same process of data cleaning for sco file.
features <- c("mileage", "trim_name", "back_legroom_inches", "model_name", "major_options", "front_legroom_inches", "horsepower","fuel_type")
sco <- sco [,features]
numeric_cols <- sapply(sco, is.numeric)
sco[numeric_cols] <- lapply(sco[numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
factor_cols <- sapply(sco, is.factor)
sco[factor_cols] <- lapply(sco[factor_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
sco[] <- lapply(sco, function(x) as.numeric(as.character(x)))
predictions <- predict(rf_model, sco)
Then print out the submission file with the prediction.
submissionFile <- data.frame(id = scoc$id, price = predictions)
write.csv(submissionFile, 'submission.csv', row.names = FALSE, quote = FALSE)
I think for the entire project, there are things that I did right and things that I did wrong. I would do stuffs differently if I have chance to do the project again.
Data cleaning of missing value
I think data cleaning is an important part of the model. I find means to drop all NA values.
Correlation matrix for feature selection
I used correlation matrix to eliminate the factors with absolute correlation more than 0.5 and iterate the process to further select the numerical features.
Used random forest importance for feature selection
I used more than one method of feature selection in order to select better features. Together with the correlation matrix, I reduced the factor number to be 8, which reduced the over-fitting.
Find parameter for random forest model
I find out the best mtry and the best min.node.size value for the random forest model, which increase the accuracy of the model.
Compare with other models
I construct and conduct analysis using other models such as linear model and also XGBoost model to find the lowest RMSE, and turns out that random forest model is the best in the case.
Overfitting of the categorical features
I only did the correlation matrix for numerical features, but I did not consider the over-fitting of the categorical features. This could lead to huge problems in feature selection.
ntree of random forest model
I choose ntree of 500 for the random forest model, which takes a long time to run the model.
KNN for missing value
I clean the missing value using the simply calculation of mean. More advanced missing value method such as KNN would be better dealing with missing value.
Data cleaning
There are more things could be done with data cleaning. One of the important factor, power, could be process better to become a numerical value and fill the missing value in correct ways. I would spend more time do data cleaning for factors like “power”.
chi-square test for feature selection of the categorical features
I need to reduce the overfitting of the categorical features using method such as chi-square test.
Other feature selection methods
feature selection is really important when there is a great number of factors. I think I should add more process of feature selection, such as using leap to better iterate the process to find better predictors.
Try other models
I only tried 3 models (linear, random forest, XGBoost). There could be other model that perform better for this prediction. I should also try other models.