The purpose of this project is to analyze a dataset containing various car attributes and predict car prices using multiple machine learning models. This analysis is part of the final assignment for MSCI 4230 – Business Analytics in Practice. The dataset includes variables such as engine size, horsepower, fuel type, curb weight, and more, all of which may influence the price of a car. The primary goals of this project are to perform data cleaning and transformation, visualize key relationships in the data and apply predictive models (regression and classification).
Several visualizations were created to better understand the structure and relationships in the data:
-Histograms helped visualize the distribution of features like horsepower, engine size, and price.
-Boxplots identified outliers and compared price distributions across variables like number of doors and car body type.
-Scatter plots revealed strong linear relationships between price and variables like horsepower, curb weight, and engine size.
-A correlation heatmap showed that curbweight and enginesize are highly correlated with price, confirming their importance in predictive modeling.
These visualizations provided both intuitive and statistical insights into which features are most likely to influence car pricing.
This histogram displays the distribution of standardized car prices based on fuel type. It helps identify whether gas or diesel vehicles dominate certain price ranges. From this, you can infer which type is generally more affordable or if there’s more variation in one category.
This correlation matrix provides a comprehensive overview of how all numeric variables are interrelated. Strong positive or negative correlations (darker shades) suggest potential multicollinearity, useful for model selection and understanding variable influence on price.
These boxplots help visualize the spread, central tendency, and outliers for each numerical variable. They’re especially helpful for identifying variables with significant skew or extreme values, which can impact regression models.
This plot compares car prices between 2-door and 4-door vehicles. It reveals differences in median price and price variability across door types, offering insight into how car design impacts market value.
These scatter plots (with regression lines) explore the relationship between car price and major predictors like horsepower, curbweight, enginesize, and carwidth. They help visualize linear trends and the strength of association with price.
This set of scatter plots explores the relationship between car price and four key predictors: horsepower, curbweight, enginesize, and carwidth. Each plot includes a linear regression line to help identify trends. These variables were chosen due to their logical connection to vehicle performance and value. The positive slopes across these plots suggest that as these attributes increase, so does the vehicle’s price — highlighting them as strong candidates for predictive modeling. This also aids in visualizing potential linear relationships and evaluating multicollinearity among features.
This heatmap visualizes the pairwise Pearson correlation coefficients between all numeric variables in the dataset. Darker colors indicate stronger correlations — either positive (closer to +1) or negative (closer to –1). The numerical values displayed help quantify these relationships. This matrix is useful for detecting multicollinearity and for identifying which features may serve as strong predictors for car price. It also highlights potential redundancies in the dataset that should be considered when building regression models.
This scatter plot emphasizes the strong linear relationship between a car’s curbweight and engine size. As vehicles become heavier, they typically require larger engines — a trend clearly supported by the upward regression line. The high correlation shown here aligns with automotive engineering principles and was also reflected in the overall correlation matrix. This visualization reinforces why both variables are valuable in predicting car price and vehicle class.
This scatter plot investigates the relationship between a car’s wheelbase (distance between front and rear axles) and its width. The positive linear trend suggests that vehicles with longer wheelbases also tend to be wider — a reflection of proportional design choices in car manufacturing. The regression line provides a clearer sense of this correlation, which, while not necessarily tied to price, can influence handling, stability, and vehicle class categorization.
This violin plot shows the density and spread of prices across different car body types. It’s useful for identifying which car body types (e.g., hatchback, sedan) tend to have greater price variability or higher medians.
This plot illustrates how engine size influences horsepower, typically a strong linear relationship. It’s useful for understanding performance scaling across the car dataset.
A simple bar chart that shows how many cars use each type of fuel. It’s useful for understanding your dataset’s composition and identifying any imbalance between fuel types.
This faceted scatter plot splits the relationship between horsepower and price across different drivewheel configurations (FWD, RWD, etc.). It reveals how drivetrain affects performance-to-price patterns.
## fueltype aspiration doornumber carbody
## Min. :0.000 Min. :0.0000 Min. :2.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.000 Median :0.0000 Median :4.000 Median :3.000
## Mean :0.902 Mean :0.1814 Mean :3.118 Mean :2.618
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :1.000 Max. :1.0000 Max. :4.000 Max. :4.000
## drivewheel enginelocation wheelbase carlength
## Min. :0.000 Min. :0.00000 Min. :-2.0264 Min. :-2.68960
## 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:-0.7122 1st Qu.:-0.60714
## Median :1.000 Median :0.00000 Median :-0.2963 Median :-0.07584
## Mean :1.328 Mean :0.01471 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.:2.000 3rd Qu.:0.00000 3rd Qu.: 0.6020 3rd Qu.: 0.73842
## Max. :2.000 Max. :1.00000 Max. : 3.6795 Max. : 2.76591
## carwidth carheight curbweight enginetype
## Min. :-2.6252 Min. :-2.4408 Min. :-2.0624 Min. :0.000
## 1st Qu.:-0.8496 1st Qu.:-0.7151 1st Qu.:-0.7619 1st Qu.:3.000
## Median :-0.1954 Median : 0.1478 Median :-0.2725 Median :3.000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean :3.015
## 3rd Qu.: 0.4588 3rd Qu.: 0.7231 3rd Qu.: 0.7337 3rd Qu.:3.000
## Max. : 2.9820 Max. : 2.4900 Max. : 2.9045 Max. :6.000
## cylindernumber enginesize fuelsystem boreratio
## Min. : 2.000 Min. :-1.5901 Min. :0.000 Min. :-2.9352
## 1st Qu.: 4.000 1st Qu.:-0.7059 1st Qu.:1.000 1st Qu.:-0.6731
## Median : 4.000 Median :-0.1705 Median :5.000 Median :-0.0798
## Mean : 4.382 Mean : 0.0000 Mean :3.265 Mean : 0.0000
## 3rd Qu.: 4.000 3rd Qu.: 0.3588 3rd Qu.:5.000 3rd Qu.: 0.9307
## Max. :12.000 Max. : 4.7859 Max. :7.000 Max. : 2.2564
## stroke compressionratio horsepower peakrpm
## Min. :-3.7805 Min. :-0.7921 Min. :-1.4265 Min. :-2.0436
## 1st Qu.:-0.4641 1st Qu.:-0.3956 1st Qu.:-0.8690 1st Qu.:-0.6788
## Median : 0.1099 Median :-0.2886 Median :-0.2355 Median : 0.1611
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4926 3rd Qu.:-0.1879 3rd Qu.: 0.2966 3rd Qu.: 0.7910
## Max. : 2.9161 Max. : 3.2364 Max. : 4.6552 Max. : 3.1007
## citympg highwaympg price
## Min. :-1.8671 Min. :-2.1428 Min. :-1.0276
## 1st Qu.:-0.9482 1st Qu.:-0.8323 1st Qu.:-0.6917
## Median :-0.1824 Median :-0.1042 Median :-0.3751
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7365 3rd Qu.: 0.4782 3rd Qu.: 0.4007
## Max. : 3.6463 Max. : 3.3904 Max. : 4.0244
## fueltype aspiration doornumber carbody
## 0.2980992 0.3862745 0.9954984 0.8601079
## drivewheel enginelocation wheelbase carlength
## 0.5570643 0.1206690 1.0024600 1.0024600
## carwidth carheight curbweight enginetype
## 1.0024600 1.0024600 1.0024600 1.0573596
## cylindernumber enginesize fuelsystem boreratio
## 1.0831819 1.0024600 2.0119176 1.0024600
## stroke compressionratio horsepower peakrpm
## 1.0024600 1.0024600 1.0024600 1.0024600
## citympg highwaympg price
## 1.0024600 1.0024600 1.0024600
The dataset was first cleaned by removing duplicate entries and standardizing categorical variables. All categorical features were encoded numerically, and numerical features were standardized (mean = 0, standard deviation = 1) to ensure a uniform scale for modeling techniques like KNN and regression.
The target variable, price, was kept as both a continuous variable (for regression) and discretized into three categories (Low, Medium, High) using quantiles for classification purposes. For the data transformation, we can comfirm that the data has been converted into catagorical variables and are encoded numerically, all continuous variables have been standardized. Standardization ensures each variable contributes equally to models like KNN or regression.
Now we’ll split the dataset into training (70%) and
testing (30%) sets using caret.
## Training set size: 144
## Testing set size: 60
This chunk partitions the dataset into a training set (used to train
models) and a testing set (used to evaluate performance). The partition
is stratified on the price variable to maintain its
distribution across both sets.
##
## Call:
## lm(formula = price ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89425 -0.17516 -0.00015 0.19214 1.74160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.961440 0.985284 0.976 0.33111
## fueltype -0.052488 1.007509 -0.052 0.95854
## aspiration 0.026815 0.136456 0.197 0.84454
## doornumber 0.001333 0.049086 0.027 0.97838
## carbody -0.075232 0.062342 -1.207 0.22988
## drivewheel 0.166521 0.095624 1.741 0.08415 .
## enginelocation 1.205474 0.447216 2.696 0.00803 **
## wheelbase 0.103041 0.097647 1.055 0.29342
## carlength 0.038797 0.107078 0.362 0.71774
## carwidth 0.153983 0.086664 1.777 0.07812 .
## carheight 0.045349 0.056662 0.800 0.42509
## curbweight 0.174836 0.133316 1.311 0.19219
## enginetype 0.034483 0.036027 0.957 0.34040
## cylindernumber -0.217505 0.132240 -1.645 0.10261
## enginesize 0.704072 0.209705 3.357 0.00105 **
## fuelsystem -0.024251 0.024194 -1.002 0.31817
## boreratio -0.180392 0.086929 -2.075 0.04009 *
## stroke -0.142163 0.058857 -2.415 0.01721 *
## compressionratio 0.055228 0.284070 0.194 0.84617
## horsepower 0.196504 0.109278 1.798 0.07464 .
## peakrpm 0.124027 0.051884 2.390 0.01837 *
## citympg -0.100653 0.188518 -0.534 0.59438
## highwaympg 0.151090 0.168344 0.898 0.37123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3819 on 121 degrees of freedom
## Multiple R-squared: 0.8721, Adjusted R-squared: 0.8489
## F-statistic: 37.52 on 22 and 121 DF, p-value: < 2.2e-16
## Linear Regression RMSE: 0.3756115
This linear model predicts standardized price using all
other variables. The RMSE (Root Mean Squared Error) gives an idea of
average prediction error. Multiple Linear Regression was used to predict
the continuous price variable based on multiple independent features
such as horsepower, curbweight, enginesize, and others. This model
assumes a linear relationship between the predictors and the target
variable. Standardized predictors were used to ensure comparability and
reduce the impact of scale differences across features.
The MLR model yielded a relatively low Root Mean Squared Error (RMSE), indicating a modest average difference between predicted and actual car prices. The model’s residuals were mostly centered around zero, suggesting no extreme bias in over- or under-prediction. However, due to the presence of potentially non-linear relationships in the data (e.g., price vs. engine size), the model may have underfit the complexity in certain areas.
##
## Low Medium High
## 67 67 70
## # weights: 72 (46 variable)
## initial value 157.101557
## iter 10 value 57.841887
## iter 20 value 36.476167
## iter 30 value 25.432954
## iter 40 value 24.364784
## iter 50 value 23.929575
## iter 60 value 23.011136
## iter 70 value 22.980240
## iter 80 value 22.903438
## iter 90 value 22.900879
## final value 22.900873
## converged
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 19 7 0
## Medium 1 13 8
## High 0 0 13
##
## Overall Statistics
##
## Accuracy : 0.7377
## 95% CI : (0.6093, 0.842)
## No Information Rate : 0.3443
## P-Value [Acc > NIR] : 4.175e-10
##
## Kappa : 0.6077
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.9500 0.6500 0.6190
## Specificity 0.8293 0.7805 1.0000
## Pos Pred Value 0.7308 0.5909 1.0000
## Neg Pred Value 0.9714 0.8205 0.8333
## Prevalence 0.3279 0.3279 0.3443
## Detection Rate 0.3115 0.2131 0.2131
## Detection Prevalence 0.4262 0.3607 0.2131
## Balanced Accuracy 0.8896 0.7152 0.8095
Logistic regression is a classification model used here to predict price category (Low, Medium, High) based on the car’s features. The confusion matrix gives a clear view of how well the model performs in classifying price ranges.
To evaluate price from a categorical perspective, price was binned into three classes: Low, Medium, and High using quantile-based thresholds. Multinomial Logistic Regression was then used to model the probability of a car falling into one of these categories, based on the same set of predictors.
This model generalizes binary logistic regression to multiclass settings and estimates separate coefficients for each class relative to a baseline. The resulting probability estimates are used to assign the most likely class to each observation.
One of the main advantages of logistic regression is that it does not assume independence between predictors (unlike Naïve Bayes), allowing for a more realistic representation of feature relationships.
## KNN RMSE: 0.6300444
KNN is a distance-based method that predicts price by averaging the k-nearest training observations. We use RMSE to assess accuracy on the test set.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 14 2 0
## Medium 6 17 5
## High 0 1 16
##
## Overall Statistics
##
## Accuracy : 0.7705
## 95% CI : (0.645, 0.8685)
## No Information Rate : 0.3443
## P-Value [Acc > NIR] : 1.234e-11
##
## Kappa : 0.6562
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.7000 0.8500 0.7619
## Specificity 0.9512 0.7317 0.9750
## Pos Pred Value 0.8750 0.6071 0.9412
## Neg Pred Value 0.8667 0.9091 0.8864
## Prevalence 0.3279 0.3279 0.3443
## Detection Rate 0.2295 0.2787 0.2623
## Detection Prevalence 0.2623 0.4590 0.2787
## Balanced Accuracy 0.8256 0.7909 0.8685
Naïve Bayes classifies cars into price categories using probabilistic reasoning. The confusion matrix shows classification accuracy for Low, Medium, and High priced vehicles.
## Logistic Regression Accuracy: 73.77 %
## Naïve Bayes Accuracy: 65.57 %
This section visually and numerically compares the performance of Logistic Regression and Naïve Bayes in classifying cars into three price categories: Low, Medium, and High.
The confusion matrix heatmap for the Logistic Regression model offers a clear visual representation of prediction accuracy across the categories. Darker shades indicate higher agreement between predicted and actual classes, while off-diagonal values highlight areas of misclassification (e.g., Medium being confused as Low).
Additionally, a side-by-side accuracy comparison quantifies the performance of both models. This allows us to evaluate which algorithm provides more reliable predictions for this classification task. Depending on the structure of the data, one model may outperform the other due to its underlying assumptions.
Model comparision:
Regression Models: Multiple Linear Regression served as a baseline. It assumes a linear relationship between the predictors and the target variable. The model performed reasonably well, with an RMSE that reflected moderate predictive accuracy. However, due to its simplicity, it may have underfit some of the non-linear relationships in the dataset.
K-Nearest Neighbors (KNN) regression offered a non-parametric approach by making predictions based on the closest training instances. While KNN captured some local patterns missed by linear regression, its performance was sensitive to the value of k and was generally less stable. KNN’s RMSE was slightly higher than that of linear regression, indicating weaker performance in this case — likely due to the standardized but still high-dimensional feature space.
Classification Models: To allow for classification modeling, the continuous price variable was converted into three categories: Low, Medium, and High, based on quantile breaks.
Naïve Bayes Classification assumes conditional independence between predictors. Despite this assumption rarely holding true in real-world data, the model performed surprisingly well. The simplicity and speed of Naïve Bayes made it a strong baseline classifier. Its confusion matrix revealed that while it handled the “Low” and “High” categories decently, it struggled with distinguishing the “Medium” class — likely due to overlapping feature distributions.
Multinomial Logistic Regression was employed to model the same classification task with greater flexibility. Unlike Naïve Bayes, logistic regression accounts for feature relationships and delivers probability-based outputs. It achieved slightly higher overall accuracy and a better-balanced confusion matrix, particularly in predicting the “Medium” category. This suggests it modeled the decision boundaries between classes more effectively.