R Markdown

The purpose of this project is to analyze a dataset containing various car attributes and predict car prices using multiple machine learning models. This analysis is part of the final assignment for MSCI 4230 – Business Analytics in Practice. The dataset includes variables such as engine size, horsepower, fuel type, curb weight, and more, all of which may influence the price of a car. The primary goals of this project are to perform data cleaning and transformation, visualize key relationships in the data and apply predictive models (regression and classification).

Several visualizations were created to better understand the structure and relationships in the data:

-Histograms helped visualize the distribution of features like horsepower, engine size, and price.

-Boxplots identified outliers and compared price distributions across variables like number of doors and car body type.

-Scatter plots revealed strong linear relationships between price and variables like horsepower, curb weight, and engine size.

-A correlation heatmap showed that curbweight and enginesize are highly correlated with price, confirming their importance in predictive modeling.

These visualizations provided both intuitive and statistical insights into which features are most likely to influence car pricing.

Visualization

This histogram displays the distribution of standardized car prices based on fuel type. It helps identify whether gas or diesel vehicles dominate certain price ranges. From this, you can infer which type is generally more affordable or if there’s more variation in one category.

This correlation matrix provides a comprehensive overview of how all numeric variables are interrelated. Strong positive or negative correlations (darker shades) suggest potential multicollinearity, useful for model selection and understanding variable influence on price.

These boxplots help visualize the spread, central tendency, and outliers for each numerical variable. They’re especially helpful for identifying variables with significant skew or extreme values, which can impact regression models.

This plot compares car prices between 2-door and 4-door vehicles. It reveals differences in median price and price variability across door types, offering insight into how car design impacts market value.

These scatter plots (with regression lines) explore the relationship between car price and major predictors like horsepower, curbweight, enginesize, and carwidth. They help visualize linear trends and the strength of association with price.

This set of scatter plots explores the relationship between car price and four key predictors: horsepower, curbweight, enginesize, and carwidth. Each plot includes a linear regression line to help identify trends. These variables were chosen due to their logical connection to vehicle performance and value. The positive slopes across these plots suggest that as these attributes increase, so does the vehicle’s price — highlighting them as strong candidates for predictive modeling. This also aids in visualizing potential linear relationships and evaluating multicollinearity among features.

This heatmap visualizes the pairwise Pearson correlation coefficients between all numeric variables in the dataset. Darker colors indicate stronger correlations — either positive (closer to +1) or negative (closer to –1). The numerical values displayed help quantify these relationships. This matrix is useful for detecting multicollinearity and for identifying which features may serve as strong predictors for car price. It also highlights potential redundancies in the dataset that should be considered when building regression models.

This scatter plot emphasizes the strong linear relationship between a car’s curbweight and engine size. As vehicles become heavier, they typically require larger engines — a trend clearly supported by the upward regression line. The high correlation shown here aligns with automotive engineering principles and was also reflected in the overall correlation matrix. This visualization reinforces why both variables are valuable in predicting car price and vehicle class.

This scatter plot investigates the relationship between a car’s wheelbase (distance between front and rear axles) and its width. The positive linear trend suggests that vehicles with longer wheelbases also tend to be wider — a reflection of proportional design choices in car manufacturing. The regression line provides a clearer sense of this correlation, which, while not necessarily tied to price, can influence handling, stability, and vehicle class categorization.

This violin plot shows the density and spread of prices across different car body types. It’s useful for identifying which car body types (e.g., hatchback, sedan) tend to have greater price variability or higher medians.

This plot illustrates how engine size influences horsepower, typically a strong linear relationship. It’s useful for understanding performance scaling across the car dataset.

A simple bar chart that shows how many cars use each type of fuel. It’s useful for understanding your dataset’s composition and identifying any imbalance between fuel types.

This faceted scatter plot splits the relationship between horsepower and price across different drivewheel configurations (FWD, RWD, etc.). It reveals how drivetrain affects performance-to-price patterns.

##     fueltype       aspiration       doornumber       carbody     
##  Min.   :0.000   Min.   :0.0000   Min.   :2.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :1.000   Median :0.0000   Median :4.000   Median :3.000  
##  Mean   :0.902   Mean   :0.1814   Mean   :3.118   Mean   :2.618  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :4.000   Max.   :4.000  
##    drivewheel    enginelocation      wheelbase         carlength       
##  Min.   :0.000   Min.   :0.00000   Min.   :-2.0264   Min.   :-2.68960  
##  1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:-0.7122   1st Qu.:-0.60714  
##  Median :1.000   Median :0.00000   Median :-0.2963   Median :-0.07584  
##  Mean   :1.328   Mean   :0.01471   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.:2.000   3rd Qu.:0.00000   3rd Qu.: 0.6020   3rd Qu.: 0.73842  
##  Max.   :2.000   Max.   :1.00000   Max.   : 3.6795   Max.   : 2.76591  
##     carwidth         carheight         curbweight        enginetype   
##  Min.   :-2.6252   Min.   :-2.4408   Min.   :-2.0624   Min.   :0.000  
##  1st Qu.:-0.8496   1st Qu.:-0.7151   1st Qu.:-0.7619   1st Qu.:3.000  
##  Median :-0.1954   Median : 0.1478   Median :-0.2725   Median :3.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   :3.015  
##  3rd Qu.: 0.4588   3rd Qu.: 0.7231   3rd Qu.: 0.7337   3rd Qu.:3.000  
##  Max.   : 2.9820   Max.   : 2.4900   Max.   : 2.9045   Max.   :6.000  
##  cylindernumber     enginesize        fuelsystem      boreratio      
##  Min.   : 2.000   Min.   :-1.5901   Min.   :0.000   Min.   :-2.9352  
##  1st Qu.: 4.000   1st Qu.:-0.7059   1st Qu.:1.000   1st Qu.:-0.6731  
##  Median : 4.000   Median :-0.1705   Median :5.000   Median :-0.0798  
##  Mean   : 4.382   Mean   : 0.0000   Mean   :3.265   Mean   : 0.0000  
##  3rd Qu.: 4.000   3rd Qu.: 0.3588   3rd Qu.:5.000   3rd Qu.: 0.9307  
##  Max.   :12.000   Max.   : 4.7859   Max.   :7.000   Max.   : 2.2564  
##      stroke        compressionratio    horsepower         peakrpm       
##  Min.   :-3.7805   Min.   :-0.7921   Min.   :-1.4265   Min.   :-2.0436  
##  1st Qu.:-0.4641   1st Qu.:-0.3956   1st Qu.:-0.8690   1st Qu.:-0.6788  
##  Median : 0.1099   Median :-0.2886   Median :-0.2355   Median : 0.1611  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4926   3rd Qu.:-0.1879   3rd Qu.: 0.2966   3rd Qu.: 0.7910  
##  Max.   : 2.9161   Max.   : 3.2364   Max.   : 4.6552   Max.   : 3.1007  
##     citympg          highwaympg          price        
##  Min.   :-1.8671   Min.   :-2.1428   Min.   :-1.0276  
##  1st Qu.:-0.9482   1st Qu.:-0.8323   1st Qu.:-0.6917  
##  Median :-0.1824   Median :-0.1042   Median :-0.3751  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7365   3rd Qu.: 0.4782   3rd Qu.: 0.4007  
##  Max.   : 3.6463   Max.   : 3.3904   Max.   : 4.0244
##         fueltype       aspiration       doornumber          carbody 
##        0.2980992        0.3862745        0.9954984        0.8601079 
##       drivewheel   enginelocation        wheelbase        carlength 
##        0.5570643        0.1206690        1.0024600        1.0024600 
##         carwidth        carheight       curbweight       enginetype 
##        1.0024600        1.0024600        1.0024600        1.0573596 
##   cylindernumber       enginesize       fuelsystem        boreratio 
##        1.0831819        1.0024600        2.0119176        1.0024600 
##           stroke compressionratio       horsepower          peakrpm 
##        1.0024600        1.0024600        1.0024600        1.0024600 
##          citympg       highwaympg            price 
##        1.0024600        1.0024600        1.0024600

The dataset was first cleaned by removing duplicate entries and standardizing categorical variables. All categorical features were encoded numerically, and numerical features were standardized (mean = 0, standard deviation = 1) to ensure a uniform scale for modeling techniques like KNN and regression.

The target variable, price, was kept as both a continuous variable (for regression) and discretized into three categories (Low, Medium, High) using quantiles for classification purposes. For the data transformation, we can comfirm that the data has been converted into catagorical variables and are encoded numerically, all continuous variables have been standardized. Standardization ensures each variable contributes equally to models like KNN or regression.

Now we’ll split the dataset into training (70%) and testing (30%) sets using caret.

## Training set size: 144
## Testing set size: 60

This chunk partitions the dataset into a training set (used to train models) and a testing set (used to evaluate performance). The partition is stratified on the price variable to maintain its distribution across both sets.

## 
## Call:
## lm(formula = price ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89425 -0.17516 -0.00015  0.19214  1.74160 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.961440   0.985284   0.976  0.33111   
## fueltype         -0.052488   1.007509  -0.052  0.95854   
## aspiration        0.026815   0.136456   0.197  0.84454   
## doornumber        0.001333   0.049086   0.027  0.97838   
## carbody          -0.075232   0.062342  -1.207  0.22988   
## drivewheel        0.166521   0.095624   1.741  0.08415 . 
## enginelocation    1.205474   0.447216   2.696  0.00803 **
## wheelbase         0.103041   0.097647   1.055  0.29342   
## carlength         0.038797   0.107078   0.362  0.71774   
## carwidth          0.153983   0.086664   1.777  0.07812 . 
## carheight         0.045349   0.056662   0.800  0.42509   
## curbweight        0.174836   0.133316   1.311  0.19219   
## enginetype        0.034483   0.036027   0.957  0.34040   
## cylindernumber   -0.217505   0.132240  -1.645  0.10261   
## enginesize        0.704072   0.209705   3.357  0.00105 **
## fuelsystem       -0.024251   0.024194  -1.002  0.31817   
## boreratio        -0.180392   0.086929  -2.075  0.04009 * 
## stroke           -0.142163   0.058857  -2.415  0.01721 * 
## compressionratio  0.055228   0.284070   0.194  0.84617   
## horsepower        0.196504   0.109278   1.798  0.07464 . 
## peakrpm           0.124027   0.051884   2.390  0.01837 * 
## citympg          -0.100653   0.188518  -0.534  0.59438   
## highwaympg        0.151090   0.168344   0.898  0.37123   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3819 on 121 degrees of freedom
## Multiple R-squared:  0.8721, Adjusted R-squared:  0.8489 
## F-statistic: 37.52 on 22 and 121 DF,  p-value: < 2.2e-16
## Linear Regression RMSE: 0.3756115

This linear model predicts standardized price using all other variables. The RMSE (Root Mean Squared Error) gives an idea of average prediction error. Multiple Linear Regression was used to predict the continuous price variable based on multiple independent features such as horsepower, curbweight, enginesize, and others. This model assumes a linear relationship between the predictors and the target variable. Standardized predictors were used to ensure comparability and reduce the impact of scale differences across features.

The MLR model yielded a relatively low Root Mean Squared Error (RMSE), indicating a modest average difference between predicted and actual car prices. The model’s residuals were mostly centered around zero, suggesting no extreme bias in over- or under-prediction. However, due to the presence of potentially non-linear relationships in the data (e.g., price vs. engine size), the model may have underfit the complexity in certain areas.

## 
##    Low Medium   High 
##     67     67     70
## # weights:  72 (46 variable)
## initial  value 157.101557 
## iter  10 value 57.841887
## iter  20 value 36.476167
## iter  30 value 25.432954
## iter  40 value 24.364784
## iter  50 value 23.929575
## iter  60 value 23.011136
## iter  70 value 22.980240
## iter  80 value 22.903438
## iter  90 value 22.900879
## final  value 22.900873 
## converged
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low     19      7    0
##     Medium   1     13    8
##     High     0      0   13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7377         
##                  95% CI : (0.6093, 0.842)
##     No Information Rate : 0.3443         
##     P-Value [Acc > NIR] : 4.175e-10      
##                                          
##                   Kappa : 0.6077         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity              0.9500        0.6500      0.6190
## Specificity              0.8293        0.7805      1.0000
## Pos Pred Value           0.7308        0.5909      1.0000
## Neg Pred Value           0.9714        0.8205      0.8333
## Prevalence               0.3279        0.3279      0.3443
## Detection Rate           0.3115        0.2131      0.2131
## Detection Prevalence     0.4262        0.3607      0.2131
## Balanced Accuracy        0.8896        0.7152      0.8095

Logistic regression is a classification model used here to predict price category (Low, Medium, High) based on the car’s features. The confusion matrix gives a clear view of how well the model performs in classifying price ranges.

To evaluate price from a categorical perspective, price was binned into three classes: Low, Medium, and High using quantile-based thresholds. Multinomial Logistic Regression was then used to model the probability of a car falling into one of these categories, based on the same set of predictors.

This model generalizes binary logistic regression to multiclass settings and estimates separate coefficients for each class relative to a baseline. The resulting probability estimates are used to assign the most likely class to each observation.

One of the main advantages of logistic regression is that it does not assume independence between predictors (unlike Naïve Bayes), allowing for a more realistic representation of feature relationships.

## KNN RMSE: 0.6300444

KNN is a distance-based method that predicts price by averaging the k-nearest training observations. We use RMSE to assess accuracy on the test set.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low     14      2    0
##     Medium   6     17    5
##     High     0      1   16
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7705         
##                  95% CI : (0.645, 0.8685)
##     No Information Rate : 0.3443         
##     P-Value [Acc > NIR] : 1.234e-11      
##                                          
##                   Kappa : 0.6562         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity              0.7000        0.8500      0.7619
## Specificity              0.9512        0.7317      0.9750
## Pos Pred Value           0.8750        0.6071      0.9412
## Neg Pred Value           0.8667        0.9091      0.8864
## Prevalence               0.3279        0.3279      0.3443
## Detection Rate           0.2295        0.2787      0.2623
## Detection Prevalence     0.2623        0.4590      0.2787
## Balanced Accuracy        0.8256        0.7909      0.8685

Naïve Bayes classifies cars into price categories using probabilistic reasoning. The confusion matrix shows classification accuracy for Low, Medium, and High priced vehicles.

## Logistic Regression Accuracy: 73.77 %
## Naïve Bayes Accuracy: 65.57 %

This section visually and numerically compares the performance of Logistic Regression and Naïve Bayes in classifying cars into three price categories: Low, Medium, and High.

The confusion matrix heatmap for the Logistic Regression model offers a clear visual representation of prediction accuracy across the categories. Darker shades indicate higher agreement between predicted and actual classes, while off-diagonal values highlight areas of misclassification (e.g., Medium being confused as Low).

Additionally, a side-by-side accuracy comparison quantifies the performance of both models. This allows us to evaluate which algorithm provides more reliable predictions for this classification task. Depending on the structure of the data, one model may outperform the other due to its underlying assumptions.

Model comparision:

Regression Models: Multiple Linear Regression served as a baseline. It assumes a linear relationship between the predictors and the target variable. The model performed reasonably well, with an RMSE that reflected moderate predictive accuracy. However, due to its simplicity, it may have underfit some of the non-linear relationships in the dataset.

K-Nearest Neighbors (KNN) regression offered a non-parametric approach by making predictions based on the closest training instances. While KNN captured some local patterns missed by linear regression, its performance was sensitive to the value of k and was generally less stable. KNN’s RMSE was slightly higher than that of linear regression, indicating weaker performance in this case — likely due to the standardized but still high-dimensional feature space.

Classification Models: To allow for classification modeling, the continuous price variable was converted into three categories: Low, Medium, and High, based on quantile breaks.

Naïve Bayes Classification assumes conditional independence between predictors. Despite this assumption rarely holding true in real-world data, the model performed surprisingly well. The simplicity and speed of Naïve Bayes made it a strong baseline classifier. Its confusion matrix revealed that while it handled the “Low” and “High” categories decently, it struggled with distinguishing the “Medium” class — likely due to overlapping feature distributions.

Multinomial Logistic Regression was employed to model the same classification task with greater flexibility. Unlike Naïve Bayes, logistic regression accounts for feature relationships and delivers probability-based outputs. It achieved slightly higher overall accuracy and a better-balanced confusion matrix, particularly in predicting the “Medium” category. This suggests it modeled the decision boundaries between classes more effectively.