MSCI 4320 - Individual Final Project Report

I. Executive Summary

In today’s data-driven environment, auction platforms play a crucial role in facilitating transactions across a wide variety of products — from artwork to collectibles and beyond. The growing volume of auction data offers an opportunity to derive actionable insights into market behavior, buyer interest, and product value using analytical techniques. This project explores such opportunities through the analysis of a large auction dataset consisting of 5,785 transactions.

The primary objective of this report is to conduct a thorough exploratory and predictive analysis to uncover patterns in auction dynamics and to forecast relevant outcomes such as sold price and item type classification. Key focus areas include understanding how factors like bidder count, item age, advertising media, and product category (e.g., painting vs. non-painting) influence auction success.

To achieve this, the report is structured into three core sections:

Exploratory Data Analysis (EDA): Cleaning the dataset, generating visualizations, and interpreting patterns and relationships across variables.
Predictive Modeling: Applying and comparing statistical and machine learning models (Linear Regression, K-Nearest Neighbors, and Naïve Bayes) to predict sold prices and classify auction items.
Business Insights and Reporting: Drawing conclusions from the analysis and translating findings into strategic recommendations for auction platforms and sellers.

II. Exploratory Data Analysis

i. Data cleaning

To prepare the dataset for analysis, a series of preprocessing steps were carried out to ensure accuracy and consistency across variables. The steps included data import, validation, cleaning, and basic structure reporting.

Cleaning Process:

Rows with invalid values in the Age column (i.e., negative ages) were filtered out. These values were likely due to entry errors and not realistic in the auction context.

The IsPainting column, which originally contained “Yes” or “No”, was converted to a binary numeric variable.

A check for missing values was performed, and the results confirmed that there were no missing values in any of the columns.

Dimensions of the Cleaned Auction Dataset
Metric	Value
Number of Rows	5676
Number of Columns	8

This confirms that the cleaned dataset contains 5,676 complete rows and 8 variables.

Summary statistics of cleaned auction data
Soldprice	FirstDayQueries	Age	Bidders	SecondDayQueries	Location	Media	IsPainting
Min. : 1050	Min. : 997	Min. : 0.00	Min. : 821	Min. : 504	Min. :1.000	Min. :0.0000	Min. :0.0000
1st Qu.: 3236	1st Qu.: 4767	1st Qu.: 46.75	1st Qu.:2341	1st Qu.:1306	1st Qu.:1.000	1st Qu.:0.0000	1st Qu.:0.0000
Median : 3737	Median : 5658	Median : 65.00	Median :2686	Median :1538	Median :1.000	Median :1.0000	Median :1.0000
Mean : 3893	Mean : 6272	Mean : 64.56	Mean :2901	Mean :1641	Mean :1.016	Mean :0.7364	Mean :0.6929
3rd Qu.: 4332	3rd Qu.: 7004	3rd Qu.: 80.00	3rd Qu.:3216	3rd Qu.:1856	3rd Qu.:1.000	3rd Qu.:1.0000	3rd Qu.:1.0000
Max. :12178	Max. :46411	Max. :2000.00	Max. :8154	Max. :5289	Max. :2.000	Max. :4.0000	Max. :1.0000

A statistical summary was produced for all variables, showing the range, mean, median, and distribution of each feature.

First 6 rows of the cleaned auction dataset
Soldprice	FirstDayQueries	Age	Bidders	SecondDayQueries	Location	Media	IsPainting
5692	7000	2000	4732	2641	1	1	0
4379	8861	202	5356	2953	1	0	0
2943	6119	200	2244	1154	1	1	1
4741	9030	196	3541	2060	1	0	0
3832	5000	190	2852	1834	1	1	1
4174	5000	190	3116	1990	1	0	1

A snapshot of the first six rows is displayed to give a concrete view of the data structure and types after cleaning.

ii. Visualization

Interpretation:

The histogram shows that most auction items are sold within the 3,000–5,000 price range, with the peak in the 3,000–4,000 bracket (2,749 items).

Sales drop sharply at higher prices. Notably, the 10,000–11,000 range has 0 items, and only 1 item appears in the 12,000–13,000 range.

Key Takeaways:

3,000–5,000 is the market’s sweet spot.

High-priced items are rare and require targeted strategies.

Optimal pricing increases the chance of successful sales.

Interpretation:

The histogram illustrates the distribution of auction item ages (filtered to show items aged under 200 years).

Key Observations:

The distribution is bell-shaped and slightly right-skewed.

Most items fall between 40 and 90 years old, with peaks around 50–60 and 70–80.

Very few items are older than 120 years, and almost none beyond 150.

Insights:

Mid-century items (40–90 years old) are the most common, indicating strong supply or demand in that age range.

Newer items (<30 years) are rare, possibly due to auction curation focusing on vintage or collectible products.

Very old items (>120 years) are rare and may be more valuable, niche, or harder to source.

Interpretation:

The histogram shows that most auction items attract between 2,000 and 3,500 bidders, with a sharp peak around 2,500–2,800. The distribution is right-skewed, meaning while a few listings receive very high bidder counts (up to 8,000), these are outliers.

Key Takeaways:

Typical auctions attract 2,000–3,000 bidders.

High bidder counts are rare but suggest strong item appeal or effective marketing.

This distribution can help define benchmark engagement levels for evaluating auction performance.

Interpretation:

This scatter plot compares the number of queries items receive on the first day versus the second day, with points colored by whether the item is a painting (Yes = yellow) or not (No = red).

Key Observations:

There is a positive correlation between First Day and Second Day Queries — items with more interest on Day 1 tend to maintain visibility on Day 2.

Paintings (yellow dots) are concentrated in the lower-left, indicating modest but consistent attention.

Non-paintings (red dots) are more widely spread, including some that attracted very high query volumes, suggesting they are more likely to go viral or trend.

Insights:

Non-painting items have a broader range of visibility, potentially due to varied media exposure or appeal.

Paintings tend to receive steady but moderate attention, possibly from a more niche or loyal buyer base.

Interpretation:

This scatter plot compares Day 1 and Day 2 user interest for auction items, segmented by location type: Small City (pink) vs. Big City (black).

Key Observations:

The majority of listings are from Small Cities, forming a dense cluster in the range of 2,000–10,000 first-day queries and 1,000–3,500 second-day queries.

Listings from Big Cities, although fewer in number, are more dispersed and include higher query counts, particularly on the second day.

Both locations show a positive correlation between Day 1 and Day 2 engagement.

Insights:

Small cities dominate in volume, but big city items may attract stronger attention per listing.

Urban-based auctions could benefit from enhanced exposure or targeted campaigns due to their viral potential.

Interpretation:

This scatter plot visualizes the relationship between the number of bidders and the sold price, colored by whether the item is a painting (orange) or non-painting (blue).

Key Observations:

There is a strong positive correlation: items with more bidders tend to have higher sold prices.

Non-paintings (blue) dominate the upper price ranges, showing greater variability and achieving higher price ceilings.

Paintings (orange) tend to cluster in the mid-range for both bidder count and price, suggesting more consistent but lower auction outcomes.

Insights:

Non-painting items may benefit more from competitive bidding, potentially due to broader appeal or higher perceived value.

Paintings, while more stable in performance, may need targeted promotion to reach top price tiers.

The number of bidders is a strong predictor of final auction price, especially for non-paintings.

Interpretation:

This scatter plot shows how advertising exposure (Media Count: 1 to 4) influences the relationship between First Day Queries and Sold Price.

Key Observations:

Most items used only 1 advertisement channel (grey), and their data points cluster in the lower-to-mid range of both queries and prices.

Items with 2–4 advertisements (green, red, purple) tend to appear higher in sold price and first day visibility.

A few of the highest-priced items were promoted using 3 or 4 media channels, indicating a possible uplift from multi-channel exposure.

Insights:

More media coverage is associated with higher visibility and selling price.

Using multiple advertisement platforms (2+) may be a valuable strategy for sellers aiming to increase item visibility and auction success.

Interpretation:

Figure 1: Sold Price by Painting Status
Non-paintings not only have higher median prices but also a greater spread of outliers, with some above 10,000, showing potential for high-value sales.
Paintings show fewer and lower outliers, indicating a tighter value range.

Figure 2: Sold Price by Location
Both Small and Big Cities have similar median prices. However, outliers are present in both, suggesting location may not be a key driver for high-price auctions.

Figure 3: Sold Price by Media
There is a clear upward trend in median sold price as media exposure increases. Items with 3–4 advertisements show more extreme outliers, reinforcing the importance of advertising.

Figure 4: Number of Bidders by Painting Status
Non-paintings attract more bidders and a wider spread of extreme values, some exceeding 8,000.
Paintings are more consistent with fewer outliers.

Figure 5: First Day Queries by Painting Status
Non-paintings receive significantly more views, with outliers reaching above 40,000.
Paintings maintain a stable pattern with limited spikes.

Figure 6: Second Day Queries by Painting Status
The trend continues: non-paintings dominate in volume and variability, again with a high number of outliers.
Paintings remain more consistent and niche.

Summary Insight:
Outliers are more prevalent among non-paintings, especially with higher media counts.
More advertising not only increases average performance but also the probability of exceptional results.

Interpretation:

This heatmap displays Pearson correlation coefficients between key numeric variables in the auction dataset. Darker red tiles represent stronger positive relationships, while lighter tiles indicate weaker or minimal correlation.

Key Findings:

Soldprice is strongly correlated with: SecondDayQueries (0.83) – the more queries on Day 2, the higher the selling price. Bidders (0.79) – more bidders lead to higher final prices. FirstDayQueries (0.56) – early interest contributes to better pricing.

Bidders and SecondDayQueries have a very strong correlation (0.90), indicating that higher visibility on Day 2 brings more bidder engagement.

Age shows weak or negligible correlation with: Soldprice (0.18), Bidders (0.24), and SecondDayQueries (0.20), meaning item age has minimal impact on auction outcomes.

FirstDayQueries is moderately correlated with: SecondDayQueries (0.43) and Bidders (0.45) – suggesting early attention can carry through the auction duration.

Insight:

Bidders and queries are the strongest predictors of sold price.

Age is not a strong influencing factor.

Focused promotion and visibility (especially generating queries) are essential for auction success.

iii. Data Normalization + Data Partition

First 6 rows of the standardized, balanced dataset
Soldprice	FirstDayQueries	Age	Bidders	SecondDayQueries	Location	Media	IsPainting	Set
1.0101427	-0.4653947	0.9082779	1.2937754	1.0955991	1	0.4786239	1	Train
-0.8174855	-0.0488422	-0.3734437	-0.9766221	-1.0855608	1	-1.3128357	1	Train
-0.0634439	0.6313791	-0.6041536	1.2645026	2.6294965	1	2.2700835	1	Train
0.0145968	-0.2549027	0.9082779	0.4998973	0.4942960	1	0.4786239	1	Train
-0.0919182	-0.4192342	0.0623417	-0.7412689	-0.4344683	1	0.4786239	1	Train
-0.7900659	-0.4764732	0.2674171	-0.2904680	-0.9017229	1	-1.3128357	1	Train

Interpretation:

To prepare the dataset for modeling, we performed data normalization, partitioning, and class rebalancing using the following steps:

The cleaned dataset was first split into training (70%) and validation (30%) sets using stratified sampling based on the IsPainting variable. This ensures that both sets reflect the original class proportions.
The numeric predictors were then standardized (i.e., mean-centered and scaled to unit variance) using the caret::preProcess() function. Key considerations include:
- Standardization was applied only to the training set, to prevent data leakage.
- The same scaling transformation was then applied to the validation set using the parameters learned from training.
The following numeric columns were normalized:
- Soldprice, FirstDayQueries, Age, Bidders, SecondDayQueries, and Media.
The training set was rebalanced using the ROSE technique, creating a synthetic 50:50 class distribution for the IsPainting variable to reduce prediction bias.
A new column, Set, was added to indicate whether each observation belongs to the Train or Validation subset.
The two sets were then combined and exported as standardized_balanced_partitioned_auction_data.xlsx for use in classification model training and evaluation.

This process improves class balance during model learning, ensures consistent feature scaling, and preserves generalization to unseen data.

Number of Rows for IsPainting = 1 and 0 by Set
Set	IsPainting	Count
Train	0	1976
Train	1	1999
Validation	0	522
Validation	1	1179

Interpretation: Class Distribution Summary

The training set was balanced using ROSE, ensuring equal numbers of paintings (1) and non-paintings (0) to reduce model bias.
The validation set retains the original class distribution, providing a realistic benchmark for evaluating model performance.
This setup improves fairness during training while ensuring reliable evaluation on unseen data.

iv. Predictive Analysis

1. KNN Classifier

Choose K

## k-Nearest Neighbors 
## 
## 3975 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3577, 3577, 3578, 3577, 3577, 3578, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7024703  0.4050831
##    7  0.7007130  0.4014401
##    9  0.6849516  0.3698498
##   11  0.6836010  0.3670181
##   13  0.6802429  0.3602197
##   15  0.6757175  0.3511261
##   17  0.6711046  0.3418486
##   19  0.6732016  0.3459793
##   21  0.6756384  0.3508369
##   23  0.6740442  0.3476205
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Plot shows that the elbow occurs at k=5

KNN Classifier Result

KNN Classifier Result (Sample of Validation Set)
Soldprice	FirstDayQueries	Age	Bidders	SecondDayQueries	Location	Media	IsPainting	Set	Prediction
0.8973001	1.0117418	3.369184	0.7563270	0.8083524	1	-1.3128357	0	Validation	0
0.1316578	0.7547200	2.959033	0.2540059	0.3353529	1	-1.3128357	1	Validation	0
-0.4093539	-1.1153347	2.574516	-0.4824974	-0.6566057	1	-1.3128357	1	Validation	0
4.2709773	2.1827417	2.446344	3.8650971	2.7252454	2	2.2700835	0	Validation	0
-0.7510455	-0.1057120	2.190000	0.3980280	0.3142881	1	-1.3128357	1	Validation	1
-0.8670519	0.5519830	1.933655	-1.0035531	-1.1142855	1	-1.3128357	1	Validation	0
-0.4336098	-0.0373944	1.933655	-0.0281838	0.0098066	1	-1.3128357	0	Validation	0
-0.3070574	0.5560452	1.933655	0.2223913	0.3257780	1	-1.3128357	0	Validation	1
1.0850195	0.9821990	1.805483	0.2540059	0.4483366	1	-1.3128357	0	Validation	0
1.4172197	1.8367225	1.805483	1.8862566	2.2560758	1	0.4786239	0	Validation	0

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 313 435
##          1 209 744
##                                           
##                Accuracy : 0.6214          
##                  95% CI : (0.5979, 0.6445)
##     No Information Rate : 0.6931          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2058          
##                                           
##                                           
##             Sensitivity : 0.6310          
##             Specificity : 0.5996          
##          Pos Pred Value : 0.7807          
##          Neg Pred Value : 0.4184          
##              Prevalence : 0.6931          
##          Detection Rate : 0.4374          
##    Detection Prevalence : 0.5603          
##       Balanced Accuracy : 0.6153          
##                                           
##        'Positive' Class : 1               
##

Interpretation:

The KNN model demonstrates solid performance, with sensitivity (63%) and specificity (60%), showing it can effectively identify both paintings and non-paintings. Its high precision (78%) indicates strong reliability when predicting paintings, which is valuable in applications like targeted marketing.

While the negative predictive value (41%) is lower, this tradeoff is common when prioritizing accurate positives. Cross-validation supports the model’s generalizability, and despite a moderate Kappa score (~0.21), it consistently captures meaningful patterns.

2. Naive Bayes Classifier

Subset Summary

Classification Result on Auction Validation Set
Soldprice	FirstDayQueries	Age	Bidders	SecondDayQueries	Location	Media	IsPainting	Set	X0	X1
0.8973001	1.0117418	3.369184	0.7563270	0.8083524	1	-1.3128357	0	Validation	0.9985696	0.0014304
0.1316578	0.7547200	2.959033	0.2540059	0.3353529	1	-1.3128357	1	Validation	0.9684050	0.0315950
-0.4093539	-1.1153347	2.574516	-0.4824974	-0.6566057	1	-1.3128357	1	Validation	0.7136155	0.2863845
4.2709773	2.1827417	2.446344	3.8650971	2.7252454	2	2.2700835	0	Validation	1.0000000	0.0000000
-0.7510455	-0.1057120	2.190000	0.3980280	0.3142881	1	-1.3128357	1	Validation	0.7946413	0.2053587
-0.8670519	0.5519830	1.933655	-1.0035531	-1.1142855	1	-1.3128357	1	Validation	0.5478309	0.4521691
-0.4336098	-0.0373944	1.933655	-0.0281838	0.0098066	1	-1.3128357	0	Validation	0.5002153	0.4997847
-0.3070574	0.5560452	1.933655	0.2223913	0.3257780	1	-1.3128357	0	Validation	0.7432732	0.2567268
1.0850195	0.9821990	1.805483	0.2540059	0.4483366	1	-1.3128357	0	Validation	0.8308143	0.1691857
1.4172197	1.8367225	1.805483	1.8862566	2.2560758	1	0.4786239	0	Validation	0.9999984	0.0000016

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  237  154
##          1  285 1025
##                                           
##                Accuracy : 0.7419          
##                  95% CI : (0.7204, 0.7626)
##     No Information Rate : 0.6931          
##     P-Value [Acc > NIR] : 5.311e-06       
##                                           
##                   Kappa : 0.3477          
##                                           
##  Mcnemar's Test P-Value : 5.485e-10       
##                                           
##             Sensitivity : 0.8694          
##             Specificity : 0.4540          
##          Pos Pred Value : 0.7824          
##          Neg Pred Value : 0.6061          
##              Prevalence : 0.6931          
##          Detection Rate : 0.6026          
##    Detection Prevalence : 0.7701          
##       Balanced Accuracy : 0.6617          
##                                           
##        'Positive' Class : 1               
##

Interpretation: Naive Bayes Classifier on Auction Dataset

The Naive Bayes model was trained using five standardized predictors:
FirstDayQueries, Age, Bidders, SecondDayQueries, and Media.
It was evaluated on the validation set, producing predicted class probabilities and a confusion matrix.

Key Performance Highlights: Accuracy: Approximately 74% — indicates strong predictive power. Precision (PPV): High (~79%) — when the model predicts a painting, it’s often correct. Recall (Sensitivity): Very high (~86–87%) — most true paintings are correctly identified. Specificity: Moderate (~45–46%) — the model occasionally misclassifies non-paintings. Kappa: ~0.34 — indicates fair agreement beyond chance.

3. Linear Regression Model

##                Metric       Value
## 1         Residual DF 3969.000000
## 2                  R2    0.774458
## 3         Adjusted R2    0.774174
## 4 Std. Error Estimate    0.546476
## 5                 RSS 1185.284454

##                         Predictor  Criteria Included
## (Intercept)           (Intercept)  6.317575     TRUE
## FirstDayQueries   FirstDayQueries 25.923142     TRUE
## Age                           Age  2.762722     TRUE
## Bidders                   Bidders  7.255605     TRUE
## SecondDayQueries SecondDayQueries 30.369581     TRUE
## Media                       Media 14.855057     TRUE

## 
## Tolerance for entering the model: 0.15655

##                         Predictor Estimate
## (Intercept)           (Intercept) 0.055409
## FirstDayQueries   FirstDayQueries 0.233596
## Age                           Age 0.040562
## Bidders                   Bidders 0.138306
## SecondDayQueries SecondDayQueries 0.570408
## Media                       Media 0.136147

Interpretation:

The linear regression model achieves a strong fit, with an R² around 0.76–0.78, meaning the predictors explain a substantial portion of the variation in Soldprice.

Key variables such as SecondDayQueries, FirstDayQueries, and Media are statistically significant. The coefficient plot below highlights that SecondDayQueries has the strongest positive influence on price — suggesting items that gain more attention on the second day tend to sell for more.

Overall, the model is interpretable, performs well, and shows no serious multicollinearity issues, making it a reliable predictor of auction outcomes.

MSCI 4320 - Individual Final Project Report

Quynh Nguyen Thi Nhu

Apr 16, 2025

I. Executive Summary

II. Exploratory Data Analysis

i. Data cleaning

ii. Visualization

iii. Data Normalization + Data Partition

iv. Predictive Analysis

1. KNN Classifier

Choose K

Plot shows that the elbow occurs at k=5

KNN Classifier Result

Confusion Matrix

2. Naive Bayes Classifier

Subset Summary

3. Linear Regression Model