In today’s data-driven environment, auction platforms play a crucial role in facilitating transactions across a wide variety of products — from artwork to collectibles and beyond. The growing volume of auction data offers an opportunity to derive actionable insights into market behavior, buyer interest, and product value using analytical techniques. This project explores such opportunities through the analysis of a large auction dataset consisting of 5,785 transactions.
The primary objective of this report is to conduct a thorough exploratory and predictive analysis to uncover patterns in auction dynamics and to forecast relevant outcomes such as sold price and item type classification. Key focus areas include understanding how factors like bidder count, item age, advertising media, and product category (e.g., painting vs. non-painting) influence auction success.
To achieve this, the report is structured into three core sections:
Exploratory Data Analysis (EDA): Cleaning the dataset, generating visualizations, and interpreting patterns and relationships across variables.
Predictive Modeling: Applying and comparing statistical and machine learning models (Linear Regression, K-Nearest Neighbors, and Naïve Bayes) to predict sold prices and classify auction items.
Business Insights and Reporting: Drawing conclusions from the analysis and translating findings into strategic recommendations for auction platforms and sellers.
To prepare the dataset for analysis, a series of preprocessing steps were carried out to ensure accuracy and consistency across variables. The steps included data import, validation, cleaning, and basic structure reporting.
Cleaning Process:
Rows with invalid values in the Age column (i.e., negative ages) were filtered out. These values were likely due to entry errors and not realistic in the auction context.
The IsPainting column, which originally contained “Yes” or “No”, was converted to a binary numeric variable.
A check for missing values was performed, and the results confirmed that there were no missing values in any of the columns.
| Metric | Value |
|---|---|
| Number of Rows | 5676 |
| Number of Columns | 8 |
This confirms that the cleaned dataset contains 5,676 complete rows and 8 variables.
| Soldprice | FirstDayQueries | Age | Bidders | SecondDayQueries | Location | Media | IsPainting | |
|---|---|---|---|---|---|---|---|---|
| Min. : 1050 | Min. : 997 | Min. : 0.00 | Min. : 821 | Min. : 504 | Min. :1.000 | Min. :0.0000 | Min. :0.0000 | |
| 1st Qu.: 3236 | 1st Qu.: 4767 | 1st Qu.: 46.75 | 1st Qu.:2341 | 1st Qu.:1306 | 1st Qu.:1.000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | |
| Median : 3737 | Median : 5658 | Median : 65.00 | Median :2686 | Median :1538 | Median :1.000 | Median :1.0000 | Median :1.0000 | |
| Mean : 3893 | Mean : 6272 | Mean : 64.56 | Mean :2901 | Mean :1641 | Mean :1.016 | Mean :0.7364 | Mean :0.6929 | |
| 3rd Qu.: 4332 | 3rd Qu.: 7004 | 3rd Qu.: 80.00 | 3rd Qu.:3216 | 3rd Qu.:1856 | 3rd Qu.:1.000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | |
| Max. :12178 | Max. :46411 | Max. :2000.00 | Max. :8154 | Max. :5289 | Max. :2.000 | Max. :4.0000 | Max. :1.0000 |
A statistical summary was produced for all variables, showing the range, mean, median, and distribution of each feature.
| Soldprice | FirstDayQueries | Age | Bidders | SecondDayQueries | Location | Media | IsPainting |
|---|---|---|---|---|---|---|---|
| 5692 | 7000 | 2000 | 4732 | 2641 | 1 | 1 | 0 |
| 4379 | 8861 | 202 | 5356 | 2953 | 1 | 0 | 0 |
| 2943 | 6119 | 200 | 2244 | 1154 | 1 | 1 | 1 |
| 4741 | 9030 | 196 | 3541 | 2060 | 1 | 0 | 0 |
| 3832 | 5000 | 190 | 2852 | 1834 | 1 | 1 | 1 |
| 4174 | 5000 | 190 | 3116 | 1990 | 1 | 0 | 1 |
A snapshot of the first six rows is displayed to give a concrete view of the data structure and types after cleaning.

Interpretation:
The histogram shows that most auction items are sold within the 3,000–5,000 price range, with the peak in the 3,000–4,000 bracket (2,749 items).
Sales drop sharply at higher prices. Notably, the 10,000–11,000 range has 0 items, and only 1 item appears in the 12,000–13,000 range.
Key Takeaways:
3,000–5,000 is the market’s sweet spot.
High-priced items are rare and require targeted strategies.
Optimal pricing increases the chance of successful sales.

Interpretation:
The histogram illustrates the distribution of auction item ages (filtered to show items aged under 200 years).
Key Observations:
The distribution is bell-shaped and slightly right-skewed.
Most items fall between 40 and 90 years old, with peaks around 50–60 and 70–80.
Very few items are older than 120 years, and almost none beyond 150.
Insights:
Mid-century items (40–90 years old) are the most common, indicating strong supply or demand in that age range.
Newer items (<30 years) are rare, possibly due to auction curation focusing on vintage or collectible products.
Very old items (>120 years) are rare and may be more valuable, niche, or harder to source.

Interpretation:
The histogram shows that most auction items attract between 2,000 and 3,500 bidders, with a sharp peak around 2,500–2,800. The distribution is right-skewed, meaning while a few listings receive very high bidder counts (up to 8,000), these are outliers.
Key Takeaways:
Typical auctions attract 2,000–3,000 bidders.
High bidder counts are rare but suggest strong item appeal or effective marketing.
This distribution can help define benchmark engagement levels for evaluating auction performance.

Interpretation:
This scatter plot compares the number of queries items receive on the first day versus the second day, with points colored by whether the item is a painting (Yes = yellow) or not (No = red).
Key Observations:
There is a positive correlation between First Day and Second Day Queries — items with more interest on Day 1 tend to maintain visibility on Day 2.
Paintings (yellow dots) are concentrated in the lower-left, indicating modest but consistent attention.
Non-paintings (red dots) are more widely spread, including some that attracted very high query volumes, suggesting they are more likely to go viral or trend.
Insights:
Non-painting items have a broader range of visibility, potentially due to varied media exposure or appeal.
Paintings tend to receive steady but moderate attention, possibly from a more niche or loyal buyer base.

Interpretation:
This scatter plot compares Day 1 and Day 2 user interest for auction items, segmented by location type: Small City (pink) vs. Big City (black).
Key Observations:
The majority of listings are from Small Cities, forming a dense cluster in the range of 2,000–10,000 first-day queries and 1,000–3,500 second-day queries.
Listings from Big Cities, although fewer in number, are more dispersed and include higher query counts, particularly on the second day.
Both locations show a positive correlation between Day 1 and Day 2 engagement.
Insights:
Small cities dominate in volume, but big city items may attract stronger attention per listing.
Urban-based auctions could benefit from enhanced exposure or targeted campaigns due to their viral potential.

Interpretation:
This scatter plot visualizes the relationship between the number of bidders and the sold price, colored by whether the item is a painting (orange) or non-painting (blue).
Key Observations:
There is a strong positive correlation: items with more bidders tend to have higher sold prices.
Non-paintings (blue) dominate the upper price ranges, showing greater variability and achieving higher price ceilings.
Paintings (orange) tend to cluster in the mid-range for both bidder count and price, suggesting more consistent but lower auction outcomes.
Insights:
Non-painting items may benefit more from competitive bidding, potentially due to broader appeal or higher perceived value.
Paintings, while more stable in performance, may need targeted promotion to reach top price tiers.
The number of bidders is a strong predictor of final auction price, especially for non-paintings.

Interpretation:
This scatter plot shows how advertising exposure (Media Count: 1 to 4) influences the relationship between First Day Queries and Sold Price.
Key Observations:
Most items used only 1 advertisement channel (grey), and their data points cluster in the lower-to-mid range of both queries and prices.
Items with 2–4 advertisements (green, red, purple) tend to appear higher in sold price and first day visibility.
A few of the highest-priced items were promoted using 3 or 4 media channels, indicating a possible uplift from multi-channel exposure.
Insights:
More media coverage is associated with higher visibility and selling price.
Using multiple advertisement platforms (2+) may be a valuable strategy for sellers aiming to increase item visibility and auction success.

Interpretation:
Figure 1: Sold Price by Painting Status
Non-paintings not only have higher median prices but
also a greater spread of outliers, with some above
10,000, showing potential for high-value sales.
Paintings show fewer and lower outliers, indicating a
tighter value range.
Figure 2: Sold Price by Location
Both Small and Big Cities have similar median prices.
However, outliers are present in both, suggesting
location may not be a key driver for high-price
auctions.
Figure 3: Sold Price by Media
There is a clear upward trend in median sold price as media
exposure increases. Items with 3–4
advertisements show more extreme outliers,
reinforcing the importance of advertising.
Figure 4: Number of Bidders by Painting Status
Non-paintings attract more bidders and a wider spread
of extreme values, some exceeding 8,000.
Paintings are more consistent with fewer
outliers.
Figure 5: First Day Queries by Painting Status
Non-paintings receive significantly more views, with
outliers reaching above 40,000.
Paintings maintain a stable pattern with limited
spikes.
Figure 6: Second Day Queries by Painting
Status
The trend continues: non-paintings dominate in volume
and variability, again with a high number of
outliers.
Paintings remain more consistent and niche.
Summary Insight:
Outliers are more prevalent among non-paintings,
especially with higher media counts.
More advertising not only increases average performance
but also the probability of exceptional results.

Interpretation:
This heatmap displays Pearson correlation coefficients between key numeric variables in the auction dataset. Darker red tiles represent stronger positive relationships, while lighter tiles indicate weaker or minimal correlation.
Key Findings:
Soldprice is strongly correlated with: SecondDayQueries (0.83) – the more queries on Day 2, the higher the selling price. Bidders (0.79) – more bidders lead to higher final prices. FirstDayQueries (0.56) – early interest contributes to better pricing.
Bidders and SecondDayQueries have a very strong correlation (0.90), indicating that higher visibility on Day 2 brings more bidder engagement.
Age shows weak or negligible correlation with: Soldprice (0.18), Bidders (0.24), and SecondDayQueries (0.20), meaning item age has minimal impact on auction outcomes.
FirstDayQueries is moderately correlated with: SecondDayQueries (0.43) and Bidders (0.45) – suggesting early attention can carry through the auction duration.
Insight:
Bidders and queries are the strongest predictors of sold price.
Age is not a strong influencing factor.
Focused promotion and visibility (especially generating queries) are essential for auction success.
| Soldprice | FirstDayQueries | Age | Bidders | SecondDayQueries | Location | Media | IsPainting | Set |
|---|---|---|---|---|---|---|---|---|
| 1.0101427 | -0.4653947 | 0.9082779 | 1.2937754 | 1.0955991 | 1 | 0.4786239 | 1 | Train |
| -0.8174855 | -0.0488422 | -0.3734437 | -0.9766221 | -1.0855608 | 1 | -1.3128357 | 1 | Train |
| -0.0634439 | 0.6313791 | -0.6041536 | 1.2645026 | 2.6294965 | 1 | 2.2700835 | 1 | Train |
| 0.0145968 | -0.2549027 | 0.9082779 | 0.4998973 | 0.4942960 | 1 | 0.4786239 | 1 | Train |
| -0.0919182 | -0.4192342 | 0.0623417 | -0.7412689 | -0.4344683 | 1 | 0.4786239 | 1 | Train |
| -0.7900659 | -0.4764732 | 0.2674171 | -0.2904680 | -0.9017229 | 1 | -1.3128357 | 1 | Train |
Interpretation:
To prepare the dataset for modeling, we performed data normalization, partitioning, and class rebalancing using the following steps:
The cleaned dataset was first split into training (70%)
and validation (30%) sets using stratified sampling based on
the IsPainting variable. This ensures that both sets
reflect the original class proportions.
The numeric predictors were then standardized
(i.e., mean-centered and scaled to unit variance) using the
caret::preProcess() function. Key considerations
include:
The following numeric columns were normalized:
Soldprice, FirstDayQueries,
Age, Bidders, SecondDayQueries,
and Media.The training set was rebalanced using the ROSE
technique, creating a synthetic 50:50 class distribution for
the IsPainting variable to reduce prediction bias.
A new column, Set, was added to indicate whether
each observation belongs to the Train or
Validation subset.
The two sets were then combined and exported as
standardized_balanced_partitioned_auction_data.xlsx for use
in classification model training and evaluation.
This process improves class balance during model learning, ensures consistent feature scaling, and preserves generalization to unseen data.
| Set | IsPainting | Count |
|---|---|---|
| Train | 0 | 1976 |
| Train | 1 | 1999 |
| Validation | 0 | 522 |
| Validation | 1 | 1179 |
Interpretation: Class Distribution Summary
The training set was balanced using ROSE, ensuring
equal numbers of paintings (1) and non-paintings
(0) to reduce model bias.
The validation set retains the original class
distribution, providing a realistic benchmark for evaluating
model performance.
This setup improves fairness during training while ensuring reliable
evaluation on unseen data.
## k-Nearest Neighbors
##
## 3975 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3577, 3577, 3578, 3577, 3577, 3578, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7024703 0.4050831
## 7 0.7007130 0.4014401
## 9 0.6849516 0.3698498
## 11 0.6836010 0.3670181
## 13 0.6802429 0.3602197
## 15 0.6757175 0.3511261
## 17 0.6711046 0.3418486
## 19 0.6732016 0.3459793
## 21 0.6756384 0.3508369
## 23 0.6740442 0.3476205
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

| Soldprice | FirstDayQueries | Age | Bidders | SecondDayQueries | Location | Media | IsPainting | Set | Prediction |
|---|---|---|---|---|---|---|---|---|---|
| 0.8973001 | 1.0117418 | 3.369184 | 0.7563270 | 0.8083524 | 1 | -1.3128357 | 0 | Validation | 0 |
| 0.1316578 | 0.7547200 | 2.959033 | 0.2540059 | 0.3353529 | 1 | -1.3128357 | 1 | Validation | 0 |
| -0.4093539 | -1.1153347 | 2.574516 | -0.4824974 | -0.6566057 | 1 | -1.3128357 | 1 | Validation | 0 |
| 4.2709773 | 2.1827417 | 2.446344 | 3.8650971 | 2.7252454 | 2 | 2.2700835 | 0 | Validation | 0 |
| -0.7510455 | -0.1057120 | 2.190000 | 0.3980280 | 0.3142881 | 1 | -1.3128357 | 1 | Validation | 1 |
| -0.8670519 | 0.5519830 | 1.933655 | -1.0035531 | -1.1142855 | 1 | -1.3128357 | 1 | Validation | 0 |
| -0.4336098 | -0.0373944 | 1.933655 | -0.0281838 | 0.0098066 | 1 | -1.3128357 | 0 | Validation | 0 |
| -0.3070574 | 0.5560452 | 1.933655 | 0.2223913 | 0.3257780 | 1 | -1.3128357 | 0 | Validation | 1 |
| 1.0850195 | 0.9821990 | 1.805483 | 0.2540059 | 0.4483366 | 1 | -1.3128357 | 0 | Validation | 0 |
| 1.4172197 | 1.8367225 | 1.805483 | 1.8862566 | 2.2560758 | 1 | 0.4786239 | 0 | Validation | 0 |
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 313 435
## 1 209 744
##
## Accuracy : 0.6214
## 95% CI : (0.5979, 0.6445)
## No Information Rate : 0.6931
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2058
##
##
## Sensitivity : 0.6310
## Specificity : 0.5996
## Pos Pred Value : 0.7807
## Neg Pred Value : 0.4184
## Prevalence : 0.6931
## Detection Rate : 0.4374
## Detection Prevalence : 0.5603
## Balanced Accuracy : 0.6153
##
## 'Positive' Class : 1
##
Interpretation:
The KNN model demonstrates solid performance, with sensitivity (63%) and specificity (60%), showing it can effectively identify both paintings and non-paintings. Its high precision (78%) indicates strong reliability when predicting paintings, which is valuable in applications like targeted marketing.
While the negative predictive value (41%) is lower, this tradeoff is common when prioritizing accurate positives. Cross-validation supports the model’s generalizability, and despite a moderate Kappa score (~0.21), it consistently captures meaningful patterns.
| Soldprice | FirstDayQueries | Age | Bidders | SecondDayQueries | Location | Media | IsPainting | Set | X0 | X1 | Predicted |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.8973001 | 1.0117418 | 3.369184 | 0.7563270 | 0.8083524 | 1 | -1.3128357 | 0 | Validation | 0.9985696 | 0.0014304 | 0 |
| 0.1316578 | 0.7547200 | 2.959033 | 0.2540059 | 0.3353529 | 1 | -1.3128357 | 1 | Validation | 0.9684050 | 0.0315950 | 0 |
| -0.4093539 | -1.1153347 | 2.574516 | -0.4824974 | -0.6566057 | 1 | -1.3128357 | 1 | Validation | 0.7136155 | 0.2863845 | 0 |
| 4.2709773 | 2.1827417 | 2.446344 | 3.8650971 | 2.7252454 | 2 | 2.2700835 | 0 | Validation | 1.0000000 | 0.0000000 | 0 |
| -0.7510455 | -0.1057120 | 2.190000 | 0.3980280 | 0.3142881 | 1 | -1.3128357 | 1 | Validation | 0.7946413 | 0.2053587 | 0 |
| -0.8670519 | 0.5519830 | 1.933655 | -1.0035531 | -1.1142855 | 1 | -1.3128357 | 1 | Validation | 0.5478309 | 0.4521691 | 0 |
| -0.4336098 | -0.0373944 | 1.933655 | -0.0281838 | 0.0098066 | 1 | -1.3128357 | 0 | Validation | 0.5002153 | 0.4997847 | 0 |
| -0.3070574 | 0.5560452 | 1.933655 | 0.2223913 | 0.3257780 | 1 | -1.3128357 | 0 | Validation | 0.7432732 | 0.2567268 | 0 |
| 1.0850195 | 0.9821990 | 1.805483 | 0.2540059 | 0.4483366 | 1 | -1.3128357 | 0 | Validation | 0.8308143 | 0.1691857 | 0 |
| 1.4172197 | 1.8367225 | 1.805483 | 1.8862566 | 2.2560758 | 1 | 0.4786239 | 0 | Validation | 0.9999984 | 0.0000016 | 0 |
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 237 154
## 1 285 1025
##
## Accuracy : 0.7419
## 95% CI : (0.7204, 0.7626)
## No Information Rate : 0.6931
## P-Value [Acc > NIR] : 5.311e-06
##
## Kappa : 0.3477
##
## Mcnemar's Test P-Value : 5.485e-10
##
## Sensitivity : 0.8694
## Specificity : 0.4540
## Pos Pred Value : 0.7824
## Neg Pred Value : 0.6061
## Prevalence : 0.6931
## Detection Rate : 0.6026
## Detection Prevalence : 0.7701
## Balanced Accuracy : 0.6617
##
## 'Positive' Class : 1
##
Interpretation: Naive Bayes Classifier on Auction Dataset
The Naive Bayes model was trained using five standardized
predictors:
FirstDayQueries, Age, Bidders,
SecondDayQueries, and Media.
It was evaluated on the validation set, producing predicted class
probabilities and a confusion matrix.
Key Performance Highlights: Accuracy: Approximately 74% — indicates strong predictive power. Precision (PPV): High (~79%) — when the model predicts a painting, it’s often correct. Recall (Sensitivity): Very high (~86–87%) — most true paintings are correctly identified. Specificity: Moderate (~45–46%) — the model occasionally misclassifies non-paintings. Kappa: ~0.34 — indicates fair agreement beyond chance.
## Metric Value
## 1 Residual DF 3969.000000
## 2 R2 0.774458
## 3 Adjusted R2 0.774174
## 4 Std. Error Estimate 0.546476
## 5 RSS 1185.284454
## Predictor Criteria Included
## (Intercept) (Intercept) 6.317575 TRUE
## FirstDayQueries FirstDayQueries 25.923142 TRUE
## Age Age 2.762722 TRUE
## Bidders Bidders 7.255605 TRUE
## SecondDayQueries SecondDayQueries 30.369581 TRUE
## Media Media 14.855057 TRUE
##
## Tolerance for entering the model: 0.15655
## Predictor Estimate
## (Intercept) (Intercept) 0.055409
## FirstDayQueries FirstDayQueries 0.233596
## Age Age 0.040562
## Bidders Bidders 0.138306
## SecondDayQueries SecondDayQueries 0.570408
## Media Media 0.136147
Interpretation:
The linear regression model achieves a strong fit,
with an R² around 0.76–0.78, meaning the predictors
explain a substantial portion of the variation in
Soldprice.
Key variables such as SecondDayQueries, FirstDayQueries, and Media are statistically significant. The coefficient plot below highlights that SecondDayQueries has the strongest positive influence on price — suggesting items that gain more attention on the second day tend to sell for more.
Overall, the model is interpretable, performs well, and shows no serious multicollinearity issues, making it a reliable predictor of auction outcomes.