Project Proposal: Predicting Price of Used Cars Using Regression Model

Hello, my name is Hyreen Alice, and I am deeply engaged in applying data science to understand and predict used car prices. As part of the course under Professor Dr. Bradford Dykes at Grand Valley State University, I have focused on employing statistical modeling and advanced analytics. My expertise spans various analytical techniques, including linear and logistic regression, ridge regression, with proficiency in R and familiarity with essential libraries such as tidyverse and ggplot2. In this project, I delved into the complexities of statistical modeling by focusing on predicting the prices of used cars, using R and R Markdown as the foundational tools for our analysis. I explored the fundamental role of probability in statistical modeling, particularly through the application of R-squared and RMSE to fit my models, ensuring a robust framework for inference. By employing generalized linear models,ridege regression I tailored my approach to match the distribution and nature the data, demonstrating the process in R and interpreting the output to assess the effectiveness of different predictors. I conducted model selection was conducted rigorously, utilizing R-squared and RMSE criteria to compare and choose the optimal model, which was essential for enhancing the model’s predictive accuracy. The results were then communicated clearly and concisely, making extensive use of visual aids like plots to make the findings accessible and understandable to a non-technical audience. Finally, the extensive use of R Markdown allowed for a dynamic presentation of our methods and findings, showcasing the practical application of programming software in statistical analysis and model assessment, thus fully aligning with and accomplishing the educational objectives of the course.

Objectives

In this project, I aimed to achieve the following objectives:

Describe and Apply Statistical Modeling Foundations: I explored the fundamental role of probability and applied maximum likelihood estimation for fitting statistical models, ensuring a robust framework for inference.

Identify and Utilize Appropriate Predictive Models: I employed generalized linear models (GLM) to match the distribution and nature of our dataset, demonstrated the process in R, and interpreted the outputs to assess different predictors’ effectiveness.

Perform Model Selection and Validation: Rigorous model selection was conducted using RMSE and R-squared criteria to identify the optimal model, crucial for enhancing predictive accuracy and managing issues like multicollinearity and overfitting.

Effectively Communicate Complex Model Outcomes: I communicated the results clearly and concisely, utilizing visual aids like plots to make the findings accessible to a non-technical audience.

Utilize R Programming for Comprehensive Data Analysis: The extensive use of R Markdown showcased the dynamic presentation of our methods and findings, highlighting the practical application of programming software in statistical analysis and model assessment.

This project aligns seamlessly with the learning objectives of the course by effectively demonstrating the practical application of statistical modeling to real-world data. The utilization of regression techniques, both simple and multiple, alongside ridge regression, embodies a deep understanding of probability as it applies to model fitting and inference, showcasing the ability to manage assumptions and limitations such as multicollinearity. The careful selection between linear and ridge regression models illustrates the process of model selection, guided by insights into each predictor’s influence on the response variable. Moreover, this portfolio communicates complex statistical findings in an accessible manner, facilitating a clear understanding for a general audience. Finally, the extensive use of R programming for data preprocessing, model fitting, and diagnostics exemplifies a strong command of the software, meeting the course’s objective of employing programming tools to execute and interpret statistical models. Overall, the project reflects a comprehensive grasp of statistical concepts, model assessment, and the articulation of data-driven stories, which are core competencies this course aims to impart.

Introduction

Welcome to my portfolio on car valuation, where I delve into the dynamics of the used car market through rigorous statistical modeling. This portfolio presents an in-depth exploration of how various factors such as make, model, age, location, and technical specifications influence the prices of used cars. Utilizing a dataset of 7,253 cars, this analysis employs multiple linear regression (MLR) and ridge regression models to uncover insights that are not only statistically significant but also practically relevant for stakeholders in the automotive industry.

The goal of this project is to model and predict used car prices based on a multitude of features, thereby providing a foundation for informed decision-making for sellers and buyers alike. The findings detailed here offer a clear narrative on the interdependencies and influence of various predictors on car pricing, highlighting how different attributes from transmission type to engine power play into the market valuation of cars.

Dataset Overview

data set https://drive.google.com/uc?id=1c4-0K8V2jGF-9P1qV34T-DOnEJskJk4q - Total Rows: 7,253 - Total Columns: 14

Data Set Description

S.No.: Serial Number
Name: Name of the car which includes Brand name and Model name
Location: The location in which the car is being sold or is available for purchase
Year: Manufacturing year of the car
Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in KM
Fuel_Type: The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
Transmission: The type of transmission used by the car (Automatic / Manual)
Owner: Type of ownership
Mileage: The standard mileage offered by the car company in kmpl or km/kg
Engine: The displacement volume of the engine in CC
Power: The maximum power of the engine in bhp
Seats: The number of seats in the car
New_Price: The price of a new car of the same model in INR Lakhs (1 Lakh = 100,000)
Price: The price of the used car in INR Lakhs (1 Lakh = 100,000)

Problem Statement

This analysis aims to explore the factors affecting the pricing of used cars. The following questions will guide the investigation:

Influence of Predictors: Do various predicting factors significantly affect the price of a used car?
Key Determinants: Which independent variables have the most impact on the pricing of used cars?
Brand and Model Impact: Does the name of a car, which includes the brand and model, influence its price?
Transmission Type: How does the type of transmission (Automatic or Manual) affect the pricing of a car?
Geographic Influence: Is there an effect on the used car price depending on the location it is being sold?
Usage and Age: Do the kilometers driven and the year of manufacturing have a negative correlation with the car’s price?
Performance Attributes: How do the performance-related attributes such as Mileage, Engine, and Power, relate to the car’s pricing?
Features and Specifications: What is the influence of the number of seats and fuel type on the price of a used car?

Methodology

Data Preparation: Cleaning and transforming of usd car prices data. Model Development: Constructing models using simple linear regression, multiple linear models and ridege regression. ** Model Evaluation: ** Assessing model performance using statistical metrics such as adjusted R-squared and RMSE.

Expected Outcome

Develop a model for used cars by gathering and processing comprehensive transaction data, engineering relevant features, and deploying advanced statistical models. This entails collecting data on various attributes like make, model, year, mileage, and condition; cleansing data for accuracy; constructing impactful features such as car age and brand reputation; and employing models like linear regression and ridge regression. The model’s effectiveness will be evaluated using metrics like RMSE, and R-squared to ensure accurate price predictions.

Exploratory Data Analysis

Car Dataset Insights

Year
The majority of cars in the dataset seem to be from recent years, indicating that the dataset contains newer used cars.
There is a peak around 2010, suggesting that many of the cars are about a decade old.
Kilometers Driven
Most cars have a low to moderate mileage, with very few cars having very high mileage.
This suggests that the dataset consists mostly of relatively less used cars.
Mileage
The mileage of the cars seems to be normally distributed with a peak around 10-20 km/l.
This might indicate a common fuel efficiency for the cars in the dataset.
Engine
This histogram shows a right-skewed distribution, indicating that there are more cars with smaller engine sizes and few cars with very large engines.
Power
Similar to the Engine histogram, the Power distribution is also right-skewed.
Most cars having lower power and a tail of cars with higher power ratings.
Price
The price histogram is heavily right-skewed, indicating that most cars are in the lower price range, with a few cars being very expensive.
Seats
The majority of cars have 5 seats, which is typical for most passenger vehicles.
There are very few cars with more than 5 seats.
Age of Car
This shows a decreasing trend, with newer cars being more prevalent in the dataset than older cars.
This could indicate either a dataset of mostly newer models or perhaps a resale market that has newer cars more frequently.
New Price Num (new_price_num)
The plot shows that most cars are in the lower new price range, with very few expensive models.
This could be the price when the cars were new or an adjusted price metric; it’s not clear without further context.

Observations

Upon examining the dataset, several observations about the distribution and characteristics of various variables have been made:

Year: The ‘Year’ feature is left-skewed with outliers on the lower side. Considering the nature of this skewness and outliers, it could be considered for exclusion from the model.
Kilometers Driven: This variable is right-skewed, indicating that most cars have lower mileage, with a few cars showing exceptionally high mileage.
Mileage: The ‘Mileage’ of cars is approximately normally distributed but with a few outliers on both the upper and lower ends. These outliers warrant further investigation.
Engine, Power, and Price: These features are right-skewed and exhibit outliers on the upper side, suggesting the presence of some high-performance or luxury cars with significantly higher values in the dataset.
Age of Car: The age distribution of the cars is right-skewed, indicating a greater number of newer cars in the dataset and fewer older models.

Car Profile Observations

The dataset provides insights into the profile of used cars available for sale. Notable observations include:

Transmission: Approximately 71% of the cars listed for sale feature manual transmission.
Ownership: Around 82% of the cars are being sold by their first owners.
Brand Popularity: Maruti and Hyundai brands together make up 39% of the cars available for sale, indicating their popularity in the pre-owned car market.
Fuel Type: Over half of the cars available for sale, about 53%, utilize diesel as their fuel type.
Geographic Distribution: Mumbai leads with the highest number of cars available for purchase, while Ahmedabad has the fewest listings.
Seating Capacity: The majority of the cars are 5-seaters, which aligns with the common configuration for family vehicles.
Vehicle Age: The age range of the cars on sale spans from 2 to 23 years, providing a wide range for potential buyers.
Price Range: Approximately 71% of the cars fall into the lower price range segment, making them more accessible to a broader customer base.

These points offer a comprehensive overview of the current market for used cars and can be used to inform potential buyers or to shape further analysis.

Correlation

# Calculate the correlation matrix
cor_matrix <- cor(cars_copy[, sapply(cars_copy, is.numeric)], use = "complete.obs")

# Melt the correlation matrix
cor_melted <- melt(cor_matrix)

# Plot using ggplot2
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
    geom_tile(color = "white") +  # add tile layer
    scale_fill_gradient2(low = "blue", high = "yellowgreen", mid = "white", 
                         midpoint = 0, limit = c(-1, 1), space = "Lab", 
                         name="Correlation") +
    theme_minimal() +  # Minimal theme
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +  # Rotate x axis labels
    labs(x = "", y = "", title = "Correlation Matrix") +
    geom_text(aes(label = sprintf("%.2f", value)), size = 3)  # Add text labels

The correlation analysis reveals several relationships between different variables related to the car profiles:

Engine and Power: There is a strong positive correlation (0.86) between the Engine size and Power output, suggesting that larger engines tend to be more powerful.
Price Determinants:
- The Price of the car has a positive correlation with Engine size (0.66), indicating that cars with larger engines tend to be more expensive.
- Similarly, Price is positively correlated with Power (0.77), reinforcing that more powerful cars typically command higher prices.
Mileage and Performance Factors: Mileage shows a negative correlation with Engine size, Power, and Price, as well as with the Age of the car. This suggests that as cars become more powerful, larger, or older, they tend to be less fuel-efficient.
Price Depreciation: There is a negative correlation between the Price and the Age of the car, indicating that older cars tend to be less expensive, aligning with the general depreciation of vehicle value over time.
Usage: The Kilometers Driven does not appear to have a significant impact on the Price, suggesting that the car’s condition and other factors may be more influential in determining the used car’s value.

Same observation about correlation as seen in heatmap.

Kilometer driven doesnot have impact on Price .

As power increase mileage decrease. Car with recent make sell at higher prices. Engine and Power increase , price of the car seems to increase.

Variables that are correlated with Price variable

# Transmission Type Analysis

The relationship between the type of transmission in cars, their engine sizes, and their prices has been observed as follows:

Manual Transmission:
- Cars with manual transmission tend to cluster in the lower range of engine sizes and prices.
- There are relatively few manual transmission cars with large engines or high prices.
- This pattern suggests a potential weak negative correlation, where the prevalence of manual transmission decreases as car prices increase.
Automatic Transmission:
Vehicles with automatic transmission show a broader distribution across different engine sizes and price points, including the higher end of the spectrum.
This variation implies that for cars with automatic transmissions, engine size may have a less definite or slightly more positive correlation with price compared to manual transmission cars.

Price Vs Power vs Transmission

Transmission Types Distribution

Manual Transmission Cars: Predominantly grouped at the lower end of both price and power, indicating that cars with manual transmissions tend to be less powerful and more affordable.
Automatic Transmission Cars: Display a more dispersed distribution across a wider range of prices and power levels. Several automatic cars extend into the higher end of the spectrum, which might be indicative of the luxury segment.

Correlation Patterns

Manual Transmission Cars: A concentration of data points at the lower price and power area suggests that manual cars are generally associated with lower power outputs and costs.
Automatic Transmission Cars: A wider distribution in the scatter plot is observed, suggesting a greater variability in power and price. These cars are more likely to be found in the higher price and power categories compared to manual transmission cars.

High-End Market

The presence of automatic transmission cars in the upper regions of the plot suggests that higher-powered and more expensive cars are more likely to be equipped with automatic transmissions, possibly reflecting the premium market segment.

Mileage vs Price based on Transmission Insights

The scatter plot analyzing the relationship between car mileage, price, and transmission type offers the following insights:

Manual Transmission Efficiency: There is a prevalent concentration of manual transmission cars that tend to offer higher mileage across various price points, relative to automatic transmission cars.
Economy and Mileage: A notable cluster of manual transmission cars within the lower price range exhibits higher mileage, suggesting that economical cars often come with manual transmissions and enhanced fuel efficiency.
Automatic Transmission Diversity: Automatic cars show a wider scatter in terms of mileage and do not exhibit a clear trend between price and mileage, reflecting the segment’s diversity in both pricing and fuel economy.

Price Vs Year Vs Transmission

The line plot illustrating average car prices over various manufacturing years, segmented by transmission type, reveals:

Rising Prices for Automatics: There’s a clear upward trend in the average price of automatic cars over the years, particularly sharp after 2010, indicating increased valuation or more premium models entering the market.
Steady Prices for Manuals: Manual cars show a more consistent and moderate increase in average price over time, without the sharp spikes observed with automatic cars.
Price Divergence Over Time: The price gap between automatic and manual transmission cars widens notably in more recent years, suggesting a market shift that increasingly favors automatics, potentially due to advancements in technology or changing consumer preferences.

Price Vs Year VS Fuel Type

From the line plot that compares the average price of cars by manufacturing year across different fuel types, we can observe:

Electric Vehicle Trend: There is a significant increase in the average price of electric vehicles over time, especially after 2010, which could be due to technological advancements or a shift in market demand towards eco-friendly options.
Diesel and Petrol Prices: Both diesel and petrol cars show a gradual increase in average prices over the years, with diesel vehicles generally maintaining a higher average price compared to petrol.
Fuel Type Variance: The average prices of cars with alternative fuel types like CNG and LPG remain relatively low and do not show the same upward trend as electric, diesel, or petrol cars, possibly indicating a smaller market segment or less expensive models.

Year Vs Price Vs Owner_Type

The line plot presenting the average car prices across different manufacturing years, segmented by owner type, shows several trends:

First Owners: Cars sold by first owners maintain a consistently higher average price across all manufacturing years, possibly reflecting better maintenance or lower usage.
Subsequent Owners: There is a noticeable increase in the average price for cars owned by the second and third owners in more recent years, suggesting an increased valuation of more modern vehicles regardless of ownership history.
Fourth and Above Owners: Cars that have had four or more owners tend to have a more variable price trend but generally stay below those of cars owned by the first three owners, which might be indicative of higher usage or wear and tear.

Price Vs Mileage vs Fuel_type

Analyzing the scatter plot that compares car prices with their mileage, differentiated by fuel type, we observe:

Mileage and Price Relationship: There doesn’t appear to be a strong correlation between mileage and price, as higher mileage does not consistently equate to higher or lower prices.
Fuel Type Disparity: The distribution of points by fuel type shows variability in how mileage affects price among different fuel types, with no single fuel type clearly leading to higher prices.

Price Vs Brand

The boxplot provides a comparison of car prices across different brands, revealing:

Brand Price Range: There is significant variability in price ranges among different car brands. Luxury brands, such as Bentley and Audi, have a higher median price and wider interquartile ranges, indicating a larger spread in their pricing.
Outliers: Several brands show outliers, particularly at higher price points, suggesting that there are certain models that are priced much higher than the typical cars from these brands.
Price Distribution: Brands like Maruti and Hyundai have more compact price distributions, indicating less variation in their vehicle pricing and a potentially more focused market segment.

Price vs Year by owner type and transmission

First Owners: Cars sold by first owners show a wide range of prices with newer models trending towards higher prices. Both automatic and manual transmissions are represented, with automatics generally at higher price points.
Second Owners: The spread of prices for cars sold by second owners is narrower than for first owners, with prices generally lower, reflecting the depreciation of vehicle value over time.
Third and Fourth & Above Owners: For cars sold by third owners and fourth & above, there are fewer data points, suggesting fewer transactions. These cars also tend to be older and less expensive, with some outliers representing exceptions.

These facets indicate that as cars change hands over time, their value tends to decrease, and the range of prices becomes narrower, especially for vehicles with multiple past owners.

Observations on Factors Affecting Car Pricing

A comprehensive review of the dataset has led to the following observations regarding car pricing:

Geographic Price Variations: The most expensive cars are located in Coimbatore and Bangalore, suggesting regional differences in pricing, possibly due to local demand or economic factors.
Seating and Price: Cars with a seating capacity for two are more expensive, often indicative of luxury or sports models.
Fuel Type Impact: Diesel-fueled vehicles are generally more expensive compared to other fuel types, which may reflect their higher efficiency or longevity.
Model Year Depreciation: As expected, older car models are sold at lower prices compared to the latest models, following the standard vehicle depreciation curve.
Transmission and Price: Vehicles with automatic transmissions tend to have higher prices than those with manual transmissions, possibly due to added convenience or technology.
Engine Capacity: There is a correlation between engine capacity and price, with more robust engines commanding higher prices.
Ownership and Valuation: The price decreases as the number of previous owners increases, reflecting wear and vehicle history.
Engine Requirements for Transmission: Automatic transmission vehicles require higher engine capacity and power, which might contribute to their higher pricing.
Fuel Type Trends Over Time: The prices for cars with diesel fuel have increased in more recent models, indicating a shift in market valuation over time.
Comprehensive Pricing Factors: A multitude of factors such as engine capacity, power, vehicle age, mileage, fuel type, location, and transmission type collectively influence the price of a car.

These findings provide valuable insights into the used car market, highlighting the complex interplay between various features and the resulting impact on vehicle pricing.

Simple Linear Regression Model

In the chunk below, I simulate data to demonstrate a simple Linear regression model. This involves generating random data for all the variables which I then use to predict car prices. A simple linear regression is fitted to understand how various variables can be used to predict price of cars. This helps illustrate basic concepts in statistical inference and model estimation.

# Select predictors and the target variable
predictors <- setdiff(names(cars_copy), c("Price_log"))  # Exclude the target variable from predictors

# Initialize a list to store model summaries
model_summaries <- list()

# Loop through each predictor
for (var in predictors) {
  # Formulate the formula dynamically
  formula <- as.formula(paste("Price_log ~", var))
  
  # Fit the linear model
  model <- lm(formula, data = cars_copy)
  
  # Store the summary of each model
  model_summaries[[var]] <- summary(model)
}

# Now you can access each model summary by the predictor name, e.g., model_summaries[["Engine"]]
# If you want to print summaries, use:
lapply(model_summaries, print)

# Optionally, you can plot the residuals for each model to check for any patterns
for (var in predictors) {
  plot(model_summaries[[var]]$residuals, main = paste("Residuals for model with", var),
       xlab = "Index", ylab = "Residuals")
  abline(h = 0, col = "red")
}

dev.off()

# Summary printing for general review
cat("Model summaries and residual plots are saved. Review each to assess model fit and assumptions.\n")

Overview
Based on this models and their outputs, several insights can be drawn to aid stakeholders in understanding the factors that significantly influence car prices in the dataset. Notably, the model involving Power has the highest R-squared value (0.5896), indicating that this predictor accounts for approximately 58.96% of the variance in the log-transformed price of cars. This suggests that the power of a car is a major determinant of its price. Additionally, Engine and Brand_Class models also show substantial R-squared values (0.4719 and 0.3144, respectively), highlighting that engine size and brand class significantly contribute to car pricing. Models with significant predictors such as Transmission and Fuel_Type provide actionable insights, where manual transmission and diesel fuel type add considerable value to cars, which stakeholders can consider for targeted marketing and pricing strategies.
Communicating these results to a general audience involves emphasizing how these predictors affect car prices. For instance, highlighting the strong impact of Power and Engine on pricing can guide potential buyers or sellers about which features enhance a car’s market value. Similarly, discussing the role of Brand_Class can help stakeholders understand brand positioning and its effect on pricing. For predictive purposes, these models enable stakeholders to estimate car prices based on specific features, enhancing decision-making processes for purchasing, selling, or stocking vehicles. The residual plots and model diagnostics serve as tools for checking the adequacy of the models, ensuring the reliability of the predictions made. Overall, these insights offer a comprehensive understanding of the key factors that drive car pricing, assisting stakeholders in making informed business decisions.

The analysis of used car prices using statistical models has provided significant insights into the factors influencing car valuation. The models highlight the substantial impact of car power, which accounts for approximately 58.96% of the variance in log-transformed car prices, suggesting it as a major determinant. Similarly, engine size and brand classification also contribute significantly to pricing, as evidenced by their respective R-squared values. The findings suggest that features like manual transmission and diesel fuel type substantially add value, offering actionable insights for targeted marketing strategies. By effectively communicating these results, stakeholders can better understand how specific car features influence market values, enhancing decision-making for purchasing, selling, or stocking vehicles. The analysis also pointed to potential overfitting, prompting further exploration with a multiple linear regression approach to refine the model’s accuracy and reliability.

Adjusted R-squared: Adjusted for the number of predictors in the model, providing a more accurate measure when comparing models with different numbers of predictors. The adjusted R-squared for the Power model is 0.5895, almost equal to the R-squared, indicating that the model is appropriately specified without redundant predictors.
R-squared: Represents the proportion of variance in the dependent variable that is predictable from the independent variables. For the Power model, the R-squared value of 0.5896 suggests that approximately 58.96% of the variability in log-transformed car prices is explained by the car’s power.
Both the R-squared and Adjusted R squared of our model are very high. This is a clear indication that we have been able to create a very good model that is able to explain variance in price of used cars for upto 89%
There is overfitting model which prompts us to try a multiple linear regression.

Multiple Linear Regression Model

This model aims to predict price of a car based on various response variables that include, miles driven, age of the car, engine, power, seats, location, fuel type, transmission and owner type. Based on the model

# Load necessary libraries
library(caret)
library(fastDummies)

# Function to encode categorical variables
encode_cat_vars <- function(x) {
  cat_cols <- names(x)[sapply(x, function(col) is.factor(col) || is.character(col))]
  x <- dummy_cols(x, select_columns = cat_cols, remove_first_dummy = TRUE)
  return(x)
}

# Assuming 'cars_copy' is your data frame and has been loaded
encoded_cars_copy <- encode_cat_vars(cars_copy)

# Preparing data
X <- encoded_cars_copy[, !(names(encoded_cars_copy) %in% c("Price", "Price_log"))]
y <- cars_copy[["Price_log"]]

# Set seed for reproducibility
set.seed(42)

# Splitting data
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE, times = 1)
X_train <- X[trainIndex, , drop = FALSE]
y_train <- y[trainIndex, drop = FALSE]

# Fitting a simple linear regression model
ols_model <- lm(y_train ~ ., data = X_train)

# Printing the summary of the model
model_summary <- summary(ols_model)
print(model_summary)

# Interpretation of model parameters
cat("\nInterpretation of Model Parameter Estimates:\n")
print(coef(model_summary))

# Residual analysis
residuals_values <- residuals(ols_model)
plot(residuals_values, type = 'p', main = "Residual Plot", xlab = "Observation Index", ylab = "Residuals")
abline(h = 0, col = "red")

# Explanation of residuals in least squares regression
cat("\nResiduals are differences between observed values and the values predicted by the model. The least squares method minimizes the sum of the squared residuals, effectively finding the line of best fit.\n")

# Interpretation of an individual residual
example_resid <- residuals_values[1]
cat("\nInterpretation of a Specific Residual:\n")
cat(sprintf("A residual of %f indicates that the actual value is %f units away from the predicted value by the model, on the log scale of price.\n", example_resid, example_resid))

# Bias, unbiased, and error in statistical and colloquial terms
cat("\nStatistical vs. Colloquial Usage:\n")
cat("In statistics, 'bias' refers to the difference between an estimator's expected value and the true value of the parameter being estimated. An 'unbiased' estimator has an expected value equal to the parameter. 'Error' often refers to the residual of an observation. Colloquially, these terms often carry different, less precise meanings.\n")

# Using standard errors for hypothesis testing
cat("\nUsing Standard Errors:\n")
cat("Standard errors measure the variability of an estimate. They are used to construct confidence intervals and to perform hypothesis tests, such as testing if a coefficient is significantly different from zero.\n")
print(summary(ols_model)$coefficients[, "Std. Error"])

# Model fit quantification
cat("\nModel Fit Quantification:\n")
cat(sprintf("R-squared: %f. This statistic measures the proportion of variance in the dependent variable that is predictable from the independent variables.\n", model_summary$r.squared))
cat(sprintf("Adjusted R-squared: %f. Adjusts the R-squared for the number of predictors in the model, providing a more accurate measure when comparing models with different numbers of predictors.\n", model_summary$adj.r.squared))

Analysis of Factors Affecting Used Car Prices based on Multi Linear Regression Model

Influence of Predictors

Various predicting factors such as the location, fuel type, transmission, owner type, mileage, engine, power, seats, the age of the car, and brand class have significant effects on the price of a used car. This is indicated by several predictor variables having significant t-values and small p-values.

Key Determinants

The variables with the most impact on the pricing seem to be the brand class, the power of the car, and the age of the car, considering their estimates and the corresponding p-values.

Brand and Model Impact

The name of a car, which includes the brand and model, does seem to influence its price, as seen in the significant coefficients for different brand classes.

Transmission Type

Transmission type has a significant effect on the pricing of a car, with manual transmission having a negative coefficient, indicating that, all else being equal, cars with manual transmission are less expensive than those with automatic transmission.

Geographic Influence

The model suggests a geographical influence on used car pricing. Different locations like Bangalore, Delhi, Hyderabad, Jaipur, Kolkata, Mumbai, and Pune have significant coefficients, indicating that location does affect car pricing.

Usage and Age

There is a negative correlation between the age of the car (Ageofcar) and its price, as indicated by a highly significant negative coefficient. The Kilometers_Driven variable itself is not significant, but the log transformation of it is, suggesting a non-linear relationship.

Performance Attributes

Mileage has a negative influence on price, whereas Engine and Power both have positive influences on the car’s pricing, indicating that higher performance attributes relate to higher prices.

Features and Specifications

The number of seats has a positive influence on the price, but not very strongly. Fuel type does have an influence; particularly, electric cars are significantly more expensive than other types, as indicated by their coefficients.

Residuals:

The range of residuals goes from -3.8375 to 1.7364, with the median being close to zero (0.0210). This suggests that the model predictions are unbiased on average, as the median of the residuals is near zero. However, the presence of large minimum and maximum residuals could indicate potential outliers or model misspecification. - Coefficients: Most variables seem to have a significant effect on the dependent variable (Price_log), as indicated by their low p-values (less than 0.05). The Location coefficients represent the average change in Price_log for being in that location compared to the baseline location (which seems to be omitted due to dummy variable coding). For instance, LocationBangalore has a positive coefficient, indicating higher prices on average in Bangalore compared to the baseline. miles_Driven has a very small positive coefficient, suggesting a negligible effect on `Price

Fuel_Type coefficients indicate the difference in Price_log between the reference fuel type category and other types. For example, Fuel_TypeElectric has a large positive coefficient, indicating that electric cars tend to be priced higher than the reference category (probably petrol). The significant negative coefficient for TransmissionManual indicates that manual cars are, on average, priced lower than automatic ones.
Ageofcar has a significant negative coefficient, showing that as cars age, their price tends to decrease.
The presence of “NA” for some coefficients suggests there are singularities or multicollinearity in the model, which means some predictor variables are linearly dependent on others.
Residual standard error (RSE): The RSE is 0.2809, which indicates the average amount by which the observed values deviate from the values predicted by the model. It is a measure of the quality of the regression fit.
Multiple R-squared and Adjusted R-squared: The Multiple R-squared value is 0.8961, which means that approximately 89.61% of the variance in Price_log can be explained by the model. It’s a high value, suggesting a good fit. The Adjusted R-squared is 0.8954, slightly lower than the Multiple R-squared, adjusting for the number of predictors in the model, which confirms that the model explains a lot of the variability in the data while considering the number of variables.
F-statistic: The F-statistic is very large (1214) with a p-value of less than 2.2e-16, indicating that the model is statistically significant; the included predictors do provide explanatory power in predicting the log of the car prices.

Model Performance:

Overall Fit: The model explains a significant portion of the variability in the log-transformed car prices (R-squared of 0.8961). This implies that the predictors included in the model are effective in estimating car prices.
Predictive Accuracy: Individual residuals indicate that while some predictions are very accurate (median residual close to zero), there are cases of overestimation and underestimation (residuals range from -3.8375 to 1.7364), highlighting that the model may not capture all the variability perfectly for every single car.
Location Influence: Geographic Variation: The coefficients for location indicate that the city where a car is sold has a significant impact on its price. Cars in Bangalore, for instance, are priced higher, while those in Kolkata are priced lower relative to the baseline city (not mentioned in the output).
Car Attributes: Fuel Type: Electric cars command a premium (significant positive coefficient), while traditional fuel types like diesel and petrol show expected but varying degrees of influence on prices. Transmission: Manual cars are less expensive than automatic cars, as indicated by the significant negative coefficient for manual transmission.
Ownership and Car Condition: Previous Ownership: Cars that have been previously owned multiple times, particularly those that have had three or more previous owners, are valued lower. Age Factor: The age of the car is a strong predictor of its price, with older cars being less expensive. This is consistent with the general expectation that car values depreciate over time.
Technical Specifications: Performance Variables: Engine size and power have positive associations with the car price, suggesting that more powerful cars are more expensive. Physical Attributes: The number of seats also affects the car’s price, but the impact is relatively smaller compared to performance metrics.
Brand Perception: Brand Classification: The brand’s perceived class has a significant impact, with luxury (Brand_ClassLand) and specialty (Brand_ClassMini) brands commanding higher prices and low-end brands (Brand_ClassLow) associated with lower prices.

Standard Errors and Statistical Significance: - Confidence in Estimates: The small standard errors for most coefficients, especially for critical variables like age, power, and brand class, give confidence in the reliability of these estimates.

Model Interpretation and Application:

Practical Insights: The model provides actionable insights for sellers to set prices and for buyers to negotiate prices based on quantifiable car attributes and market conditions.
Policy and Strategy: Dealerships and online marketplaces can use this model to develop pricing strategies that consider location, car attributes, and brand perception.
Future Investigations: The presence of singularities and non-defined coefficients due to singularities suggests potential redundancy in variables that need to be addressed in future model iterations.

Hypothesis Testing

TransmissionManual - Estimate=-0.2371, - Std.Error=0.01349, - tvalue= -17.580, P-value<2e-16

Hypothesis Test
H0:TransmissionManual=0
HA: TransmissionManual is not eqaul to zero

Implications of the Result - Reject Ho: reject the null hypothesis that the coefficient of TransmissionManual is zero. This means there is a statistically significant effect of the transmission type (manual vs. automatic) on the log-transformed price of cars. Effect of Manual Transmission: Given that the coefficient for TransmissionManual is negative (from your model summary), this result indicates that manual transmission vehicles are priced significantly lower than automatic vehicles, holding other factors constant.

Conclusion based on the MLR Model

Embarking on the journey of car valuation, our model weaves a narrative that guides sellers and buyers through the bustling lanes of the used car market. At the core of our story, the log-transformed price of cars, a reflection of a vehicle’s worth, is intricately influenced by its features and history. From the bustling streets of Bangalore, where the allure of luxury sedans and electric vehicles drives up prices, to the more modest valuations in Kolkata, location emerges as a pivotal character in our tale. Each car tells its own story: the sleek charm of newer models, the reliable hum of a diesel engine, the classic simplicity of a manual transmission. Our model, with a commendable R-squared of 0.896, acts as a seasoned storyteller, deciphering the complex interplay between a car’s attributes and its market value, while accounting for the nuances that make each car unique. Through the lens of our analysis, a landscape unfolds where the past ownership, brand prestige, and even the number of seats contribute their verses to the epic of car pricing, offering a crystal ball to gaze into the market’s soul, understanding and predicting the ebbs and flows of car prices with statistical precision.

There is overfitting in the model, I will try ridge regression to take care of overfitting.

Ridge Regression

This is done to ensure there is no multicollinearity and to manage overfitting In the chunk below, I simulate data to demonstrate a ridge regression model. This involves generating random data for variables which I then use to predict car prices. A ridge regression is fitted to understand howvarious variables can be used to predict price of cars. This helps illustrate basic concepts in statistical inference and model estimation.

# Install and load glmnet package if not already installed
if (!require(glmnet)) install.packages("glmnet")
library(glmnet)

# Assuming that `encoded_cars_copy` and `cars_copy` are already loaded and preprocessed

# Prepare matrix for glmnet (which requires a matrix input for predictors)
X_matrix <- as.matrix(X_train)

# Alpha = 0 for ridge regression
# Note: glmnet uses lambda for regularization strength. Often, a sequence of lambda values is used to find the optimal.
lambda_values <- 10^seq(10, -2, length = 100)  # This is a range of lambda values
ridge_model <- glmnet(X_matrix, y_train, alpha = 0, lambda = lambda_values)

# Cross-validation to find optimal lambda
set.seed(42)  # For reproducibility in cross-validation
cv_ridge <- cv.glmnet(X_matrix, y_train, alpha = 0, lambda = lambda_values)

# Find the lambda value that minimizes the cross-validation error
best_lambda <- cv_ridge$lambda.min

# Fit the final model using the selected lambda
final_ridge_model <- glmnet(X_matrix, y_train, alpha = 0, lambda = best_lambda)

# Viewing the coefficients of the final ridge regression model
coef(final_ridge_model)

# Plot the cross-validated error as a function of lambda
plot(cv_ridge)

cat("\nSelected Lambda for Ridge Regression:", best_lambda, "\n")

Residuals: The residual plot shows a random pattern around the zero line, which suggests that the model’s predictions are unbiased, and there is no apparent heteroscedasticity or non-linearity. This is a good sign and indicates that the model fits well with the data.

Insights from the model

The refined lens of ridge regression reveals a nuanced tapestry of the used car market, where each thread—the car’s age, make, and power, intertwined with its brand prestige and geographic story—plays a pivotal role in the pricing saga. Older vehicles and those from Kolkata wear the cloak of modesty, reflected in lower price logs, while the electric engines, luxury brands like Land and Mini, and the vigor of high horsepower emerge as the champions of value, commanding higher prices. The model’s careful calibration through an optimal lambda tames the complexity of multicollinearity without muting the individual contributions of each predictor, offering a balanced narrative that stands robust against the risks of overfitting. This statistical odyssey, underpinned by the sagacity of ridge regression, not only demystifies the hidden patterns within the automotive bazaar but also arms stakeholders with the foresight to navigate the intricate dynamics of car valuation.

Comparative Insight into MLR and Ridge Regression Models

The Multiple Linear Regression (MLR) model demonstrated a high ability to explain variability in car prices, with an R-squared value of 0.8961, indicating that about 89.61% of the variance in car prices is predictable from the model’s inputs. This model provided significant insights into the impact of factors such as brand class, power, and car age, confirming expected trends like the depreciation effect of age on car prices and the premium pricing of electric cars.
In contrast, the Ridge Regression model was introduced as a refinement to address potential multicollinearity and overfitting issues inherent in the MLR model. By employing regularization, the Ridge Regression adjusted the model to balance bias and variance, thus enhancing prediction accuracy and model generalizability. This model not only confirmed many findings from the MLR model but also emphasized the robustness required when dealing with highly interrelated predictors.
Through these analyses, this portfolio not only captures the quantitative metrics of car valuation but also interprets these findings within the broader context of market dynamics and economic implications. The comprehensive approach here bridges theoretical statistical methods and practical applications, offering a holistic view of the used car market that supports both strategic decisions and policy-making.

Results and Conclusion

I successfully integrated ridge regression into my analysis to address potential multicollinearity and manage overfitting in the dataset concerning used car prices. By using the glmnet package in R, I prepared a matrix of predictors and implemented ridge regression with a range of lambda values to determine the optimal model through cross-validation. This methodological refinement ensured that my model was robust and generalized well across different data scenarios, preventing the overfitting that often plagues complex models with many predictors.
The outcome of the ridge regression provided insightful and reliable interpretations of how various factors like the car’s age, brand, and geographic location impact its price. Particularly, the model highlighted the depreciation effect associated with age and the premium added by electric vehicles and luxury brands. The optimal lambda value, identified through cross-validation, effectively balanced bias and variance, thereby enhancing the model’s prediction accuracy and reliability.
Comparing the multiple linear regression (MLR) model and the ridge regression model, the latter proved superior in terms of handling multicollinearity, which is common with the extensive variables in our dataset. While the MLR model had an R-squared value of 0.8961, indicating strong explanatory power, it was susceptible to overfitting. On the other hand, ridge regression, by incorporating regularization, offered a more nuanced understanding and robustness, confirming its suitability for predicting used car prices in scenarios with complex and interrelated predictors.
In conclusion, my analysis using both MLR and ridge regression techniques not only quantified the impact of various car attributes on pricing but also provided a comprehensive framework that can be applied practically for pricing strategies in the used car market. This approach, grounded in statistical rigor and practical applicability, ensures that stakeholders can make informed decisions based on robust data-driven insights.

Project Portfolio

Hyreen Alice

2024-04-21