Introduction & Motivation

The Automobile Dataset from the UCI Machine Learning Repository consists of 398 instances with 26 attributes, providing detailed information about various cars, including their specifications and features. We want to study how the price of cars varies based on their features. It is important to us because understanding the factors that influence car prices can inform us. For consumers, insights into how features like make, fuel type, horsepower, engine size, and highway MPG affect price can guide purchasing decisions, helping us choose vehicles that best fit our needs and budgets. By leveraging statistical analysis and visualization techniques on the Automobile Dataset, we aim to uncover relationships and patterns between car prices and their associated features.

Dataset & Wrangling

The dataset we will be using was imported from the UCI Automobile Dataset

It is crucial to exercise caution when dropping observations containing null variables without further investigation, as this can potentially lead to a loss of valuable data. In some instances, it may be necessary to drop such observations to facilitate the statistical analysis of a dataset. In the case at hand, after removing the observations containing null variables, the dataset is reduced by 5 data points. Due to the potential challenges that arise from working with null variables when calculating coefficient values and performing regression analysis and given the nature of the dataset and the research goals, we made the decision to drop observations with missing data.

Variables in the Analysis

  • Response Variable:
    • Price
  • Predictor Variables:
    • Categorical Variables:
      • Drive Wheel Type
    • Quantitative Variables:
      • Horsepower
      • Engine Size
      • Highway MPG

Exploratory Data Analysis (EDA)

Univariate EDA

Response Variable: Price

The price distribution has a mean of $1,235.50 and a median of $700, indicating that most prices are below the average, with a few higher values pulling the mean up. 50% of the prices of cars in the dataset range from $7,775 and $16,500. The distribution is unimodal and right-skewed, meaning there are more lower-priced items with a few outliers at the higher end that are significantly above the rest of the data. These outliers contribute to the skewness and the higher average price compared to the median.

##   min   Q1 median    Q3   max  mean     sd   n missing
##  5118 7775  10295 16500 45400 13207 7947.1 201       0

Categorical Predictor: Drive Wheels

The distribution of drive wheel types in the dataset reveals that front-wheel drive (fwd) cars are the most common, with approximately 125 vehicles, making them the dominant category. Rear-wheel drive (rwd) cars follow, representing a smaller portion of the dataset, with around 75 vehicles. In contrast, four-wheel drive (4wd) cars are the least frequent, with fewer than 25 vehicles. This distribution indicates that front-wheel drive is the most prevalent type, while four-wheel drive is relatively rare, suggesting a significant preference for certain drive systems in the data.

## 
## 4wd fwd rwd 
##   9 120  76

Quantitative Predictors

  1. Horsepower

The distribution of the horsepower variable shows a unimodal distribution, with 50% of values being between 70 and 116 horsepower. The mean is 104.26hp, and the median is 95hp, indicating a slight right skew. The distribution has a minimum of 48hp and a maximum of 288hp, with a standard deviation of 39.7hp, reflecting moderate variability. Four high-horsepower outliers skew the data to the right, as shown in the box-whisker plot.

##  min Q1 median  Q3 max   mean     sd   n missing
##   48 70     95 116 288 104.26 39.714 203       2

  1. Engine Size

The distribution of the data for the engine size variable is right-skewed and unimodal. Based on the box plot, the center (median) is 120\(in^3\) and the spread (IQR) is 44\(in^3\). This data distribution has a minimum value of 61\(in^3\) and a maximum value of 326\(in^3\). There are also 6 outliers (unusual points) on the upper end of the box plot.

##  min Q1 median  Q3 max   mean     sd   n missing
##   61 97    120 141 326 126.91 41.643 205       0

  1. Highway MPG

Highway MPG distribution appears to be unimodal and very slightly right skewed. The median miles per gallon on the highway is 30mpg. 50% of the MPG data is between 25mpg and 34mpg with a minimum of 16mpg and a maximum of 54mpg. Based on the boxplot, there are 3 outliers in the upper tail of the distribution.

##  min Q1 median Q3 max   mean     sd   n missing
##   16 25     30 34  54 30.751 6.8864 205       0

Multi-variable EDA

Quantitative

  • Horsepower vs. Price

The overall relationship between the horsepower of a car and the price of a car is positively associated with the scatterplot. In other words, cars with higher horsepower tend to have higher prices. Furthermore, this scatterplot also includes the categorical variable drive_wheel, which categorizes the type of wheels the car has (4wd, fwd, rwd). The analysis of these values shows that fwd and 4wd tend to have lower horsepower than rwd and, therefore, tend to be cheaper.

  • Engine Size vs. Price

The overall relationship between engine size of a car and the price of a car is positively associated. This means that as the size of the engine increases, the price of the car also tends to increase. This scatterplot also includes the categorical variable drive_wheel which categorizes the type of wheels the car has (4wd, fwd, rwd). This categorization may provide insights into price variations associated with different wheel types. As shown on the scatterplot, cars with RWD tend to have larger engine sizes and higher prices, while vehicles with FWD generally display smaller engine sizes and lower prices.

  • Highway MPG vs. Price

Highway miles per gallon (mpg) is a measure of how many miles a vehicle can travel on a gallon of gas while driving on the highway. Based on the scatterplot, there is a negative relationship between log(price) and highway MPG. It means that when the log(price) of the car increases, the highway MPG tends to decrease, and when log(price) decreases, the highway MPG increases. The distribution of drive wheel types in the dataset reveals that front-wheel drive (fwd) cars have higher highway MPG and lower log(price) than rear-wheel drive (rwd) cars that have a lower MPG and higher log(price). Four-wheel drive (4wd) cars fall in the middle of the distribution having average price and average highway MPG.

Categeorical

  1. Drive Wheels Type

Based on the boxplot and numerical breakdown of the price distribution by drive wheel type, we observe distinct patterns. Four-wheel drive (4wd) cars have the narrowest price range, with prices primarily between $7,600 and $11,368. The median price is $9,005, with a few outliers reaching up to $17,450. Front-wheel drive (fwd) cars show a wider price range, with prices starting at $5,118 and extending to $23,875. The median price for fwd cars is $8,192, with more variability and outliers. On the other hand, rear-wheel drive (rwd) cars have the broadest price variation, ranging from $6,785 to $45,400, with a median of $16,900. Rwd vehicles show several high-end outliers exceeding $40,000, suggesting that rear-wheel drive cars tend to be more expensive and have greater price variability. In contrast, front-wheel and four-wheel drive cars are generally lower-priced and more tightly clustered around the median.

##   drive_wheels  min      Q1  median    Q3   max    mean     sd   n missing
## 1          4wd 7603  7984.2  9005.5 11368 17450 10241.0 3288.2   8       0
## 2          fwd 5118  6950.8  8192.0 10332 23875  9244.8 3345.9 118       0
## 3          rwd 6785 13455.0 16900.0 22548 45400 19757.6 9082.6  75       0

Correlation Coefficients

##                price horsepower engine_size highway_mpg
## price        1.00000    0.81053     0.87234    -0.70469
## horsepower   0.81053    1.00000     0.82271    -0.80460
## engine_size  0.87234    0.82271     1.00000    -0.67957
## highway_mpg -0.70469   -0.80460    -0.67957     1.00000

Horsepower (0.81053): There is a strong positive correlation between price and horsepower, indicated by a correlation coefficient of 0.81. This suggests that as horsepower increases, the price of the cars tends to increase as well. In practical terms, cars with more powerful engines (higher horsepower) are generally more expensive.

Engine Size (0.87234): The correlation coefficient of 0.87 indicates an even stronger positive correlation between price and engine size. This means that larger engines are associated with higher prices. This relationship may stem from consumer preferences for cars with larger engines, which are often seen as more powerful or capable, thus driving up their market value.

Highway MPG (-0.70469): The correlation between price and highway miles per gallon (MPG) is -0.70, indicating a strong negative correlation. This suggests that as highway MPG increases (i.e., cars become more fuel-efficient), the price tends to decrease. This could imply that cars with better fuel efficiency are often lower-priced, possibly because they are smaller or less powerful vehicles that prioritize economy over performance.

There is a chance of multicolinearity because some of the predictor variables have high correlation coefficients with each other, which may cause problems for the future.

Evaluation of MLR

In our analysis of car pricing, we are moving forward with implementing multiple linear regression (MLR) due to the promising insights gleaned from our exploratory data analysis (EDA). Our examination has revealed significant relationships between the target variable, price, and several independent variables, including horsepower, engine size, highway miles per gallon (mpg), and wheel drive type.

Specifically, we found that horsepower and engine size both exhibit a strong positive correlation with price, indicating that vehicles with higher power and larger engines tend to command higher market values. Additionally, highway mpg has an inverse relationship with price, suggesting that more fuel-efficient vehicles may be priced lower, likely due to consumer preferences shifting towards efficiency over raw power in certain segments. The categorical variable wheel drive adds depth to our model, revealing that rear-wheel drive (RWD) vehicles typically command the highest prices, reflecting their desirability for performance and handling. In contrast, front-wheel drive (FWD) and four-wheel drive (4WD) options generally have lower average prices, as RWD is often favored in sports and luxury cars where driving dynamics are a priority.

The inclusion of these variables in our MLR model is expected to enhance predictive accuracy and provide valuable insights into the factors influencing car pricing. By employing MLR, we aim to capture the interplay between these independent variables and their collective impact on vehicle prices, ultimately facilitating better decision-making for stakeholders in the automotive market. Thus, we are confident that proceeding with MLR will yield a robust framework for understanding and forecasting car prices based on key attributes.

In our analysis of car pricing, we are moving forward with implementing multiple linear regression (MLR) due to the promising insights gleaned from our exploratory data analysis (EDA).