US Car Price Data

I took this dataset from https://vincentarelbundock.github.io/Rdatasets/datasets.html. I was interested to analyze the prices vs gpm100 (number of gallons required to travel 100 miles) and price vs mileage. this data set has the below fields. My assumption is the more the milage the price will be higher and also to find out if there is a relationship between price and milage.

Type:-Type of car, e.g. Sporty, Van, Compact

Min.Price:-Price for a basic model

Price:-Price for a mid-range model

Max.Price:-Price for a ‘premium’ model

Range.Price:-Difference between Max.Price and Min.Price

RoughRange:-Rough.Range plus some N(0,.0001) noise

gpm100:-The number of gallons required to travel 100 miles

MPG.city:-Average number of miles per gallon for city driving

MPG.highway:-Average number of miles per gallon for highway driving

Read Data

library(XML)
library(knitr)

url <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/DAAG/carprice.csv"

car_price  <- read.csv(file = url, header = T, stringsAsFactors = F)

kable(head(car_price))
X Type Min.Price Price Max.Price Range.Price RoughRange gpm100 MPG.city MPG.highway
6 Midsize 14.2 15.7 17.3 3.1 3.09 3.8 22 31
7 Large 19.9 20.8 21.7 1.8 1.79 4.2 19 28
8 Large 22.6 23.7 24.9 2.3 2.31 4.9 16 25
9 Midsize 26.3 26.3 26.3 0.0 -0.01 4.3 19 27
10 Large 33.0 34.7 36.3 3.3 3.30 4.9 16 25
11 Midsize 37.5 40.1 42.7 5.2 5.18 4.9 16 25
summary(car_price)
##        X             Type             Min.Price         Price      
##  Min.   : 6.00   Length:48          Min.   : 6.90   Min.   : 7.40  
##  1st Qu.:17.75   Class :character   1st Qu.:11.40   1st Qu.:13.47  
##  Median :29.50   Mode  :character   Median :14.50   Median :16.30  
##  Mean   :36.54                      Mean   :16.54   Mean   :18.57  
##  3rd Qu.:60.25                      3rd Qu.:19.43   3rd Qu.:20.73  
##  Max.   :79.00                      Max.   :37.50   Max.   :40.10  
##    Max.Price      Range.Price       RoughRange         gpm100     
##  Min.   : 7.90   Min.   : 0.000   Min.   :-0.020   Min.   :2.800  
##  1st Qu.:14.97   1st Qu.: 1.700   1st Qu.: 1.705   1st Qu.:3.800  
##  Median :18.40   Median : 3.300   Median : 3.305   Median :4.200  
##  Mean   :20.63   Mean   : 4.092   Mean   : 4.089   Mean   :4.167  
##  3rd Qu.:24.50   3rd Qu.: 5.850   3rd Qu.: 5.853   3rd Qu.:4.550  
##  Max.   :42.70   Max.   :14.600   Max.   :14.600   Max.   :5.700  
##     MPG.city      MPG.highway   
##  Min.   :15.00   Min.   :20.00  
##  1st Qu.:18.00   1st Qu.:26.00  
##  Median :20.00   Median :28.00  
##  Mean   :20.96   Mean   :28.15  
##  3rd Qu.:23.00   3rd Qu.:30.00  
##  Max.   :31.00   Max.   :41.00

Visualization

# price vs gpm100
price_lm <- lm(car_price$gpm100 ~ car_price$Price )

plot(car_price$gpm100, car_price$Price)
abline(price_lm)

summary(price_lm)
## 
## Call:
## lm(formula = car_price$gpm100 ~ car_price$Price)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88981 -0.33553 -0.07511  0.19858  1.63161 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.241501   0.196492    16.5  < 2e-16 ***
## car_price$Price 0.049813   0.009766     5.1 6.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5234 on 46 degrees of freedom
## Multiple R-squared:  0.3612, Adjusted R-squared:  0.3474 
## F-statistic: 26.01 on 1 and 46 DF,  p-value: 6.271e-06
gpm_res <- price_lm$residuals

hist(gpm_res, breaks = 11)

qqnorm(price_lm$residuals)
qqline(price_lm$residuals)

# price vs mpg
mpg_lm <- lm(car_price$MPG.highway ~ car_price$Price )

plot(car_price$MPG.highway, car_price$Price, col='Blue')
abline(mpg_lm, col='Red')

summary(mpg_lm)
## 
## Call:
## lm(formula = car_price$MPG.highway ~ car_price$Price)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.689 -1.791  0.279  2.094 10.217 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     33.26140    1.34695  24.694  < 2e-16 ***
## car_price$Price -0.27543    0.06695  -4.114 0.000159 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.588 on 46 degrees of freedom
## Multiple R-squared:  0.269,  Adjusted R-squared:  0.2531 
## F-statistic: 16.93 on 1 and 46 DF,  p-value: 0.0001591
mpg_res <- mpg_lm$residuals

hist(mpg_res, breaks = 11)

qqnorm(mpg_lm$residuals)
qqline(mpg_lm$residuals)

Summary: histogram of residual plot appear to be normally distributed. Rsquare is less. However, other combination of attributes like the type of car along with the range of price need to be analyzed.