I took this dataset from https://vincentarelbundock.github.io/Rdatasets/datasets.html. I was interested to analyze the prices vs gpm100 (number of gallons required to travel 100 miles) and price vs mileage. this data set has the below fields. My assumption is the more the milage the price will be higher and also to find out if there is a relationship between price and milage.
library(XML)
library(knitr)
url <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/DAAG/carprice.csv"
car_price <- read.csv(file = url, header = T, stringsAsFactors = F)
kable(head(car_price))
X | Type | Min.Price | Price | Max.Price | Range.Price | RoughRange | gpm100 | MPG.city | MPG.highway |
---|---|---|---|---|---|---|---|---|---|
6 | Midsize | 14.2 | 15.7 | 17.3 | 3.1 | 3.09 | 3.8 | 22 | 31 |
7 | Large | 19.9 | 20.8 | 21.7 | 1.8 | 1.79 | 4.2 | 19 | 28 |
8 | Large | 22.6 | 23.7 | 24.9 | 2.3 | 2.31 | 4.9 | 16 | 25 |
9 | Midsize | 26.3 | 26.3 | 26.3 | 0.0 | -0.01 | 4.3 | 19 | 27 |
10 | Large | 33.0 | 34.7 | 36.3 | 3.3 | 3.30 | 4.9 | 16 | 25 |
11 | Midsize | 37.5 | 40.1 | 42.7 | 5.2 | 5.18 | 4.9 | 16 | 25 |
summary(car_price)
## X Type Min.Price Price
## Min. : 6.00 Length:48 Min. : 6.90 Min. : 7.40
## 1st Qu.:17.75 Class :character 1st Qu.:11.40 1st Qu.:13.47
## Median :29.50 Mode :character Median :14.50 Median :16.30
## Mean :36.54 Mean :16.54 Mean :18.57
## 3rd Qu.:60.25 3rd Qu.:19.43 3rd Qu.:20.73
## Max. :79.00 Max. :37.50 Max. :40.10
## Max.Price Range.Price RoughRange gpm100
## Min. : 7.90 Min. : 0.000 Min. :-0.020 Min. :2.800
## 1st Qu.:14.97 1st Qu.: 1.700 1st Qu.: 1.705 1st Qu.:3.800
## Median :18.40 Median : 3.300 Median : 3.305 Median :4.200
## Mean :20.63 Mean : 4.092 Mean : 4.089 Mean :4.167
## 3rd Qu.:24.50 3rd Qu.: 5.850 3rd Qu.: 5.853 3rd Qu.:4.550
## Max. :42.70 Max. :14.600 Max. :14.600 Max. :5.700
## MPG.city MPG.highway
## Min. :15.00 Min. :20.00
## 1st Qu.:18.00 1st Qu.:26.00
## Median :20.00 Median :28.00
## Mean :20.96 Mean :28.15
## 3rd Qu.:23.00 3rd Qu.:30.00
## Max. :31.00 Max. :41.00
# price vs gpm100
price_lm <- lm(car_price$gpm100 ~ car_price$Price )
plot(car_price$gpm100, car_price$Price)
abline(price_lm)
summary(price_lm)
##
## Call:
## lm(formula = car_price$gpm100 ~ car_price$Price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88981 -0.33553 -0.07511 0.19858 1.63161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.241501 0.196492 16.5 < 2e-16 ***
## car_price$Price 0.049813 0.009766 5.1 6.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5234 on 46 degrees of freedom
## Multiple R-squared: 0.3612, Adjusted R-squared: 0.3474
## F-statistic: 26.01 on 1 and 46 DF, p-value: 6.271e-06
gpm_res <- price_lm$residuals
hist(gpm_res, breaks = 11)
qqnorm(price_lm$residuals)
qqline(price_lm$residuals)
# price vs mpg
mpg_lm <- lm(car_price$MPG.highway ~ car_price$Price )
plot(car_price$MPG.highway, car_price$Price, col='Blue')
abline(mpg_lm, col='Red')
summary(mpg_lm)
##
## Call:
## lm(formula = car_price$MPG.highway ~ car_price$Price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.689 -1.791 0.279 2.094 10.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.26140 1.34695 24.694 < 2e-16 ***
## car_price$Price -0.27543 0.06695 -4.114 0.000159 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.588 on 46 degrees of freedom
## Multiple R-squared: 0.269, Adjusted R-squared: 0.2531
## F-statistic: 16.93 on 1 and 46 DF, p-value: 0.0001591
mpg_res <- mpg_lm$residuals
hist(mpg_res, breaks = 11)
qqnorm(mpg_lm$residuals)
qqline(mpg_lm$residuals)
Summary: histogram of residual plot appear to be normally distributed. Rsquare is less. However, other combination of attributes like the type of car along with the range of price need to be analyzed.