Introduction

As the population increases, the need for transportation increases.
It takes longer to travel from one place to another in the city due to congestion, resulting increased carbon emission by cars.
Demand for car use is inelastic regardless of the fuel price (Borger, & Rouwendal, 2014).
Increasing the tax on fuel does not help much due to cars with higher fuel efficiency being more expensive compared to cars with lower fuel efficiency.
To solve the increased carbon emission problem, people need to use cars with better fuel efficiency.

Problem Statement

However, it is hard to achieve because of the perception of people that fuel efficient cars are more expensive (Alberini, Bareit, & FIlippini, 2016).
The aim of this paper is to find out if there really is a relationship between fuel efficiency and the price of the car using regression analysis.
To find out how well the fuel efficiency of a car can predict the price of the car, Automobile data set from UCI is used (Schlimmer, 1987).

Data

The data contains 26 different attributes of automobiles.
However, we’re only interested in the two attributes - price and fuel efficiency.
Car price ranging from $5,118 to $45,400 (continuous variable).
City fuel efficiency 13 to 49 miles per gallon (continuous variable).
A total of 205 entries of cars were collected.
The data is also independent, i.e. each entry is unique and does not affect other entries.

Descriptive Statistics and Visualisation

Descriptive statistics
Rows with NA values are dropped resulting in 201 rows without NA values

cars.all <- read.csv('cars.csv', header=FALSE, 
                 dec=".", 
                 na.strings='?', 
                 col.names = c("symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors", "body-styles","drive-wheels","engine-location","wheel-base","length","width", "height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"))

cars <- cars.all[,c("city.mpg", "price")]
cars <- cars %>% na.omit()

stats <- c( min(cars$price,na.rm = TRUE), quantile(cars$price,probs = .25,na.rm = TRUE), median(cars$price, na.rm = TRUE), quantile(cars$price,probs = .75,na.rm = TRUE), max(cars$price,na.rm = TRUE), mean(cars$price, na.rm = TRUE), sd(cars$price, na.rm = TRUE), length(cars$price), sum(is.na(cars$price)), min(cars$city.mpg,na.rm = TRUE), quantile(cars$city.mpg,probs = .25,na.rm = TRUE), median(cars$city.mpg, na.rm = TRUE), quantile(cars$city.mpg,probs = .75,na.rm = TRUE), max(cars$city.mpg,na.rm = TRUE), mean(cars$city.mpg, na.rm = TRUE), sd(cars$city.mpg, na.rm = TRUE), length(cars$city.mpg), sum(is.na(cars$city.mpg))
)
summary <- matrix(stats, ncol = 9, byrow = TRUE)
colnames(summary) <- c("Min", "Q1", "Median",   "Q3",   "Max",  "Mean", "SD",   "n",    "Missing")
rownames(summary) <- c("Price", "City MPG")

knitr::kable(summary)

	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Price	5118	7775	10295	16500	45400	13207.1294	7947.066342	201	0
City MPG	13	19	24	30	49	25.1791	6.423221	201	0

Descriptive Statistics and Visualisation cont.

ggplot(cars, aes(x=price, y=city.mpg)) + geom_point()

City MPG by Price has a negative non-linear relationship.
We can explore the frequencies for both columns to understand why.

Descriptive Statistics and Visualisation cont.

a <- ggplot(cars, aes(x=price)) + geom_histogram(aes(y=..density..)) + geom_density()
b <- ggplot(cars, aes(x=city.mpg)) + geom_histogram(aes(y=..density..)) + geom_density()
grid.arrange(a, b, nrow = 1)

Both plots are skewed to right and are not normally distributed.

Descriptive Statistics and Visualisation cont.

After performing logarithmic transforms on both columns:

a <- ggplot(cars, aes(x=log(price))) + geom_histogram(aes(y=..density..)) + geom_density()
b <- ggplot(cars, aes(x=log(city.mpg))) + geom_histogram(aes(y=..density..)) + geom_density()
grid.arrange(a, b, nrow = 1)

Both are now somewhat more normally distributed.

Descriptive Statistics and Visualisation cont.

ggplot(cars, aes(x=log(price), y=log(city.mpg))) + geom_point()

Negative linear relationship.

Descriptive Statistics and Visualisation cont.

model1 <- lm(log(city.mpg) ~ log(price), data = cars)

model1 %>% summary()

## 
## Call:
## lm(formula = log(city.mpg) ~ log(price), data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38139 -0.09863 -0.01045  0.06525  0.46221 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.02840    0.19350   36.32   <2e-16 ***
## log(price)  -0.41006    0.02067  -19.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1468 on 199 degrees of freedom
## Multiple R-squared:  0.6643, Adjusted R-squared:  0.6626 
## F-statistic: 393.7 on 1 and 199 DF,  p-value: < 2.2e-16

We apply the linear regression model and test assumptions after summarising the model.

Descriptive Statistics and Visualisation cont.

Visualising line of best fit over the transformed scatter plot:

alpha <- model1$coefficients[["(Intercept)"]]
beta <- model1$coefficients[["log(price)"]]
model.function <- function(x) alpha + beta*x

ggplot(cars, aes(x=log(price), y=log(city.mpg))) + geom_point() + stat_function(fun = model.function) + xlim(8, 11)

Line of best fit: $log(economy) = 7.03 - 0.41 * log(price)$

Hypothesis Testing

Hypotheses for the overall linear regression model
- $H_0$: The data does not fit the linear regression model
- $H_A$: The data fits the linear regression model

model1 %>% summary

## 
## Call:
## lm(formula = log(city.mpg) ~ log(price), data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38139 -0.09863 -0.01045  0.06525  0.46221 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.02840    0.19350   36.32   <2e-16 ***
## log(price)  -0.41006    0.02067  -19.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1468 on 199 degrees of freedom
## Multiple R-squared:  0.6643, Adjusted R-squared:  0.6626 
## F-statistic: 393.7 on 1 and 199 DF,  p-value: < 2.2e-16

$P$ value for F-test is very small $F(1,199) = 393.7,p<.001$. City fuel economy explained 66.43% (Multiple R-squared) of the variability in the price of a car. The model is statistically significant. We can now interpret the model coefficients.

Hypothesis Testing Cont.

Hypotheses for linear regression model parameters:
- Intercept:
  - $H_0: \alpha= 0$
  - $H_A: \alpha \neq0$
- Slope:
  - $H_0: \beta= 0$
  - $H_A: \beta \neq0$

model1 %>% summary() %>% coef()

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  7.0283979 0.19350366  36.32178 9.413514e-90
## log(price)  -0.4100568 0.02066568 -19.84240 4.753214e-49

model1 %>% confint()

##                  2.5 %    97.5 %
## (Intercept)  6.6468170  7.409979
## log(price)  -0.4508086 -0.369305

$H_0: \alpha/\beta=0$, $H_A: \alpha/\beta \neq0$
- $H_0: \alpha = 0$ is not captured by the 95% CI $[6.65,7.41]$, $p < .001$. Test is statistically significant.
- $H_0: \beta = 0$ is not captured by the 95% CI $[-0.45,-0.37]$, $p < 0.001$. Test is statistically significant.
We reject $H_0$ for model parameters.

Testing Assumptions - Normality of Residuals

cars$predicted <- predict(model1)
cars$residuals <- residuals(model1)

a <- ggplot(cars, aes(x = log(price), y = log(city.mpg))) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = log(price), yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +  
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +  
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()
b <- ggplot(cars, aes(x = predicted, y = residuals)) + geom_point() +  geom_smooth(aes(colour = "red"), se= FALSE) + 
 geom_hline(yintercept=0) + theme(legend.position = "none")

grid.arrange(a, b, nrow = 1)

Residuals vs Fitted sum lie close to 0.
This indicates normality of residuals and homoscedasticity.

Testing Assumptions - Normality of Residuals, Homoscedasticity and Influential Cases

grid.arrange(a, b, c, nrow = 1)

No obvious departures from normality for most residuals
No influential cases
Variance in square root of residuals is mostly consistent (no pattern)
Assume homoscedasticity

Correlation coefficient

r.squared <- model1 %>% summary()
r.squared <- r.squared$r.squared

#Correlation Coefficient r
r <- cor(log(cars$city.mpg), log(cars$price), use = "complete.obs")

#r 95%CI
library(psychometric)
n <- cars$city.mpg %>% length()
print(CIr(r, n = n, level = .95))

## [1] -0.8567765 -0.7626498

detach("package:psychometric", unload = TRUE)

$R^2$
- $R^2 = 0.6643$
- Log city fuel economy explained 66.43% of the variability in the price of the car.
$r$
- $r = -0.82$
- Indicates strong negative linear relationship
- 95%CI [-0.86, -0.76] does not capture $r =$ -0.82. Statistically significant evidence of negative correlation between the two variables.

Results

Simple linear regression summary:
- Linearity was assumed, normality of residuals OK, homoscedasticity OK, no influential cases.
- $r$ = -0.82, $R^2$ = 0.66.
- Model ANOVA, $F(1,199) = 393.7, p < .001$.
- $\alpha = 7.03, p < .001, CI(6.65, 7.41)$
- $\beta = −0.41, p < .001 CI(-0.55, -0.37)$
Decision:
- Overall model: Reject H0.
- Intercept: Reject H0.
- Slope: Reject H0.
- $Log(Efficiency) = 7.03 − 0.41 × log(Price)$
Conclusion:
- There was a statistically significant negative linear relationship between a car’s city fuel economy and its price.
- For every one unit increase in log(price), log(city mpg) was estimated to decrease on average by -0.41.

Discussion

Results are contradictory from the findings of Borger & Rouwendal et.al (2014).
There is statisitically significant evidence of a negative linear relationship showing the prices of cars actually increases as the fuel efficiency of the car decreases.
This shows that the previous findings showing fuel efficient cars are more expensive is incorrect.
This means that being eco friendly and driving fuel efficient cars is actually cheaper as people will be saving on both car and fuel costs.
However, the findings can be due to possible presence of confounding variables such as curb weight of the car.
There is a possible positive relationship between curb weight and price of the car meaning heavier cars are more expensive and also less fuel efficient.
Therefore, curb weight needs to be controlled in the future researches to find out if there is a real negative relationship between the fuel efficiency and price of the car.

References

Schlimmer, J. (1987). Automobile data set. Retrieved from https://archive.ics.uci.edu/ml/datasets/automobile
Borger, B. D., & Rouwendal, J. (2014). Car user taxes, quality characteristics, and fuel efficiency: Household behaviour and market adjustment. Journal of Transport Economics and Policy. 48(3), 345-366
Alberini, A., Bareit, M., & FIlippini M., (2016). What is the effect of fuel efficiency information on car prices? Evidence from Switzerland. The Energy Journal, 37(3). 2016.

Is Being Eco Friendly Really Expensive?

Using Linear Regression Model To Predict Fuel Economy using Price

Introduction

Problem Statement

Data

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation cont.

Descriptive Statistics and Visualisation cont.

Descriptive Statistics and Visualisation cont.

Descriptive Statistics and Visualisation cont.

Descriptive Statistics and Visualisation cont.

Descriptive Statistics and Visualisation cont.

Hypothesis Testing

Hypothesis Testing Cont.

Testing Assumptions - Normality of Residuals

Testing Assumptions - Normality of Residuals, Homoscedasticity and Influential Cases

Correlation coefficient

Results

Discussion

References