Burtsev_Project

Introduction:

My research question is how fuel economy change over time. I am a car owner and want to save money. US standards for passenger vehicle fuel economy and greenhouse gas emissions are slated to tighten steeply.The goals of the standards are to reduce greenhouse gas emissions, improve energy security, and reduce consumers’ fuel costs.

Data

Hadley Wickham released data packages to CRAN. Fuel economy data for all cars sold in the US from 1984 to 2015. 33,442 rows, 12 variables. (Source: Environmental protection agency). URL https://github.com/hadley/fueleconomy I will be studying two variables: hwy (Highway fuel economy, in mpg) and year (Model year). “Year” is explanatory variable and “hwy” is response variable. This is observational type of study.
Population of interest here is Large Cars class vehicle with front-wheel drive, regular fuel, six cylinders and 3.5 engine displacement, in litres. The findings from this analysis can be generalized to that population. All cars model make using the same specification. There are no potential sources of bias that might prevent generalizability. These data can be used to establish causal links between the variables of interest.

if(!require(devtools)) install.packages("devtools")

## Loading required package: devtools

## Loading required package: usethis

devtools::install_github("hadley/fueleconomy")

## Skipping install of 'fueleconomy' from a github remote, the SHA1 (d590bcf6) has not changed since last install.
##   Use `force = TRUE` to force installation

library(fueleconomy)
library(psych)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

df_full = data.frame("vehicle_id" = as.numeric(fueleconomy::vehicles$id), "make" = fueleconomy::vehicles$make, 
"model" = fueleconomy::vehicles$model, "year" = as.numeric(fueleconomy::vehicles$year), "class" = fueleconomy::vehicles$class, "trans" = fueleconomy::vehicles$trans, "drive"=fueleconomy::vehicles$drive, "cyl"=as.numeric(fueleconomy::vehicles$cyl), 
"displ"= as.character(fueleconomy::vehicles$displ), "fuel"=fueleconomy::vehicles$fuel,"hwy"=as.numeric(fueleconomy::vehicles$hwy), "cty"=as.numeric(fueleconomy::vehicles$cty))
df <- subset(df_full,  drive == "Front-Wheel Drive" & fuel == "Regular" & class == "Large Cars" & cyl == 6 & displ == "3.5")

Exploratory data analysis

cat("Data frame row number is",nrow(df),", column number is",ncol(df), "\n")

## Data frame row number is 81 , column number is 12

summary(df)

##    vehicle_id           make           model         year     
##  Min.   :10113   Chrysler :21   Intrepid  :15   Min.   :1993  
##  1st Qu.:13627   Dodge    :15   Taurus FWD:10   1st Qu.:1997  
##  Median :19839   Ford     :10   Avalon    : 8   Median :2004  
##  Mean   :20309   Chevrolet: 8   Vision    : 7   Mean   :2003  
##  3rd Qu.:26004   Toyota   : 8   Concorde  : 6   3rd Qu.:2009  
##  Max.   :33682   Eagle    : 7   300 M     : 5   Max.   :2014  
##                  (Other)  :12   (Other)   :30                 
##                           class                trans   
##  Large Cars                  :81   Automatic 4-spd:45  
##  Compact Cars                : 0   Automatic (S6) :11  
##  Midsize-Large Station Wagons: 0   Automatic 5-spd: 8  
##  Midsize Cars                : 0   Automatic 6-spd: 8  
##  Midsize Station Wagons      : 0   Automatic (S4) : 6  
##  Minicompact Cars            : 0   Automatic (S5) : 3  
##  (Other)                     : 0   (Other)        : 0  
##                         drive         cyl        displ   
##  2-Wheel Drive             : 0   Min.   :6   3.5    :81  
##  4-Wheel Drive             : 0   1st Qu.:6   0      : 0  
##  4-Wheel or All-Wheel Drive: 0   Median :6   1      : 0  
##  All-Wheel Drive           : 0   Mean   :6   1.1    : 0  
##  Front-Wheel Drive         :81   3rd Qu.:6   1.2    : 0  
##  Part-time 4-Wheel Drive   : 0   Max.   :6   1.3    : 0  
##  Rear-Wheel Drive          : 0               (Other): 0  
##                       fuel         hwy             cty       
##  Regular                :81   Min.   :23.00   Min.   :15.00  
##  CNG                    : 0   1st Qu.:24.00   1st Qu.:16.00  
##  Diesel                 : 0   Median :25.00   Median :16.00  
##  Electricity            : 0   Mean   :25.83   Mean   :17.11  
##  Gasoline or E85        : 0   3rd Qu.:28.00   3rd Qu.:18.00  
##  Gasoline or natural gas: 0   Max.   :30.00   Max.   :20.00  
##  (Other)                : 0

describe(df$year)

##    vars  n    mean   sd median trimmed  mad  min  max range  skew kurtosis   se
## X1    1 81 2003.31 6.16   2004 2003.38 7.41 1993 2014    21 -0.12    -1.32 0.68

describe(df$hwy)

##    vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 81 25.83 2.15     25   25.68 1.48  23  30     7 0.42    -1.53 0.24

ggplot(df, aes(x=df$year,y=df$hwy))+ geom_point()

## The relationship look linear, positive

cat("The correlation coefficient is", cor(df$year, df$hwy))

## The correlation coefficient is 0.7977006

The correlation coefficient 0.7977 is closer to 1.

DATA606::plot_ss(x = df$year, y = df$hwy, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -532.1659       0.2785  
## 
## Sum of Squares:  134.407

m1 <- lm(hwy ~ year, data = df)
summary(m1)

## 
## Call:
## lm(formula = hwy ~ year, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6909 -0.9056  0.2086  0.8661  2.0305 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -532.16594   47.46062  -11.21   <2e-16 ***
## year           0.27854    0.02369   11.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.304 on 79 degrees of freedom
## Multiple R-squared:  0.6363, Adjusted R-squared:  0.6317 
## F-statistic: 138.2 on 1 and 79 DF,  p-value: < 2.2e-16

Inference

conditions

Linearity: Our scatterplot shows that the data is linear.

plot(m1$residuals ~ df$year)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

## There is no pattern, linear relationship.

Nearly normal residuals: To check this condition, we can look at a histogram

hist(m1$residuals)

or a normal probability plot of the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

## Normal residuals condition is meet. Histogram is a little left-skewed and not simmetrical.

Constant variability: Based on the plot the constant variability condition appear to be met.

Interpretation: For each additional year we expect fuel economy improve on average by 0.28% points.

Interpretation of the intercept doesn’t make sense.

Null hypothesis - fuel economy don’t change over years.

Alternative hypothesis - fuel economy improve over years.

As a general rule we reject H0 when the p-value is less than 0.05, i.e. we use a significance level of 0.05, α = 0.05.

Our p-value is less than 0.05. We reject null hypothesis, there is constant improvement in vehicles fuel efficiency.

Fuel efficient vehicles require less gas to go a given distance. When we burn less gas, we produce less pollution, while spending less on gas—a lot less. Our dependence on oil makes us vulnerable to oil market manipulation and price shocks. Improving the fuel efficiency of US vehicles is the single biggest step we can take to cut America’s oil consumption. Oil is a non-renewable resource, and we cannot sustain our current rate of use indefinitely. Using it wisely now allows us time to find alternative technologies and fuels that will be more sustainable.