icon iconic

INTRODUCTION

We decided to analyze the mtcars dataset to determine whether or not the ranges and charge times of electric vehicles has been increasing over time as technology improves. We would also like to determine whether or not Tesla vehicles have longer ranges than other makes of electric cars, and if re-focusing the analysis to either all-wheel drive or two-wheel drive vehicles would have any effect. Because Tesla makes up the largest proportion of our datapoints, we would also like to see if there is signicant variance between its own different models of electric cars.

To solve these problems we will use scatter plots and histograms to plot the ranges of the different electric vehicles against make, Tesla model, year, and other variables. We will also create new variables to categorize our data effectively.

We will also plot several scatter plots with color coded points to help us compare three different variables at once. We will also choose to conduct hypothesis tests comparing means of different samples within the EV space - for example, comparing the mean charge time / range of Tesla vehicles to the the mean of all other models to a certain confidence level.

Our analysis will help the consumer determine which electric cars have the longest overall, city and highway ranges, and if this depends on make or class of car. It will also help them determine whether a new car is needed to maximize their range, and what, if anything may be sacrificed in order to achieve maximum range. Electric cars with long ranges and fast charge times are extremely desirable to consumers, as it can be a major inconvenience to them if they expectedly run out of charge or are forced to reroute in order to find a charging station.

PACKAGES

library(tidyverse)
library(data.table)
library(knitr)

We require the data.table and tidyverse packages for our analyses. From the dplyr package within the tidyverse we use the select and mutate functions to remove columns that are not relevant and create new variables that analyze relationships between existing variables, as well as filter, group_by, summarise, arrange, and others to build basic summary statistics for our data. Alongside this, we utlize many ggplot functions from the ggplot package within tidyvers. We use the fread function from the data.table package to read in our csv data from the url. We use the kable function from the knitr package to format some tables.

DATA PREP

1. Intro & Import

The mtcars dataset comes from the EPA and the U.S. Department of Energy’s Office of Energy Efficiency & Renewable Energy. It was last updated November 15th, 2019. This dataset includes vehicle, emission, and fuel price data and we believe it may have been collected to raise awareness about air pollution caused by cars, and provide insights into how certain vehicles are better than others at using clean energy and consuming less fuel. In a time where global warming and our carbon footprints are popular topics of discussion and frequently mentioned on the news, this dataset can be used as a tool for people who are concerned with emissions and their carbon footprints to decide which types of cars they may be open to buying.

Our first step into the data prep is to load the data through the url recieved from a GitHub user’s page (original source is FuelEconomy.gov) and using the fread function:

##loading data into RStudio and checking for success
url <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv'
cars <- fread(url)
head(cars)

2. First Glance

When looking at the data upon load, we want to see how big the dataset is and basic understanding about the format and structure of the set:

#how big is the dataset? introductory questions about data
names(cars)
summary(cars)
str(cars)
dim(cars)
## [1] 41804    83

This dataset is massive. We have over 40,000 observations and 83 variables. This is rather unmanageable in its current form. We hope to trim this to only relevant data prior to further data cleaning steps.

3. Filtering for Electric Vehicles

Knowing that we want to do an analysis on electric vehicles (EVs), it is important to us to firstly trim the dataset to only include electric vehicle observations. After taking an initial look at the cars dataset and consulting the data dictionary, we determined that an easy way to identify electric vehicle observations was through the atvType column:

kable(table(cars$atvType), col.names = c('Vehicle Type','Frequency'))
Vehicle Type Frequency
Bifuel (CNG) 20
Bifuel (LPG) 8
CNG 50
Diesel 1125
EV 209
FFV 1458
Hybrid 636
Plug-in Hybrid 151
##Filter on EVs = only showing where atvType = 'EV' to limit our set
EVcars <- cars[cars$atvType == "EV",]
head(EVcars)
dim(EVcars)
## [1] 209  83

We filter out all non-electric cars are left with an EVcars dataset that contains 209 observations with all 83 original columns.

4. Trimming Variables with Missing Values

We know we need to continue organizing the data to only include relevant columns. We start by looking at columns with all or many missing values inside the variable. We also imagine that there are certain columns that are irrelevant to EVs, and we want to target those first as well.

missingvalues <- colnames(EVcars)[colSums(is.na(EVcars)) > 0]
nullTable <- select(EVcars, missingvalues)
kable(head(nullTable))
cylinders displ drive eng_dscr trany guzzler trans_dscr tCharger sCharger fuelType2 rangeA evMotor mfrCode c240Dscr c240bDscr
NA NA NA NA NA NA NA NA NA NA NA 62 KW AC Induction NA NA NA
NA NA 2-Wheel Drive NA NA NA NA NA NA NA NA 50 KW DC NA NA NA
NA NA 2-Wheel Drive NA NA NA NA NA NA NA NA 50 KW DC NA NA NA
NA NA NA NA NA NA NA NA NA NA NA 27 KW AC Induction NA NA NA
NA NA 2-Wheel Drive NA NA NA NA NA NA NA NA 67 KW AC Induction NA NA NA
NA NA NA NA NA NA NA NA NA NA NA 24 KW AC Synchronous NA NA NA
dim(nullTable)
## [1] 209  15
nullPropBarplot <- barplot(colSums(is.na(nullTable))/nrow(nullTable), las = 2, ylim = c(0,1))

nullPropBarplot

Using the select function to target missing values, we create a barplot that shows the proportions of missing values for all of the variables that have at least one missing value. It is evident that there are several variables that have no values, which we can immediately mark for elimination.

#view the different values and their frequencies for the variables we are unsure of whether to keep or not

table(EVcars$c240bDscr)
table(EVcars$mfrCode)
table(EVcars$displ)
table(EVcars$trany)
table(EVcars$evMotor)
kable(table(EVcars$eng_dscr), col.names = c('Engine Type', 'Frequency'))
Engine Type Frequency
Lead Acid 4
NiMH 4
SIDI 1
kable(table(EVcars$c240Dscr), col.names = c('Charger Type', 'Frequency'))
Charger Type Frequency
3.6 kW charger 4
6.6 kW charger 2
7.2 kW charger 2
single charger 3
standard charger 73
##Deselect the variables we do not want to include based on their proportions of missing values shown in the barplot
EVcars_trim1 <- select(.data = EVcars, -c(cylinders, displ, eng_dscr, guzzler, trans_dscr, tCharger, sCharger, fuelType2, rangeA, mfrCode))
head(EVcars_trim1)
dim(EVcars_trim1)
## [1] 209  73

For the remaining 7 variables we use the table function to get an idea for how many values they have and what the frequencies are. We decide to keep the c240Dscr and c240bDscr variables despite their missing values, as they still have enough data to be useful to us if we decide to include electric vehicle charging in our analysis. We also keep the variables drive, trany, and evMotor because they still have values for a majority of the observations and may be useful in our investigation into ranges of electric vehicles. We cut mfrCode because we already have the variable make and do not feel this will add incremental value. We also cut eng_desc and displ because they have missing values for such a large proportion of their observations without adding much important insight with the values that are included.

We use the select function again to deselect all the variables we decided to remove, generating a new dataset called EVcars_trim1. We now have 209 observations in 73 columns.

5. Trimming Irrelevant Variables

##First pass on variable stripping - taking out non-relevant fields 
relevantvars <- c('atvType', 'c240Dscr', 'c240bDscr',
                  'charge120', 'charge240', 
                  'cityE', 'cityUF', 'combE', 
                  'combinedUF', 'drive', 
                  'evMotor', 'engId', 'feScore', 
                  'fuelCost08', 'fuelCostA08', 'fuelType',
                  'fuelType1', 'ghgScore', 
                  'ghgScoreA', 'highway08', 'highway08U', 
                  'highwayA08U', 'highwayE', 'id', 
                  'make', 'model', 'mpgData', 'range', 'rangeCity', 
                  'rangeHwy', 'trany', 'UCity','UHighway', 
                  'VClass', 'youSaveSpend', 'charge240b', 'year')
EVcars_trim2 <- select(.data = EVcars, relevantvars)
head(EVcars_trim2)
dim(EVcars_trim2)
## [1] 209  37

Once we’ve filtered out the variables that have too many missing values to be useful to our analyses, we look into the data dictionary to interpret the descriptions of each of the remaining variables and determine which will be relevant to our investigation into electric vehicles. We are able to remove 36 variables that pertain to gasoline, engines, and other things that are only applicable to non-electric vehicles, or are otherwise irrelevant to our analysis, leaving us with 209 observations across 37 variables.

6. Trimming Variables with no Variance

#getting rid of variables with only 1 value
EVcars_trim3 <- Filter(function(x)(length(unique(x))>1), EVcars_trim2)
head(EVcars_trim3)
dim(EVcars_trim3)
## [1] 209  28

Now that we have only variables that we believe to be relevant to our analyses and are not missing values for a substantial amount of observations, we decide to use the filter function to remove variables that only have one value across all observations. If there is no variance in the variable then it is not helpful to us in our analysis. This leaves us with 209 observations across 28 variables.

7. Recoding of Similar Variables

#recoding similar values to simplify streamline analysis related to class of car
EVcars_trim4 <- EVcars_trim3
kable(table(EVcars_trim4$VClass), col.names = c('Class', 'Frequency'))
Class Frequency
Compact Cars 15
Large Cars 47
Midsize Cars 33
Midsize Station Wagons 1
Minicompact Cars 9
Minivan - 2WD 2
Small Pickup Trucks 2WD 2
Small Sport Utility Vehicle 2WD 11
Small Sport Utility Vehicle 4WD 2
Small Station Wagons 12
Special Purpose Vehicle 2WD 2
Sport Utility Vehicle - 2WD 8
Standard Pickup Trucks 2WD 5
Standard Sport Utility Vehicle 4WD 20
Subcompact Cars 20
Two Seaters 20
EVcars_trim4$VClass[EVcars_trim4$VClass == "Small Station Wagons"] <- "Station Wagons"
EVcars_trim4$VClass[EVcars_trim4$VClass == "Midsize Station Wagons"] <- "Station Wagons"
EVcars_trim4$VClass[EVcars_trim4$VClass == "Standard Pickup Trucks 2WD"] <- "Pickup Trucks"
EVcars_trim4$VClass[EVcars_trim4$VClass == "Small Pickup Trucks 2WD"] <- "Pickup Trucks"

Our final step in our data preparation and cleaning processes is to combine data values that are similar to one another to streamline our analyses. It will be much easier to compare values between different classes of cars if we have them consolidated into a few major classes instead of many smaller subclasses, especially when many of them have little to no differences from one another.

8. Overview of Final Dataset

#get summary information about the final dataset
str(EVcars_trim4)
## Classes 'data.table' and 'data.frame':   209 obs. of  28 variables:
##  $ c240Dscr    : chr  NA NA NA NA ...
##  $ c240bDscr   : chr  NA NA NA NA ...
##  $ charge240   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityE       : num  41 41 41 46 75 40 39 75 39 54 ...
##  $ combE       : num  40 47 47 52 87 45 43 87 43 58 ...
##  $ drive       : chr  NA "2-Wheel Drive" "2-Wheel Drive" NA ...
##  $ evMotor     : chr  "62 KW AC Induction" "50 KW DC" "50 KW DC" "27 KW AC Induction" ...
##  $ engId       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ feScore     : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ fuelCost08  : int  800 900 900 1000 1700 900 850 1700 850 1150 ...
##  $ ghgScore    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ highway08   : int  91 64 64 58 33 66 69 33 69 54 ...
##  $ highway08U  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayE    : num  37 53 53 59 102 51 49 102 49 63 ...
##  $ id          : int  16423 16424 17328 17329 17330 17331 18290 18291 19296 30965 ...
##  $ make        : chr  "Nissan" "Toyota" "Toyota" "Ford" ...
##  $ model       : chr  "Altra EV" "RAV4 EV" "RAV4 EV" "Th!nk" ...
##  $ mpgData     : chr  "N" "N" "N" "N" ...
##  $ range       : int  90 88 88 29 38 33 95 38 95 50 ...
##  $ rangeCity   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rangeHwy    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trany       : chr  NA NA NA NA ...
##  $ UCity       : num  116.2 116.2 116.2 105.3 62.4 ...
##  $ UHighway    : num  129.6 91.1 91.1 82.2 46.8 ...
##  $ VClass      : chr  "Station Wagons" "Sport Utility Vehicle - 2WD" "Sport Utility Vehicle - 2WD" "Two Seaters" ...
##  $ youSaveSpend: int  3250 2750 2750 2250 -1250 2750 3000 -1250 3000 1500 ...
##  $ charge240b  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ year        : int  2000 2000 2001 2001 2001 2001 2002 2002 2003 2001 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#using EVcars label for our final working dataset
EVcarsTrim <- EVcars_trim4
kable(head(EVcarsTrim))
c240Dscr c240bDscr charge240 cityE combE drive evMotor engId feScore fuelCost08 ghgScore highway08 highway08U highwayE id make model mpgData range rangeCity rangeHwy trany UCity UHighway VClass youSaveSpend charge240b year
NA NA 0 41 40 NA 62 KW AC Induction 0 -1 800 -1 91 0 37 16423 Nissan Altra EV N 90 0 0 NA 116.2069 129.6154 Station Wagons 3250 0 2000
NA NA 0 41 47 2-Wheel Drive 50 KW DC 0 -1 900 -1 64 0 53 16424 Toyota RAV4 EV N 88 0 0 NA 116.2069 91.0811 Sport Utility Vehicle - 2WD 2750 0 2000
NA NA 0 41 47 2-Wheel Drive 50 KW DC 0 -1 900 -1 64 0 53 17328 Toyota RAV4 EV N 88 0 0 NA 116.2069 91.0811 Sport Utility Vehicle - 2WD 2750 0 2001
NA NA 0 46 52 NA 27 KW AC Induction 0 -1 1000 -1 58 0 59 17329 Ford Th!nk N 29 0 0 NA 105.3125 82.1951 Two Seaters 2250 0 2001
NA NA 0 75 87 2-Wheel Drive 67 KW AC Induction 0 -1 1700 -1 33 0 102 17330 Ford Explorer USPS Electric N 38 0 0 NA 62.4074 46.8056 Sport Utility Vehicle - 2WD -1250 0 2001
NA NA 0 40 45 NA 24 KW AC Synchronous 0 -1 900 -1 66 0 51 17331 Nissan Hyper-Mini N 33 0 0 NA 120.3571 93.6111 Two Seaters 2750 0 2001

We use the str function to show a summary information about our data, including the number of variables and observations, the class types, variable names, and the first few values for each of the variables. Then we use the kable and head functions to display a portion of our data in a cleaned and condensed format.

Our final dataset for use moving forward is a set with 209 observations and 28 variables. There are variables that help us identify descriptive information of each vehicle, such as year for the model year, make and model for the maker and the model name, VClass for vehicle class that shows type of vehicle, drive distinguishing the type of drivetrain, and evMotor describing special aspects of some of the motors used for the EVs. Most of the rest of the variables are metrics that determine performance. Variables of note here include cityE and hwyE, which denote city and highway electricity consumption in kw-hrs/100 miles for each vehicle respectively; range, denoting EPA range of vehicle; and charge240, which is time to charge an electric vehicle in hours at 240 V. While these are our primary variables of interest, the other variables may prove to be interesting to explore as we make our way through data analysis, and thus we are keeping them in.

EXPLORATORY DATA ANALYSIS

Range and Charge Time by Year

Our primary interest lies within both the range and charge240 variables. We’ll start by looking at range and charge240 over time; we are interested first in seeing if there have been significant technological improvements over time as the prevalence of EV research and development has grown in recent years.

Record-high range has certainly improved over the past 20 years - however, there are still many models with ranges that are similar to older models. There is a serious gap in data between the years 2005 and 2010. Given the history of the EV industry and that true prototyping and feasible models did not gain traction until 2010, we may elect to only take into account records with year > 2010 in future analysis.

We also can see that there are fewer recorded values for charge time, which may play a role in future hypothesis tests. Values for charge time vary in recent years between 3 and 13 hours.

Vehicle Class

We’d like to explore the VClass variable more; to start, we simply want to see a breakdown of each vehicle class. We can use a standard barplot to display this:

Vehicle Class Frequency
Compact Cars 15
Large Cars 47
Midsize Cars 33
Minicompact Cars 9
Minivan - 2WD 2
Pickup Trucks 7
Small Sport Utility Vehicle 2WD 11
Small Sport Utility Vehicle 4WD 2
Special Purpose Vehicle 2WD 2
Sport Utility Vehicle - 2WD 8
Standard Sport Utility Vehicle 4WD 20
Station Wagons 13
Subcompact Cars 20
Two Seaters 20

Our greatest two samples are in large and midsize cars for vehicle class. We may want to only compare between classes in those two cagetories, as they appear to be the only two with more than 30 observations.

Make

We can also look at the breakdown of make to see which specific car makers we can compare to each other:

Tesla observations far outweigh any other make. Based on this, we will limit most of our make comparisons to Tesla vs. non-Tesla analyses; however, Nissan, Ford, and smart all have numerous observations as well, so we can also highlight those as well.

Initial Tesla comparisons on range and charge time

We could make some initial analysis based on our first visualizations and including make. We want to add a column that color codes for notable makes that we specified earlier - all other makes will be shown as black.

#Adding color classifications for specific makes
EVcarsTrim$color = 'black'
EVcarsTrim$color[EVcarsTrim$make == 'Tesla']='red'
EVcarsTrim$color[EVcarsTrim$make == 'Nissan']='blue'
EVcarsTrim$color[EVcarsTrim$make == 'Ford']='green'
EVcarsTrim$color[EVcarsTrim$make == 'smart']='orange'

#plotting based on color
EVcarsTrim %>% 
  ggplot(aes(x = year, y = range, color = make)) +
  geom_point() + 
  ggtitle("Range by year, by make")

Tesla can easily be seen to have a higher range than other vehicles in recent years. Is there a tradeoff with charge time, though?

#charge time by year by make
EVcarsTrim %>% 
  ggplot(aes(x = year, y = charge240, color = make)) +
  geom_point() + 
  ggtitle("Charge Time by year, by make")

This looks to be the case. We would like to investigate this more, and we may elect to build a metric to standardize this relationship or potentially to model it with a linear regression.

Basic summary statistics

Now that we have begun to visualize the data, we can investigate some of the variables more deeply. Using dplyr functions and our insights from previous steps in EDA, we will begin to summarise certain variables.

#summarizing mean range by make
EVcarsTrim %>% 
  filter(year > 2010) %>% 
  group_by(make) %>% 
  summarise(meanRange = mean(range, na.rm = TRUE), n = n()) %>% 
  arrange(desc(meanRange)) %>% 
  kable()
make meanRange n
Tesla 268.25974 77
Jaguar 234.00000 2
Audi 204.00000 1
Hyundai 177.60000 5
BYD 163.87500 8
Chevrolet 160.00000 6
Kia 140.42857 7
Nissan 119.08333 12
Volkswagen 110.66667 6
BMW 105.90000 10
Toyota 103.00000 3
CODA Automotive 88.00000 2
Ford 87.14286 7
Mercedes-Benz 87.00000 4
Honda 86.20000 5
Fiat 85.28571 7
smart 63.43750 16
Mitsubishi 61.40000 5
Azure Dynamics 56.00000 2
Scion 38.00000 1

There is very clearly a pecking order for average range, with Tesla at the top. What is the difference between Tesla’s average and the average of all of the rest combined, since the n values show such a stark contrast? We can use a mutate function to easily find out by making an indicator column for Teslas:

#summarizing mean range by Tesla vs. non-Tesla
EVcarsTrim %>% 
  filter(year > 2010) %>% 
  mutate(teslaBin = if_else(make =='Tesla', 'Tesla', 'non-Tesla')) %>% 
  group_by(teslaBin) %>% 
  summarise(meanRange = mean(range, na.rm = TRUE), n = n()) %>% 
  arrange(desc(meanRange)) %>% 
  kable()
teslaBin meanRange n
Tesla 268.2597 77
non-Tesla 109.2569 109

We want to hypothesize test this difference in mean to determine its significance across these two populations. We will do so in the next section.

Another summarization we can make is the range based on drivetrain. The results are below:

#summarizing mean range by drive type
EVcarsTrim %>% 
  filter(year > 2010) %>% 
  group_by(drive) %>% 
  summarise(meanRange = mean(range, na.rm = TRUE), n = n()) %>% 
  arrange(desc(meanRange)) %>% 
  kable()
drive meanRange n
All-Wheel Drive 277.0755 53
4-Wheel Drive 236.6667 3
Rear-Wheel Drive 151.1636 55
Front-Wheel Drive 118.0800 75

We will want to compare these population means to a significance level as well.

STATISTICAL ANALYSIS

Tesla vs Non-Tesla Ranges

We decide to do a “difference between two population means” hypothesis test to determine if there is a significant difference between the mean ranges of Tesla and non-Tesla electric vehicles. The population variances are unknown, and for this type of test we will assume they are equal. We use several functions within the dplyr package to select only vehicles built after 2010, and classify our vehicles based on whether or not they were manufactured by Tesla. We make a datatable to summarize the primary summary statistics for both Teslas and non-Teslas that we will need for further analysis. Our null hypothesis will be that there is no significant difference between the population means for ranges of Tesla and non-Tesla vehicles; our alternative hypothesis is that there is a significant difference between those two population means.

\[H_0: \mu_1-\mu_2 = 0\] \[H_A: \mu_1-\mu_2 \neq 0\]

teslaBin Mean_Range SD_Range Sample_Size
non-Tesla 109.2569 54.31141 109
Tesla 268.2597 40.50874 77

We calculate our pooled variance estimate using the equation below, and use the square root of this value as our estimated standard deviation.

\[s^2_p = \frac{(n_1 - 1) s^2_1 + (n_2 - 1) s^2_2}{n_1 + n_2 - 2}\]

\[s = 49.083\]

We use this calculated estimate in our calculations for the t-statistic, as shown in the below equation. Our t-statistic is 21.761.

\[t = \frac{(\overline{x_1}-\overline{x_2}) - (\mu_1 - \mu_2)}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[ t = 21.761\]

We finalize our hypothesis using the critical value test. We decide to test at a 95% confidence level, and find our critical values at this level of confidence are 1.973 and -1.973. We will reject the null hypothesis if the t-statistic is less than or equal to -1.973, or greater than or equal to 1.973.

\[ critical\,values = \pm 1.973\]

Our t-statistic is greater than our larger critical value, so we can reject our null hypothesis. There is sufficient evidence to conclude that the mean range of Tesla cars is not equal to the mean range of non-Tesla cars. From our visualizations it is clear that the Tesla vehicles have a higher mean range than the non-Tesla vehicles.

\[21.761>1.973\]

We then decide to estimate the confidence interval for the difference between the two means using the equation below. We are 95% confident that the difference between the population means in ranges of Tesla and non-Tesla electric vehicles is between 144.68 and 173.32 miles.

\[\overline{x_1}-\overline{x_2} \pm z*s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

#point estimate
pt_est <- stattable$Mean_Range[2] -  stattable$Mean_Range[1]

z <- qnorm(1-(alpha/2))

margin <- z * s * sqrt((1 / stattable$Sample_Size[1]) + (1 / stattable$Sample_Size[2]))

#interval estimate
int_est <- c(pt_est - margin, pt_est + margin)

\[Confidence\,Interval: (144.682, 173.324)\]

AWD vs 2WD Ranges

Next we decide to do a “difference between two population means” hypothesis test to for all wheel drive and two wheel drive electric vehicles. The population variances are unknown, and for this type of test we will assume they are equal. We use several functions within the dplyr package to select only vehicles built after 2010, and classify our vehicles based on their type of drivetrain. We make a datatable to summarize the primary statistics of interest for both all wheel drive and two wheel drive vehicles. Our null hypothesis is that there is no significant difference between the population means for ranges of all wheel drive and two wheel drive electric vehicles; our alternative hypothesis is that there is a significant difference between those two population means.

\[H_0: \mu_1-\mu_2 = 0\]

\[H_A: \mu_1-\mu_2 \neq 0\]

driveBin Mean_Range SD_Range Sample_Size
2-wheel drive 132.0769 73.77239 130
4-wheel drive 274.9107 38.96865 56

We calculate our pooled variance estimate using the equation below, and use the square root of this value as our estimated standard deviation.

\[s^2_p = \frac{(n_1 - 1) s^2_1 + (n_2 - 1) s^2_2}{n_1 + n_2 - 2}\]

\[s = 65.341\]

We use this calculated estimate in our calculations for the t-statistic, as shown in the below equation. Our t-statistic is -13.676.

\[t = \frac{(\overline{x_1}-\overline{x_2}) - (\mu_1 - \mu_2)}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[ t = -13.676\]

We finalize our hypothesis using the critical value test. We decide to test at a 95% confidence level, and find our critical values at this level of confidence are 1.973 and -1.973. We will reject the null hypothesis if the t-statistic is less than or equal to -1.973, or greater than or equal to 1.973.

\[ critical\,values = \pm 1.973\]

Our t-statistic is less than than our smaller critical value, so we can reject our null hypothesis. There is sufficient evidence to conclude that the mean range of all wheel drive electric cars is not equal to the mean range of two wheel drive electric cars. From our visualizations it is clear that the all wheel drive electric vehicles have a higher mean range than the two wheel drive vehicles.

\[-13.676<-1.973\]

We then decide to estimate the confidence interval for the difference between the two means using the equation below. We are 95% confident that the difference between the population means in ranges of all wheel drive and two wheel drive electric vehicles is between 122.36 and 163.30 miles.

\[\overline{x_1}-\overline{x_2} \pm z*s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

#point estimate
pt_est <- stattable$Mean_Range[2] -  stattable$Mean_Range[1]

z <- qnorm((1-alpha)/2)

margin <- z * s * sqrt((1 / stattable$Sample_Size[1]) + (1 / stattable$Sample_Size[2]))

#interval estimate
int_est <- c(pt_est + margin, pt_est - margin)

\[Confidence\,Interval: (142.179, 143.489)\]

SUMMARY

As mentioned in our introduction, we analyzed the mtcars dataset, and narrowed it down to only electric vehicles. As concerns regarding climate change and talks of our carbon footprint become more prominent, more American consumers choose to drive electric cars to limit their contributions to pollution. Instead of being continually refueled with gasoline, electric cars have a battery which is charged. These cars burn no fossil fuels and are powered by clean energy, but one of their greatest limitations is the range that they are able to drive on a single charge. This is more of an issue for electric vehicles because it can be much harder to find a charging station than a gas station, depending on where you are. This being the case, potential buyers of electric vehicles are extremely concerned with the range that their future vehicle will be able to drive on a single charge, for both convenience and practical reasons. The charge time required to fully charge the electric vehicle is another important variable that can affect the practicality of an electric vehicle.

We decided to analyze certain attributes of electric cars and determine which other variables seem to have a correlation with the range that these vehicles are able to travel on one charge as well as charge time required. We decided car make, year of production, and drivetrain were three variables that could be significant, so we isolated these variables and plotted them against one another using scatter plots. We also used histograms to determine the frequencies of the different makes and classes of cars. We found that both range and charge time increased as we moved to more recent years of production, and that Tesla vehicles were extremely prominent, representing over one third of the observations. Knowing how prominent Tesla vehicles are, we decide to compare their ranges and charge times to those of the other makes of vehicles. By changing the colors of the points on the scatter plot, we find that Teslas seem to be on the high end for both range and charge time when compared to the other electric vehicle makes. Putting these comparisons into tables confirm our primary findings: the Teslas and all wheel drive cars in our sample have higher mean ranges than their counterparts.

We decide to use hypothesis tests to decide if our findings can be determined statistically significant based on the sample sizes and variances of the observations in our dataset. In both cases, we reject our null hypothesis and determine that there is significant evidence to say that the mean ranges of the two categories are different.

The implications to a potential electric car buyer are clear: if you would like your vehicle to have maximum range, you should buy a Tesla vehicle, an all wheel drive vehicle, or both. Teslas consistently have the best ranges and are superior in this aspect, and all wheel drive electric cars tend to have longer ranges than two wheel drive cars.

Our analysis is limited by the lack of observations for several of the other makes of electric cars that have competitive ranges. Audi, Hyundai, and Jaguar, the next three makes after Tesla when it comes to average range, all have five or fewer observations, compared to Tesla’s 77. Having such small amounts of observations could potentially allow an outlier to have a drastic effect on the mean, pulling it down further below Tesla than it should be. We could address this by locating additional data for the electric cars of these manufacturers pulling it into our analysis. Another factor that we realized could be included in our analysis is the relationship between range and charge time. We could create a new variable that is range divided by charge time and see if Teslas and all wheel drive vehicles have are more or less efficient than their counterparts when we factor in their charge times to their ranges off of a charge. Other future work that could be done could be further research into cost implications of Teslas vs. other models, as well as charge times as previously mentioned.