Electric Vehicle Product Analysis: Data Wrangling Final Project

INTRODUCTION

We decided to analyze the mtcars dataset to determine whether or not the ranges and charge times of electric vehicles has been increasing over time as technology improves. We would also like to determine whether or not Tesla vehicles have longer ranges than other makes of electric cars, and if re-focusing the analysis to either all-wheel drive or two-wheel drive vehicles would have any effect. Because Tesla makes up the largest proportion of our datapoints, we would also like to see if there is signicant variance between its own different models of electric cars.

To solve these problems we will use scatter plots and histograms to plot the ranges of the different electric vehicles against make, Tesla model, year, and other variables. We will also create new variables to categorize our data effectively.

We will also plot several scatter plots with color coded points to help us compare three different variables at once. We will also choose to conduct hypothesis tests comparing means of different samples within the EV space - for example, comparing the mean charge time / range of Tesla vehicles to the the mean of all other models to a certain confidence level.

Our analysis will help the consumer determine which electric cars have the longest overall, city and highway ranges, and if this depends on make or class of car. It will also help them determine whether a new car is needed to maximize their range, and what, if anything may be sacrificed in order to achieve maximum range. Electric cars with long ranges and fast charge times are extremely desirable to consumers, as it can be a major inconvenience to them if they expectedly run out of charge or are forced to reroute in order to find a charging station.

PACKAGES

library(tidyverse)
library(data.table)
library(knitr)

We require the data.table and tidyverse packages for our analyses. From the dplyr package within the tidyverse we use the select and mutate functions to remove columns that are not relevant and create new variables that analyze relationships between existing variables, as well as filter, group_by, summarise, arrange, and others to build basic summary statistics for our data. Alongside this, we utlize many ggplot functions from the ggplot package within tidyvers. We use the fread function from the data.table package to read in our csv data from the url. We use the kable function from the knitr package to format some tables.

DATA PREP

1. Intro & Import

The mtcars dataset comes from the EPA and the U.S. Department of Energy’s Office of Energy Efficiency & Renewable Energy. It was last updated November 15th, 2019. This dataset includes vehicle, emission, and fuel price data and we believe it may have been collected to raise awareness about air pollution caused by cars, and provide insights into how certain vehicles are better than others at using clean energy and consuming less fuel. In a time where global warming and our carbon footprints are popular topics of discussion and frequently mentioned on the news, this dataset can be used as a tool for people who are concerned with emissions and their carbon footprints to decide which types of cars they may be open to buying.

Our first step into the data prep is to load the data through the url recieved from a GitHub user’s page (original source is FuelEconomy.gov) and using the fread function:

##loading data into RStudio and checking for success
url <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv'
cars <- fread(url)
head(cars)

2. First Glance

When looking at the data upon load, we want to see how big the dataset is and basic understanding about the format and structure of the set:

#how big is the dataset? introductory questions about data
names(cars)
summary(cars)
str(cars)

dim(cars)

## [1] 41804    83

This dataset is massive. We have over 40,000 observations and 83 variables. This is rather unmanageable in its current form. We hope to trim this to only relevant data prior to further data cleaning steps.

3. Filtering for Electric Vehicles

Knowing that we want to do an analysis on electric vehicles (EVs), it is important to us to firstly trim the dataset to only include electric vehicle observations. After taking an initial look at the cars dataset and consulting the data dictionary, we determined that an easy way to identify electric vehicle observations was through the atvType column:

kable(table(cars$atvType), col.names = c('Vehicle Type','Frequency'))

Vehicle Type	Frequency
Bifuel (CNG)	20
Bifuel (LPG)	8
CNG	50
Diesel	1125
EV	209
FFV	1458
Hybrid	636
Plug-in Hybrid	151

##Filter on EVs = only showing where atvType = 'EV' to limit our set
EVcars <- cars[cars$atvType == "EV",]
head(EVcars)

dim(EVcars)

## [1] 209  83

We filter out all non-electric cars are left with an EVcars dataset that contains 209 observations with all 83 original columns.

4. Trimming Variables with Missing Values

We know we need to continue organizing the data to only include relevant columns. We start by looking at columns with all or many missing values inside the variable. We also imagine that there are certain columns that are irrelevant to EVs, and we want to target those first as well.

missingvalues <- colnames(EVcars)[colSums(is.na(EVcars)) > 0]
nullTable <- select(EVcars, missingvalues)
kable(head(nullTable))

cylinders	displ	drive	eng_dscr	trany	guzzler	trans_dscr	tCharger	sCharger	fuelType2	rangeA	evMotor	mfrCode	c240Dscr	c240bDscr
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	62 KW AC Induction	NA	NA	NA
NA	NA	2-Wheel Drive	NA	NA	NA	NA	NA	NA	NA	NA	50 KW DC	NA	NA	NA
NA	NA	2-Wheel Drive	NA	NA	NA	NA	NA	NA	NA	NA	50 KW DC	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	27 KW AC Induction	NA	NA	NA
NA	NA	2-Wheel Drive	NA	NA	NA	NA	NA	NA	NA	NA	67 KW AC Induction	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	24 KW AC Synchronous	NA	NA	NA

dim(nullTable)

## [1] 209  15

nullPropBarplot <- barplot(colSums(is.na(nullTable))/nrow(nullTable), las = 2, ylim = c(0,1))

nullPropBarplot

Using the select function to target missing values, we create a barplot that shows the proportions of missing values for all of the variables that have at least one missing value. It is evident that there are several variables that have no values, which we can immediately mark for elimination.

#view the different values and their frequencies for the variables we are unsure of whether to keep or not

table(EVcars$c240bDscr)
table(EVcars$mfrCode)
table(EVcars$displ)
table(EVcars$trany)
table(EVcars$evMotor)

kable(table(EVcars$eng_dscr), col.names = c('Engine Type', 'Frequency'))

Engine Type	Frequency
Lead Acid	4
NiMH	4
SIDI	1

kable(table(EVcars$c240Dscr), col.names = c('Charger Type', 'Frequency'))

Charger Type	Frequency
3.6 kW charger	4
6.6 kW charger	2
7.2 kW charger	2
single charger	3
standard charger	73

##Deselect the variables we do not want to include based on their proportions of missing values shown in the barplot
EVcars_trim1 <- select(.data = EVcars, -c(cylinders, displ, eng_dscr, guzzler, trans_dscr, tCharger, sCharger, fuelType2, rangeA, mfrCode))
head(EVcars_trim1)

dim(EVcars_trim1)

## [1] 209  73

For the remaining 7 variables we use the table function to get an idea for how many values they have and what the frequencies are. We decide to keep the c240Dscr and c240bDscr variables despite their missing values, as they still have enough data to be useful to us if we decide to include electric vehicle charging in our analysis. We also keep the variables drive, trany, and evMotor because they still have values for a majority of the observations and may be useful in our investigation into ranges of electric vehicles. We cut mfrCode because we already have the variable make and do not feel this will add incremental value. We also cut eng_desc and displ because they have missing values for such a large proportion of their observations without adding much important insight with the values that are included.

We use the select function again to deselect all the variables we decided to remove, generating a new dataset called EVcars_trim1. We now have 209 observations in 73 columns.

5. Trimming Irrelevant Variables

##First pass on variable stripping - taking out non-relevant fields 
relevantvars <- c('atvType', 'c240Dscr', 'c240bDscr',
                  'charge120', 'charge240', 
                  'cityE', 'cityUF', 'combE', 
                  'combinedUF', 'drive', 
                  'evMotor', 'engId', 'feScore', 
                  'fuelCost08', 'fuelCostA08', 'fuelType',
                  'fuelType1', 'ghgScore', 
                  'ghgScoreA', 'highway08', 'highway08U', 
                  'highwayA08U', 'highwayE', 'id', 
                  'make', 'model', 'mpgData', 'range', 'rangeCity', 
                  'rangeHwy', 'trany', 'UCity','UHighway', 
                  'VClass', 'youSaveSpend', 'charge240b', 'year')
EVcars_trim2 <- select(.data = EVcars, relevantvars)
head(EVcars_trim2)

dim(EVcars_trim2)

## [1] 209  37

Once we’ve filtered out the variables that have too many missing values to be useful to our analyses, we look into the data dictionary to interpret the descriptions of each of the remaining variables and determine which will be relevant to our investigation into electric vehicles. We are able to remove 36 variables that pertain to gasoline, engines, and other things that are only applicable to non-electric vehicles, or are otherwise irrelevant to our analysis, leaving us with 209 observations across 37 variables.

6. Trimming Variables with no Variance

#getting rid of variables with only 1 value
EVcars_trim3 <- Filter(function(x)(length(unique(x))>1), EVcars_trim2)
head(EVcars_trim3)

dim(EVcars_trim3)

## [1] 209  28

Now that we have only variables that we believe to be relevant to our analyses and are not missing values for a substantial amount of observations, we decide to use the filter function to remove variables that only have one value across all observations. If there is no variance in the variable then it is not helpful to us in our analysis. This leaves us with 209 observations across 28 variables.

7. Recoding of Similar Variables

#recoding similar values to simplify streamline analysis related to class of car
EVcars_trim4 <- EVcars_trim3
kable(table(EVcars_trim4$VClass), col.names = c('Class', 'Frequency'))

Class	Frequency
Compact Cars	15
Large Cars	47
Midsize Cars	33
Midsize Station Wagons	1
Minicompact Cars	9
Minivan - 2WD	2
Small Pickup Trucks 2WD	2
Small Sport Utility Vehicle 2WD	11
Small Sport Utility Vehicle 4WD	2
Small Station Wagons	12
Special Purpose Vehicle 2WD	2
Sport Utility Vehicle - 2WD	8
Standard Pickup Trucks 2WD	5
Standard Sport Utility Vehicle 4WD	20
Subcompact Cars	20
Two Seaters	20

EVcars_trim4$VClass[EVcars_trim4$VClass == "Small Station Wagons"] <- "Station Wagons"
EVcars_trim4$VClass[EVcars_trim4$VClass == "Midsize Station Wagons"] <- "Station Wagons"
EVcars_trim4$VClass[EVcars_trim4$VClass == "Standard Pickup Trucks 2WD"] <- "Pickup Trucks"
EVcars_trim4$VClass[EVcars_trim4$VClass == "Small Pickup Trucks 2WD"] <- "Pickup Trucks"

Our final step in our data preparation and cleaning processes is to combine data values that are similar to one another to streamline our analyses. It will be much easier to compare values between different classes of cars if we have them consolidated into a few major classes instead of many smaller subclasses, especially when many of them have little to no differences from one another.

8. Overview of Final Dataset

#get summary information about the final dataset
str(EVcars_trim4)

## Classes 'data.table' and 'data.frame':   209 obs. of  28 variables:
##  $ c240Dscr    : chr  NA NA NA NA ...
##  $ c240bDscr   : chr  NA NA NA NA ...
##  $ charge240   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityE       : num  41 41 41 46 75 40 39 75 39 54 ...
##  $ combE       : num  40 47 47 52 87 45 43 87 43 58 ...
##  $ drive       : chr  NA "2-Wheel Drive" "2-Wheel Drive" NA ...
##  $ evMotor     : chr  "62 KW AC Induction" "50 KW DC" "50 KW DC" "27 KW AC Induction" ...
##  $ engId       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ feScore     : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ fuelCost08  : int  800 900 900 1000 1700 900 850 1700 850 1150 ...
##  $ ghgScore    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ highway08   : int  91 64 64 58 33 66 69 33 69 54 ...
##  $ highway08U  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayE    : num  37 53 53 59 102 51 49 102 49 63 ...
##  $ id          : int  16423 16424 17328 17329 17330 17331 18290 18291 19296 30965 ...
##  $ make        : chr  "Nissan" "Toyota" "Toyota" "Ford" ...
##  $ model       : chr  "Altra EV" "RAV4 EV" "RAV4 EV" "Th!nk" ...
##  $ mpgData     : chr  "N" "N" "N" "N" ...
##  $ range       : int  90 88 88 29 38 33 95 38 95 50 ...
##  $ rangeCity   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rangeHwy    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trany       : chr  NA NA NA NA ...
##  $ UCity       : num  116.2 116.2 116.2 105.3 62.4 ...
##  $ UHighway    : num  129.6 91.1 91.1 82.2 46.8 ...
##  $ VClass      : chr  "Station Wagons" "Sport Utility Vehicle - 2WD" "Sport Utility Vehicle - 2WD" "Two Seaters" ...
##  $ youSaveSpend: int  3250 2750 2750 2250 -1250 2750 3000 -1250 3000 1500 ...
##  $ charge240b  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ year        : int  2000 2000 2001 2001 2001 2001 2002 2002 2003 2001 ...
##  - attr(*, ".internal.selfref")=<externalptr>

#using EVcars label for our final working dataset
EVcarsTrim <- EVcars_trim4

kable(head(EVcarsTrim))

c240Dscr	c240bDscr	cityE	combE	drive	evMotor	feScore	fuelCost08	ghgScore	highway08	highwayE	id	make	model	mpgData	range	trany	UCity	UHighway	VClass	youSaveSpend	year
NA	NA	41	40	NA	62 KW AC Induction	-1	800	-1	91	37	16423	Nissan	Altra EV	N	90	NA	116.2069	129.6154	Station Wagons	3250	2000
NA	NA	41	47	2-Wheel Drive	50 KW DC	-1	900	-1	64	53	16424	Toyota	RAV4 EV	N	88	NA	116.2069	91.0811	Sport Utility Vehicle - 2WD	2750	2000
NA	NA	41	47	2-Wheel Drive	50 KW DC	-1	900	-1	64	53	17328	Toyota	RAV4 EV	N	88	NA	116.2069	91.0811	Sport Utility Vehicle - 2WD	2750	2001
NA	NA	46	52	NA	27 KW AC Induction	-1	1000	-1	58	59	17329	Ford	Th!nk	N	29	NA	105.3125	82.1951	Two Seaters	2250	2001
NA	NA	75	87	2-Wheel Drive	67 KW AC Induction	-1	1700	-1	33	102	17330	Ford	Explorer USPS Electric	N	38	NA	62.4074	46.8056	Sport Utility Vehicle - 2WD	-1250	2001
NA	NA	40	45	NA	24 KW AC Synchronous	-1	900	-1	66	51	17331	Nissan	Hyper-Mini	N	33	NA	120.3571	93.6111	Two Seaters	2750	2001

We use the str function to show a summary information about our data, including the number of variables and observations, the class types, variable names, and the first few values for each of the variables. Then we use the kable and head functions to display a portion of our data in a cleaned and condensed format.

Our final dataset for use moving forward is a set with 209 observations and 28 variables. There are variables that help us identify descriptive information of each vehicle, such as year for the model year, make and model for the maker and the model name, VClass for vehicle class that shows type of vehicle, drive distinguishing the type of drivetrain, and evMotor describing special aspects of some of the motors used for the EVs. Most of the rest of the variables are metrics that determine performance. Variables of note here include cityE and hwyE, which denote city and highway electricity consumption in kw-hrs/100 miles for each vehicle respectively; range, denoting EPA range of vehicle; and charge240, which is time to charge an electric vehicle in hours at 240 V. While these are our primary variables of interest, the other variables may prove to be interesting to explore as we make our way through data analysis, and thus we are keeping them in.

EXPLORATORY DATA ANALYSIS

Range and Charge Time by Year

Our primary interest lies within both the range and charge240 variables. We’ll start by looking at range and charge240 over time; we are interested first in seeing if there have been significant technological improvements over time as the prevalence of EV research and development has grown in recent years.

Record-high range has certainly improved over the past 20 years - however, there are still many models with ranges that are similar to older models. There is a serious gap in data between the years 2005 and 2010. Given the history of the EV industry and that true prototyping and feasible models did not gain traction until 2010, we may elect to only take into account records with year > 2010 in future analysis.

We also can see that there are fewer recorded values for charge time, which may play a role in future hypothesis tests. Values for charge time vary in recent years between 3 and 13 hours.

Vehicle Class

We’d like to explore the VClass variable more; to start, we simply want to see a breakdown of each vehicle class. We can use a standard barplot to display this:

Vehicle Class	Frequency
Compact Cars	15
Large Cars	47
Midsize Cars	33
Minicompact Cars	9
Minivan - 2WD	2
Pickup Trucks	7
Small Sport Utility Vehicle 2WD	11
Small Sport Utility Vehicle 4WD	2
Special Purpose Vehicle 2WD	2
Sport Utility Vehicle - 2WD	8
Standard Sport Utility Vehicle 4WD	20
Station Wagons	13
Subcompact Cars	20
Two Seaters	20

Our greatest two samples are in large and midsize cars for vehicle class. We may want to only compare between classes in those two cagetories, as they appear to be the only two with more than 30 observations.

Make

We can also look at the breakdown of make to see which specific car makers we can compare to each other:

Tesla observations far outweigh any other make. Based on this, we will limit most of our make comparisons to Tesla vs. non-Tesla analyses; however, Nissan, Ford, and smart all have numerous observations as well, so we can also highlight those as well.

Initial Tesla comparisons on range and charge time

We could make some initial analysis based on our first visualizations and including make. We want to add a column that color codes for notable makes that we specified earlier - all other makes will be shown as black.

#Adding color classifications for specific makes
EVcarsTrim$color = 'black'
EVcarsTrim$color[EVcarsTrim$make == 'Tesla']='red'
EVcarsTrim$color[EVcarsTrim$make == 'Nissan']='blue'
EVcarsTrim$color[EVcarsTrim$make == 'Ford']='green'
EVcarsTrim$color[EVcarsTrim$make == 'smart']='orange'

#plotting based on color
EVcarsTrim %>% 
  ggplot(aes(x = year, y = range, color = make)) +
  geom_point() + 
  ggtitle("Range by year, by make")

Tesla can easily be seen to have a higher range than other vehicles in recent years. Is there a tradeoff with charge time, though?

#charge time by year by make
EVcarsTrim %>% 
  ggplot(aes(x = year, y = charge240, color = make)) +
  geom_point() + 
  ggtitle("Charge Time by year, by make")

This looks to be the case. We would like to investigate this more, and we may elect to build a metric to standardize this relationship or potentially to model it with a linear regression.

Basic summary statistics

Now that we have begun to visualize the data, we can investigate some of the variables more deeply. Using dplyr functions and our insights from previous steps in EDA, we will begin to summarise certain variables.

#summarizing mean range by make
EVcarsTrim %>% 
  filter(year > 2010) %>% 
  group_by(make) %>% 
  summarise(meanRange = mean(range, na.rm = TRUE), n = n()) %>% 
  arrange(desc(meanRange)) %>% 
  kable()

make	meanRange	n
Tesla	268.25974	77
Jaguar	234.00000	2
Audi	204.00000	1
Hyundai	177.60000	5
BYD	163.87500	8
Chevrolet	160.00000	6
Kia	140.42857	7
Nissan	119.08333	12
Volkswagen	110.66667	6
BMW	105.90000	10
Toyota	103.00000	3
CODA Automotive	88.00000	2
Ford	87.14286	7
Mercedes-Benz	87.00000	4
Honda	86.20000	5
Fiat	85.28571	7
smart	63.43750	16
Mitsubishi	61.40000	5
Azure Dynamics	56.00000	2
Scion	38.00000	1

There is very clearly a pecking order for average range, with Tesla at the top. What is the difference between Tesla’s average and the average of all of the rest combined, since the n values show such a stark contrast? We can use a mutate function to easily find out by making an indicator column for Teslas:

#summarizing mean range by Tesla vs. non-Tesla
EVcarsTrim %>% 
  filter(year > 2010) %>% 
  mutate(teslaBin = if_else(make =='Tesla', 'Tesla', 'non-Tesla')) %>% 
  group_by(teslaBin) %>% 
  summarise(meanRange = mean(range, na.rm = TRUE), n = n()) %>% 
  arrange(desc(meanRange)) %>% 
  kable()

teslaBin	meanRange	n
Tesla	268.2597	77
non-Tesla	109.2569	109

We want to hypothesize test this difference in mean to determine its significance across these two populations. We will do so in the next section.

Another summarization we can make is the range based on drivetrain. The results are below:

#summarizing mean range by drive type
EVcarsTrim %>% 
  filter(year > 2010) %>% 
  group_by(drive) %>% 
  summarise(meanRange = mean(range, na.rm = TRUE), n = n()) %>% 
  arrange(desc(meanRange)) %>% 
  kable()

drive	meanRange	n
All-Wheel Drive	277.0755	53
4-Wheel Drive	236.6667	3
Rear-Wheel Drive	151.1636	55
Front-Wheel Drive	118.0800	75

We will want to compare these population means to a significance level as well.

STATISTICAL ANALYSIS

Tesla vs Non-Tesla Ranges

We decide to do a “difference between two population means” hypothesis test to determine if there is a significant difference between the mean ranges of Tesla and non-Tesla electric vehicles. The population variances are unknown, and for this type of test we will assume they are equal. We use several functions within the dplyr package to select only vehicles built after 2010, and classify our vehicles based on whether or not they were manufactured by Tesla. We make a datatable to summarize the primary summary statistics for both Teslas and non-Teslas that we will need for further analysis. Our null hypothesis will be that there is no significant difference between the population means for ranges of Tesla and non-Tesla vehicles; our alternative hypothesis is that there is a significant difference between those two population means.

\[H_0: \mu_1-\mu_2 = 0\] \[H_A: \mu_1-\mu_2 \neq 0\]

teslaBin	Mean_Range	SD_Range	Sample_Size
non-Tesla	109.2569	54.31141	109
Tesla	268.2597	40.50874	77

We calculate our pooled variance estimate using the equation below, and use the square root of this value as our estimated standard deviation.

\[s^2_p = \frac{(n_1 - 1) s^2_1 + (n_2 - 1) s^2_2}{n_1 + n_2 - 2}\]

\[s = 49.083\]

We use this calculated estimate in our calculations for the t-statistic, as shown in the below equation. Our t-statistic is 21.761.

\[t = \frac{(\overline{x_1}-\overline{x_2}) - (\mu_1 - \mu_2)}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[ t = 21.761\]

We finalize our hypothesis using the critical value test. We decide to test at a 95% confidence level, and find our critical values at this level of confidence are 1.973 and -1.973. We will reject the null hypothesis if the t-statistic is less than or equal to -1.973, or greater than or equal to 1.973.

\[ critical\,values = \pm 1.973\]

Our t-statistic is greater than our larger critical value, so we can reject our null hypothesis. There is sufficient evidence to conclude that the mean range of Tesla cars is not equal to the mean range of non-Tesla cars. From our visualizations it is clear that the Tesla vehicles have a higher mean range than the non-Tesla vehicles.

\[21.761>1.973\]

We then decide to estimate the confidence interval for the difference between the two means using the equation below. We are 95% confident that the difference between the population means in ranges of Tesla and non-Tesla electric vehicles is between 144.68 and 173.32 miles.

\[\overline{x_1}-\overline{x_2} \pm z*s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

#point estimate
pt_est <- stattable$Mean_Range[2] -  stattable$Mean_Range[1]

z <- qnorm(1-(alpha/2))

margin <- z * s * sqrt((1 / stattable$Sample_Size[1]) + (1 / stattable$Sample_Size[2]))

#interval estimate
int_est <- c(pt_est - margin, pt_est + margin)

\[Confidence\,Interval: (144.682, 173.324)\]

AWD vs 2WD Ranges

Next we decide to do a “difference between two population means” hypothesis test to for all wheel drive and two wheel drive electric vehicles. The population variances are unknown, and for this type of test we will assume they are equal. We use several functions within the dplyr package to select only vehicles built after 2010, and classify our vehicles based on their type of drivetrain. We make a datatable to summarize the primary statistics of interest for both all wheel drive and two wheel drive vehicles. Our null hypothesis is that there is no significant difference between the population means for ranges of all wheel drive and two wheel drive electric vehicles; our alternative hypothesis is that there is a significant difference between those two population means.

\[H_0: \mu_1-\mu_2 = 0\]

\[H_A: \mu_1-\mu_2 \neq 0\]

driveBin	Mean_Range	SD_Range	Sample_Size
2-wheel drive	132.0769	73.77239	130
4-wheel drive	274.9107	38.96865	56

We calculate our pooled variance estimate using the equation below, and use the square root of this value as our estimated standard deviation.

\[s^2_p = \frac{(n_1 - 1) s^2_1 + (n_2 - 1) s^2_2}{n_1 + n_2 - 2}\]

\[s = 65.341\]

We use this calculated estimate in our calculations for the t-statistic, as shown in the below equation. Our t-statistic is -13.676.

\[t = \frac{(\overline{x_1}-\overline{x_2}) - (\mu_1 - \mu_2)}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[ t = -13.676\]

\[ critical\,values = \pm 1.973\]

Our t-statistic is less than than our smaller critical value, so we can reject our null hypothesis. There is sufficient evidence to conclude that the mean range of all wheel drive electric cars is not equal to the mean range of two wheel drive electric cars. From our visualizations it is clear that the all wheel drive electric vehicles have a higher mean range than the two wheel drive vehicles.

\[-13.676<-1.973\]

We then decide to estimate the confidence interval for the difference between the two means using the equation below. We are 95% confident that the difference between the population means in ranges of all wheel drive and two wheel drive electric vehicles is between 122.36 and 163.30 miles.

\[\overline{x_1}-\overline{x_2} \pm z*s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

#point estimate
pt_est <- stattable$Mean_Range[2] -  stattable$Mean_Range[1]

z <- qnorm((1-alpha)/2)

margin <- z * s * sqrt((1 / stattable$Sample_Size[1]) + (1 / stattable$Sample_Size[2]))

#interval estimate
int_est <- c(pt_est + margin, pt_est - margin)

\[Confidence\,Interval: (142.179, 143.489)\]

SUMMARY

As mentioned in our introduction, we analyzed the mtcars dataset, and narrowed it down to only electric vehicles. As concerns regarding climate change and talks of our carbon footprint become more prominent, more American consumers choose to drive electric cars to limit their contributions to pollution. Instead of being continually refueled with gasoline, electric cars have a battery which is charged. These cars burn no fossil fuels and are powered by clean energy, but one of their greatest limitations is the range that they are able to drive on a single charge. This is more of an issue for electric vehicles because it can be much harder to find a charging station than a gas station, depending on where you are. This being the case, potential buyers of electric vehicles are extremely concerned with the range that their future vehicle will be able to drive on a single charge, for both convenience and practical reasons. The charge time required to fully charge the electric vehicle is another important variable that can affect the practicality of an electric vehicle.

We decided to analyze certain attributes of electric cars and determine which other variables seem to have a correlation with the range that these vehicles are able to travel on one charge as well as charge time required. We decided car make, year of production, and drivetrain were three variables that could be significant, so we isolated these variables and plotted them against one another using scatter plots. We also used histograms to determine the frequencies of the different makes and classes of cars. We found that both range and charge time increased as we moved to more recent years of production, and that Tesla vehicles were extremely prominent, representing over one third of the observations. Knowing how prominent Tesla vehicles are, we decide to compare their ranges and charge times to those of the other makes of vehicles. By changing the colors of the points on the scatter plot, we find that Teslas seem to be on the high end for both range and charge time when compared to the other electric vehicle makes. Putting these comparisons into tables confirm our primary findings: the Teslas and all wheel drive cars in our sample have higher mean ranges than their counterparts.

We decide to use hypothesis tests to decide if our findings can be determined statistically significant based on the sample sizes and variances of the observations in our dataset. In both cases, we reject our null hypothesis and determine that there is significant evidence to say that the mean ranges of the two categories are different.

The implications to a potential electric car buyer are clear: if you would like your vehicle to have maximum range, you should buy a Tesla vehicle, an all wheel drive vehicle, or both. Teslas consistently have the best ranges and are superior in this aspect, and all wheel drive electric cars tend to have longer ranges than two wheel drive cars.

Our analysis is limited by the lack of observations for several of the other makes of electric cars that have competitive ranges. Audi, Hyundai, and Jaguar, the next three makes after Tesla when it comes to average range, all have five or fewer observations, compared to Tesla’s 77. Having such small amounts of observations could potentially allow an outlier to have a drastic effect on the mean, pulling it down further below Tesla than it should be. We could address this by locating additional data for the electric cars of these manufacturers pulling it into our analysis. Another factor that we realized could be included in our analysis is the relationship between range and charge time. We could create a new variable that is range divided by charge time and see if Teslas and all wheel drive vehicles have are more or less efficient than their counterparts when we factor in their charge times to their ranges off of a charge. Other future work that could be done could be further research into cost implications of Teslas vs. other models, as well as charge times as previously mentioned.

Electric Vehicle Product Analysis: Data Wrangling Final Project

Laith Barakat & Matt Lekowski

December 2, 2019

INTRODUCTION

PACKAGES

DATA PREP

1. Intro & Import

2. First Glance

3. Filtering for Electric Vehicles

4. Trimming Variables with Missing Values

5. Trimming Irrelevant Variables

6. Trimming Variables with no Variance

7. Recoding of Similar Variables

8. Overview of Final Dataset

EXPLORATORY DATA ANALYSIS

Range and Charge Time by Year

Vehicle Class

Make

Initial Tesla comparisons on range and charge time

Basic summary statistics

STATISTICAL ANALYSIS

Tesla vs Non-Tesla Ranges

AWD vs 2WD Ranges

SUMMARY