Investigating Super Car Data

Investigating “Super Car” Data - Observing and making sense of outliers

df.car_spec_data <- read.csv(url("https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2015/01/auto-snout_car-specifications_COMBINED.txt"))
df.car_spec_data$year <- as.character(df.car_spec_data$year)

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Creating the plot

#-------------------------
# Horsepower vs. Top Speed
#-------------------------

ggplot(data=df.car_spec_data, aes(x=horsepower_bhp, y=top_speed_mph)) +
  geom_point(alpha=.2, size=2, color="#880011") +
  ggtitle("Horsepower vs. Top Speed") +
  labs(x="Horsepower, bhp", y="Top Speed,\n mph")

## Warning: Removed 10 rows containing missing values (geom_point).

Obviously, there is nothing suprising here. Most would assume the more HP gives you more Speed.

The interesting thing about the graph is the horizontal line of points around the 150 MPH mark.

NOTE: The warning above is due to data outside the axis ranges of the plot
Can be removed using scale_y_continuous OR ylim
Another option is to adjust the limits manually on the ax axis (limits = c(x, y))

#------------------------
# Histogram of Top Speed
#------------------------

ggplot(data=df.car_spec_data, aes(x=top_speed_mph)) +
  geom_histogram(fill="#880011") +  
  ggtitle("Histogram of Top Speed") +
  labs(x="Top Speed, mph", y="Count\nof Records")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 10 rows containing non-finite values (stat_bin).

#----------------------------------
# ZOOM IN ON SPEED CONTROLLED CARS
#
# What is the 'limited' speed?
#  (create bar chart)
#----------------------------------

df.car_spec_data %>%
  filter(top_speed_mph >149 & top_speed_mph <159) %>%
  ggplot(aes(x= as.factor(top_speed_mph))) +
    geom_bar(fill="#880011") +
    labs(x="Top Speed, mph")

Top Speeds of 150 can be due to limiters/governers placed on vehicles

When did this start happening? - Spike begins sometime in the 90s

#------------------------
# Histogram of Top Speed
#  By DECADE
#------------------------

ggplot(data=df.car_spec_data, aes(x=top_speed_mph)) +
  geom_histogram(fill="#880011") +
  ggtitle("Histogram of Top Speed\nby decade") +
  labs(x="Top Speed, mph", y="Count\nof Records") +
  facet_wrap(~decade)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 10 rows containing non-finite values (stat_bin).

Which manufacturers are limiting speed?

#-------------------------------
# TABLE OF CAR COMPANIES WITH 
#  CARS AT MAX SPEED = 155
#-------------------------------
df.car_spec_data %>%
  filter(top_speed_mph == 155 & year>=1990) %>%
  group_by(make_nm) %>% 
  summarize(count_speed_controlled = n()) %>%
  arrange(desc(count_speed_controlled))

## # A tibble: 37 x 2
##    make_nm        count_speed_controlled
##    <fct>                           <int>
##  1 BMW                                53
##  2 Audi                               51
##  3 Mercedes                           41
##  4 Jaguar                             14
##  5 Nissan                              9
##  6 Subaru                              7
##  7 Volkswagen(VW)                      7
##  8 Volvo                               7
##  9 Ford                                5
## 10 Mitsubishi                          5
## # ... with 27 more rows

Source:

“Tutorial: How to Explore a Dataset in R (Using ggplot2 and Dplyr).” Sharp Sight, 4 Apr. 2018, www.sharpsightlabs.com/blog/data-analysis-example-r-supercars-part2/

Investigating Super Car Data

Russell Chebahtah

10/27/2018