Investigating “Super Car” Data - Observing and making sense of outliers
df.car_spec_data <- read.csv(url("https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2015/01/auto-snout_car-specifications_COMBINED.txt"))
df.car_spec_data$year <- as.character(df.car_spec_data$year)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#-------------------------
# Horsepower vs. Top Speed
#-------------------------
ggplot(data=df.car_spec_data, aes(x=horsepower_bhp, y=top_speed_mph)) +
geom_point(alpha=.2, size=2, color="#880011") +
ggtitle("Horsepower vs. Top Speed") +
labs(x="Horsepower, bhp", y="Top Speed,\n mph")
## Warning: Removed 10 rows containing missing values (geom_point).
Obviously, there is nothing suprising here. Most would assume the more HP gives you more Speed.
The interesting thing about the graph is the horizontal line of points around the 150 MPH mark.
#------------------------
# Histogram of Top Speed
#------------------------
ggplot(data=df.car_spec_data, aes(x=top_speed_mph)) +
geom_histogram(fill="#880011") +
ggtitle("Histogram of Top Speed") +
labs(x="Top Speed, mph", y="Count\nof Records")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (stat_bin).
#----------------------------------
# ZOOM IN ON SPEED CONTROLLED CARS
#
# What is the 'limited' speed?
# (create bar chart)
#----------------------------------
df.car_spec_data %>%
filter(top_speed_mph >149 & top_speed_mph <159) %>%
ggplot(aes(x= as.factor(top_speed_mph))) +
geom_bar(fill="#880011") +
labs(x="Top Speed, mph")
Top Speeds of 150 can be due to limiters/governers placed on vehicles
When did this start happening? - Spike begins sometime in the 90s
#------------------------
# Histogram of Top Speed
# By DECADE
#------------------------
ggplot(data=df.car_spec_data, aes(x=top_speed_mph)) +
geom_histogram(fill="#880011") +
ggtitle("Histogram of Top Speed\nby decade") +
labs(x="Top Speed, mph", y="Count\nof Records") +
facet_wrap(~decade)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (stat_bin).
#-------------------------------
# TABLE OF CAR COMPANIES WITH
# CARS AT MAX SPEED = 155
#-------------------------------
df.car_spec_data %>%
filter(top_speed_mph == 155 & year>=1990) %>%
group_by(make_nm) %>%
summarize(count_speed_controlled = n()) %>%
arrange(desc(count_speed_controlled))
## # A tibble: 37 x 2
## make_nm count_speed_controlled
## <fct> <int>
## 1 BMW 53
## 2 Audi 51
## 3 Mercedes 41
## 4 Jaguar 14
## 5 Nissan 9
## 6 Subaru 7
## 7 Volkswagen(VW) 7
## 8 Volvo 7
## 9 Ford 5
## 10 Mitsubishi 5
## # ... with 27 more rows
Source: