This notebook explores the mpg dataset available in the ggplot2 package. The goals of this notebook are:
require(ggplot2)
This dataset provides fuel economy data from 1999 and 2008 for 38 popular models of cars. The dataset is shipped with ggplot2 package.
Variable | Type | Description | Details |
---|---|---|---|
manufacturer | string | car manufacturer | 15 manufacturers |
model | string | model name | 38 models |
displ | numeric | engine displacement in liters | 1.6 - 7.0, median: 3.3 |
year | integer | year of manufacturing | 1999, 2008 |
cyl | number of cylinders | 4, 5, 6, 8 | |
trans | string | type of transmission | automatic, manual (many sub types) |
drv | string | drive type | f, r, 4, f=front wheel, r=rear wheel, 4=4 wheel |
cty | integer | city mileage | miles per gallon |
hwy | integer | highway mileage | miles per gallon |
fl | string | fuel type | 5 fuel types (diesel, petrol, electric, etc.) |
class | string | vehicle class | 7 types (compact, SUV, minivan etc.) |
Description of mpg dataset
str(mpg)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 11 variables:
$ manufacturer: chr "audi" "audi" "audi" "audi" ...
$ model : chr "a4" "a4" "a4" "a4" ...
$ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr "f" "f" "f" "f" ...
$ cty : int 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr "p" "p" "p" "p" ...
$ class : chr "compact" "compact" "compact" "compact" ...
Number of rows or observations or records
nrow(mpg)
[1] 234
Number of columns or variables
ncol(mpg)
[1] 11
Names of columns
colnames(mpg)
[1] "manufacturer" "model" "displ" "year"
[5] "cyl" "trans" "drv" "cty"
[9] "hwy" "fl" "class"
table(mpg$manufacturer)
audi chevrolet dodge ford honda hyundai jeep land rover
18 19 37 25 9 14 8 4
lincoln mercury nissan pontiac subaru toyota volkswagen
3 4 13 5 14 34 27
qplot(manufacturer, data=mpg, geom="bar", fill=manufacturer)
table(mpg$year)
1999 2008
117 117
qplot(factor(year), data=mpg, geom="bar", fill=factor(year))
summary(mpg$displ)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.600 2.400 3.300 3.472 4.600 7.000
qplot(displ, data=mpg, geom="histogram", bins=30)
boxplot(mpg$displ)
table(mpg$cyl)
4 5 6 8
81 4 79 70
qplot(cyl, data=mpg, geom="bar", fill=factor(cyl))
table(mpg$trans)
auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5) auto(s6)
5 2 83 39 6 3 3 16
manual(m5) manual(m6)
58 19
qplot(trans, data=mpg, geom="bar", fill=factor(trans))
table(mpg$drv)
4 f r
103 106 25
qplot(drv, data=mpg, geom="bar", fill=drv)
summary(mpg$cty)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 14.00 17.00 16.86 19.00 35.00
qplot(cty, data=mpg, geom="histogram", bins=20)
qplot(cty, data=mpg, geom="histogram", bins=30)
qplot(cty, data=mpg, geom="histogram", bins=40)
summary(mpg$hwy)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.00 18.00 24.00 23.44 27.00 44.00
qplot(hwy, data=mpg, geom="histogram", bins=20)
qplot(hwy, data=mpg, geom="histogram", bins=30)
qplot(hwy, data=mpg, geom="histogram", bins=40)
table(mpg$fl)
c d e p r
1 5 8 52 168
qplot(fl, data=mpg, geom="bar", fill=fl)
table(mpg$class)
2seater compact midsize minivan pickup subcompact suv
5 47 41 11 33 35 62
qplot(class, data=mpg, geom="bar", fill=class)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color='blue')
The same graph can be built using qplot too
qplot(displ, hwy, data=mpg, geom="point", color='red')
Looking at this data separately for each class
qplot(displ, hwy, data=mpg, geom="point", color=class)
Fixing the color for all points
qplot(displ, hwy, data=mpg, geom="point", color=I("blue"))
The same graphic through ggplot
ggplot(data=mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Separate graphs for each vehicle class:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color=class)) +
facet_wrap(~ class, nrow = 2)
Creating facets on the basis of two variables : number of cylinders and type of drive
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color=drv)) +
facet_grid(drv ~ cyl)
There are no cars for some combinations of drive and cylinders.
Creating facets vertically on drive:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color=drv)) +
facet_grid(drv ~ .)
Creating facets horizontally on cylinders:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color=cyl)) +
facet_grid(. ~ cyl)
Estimating a smooth curve for the relationship between displacement and highway mileage:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Separate curve for each type of drive:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv))
Grouping data:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
Grouping data using specific color:
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv)
)
Hiding the legend:
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)
Overlaying a smooth curve on top of scatter plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping=aes(color=class)) +
geom_smooth()
Using a color based grouping for the scatter plot but a common curve overlayed on top of it:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
Filtering data for a specific geom:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = dplyr::filter(mpg, class == "subcompact"), se = FALSE)
The se=FALSE setting removes the confidence interval around the estimated curve.
Grouping data by drive and then drawing scatter plot with estimated curve for each group:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
Visualizing the combinations of drives and cylinders available in the dataset:
ggplot(data = mpg) +
geom_point(mapping = aes(y = drv, x = cyl, color=factor(cyl)), size=4)
We will create a copy of our data frame
mpg2 <- mpg
We will introduce a new column defining whether the car is automatic or manual
mpg2$is.automatic <- startsWith(mpg2$trans, 'auto')
mpg2$transmission <- ifelse(mpg2$is.automatic, 'auto', 'man')
Let’s verify the statistics
table(mpg2$trans)
auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4)
5 2 83 39 6 3
auto(s5) auto(s6) manual(m5) manual(m6)
3 16 58 19
table(mpg2$is.automatic)
FALSE TRUE
77 157
Let’s compare the box plots of city mileage for the two types
qplot(transmission, cty, data=mpg2, geom='boxplot', fill=transmission)
This graphic suggests that the manual transmission is better than automatic one.
It is time for us to perform a t-test to verify the accuracy.
manual.cty <- mpg2$cty[!mpg2$is.automatic]
auto.cty <- mpg2$cty[mpg2$is.automatic]
t.test(manual.cty, auto.cty, alternative = "two.sided", var.equal = FALSE)
Welch Two Sample t-test
data: manual.cty and auto.cty
t = 4.5375, df = 132.32, p-value = 1.263e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.527033 3.887311
sample estimates:
mean of x mean of y
18.67532 15.96815
The p-value is very low. It indicates strong evidence against the null hypothesis, so we reject the null hypothesis and accept that means are not equal.
Let’s also verify the hypothesis that mean of manual is greater than mean of automatic transmission.
t.test(manual.cty, auto.cty, alternative = "greater", var.equal = FALSE)
Welch Two Sample t-test
data: manual.cty and auto.cty
t = 4.5375, df = 132.32, p-value = 6.317e-06
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
1.718907 Inf
sample estimates:
mean of x mean of y
18.67532 15.96815
Again we see that p-value is very very small. Thus, the alternate hypothesis must be true. Mean of manual transmission is greater than mean of automatic transmission.
t.test(manual.cty, auto.cty, alternative = "less", var.equal = FALSE)
Welch Two Sample t-test
data: manual.cty and auto.cty
t = 4.5375, df = 132.32, p-value = 1
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 3.695437
sample estimates:
mean of x mean of y
18.67532 15.96815
Here p-value is 1. We don’t have any evidence supporting the alternate hypothesis.