#install.packages("ggplot2")
#dataset "mpg" - Fuel economy data from 1999 and 2008 for 38 popular models of car.
require(ggplot2)
## Loading required package: ggplot2
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
Identify which variable in your data set are numeric, and which are categorical (factors). - numeric variables; displ,year,cyl,cty,hwy, - categorical (factor) variables; manufacturer, model,trans,drv,fl,class
“year” can be a categorical factor since automobiles should reflect customers preferences, and the requests have been changed every year based on fuel economy, environment, and etc.
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(mpg)
## manufacturer model displ year
## dodge :37 caravan 2wd : 11 Min. :1.600 Min. :1999
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400 1st Qu.:1999
## volkswagen:27 civic : 9 Median :3.300 Median :2004
## ford :25 dakota pickup 4wd : 9 Mean :3.472 Mean :2004
## chevrolet :19 jetta : 9 3rd Qu.:4.600 3rd Qu.:2008
## audi :18 mustang : 9 Max. :7.000 Max. :2008
## (Other) :74 (Other) :177
## cyl trans drv cty hwy
## Min. :4.000 auto(l4) :83 4:103 Min. : 9.00 Min. :12.00
## 1st Qu.:4.000 manual(m5):58 f:106 1st Qu.:14.00 1st Qu.:18.00
## Median :6.000 auto(l5) :39 r: 25 Median :17.00 Median :24.00
## Mean :5.889 manual(m6):19 Mean :16.86 Mean :23.44
## 3rd Qu.:8.000 auto(s6) :16 3rd Qu.:19.00 3rd Qu.:27.00
## Max. :8.000 auto(l6) : 6 Max. :35.00 Max. :44.00
## (Other) :13
## fl class
## c: 1 2seater : 5
## d: 5 compact :47
## e: 8 midsize :41
## p: 52 minivan :11
## r:168 pickup :33
## subcompact:35
## suv :62
table(mpg$fl)
##
## c d e p r
## 1 5 8 52 168
table(mpg$year)
##
## 1999 2008
## 117 117
# Run the table() function against two categorical variables.
table(mpg$class, mpg$fl)
##
## c d e p r
## 2seater 0 0 0 5 0
## compact 0 1 0 21 25
## midsize 0 0 0 15 26
## minivan 0 0 1 0 10
## pickup 0 0 3 0 30
## subcompact 1 2 0 3 29
## suv 0 2 4 8 48
table(mpg$manufacturer, mpg$cty)
##
## 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 33 35
## audi 0 0 0 0 0 3 3 3 4 1 2 2 0 0 0 0 0 0 0 0 0
## chevrolet 0 3 1 1 4 3 2 1 2 1 0 0 1 0 0 0 0 0 0 0 0
## dodge 4 8 2 8 4 3 4 3 1 0 0 0 0 0 0 0 0 0 0 0 0
## ford 0 3 1 7 5 5 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0
## honda 0 0 0 0 0 0 0 0 0 0 0 1 0 1 3 2 1 1 0 0 0
## hyundai 0 0 0 0 0 0 1 2 4 3 2 2 0 0 0 0 0 0 0 0 0
## jeep 1 1 0 1 2 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## land rover 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## lincoln 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## mercury 0 0 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## nissan 0 0 1 0 2 1 0 0 1 5 0 1 0 2 0 0 0 0 0 0 0
## pontiac 0 0 0 0 0 0 2 1 2 0 0 0 0 0 0 0 0 0 0 0 0
## subaru 0 0 0 0 0 0 0 0 3 5 5 1 0 0 0 0 0 0 0 0 0
## toyota 0 1 0 1 1 7 4 1 5 1 0 7 1 0 2 0 2 1 0 0 0
## volkswagen 0 0 0 0 0 0 2 3 2 4 2 9 2 0 0 0 0 0 1 1 1
table(mpg$year,mpg$cty)
##
## 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 33 35
## 1999 0 16 0 7 9 16 10 5 21 11 0 11 0 1 4 1 1 1 1 1 1
## 2008 5 4 8 14 10 8 9 11 5 9 11 12 4 2 1 1 2 1 0 0 0
First, in base R.
boxplot(mpg$cty)
hist(mpg$hwy)
Look at the same single numeric variable, in ggplot2.
# ggplot2:
qplot(cty, data=mpg)+labs(title="Counts of models in MPG", x="cty MPG", y="Count of models")
First, in base R:
plot(mpg$cty ~ mpg$displ)
Look at the same scatterplot in ggplot2. The order of cty and fl are reversed.
qplot(displ, cty, data=mpg)