Load Dataset, Identify Variable Types
- Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables your data set are numeric, and which are categorical (factors).
#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2
data(mpg)
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
#Identify which variables in the set are numeric, and which are categorical (factors)
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
Generate Descriptive Statistics
- Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
#Generate summary level descriptive statistics
summary(mpg)
## manufacturer model displ year
## dodge :37 caravan 2wd : 11 Min. :1.600 Min. :1999
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400 1st Qu.:1999
## volkswagen:27 civic : 9 Median :3.300 Median :2004
## ford :25 dakota pickup 4wd : 9 Mean :3.472 Mean :2004
## chevrolet :19 jetta : 9 3rd Qu.:4.600 3rd Qu.:2008
## audi :18 mustang : 9 Max. :7.000 Max. :2008
## (Other) :74 (Other) :177
## cyl trans drv cty hwy
## Min. :4.000 auto(l4) :83 4:103 Min. : 9.00 Min. :12.00
## 1st Qu.:4.000 manual(m5):58 f:106 1st Qu.:14.00 1st Qu.:18.00
## Median :6.000 auto(l5) :39 r: 25 Median :17.00 Median :24.00
## Mean :5.889 manual(m6):19 Mean :16.86 Mean :23.44
## 3rd Qu.:8.000 auto(s6) :16 3rd Qu.:19.00 3rd Qu.:27.00
## Max. :8.000 auto(l6) : 6 Max. :35.00 Max. :44.00
## (Other) :13
## fl class
## c: 1 2seater : 5
## d: 5 compact :47
## e: 8 midsize :41
## p: 52 minivan :11
## r:168 pickup :33
## subcompact:35
## suv :62
Determine Frequencies
- Determine the frequency for one of the categorical variables.
#Determine frequency for one variable
table(mpg$manufacturer)
##
## audi chevrolet dodge ford honda hyundai
## 18 19 37 25 9 14
## jeep land rover lincoln mercury nissan pontiac
## 8 4 3 4 13 5
## subaru toyota volkswagen
## 14 34 27
- Determine the frequency for one of the categorical variables, by a different categorical variable.
#Determine frequency for one variable by another
table(mpg$manufacturer, mpg$class)
##
## 2seater compact midsize minivan pickup subcompact suv
## audi 0 15 3 0 0 0 0
## chevrolet 5 0 5 0 0 0 9
## dodge 0 0 0 11 19 0 7
## ford 0 0 0 0 7 9 9
## honda 0 0 0 0 0 9 0
## hyundai 0 0 7 0 0 7 0
## jeep 0 0 0 0 0 0 8
## land rover 0 0 0 0 0 0 4
## lincoln 0 0 0 0 0 0 3
## mercury 0 0 0 0 0 0 4
## nissan 0 2 7 0 0 0 4
## pontiac 0 0 5 0 0 0 0
## subaru 0 4 0 0 0 4 6
## toyota 0 12 7 0 7 0 8
## volkswagen 0 14 7 0 0 6 0
Graph a Single Variable
- Create a graph for a single numeric variable.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Graph Two Variables
- Create a scatterplot of two numeric variables.
