1. Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
The mpg data set has been selected for exploratory analysis.
| variableName | variableDescription |
|---|---|
| manufacturer: | vehicle manufacturer: |
| model: | vehicle model: |
| displ: | engine displacement (in litres) |
| year: | year (YYYY) |
| cyl: | number of cylinders |
| trans: | type of transmission |
| drv: | drive type (f = front-wheel drive, r = rear wheel drive, 4 = 4wd) |
| cty: | city miles per gallon |
| hwy: | highway miles per gallon |
| fl: | fuel type (e = ethanol, d = diesel, r = regular,p = premium, c = CNG): |
| class: | vehicle class |
The data set has the following characteristics:
library('ggplot2')
mpg2<-mpg
# rename the colum names
colnames(mpg2)<-c('manufacturer',
'model',
'engine displacement',
'year','number of cylinders',
'type of transmission',
'drive type',
'city miles per gallon',
'highway miles per gallon',
'fuel type',
'class')
# show the characteristics
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
We can see that there are 6 categorical variables (i.e., manufacturer, model, trans, drv, fl , and class) and 5 numerical variables (i.e., displ, year, cyl, cty, and hwy).
2. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
summary(mpg)
## manufacturer model displ year
## dodge :37 caravan 2wd : 11 Min. :1.600 Min. :1999
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400 1st Qu.:1999
## volkswagen:27 civic : 9 Median :3.300 Median :2004
## ford :25 dakota pickup 4wd : 9 Mean :3.472 Mean :2004
## chevrolet :19 jetta : 9 3rd Qu.:4.600 3rd Qu.:2008
## audi :18 mustang : 9 Max. :7.000 Max. :2008
## (Other) :74 (Other) :177
## cyl trans drv cty hwy
## Min. :4.000 auto(l4) :83 4:103 Min. : 9.00 Min. :12.00
## 1st Qu.:4.000 manual(m5):58 f:106 1st Qu.:14.00 1st Qu.:18.00
## Median :6.000 auto(l5) :39 r: 25 Median :17.00 Median :24.00
## Mean :5.889 manual(m6):19 Mean :16.86 Mean :23.44
## 3rd Qu.:8.000 auto(s6) :16 3rd Qu.:19.00 3rd Qu.:27.00
## Max. :8.000 auto(l6) : 6 Max. :35.00 Max. :44.00
## (Other) :13
## fl class
## c: 1 2seater : 5
## d: 5 compact :47
## e: 8 midsize :41
## p: 52 minivan :11
## r:168 pickup :33
## subcompact:35
## suv :62
3. Determine the frequency for one of the categorical variables.
We can tabulate the frequency of the ‘manufacturer’ categorical variable as follows:
# tabulate the frequency
manufacturerFrequency<-table(mpg2$manufacturer)
# create the data.frame
manufacturerFrequencyTable<-data.frame(manufacturerFrequency)
# relabel the columns
colnames(manufacturerFrequencyTable)<-c('manufacturer',
'frequency')
| manufacturer | frequency |
|---|---|
| audi | 18 |
| chevrolet | 19 |
| dodge | 37 |
| ford | 25 |
| honda | 9 |
| hyundai | 14 |
| jeep | 8 |
| land rover | 4 |
| lincoln | 3 |
| mercury | 4 |
| nissan | 13 |
| pontiac | 5 |
| subaru | 14 |
| toyota | 34 |
| volkswagen | 27 |
4. Determine the frequency for one of the categorical variables, by a different categorical variable.
We can tabulate the frequency of the ‘manufacturer’ categorical variable by the frequency of the ‘number of cylinders’ variable as follows:
# tabulate the frequency
manufacturerByCylinderFrequency<-table(mpg2$manufacturer,
mpg2$'number of cylinders')
| 4 | 5 | 6 | 8 | |
|---|---|---|---|---|
| audi | 8 | 0 | 9 | 1 |
| chevrolet | 2 | 0 | 3 | 14 |
| dodge | 1 | 0 | 15 | 21 |
| ford | 0 | 0 | 10 | 15 |
| honda | 9 | 0 | 0 | 0 |
| hyundai | 8 | 0 | 6 | 0 |
| jeep | 0 | 0 | 3 | 5 |
| land rover | 0 | 0 | 0 | 4 |
| lincoln | 0 | 0 | 0 | 3 |
| mercury | 0 | 0 | 2 | 2 |
| nissan | 4 | 0 | 8 | 1 |
| pontiac | 0 | 0 | 4 | 1 |
| subaru | 14 | 0 | 0 | 0 |
| toyota | 18 | 0 | 13 | 3 |
| volkswagen | 17 | 4 | 6 | 0 |
5. Create a graph for a single numeric variable.
We can see that the dispersion of cty is related to the number of cylinders, with the cars with a lower number of cylinders getting better fuel economy (i.e., having a higher cty)
# create a histogram
dplot <- ggplot(mpg, aes(cty, fill = factor(cyl)))
dplot + geom_bar(position = "stack")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
6. Create a scatterplot of two numeric variables.
We can also see that the hwy and cty variables are highly related
# create a scatter plot
qplot(cty, hwy, data=mpg, colour=factor(cyl))
Again we see that a smaller number of cylinders is associated with better cty and hwy.