Honey Berk - IS607 - Week 6 Assignment

Load dataset and identify variable types

  1. Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
#Identify which variables in your data set are numeric, and which are categorical (factors).
require(ggplot2)
## Loading required package: ggplot2
str(mpg)
## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

Generate descriptive statistics

  1. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
summary(mpg)
##      manufacturer                 model         displ            year     
##  dodge     :37    caravan 2wd        : 11   Min.   :1.600   Min.   :1999  
##  toyota    :34    ram 1500 pickup 4wd: 10   1st Qu.:2.400   1st Qu.:1999  
##  volkswagen:27    civic              :  9   Median :3.300   Median :2004  
##  ford      :25    dakota pickup 4wd  :  9   Mean   :3.472   Mean   :2004  
##  chevrolet :19    jetta              :  9   3rd Qu.:4.600   3rd Qu.:2008  
##  audi      :18    mustang            :  9   Max.   :7.000   Max.   :2008  
##  (Other)   :74    (Other)            :177                                 
##       cyl               trans    drv          cty             hwy       
##  Min.   :4.000   auto(l4)  :83   4:103   Min.   : 9.00   Min.   :12.00  
##  1st Qu.:4.000   manual(m5):58   f:106   1st Qu.:14.00   1st Qu.:18.00  
##  Median :6.000   auto(l5)  :39   r: 25   Median :17.00   Median :24.00  
##  Mean   :5.889   manual(m6):19           Mean   :16.86   Mean   :23.44  
##  3rd Qu.:8.000   auto(s6)  :16           3rd Qu.:19.00   3rd Qu.:27.00  
##  Max.   :8.000   auto(l6)  : 6           Max.   :35.00   Max.   :44.00  
##                  (Other)   :13                                          
##  fl             class   
##  c:  1   2seater   : 5  
##  d:  5   compact   :47  
##  e:  8   midsize   :41  
##  p: 52   minivan   :11  
##  r:168   pickup    :33  
##          subcompact:35  
##          suv       :62

Determine frequencies

  1. Determine the frequency for one of the categorical variables.
table(mpg$model)
## 
##            4runner 4wd                     a4             a4 quattro 
##                      6                      7                      8 
##             a6 quattro                 altima     c1500 suburban 2wd 
##                      3                      6                      5 
##                  camry           camry solara            caravan 2wd 
##                      7                      7                     11 
##                  civic                corolla               corvette 
##                      9                      5                      5 
##      dakota pickup 4wd            durango 4wd         expedition 2wd 
##                      9                      7                      3 
##           explorer 4wd        f150 pickup 4wd           forester awd 
##                      6                      7                      6 
##     grand cherokee 4wd             grand prix                    gti 
##                      8                      5                      5 
##            impreza awd                  jetta        k1500 tahoe 4wd 
##                      8                      9                      4 
## land cruiser wagon 4wd                 malibu                 maxima 
##                      2                      5                      3 
##        mountaineer 4wd                mustang          navigator 2wd 
##                      4                      9                      3 
##             new beetle                 passat         pathfinder 4wd 
##                      6                      7                      4 
##    ram 1500 pickup 4wd            range rover                 sonata 
##                     10                      4                      7 
##                tiburon      toyota tacoma 4wd 
##                      7                      7
  1. Determine the frequency for one of the categorical variables, by a different categorical variable.
table(mpg$model, mpg$class)
##                         
##                          2seater compact midsize minivan pickup subcompact
##   4runner 4wd                  0       0       0       0      0          0
##   a4                           0       7       0       0      0          0
##   a4 quattro                   0       8       0       0      0          0
##   a6 quattro                   0       0       3       0      0          0
##   altima                       0       2       4       0      0          0
##   c1500 suburban 2wd           0       0       0       0      0          0
##   camry                        0       0       7       0      0          0
##   camry solara                 0       7       0       0      0          0
##   caravan 2wd                  0       0       0      11      0          0
##   civic                        0       0       0       0      0          9
##   corolla                      0       5       0       0      0          0
##   corvette                     5       0       0       0      0          0
##   dakota pickup 4wd            0       0       0       0      9          0
##   durango 4wd                  0       0       0       0      0          0
##   expedition 2wd               0       0       0       0      0          0
##   explorer 4wd                 0       0       0       0      0          0
##   f150 pickup 4wd              0       0       0       0      7          0
##   forester awd                 0       0       0       0      0          0
##   grand cherokee 4wd           0       0       0       0      0          0
##   grand prix                   0       0       5       0      0          0
##   gti                          0       5       0       0      0          0
##   impreza awd                  0       4       0       0      0          4
##   jetta                        0       9       0       0      0          0
##   k1500 tahoe 4wd              0       0       0       0      0          0
##   land cruiser wagon 4wd       0       0       0       0      0          0
##   malibu                       0       0       5       0      0          0
##   maxima                       0       0       3       0      0          0
##   mountaineer 4wd              0       0       0       0      0          0
##   mustang                      0       0       0       0      0          9
##   navigator 2wd                0       0       0       0      0          0
##   new beetle                   0       0       0       0      0          6
##   passat                       0       0       7       0      0          0
##   pathfinder 4wd               0       0       0       0      0          0
##   ram 1500 pickup 4wd          0       0       0       0     10          0
##   range rover                  0       0       0       0      0          0
##   sonata                       0       0       7       0      0          0
##   tiburon                      0       0       0       0      0          7
##   toyota tacoma 4wd            0       0       0       0      7          0
##                         
##                          suv
##   4runner 4wd              6
##   a4                       0
##   a4 quattro               0
##   a6 quattro               0
##   altima                   0
##   c1500 suburban 2wd       5
##   camry                    0
##   camry solara             0
##   caravan 2wd              0
##   civic                    0
##   corolla                  0
##   corvette                 0
##   dakota pickup 4wd        0
##   durango 4wd              7
##   expedition 2wd           3
##   explorer 4wd             6
##   f150 pickup 4wd          0
##   forester awd             6
##   grand cherokee 4wd       8
##   grand prix               0
##   gti                      0
##   impreza awd              0
##   jetta                    0
##   k1500 tahoe 4wd          4
##   land cruiser wagon 4wd   2
##   malibu                   0
##   maxima                   0
##   mountaineer 4wd          4
##   mustang                  0
##   navigator 2wd            3
##   new beetle               0
##   passat                   0
##   pathfinder 4wd           4
##   ram 1500 pickup 4wd      0
##   range rover              4
##   sonata                   0
##   tiburon                  0
##   toyota tacoma 4wd        0

R Graphics

  1. Create a graph for a single numeric variable.

Plot mpg city for each manufacturer.

hist(mpg$cty)

Look at the same single numeric variable, in ggplot2.

qplot(cty, data = mpg)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

  1. Create a scatterplot of two numeric variables.

First, in base R:

plot(mpg$cyl, mpg$cty, xlab="# Cylinders",ylab="Miles Per Gallon", pch=19)

Look at the same scatterplot in ggplot2

ggplot(mpg, aes(cyl, cty)) + geom_point()

Extra plots

# Manufacturer vs. mpg city 
ggplot(mpg, aes(x = manufacturer, y = cty)) + geom_boxplot() + coord_flip()

# City mpg by manufacturer
ggplot(mpg, aes(x = cty, fill = manufacturer)) + geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

#City mpg by year, with manufacturer
ggplot(mpg, aes(x = year, y = cty, fill = manufacturer)) + geom_bar(stat="identity", position=position_dodge())