WEEK 6 ASSIGNMENT- Jamey Etherton

R Markdown file

  1. Choose and load R dataset that has at least two numeric variables and at least two categorial variables.
#install.packages("ggplot2")
#dataset "mpg" - Fuel economy data from 1999 and 2008 for 38 popular models of car.

require(ggplot2)
## Loading required package: ggplot2
head(mpg)
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

Identify which variable in your data set are numeric, and which are categorical (factors). - numeric variables; displ,year,cyl,cty,hwy, - categorical (factor) variables; manufacturer, model,trans,drv,fl,class

“year” can be a categorical factor since automobiles should reflect customers preferences, and the requests have been changed every year based on fuel economy, environment, and etc.

str(mpg)
## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

Generate Descriptive Statistics

  1. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
summary(mpg)
##      manufacturer                 model         displ            year     
##  dodge     :37    caravan 2wd        : 11   Min.   :1.600   Min.   :1999  
##  toyota    :34    ram 1500 pickup 4wd: 10   1st Qu.:2.400   1st Qu.:1999  
##  volkswagen:27    civic              :  9   Median :3.300   Median :2004  
##  ford      :25    dakota pickup 4wd  :  9   Mean   :3.472   Mean   :2004  
##  chevrolet :19    jetta              :  9   3rd Qu.:4.600   3rd Qu.:2008  
##  audi      :18    mustang            :  9   Max.   :7.000   Max.   :2008  
##  (Other)   :74    (Other)            :177                                 
##       cyl               trans    drv          cty             hwy       
##  Min.   :4.000   auto(l4)  :83   4:103   Min.   : 9.00   Min.   :12.00  
##  1st Qu.:4.000   manual(m5):58   f:106   1st Qu.:14.00   1st Qu.:18.00  
##  Median :6.000   auto(l5)  :39   r: 25   Median :17.00   Median :24.00  
##  Mean   :5.889   manual(m6):19           Mean   :16.86   Mean   :23.44  
##  3rd Qu.:8.000   auto(s6)  :16           3rd Qu.:19.00   3rd Qu.:27.00  
##  Max.   :8.000   auto(l6)  : 6           Max.   :35.00   Max.   :44.00  
##                  (Other)   :13                                          
##  fl             class   
##  c:  1   2seater   : 5  
##  d:  5   compact   :47  
##  e:  8   midsize   :41  
##  p: 52   minivan   :11  
##  r:168   pickup    :33  
##          subcompact:35  
##          suv       :62
  1. Determine the frequency for one of the categorical variables. Run the table() function against a single categorical variable See also: http://www.statmethods.net/stats/frequencies.html
table(mpg$fl)
## 
##   c   d   e   p   r 
##   1   5   8  52 168
table(mpg$year)
## 
## 1999 2008 
##  117  117
  1. Determine the frequency for one of the categorical variables, by a different categorical variable. - “cty” is not a factor but can be categorical against a categorical variable.
# Run the table() function against two categorical variables.
table(mpg$class, mpg$fl)
##             
##               c  d  e  p  r
##   2seater     0  0  0  5  0
##   compact     0  1  0 21 25
##   midsize     0  0  0 15 26
##   minivan     0  0  1  0 10
##   pickup      0  0  3  0 30
##   subcompact  1  2  0  3 29
##   suv         0  2  4  8 48
table(mpg$manufacturer, mpg$cty)
##             
##              9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 33 35
##   audi       0  0  0  0  0  3  3  3  4  1  2  2  0  0  0  0  0  0  0  0  0
##   chevrolet  0  3  1  1  4  3  2  1  2  1  0  0  1  0  0  0  0  0  0  0  0
##   dodge      4  8  2  8  4  3  4  3  1  0  0  0  0  0  0  0  0  0  0  0  0
##   ford       0  3  1  7  5  5  1  1  2  0  0  0  0  0  0  0  0  0  0  0  0
##   honda      0  0  0  0  0  0  0  0  0  0  0  1  0  1  3  2  1  1  0  0  0
##   hyundai    0  0  0  0  0  0  1  2  4  3  2  2  0  0  0  0  0  0  0  0  0
##   jeep       1  1  0  1  2  2  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0
##   land rover 0  2  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   lincoln    0  2  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   mercury    0  0  0  3  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   nissan     0  0  1  0  2  1  0  0  1  5  0  1  0  2  0  0  0  0  0  0  0
##   pontiac    0  0  0  0  0  0  2  1  2  0  0  0  0  0  0  0  0  0  0  0  0
##   subaru     0  0  0  0  0  0  0  0  3  5  5  1  0  0  0  0  0  0  0  0  0
##   toyota     0  1  0  1  1  7  4  1  5  1  0  7  1  0  2  0  2  1  0  0  0
##   volkswagen 0  0  0  0  0  0  2  3  2  4  2  9  2  0  0  0  0  0  1  1  1
table(mpg$year,mpg$cty)
##       
##         9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 33 35
##   1999  0 16  0  7  9 16 10  5 21 11  0 11  0  1  4  1  1  1  1  1  1
##   2008  5  4  8 14 10  8  9 11  5  9 11 12  4  2  1  1  2  1  0  0  0

R Graphics

  1. Create a graph for a single numeric variable.

First, in base R.

boxplot(mpg$cty)

hist(mpg$hwy)

Look at the same single numeric variable, in ggplot2.

# ggplot2:

qplot(cty, data=mpg)+labs(title="Counts of models in MPG", x="cty MPG", y="Count of models")

  1. Create a scatterplot of two numeric variables.

First, in base R:

plot(mpg$cty ~ mpg$displ)

Look at the same scatterplot in ggplot2. The order of cty and fl are reversed.

qplot(displ, cty, data=mpg)