1. Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).

Use the mpg dataset from ggplot2 & review the data

library(ggplot2)
data(mpg)
names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"        
##  [5] "cyl"          "trans"        "drv"          "cty"         
##  [9] "hwy"          "fl"           "class"
head(mpg)
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

Lets examine the structure of the mpg dataset

str(mpg)
## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

From the above, there are 6 categorical variables ==> manufacturer, model, trans, drv, fl , class.
And there are 5 numerics - displ, year, cyl, cty, hwy.
(Note: except displ, all other 4 are integers. ‘displ’ is a double number)

2. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.

summary(mpg)
##      manufacturer                 model         displ            year     
##  dodge     :37    caravan 2wd        : 11   Min.   :1.600   Min.   :1999  
##  toyota    :34    ram 1500 pickup 4wd: 10   1st Qu.:2.400   1st Qu.:1999  
##  volkswagen:27    civic              :  9   Median :3.300   Median :2004  
##  ford      :25    dakota pickup 4wd  :  9   Mean   :3.472   Mean   :2004  
##  chevrolet :19    jetta              :  9   3rd Qu.:4.600   3rd Qu.:2008  
##  audi      :18    mustang            :  9   Max.   :7.000   Max.   :2008  
##  (Other)   :74    (Other)            :177                                 
##       cyl               trans    drv          cty             hwy       
##  Min.   :4.000   auto(l4)  :83   4:103   Min.   : 9.00   Min.   :12.00  
##  1st Qu.:4.000   manual(m5):58   f:106   1st Qu.:14.00   1st Qu.:18.00  
##  Median :6.000   auto(l5)  :39   r: 25   Median :17.00   Median :24.00  
##  Mean   :5.889   manual(m6):19           Mean   :16.86   Mean   :23.44  
##  3rd Qu.:8.000   auto(s6)  :16           3rd Qu.:19.00   3rd Qu.:27.00  
##  Max.   :8.000   auto(l6)  : 6           Max.   :35.00   Max.   :44.00  
##                  (Other)   :13                                          
##  fl             class   
##  c:  1   2seater   : 5  
##  d:  5   compact   :47  
##  e:  8   midsize   :41  
##  p: 52   minivan   :11  
##  r:168   pickup    :33  
##          subcompact:35  
##          suv       :62

3. Determine the frequency for one of the categorical variables.

table(mpg$manufacturer)
## 
##       audi  chevrolet      dodge       ford      honda    hyundai 
##         18         19         37         25          9         14 
##       jeep land rover    lincoln    mercury     nissan    pontiac 
##          8          4          3          4         13          5 
##     subaru     toyota volkswagen 
##         14         34         27

4.Determine the frequency for one of the categorical variables, by a different categorical variable.

table(mpg$manufacturer, mpg$drv)
##             
##               4  f  r
##   audi       11  7  0
##   chevrolet   4  5 10
##   dodge      26 11  0
##   ford       13  0 12
##   honda       0  9  0
##   hyundai     0 14  0
##   jeep        8  0  0
##   land rover  4  0  0
##   lincoln     0  0  3
##   mercury     4  0  0
##   nissan      4  9  0
##   pontiac     0  5  0
##   subaru     14  0  0
##   toyota     15 19  0
##   volkswagen  0 27  0

5. Create a graph for a single numeric variable.

In base R

boxplot(mpg$displ, main="Distribution of engine displacement in litres")

hist(mpg$displ, xlab="Displ", main="engine displacement in litres - frequencies")

#Add a density distribution line over the histogram using lines function.
hist(mpg$displ, freq=FALSE, xlab="Displ")
lines(density(mpg$displ))

#Histogram with a normal density curve using curve
hist(mpg$displ, freq=FALSE, xlab="Displ", col="lightgreen")
curve(dnorm(x, mean=mean(mpg$displ), sd=sd(mpg$displ)), add=TRUE, col="darkblue", lwd=2)

Using ggplot2

qplot(displ, data= mpg)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

6. Create a scatterplot of two numeric variables

In base R

plot(mpg$hwy ~ mpg$displ)

In ggplot2

#engine displacement in litres (displ) Vs avg highway miles pers gallon (hwy). Points colored by number of cylenders.
qplot(displ, hwy, data = mpg, color = factor(cyl)) + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.