STAT 528 HW 1

1. List five functions that you could use to get more information about the mpg dataset.

    1.str
    2.head
    3.view
    4.summary
    5.dim

2. Using the ggplot2 package create a scatterplot of hwy and cty and describe the relationship. Why are there so few points visible? Use a geom that makes all points visible in the scatterplot.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

ggplot(mpg, aes(hwy,cty, color = class)) + geom_point()

scatterplot with overlapping points

ggplot(mpg, aes(hwy,cty, color = class)) + geom_jitter() #shows us all the data points by adding a bit of random noise to each data point to reduce overplotting

scatterplot with jittered points

Many of the instances have the same inputs for hwy and cty. This causes a number of the points to overlap with each other. All of the points are technically on     the scatter plot, however, do to the overlapping many of the points are stacked on top of each other.

3. Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance traveled with fixed amount of fuel). How could you convert cty and hwy into the European standard of liter/100 km?

The equation for the conversion of mpg to L/100 is 235.21 divided by mpg. The 235.1 value is derived from the equation L/100km = 100/((mpg * 1.609)/3.785),         equating 1 mpg to L/100.

#the formula to convert from mpg to L/100 km is 235.21/mpg >> the UK measurement represents the number of litres used per 100 km
#the formula may look like  L/100km = 100/((mpg * 1.609)/3.785) which translate to 1 mpg equals roughly 235.21
mpg2 <- mpg
mpg2$cty_non_US <- 235.21/mpg2$cty
mpg2$hwy_non_US <- 235.21/mpg2$hwy

4. Which model is the most economic based on cty? Which model consumes most fuel using the European standard for cty?

manu <- unlist(mpg2[which.max(mpg2$cty), 1])
mod <- unlist(mpg2[which.max(mpg2$cty), 2])
eff <- unlist(mpg2[which.max(mpg2$cty), 8])
manu
## manufacturer 
## "volkswagen"
mod
##        model 
## "new beetle"
eff
## cty 
##  35

The model that is the most economic is the volkswagon beattle which has a cty mpg of 35.

eu.manu <- unlist(mpg2[which.max(mpg2$cty_non_US), 1])
eu.mod <- unlist(mpg2[which.max(mpg2$cty_non_US), 2])
eu.eff <- unlist(mpg2[which.max(mpg2$cty_non_US), 12])
eu.manu
## manufacturer 
##      "dodge"
eu.mod
##               model 
## "dakota pickup 4wd"
eu.eff
## cty_non_US 
##   26.13444

The model that consumes the most fuel based on the European standard is the Dodge Dakota Pickup 4wd at 26.13 liters.

5. Which manufacturer has the most models in this dataset? Which model has the most variations? (table and apply functions can be used to solve this problem)

manu2 <- mpg2 %>% group_by(manufacturer) %>% tally(sort = TRUE) #the examples under the help pages point to this structure
manu2
## # A tibble: 15 x 2
##    manufacturer     n
##           <chr> <int>
##  1        dodge    37
##  2       toyota    34
##  3   volkswagen    27
##  4         ford    25
##  5    chevrolet    19
##  6         audi    18
##  7      hyundai    14
##  8       subaru    14
##  9       nissan    13
## 10        honda     9
## 11         jeep     8
## 12      pontiac     5
## 13   land rover     4
## 14      mercury     4
## 15      lincoln     3
mpg2$manu_mod <- paste(mpg2$manufacturer,mpg2$model)
unique(mpg2$manu_mod)
##  [1] "audi a4"                       "audi a4 quattro"              
##  [3] "audi a6 quattro"               "chevrolet c1500 suburban 2wd" 
##  [5] "chevrolet corvette"            "chevrolet k1500 tahoe 4wd"    
##  [7] "chevrolet malibu"              "dodge caravan 2wd"            
##  [9] "dodge dakota pickup 4wd"       "dodge durango 4wd"            
## [11] "dodge ram 1500 pickup 4wd"     "ford expedition 2wd"          
## [13] "ford explorer 4wd"             "ford f150 pickup 4wd"         
## [15] "ford mustang"                  "honda civic"                  
## [17] "hyundai sonata"                "hyundai tiburon"              
## [19] "jeep grand cherokee 4wd"       "land rover range rover"       
## [21] "lincoln navigator 2wd"         "mercury mountaineer 4wd"      
## [23] "nissan altima"                 "nissan maxima"                
## [25] "nissan pathfinder 4wd"         "pontiac grand prix"           
## [27] "subaru forester awd"           "subaru impreza awd"           
## [29] "toyota 4runner 4wd"            "toyota camry"                 
## [31] "toyota camry solara"           "toyota corolla"               
## [33] "toyota land cruiser wagon 4wd" "toyota toyota tacoma 4wd"     
## [35] "volkswagen gti"                "volkswagen jetta"             
## [37] "volkswagen new beetle"         "volkswagen passat"

Dodge has the most models in this dataset as it shows up 37 times, while toyata has the most variation with 5 different model types.

6. Using the ggplot2 package create side-by-side boxplots of cty by class. Describe the relationship in 2-3 sentences. Change the label for the y-axis to âcity miles per gallonâ (see ?ylab).Change the order of the categories in the class variable such that the boxplots are ordered from least efficient to most efficient as measured by cty.

ggplot(mpg, aes(class,cty, color = class)) + geom_boxplot() + ylab("city miles per gallon")

The boxplot allows for an easy comparison across vehicle class. It appears that the subcompact is capable of the greatest cty mpg while. However, there is some uncertainty across the instances as the range of possiblity for the subcompact is quite large. The compact would seem to offer a solid cty mpg that is relatively consistent and has the highest median cty mpg.

ggplot(mpg, aes(class,cty, color = class)) + geom_boxplot() + ylab("city miles per gallon")

cty mpg by class ordered from least to greatest