Objectives

The objectives of this problem set is to gain experience working with the ggplot2 package for data visualization. To do this I have provided a series of graphics, all created using the ggplot2 package. Your objective for this assignment will be write the code necessary to exactly recreate the provided graphics.

When completed submit a link to your file on rpubs.com. Be sure to include echo = TRUE for each graphic so that I can see the visualization and the code required to create it.

library(ggplot2)

Vis 1

This graphic is a traditional stacked bar chart. This graphic works on the mpg dataset, which is built into the ggplot2 library. This means that you can access it simply by ggplot(mpg, ....). There is one modification above default in this graphic, I renamed the legend for more clarity.

ggplot(mpg, aes(class, fill=trans)) + 
  geom_histogram(stat="count") + 
  scale_fill_discrete(name='Transmission')

The graph above is a stacked bar chart that displays the count of class grouped with transmission type trans. Various transimission types are represented by different colors as shown in the legend. From the chart, we can see that the SUV has the maximum count of cars and within it the higest number comes from auto(I4) transmission type.

Vis 2

This boxplot is also built using the mpg dataset. Notice the changes in axis labels, and an altered theme_XXXX

ggplot(mpg, aes(manufacturer, hwy)) + 
  geom_boxplot() + 
  labs(x='Vehicle Manufacturer', y='Highway Fuel Efficiency (miles/gallon)') + 
  coord_flip() + 
  theme_classic()

The boxplot above describes the distribution of highway mpg for different vehicle manufacturers. The boxplot displays the median, the first quantile and the third quantile of the distribution. We can see that Honda has the highest median for highway mpg, and its entire quantile range is greater than all other vehicles - which means that it has best fuel economy in highways.

Vis 3

This graphic is built with another dataset diamonds a dataset also built into the ggplot2 package. For this one I used an additional package called library(ggthemes) check it out to reproduce this view.

library(ggthemes)
ggplot(diamonds, aes(price, colour=cut, fill=cut)) + 
  geom_density(alpha=0.3) + 
  labs(x='Diamond Price (USD)', y='Density') +
  ggtitle('Diamond Price Density') +
  theme_economist()

The plot above displays the density plot of the price of diamonds for different types of diamond cuts. We can see that the Ideal diamond cut has the highest and narrowest density at low price ranges, and goes down steeply after $2,500. Fair diamond cut has the most range, and has wide price density of them all.

Vis 4

For this plot we are changing vis idioms to a scatter plot framework. Additionally, I am using ggplot2 package to fit a linear model to the data all within the plot framework. Three are edited labels and theme modifications as well.

ggplot(iris, aes(x=Sepal.Length, y=Petal.Length)) + 
  geom_point() +
  geom_smooth(method=lm) +
  theme_minimal() +
  labs(x='Iris Sepal Length', y='Iris Petal Length', title='Relationship between Petal and Sepal Length')

The above plot displays the relationship between the lenghts of sepal and petals from the iris dataset. The line that runs through the points is the regression line, and the shaded area is the confidence interval around the line. From the plot, we can see that the two lengths has a linear relationship, with increasing petal length for increase in sepal length.

Vis 5

Finally, in this vis I extend on the last example, by plotting the same data but using an additional channel to communicate species level differences. Again I fit a linear model to the data but this time one for each species, and add additional theme and labeling modicitations.

ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, colour=Species)) + 
  geom_point() +
  geom_smooth(method=lm, se=FALSE) +
  theme(panel.background = element_blank(), legend.position="bottom") +
  labs(x='Iris Sepal Length', y='Iris Petal Length', title='Relationship between Petal and Sepal Length')

The plot above displays the relationship between the lenghts of sepal and petals for different speices of flowers in iris dataset. The line that runs through the points is the regression line. We can see that the relationship is more steep for ‘virginica’ speices followed by ‘versicolor’. The relationship is more flat for ‘setosa’, and all relationships are linear and directly proportional.

Problem Set 3

Swaraj Rimal

December 13, 2017