This assignment is fun and very colorful. The task is to use any of the practice data sets in the R package datasets, to create five plots using the package ggplot2, and publish these plots in RPubs.
Included with each plot should be: a. An explanation of the appropriateness of the geometry applied to plot the data and a summary that explains the “message” to be derived from the plot.
- Load all necessary packages
Iris Data FrameirisThis famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Petal Length, Petal Width $ SpeciesThe 5 variables are named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. Four of these variables are numeric, while one(Species) is a factor.
For the first plot, I’m want to compare the length and width of the petals of all three iris species. To communicate this data effectively, I will create a scatterplot.
- First, I will change the numeric variables to factors, because R thinks a number is automatically continuous.
- I want to be able to differentiate the shape and color of the cartesian coordinates based on the species. So I will weave those specifications into my code.
- For added clarity, I will use the calc theme from
ggthemes
Let’s see what this looks like…
- While this is great, it might be easier for the reader to compare the petal length and width of the different species if these were three spearate plots side-by-side.
The plot thickens…
beaver1For this second plot, I’m using the data set beaver1.
Reynolds(1994) describes a small part of a study of the long-term temperature dynamics of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females, but data from a one period of less than a day for each of two animals is used there.
temp & timeThe beaver1 data frame has 114 rows and 4 columns on body temperature measurements at 10 minute intervals.
- time - Time of observation, in the form 0330 for 3:30am
- temp - Measured body temperature in degrees Celsius.
I want to plot the body temperature at different time intervals. I’m going to try this first with the Geometric Column and see how the data plots. I will use the classic theme from ggthemes
The time intervals are very short and the heights of the bars represent values in the data, so this is not an effective way to communicate the information.
I’m going to try to plot this data in a different way.
One variable is discrete, while the other is continuous. The boxplot compactly displays the distribution of a continuous variable. It visualizes five summary statistics(the median, two hinges and two whiskers), and all “outlying” points individually.
I’m going to use jitter for all “outlying” points.
sleepData which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.
group, extra & IDA data frame with 20 observations on 3 variables.
- [, 1] extra numeric increase in hours of sleep
- [, 2] group factor drug given
- [, 3] ID factor patient ID
I want to plot the increase in sleep duration for each of the 10 patients caused by the two individual soporific drugs.
Bar charts are automatically stacked when multiple bars are placed at the same location. The order of the fill is designed to match the legend.
In this case, I’m going to plot the information for all 10 patients on the same two locations designated by the drugs administered to aid sleep.
- I have to create two different data subsets for the two different drugs administered.
USArrestsThis data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
Murder, Rape & UrbanPopA data frame with 50 observations on 4 variables.
- [,1] Murder numeric Murder arrests (per 100,000)
- [,2] Assault numeric Assault arrests (per 100,000)
- [,3] UrbanPop numeric Percent urban population
- [,4] Rape numeric Rape arrests (per 100,000)
For the crimes data frame, I want to plot the arrests per 100,000 residents for both murders as well as rapes in each of the 50 US states in 1973 per percent urban population.
I will use the geometry geom_step to do this. A geometric step creates a stairstep plot, highlighting exactly when changes occur. Also, it will help distinguish between the number of rape and murder arrests.