Module 3: Making Graphs

Learning Objectives:

Become familiar with making boxplots, densityplots, histograms, barplots, and scatterplots.
Understand how to choose the right type of plot.

The Plots

First, read the Galton data.

# Use the Galton dataset that is built-in the mosaic library
#Galton = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdFlOcmt5bzY4VlFKRmtDdFJRMldTeEE&output=csv")
module4_data = subset(Galton, select=c(3,4,5))

All plots follow the same general format, with only minor modifications. Each plot requires the use of the ggplot() function, which is used in the following way:

ggplot(data_name, aes(independent_variable_name, dependent_variable_name))

Boxplots graph groups of numerical data with their quartiles. They are extremely useful for providing visual descriptions of the data. The bottom and top of the boxes are always the first and third quartile. The band inside the box is the second quartile or the median. The whiskers represent bounds of the 95% confidence interval, meaning all data within two standard deviations of the mean resides within the whiskers. Outliers are represented as points outside of the whiskers. Boxplots are useful because they compare the variability in data controlling for different categorical levels. The disadvantage of boxplots is that you cannot see the mean, which for now does not matter, but later hinders our ability to check for normality. Boxplots are made using the geom_boxplot() function

ggplot(module4_data, aes(sex, height)) + geom_boxplot() #don't forget the parentheses after geom_boxplot!

plot of chunk unnamed-chunk-3

Let's walk through this code. We have explained the ggplot() function above, but this time we have added the geom_boxplot() function, which tells R to visually depict the data as a boxplot.

Suppose we want to add labels and a title to the graph. We can simply add more commands to the above code. Let's make the x-axis title 'sex', the y-axis title 'height', and the title of the graph 'Boxplot of height by sex'.

ggplot(module4_data, aes(sex, height)) + geom_boxplot() + labs(title = "Boxplot of height by sex", x = "sex", y = "height")

plot of chunk unnamed-chunk-4

Notice that the labs() function contains three arguments: title (to create a title for the plot), x (to create an x-axis label), and y (to create a y-axis label). Each of these must be set equal to a character string within quotation marks.

Once you know how these three functions interact with each other, you'll be able to create all sorts of graphs in R with just minor code modifications! The remainder of this module explains those specific modifications.

Densityplots show the distribution for a quantitative variable. Densitplots and boxplots are of the same vein, but you can visualize the spread of the data of a variable, and estimate the mean. Note that there is only one variable of interest, so we only input one variable name within the aes() function. Since the plot shows the frequency of different heights, we set the x-axis equal to 'height' and the y-axis equal to 'frequency'.

ggplot(module4_data, aes(height)) + geom_density() + labs(title = "Densityplot of height", x = "height", y = "frequency")

plot of chunk unnamed-chunk-5

You can add the “group” and “fill” argument inside the aes() statement to visually distinguish plots by the levels of a categorical variable. In the following example we separate the data by sex.

ggplot(module4_data, aes(height, group = sex, fill = sex)) + geom_density() + labs(title = "Densityplot of height", x = "height", y = "frequency")

plot of chunk unnamed-chunk-6

But, how can you tell which curves represent which sexes? After the 'group = sex' argument, you can say 'fill = sex', which will change the colors and automatically create a table. See the plot below.

ggplot(module4_data, aes(height, group = sex, fill = sex)) + geom_density() + labs(title = "Densityplot of height", x = "height", y = "frequency")

plot of chunk unnamed-chunk-7

Histograms are graphical representations of the frequency of a quantitative variable. They are very similar to densityplots, but the graphs are shown as adjacent rectangles. The height of a rectangle is equal to the frequency of the variable you are looking at. Again, the only change to make from the previous graphs is using the geom_histogram() function, and changing the axes. Since histograms show the number of cases of each variable (as opposed to the frequency), we can set the y-axis equal to 'count'. Also note that aes() only contains the variable height.

ggplot(module4_data, aes(height)) + geom_histogram() + labs(title = "Histogram of height", x = "Height", y = "Count")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Warning: position_stack requires constant width: output may be incorrect

plot of chunk unnamed-chunk-8

And, if we want to differentiate between the sexes on histograms, add the arguments: group = sex, fill = sex within the aes() function.

ggplot(module4_data, aes(height, group = sex, fill = sex)) + geom_histogram() + labs(title = "Histogram of height", x = "Height", y = "Count")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Warning: position_stack requires constant width: output may be incorrect

plot of chunk unnamed-chunk-9 Barplots are graphical representations of the frequency of a categorical variable. The graphs are shown as bars with the height representing the frequency of the variable you are looking at. Barplots are similar to histograms, but barplots are for categorical variables while histograms are for quantitative variables.

ggplot(module4_data, aes(sex)) + geom_bar() + labs(title = "Barplot of sex", x = "Sex", y = "Count")

plot of chunk unnamed-chunk-10

Scatterplots are used when you want to find the best-fit line for data that is continuous. Individual points are plotted, to try to determine whether a relationship exists between two quantitative variables. Later, you will use these plots to graph standard curves, but for now we will just focus on creating these plots.

ggplot(module4_data, aes(mother, height)) + geom_point() + labs(title = "Scatterplot of height by mother", x = "Mother", y = "Height")

plot of chunk unnamed-chunk-11

But, what if we want to differentiate between the sexes on this plot? We only have to add one command. Look at the ggplot() function. Specifically, look at its new input below within the aes() function: colour = sex. This argument tells R to differentiate the sexes by colour (note that 'colour' is spelled the British way!)

ggplot(module4_data, aes(mother, height, colour = sex)) + geom_point() + labs(title = "Scatterplot of height by mother", x = "Mother", y = "Height")

plot of chunk unnamed-chunk-12

Recap

All plots use the same general format. There are three steps to creating proper plots.

Use the standard ggplot() function: ggplot(data_name, aes(independent_variable, dependent_variable))
Specify the type of plot - you will have to use one of the 'geom_specifyHere()' functions presented in this module
Label your graphs with the labs() function, which takes three arguments: 'title', 'x', and 'y'.

Problems to Submit

Using the life expectancy data, you will practice making:

Boxplots
Densityplots
Barplots
Scatterplots

The data contains information on countries' male and female life expectancies, GNPs, birth and death rates, and global regions. The data is loaded for you below, and referred to as 'lifeData'

lifeData = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdEtHdXJzSlZFbVYzeUg0NnNuTGJ6SVE&output=csv")

## Error: Missing packages.  Please retry after installing the following:
## RCurl

head(lifeData)

## Error: object 'lifeData' not found

Insert your code in R Markdown chunks below.

Create a single-variable boxplot of female life expectancies.
Notice that this data frame contains one categorical variable: Region. Keeping this information in mind, create a boxplot of female life expectancies by region.

# Insert your boxplot code here

Create a densityplot of female life expectancies by region

# Insert your densityplots code here

Based on the above densityplot, approximate the average East.Eur female life expectancy (you do not need any R code to answer this question)
Create a barplot showing the number of cases per region.

# Insert your barplot code here

Create a scatterplot showing the relationship between female life expectancies and countries' GNPs.

# Insert your scatterplot code here

Create a histogram showing female life expectancies by region.

#Insert your histogram code here