First, read the Galton data.
# Use the Galton dataset that is built-in the mosaic library
#Galton = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdFlOcmt5bzY4VlFKRmtDdFJRMldTeEE&output=csv")
module4_data = subset(Galton, select=c(3,4,5))
All plots follow the same general format, with only minor modifications. Each plot requires the use of the ggplot() function, which is used in the following way:
ggplot(data_name, aes(independent_variable_name, dependent_variable_name))
Boxplots graph groups of numerical data with their quartiles. They are extremely useful for providing visual descriptions of the data. The bottom and top of the boxes are always the first and third quartile. The band inside the box is the second quartile or the median. The whiskers represent bounds of the 95% confidence interval, meaning all data within two standard deviations of the mean resides within the whiskers. Outliers are represented as points outside of the whiskers. Boxplots are useful because they compare the variability in data controlling for different categorical levels. The disadvantage of boxplots is that you cannot see the mean, which for now does not matter, but later hinders our ability to check for normality. Boxplots are made using the geom_boxplot() function
ggplot(module4_data, aes(sex, height)) + geom_boxplot() #don't forget the parentheses after geom_boxplot!
Let's walk through this code. We have explained the ggplot() function above, but this time we have added the geom_boxplot() function, which tells R to visually depict the data as a boxplot.
Suppose we want to add labels and a title to the graph. We can simply add more commands to the above code. Let's make the x-axis title 'sex', the y-axis title 'height', and the title of the graph 'Boxplot of height by sex'.
ggplot(module4_data, aes(sex, height)) + geom_boxplot() + labs(title = "Boxplot of height by sex", x = "sex", y = "height")
Notice that the labs() function contains three arguments: title (to create a title for the plot), x (to create an x-axis label), and y (to create a y-axis label). Each of these must be set equal to a character string within quotation marks.
Once you know how these three functions interact with each other, you'll be able to create all sorts of graphs in R with just minor code modifications! The remainder of this module explains those specific modifications.
Densityplots show the distribution for a quantitative variable. Densitplots and boxplots are of the same vein, but you can visualize the spread of the data of a variable, and estimate the mean. Note that there is only one variable of interest, so we only input one variable name within the aes() function. Since the plot shows the frequency of different heights, we set the x-axis equal to 'height' and the y-axis equal to 'frequency'.
ggplot(module4_data, aes(height)) + geom_density() + labs(title = "Densityplot of height", x = "height", y = "frequency")
You can add the “group” and “fill” argument inside the aes() statement to visually distinguish plots by the levels of a categorical variable. In the following example we separate the data by sex.
ggplot(module4_data, aes(height, group = sex, fill = sex)) + geom_density() + labs(title = "Densityplot of height", x = "height", y = "frequency")
But, how can you tell which curves represent which sexes? After the 'group = sex' argument, you can say 'fill = sex', which will change the colors and automatically create a table. See the plot below.
ggplot(module4_data, aes(height, group = sex, fill = sex)) + geom_density() + labs(title = "Densityplot of height", x = "height", y = "frequency")
Histograms are graphical representations of the frequency of a quantitative variable. They are very similar to densityplots, but the graphs are shown as adjacent rectangles. The height of a rectangle is equal to the frequency of the variable you are looking at. Again, the only change to make from the previous graphs is using the geom_histogram() function, and changing the axes. Since histograms show the number of cases of each variable (as opposed to the frequency), we can set the y-axis equal to 'count'. Also note that aes() only contains the variable height.
ggplot(module4_data, aes(height)) + geom_histogram() + labs(title = "Histogram of height", x = "Height", y = "Count")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
And, if we want to differentiate between the sexes on histograms, add the arguments: group = sex, fill = sex within the aes() function.
ggplot(module4_data, aes(height, group = sex, fill = sex)) + geom_histogram() + labs(title = "Histogram of height", x = "Height", y = "Count")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
Barplots are graphical representations of the frequency of a categorical variable. The graphs are shown as bars with the height representing the frequency of the variable you are looking at. Barplots are similar to histograms, but barplots are for categorical variables while histograms are for quantitative variables.
ggplot(module4_data, aes(sex)) + geom_bar() + labs(title = "Barplot of sex", x = "Sex", y = "Count")
Scatterplots are used when you want to find the best-fit line for data that is continuous. Individual points are plotted, to try to determine whether a relationship exists between two quantitative variables. Later, you will use these plots to graph standard curves, but for now we will just focus on creating these plots.
ggplot(module4_data, aes(mother, height)) + geom_point() + labs(title = "Scatterplot of height by mother", x = "Mother", y = "Height")
But, what if we want to differentiate between the sexes on this plot? We only have to add one command. Look at the ggplot() function. Specifically, look at its new input below within the aes() function: colour = sex. This argument tells R to differentiate the sexes by colour (note that 'colour' is spelled the British way!)
ggplot(module4_data, aes(mother, height, colour = sex)) + geom_point() + labs(title = "Scatterplot of height by mother", x = "Mother", y = "Height")
All plots use the same general format. There are three steps to creating proper plots.
Using the life expectancy data, you will practice making:
The data contains information on countries' male and female life expectancies, GNPs, birth and death rates, and global regions. The data is loaded for you below, and referred to as 'lifeData'
lifeData = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdEtHdXJzSlZFbVYzeUg0NnNuTGJ6SVE&output=csv")
## Error: Missing packages. Please retry after installing the following:
## RCurl
head(lifeData)
## Error: object 'lifeData' not found
Insert your code in R Markdown chunks below.
Create a single-variable boxplot of female life expectancies.
Notice that this data frame contains one categorical variable: Region. Keeping this information in mind, create a boxplot of female life expectancies by region.
# Insert your boxplot code here
# Insert your densityplots code here
Based on the above densityplot, approximate the average East.Eur female life expectancy (you do not need any R code to answer this question)
Create a barplot showing the number of cases per region.
# Insert your barplot code here
# Insert your scatterplot code here
#Insert your histogram code here