Last lab, we saw more capabilities of R and the many things it can do. We are going to extend R’s capabilities this week by learning how to make simple graphs. In order for us to create graphs in the most easiest and efficient way, we will be using the ggplot() function. Let’s begin by reading ggplot2 into R.
library(ggplot2)
Again, if you are having a difficult time reading in ggplot2 into R, it is probably because you do not have the package installed yet. If you do not have the package installed, use the install.packages() function like so
install.packages("ggplot2")
## Warning: package 'ggplot2' is in use and will not be installed
A histogram represents the frequency distribution of a numerical variable (discrete/continuous) in a sample. We are going to make a histogram of the column “age” in the Titanic Data Set we used last lab. Let’s first read in the Titanic Data set into R as we did last class.
titanicData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\titanic.csv", stringsAsFactors = TRUE)
Again, the file path of the read.csv() function is unique to your file mapping; you need to copy the path in your finder. We will enable stringsAsFactors as we are going to work with the other categorical columns for the next types of graphs.
Here is the code to make a simple histogram:
ggplot(titanicData, aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).
Here, we use the ggplot() function with the data set inside the parentheses followed by a comma and the aes() function with our numerical value set as our x axis. Lastly, we added the geom_histogram() function that actually creates the histogram for age.
As we generate the other types of graphs, you will start to notice the basis of the graphs are just using the ggplot() function and the geom_“insert type of graph”().
Based off the histogram alone, we can see that the distribution is a little righ-skewed and we can see that by noticing the peak closer to the left side of the graph with a small tail at the right end. When we have a right-skewed distribution, the mean of the numerical variable (average) is larger than the median (middle number) and that is because of the extreme values set at the right end that slightlty pulls the mean value towards the right. We can conclude just based offf the graph that the mean age of the passengers was probably around 30 years old.
Lets check using the mean() function we learned last lab. We will include the na.rm=TRUE command to read NA values as if they weren’t included.
mean(titanicData$age) # Getting mean without na.rm=TRUE. Returns NA because of NA included in calculations.
## [1] NA
mean(titanicData$age, na.rm = TRUE) # Getting mean with na.rm=TRUE
## [1] 31.19418
A bar graph plots the frequency of a categorical variable.
Here is the code for plotting a simple bar graph of the column “sex” in the Titanic data set ink order to see the counts of each gender of the passengers in the data set:
ggplot(titanicData, aes(x = sex)) + geom_bar(stat = "count")
Again, we use the ggplot() function with the data set inside the parentheses followed by a comma and the aes() function with our categorical variable (“Sex”) now set as our x axis. Lastly, we added the geom_bar() function that actually creates the bar for our gender variable/column. Inside this geom_bar() function we included the stat=“count” command to tell R to give us the frequencies of each gender in the sex variable/column.
Here, we can see that we have more male passengers than females. Approximately 850 passengers were males and 475 were females. There were approximately twice as much males than there were females.
A boxplot is a convenient way of showing the frequency distribution of a numerical variable in multiple groups. Recall that a boxplot depicts the 5-number-summary of the variable; the minimum, quartile 1, quartile 2(the median), quartile 3, and the maximum.
Here’s the code to draw a boxplot for “age” in the titanic data set, separately for each sex:
ggplot(titanicData, aes(x = sex, y = age)) + geom_boxplot()
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Again, we used the same base function for creating our boxplot, however this time we included a y = age in the aes() function and told R we wanted a boxplot using geom_boxplot().
Again, a boxplot can tell us a lot of things. One notable feature of the boxplot is the median, or the dark, thick line, that tells us where the center of the data lies. In this specific case, the median of the males is greater than the females. The boxplot can also show us information on the spread of the data from the 1st quartile to the 3rd quartile. This shows where 50% of the data falls (i.e. 50% of the ages for females ranges from approximately 20 - 40 years old and 50 % of the ages for males ranges from approximately 22 - 42 years old). Skewness can also be inferred from the boxplot. If the median line is closer to either the top or bottom of the box, it suggests that the data is skewed. The last think we want to focus on are the outliers. The outliers are the points that fall outside the whiskers, which indicate really high or low values compared to the rest of the data. With that said, whether to leave in or out outliers depends on the analyst. One thing to consider is in context of the variable and the data itself. In our example we have a couple outliers for males when discussing their age distribution. We see a couple data points at the high end, approximately in the 70s. Though we should consider leaving this out, we also have to consider that there are people that live up to 70 years old and so those outliers may not necessarily mean they are by data entry errors.
The last graphical style that we will cover here is the scatter plot, which shows the relationship between two numerical variables.
If you notice, the Titanic data set does not have two numerical variables/columns and so we will be reading in a different data set called “chap02e3bGuppyFatherSonAttractiveness” from the data set folder.
guppyFatherSonData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\chap02e3bGuppyFatherSonAttractiveness.csv")
Our goal is to see if there is a relationship between the ornamentation of father guppies and the sexual attractiveness of their sons and if so, what kind of relationship do they have. Here is the code to generate a scatterplot of the two numerical variables “fatherOrnamentation” and “sonAttractiveness”:
ggplot(guppyFatherSonData, aes(x = fatherOrnamentation, y = sonAttractiveness)) + geom_point()
Here we made the scatterplot using the basis of ggplot(). We set our x axis as fatherOrnamentation, indicating our independent variable, and our y axis as sonAttractiveness, indicating our dependent variable using the aes(x = fatherOrnamentation, y = sonAttractiveness). This time we use geom_point() to tell R to create a scatterplot by plotting our “data points”.
Notice that based off the scatterplot that there is a relationship/association between father ornamentation and son attractiveness. More specifically, we can see a positive association because as we increase in father ornamentation, the behavior is the same for son attractiveness in that it also increases. Though there is a positive association, the points aren’t well defined (and what I mean by that is that they don’t necessarily create an obvious straight line) and so the association may be a weak one. In conclusion, we can see that there is a positive semi-stong/semi-weak association between father ornamentation of guppies and their son’s attractiveness.
We can make better looking graphs by extending to the ggplot() function. There are a plethora of resources to assist you to make better looking graphs to your liking on the internet. Let us play around with some better looking graphs. We will use the recent generated scatterplot to play around with.
First let us make the graph look better by changing the axes manually to reflect batter axes titles. We will change “sonAttractiveness” and “fatherOrnamentation” to “Son’s Attractiveness” and “Father’s Ornamentation,” respectively, by adding the xlab() and ylab() functions and the desired axes titles inside the parentheses encased with quotations:
ggplot(guppyFatherSonData, aes(x = fatherOrnamentation, y = sonAttractiveness)) + geom_point() + xlab("Father's Ornamentation") + ylab("Son's Attractiveness")
To make our graph look even better, we can add a main title as well as getting rid of the background grid like so:
ggplot(guppyFatherSonData, aes(x = fatherOrnamentation, y = sonAttractiveness)) + geom_point() + xlab("Father Ornamention") + ylab("Son Attractiveness") + labs(title = "Father Ornamentation vs. Son Attractiveness Scatterplot") + theme_classic()
Notice that I have added the function labs(title = “Father Ornamentation vs. Son Attractiveness Scatterplot”) to give us the main title of the graph and theme_classic() to make our graph look more cleaner for presentations/posters.
The help pages in R are the main source of help, but the amount of detail might be off-putting for beginners. For example, to explore the options for ggplot(), enter the following into the R Console.
help(ggplot)
## starting httpd help server ... done
This will cause the contents of the manual page for this function to appear in the Help window in RStudio. These manual pages are often frustratingly technical. What many of us do instead is simply google the name of the function—there are a great number of resources online about R. There are so many things you can do with making your graphs look more appealing such as changing the appearance of the graphs (i.e. change the color, sizes, etc.).