If the data is in .csv format (comma seperated values), then we can easily read this into R using read.csv(). For example, if we had a csv file called socio.csv and it was in the directory /Users/andrewchallis/Desktop/ then we simply run the following:
Note that we must put the path to the file in either single quotes ‘’ or double quotes " “. When we do this it makes what is called a Character string, which is just telling the computer that it is text and not a number or a function or something else.
read.csv('/Users/andrewchallis/Documents/Personal/Plots\ for\ rach\ and\ sharna/socio.csv')
| country | ID | gnp | lexp | lexpf | lexpm | adulit | hdi | fertr | birthr | pop | popgrwth | childmf | childmm | infmor | urbanpop | energcpc | pppgnp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bolivia | 1 | 630 | 60 | 62 | 58 | 77.5 | 0.398 | 4.8 | 36 | 7171000 | 2.4 | 109 | 127 | 92 | 51 | 259 | 1572 |
| Argentina | 1 | 3270 | 71 | 75 | 68 | 95.3 | 0.832 | 2.8 | 20 | 32322000 | 1.2 | 30 | 40 | 29 | 86 | 1309 | 4295 |
| Australia | 2 | 16720 | 77 | 80 | 74 | 99.0 | 0.972 | 1.9 | 15 | 17065010 | 1.5 | 8 | 10 | 8 | 86 | 5161 | 16051 |
| Austria | 2 | 19060 | 76 | 80 | 73 | 99.0 | 0.952 | 1.5 | 12 | 7712000 | 1.2 | 9 | 13 | 7 | 58 | 3289 | 16504 |
| Belgium | 2 | 17610 | 76 | 80 | 73 | 99.0 | 0.952 | 1.6 | 13 | 9967000 | 0.3 | 10 | 12 | 8 | 96 | 4841 | 16381 |
| Benin | 5 | 360 | 50 | 52 | 49 | 23.4 | 0.113 | 6.3 | 46 | 4740000 | 3.1 | 155 | 173 | 113 | 38 | 23 | 1043 |
Now we know how to read a csv file, lets cache this into the memory. Caching means that we are simply giving the table a name so that the computer knows which table we want to do things with. It is very simple to do this in R, we call this assignment, it sounds fancy but its just giving the table a name. We can either use <- or = to assign a name to a table.
socio <- read.csv('/Users/andrewchallis/Documents/Personal/Plots\ for\ rach\ and\ sharna/socio.csv')
OR
socio = read.csv('/Users/andrewchallis/Documents/Personal/Plots\ for\ rach\ and\ sharna/socio.csv')
Now we are ready to start using the data in the table! That last step was easy, and the steps after won’t be much harder than this.
In R, when we import data from a csv file and assign it a name (what we did in the previous step), the table is refered to as a dataframe. This is just a fancy way of saying a table. Since the data is nice and tidy in the format of a dataframe (or table) we can extract infomation off it really easily! For example, if we wanted to look at just one column, say the life expectancy (which has a column name of lexp) we can either find the number of the column (in this case it is the 4th column) and run the following:
socio[,4]
## [1] 60 71 77 76 76 50 49 67 73 48 47 50 57 77 49 47 72 69 53 75 55 75 67
## [24] 66 60 64 48 76 77 53 76 55 77 63 43 54 65 78 71 59 62 63 63 74 76 77
## [47] 73 79 67 59 71 51 52 70 48 47 70 70 52 77 75 65 45 52 77 66 56 73 55
## [70] 67 63 64 71 75 70 48 64 47 42 74 48 62 76 71 78 78 66 48 66 54 71 67
## [93] 67 47 72 76 76 73 70 50 61
This means take the dataframe called socio and give me all the rows and only the 4th column. In general the format for this is dataframe[row number, column number], since we wanted all the rows, we left the row number blank, and we only wanted the 4th column so we put 4 after the comma.
An easier way to do this takes advantage of the fact that the data is in the format of a dataframe. This is the way I always use as it is much eaiser to read and understand. We want the clumns lexp from the dataframe socio, the code for this is:
socio$lexp
## [1] 60 71 77 76 76 50 49 67 73 48 47 50 57 77 49 47 72 69 53 75 55 75 67
## [24] 66 60 64 48 76 77 53 76 55 77 63 43 54 65 78 71 59 62 63 63 74 76 77
## [47] 73 79 67 59 71 51 52 70 48 47 70 70 52 77 75 65 45 52 77 66 56 73 55
## [70] 67 63 64 71 75 70 48 64 47 42 74 48 62 76 71 78 78 66 48 66 54 71 67
## [93] 67 47 72 76 76 73 70 50 61
So in general, if we want a column from a dataframe we just need to run the code dataframe$column_name.
There are a few types of data I will be using in this tutorial, namely:
It is very easy to see the types of data we have in our data frame (socio). This can be done by ising str() which means, what is the structure of this data frame.
str(socio)
## 'data.frame': 101 obs. of 18 variables:
## $ country : Factor w/ 101 levels "Argentina","Australia",..: 7 1 2 3 4 5 6 8 9 10 ...
## $ ID : int 1 1 2 2 2 5 4 5 2 5 ...
## $ gnp : int 630 3270 16720 19060 17610 360 200 2230 2320 270 ...
## $ lexp : int 60 71 77 76 76 50 49 67 73 48 ...
## $ lexpf : int 62 75 80 80 80 52 47 69 76 49 ...
## $ lexpm : int 58 68 74 73 73 49 50 65 70 46 ...
## $ adulit : num 77.5 95.3 99 99 99 23.4 38.4 73.6 93 18.2 ...
## $ hdi : num 0.398 0.832 0.972 0.952 0.952 0.113 0.15 0.552 0.854 0.08 ...
## $ fertr : num 4.8 2.8 1.9 1.5 1.6 6.3 5.5 4.7 1.9 6.5 ...
## $ birthr : int 36 20 15 12 13 46 39 35 13 47 ...
## $ pop : int 7171000 32322000 17065010 7712000 9967000 4740000 1433000 1277000 8636000 9016000 ...
## $ popgrwth: num 2.4 1.2 1.5 1.2 0.3 3.1 2.1 3.3 -4 2.8 ...
## $ childmf : int 109 30 8 9 10 155 183 41 14 190 ...
## $ childmm : int 127 40 10 13 12 173 179 53 19 210 ...
## $ infmor : int 92 29 8 7 8 113 122 38 14 134 ...
## $ urbanpop: int 51 86 86 58 96 38 5 25 68 15 ...
## $ energcpc: int 259 1309 5161 3289 4841 23 13 417 3143 17 ...
## $ pppgnp : int 1572 4295 16051 16504 16381 1043 800 3419 4700 618 ...
There seems to be two columns which we may want to change, namely; Country, which is a factor and ID, which is an integer. The country column would be better as a character string, and the ID column would be better as a factor.
Firstly, let’s change the cotunry column from a factor to a character string.
socio$country = as.character(socio$country)
Notice how we have overwritten what was in the country column of the data frame. This is called reassignment, we should be careful to only do this if we are confident we won’t be losing any data.
Now, let’s replace the numbers in the ID column to what they represent. In the documentation of the data we see that:
1 = Latin America
2 = OECD
3 = East Asia
4 = Other Asia
5 = Africa
6 = Gulf
The command to convert these numbers (1,2,3,4,5 or 6) to factors with the labels (“Latin America”, “OECD”, “East Asia”, “Other Asia”, “Africa”, “Gulf”) is as follows:
socio$ID = factor(socio$ID,
labels = c("Latin America", "OECD", "East Asia", "Other Asia", "Africa", "Gulf"))
To check this has worked, we can run str(socio) again, notice that we now have all the data in our data frame in the format we will need it in.
str(socio)
## 'data.frame': 101 obs. of 18 variables:
## $ country : chr "Bolivia" "Argentina" "Australia" "Austria" ...
## $ ID : Factor w/ 6 levels "Latin America",..: 1 1 2 2 2 5 4 5 2 5 ...
## $ gnp : int 630 3270 16720 19060 17610 360 200 2230 2320 270 ...
## $ lexp : int 60 71 77 76 76 50 49 67 73 48 ...
## $ lexpf : int 62 75 80 80 80 52 47 69 76 49 ...
## $ lexpm : int 58 68 74 73 73 49 50 65 70 46 ...
## $ adulit : num 77.5 95.3 99 99 99 23.4 38.4 73.6 93 18.2 ...
## $ hdi : num 0.398 0.832 0.972 0.952 0.952 0.113 0.15 0.552 0.854 0.08 ...
## $ fertr : num 4.8 2.8 1.9 1.5 1.6 6.3 5.5 4.7 1.9 6.5 ...
## $ birthr : int 36 20 15 12 13 46 39 35 13 47 ...
## $ pop : int 7171000 32322000 17065010 7712000 9967000 4740000 1433000 1277000 8636000 9016000 ...
## $ popgrwth: num 2.4 1.2 1.5 1.2 0.3 3.1 2.1 3.3 -4 2.8 ...
## $ childmf : int 109 30 8 9 10 155 183 41 14 190 ...
## $ childmm : int 127 40 10 13 12 173 179 53 19 210 ...
## $ infmor : int 92 29 8 7 8 113 122 38 14 134 ...
## $ urbanpop: int 51 86 86 58 96 38 5 25 68 15 ...
## $ energcpc: int 259 1309 5161 3289 4841 23 13 417 3143 17 ...
## $ pppgnp : int 1572 4295 16051 16504 16381 1043 800 3419 4700 618 ...
There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot a scatterplot of hdi (x axis) against life expectancy (y axis), then all we have to do is:
plot(socio$hdi, socio$lexp)
To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis labels and changing the color of the points to name a few. Here is an example where we have changed some of the looks.
plot(socio$hdi, socio$lexp,
col = "blue",
ylab = "HDI",
xlab = "Life expectancy",
main = "Plot of HDI against Life expectancy")
These plots are quick and easy to produce, but they look terrible! Definately not publishable! Let’s look at some other options we have to display our graphs. We will go through the following packages:
ggplot2plotlyggvisThere are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.
Now we are ready to introduce a packages called ggplot2, so-called because it is built using the ‘grammar of graphics’. The way this package is intended to be used is by layering different options. There are 5 components that define a layer, they are:
Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.
To install this package and load it into our Rscript simply run the following:
install.packages("ggplot2")
library(ggplot2)
First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.
ggplot(data = socio)
Notice how this is a completely blank plot, this is because we haven’t told the computer what we want to plot. So lets look at the second component, the aesthetic mappings. This is given by aes(), we want to plot hdi on the x axis, and lexp on the y axis.
ggplot(data = socio, aes(x=hdi, y=lexp))
This is looking better! we now have a pair of axis. Notice how the limits of the axis are automatically set to the maximum and minimum values we have in our data! The next component in the grammar of graphics is, the geom. Let’s make a scatter plot just like before and see if it looks better.
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point()
In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(color = "blue") +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy")
This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as size, shape, alpha and color. Note that in the previous plot we did change the color, but we didn’t make an aesthetic mapping using the data from the data frame (socio). To illustrate this, lets look at the difference when we use an aesthetic mapping of the color.
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy")
Neat! Each point has a different color depending on which country it is representing! Now we know the difference between changing the color of all the points and adding an aesthetic mapping to have the color defined by another variable in the data frame. What do the other options do I hear you say… Let’s try them out! Maybe we could have the size of the points dependant on the population of the country.
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy")
More mappings!? Alright, let’s see if changing the shape of the points makes this graph look clearer.
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop, shape = ID)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy")
This looks very confusing! Maybe we could change the alpha level rather than the shape of the points. What is this mysterious alpha? Alpha actually comes from the planet Mars, just kidding, alpha is a ghost! Joking again, kind of. Alpha is how transparent the points are, sometimes this can make plots much easier to understand. Let’s set the alpha level based upon the adult literacy, so if the adult literacy is low, then the point will be more transparent.
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop, alpha = adulit)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy")
Wow! These aesthetics are pretty bad ass! Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), there are also pre-made themes. My favourite theme is called theme_minimal(), let’s see what this looks like:
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop, alpha = adulit)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy") +
theme_minimal()
Great! The only change I want to make to the layout is the title position, let’s align it in the center. To change this we need to use theme() again, after the line theme_minimal().
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop, alpha = adulit)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
On second thought, I would also like to change the name on the scale of the legends. This is the most compicated it will get and will be our final graph using ggplot.
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop, alpha = adulit)) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_discrete(name = "Region") +
scale_alpha_continuous(name = "Adult Literacy") +
scale_size_continuous(name = "Population")
The Plotly package has a great way to add interactivity to the plots we make using ggplot. Even better than that, its incredibly easy to use. The only issue is that sometimes it doesn’t always look like the plot we made in ggplot, so we may have so simplify some of the extras we changed.
To make use of the plotly package we just need to sign up for a free plotly account. Then take note of your username and api key. Put these into the code below so it will always be stored for your use in R.
install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")
Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot but remove the legends (we have to remove them for this plot as it doesn’t look good with the legends). To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!
Note we also added text = paste("Country:", country) into the geom_point() function. This is ignored by ggplot2, but plotly will see this and when we hover over a point it will show the country!
ggplot(data = socio, aes(x=hdi, y=lexp)) +
geom_point(aes(color = ID, size = pop, alpha = adulit, text = paste("Country:", country))) +
ylab("Life Expectancy") +
xlab("Human Development Index") +
ggtitle("Plot of HDI against Life expectancy") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")
ggplotly()
How great is that! Plotly is an incredibly powerful tool, there are ways to make this plot even better by building the plot only using the plotly package. However, this is very advanced. We will cover this in another tutorial as there is no need for it right now.
Another package we can use is ggvis, this is similar to ggplot2. However, it can be made to have customisable options when running locally or on a shiny server. Such as picking what color the points should be, or what type of regression line you want to fit. The negative side is that ggvis has not been developed as much as ggplot2 and hence has a few bugs. For now though, its worth looking at it as an alternative.
Lets first download, install and load the package.
install.packages("ggvis")
library(ggvis)
To use the code below, we must first define the function add_title() since there is no way to add a title in the package at the moment. This is an issue that will hopefully be sorted soon. Simply copy and paste this code and run it as you don’t need worry about what it is doing.
add_title <- function(vis, ..., properties=NULL, title = "Plot Title")
{
# recursively merge lists by name
# http://stackoverflow.com/a/13811666/1135316
merge.lists <- function(a, b) {
a.names <- names(a)
b.names <- names(b)
m.names <- sort(unique(c(a.names, b.names)))
sapply(m.names, function(i) {
if (is.list(a[[i]]) & is.list(b[[i]])) merge.lists(a[[i]], b[[i]])
else if (i %in% b.names) b[[i]]
else a[[i]]
}, simplify = FALSE)
}
# default properties make title 'axis' invisible
default.props <- axis_props(
ticks = list(strokeWidth=0),
axis = list(strokeWidth=0),
labels = list(fontSize = 0),
grid = list(strokeWidth=0)
)
# merge the default properties with user-supplied props.
axis.props <- do.call(axis_props, merge.lists(default.props, properties))
# don't step on existing scales.
vis <- scale_numeric(vis, "title", domain = c(0,1), range = 'width')
axis <- ggvis:::create_axis('x', 'title', orient = "top", title = title, properties = axis.props, ...)
ggvis:::append_ggvis(vis, "axes", axis)
}
Now we are ready to have a quick plot! Note that we can build this up in a very similar way to using ggplot2. The core part of the code below is ggvis(data = socio, x = ~hdi, y = ~lexp, fill = ~ID, size = ~pop, opacity = ~adulit), which defines all the aesthetics we wish to plot. Another important part is %>% which is a pipe operator, it sounds complicated! but all it does is pass something through the steps required. For example, in maths we sometimes have f(g(x)) which would mean put x through the function g, then put that through f. A particular example of this case would be sum(squareroot(x)), using the pipe operator %>%, we could write this as x %>% sqrt() %>% sum().
We tell R that we want to plot it as a scatter plot by using layer_points() which is very similar to ggplot2 when we use geom_point().
All the other parts of the code are just to make it look nicer, by adding a title, changing the legend titles and adding axis labels.
ggvis(data = socio, x = ~hdi, y = ~lexp, fill = ~ID, size = ~pop, opacity = ~adulit) %>%
add_axis("x",title = "Human Development Index") %>%
add_axis("y",title = "Life Expectancy")%>%
layer_points() %>%
add_legend("size", properties = legend_props(legend = list(y = 120)),
title = "Population") %>%
add_legend("fill", properties = legend_props(legend = list(y = 0)),
title = "Region") %>%
add_title(title = "Plot of HDI against Life expectancy",
properties = axis_props(title=list(fontSize=20)))
Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis
There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot a boxplot of life expectancy, then all we have to do is:
boxplot(socio$lexp)
To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis label and changing the color of the points to name a few. Here is an example where we have changed some of the looks.
boxplot(socio$lexp, col = "blue",
main = "Boxplot of Life Expectancy",
ylab = "Life Expectancy")
These plots are quick and easy to produce, but they look terrible! Definately not publishable! Let’s look at some other options we have to display our graphs. We will go through the following packages:
ggplot2plotlyThere are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.
Now we are ready to introduce a packages called ggplot2, so-called because it is built using the ‘grammar of graphics’. The way this package is intended to be used is by layering different options. There are 5 components that define a layer, they are:
Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.
To install this package and load it into our Rscript simply run the following:
install.packages("ggplot2")
library(ggplot2)
First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.
ggplot(data = socio)
Notice how this is a completely blank plot, this is because we haven’t told the computer what we want to plot. So lets look at the second component, the aesthetic mappings. This is given by aes(), we want a box plot of the life expectancy for each region.
ggplot(data = socio, aes(ID, lexp))
This is looking better! we now have a pair of axis. Notice how the limits of the y axis is automatically set to the maximum and minimum values we have in our data! The x axis is set to all the regions! The next component in the grammar of graphics is, the geom. Let’s make a box plot just like before and see if it looks better.
ggplot(data = socio, aes(ID, lexp)) +
geom_boxplot()
In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:
ggplot(socio, aes(ID,lexp)) +
geom_boxplot(aes(fill = ID)) +
ggtitle("Boxplots of Life Expectancy for each Region") +
xlab("Region") +
ylab("Life expectancy")
This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as adding the data points, and setting an alpha level.
ggplot(socio, aes(ID,lexp)) +
geom_boxplot(aes(fill = ID),alpha = 0.7) +
geom_point(aes(color = ID),alpha = 1, position = position_jitter(width = 0.05)) +
ggtitle("Boxplots of Life Expectancy for each Region") +
xlab("Region") +
ylab("Life expectancy")
Neat! Each boxplot has all the data points shown in a different color depending on which region they are representing!
Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), there are also pre-made themes. My favourite theme is called theme_minimal(), let’s see what this looks like:
ggplot(socio, aes(ID,lexp)) +
geom_boxplot(aes(fill = ID),alpha = 0.7) +
geom_point(aes(color = ID),alpha = 1, position = position_jitter(width = 0.05)) +
ggtitle("Boxplots of Life Expectancy for each Region") +
xlab("Region") +
ylab("Life expectancy") +
theme_minimal()
Great! The only change I want to make to the layout is the title position, let’s align it in the center. To change this we need to use theme() again, after the line theme_minimal(). There also seems to be no use for the legend, since we can see what region the boxplots represent from the x axis. Let’s remove it.
ggplot(socio, aes(ID,lexp)) +
geom_boxplot(aes(fill = ID),alpha = 0.7, outlier.alpha = 1) +
geom_point(aes(color = ID,
text = paste("Country:", country)), position = position_jitter(width = 0.05)) +
ggtitle("Boxplots of Life Expectancy for each Region") +
xlab("Region") +
ylab("Life expectancy") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")
Hang on, what would it look like if we had the boxplots in a different order? Say from highest median life expectancy to lowest, would that look better? Let’s see. Don’t worry about this code if you don’t understand it. It makes a new data frame called cdata and finds the median lexp value for each of the ID’s. We then reorder the factor levels so that they go in the order of decreasing median values.
library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))
socio$ID = factor(socio$ID,levels(socio$ID)[order(-cdata$median)])
ggplot(socio, aes(ID,lexp)) +
geom_boxplot(aes(fill = ID),alpha = 0.7, outlier.alpha = 1) +
geom_point(aes(color = ID,
text = paste("Country:", country)), position = position_jitter(width = 0.05)) +
ggtitle("Boxplots of Life Expectancy for each Region") +
xlab("Region") +
ylab("Life expectancy") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")
We can also change the axis so it is plotted the other way. This may make it eaiser for the audience to understand. This is simply done by adding coord_flip().
library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))
socio$ID = factor(socio$ID,levels(socio$ID)[order(-cdata$median)])
ggplot(socio, aes(ID,lexp)) +
geom_boxplot(aes(fill = ID),alpha = 0.7, outlier.alpha = 1) +
geom_point(aes(color = ID,
text = paste("Country:", country)), position = position_jitter(height = 0.0005)) +
ggtitle("Boxplots of Life Expectancy for each Region") +
xlab("Region") +
ylab("Life expectancy") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none") + coord_flip()
The Plotly package has a great way to add interactivity to the plots we make using ggplot. Even better than that, its incredibly easy to use. The only issue is that sometimes it doesn’t always look like the plot we made in ggplot, so we may have so simplify some of the extras we changed.
To make use of the plotly package we just need to sign up for a free plotly account. Then take note of your username and api key. Put these into the code below so it will always be stored for your use in R.
install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")
Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot but remove the legends (we have to remove them for this plot as it doesn’t look good with the legends). To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!
Note we also added text = paste("Country:", country) into the geom_jitter() function. This is ignored by ggplot2, but plotly will see this and when we hover over a point it will show the country!
ggplotly()
How great is that! Plotly is an incredibly powerful tool, there are ways to make this plot even better by building the plot only using the plotly package. However, this is very advanced. We will cover this in another tutorial as there is no need for it right now.
Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis
There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot a histogram of life expectancy, then all we have to do is:
hist(socio$lexp)
To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis label and changing the fill color of the bars and changing the number of bins to name a few. Here is an example where we have changed some of the looks.
hist(socio$lexp,
col = "blue",
main = "Boxplot of Life Expectancy",
xlab = "Life Expectancy",
breaks = 30)
These plots are quick and easy to produce, but they look terrible! Definately not publishable, just look at those axes! Let’s look at some other options we have to display our graphs. We will go through the following packages:
ggplot2plotlyThere are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.
Now we are ready to introduce a packages called ggplot2, so-called because it is built using the ‘grammar of graphics’. The way this package is intended to be used is by layering different options. There are 5 components that define a layer, they are:
Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.
To install this package and load it into our Rscript simply run the following:
install.packages("ggplot2")
library(ggplot2)
First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.
ggplot(data = socio)
Notice how this is a completely blank plot, this is because we haven’t told the computer what we want to plot. So lets look at the second component, the aesthetic mappings. This is given by aes(), we want a histogram of the life expectancy.
ggplot(data = socio, aes(lexp))
This is looking better! We now have an axis. Notice how the limits of the x axis is set to all the maximum and minimum values for life expectancy! The next component in the grammar of graphics is, the geom. Let’s make a histogram just like before and see if it looks better.
ggplot(data = socio, aes(lexp)) +
geom_histogram()
In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:
ggplot(socio, aes(lexp)) +
geom_histogram(fill = "#FFB347", color = "#FF6961") +
ggtitle("Histogram of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Count")
This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as faceting by the Region, and setting an alpha level.
ggplot(socio, aes(lexp)) +
geom_histogram(fill = "#FFB347", color = "#FF6961",
bins = 10, alpha = 0.7) +
ggtitle("Histograms of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Count") +
facet_grid(ID~.)
Neat! Each region has its own histogram! Now we can directly compare the distrobutions of the life expectancies across regions.
Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), there are also pre-made themes. My favourite theme is called theme_minimal(), let’s see what this looks like:
ggplot(socio, aes(lexp)) +
geom_histogram(fill = "#FFB347", color = "#FF6961",
bins = 10, alpha = 0.7) +
ggtitle("Histograms of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Count") +
facet_grid(ID~.) +
theme_minimal()
Great! The only change I want to make to the layout is to see what it would look like if we had different colors for each of the histograms. So each region had its own color. Note we also don’t need a legend for this so lets remove it using theme(legend.position = "none") and lets also make the title center aligned by adding plot.title = element_text(hjust = 0.5) into the theme function.
ggplot(socio, aes(lexp)) +
geom_histogram(aes(fill = ID, color = ID),
bins = 10, alpha = 0.7) +
ggtitle("Histograms of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Count") +
facet_grid(ID~.) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
Hang on, what would it look like if we had the histograms in a different order? Say from lowest median life expectancy to highest, would that look better? Let’s see. Don’t worry about this code if you don’t understand it. It makes a new data frame called cdata and finds the median lexp value for each of the ID’s. We then reorder the factor levels so that they go in the order of increasing median values.
library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))
socio$ID = factor(socio$ID,levels(socio$ID)[order(cdata$median)])
ggplot(socio, aes(lexp)) +
geom_histogram(aes(fill = ID, color = ID),
bins = 10, alpha = 0.7) +
ggtitle("Histograms of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Count") +
facet_grid(ID~.) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
The Plotly package has a great way to add interactivity to the plots we make using ggplot. Even better than that, its incredibly easy to use. The only issue is that sometimes it doesn’t always look like the plot we made in ggplot, so we may have so simplify some of the extras we changed.
To make use of the plotly package we just need to sign up for a free plotly account. Then take note of your username and api key. Put these into the code below so it will always be stored for your use in R.
install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")
Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot. To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!
ggplotly()
How great is that! Plotly is an incredibly powerful tool, there are ways to make this plot even better by building the plot only using the plotly package. However, this is very advanced. We will cover this in another tutorial as there is no need for it right now.
Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis
There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot the density of life expectancy, then all we have to do is:
plot(density(socio$lexp))
To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis label and changing the fill color of the bars and changing the number of bins to name a few. Here is an example where we have changed some of the looks.
plot(density(socio$lexp),
col = "blue",
main = "Density of Life Expectancy",
xlab = "Life Expectancy")
These plots are quick and easy to produce, but they look terrible! Definately not publishable, just look at those axes! Let’s look at some other options we have to display our graphs. We will go through the following packages:
ggplot2plotlyThere are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.
Now we are ready to introduce a packages called ggplot2, so-called because it is built using the ‘grammar of graphics’. The way this package is intended to be used is by layering different options. There are 5 components that define a layer, they are:
Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.
To install this package and load it into our Rscript simply run the following:
install.packages("ggplot2")
library(ggplot2)
First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.
ggplot(data = socio)
Notice how this is a completely blank plot, this is because we haven’t told the computer what we want to plot. So lets look at the second component, the aesthetic mappings. This is given by aes(), we want a desnity plot of the life expectancy.
ggplot(data = socio, aes(lexp))
This is looking better! We now have an axis. Notice how the limits of the x axis is set to all the maximum and minimum values for life expectancy! The next component in the grammar of graphics is, the geom. Let’s make a density plot just like before and see if it looks better.
ggplot(data = socio, aes(lexp)) +
geom_density()
In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:
ggplot(socio, aes(lexp)) +
geom_density(color = "#FF6961") +
ggtitle("Density of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Density")
This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as faceting by the Region, adding a fill color and setting an alpha level.
ggplot(socio, aes(lexp)) +
geom_density(fill = "#FFB347", color = "#FF6961", alpha = 0.5) +
ggtitle("Density of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Density") +
facet_grid(ID~.)
Neat! Each region has its own desnity plot! Now we can directly compare the distrobutions of the life expectancies across regions.
Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), there are also pre-made themes. My favourite theme is called theme_minimal(), let’s see what this looks like:
ggplot(socio, aes(lexp)) +
geom_density(fill = "#FFB347", color = "#FF6961", alpha = 0.5) +
ggtitle("Density of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Density") +
facet_grid(ID~.) +
theme_minimal()
Great! The only change I want to make to the layout is to see what it would look like if we had different colors for each of the desntiy plots So each region had its own color. Note we also don’t need a legend for this so lets remove it using theme(legend.position = "none").
ggplot(socio, aes(lexp)) +
geom_density(aes(fill = ID, color = ID), alpha = 0.5) +
ggtitle("Density of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Density") +
facet_grid(ID~.) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
Hang on, what would it look like if we had the density plots in a different order? Say from lowest median life expectancy to highest, would that look better? Let’s see. Don’t worry about this code if you don’t understand it. It makes a new data frame called cdata and finds the median lexp value for each of the ID’s. We then reorder the factor levels so that they go in the order of increasing median values.
library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))
socio$ID = factor(socio$ID,levels(socio$ID)[order(cdata$median)])
ggplot(socio, aes(lexp)) +
geom_density(aes(fill = ID, color = ID), alpha = 0.5) +
ggtitle("Density of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Density") +
facet_grid(ID~.) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
The Plotly package has a great way to add interactivity to the plots we make using ggplot. Even better than that, its incredibly easy to use. The only issue is that sometimes it doesn’t always look like the plot we made in ggplot, so we may have so simplify some of the extras we changed.
To make use of the plotly package we just need to sign up for a free plotly account. Then take note of your username and api key. Put these into the code below so it will always be stored for your use in R.
install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")
Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot. To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!
ggplotly()
How great is that! Plotly is an incredibly powerful tool, there are ways to make this plot even better by building the plot only using the plotly package. However, this is very advanced. We will cover this in another tutorial as there is no need for it right now.
Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis
These base plots are normally quick and easy to produce, but they look terrible! In the case of mapping it isn’t worth the hastle to learn to use the base plotting. Instead, let’s look at some other options we have to display our graphs. We will go through the following packages:
ggplot2plotlyggmapsThere are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.
Now we are ready to introduce a packages called ggplot2, so-called because it is built using the ‘grammar of graphics’. The way this package is intended to be used is by layering different options. There are 5 components that define a layer, they are:
Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.
To install this package and load it into our Rscript simply run the following:
install.packages("ggplot2")
library(ggplot2)
First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio, we can also get the shapes of the countrys we have in our dataframe (and the ones we don’t have). To do this we just need to install a few packages, just rin the following:
install.packages(c("devtools", "dplyr", "stringr", "maps", "mapdata"))
library(devtools)
library(dplyr)
library(stringr)
library(maps)
library(mapdata)
Now we should have everything ready to make a plot of the world! Lets add the data to make up the first component of the ggplot method. Remember we must have an R dataframe. So lets merge two dataframes together, the countrys from the socio data frame and the country shapes from the package we just loaded. Lets also create another data frame of the countrys we don’t have information for so we can set their color to grey.
w2hr <- map_data("world")
names(w2hr)[5] = "country"
socio[socio$country == "United States",]$country = "USA"
socio[socio$country == "United Kingdom",]$country = "UK"
socio[socio$country == "Korea Republic of",]$country = "South Korea"
socio[socio$country == "Congo",]$country = "Republic of Congo"
socio[socio$country == "Iran Islamic Republic of",]$country = "Iran"
socio[socio$country == "Syrian Arab Republic",]$country = "Syria"
socio[socio$country == "Egypt Arab Republic of",]$country = "Egypt"
socio[socio$country == "Central African Rep.",]$country = "Central African Republic"
socio[socio$country == "Cote d'Ivoire",]$country = "Ivory Coast"
w2hr[is.na(w2hr$subregion),]$subregion = "Unknown"
w2hr[w2hr$subregion == "Hong Kong",]$country = "Hong Kong"
w2hr[w2hr$country == "Tobago" | w2hr$country == "Trinidad",]$country = "Trinidad and Tobago"
full.df = inner_join(w2hr, socio, by = "country")
empty.df = w2hr[is.na(match(w2hr$country, as.character(socio$country))),]
Let’s take a quick peak at the top of two data frames (empty.df) and (full.df).
head(empty.df)
| long | lat | group | order | country | subregion |
|---|---|---|---|---|---|
| -69.89912 | 12.45200 | 1 | 1 | Aruba | Unknown |
| -69.89571 | 12.42300 | 1 | 2 | Aruba | Unknown |
| -69.94219 | 12.43853 | 1 | 3 | Aruba | Unknown |
| -70.00415 | 12.50049 | 1 | 4 | Aruba | Unknown |
| -70.06612 | 12.54697 | 1 | 5 | Aruba | Unknown |
| -70.05088 | 12.59707 | 1 | 6 | Aruba | Unknown |
head(full.df)
| long | lat | group | order | country | subregion | ID | gnp | lexp | lexpf | lexpm | adulit | hdi | fertr | birthr | pop | popgrwth | childmf | childmm | infmor | urbanpop | energcpc | pppgnp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20.61133 | 60.04068 | 7 | 884 | Finland | Aland Islands | OECD | 24110 | 76 | 79 | 73 | 99 | 0.954 | 1.8 | 13 | 4986000 | 0.5 | 7 | 9 | 6 | 60 | 5707 | 16446 |
| 20.60342 | 60.01694 | 7 | 885 | Finland | Aland Islands | OECD | 24110 | 76 | 79 | 73 | 99 | 0.954 | 1.8 | 13 | 4986000 | 0.5 | 7 | 9 | 6 | 60 | 5707 | 16446 |
| 20.52178 | 60.01167 | 7 | 886 | Finland | Aland Islands | OECD | 24110 | 76 | 79 | 73 | 99 | 0.954 | 1.8 | 13 | 4986000 | 0.5 | 7 | 9 | 6 | 60 | 5707 | 16446 |
| 20.48750 | 60.03276 | 7 | 887 | Finland | Aland Islands | OECD | 24110 | 76 | 79 | 73 | 99 | 0.954 | 1.8 | 13 | 4986000 | 0.5 | 7 | 9 | 6 | 60 | 5707 | 16446 |
| 20.41123 | 60.03013 | 7 | 888 | Finland | Aland Islands | OECD | 24110 | 76 | 79 | 73 | 99 | 0.954 | 1.8 | 13 | 4986000 | 0.5 | 7 | 9 | 6 | 60 | 5707 | 16446 |
| 20.39795 | 60.04068 | 7 | 889 | Finland | Aland Islands | OECD | 24110 | 76 | 79 | 73 | 99 | 0.954 | 1.8 | 13 | 4986000 | 0.5 | 7 | 9 | 6 | 60 | 5707 | 16446 |
Great, we have all the data prepared in two data frames. Let’s build a map! We first have to add the data we want to show. In this case we will use full.df since it will be used most, then when we want to use empty.df we just have to override the data.
ggplot(data = full.df)
Notice how this is a completely blank plot, this is because we haven’t told the computer what we want to plot. So lets look at the second component, the aesthetic mappings. This is given by aes(), we want the shape of a map, so lets use an x value of longditude, y value of latitude, we also have to use the group column for the grouping.
ggplot(data = full.df, aes(x=long, y = lat, group = group))
This is looking good! We now have an axis with the longditude and latitude. Notice how the limits of the x and y axis is set to all the maximum and minimum values for long/lat! The next component in the grammar of graphics is, the geom. Let’s make a polygon plot and wee what it can do.
ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
geom_polygon()
Nice! As we can see we have plotted all the countries we have data for, lets color these blue and add the ones we dont have and color those grey.
ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
geom_polygon(fill = "blue") +
geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey")
This looks neat! Now lets look at some other ways we can use the aesthetic layers, such as filling the countries by the life expectancy. Le’ts describe thhe plot by also adding a title using ggtitle()
ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
geom_polygon(aes(fill = lexp)) +
geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey") + ggtitle("Life Expectancy across the world")
Awesome! Each region has its own color for the life expectancy! Now we can directly compare the life expecancy across the country.
Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), Lets remove all the axis titles, tick marks grid lines etc and also the background.
ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
geom_polygon(aes(fill = lexp)) +
geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey") + ggtitle("Life Expectancy across the world") +
theme_minimal() + theme(
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5)
)
Great! The only change I want to make to the layout is to see what it would look like if we had thicker border colors between each of the countries. We can do this using both color and size.
ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
geom_polygon(aes(fill = lexp, text = paste("Country:", country)), color = "white", size = 0.1) +
geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey", color = "white", size = 0.1) + ggtitle("Life Expectancy across the world") +
theme_minimal() + theme(
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5)
)
The final touches will be to change the legend title/color gradient and to add a text aesthetic ready for using the plotly package.
ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
geom_polygon(aes(fill = lexp, text = paste("Country:", country, "<br>", "Region", ID)), color = "white", size = 0.1) +
geom_polygon(data = empty.df, aes(x=long, y = lat, group = group, text = paste("Country:", country)), fill = "grey", color = "white", size = 0.1) + ggtitle("Life Expectancy across the world") +
scale_fill_distiller(palette = "Spectral", name = "Life Expectancy", direction = 1) +
theme_minimal() + theme(
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5)
)
The Plotly package has a great way to add interactivity to the plots we make using ggplot. Even better than that, its incredibly easy to use. The only issue is that sometimes it doesn’t always look like the plot we made in ggplot, so we may have so simplify some of the extras we changed.
To make use of the plotly package we just need to sign up for a free plotly account. Then take note of your username and api key. Put these into the code below so it will always be stored for your use in R.
install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")
Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot. To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!
ggplotly()
How great is that! Plotly is an incredibly powerful tool, there are ways to make this plot even better by building the plot only using the plotly package. However, this is very advanced. We will cover this in another tutorial as there is no need for it right now.
Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis