How do I get the data into R?

If the data is in .csv format (comma seperated values), then we can easily read this into R using read.csv(). For example, if we had a csv file called socio.csv and it was in the directory /Users/andrewchallis/Desktop/ then we simply run the following:

Note that we must put the path to the file in either single quotes ‘’ or double quotes " “. When we do this it makes what is called a Character string, which is just telling the computer that it is text and not a number or a function or something else.

 read.csv('/Users/andrewchallis/Documents/Personal/Plots\ for\ rach\ and\ sharna/socio.csv')

country	ID	gnp	lexp	lexpf	lexpm	adulit	hdi	fertr	birthr	pop	popgrwth	childmf	childmm	infmor	urbanpop	energcpc	pppgnp
Bolivia	1	630	60	62	58	77.5	0.398	4.8	36	7171000	2.4	109	127	92	51	259	1572
Argentina	1	3270	71	75	68	95.3	0.832	2.8	20	32322000	1.2	30	40	29	86	1309	4295
Australia	2	16720	77	80	74	99.0	0.972	1.9	15	17065010	1.5	8	10	8	86	5161	16051
Austria	2	19060	76	80	73	99.0	0.952	1.5	12	7712000	1.2	9	13	7	58	3289	16504
Belgium	2	17610	76	80	73	99.0	0.952	1.6	13	9967000	0.3	10	12	8	96	4841	16381
Benin	5	360	50	52	49	23.4	0.113	6.3	46	4740000	3.1	155	173	113	38	23	1043

Now we know how to read a csv file, lets cache this into the memory. Caching means that we are simply giving the table a name so that the computer knows which table we want to do things with. It is very simple to do this in R, we call this assignment, it sounds fancy but its just giving the table a name. We can either use <- or = to assign a name to a table.

socio <-  read.csv('/Users/andrewchallis/Documents/Personal/Plots\ for\ rach\ and\ sharna/socio.csv')

socio =  read.csv('/Users/andrewchallis/Documents/Personal/Plots\ for\ rach\ and\ sharna/socio.csv')

Now we are ready to start using the data in the table! That last step was easy, and the steps after won’t be much harder than this.

Getting data from the table

In R, when we import data from a csv file and assign it a name (what we did in the previous step), the table is refered to as a dataframe. This is just a fancy way of saying a table. Since the data is nice and tidy in the format of a dataframe (or table) we can extract infomation off it really easily! For example, if we wanted to look at just one column, say the life expectancy (which has a column name of lexp) we can either find the number of the column (in this case it is the 4th column) and run the following:

socio[,4]

##   [1] 60 71 77 76 76 50 49 67 73 48 47 50 57 77 49 47 72 69 53 75 55 75 67
##  [24] 66 60 64 48 76 77 53 76 55 77 63 43 54 65 78 71 59 62 63 63 74 76 77
##  [47] 73 79 67 59 71 51 52 70 48 47 70 70 52 77 75 65 45 52 77 66 56 73 55
##  [70] 67 63 64 71 75 70 48 64 47 42 74 48 62 76 71 78 78 66 48 66 54 71 67
##  [93] 67 47 72 76 76 73 70 50 61

This means take the dataframe called socio and give me all the rows and only the 4th column. In general the format for this is dataframe[row number, column number], since we wanted all the rows, we left the row number blank, and we only wanted the 4th column so we put 4 after the comma.

An easier way to do this takes advantage of the fact that the data is in the format of a dataframe. This is the way I always use as it is much eaiser to read and understand. We want the clumns lexp from the dataframe socio, the code for this is:

socio$lexp

##   [1] 60 71 77 76 76 50 49 67 73 48 47 50 57 77 49 47 72 69 53 75 55 75 67
##  [24] 66 60 64 48 76 77 53 76 55 77 63 43 54 65 78 71 59 62 63 63 74 76 77
##  [47] 73 79 67 59 71 51 52 70 48 47 70 70 52 77 75 65 45 52 77 66 56 73 55
##  [70] 67 63 64 71 75 70 48 64 47 42 74 48 62 76 71 78 78 66 48 66 54 71 67
##  [93] 67 47 72 76 76 73 70 50 61

So in general, if we want a column from a dataframe we just need to run the code dataframe$column_name.

Types of data

There are a few types of data I will be using in this tutorial, namely:

Character String (for example, “Hi my name is Andy”)
Numeric (for example, 3.14 or 1.01)
Integer (for example 1 or 3 or 100)
Factor (for example, Male or Female)

It is very easy to see the types of data we have in our data frame (socio). This can be done by ising str() which means, what is the structure of this data frame.

str(socio)

## 'data.frame':    101 obs. of  18 variables:
##  $ country : Factor w/ 101 levels "Argentina","Australia",..: 7 1 2 3 4 5 6 8 9 10 ...
##  $ ID      : int  1 1 2 2 2 5 4 5 2 5 ...
##  $ gnp     : int  630 3270 16720 19060 17610 360 200 2230 2320 270 ...
##  $ lexp    : int  60 71 77 76 76 50 49 67 73 48 ...
##  $ lexpf   : int  62 75 80 80 80 52 47 69 76 49 ...
##  $ lexpm   : int  58 68 74 73 73 49 50 65 70 46 ...
##  $ adulit  : num  77.5 95.3 99 99 99 23.4 38.4 73.6 93 18.2 ...
##  $ hdi     : num  0.398 0.832 0.972 0.952 0.952 0.113 0.15 0.552 0.854 0.08 ...
##  $ fertr   : num  4.8 2.8 1.9 1.5 1.6 6.3 5.5 4.7 1.9 6.5 ...
##  $ birthr  : int  36 20 15 12 13 46 39 35 13 47 ...
##  $ pop     : int  7171000 32322000 17065010 7712000 9967000 4740000 1433000 1277000 8636000 9016000 ...
##  $ popgrwth: num  2.4 1.2 1.5 1.2 0.3 3.1 2.1 3.3 -4 2.8 ...
##  $ childmf : int  109 30 8 9 10 155 183 41 14 190 ...
##  $ childmm : int  127 40 10 13 12 173 179 53 19 210 ...
##  $ infmor  : int  92 29 8 7 8 113 122 38 14 134 ...
##  $ urbanpop: int  51 86 86 58 96 38 5 25 68 15 ...
##  $ energcpc: int  259 1309 5161 3289 4841 23 13 417 3143 17 ...
##  $ pppgnp  : int  1572 4295 16051 16504 16381 1043 800 3419 4700 618 ...

There seems to be two columns which we may want to change, namely; Country, which is a factor and ID, which is an integer. The country column would be better as a character string, and the ID column would be better as a factor.

Firstly, let’s change the cotunry column from a factor to a character string.

socio$country = as.character(socio$country)

Notice how we have overwritten what was in the country column of the data frame. This is called reassignment, we should be careful to only do this if we are confident we won’t be losing any data.

Now, let’s replace the numbers in the ID column to what they represent. In the documentation of the data we see that:

1 = Latin America

2 = OECD

3 = East Asia

4 = Other Asia

5 = Africa

6 = Gulf

The command to convert these numbers (1,2,3,4,5 or 6) to factors with the labels (“Latin America”, “OECD”, “East Asia”, “Other Asia”, “Africa”, “Gulf”) is as follows:

socio$ID = factor(socio$ID,
       labels = c("Latin America", "OECD", "East Asia", "Other Asia", "Africa", "Gulf"))

To check this has worked, we can run str(socio) again, notice that we now have all the data in our data frame in the format we will need it in.

str(socio)

## 'data.frame':    101 obs. of  18 variables:
##  $ country : chr  "Bolivia" "Argentina" "Australia" "Austria" ...
##  $ ID      : Factor w/ 6 levels "Latin America",..: 1 1 2 2 2 5 4 5 2 5 ...
##  $ gnp     : int  630 3270 16720 19060 17610 360 200 2230 2320 270 ...
##  $ lexp    : int  60 71 77 76 76 50 49 67 73 48 ...
##  $ lexpf   : int  62 75 80 80 80 52 47 69 76 49 ...
##  $ lexpm   : int  58 68 74 73 73 49 50 65 70 46 ...
##  $ adulit  : num  77.5 95.3 99 99 99 23.4 38.4 73.6 93 18.2 ...
##  $ hdi     : num  0.398 0.832 0.972 0.952 0.952 0.113 0.15 0.552 0.854 0.08 ...
##  $ fertr   : num  4.8 2.8 1.9 1.5 1.6 6.3 5.5 4.7 1.9 6.5 ...
##  $ birthr  : int  36 20 15 12 13 46 39 35 13 47 ...
##  $ pop     : int  7171000 32322000 17065010 7712000 9967000 4740000 1433000 1277000 8636000 9016000 ...
##  $ popgrwth: num  2.4 1.2 1.5 1.2 0.3 3.1 2.1 3.3 -4 2.8 ...
##  $ childmf : int  109 30 8 9 10 155 183 41 14 190 ...
##  $ childmm : int  127 40 10 13 12 173 179 53 19 210 ...
##  $ infmor  : int  92 29 8 7 8 113 122 38 14 134 ...
##  $ urbanpop: int  51 86 86 58 96 38 5 25 68 15 ...
##  $ energcpc: int  259 1309 5161 3289 4841 23 13 417 3143 17 ...
##  $ pppgnp  : int  1572 4295 16051 16504 16381 1043 800 3419 4700 618 ...

Plotting graphs in R

Scatter plots

Base R plots

There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot a scatterplot of hdi (x axis) against life expectancy (y axis), then all we have to do is:

plot(socio$hdi, socio$lexp)

To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis labels and changing the color of the points to name a few. Here is an example where we have changed some of the looks.

plot(socio$hdi, socio$lexp, 
     col = "blue", 
     ylab = "HDI",
     xlab = "Life expectancy",
     main = "Plot of HDI against Life expectancy")

These plots are quick and easy to produce, but they look terrible! Definately not publishable! Let’s look at some other options we have to display our graphs. We will go through the following packages:

ggplot2
plotly
ggvis

There are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.

ggplot2

Now we are ready to introduce a packages called ggplot2, so-called because it is built using the ‘grammar of graphics’. The way this package is intended to be used is by layering different options. There are 5 components that define a layer, they are:

The data, which must be an R data frame, and can be changed after the plot is created.
A set of aesthetic mappings, which describe how variables in the data are mapped to aesthetic properties of the layer.
The geom, which describes the geometric used to draw the layer. The geom defines the set of available aesthetic properties.
(Optional) The stat, which takes the raw data and transforms it in some useful way. The stat returns a data frame with new variables that can also be mapped to aesthetics with a special syntax.
(Optional) The position adjustment, which adjusts elements to avoid overplotting.

Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.

To install this package and load it into our Rscript simply run the following:

install.packages("ggplot2")
library(ggplot2)

First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.

ggplot(data = socio)

Notice how this is a completely blank plot, this is because we haven’t told the computer what we want to plot. So lets look at the second component, the aesthetic mappings. This is given by aes(), we want to plot hdi on the x axis, and lexp on the y axis.

ggplot(data = socio, aes(x=hdi, y=lexp))

This is looking better! we now have a pair of axis. Notice how the limits of the axis are automatically set to the maximum and minimum values we have in our data! The next component in the grammar of graphics is, the geom. Let’s make a scatter plot just like before and see if it looks better.

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point()

In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:

Color of the points.
Labels of the x and y axis.
Title

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(color = "blue") +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy")

This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as size, shape, alpha and color. Note that in the previous plot we did change the color, but we didn’t make an aesthetic mapping using the data from the data frame (socio). To illustrate this, lets look at the difference when we use an aesthetic mapping of the color.

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy")

Neat! Each point has a different color depending on which country it is representing! Now we know the difference between changing the color of all the points and adding an aesthetic mapping to have the color defined by another variable in the data frame. What do the other options do I hear you say… Let’s try them out! Maybe we could have the size of the points dependant on the population of the country.

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy")

More mappings!? Alright, let’s see if changing the shape of the points makes this graph look clearer.

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop, shape = ID)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy")

This looks very confusing! Maybe we could change the alpha level rather than the shape of the points. What is this mysterious alpha? Alpha actually comes from the planet Mars, just kidding, alpha is a ghost! Joking again, kind of. Alpha is how transparent the points are, sometimes this can make plots much easier to understand. Let’s set the alpha level based upon the adult literacy, so if the adult literacy is low, then the point will be more transparent.

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop, alpha = adulit)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy")

Wow! These aesthetics are pretty bad ass! Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), there are also pre-made themes. My favourite theme is called theme_minimal(), let’s see what this looks like:

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop, alpha = adulit)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy") +
  theme_minimal()

Great! The only change I want to make to the layout is the title position, let’s align it in the center. To change this we need to use theme() again, after the line theme_minimal().

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop, alpha = adulit)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

On second thought, I would also like to change the name on the scale of the legends. This is the most compicated it will get and will be our final graph using ggplot.

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop, alpha = adulit)) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name = "Region") +
  scale_alpha_continuous(name = "Adult Literacy") +
  scale_size_continuous(name = "Population")

plotly

The Plotly package has a great way to add interactivity to the plots we make using ggplot. Even better than that, its incredibly easy to use. The only issue is that sometimes it doesn’t always look like the plot we made in ggplot, so we may have so simplify some of the extras we changed.

To make use of the plotly package we just need to sign up for a free plotly account. Then take note of your username and api key. Put these into the code below so it will always be stored for your use in R.

install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")

Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot but remove the legends (we have to remove them for this plot as it doesn’t look good with the legends). To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!

Note we also added text = paste("Country:", country) into the geom_point() function. This is ignored by ggplot2, but plotly will see this and when we hover over a point it will show the country!

ggplot(data = socio, aes(x=hdi, y=lexp)) +
  geom_point(aes(color = ID, size = pop, alpha = adulit, text = paste("Country:", country))) +
  ylab("Life Expectancy") +
  xlab("Human Development Index") +
  ggtitle("Plot of HDI against Life expectancy") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none")

ggplotly()

How great is that! Plotly is an incredibly powerful tool, there are ways to make this plot even better by building the plot only using the plotly package. However, this is very advanced. We will cover this in another tutorial as there is no need for it right now.

ggvis

Another package we can use is ggvis, this is similar to ggplot2. However, it can be made to have customisable options when running locally or on a shiny server. Such as picking what color the points should be, or what type of regression line you want to fit. The negative side is that ggvis has not been developed as much as ggplot2 and hence has a few bugs. For now though, its worth looking at it as an alternative.

Lets first download, install and load the package.

install.packages("ggvis")
library(ggvis)

To use the code below, we must first define the function add_title() since there is no way to add a title in the package at the moment. This is an issue that will hopefully be sorted soon. Simply copy and paste this code and run it as you don’t need worry about what it is doing.

add_title <- function(vis, ..., properties=NULL, title = "Plot Title") 
{
  # recursively merge lists by name
  # http://stackoverflow.com/a/13811666/1135316
  merge.lists <- function(a, b) {
    a.names <- names(a)
    b.names <- names(b)
    m.names <- sort(unique(c(a.names, b.names)))
    sapply(m.names, function(i) {
      if (is.list(a[[i]]) & is.list(b[[i]])) merge.lists(a[[i]], b[[i]])
      else if (i %in% b.names) b[[i]]
      else a[[i]]
    }, simplify = FALSE)
  }
  
  # default properties make title 'axis' invisible
  default.props <- axis_props(
    ticks = list(strokeWidth=0),
    axis = list(strokeWidth=0),
    labels = list(fontSize = 0),
    grid = list(strokeWidth=0)
  )
  # merge the default properties with user-supplied props.
  axis.props <- do.call(axis_props, merge.lists(default.props, properties))
  
  # don't step on existing scales.
  vis <- scale_numeric(vis, "title", domain = c(0,1), range = 'width')
  axis <- ggvis:::create_axis('x', 'title', orient = "top",  title = title, properties = axis.props, ...)
  ggvis:::append_ggvis(vis, "axes", axis)
}

Now we are ready to have a quick plot! Note that we can build this up in a very similar way to using ggplot2. The core part of the code below is ggvis(data = socio, x = ~hdi, y = ~lexp, fill = ~ID, size = ~pop, opacity = ~adulit), which defines all the aesthetics we wish to plot. Another important part is %>% which is a pipe operator, it sounds complicated! but all it does is pass something through the steps required. For example, in maths we sometimes have f(g(x)) which would mean put x through the function g, then put that through f. A particular example of this case would be sum(squareroot(x)), using the pipe operator %>%, we could write this as x %>% sqrt() %>% sum().

We tell R that we want to plot it as a scatter plot by using layer_points() which is very similar to ggplot2 when we use geom_point().

All the other parts of the code are just to make it look nicer, by adding a title, changing the legend titles and adding axis labels.

ggvis(data = socio, x = ~hdi, y = ~lexp, fill = ~ID, size = ~pop, opacity = ~adulit) %>%
  add_axis("x",title = "Human Development Index") %>%
  add_axis("y",title = "Life Expectancy")%>%
  layer_points() %>%
  add_legend("size", properties = legend_props(legend = list(y = 120)),
             title = "Population") %>%
  add_legend("fill", properties = legend_props(legend = list(y = 0)),
             title = "Region") %>%
  add_title(title = "Plot of HDI against Life expectancy",
            properties = axis_props(title=list(fontSize=20)))

Contact details:

Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis

Box plots

Base R plots

There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot a boxplot of life expectancy, then all we have to do is:

boxplot(socio$lexp)

To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis label and changing the color of the points to name a few. Here is an example where we have changed some of the looks.

boxplot(socio$lexp, col = "blue", 
        main = "Boxplot of Life Expectancy", 
        ylab = "Life Expectancy")

ggplot2
plotly

There are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.

ggplot2

The data, which must be an R data frame, and can be changed after the plot is created.
A set of aesthetic mappings, which describe how variables in the data are mapped to aesthetic properties of the layer.
The geom, which describes the geometric used to draw the layer. The geom defines the set of available aesthetic properties.
(Optional) The stat, which takes the raw data and transforms it in some useful way. The stat returns a data frame with new variables that can also be mapped to aesthetics with a special syntax.
(Optional) The position adjustment, which adjusts elements to avoid overplotting.

Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.

To install this package and load it into our Rscript simply run the following:

install.packages("ggplot2")
library(ggplot2)

First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.

ggplot(data = socio)

ggplot(data = socio, aes(ID, lexp))

This is looking better! we now have a pair of axis. Notice how the limits of the y axis is automatically set to the maximum and minimum values we have in our data! The x axis is set to all the regions! The next component in the grammar of graphics is, the geom. Let’s make a box plot just like before and see if it looks better.

ggplot(data = socio, aes(ID, lexp)) +
  geom_boxplot()

In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:

Fill color of the boxplots.
Labels of the x and y axis.
Title

ggplot(socio, aes(ID,lexp)) + 
  geom_boxplot(aes(fill = ID)) +
  ggtitle("Boxplots of Life Expectancy for each Region") +
  xlab("Region") +
  ylab("Life expectancy")

This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as adding the data points, and setting an alpha level.

ggplot(socio, aes(ID,lexp)) + 
  geom_boxplot(aes(fill = ID),alpha = 0.7) +
  geom_point(aes(color = ID),alpha = 1, position = position_jitter(width = 0.05)) +
  ggtitle("Boxplots of Life Expectancy for each Region") +
  xlab("Region") +
  ylab("Life expectancy")

Neat! Each boxplot has all the data points shown in a different color depending on which region they are representing!

Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), there are also pre-made themes. My favourite theme is called theme_minimal(), let’s see what this looks like:

ggplot(socio, aes(ID,lexp)) + 
  geom_boxplot(aes(fill = ID),alpha = 0.7) +
  geom_point(aes(color = ID),alpha = 1, position = position_jitter(width = 0.05)) +
  ggtitle("Boxplots of Life Expectancy for each Region") +
  xlab("Region") +
  ylab("Life expectancy") +
  theme_minimal()

Great! The only change I want to make to the layout is the title position, let’s align it in the center. To change this we need to use theme() again, after the line theme_minimal(). There also seems to be no use for the legend, since we can see what region the boxplots represent from the x axis. Let’s remove it.

ggplot(socio, aes(ID,lexp)) + 
  geom_boxplot(aes(fill = ID),alpha = 0.7, outlier.alpha = 1) +
  geom_point(aes(color = ID,
             text = paste("Country:", country)), position = position_jitter(width = 0.05)) +
  ggtitle("Boxplots of Life Expectancy for each Region") +
  xlab("Region") +
  ylab("Life expectancy") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none")

Hang on, what would it look like if we had the boxplots in a different order? Say from highest median life expectancy to lowest, would that look better? Let’s see. Don’t worry about this code if you don’t understand it. It makes a new data frame called cdata and finds the median lexp value for each of the ID’s. We then reorder the factor levels so that they go in the order of decreasing median values.

library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))

socio$ID = factor(socio$ID,levels(socio$ID)[order(-cdata$median)])

ggplot(socio, aes(ID,lexp)) + 
  geom_boxplot(aes(fill = ID),alpha = 0.7, outlier.alpha = 1) +
  geom_point(aes(color = ID,
             text = paste("Country:", country)), position = position_jitter(width = 0.05)) +
  ggtitle("Boxplots of Life Expectancy for each Region") +
  xlab("Region") +
  ylab("Life expectancy") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none")

We can also change the axis so it is plotted the other way. This may make it eaiser for the audience to understand. This is simply done by adding coord_flip().

library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))

socio$ID = factor(socio$ID,levels(socio$ID)[order(-cdata$median)])

ggplot(socio, aes(ID,lexp)) + 
  geom_boxplot(aes(fill = ID),alpha = 0.7, outlier.alpha = 1) +
  geom_point(aes(color = ID,
             text = paste("Country:", country)), position = position_jitter(height = 0.0005)) +
  ggtitle("Boxplots of Life Expectancy for each Region") +
  xlab("Region") +
  ylab("Life expectancy") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none") + coord_flip()

plotly

install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")

Note we also added text = paste("Country:", country) into the geom_jitter() function. This is ignored by ggplot2, but plotly will see this and when we hover over a point it will show the country!

ggplotly()

Contact details:

Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis

Histograms

Base R plots

There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot a histogram of life expectancy, then all we have to do is:

hist(socio$lexp)

To make the plot look more aestetically pleasing we can do things such as; adding a title, changing the axis label and changing the fill color of the bars and changing the number of bins to name a few. Here is an example where we have changed some of the looks.

hist(socio$lexp,
     col = "blue",
     main = "Boxplot of Life Expectancy",
     xlab = "Life Expectancy",
     breaks = 30)

These plots are quick and easy to produce, but they look terrible! Definately not publishable, just look at those axes! Let’s look at some other options we have to display our graphs. We will go through the following packages:

ggplot2
plotly

There are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.

ggplot2

The data, which must be an R data frame, and can be changed after the plot is created.
A set of aesthetic mappings, which describe how variables in the data are mapped to aesthetic properties of the layer.
The geom, which describes the geometric used to draw the layer. The geom defines the set of available aesthetic properties.
(Optional) The stat, which takes the raw data and transforms it in some useful way. The stat returns a data frame with new variables that can also be mapped to aesthetics with a special syntax.
(Optional) The position adjustment, which adjusts elements to avoid overplotting.

Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.

To install this package and load it into our Rscript simply run the following:

install.packages("ggplot2")
library(ggplot2)

First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.

ggplot(data = socio)

ggplot(data = socio, aes(lexp))

This is looking better! We now have an axis. Notice how the limits of the x axis is set to all the maximum and minimum values for life expectancy! The next component in the grammar of graphics is, the geom. Let’s make a histogram just like before and see if it looks better.

ggplot(data = socio, aes(lexp)) +
  geom_histogram()

In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:

Fill color of the boxplots to a pastel orange (googled the hex code which is #FFB347).
Border color of the bars to a pastel red (googled the hex code which is #FF6961).
Labels of the x and y axis.
Title

ggplot(socio, aes(lexp)) + 
  geom_histogram(fill = "#FFB347", color = "#FF6961") +
  ggtitle("Histogram of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count")

This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as faceting by the Region, and setting an alpha level.

ggplot(socio, aes(lexp)) + 
  geom_histogram(fill = "#FFB347", color = "#FF6961",
                 bins = 10, alpha = 0.7) +
  ggtitle("Histograms of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count") +
  facet_grid(ID~.)

Neat! Each region has its own histogram! Now we can directly compare the distrobutions of the life expectancies across regions.

ggplot(socio, aes(lexp)) + 
  geom_histogram(fill = "#FFB347", color = "#FF6961",
                 bins = 10, alpha = 0.7) +
  ggtitle("Histograms of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count") +
  facet_grid(ID~.) +
  theme_minimal()

Great! The only change I want to make to the layout is to see what it would look like if we had different colors for each of the histograms. So each region had its own color. Note we also don’t need a legend for this so lets remove it using theme(legend.position = "none") and lets also make the title center aligned by adding plot.title = element_text(hjust = 0.5) into the theme function.

ggplot(socio, aes(lexp)) + 
  geom_histogram(aes(fill = ID, color = ID),
                 bins = 10, alpha = 0.7) +
  ggtitle("Histograms of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count") +
  facet_grid(ID~.) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

Hang on, what would it look like if we had the histograms in a different order? Say from lowest median life expectancy to highest, would that look better? Let’s see. Don’t worry about this code if you don’t understand it. It makes a new data frame called cdata and finds the median lexp value for each of the ID’s. We then reorder the factor levels so that they go in the order of increasing median values.

library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))

socio$ID = factor(socio$ID,levels(socio$ID)[order(cdata$median)])

ggplot(socio, aes(lexp)) + 
  geom_histogram(aes(fill = ID, color = ID),
                 bins = 10, alpha = 0.7) +
  ggtitle("Histograms of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count") +
  facet_grid(ID~.) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

plotly

install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")

Now we have the package installed and the library loaded in our script, lets see what this can do! Let’s use the last plot we made with ggplot. To do this, we simply create our ggplot as we did before, then run the command ggplotly() and it will take the last plot we created and transform it into a magical interactive plot!

ggplotly()

Contact details:

Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis

Density plots

Base R plots

There are many ways to plot graphs in R, the most basic way is to use the defualt plotting functions, which are very intuitive. To plot the density of life expectancy, then all we have to do is:

plot(density(socio$lexp))

plot(density(socio$lexp),
     col = "blue",
     main = "Density of Life Expectancy",
     xlab = "Life Expectancy")

ggplot2
plotly

There are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.

ggplot2

The data, which must be an R data frame, and can be changed after the plot is created.
A set of aesthetic mappings, which describe how variables in the data are mapped to aesthetic properties of the layer.
The geom, which describes the geometric used to draw the layer. The geom defines the set of available aesthetic properties.
(Optional) The stat, which takes the raw data and transforms it in some useful way. The stat returns a data frame with new variables that can also be mapped to aesthetics with a special syntax.
(Optional) The position adjustment, which adjusts elements to avoid overplotting.

Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.

To install this package and load it into our Rscript simply run the following:

install.packages("ggplot2")
library(ggplot2)

First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio.

ggplot(data = socio)

ggplot(data = socio, aes(lexp))

This is looking better! We now have an axis. Notice how the limits of the x axis is set to all the maximum and minimum values for life expectancy! The next component in the grammar of graphics is, the geom. Let’s make a density plot just like before and see if it looks better.

ggplot(data = socio, aes(lexp)) +
  geom_density()

In my opinion, this already looks better than the basic plot! Now lets make it look even better by first changing the:

Line color to a pastel red (googled the hex code which is #FF6961).
Labels of the x and y axis.
Title

ggplot(socio, aes(lexp)) + 
  geom_density(color = "#FF6961") +
  ggtitle("Density of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Density")

This looks much neater! Now lets look at some other ways we can use the aesthetic layers, such as faceting by the Region, adding a fill color and setting an alpha level.

ggplot(socio, aes(lexp)) + 
  geom_density(fill = "#FFB347", color = "#FF6961", alpha = 0.5) +
  ggtitle("Density of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Density") +
  facet_grid(ID~.)

Neat! Each region has its own desnity plot! Now we can directly compare the distrobutions of the life expectancies across regions.

ggplot(socio, aes(lexp)) + 
  geom_density(fill = "#FFB347", color = "#FF6961", alpha = 0.5) +
  ggtitle("Density of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Density") +
  facet_grid(ID~.) +
  theme_minimal()

Great! The only change I want to make to the layout is to see what it would look like if we had different colors for each of the desntiy plots So each region had its own color. Note we also don’t need a legend for this so lets remove it using theme(legend.position = "none").

ggplot(socio, aes(lexp)) + 
  geom_density(aes(fill = ID, color = ID), alpha = 0.5) +
  ggtitle("Density of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Density") +
  facet_grid(ID~.) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

Hang on, what would it look like if we had the density plots in a different order? Say from lowest median life expectancy to highest, would that look better? Let’s see. Don’t worry about this code if you don’t understand it. It makes a new data frame called cdata and finds the median lexp value for each of the ID’s. We then reorder the factor levels so that they go in the order of increasing median values.

library(plyr)
cdata <- ddply(socio, c("ID"), summarise, median = median(lexp))

socio$ID = factor(socio$ID,levels(socio$ID)[order(cdata$median)])

ggplot(socio, aes(lexp)) + 
  geom_density(aes(fill = ID, color = ID), alpha = 0.5) +
  ggtitle("Density of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Density") +
  facet_grid(ID~.) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

plotly

install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")

ggplotly()

Contact details:

Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis

Maps

Base R plots

These base plots are normally quick and easy to produce, but they look terrible! In the case of mapping it isn’t worth the hastle to learn to use the base plotting. Instead, let’s look at some other options we have to display our graphs. We will go through the following packages:

ggplot2
plotly
ggmaps

There are many other data visualisation tools, but for kind of data we have in this tutorial these will suffice.

ggplot2

The data, which must be an R data frame, and can be changed after the plot is created.
A set of aesthetic mappings, which describe how variables in the data are mapped to aesthetic properties of the layer.
The geom, which describes the geometric used to draw the layer. The geom defines the set of available aesthetic properties.
(Optional) The stat, which takes the raw data and transforms it in some useful way. The stat returns a data frame with new variables that can also be mapped to aesthetics with a special syntax.
(Optional) The position adjustment, which adjusts elements to avoid overplotting.

Before we get overly complicated by what this means, lets look at an example to illustate how we can build up a pretty looking plot.

To install this package and load it into our Rscript simply run the following:

install.packages("ggplot2")
library(ggplot2)

First lets look at the first component that ggplot has told us to include, the data. As it says above, the data must be an R data frame. Luckily we already have a data frame called socio, we can also get the shapes of the countrys we have in our dataframe (and the ones we don’t have). To do this we just need to install a few packages, just rin the following:

install.packages(c("devtools", "dplyr", "stringr", "maps", "mapdata"))
library(devtools)
library(dplyr)
library(stringr)
library(maps)
library(mapdata)

Now we should have everything ready to make a plot of the world! Lets add the data to make up the first component of the ggplot method. Remember we must have an R dataframe. So lets merge two dataframes together, the countrys from the socio data frame and the country shapes from the package we just loaded. Lets also create another data frame of the countrys we don’t have information for so we can set their color to grey.

w2hr <- map_data("world")
names(w2hr)[5] = "country"

socio[socio$country == "United States",]$country = "USA"
socio[socio$country == "United Kingdom",]$country = "UK"
socio[socio$country == "Korea Republic of",]$country = "South Korea"
socio[socio$country == "Congo",]$country = "Republic of Congo"
socio[socio$country == "Iran Islamic Republic of",]$country = "Iran"
socio[socio$country == "Syrian Arab Republic",]$country = "Syria"
socio[socio$country == "Egypt Arab Republic of",]$country = "Egypt"
socio[socio$country == "Central African Rep.",]$country = "Central African Republic"
socio[socio$country == "Cote d'Ivoire",]$country = "Ivory Coast"

w2hr[is.na(w2hr$subregion),]$subregion = "Unknown"
w2hr[w2hr$subregion == "Hong Kong",]$country = "Hong Kong"


w2hr[w2hr$country == "Tobago" | w2hr$country == "Trinidad",]$country = "Trinidad and Tobago"

full.df = inner_join(w2hr, socio, by = "country")
empty.df = w2hr[is.na(match(w2hr$country, as.character(socio$country))),]

Let’s take a quick peak at the top of two data frames (empty.df) and (full.df).

head(empty.df)

long	lat	group	order	country	subregion
-69.89912	12.45200	1	1	Aruba	Unknown
-69.89571	12.42300	1	2	Aruba	Unknown
-69.94219	12.43853	1	3	Aruba	Unknown
-70.00415	12.50049	1	4	Aruba	Unknown
-70.06612	12.54697	1	5	Aruba	Unknown
-70.05088	12.59707	1	6	Aruba	Unknown

head(full.df)

long	lat	group	order	country	subregion	ID	gnp	lexp	lexpf	lexpm	adulit	hdi	fertr	birthr	pop	popgrwth	childmf	childmm	infmor	urbanpop	energcpc	pppgnp
20.61133	60.04068	7	884	Finland	Aland Islands	OECD	24110	76	79	73	99	0.954	1.8	13	4986000	0.5	7	9	6	60	5707	16446
20.60342	60.01694	7	885	Finland	Aland Islands	OECD	24110	76	79	73	99	0.954	1.8	13	4986000	0.5	7	9	6	60	5707	16446
20.52178	60.01167	7	886	Finland	Aland Islands	OECD	24110	76	79	73	99	0.954	1.8	13	4986000	0.5	7	9	6	60	5707	16446
20.48750	60.03276	7	887	Finland	Aland Islands	OECD	24110	76	79	73	99	0.954	1.8	13	4986000	0.5	7	9	6	60	5707	16446
20.41123	60.03013	7	888	Finland	Aland Islands	OECD	24110	76	79	73	99	0.954	1.8	13	4986000	0.5	7	9	6	60	5707	16446
20.39795	60.04068	7	889	Finland	Aland Islands	OECD	24110	76	79	73	99	0.954	1.8	13	4986000	0.5	7	9	6	60	5707	16446

Great, we have all the data prepared in two data frames. Let’s build a map! We first have to add the data we want to show. In this case we will use full.df since it will be used most, then when we want to use empty.df we just have to override the data.

ggplot(data = full.df)

ggplot(data = full.df, aes(x=long, y = lat, group = group))

This is looking good! We now have an axis with the longditude and latitude. Notice how the limits of the x and y axis is set to all the maximum and minimum values for long/lat! The next component in the grammar of graphics is, the geom. Let’s make a polygon plot and wee what it can do.

ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
    geom_polygon()

Nice! As we can see we have plotted all the countries we have data for, lets color these blue and add the ones we dont have and color those grey.

ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
    geom_polygon(fill = "blue") +
    geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey")

This looks neat! Now lets look at some other ways we can use the aesthetic layers, such as filling the countries by the life expectancy. Le’ts describe thhe plot by also adding a title using ggtitle()

ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
    geom_polygon(aes(fill = lexp)) +
    geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey") + ggtitle("Life Expectancy across the world")

Awesome! Each region has its own color for the life expectancy! Now we can directly compare the life expecancy across the country.

Let’s put the final touches on our plot. I think it would look better with a different background, the grey isnt adding much in my opinion. We normally do these ‘cherry on the cake’ parts when we know we want to publish our plot. The way we do this is to use the command theme(), Lets remove all the axis titles, tick marks grid lines etc and also the background.

ggplot(data = full.df, aes(x=long, y = lat, group = group)) +
  geom_polygon(aes(fill = lexp)) +
  geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey") + ggtitle("Life Expectancy across the world") +
  theme_minimal() + theme(
    axis.text = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.title = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

Great! The only change I want to make to the layout is to see what it would look like if we had thicker border colors between each of the countries. We can do this using both color and size.

ggplot(data = full.df, aes(x=long, y = lat, group = group)) + 
  geom_polygon(aes(fill = lexp, text = paste("Country:", country)), color = "white", size = 0.1) +
  geom_polygon(data = empty.df, aes(x=long, y = lat, group = group), fill = "grey", color = "white", size = 0.1) + ggtitle("Life Expectancy across the world") +
  theme_minimal() + theme(
    axis.text = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.title = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

The final touches will be to change the legend title/color gradient and to add a text aesthetic ready for using the plotly package.

ggplot(data = full.df, aes(x=long, y = lat, group = group)) + 
  geom_polygon(aes(fill = lexp, text = paste("Country:", country, "<br>", "Region", ID)), color = "white", size = 0.1) +
  geom_polygon(data = empty.df, aes(x=long, y = lat, group = group, text = paste("Country:", country)), fill = "grey", color = "white", size = 0.1) + ggtitle("Life Expectancy across the world") +
  scale_fill_distiller(palette = "Spectral", name = "Life Expectancy", direction = 1) +
  theme_minimal() + theme(
    axis.text = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    axis.title = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

plotly

install.packages(plotly)
library(plotly)
Sys.setenv("plotly_username"="USERNAME")
Sys.setenv("plotly_api_key"="API_KEY")

ggplotly()

Contact details:

Name: Andy Challis
Email: andrewchallis@hotmail.co.uk
Linkedin: http://uk.linkedin.com/in/achallis

Introducion to R - Basics

Andy Challis | Consultant Data Scientist

Last updated on the 28 January, 2017

How do I get the data into R?

Getting data from the table

Types of data

Plotting graphs in R

Scatter plots

Base R plots

ggplot2

plotly

ggvis

Contact details:

Box plots

Base R plots

ggplot2

plotly

Contact details:

Histograms

Base R plots

ggplot2

plotly

Contact details:

Density plots

Base R plots

ggplot2

plotly

Contact details:

Maps

Base R plots

ggplot2

plotly

Contact details:

ggmaps