MCD Wokshop 3

Ibrahim Inal

Data presentation

We have already seen some data and graphical tools to present it. The standard format for data

  • each row correspond to a single observation
  • each column corresponds to a single variable of interest

We will continue to learn more about data presentation

Numerical summaries

A big component of data presentation is to extract useful information.

The following is a list of some useful things to keep in mind:

  • Column metadata: str, summary

  • Typical entries: head, tail

  • Table structure: names, dim, nrow, ncol

  • Summary statistics: mean, median, sd, var

Graphical summaries

Recall that ingredients of a plot

  • Data : R works with so called long format i.e., each row corresponds to an observation
  • Specification of column representation e..g, coordinate, color, line style… ggplot() takes these with aes function
  • Type of plot (scatterplot, line chart, histogram…).geom_prefix e.g., geom_bar()
  • Coordinate system (cartesian, flipped cartesian, world map…). coord_prefix - recall coord_flip()
  • Additional customization options

Examples of aes usage

library(ggplot2)
ggplot(data = mtcars, aes(x = wt, y = mpg, color = cyl  )) +  
  geom_point(size = 2) +
  ggtitle("Miles per gallon vs vehicle weight")

library(ggplot2)
ggplot(data = mtcars, aes(x = wt, y = mpg, color = as.factor(cyl)  )) +  
  geom_point(size = 2) +
  ggtitle("Miles per gallon vs vehicle weight")

Aesthetics examples

Aesthetics examples

Layers

We can have more than one layer in a graphic.

= +

Each layer = 1 data set + 1 geometric object + aesthetic mappings

Layer synthax

data(mtcars)
library(ggplot2)
ggplot() +
  geom_boxplot(data = mtcars, mapping = aes(x=as.factor(cyl), y = mpg)) +
  geom_point(data = mtcars, mapping = aes(x = as.factor(cyl), y = mpg), 
             position = "jitter")
  • When layers share attributes, we only have to type them once:
ggplot(data = mtcars, mapping = aes(x=as.factor(cyl), y = mpg)) +
  geom_boxplot() +
  geom_point(position = "jitter")

ggplot() code

You have seen various ways of ggplot structure. Generally speaking:

  • We could drop data= if the data is the first argument we pass on to the ggplot() function

  • We could drop mapping= if

    • it is the second argument we pass on to the ggplot() function
    • it is the first argument we pass on geom_prefix() function

ggplot() code

Also, note that ggplot can be simplified in various ways due to its flexible structure. One way to simplify your ggplot() is to use <- (assignment operator) and + operator from ggplot2 syntax.

base_gr <- ggplot(data = mtcars, mapping = aes(y = mpg, x = wt))
scatter_gr <- base_gr + geom_point(aes(col = as.factor(cyl) ), size = 3)

Scales

aes is pretty useful for many things e.g., it tells which column represents which aesthetics x=mpg or color=cyl. However, specific mapping elements e.g., which colors for which cyl, or range of x cannot be done with aes(). Scales are used for this purpose. Note, however that, the default values are pretty good.

Scales examples

scatter_gr + scale_color_manual(values=c("gold2","darkorange","firebrick"))

Scales examples

scatter_gr + scale_x_continuous(limits = c(0, 6)) +
    scale_y_continuous(limits = c(0, 35))

Facets

Facets allow plotting the same data to different canvas. This might give clearer graphs, especially data is very cluttered.

Facets can be done using rows/columns

scatter_gr +facet_wrap(~cyl)

Themes

We have already seen the themes. This generally corresponds to all non-data aspect.

ggplot2’s default theme

Minimal theme

More in-built themes

Classic theme

Classic theme

Positions

We have also seen position earlier. Usually their usage is more of a preference issue, sometimes their usage makes a difference.

Only 9 data points??

Much better

Annotation

In some cases, you may want to annotate text in your graphs.

#Annotate text by giving the coordinates with some options
ggplot(mtcars) +
  geom_point(aes(wt, mpg)) +
  annotate("text", x = 2, y = 30, label = "My Text", 
           angle=30,color="orange3", fontface="italic", size=5)
#"text" can be replaced with "rect", "segment", "pointrange" as well.

Alternatively,

ggplot(mtcars) +
  geom_point(aes(wt, mpg)) +
  geom_text(label="My Text", x=2, y=30, 
            angle=30, color="orange3",fontface="italic", size=5  )

Smoothing

Smoothing helps to see the pattern. This is generally useful in the presence of over plotting. Note that I passed the aes() into ggplot() instead of geom_point(). Otherwise, I needed to pass the aes both geom_point() and stat_smooth(). Another alternative to stat_smooth() is geom_smooth().

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  stat_smooth()

Further source

We have only covered the basics. R graph gallery contains many examples of code snippets and beautiful graphs. Feel free to get some inspiration.

Practice

NHS Data

So let us use some of what we learned by using different data sets. We will start with NHS Data on A&E.

Step 1: Upload the data in R.

Step 2: Have a look at the data

Step 3:Plot the relation between dcubicles and dwait.

Step 4: Try to visualize in a single graph by differentiating staff_increase.

Step 5: Try step 4 but two graphs. (Hint: Recall Facet)

Step 6: How are dwait values distributed?

Step 7: Bar plot the number of attendances by cite.

Step 8: Boxplot dwait and staff_increase relation.

Tibbles

ggplot2 is part of a tidyverse package. Tidyverse uses tibbles. Officially, it is called modern re imagining of data frames. I personally use it because it works well with nice packages such as tidyverse, ggplot2. Generally people, use tibbles just by calling tidyverse package, but there is a separate package called tibble. Some properties of tibbles:

  • Tibble can be created with tibble() function.

  • Unlike data frames, tibbles don’t show the entire dataset when you print it.

  • Tibbles cannot access a column when you provide a partial name of the column, but data frames can.Only when you provide the entire column name, it will work.

  • When you access only one column of a tibble, it will keep the tibble structure. But when you access one column of a data frame, it will become a vector.

  • Subsetting, including [[ ]] and $, work the same for tibbles and data frames

  • read.csv() function will output data frames, while reading with read_csv() in readr package inside tidyverse will output tibbles.

WB Data Revisited

We do the same set-up, but this time, we use the read_csv function instead of read.csv.

library(tidyverse)

wb_data_tib <- read_csv("mydata/wb_data_tidy.csv")

head(wb_data_tib)
# A tibble: 6 × 11
  cty_name    cty_code  year elecAccess gdpPerCap compEduc educPri educTer
  <chr>       <chr>    <dbl>      <dbl>     <dbl>    <dbl>   <dbl>   <dbl>
1 Afghanistan AFG       2009       48.3     1575.        9      NA      NA
2 Afghanistan AFG       2010       42.7     1771.        9      NA      NA
3 Afghanistan AFG       2011       43.2     1750.        9      NA      NA
4 Afghanistan AFG       2012       69.1     1958.        9      NA      NA
5 Afghanistan AFG       2013       68.0     2062.        9      NA      NA
6 Afghanistan AFG       2014       89.5     2111.        9      NA      NA
# ℹ 3 more variables: govEducExp <dbl>, popYoung <dbl>, pop <dbl>

Previously we have seen

Box Plot

Scatter Plot

Try to combine these two now.

Can you guess how to get?

Can you write a code to get?

Can you write a code to get?

For the rest of the session, we will concentrate on 2009, 2019 and 2020. We must create a vector with values TRUE at those indices where the corresponding row has one of these years, and FALSE everywhere else. We accomplish this using the %in% operator, which checks if a value on its left is contained in the vector on its right.

wb_data_tib$year %in% c(2009, 2019, 2020)

Now try to create your data by using the function above. Consider subsetting/indexing by using the function above.

With your new data, try to produce the following graphs: