library(ggplot2)
ggplot(data = mtcars, aes(x = wt, y = mpg, color = cyl )) +
geom_point(size = 2) +
ggtitle("Miles per gallon vs vehicle weight")We have already seen some data and graphical tools to present it. The standard format for data
We will continue to learn more about data presentation
A big component of data presentation is to extract useful information.
The following is a list of some useful things to keep in mind:
Column metadata: str, summary
Typical entries: head, tail
Table structure: names, dim, nrow, ncol
Summary statistics: mean, median, sd, var
Recall that ingredients of a plot
ggplot() takes these with aes functiongeom_prefix e.g., geom_bar()coord_prefix - recall coord_flip()aes usageWe can have more than one layer in a graphic.
=
+
Each layer = 1 data set + 1 geometric object + aesthetic mappings
ggplot() codeYou have seen various ways of ggplot structure. Generally speaking:
We could drop data= if the data is the first argument we pass on to the ggplot() function
We could drop mapping= if
ggplot() functiongeom_prefix() functionggplot() codeAlso, note that ggplot can be simplified in various ways due to its flexible structure. One way to simplify your ggplot() is to use <- (assignment operator) and + operator from ggplot2 syntax.
aes is pretty useful for many things e.g., it tells which column represents which aesthetics x=mpg or color=cyl. However, specific mapping elements e.g., which colors for which cyl, or range of x cannot be done with aes(). Scales are used for this purpose. Note, however that, the default values are pretty good.
Facets allow plotting the same data to different canvas. This might give clearer graphs, especially data is very cluttered.
Facets can be done using rows/columns
We have already seen the themes. This generally corresponds to all non-data aspect.
ggplot2’s default theme
Minimal theme
Classic theme
Classic theme
We have also seen position earlier. Usually their usage is more of a preference issue, sometimes their usage makes a difference.
Only 9 data points??
Much better
In some cases, you may want to annotate text in your graphs.
Alternatively,
Smoothing helps to see the pattern. This is generally useful in the presence of over plotting. Note that I passed the aes() into ggplot() instead of geom_point(). Otherwise, I needed to pass the aes both geom_point() and stat_smooth(). Another alternative to stat_smooth() is geom_smooth().
We have only covered the basics. R graph gallery contains many examples of code snippets and beautiful graphs. Feel free to get some inspiration.
So let us use some of what we learned by using different data sets. We will start with NHS Data on A&E.
Step 1: Upload the data in R.
Step 2: Have a look at the data
Step 3:Plot the relation between dcubicles and dwait.
Step 4: Try to visualize in a single graph by differentiating staff_increase.
Step 5: Try step 4 but two graphs. (Hint: Recall Facet)
Step 6: How are dwait values distributed?
Step 7: Bar plot the number of attendances by cite.
Step 8: Boxplot dwait and staff_increase relation.
ggplot2 is part of a tidyverse package. Tidyverse uses tibbles. Officially, it is called modern re imagining of data frames. I personally use it because it works well with nice packages such as tidyverse, ggplot2. Generally people, use tibbles just by calling tidyverse package, but there is a separate package called tibble. Some properties of tibbles:
Tibble can be created with tibble() function.
Unlike data frames, tibbles don’t show the entire dataset when you print it.
Tibbles cannot access a column when you provide a partial name of the column, but data frames can.Only when you provide the entire column name, it will work.
When you access only one column of a tibble, it will keep the tibble structure. But when you access one column of a data frame, it will become a vector.
Subsetting, including [[ ]] and $, work the same for tibbles and data frames
read.csv() function will output data frames, while reading with read_csv() in readr package inside tidyverse will output tibbles.
We do the same set-up, but this time, we use the read_csv function instead of read.csv.
# A tibble: 6 × 11
cty_name cty_code year elecAccess gdpPerCap compEduc educPri educTer
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AFG 2009 48.3 1575. 9 NA NA
2 Afghanistan AFG 2010 42.7 1771. 9 NA NA
3 Afghanistan AFG 2011 43.2 1750. 9 NA NA
4 Afghanistan AFG 2012 69.1 1958. 9 NA NA
5 Afghanistan AFG 2013 68.0 2062. 9 NA NA
6 Afghanistan AFG 2014 89.5 2111. 9 NA NA
# ℹ 3 more variables: govEducExp <dbl>, popYoung <dbl>, pop <dbl>
Previously we have seen
Box Plot
Scatter Plot
Try to combine these two now.
Can you guess how to get?
Can you write a code to get?
Can you write a code to get?
For the rest of the session, we will concentrate on 2009, 2019 and 2020. We must create a vector with values TRUE at those indices where the corresponding row has one of these years, and FALSE everywhere else. We accomplish this using the %in% operator, which checks if a value on its left is contained in the vector on its right.
Now try to create your data by using the function above. Consider subsetting/indexing by using the function above.
With your new data, try to produce the following graphs: