Errors: When the red text is a legitimate error, it will be prefaced with “Error in.” and try to explain what went wrong.
Warnings: When the red text is a warning, it will be prefaced with “Warning:” and try to explain why there’s a warning. Generally your code will still work, but with some caveats.
Messages: When the red text doesn’t start with either “Error” or “Warning”, it’s just a friendly message.
Option 1:
>instal.packages("ggplot2")
>instal.packages("dplyr")
>instal.packages("nycflights13")
>instal.packages("knitr")
Option 2:
Click o Packages, Instal, Name of Package. If multiple, separate with a comma.
This is done/loaded everytime you open an r session or when you want use a package;
>library(dplyr)
>library(knitr)
>library(nycflights13)
>library(ggplot2)
Otherwise you will get the following error: Error: could not find function
>flights
- (part of nycflights13 package)
View()
function built for use in RStudio.glimpse()
function, which is included in the dplyr
package.kable()
function, which is included in the knitr
package.$
operator to view a single variable in a data frame.The packages we need;
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(nycflights13)
library(fivethirtyeight)
Spreadsheet data is often saved in one of the following formats:
.csv
file. You can think of a .csv file as a bare-bones spreadsheet where:.xlsx
file. As opposed to a bare-bones .csv
files, .xlsx
Excel files contain a lot of meta-data, or put more simply, data about the data.First, let’s import a Comma Separated Values .csv file of data directly off the internet. The .csv file dem_score.csv accessible at https://moderndive.com/data/dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992. Let’s use the read_csv() function from the readr package to read it off the web, import it into R, and save it in a data frame called dem_score
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
dem_score
## # A tibble: 96 x 10
## country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania -9 -9 -9 -9 -9 -9 -9 -9 5
## 2 Argentina -9 -1 -1 -9 -9 -9 -8 8 7
## 3 Armenia -9 -7 -7 -7 -7 -7 -7 -7 7
## 4 Australia 10 10 10 10 10 10 10 10 10
## 5 Austria 10 10 10 10 10 10 10 10 10
## 6 Azerbaij~ -9 -7 -7 -7 -7 -7 -7 -7 1
## 7 Belarus -9 -7 -7 -7 -7 -7 -7 -7 7
## 8 Belgium 10 10 10 10 10 10 10 10 10
## 9 Bhutan -10 -10 -10 -10 -10 -10 -10 -10 -10
## 10 Bolivia -4 -3 -3 -4 -7 -7 8 9 9
## # ... with 86 more rows
dem_score
data frame, the minimum value of -10
corresponds to a highly autocratic nation whereas a value of 10
corresponds to a highly democratic nation.read_csv()
function included in the readr
package is different than the read.csv()
function that comes installed with R by default. The read_csv()
function is in our opinion easier to use since it can more easily read data off the web and generally imports data at a much faster speed.Note in the bottom right of the above image there exists a “Code Preview”: you can copy and paste this code to reload your data again later automatically instead of repeating the above manual point-and-click process.
Let’s consider the drinks data frame included in the fivethirtyeight data that is from a survey of the average number of servings of beer, spirits, and wine consumed for 193 countries. This data was originally reported on the data journalism website FiveThirtyEight.com in Mona Chalabi’s article “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”
drinks_smaller <- drinks %>%
filter(country %in% c("China", "Italy", "Saudi Arabia", "USA")) %>%
select(-total_litres_of_pure_alcohol) %>%
rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
## # A tibble: 4 x 4
## country beer spirit wine
## <chr> <int> <int> <int>
## 1 China 79 192 8
## 2 Italy 85 42 237
## 3 Saudi Arabia 0 5 0
## 4 USA 249 158 84
We use the gather()
function to convert wide/untidy data to tidy
data:
drinks_smaller_tidy <- drinks_smaller %>%
gather(key = type, value = servings, -country)
drinks_smaller_tidy
## # A tibble: 12 x 3
## country type servings
## <chr> <chr> <int>
## 1 China beer 79
## 2 Italy beer 85
## 3 Saudi Arabia beer 0
## 4 USA beer 249
## 5 China spirit 192
## 6 Italy spirit 42
## 7 Saudi Arabia spirit 5
## 8 USA spirit 158
## 9 China wine 8
## 10 Italy wine 237
## 11 Saudi Arabia wine 0
## 12 USA wine 84
We set the arguments to gather() as follows:
key
is the name of the column/variable in the new “tidy” frame that contains the column names of the original data frame that you want to tidy. Observe how we set key = type
and in the resulting drinks_smaller_tidy
the column type contains the three types of alcohol
, beer
, spirit
, and wine
.value
is the name of the column/variable in the “tidy” frame that contains the rows and columns of values in the original data frame you want to tidy. Observe how we set value = servings
and in the resulting drinks_smaller_tidy
the column value contains the 4 × 3 = 12 numerical values.-country
indicating that we don’t want to tidy the country
variable in drinks_smaller
and rather only beer
, spirit
, and wine
.Note the code below is very similar, but now the third argument species which columns we’d want to tidy c(beer, spirit, wine)
, instead of the columns we don’t want to tidy -country
. Note the use of c()
to create a vector of the columns in drinks_smaller
that we’d like to tidy. If you run the code below, you’ll see that the resulting drinks_smaller_tidy
is the same.
drinks_smaller_tidy <- drinks_smaller %>%
gather(key = type, value = servings, c(beer, spirit, wine))
drinks_smaller_tidy
## # A tibble: 12 x 3
## country type servings
## <chr> <chr> <int>
## 1 China beer 79
## 2 Italy beer 85
## 3 Saudi Arabia beer 0
## 4 USA beer 249
## 5 China spirit 192
## 6 Italy spirit 42
## 7 Saudi Arabia spirit 5
## 8 USA spirit 158
## 9 China wine 8
## 10 Italy wine 237
## 11 Saudi Arabia wine 0
## 12 USA wine 84
With our drinks_smaller_tidy
“tidy” format data frame, we can now produce a side-by-side AKA dodged barplot using geom_col()
and not geom_bar()
, since we would like to map the servings variable to the y-aesthetic of the bars.
ggplot(drinks_smaller_tidy, aes(x=country, y=servings, fill=type)) +
geom_col(position = "dodge") +
labs(x = "country", y = "servings")
airline_safety
data frame in the fivethirtyeight
package. Let’s ignore the incl_reg_subsidiaries
and avail_seat_km_per_week
variables for simplicity:airline_safety
## # A tibble: 56 x 9
## airline incl_reg_subsid~ avail_seat_km_p~ incidents_85_99
## <chr> <lgl> <dbl> <int>
## 1 Aer Li~ FALSE 320906734 2
## 2 Aerofl~ TRUE 1197672318 76
## 3 Aeroli~ FALSE 385803648 6
## 4 Aerome~ TRUE 596871813 3
## 5 Air Ca~ FALSE 1865253802 2
## 6 Air Fr~ FALSE 3004002661 14
## 7 Air In~ TRUE 869253552 2
## 8 Air Ne~ TRUE 710174817 3
## 9 Alaska~ TRUE 965346773 5
## 10 Alital~ FALSE 698012498 7
## # ... with 46 more rows, and 5 more variables:
## # fatal_accidents_85_99 <int>, fatalities_85_99 <int>,
## # incidents_00_14 <int>, fatal_accidents_00_14 <int>,
## # fatalities_00_14 <int>
airline_safety_smaller <- airline_safety %>%
select(-c(incl_reg_subsidiaries , avail_seat_km_per_week ))
airline_safety_smaller
## # A tibble: 56 x 7
## airline incidents_85_99 fatal_accidents~ fatalities_85_99
## <chr> <int> <int> <int>
## 1 Aer Li~ 2 0 0
## 2 Aerofl~ 76 14 128
## 3 Aeroli~ 6 0 0
## 4 Aerome~ 3 1 64
## 5 Air Ca~ 2 0 0
## 6 Air Fr~ 14 4 79
## 7 Air In~ 2 1 329
## 8 Air Ne~ 3 0 0
## 9 Alaska~ 5 0 0
## 10 Alital~ 7 2 50
## # ... with 46 more rows, and 3 more variables: incidents_00_14 <int>,
## # fatal_accidents_00_14 <int>, fatalities_00_14 <int>
airline_safety_tidy <- airline_safety_smaller %>%
gather(key = Incident_type, value = count, c(incidents_85_99, incidents_00_14)) %>%
gather(key = fatal_accidents, value = count, c(fatal_accidents_85_99, fatal_accidents_00_14)) %>%
gather(key = fatalities, value = count, c(fatalities_85_99, fatalities_00_14))
airline_safety_tidy
## # A tibble: 448 x 5
## airline Incident_type fatal_accidents fatalities count
## <chr> <chr> <chr> <chr> <int>
## 1 Aer Lingus incidents_85_99 fatal_accidents_85~ fatalities_8~ 0
## 2 Aeroflot incidents_85_99 fatal_accidents_85~ fatalities_8~ 128
## 3 Aerolineas Arge~ incidents_85_99 fatal_accidents_85~ fatalities_8~ 0
## 4 Aeromexico incidents_85_99 fatal_accidents_85~ fatalities_8~ 64
## 5 Air Canada incidents_85_99 fatal_accidents_85~ fatalities_8~ 0
## 6 Air France incidents_85_99 fatal_accidents_85~ fatalities_8~ 79
## 7 Air India incidents_85_99 fatal_accidents_85~ fatalities_8~ 329
## 8 Air New Zealand incidents_85_99 fatal_accidents_85~ fatalities_8~ 0
## 9 Alaska Airlines incidents_85_99 fatal_accidents_85~ fatalities_8~ 0
## 10 Alitalia incidents_85_99 fatal_accidents_85~ fatalities_8~ 50
## # ... with 438 more rows
We need to load the above packages since they are for data visualizations.
In short, the grammar tells us that:
A statistical graphic is a mapping
of data variables to aes
thetic attributes of geom
etric objects.
Specifically, we can break a graphic into the following three essential components:
data
: the data set composed of variables that we map.geom
: the geometric object in question. This refers to the type of object we can observe in plot. For example: points, lines, and bars.aes
: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data set.Shows relationship between 2 numerical variables (bivariate plots).
* geom_point()
we will visualize the relationship between dep_delay
and arr_delay
variables in the flights data frame included in the nycflights13 package. This requires filter
ing down the data from all 336,776 flights that left NYC in 2013, to only the 714 Alaska Airlines (AS) flights that left NYC in 2013.
library(dplyr)
library(ggplot2)
library(nycflights13)
alaska_flights <- flights %>%
filter(carrier == "AS")
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
This corresponds to values being plotted on top of each other over and over again. We address this by;
Method 1: Changing the transparency
We use the alpha
argument in geom_point()
. By default, this value is set to 1. We can change this to any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque.
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha =0.3 )
## Warning: Removed 5 rows containing missing values (geom_point).
Method 2: Jittering the points
This is shaking the points around a bit on the plot.
To create a jittered scatterplot, instead of using geom_point()
, we use geom_jitter()
. To specify how much jitter to add, we adjust the width
and height
arguments. This corresponds to how hard you’d like to shake the plot in units corresponding to those for both the horizontal and vertical variables (in this case minutes).
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30)
## Warning: Removed 5 rows containing missing values (geom_point).
geom_line
is the keywordWe can see that there is a variable called temp of hourly temperature recordings in Fahrenheit at weather stations near all three airports in New York City: Newark (origin code EWR), JFK, and La Guardia (LGA). Instead of considering hourly temperatures for all days in 2013 for all three airports however, for simplicity let’s only consider hourly temperatures at only Newark airport for the first 15 days in January.
early_january_weather <- weather %>%
filter(origin == "EWR" & month == 1 & day <= 15)
ggplot(early_january_weather, aes(x = time_hour, y=temp)) +
geom_line()
geom_histogram
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 40, color = "white", fill = "pink")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 10, color = "red", fill = "purple")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Color
- separates the binsy
- populates itself ie the counts of each binfill
- colour of the bin filling argument. Run colors() to see all 657 possible colors!Faceting is used when we’d like to split a particular visualization of variables by another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.
Note the use of the tilde ~
before month
in facet_wrap(). The tilde is required and you’ll receive the error Error in as.quoted(facets) : object 'month' not found
if you don’t include it before month here.
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
nrow
and ncol
arguments inside of facet_wrap()
. For example, say we would like our faceted plot to have 4 rows instead of 3. Add the nrow = 4
argument to facet_wrap(~ month)
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month, nrow = 4)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
geom_boxplot()
We want to split the box plots by month, hence we put this on the x axis.
ggplot(data = weather, mapping = aes(x = month, y = temp)) +
geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
factor()
function to convert to categorical variables “1”, “2”, “3”, ….. as below;ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
geom_bar()
when counts are not pre-counted, geom_col()
when counts are pre-counted
flights
data frame in the nycflights13
package and visualize the distribution of the categorical variable carrier
. In the flights
data frame, each row corresponds to a flight, hence not pre-counted this means we use geom_bar
.Note that there is only one variable in the aes()
aesthetic mapping: the variable carrier
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar(fill="blue")
If the number of flights per carrier were precounted and the variable called number
flights_counted <- flights %>%
group_by(carrier) %>%
summarize(number=n())
flights_counted
## # A tibble: 16 x 2
## carrier number
## <chr> <int>
## 1 9E 18460
## 2 AA 32729
## 3 AS 714
## 4 B6 54635
## 5 DL 48110
## 6 EV 54173
## 7 F9 685
## 8 FL 3260
## 9 HA 342
## 10 MQ 26397
## 11 OO 32
## 12 UA 58665
## 13 US 20536
## 14 VX 5162
## 15 WN 12275
## 16 YV 601
So now, lets plot the precounted categorical variable (number of flights per carrier);
ggplot(flights_counted, aes(x=carrier, y=number)) +
geom_col(fill="purple")
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar()
The above is an example of a stacked barplot. While simple to make, in certain aspects it is not ideal. For example, it is difficult to compare the heights of the different colors between the bars, corresponding to comparing the number of flights from each origin airport between the carriers.
Note that fill
is another aesthetic mapping much like x
-position; thus it must be included within the parentheses of the aes()
mapping.
Second, the fill
aesthetic corresponds to the color used to fill the bars, while the color
aesthetic corresponds to the color of the outline of the bars. Observe in Figure 3.30 that mapping origin
to color
and not fill
yields grey bars with different colored outlines.
ggplot(data = flights, mapping = aes(x = carrier, color = origin)) +
geom_bar()
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
geom_bar(position = "dodge")
ggplot(flights, aes(x=carrier)) +
geom_bar(fill="black") +
facet_wrap(~ origin, ncol=1)