The link to this demonstration is here, if you’d like to share it with your peers.
This exercise was intended to combine the three modules by Danielle (R Markdown, ggplot, dplyr) and help you understand how all three of these units will help you in compiling your verification report.
For a cheat-sheet on R Markdown, click here.
EXERCISE: Have a look at the difference between this code, and what is output from this R Markdown file by clicking “Knit” in the top part of your screen.
Here, I load the packages that are needed for this analysis. If you get the following error:
Error in library(tidyverse) : there is no package called ‘tidyverse’
This is because the tidyverse package has not been
installed to your R project yet. Simply go to Packages > Install and
punch in tidyverse, then click “Install”.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
In class, I showed you how to access the existing data sets in R that you can get creative/wrangle with:
For the purpose of this demonstration, I loaded the
starwars table, as below:
I want to have a look at what’s in this data sheet, so I’m going to
use the glimpse() and head() functions to see
what I have.
head(starwars) #looks at the first few ROWS of the data table.
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Org… 150 49 brown light brown 19 fema… femin…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
glimpse(starwars) #looks at each COLUMN of the table (and what type of data populates each column)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
For the purpose of this demonstration, consult the table you’ve pre-loaded and have a look at what you want to do with the data. What are your objectives? What are you curious to find out?
This step is INCREDIBLY important, because it’s easy to get “lost” in code and have NO IDEA what you’re doing halfway through!
For a cheat-sheet on dplyr, click here.
Here’s my objective: * I want to see what the average birth year is for people with different eye colours, and I want that data divided by sex. * I want to visualise this data in a bar graph.
Here, I’ll use the select() function to pick the columns
I want from the main data table, and the mutate() function
to compute an average birth year.
starwars_2 <- starwars %>%
select(name, eye_color, birth_year, sex) %>%
mutate(sex = replace_na(sex, "none")) %>%
mutate(birth_year = replace_na(birth_year, 0))
I then do a group_by() function to separate each of
these variables, then I used the summarise() function to
get the mean birth year variable.
Once I computed a summary of the existing statistics, I used the
filter() function to apply an exclusion criteria where I
got ONLY the birth range between the years of 0 and 100.
Note that this is just to give you an example of the filter function - you need to be more deliberate and thoughtful about your exclusion criteria in the real world (and when to apply them)!
NOTE: The “summarise” function creates the summary statistic you want, but removes all other columns that have NOT been included in the group_by() function. If you want to keep your summary statistics separated by a certain variable, make sure you include it in group_by().
starwars_3 <- starwars_2 %>%
group_by(eye_color, sex) %>% # this function does a grouping thing
summarise(mean_birthyear = mean(birth_year)) %>% # this summarises mean birth year
filter(mean_birthyear > 0 & mean_birthyear < 100) # filtering birth years between 0 and 100.
## `summarise()` has grouped output by 'eye_color'. You can override using the
## `.groups` argument.
For a cheat-sheet on ggplot, click here.
Finally, plotting time! I want to plot the eye colour averages with
birth year, so I use ggplot() (with aes()
specified in the plot) and add a geom_bar() on top.
ggplot(data = starwars_3, mapping = aes(x = eye_color, y = mean_birthyear)) +
geom_bar(stat = "identity")
# NOTE: The geom_bar initially didn't output because I didn't include stat = "identity". Some functions in R require prerequisite information to be filled out, or the code doesn't run!
I want to produce separate graphs for each type of sex, so I’m going
to use the facet_wrap() function to produce the graph.
The code is basically the same as above, except for
+ facet_wrap()!
ggplot(data = starwars_3, mapping = aes(x = eye_color, y = mean_birthyear)) +
geom_bar(stat = "identity") +
facet_wrap(vars(sex)) # dividing the graph by the category "sex"
There are some blank spots in the bar graph - why do you think that is? Refer back to the data table that I used to plot the graph!
Look at what data exists using data(). Load a data
table, then think about what information you want to get out of it.
As per the exercise above, what you normally need to do with data in the real world is wrangle with the data first to a presentable/analysable format (dplyr), analyse the data (using t-tests, ANOVAs, etc.), and then display the data (ggplot).
And before you start coding, always have an objective in mind:
geom_bar(), geom_plot(), etc.starwars_3 <- starwars %>%
select(name, hair_color, height, gender) %>%
mutate(gender = replace_na(gender, "none")) %>%
mutate(height = replace_na(height, 0))
starwars_4 <- starwars_3 %>%
group_by(hair_color, gender) %>%
summarise(mean_height = mean(height))
## `summarise()` has grouped output by 'hair_color'. You can override using the
## `.groups` argument.
ggplot(data = starwars_4, mapping = aes(x = hair_color, y = mean_height)) +
geom_bar(stat = "identity") +
facet_wrap(vars(gender))