These are my notes for GRD 610A: Data Visualization II in Winter 2022 at the College for Creative Studies. These notes are for my work in the book Data Visualization by Kieran Healy (Princeton University Press, 2019).
Objects in R are created and referred to by their names. Certain names are not allowed because they are reserved words such as TRUE, if, mean(), and NA. Names also cannot start with a number or contain spaces. There are different naming conventions.
Snake Case
my_data
this_is_snake_case
Camel Case
myData
thisIsCamelCase
Pascal Case
MyData
ThisIsPascalCase
Pick one naming convention and stick with it. Be consistent; donโt switch between conventions. I recommend snake case.
# This is a comment (it starts with #)
my_data <- c(1, 2, 3, 4) # Assign using <- ; use ALT + - or OPTION + -
My_Data
## Error in eval(expr, envir, enclos): object 'My_Data' not found
# Cannot be found because we called it my_data (lowercase)
# Now we can see it
my_data
## [1] 1 2 3 4
Think of functions like a recipe. The arguments of the function are the ingredients and what happens within the function is the sequence of cooking steps.
c(1, 2, 3, 1, 3, 5, 25) # c() is the combine function, it puts things together into a vector/list
## [1] 1 2 3 1 3 5 25
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)
my_numbers
## [1] 1 2 3 1 3 5 25
mean(x = my_numbers)
## [1] 5.714286
mean(my_numbers) # you don't have to specify the argument names, but order matters if you do not specify
## [1] 5.714286
mean(x = your_numbers)
## [1] 19.71429
my_summary <- summary(my_numbers)
my_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 3.000 5.714 4.000 25.000
table(my_numbers)
## my_numbers
## 1 2 3 5 25
## 2 1 2 1 1
sd(my_numbers)
## [1] 8.616153
my_numbers * 5
## [1] 5 10 15 5 15 25 125
my_numbers + 1
## [1] 2 3 4 2 4 6 26
my_numbers + my_numbers # How is this different than the last line?
## [1] 2 4 6 2 6 10 50
# If you're not sure what an object is, ask for its class or type
class(my_numbers)
## [1] "numeric"
class(my_summary)
## [1] "summaryDefault" "table"
class(summary)
## [1] "function"
my_new_vector <- c(my_numbers, "Apple") # What happens if we combine a word with numbers?
my_new_vector
## [1] "1" "2" "3" "1" "3" "5" "25" "Apple"
class(my_new_vector)
## [1] "character"
# Let's look at a new dataset
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
class(titanic)
## [1] "data.frame"
# Titanic is a data frame, which is like a table
# The $ operator can be used to access a column of a data frame by name
titanic$percent
## [1] 62.0 5.7 16.7 15.6
# Tibbles are slightly different than data frames. They are also data tables, but they provide more information.
titanic_tb <- as_tibble(titanic)
titanic_tb # How is does this compare to titanic above?
## # A tibble: 4 x 4
## fate sex n percent
## <fct> <fct> <dbl> <dbl>
## 1 perished male 1364 62
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
# To see inside an object, ask for its structure
str(my_numbers)
## num [1:7] 1 2 3 1 3 5 25
str(my_summary)
## 'summaryDefault' Named num [1:6] 1 1.5 3 5.71 4 ...
## - attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
Programming in R can be challenging and it takes time to get used to. Be patient and take a break if you get stuck. Make sure parentheses are opened and closed. Complete your commands (look out for the + in the console). Take your time and lookout for typos.
In this section, we will get data from a URL and make a quick figure.
# Data source
url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
# Read the CSV from the URL
organs <- read_csv(file = url)
## Rows: 238 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): country, world, opt, consent.law, consent.practice, consistent, ccode
## dbl (14): year, donors, pop, pop.dens, gdp, gdp.lag, health, health.lag, pub...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Take a quick look at the data
glimpse(organs)
## Rows: 238
## Columns: 21
## $ country <chr> "Australia", "Australia", "Australia", "Australia", "~
## $ year <dbl> NA, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1~
## $ donors <dbl> NA, 12.09, 12.35, 12.51, 10.25, 10.18, 10.59, 10.26, ~
## $ pop <dbl> 17065, 17284, 17495, 17667, 17855, 18072, 18311, 1851~
## $ pop.dens <dbl> 0.2204433, 0.2232723, 0.2259980, 0.2282198, 0.2306484~
## $ gdp <dbl> 16774, 17171, 17914, 18883, 19849, 21079, 21923, 2296~
## $ gdp.lag <dbl> 16591, 16774, 17171, 17914, 18883, 19849, 21079, 2192~
## $ health <dbl> 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948, 2077,~
## $ health.lag <dbl> 1224, 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948,~
## $ pubhealth <dbl> 4.8, 5.4, 5.4, 5.4, 5.4, 5.5, 5.6, 5.7, 5.9, 6.1, 6.2~
## $ roads <dbl> 136.59537, 122.25179, 112.83224, 110.54508, 107.98096~
## $ cerebvas <dbl> 682, 647, 630, 611, 631, 592, 576, 525, 516, 493, 474~
## $ assault <dbl> 21, 19, 17, 18, 17, 16, 17, 17, 16, 15, 16, 15, 14, N~
## $ external <dbl> 444, 425, 406, 376, 387, 371, 395, 385, 410, 409, 393~
## $ txp.pop <dbl> 0.9375916, 0.9257116, 0.9145470, 0.9056433, 0.8961075~
## $ world <chr> "Liberal", "Liberal", "Liberal", "Liberal", "Liberal"~
## $ opt <chr> "In", "In", "In", "In", "In", "In", "In", "In", "In",~
## $ consent.law <chr> "Informed", "Informed", "Informed", "Informed", "Info~
## $ consent.practice <chr> "Informed", "Informed", "Informed", "Informed", "Info~
## $ consistent <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes~
## $ ccode <chr> "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz",~
# View(organs) # Run in RStudio
# Another way to view data
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
# Make a plot object
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
# Create a scatterplot
p + geom_point()