Visualization with R and RStudio

Author

James L. Adams

Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

  • RStudio > Preferences (Mac)
  • Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int
[1] 4

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int) 
[1] -0.6536436
cos(4)
[1] -0.6536436

Working Environments

You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages(c("tidyverse", "leaflet"))
library(tidyverse)

Read in Data

We’ll be using data from Inside Airbnb, specifically listings in the city of Boston. The original data can be found here.

df <- read_csv("./data/original/listings.csv") %>%
  filter(price < 9999)

detail <- read_csv("./data/original/listings_detail.csv") %>%
  mutate(price = as.numeric(str_remove(price, "\\$")))

head(df)
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
3781 Rental unit in Boston · ★4.96 · 1 bedroom · 1 bed · 1 bath 4804 Frank NA East Boston 42.36413 -71.02991 Entire home/apt 120 32 24 2022-09-05 0.25 1 224 1 NA
5506 Guest suite in Boston · ★4.79 · 1 bedroom · 1 bed · 1 bath 8229 Terry NA Roxbury 42.32844 -71.09581 Entire home/apt 139 3 118 2022-12-05 0.68 10 79 8 Approved by the government
6695 Condo in Boston · ★4.80 · Studio · 2 beds · 1 bath 8229 Terry NA Roxbury 42.32802 -71.09387 Entire home/apt 171 3 124 2023-03-26 0.73 10 71 8 STR446650
8789 Rental unit in Boston · ★4.65 · 1 bedroom · 1 bed · 1 bath 26988 Anne NA Beacon Hill 42.35867 -71.06307 Entire home/apt 93 91 26 2023-05-12 0.24 8 186 1 NA
10813 Rental unit in Boston · ★5.0 · Studio · 1 bed · 1 bath 38997 Michelle NA Back Bay 42.35061 -71.08787 Entire home/apt 133 29 5 2020-12-02 0.06 11 365 0 NA
10986 Condo in Boston · Studio · 1 bed · 1 bath 38997 Michelle NA North End 42.36377 -71.05206 Entire home/apt 139 33 2 2016-05-23 0.02 11 365 0 NA

Grouping / Summarizing

One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean price for each neighborhood, we could do so like this:

df_avg <- df %>%
  group_by(neighbourhood) %>%
  summarize(mean_price = mean(price))

df_avg
neighbourhood mean_price
Allston 141.0735
Back Bay 344.7915
Bay Village 286.5273
Beacon Hill 228.3977
Brighton 147.1414
Charlestown 278.5063
Chinatown 298.3535
Dorchester 155.9565
Downtown 343.8106
East Boston 174.5811
Fenway 284.2260
Hyde Park 117.7742
Jamaica Plain 202.7200
Leather District 244.8571
Longwood Medical Area 99.0000
Mattapan 158.6000
Mission Hill 149.1000
North End 323.6129
Roslindale 187.6486
Roxbury 150.9072
South Boston 239.2228
South Boston Waterfront 318.8689
South End 265.6985
West End 256.4783
West Roxbury 226.8431

You can create multiple columns this way, too:

df_summed <- df %>%
  group_by(neighbourhood) %>%
  summarize(
    mean_price = mean(price),
    max_price = max(price),
    median_price = median(price)
  )

df_summed
neighbourhood mean_price max_price median_price
Allston 141.0735 2701 96.0
Back Bay 344.7915 1500 275.0
Bay Village 286.5273 1200 192.0
Beacon Hill 228.3977 950 170.0
Brighton 147.1414 1547 111.0
Charlestown 278.5063 950 235.0
Chinatown 298.3535 673 278.0
Dorchester 155.9565 2793 99.0
Downtown 343.8106 3999 269.0
East Boston 174.5811 800 152.0
Fenway 284.2260 1461 237.0
Hyde Park 117.7742 292 90.5
Jamaica Plain 202.7200 1746 150.0
Leather District 244.8571 400 245.0
Longwood Medical Area 99.0000 123 114.0
Mattapan 158.6000 514 103.0
Mission Hill 149.1000 359 149.5
North End 323.6129 2200 196.5
Roslindale 187.6486 1000 150.5
Roxbury 150.9072 1270 105.0
South Boston 239.2228 1109 196.0
South Boston Waterfront 318.8689 999 275.0
South End 265.6985 1200 213.0
West End 256.4783 885 242.0
West Roxbury 226.8431 1600 113.0

You can even group by multiple variables:

df_summed <- df %>%
  group_by(neighbourhood, room_type) %>%
  summarize(
    mean_price = mean(price),
    max_price = max(price),
    median_price = median(price)
  )

head(df_summed)
neighbourhood room_type mean_price max_price median_price
Allston Entire home/apt 244.77922 2701 172
Allston Private room 78.57937 250 76
Allston Shared room 30.00000 30 30
Back Bay Entire home/apt 346.80952 1500 261
Back Bay Hotel room 285.12500 348 326
Back Bay Private room 343.43478 530 378

Visualizing Data

Try running this code:

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system initially described by Leland Wilkinson (see endnotes). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

  • data
  • a coordinate system
  • geometry to display the data

We can see this in action

ggplot(df)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(df, aes(x = price, y = number_of_reviews))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point()

The points are just one way of rendering this. We could do it another way if we wanted to.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_text(aes(label = host_name))

Or we could try this

ggplot(df, aes(x = neighbourhood, y = price)) +
  geom_boxplot()

Or this:

ggplot(df, aes(x = neighbourhood, y = price)) +
  geom_col()

ggplot(df, aes(x = neighbourhood, y = price)) +
  geom_col(aes(color = room_type))

Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by type of room.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point() +
  facet_grid(room_type~.)

Activity

Using what you know so far, create a plot that lays out price and number of reviews by neighborhood.

Solution

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point() +
  facet_grid(neighbourhood~.)

We could also write this as:

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point() +
  facet_wrap(~neighbourhood)

Adding color

We can also encode variables as color, shape, or size. Let’s try coloring the points of df by type of room.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type))

What if I want to make all the points blue?

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = "blue"))

That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as price, neighborhood, or type of room. If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(color = "blue")

Activity

Try creating a bar plot that shows the median price by neighborhood, using the grouping/summarizing logic we used before. Try coloring the bars in Harvard crimson (#A41034).

Solution

df %>%
  group_by(neighbourhood) %>%
  summarize(median_price = median(price, na.rm = TRUE)) %>%
  ggplot(aes(x = median_price, y = reorder(neighbourhood, median_price))) +
  geom_col(fill = "#A41034")

Scales

Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.

We can use a logarithmic X axis to make this image more clear

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  scale_x_log10()

Adding more geometries

We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10()

Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this

ggplot(df, aes(x = price, y = number_of_reviews, color = room_type)) +
  geom_point(aes(size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10()

Labels

We can use labs() to set the labels for just about anything.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10() +
  facet_wrap(~neighbourhood) +
  labs(
    title = "Price by Number of Reviews on Airbnb",
    subtitle = "Boston, MA",
    x = "Price (USD)",
    y = "Number of Reviews",
    color = "Type of Room",
    size = "Number of Listings by\nSame Host",
    caption = "Data from insideairbnb.com"
  )

Themes

ggplot has some built-in themes that can improve your charts

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10() +
  facet_wrap(~neighbourhood) +
  labs(
    title = "Price by Number of Reviews on Airbnb",
    subtitle = "Boston, MA",
    x = "Price (USD)",
    y = "Number of Reviews",
    color = "Type of Room",
    size = "Number of Listings by\nSame Host",
    caption = "Data from insideairbnb.com"
  ) +
  theme_bw()

Saving an Image

We can use the ggsave() function to save images that we’ve created as local files.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10() +
  facet_wrap(~neighbourhood) +
  labs(
    title = "Price by Number of Reviews on Airbnb",
    subtitle = "Boston, MA",
    x = "Price (USD)",
    y = "Number of Reviews",
    color = "Type of Room",
    size = "Number of Listings by\nSame Host",
    caption = "Data from insideairbnb.com"
  ) +
  theme_bw()

ggsave("images/airbnb.png")

References

Chang, Winston. n.d. R Graphics Cookbook, 2nd Edition. Accessed August 21, 2021. https://r-graphics.org.
“Get the Data.” n.d. Inside Airbnb. Accessed October 10, 2023. http://insideairbnb.com/get-the-data/.
R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
———. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). http://dx.doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1 edition. Sebastopol, CA: O’Reilly Media.