Visualization with R and RStudio

Author

James L. Adams

Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

RStudio > Preferences (Mac)
Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int

[1] 4

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int)

[1] -0.6536436

cos(4)

[1] -0.6536436

Working Environments

You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages(c("tidyverse", "leaflet"))

library(tidyverse)

Read in Data

We’ll be using data from Inside Airbnb, specifically listings in the city of Boston. The original data can be found here.

df <- read_csv("./data/original/listings.csv") %>%
  filter(price < 9999)

detail <- read_csv("./data/original/listings_detail.csv") %>%
  mutate(price = as.numeric(str_remove(price, "\\$")))

head(df)

id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	number_of_reviews_ltm	license
3781	Rental unit in Boston · ★4.96 · 1 bedroom · 1 bed · 1 bath	4804	Frank	NA	East Boston	42.36413	-71.02991	Entire home/apt	120	32	24	2022-09-05	0.25	1	224	1	NA
5506	Guest suite in Boston · ★4.79 · 1 bedroom · 1 bed · 1 bath	8229	Terry	NA	Roxbury	42.32844	-71.09581	Entire home/apt	139	3	118	2022-12-05	0.68	10	79	8	Approved by the government
6695	Condo in Boston · ★4.80 · Studio · 2 beds · 1 bath	8229	Terry	NA	Roxbury	42.32802	-71.09387	Entire home/apt	171	3	124	2023-03-26	0.73	10	71	8	STR446650
8789	Rental unit in Boston · ★4.65 · 1 bedroom · 1 bed · 1 bath	26988	Anne	NA	Beacon Hill	42.35867	-71.06307	Entire home/apt	93	91	26	2023-05-12	0.24	8	186	1	NA
10813	Rental unit in Boston · ★5.0 · Studio · 1 bed · 1 bath	38997	Michelle	NA	Back Bay	42.35061	-71.08787	Entire home/apt	133	29	5	2020-12-02	0.06	11	365	0	NA
10986	Condo in Boston · Studio · 1 bed · 1 bath	38997	Michelle	NA	North End	42.36377	-71.05206	Entire home/apt	139	33	2	2016-05-23	0.02	11	365	0	NA

Grouping / Summarizing

One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean price for each neighborhood, we could do so like this:

df_avg <- df %>%
  group_by(neighbourhood) %>%
  summarize(mean_price = mean(price))

df_avg

neighbourhood	mean_price
Allston	141.0735
Back Bay	344.7915
Bay Village	286.5273
Beacon Hill	228.3977
Brighton	147.1414
Charlestown	278.5063
Chinatown	298.3535
Dorchester	155.9565
Downtown	343.8106
East Boston	174.5811
Fenway	284.2260
Hyde Park	117.7742
Jamaica Plain	202.7200
Leather District	244.8571
Longwood Medical Area	99.0000
Mattapan	158.6000
Mission Hill	149.1000
North End	323.6129
Roslindale	187.6486
Roxbury	150.9072
South Boston	239.2228
South Boston Waterfront	318.8689
South End	265.6985
West End	256.4783
West Roxbury	226.8431

You can create multiple columns this way, too:

df_summed <- df %>%
  group_by(neighbourhood) %>%
  summarize(
    mean_price = mean(price),
    max_price = max(price),
    median_price = median(price)
  )

df_summed

neighbourhood	mean_price	max_price	median_price
Allston	141.0735	2701	96.0
Back Bay	344.7915	1500	275.0
Bay Village	286.5273	1200	192.0
Beacon Hill	228.3977	950	170.0
Brighton	147.1414	1547	111.0
Charlestown	278.5063	950	235.0
Chinatown	298.3535	673	278.0
Dorchester	155.9565	2793	99.0
Downtown	343.8106	3999	269.0
East Boston	174.5811	800	152.0
Fenway	284.2260	1461	237.0
Hyde Park	117.7742	292	90.5
Jamaica Plain	202.7200	1746	150.0
Leather District	244.8571	400	245.0
Longwood Medical Area	99.0000	123	114.0
Mattapan	158.6000	514	103.0
Mission Hill	149.1000	359	149.5
North End	323.6129	2200	196.5
Roslindale	187.6486	1000	150.5
Roxbury	150.9072	1270	105.0
South Boston	239.2228	1109	196.0
South Boston Waterfront	318.8689	999	275.0
South End	265.6985	1200	213.0
West End	256.4783	885	242.0
West Roxbury	226.8431	1600	113.0

You can even group by multiple variables:

df_summed <- df %>%
  group_by(neighbourhood, room_type) %>%
  summarize(
    mean_price = mean(price),
    max_price = max(price),
    median_price = median(price)
  )

head(df_summed)

neighbourhood	room_type	mean_price	max_price	median_price
Allston	Entire home/apt	244.77922	2701	172
Allston	Private room	78.57937	250	76
Allston	Shared room	30.00000	30	30
Back Bay	Entire home/apt	346.80952	1500	261
Back Bay	Hotel room	285.12500	348	326
Back Bay	Private room	343.43478	530	378

Visualizing Data

Try running this code:

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system initially described by Leland Wilkinson (see endnotes). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

data
a coordinate system
geometry to display the data

We can see this in action

ggplot(df)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(df, aes(x = price, y = number_of_reviews))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point()

The points are just one way of rendering this. We could do it another way if we wanted to.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_text(aes(label = host_name))

Or we could try this

ggplot(df, aes(x = neighbourhood, y = price)) +
  geom_boxplot()

Or this:

ggplot(df, aes(x = neighbourhood, y = price)) +
  geom_col()

ggplot(df, aes(x = neighbourhood, y = price)) +
  geom_col(aes(color = room_type))

Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by type of room.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point() +
  facet_grid(room_type~.)

Activity

Using what you know so far, create a plot that lays out price and number of reviews by neighborhood.

Solution

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point() +
  facet_grid(neighbourhood~.)

We could also write this as:

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point() +
  facet_wrap(~neighbourhood)

Adding color

We can also encode variables as color, shape, or size. Let’s try coloring the points of df by type of room.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type))

What if I want to make all the points blue?

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = "blue"))

That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as price, neighborhood, or type of room. If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(color = "blue")

Activity

Try creating a bar plot that shows the median price by neighborhood, using the grouping/summarizing logic we used before. Try coloring the bars in Harvard crimson (#A41034).

Solution

df %>%
  group_by(neighbourhood) %>%
  summarize(median_price = median(price, na.rm = TRUE)) %>%
  ggplot(aes(x = median_price, y = reorder(neighbourhood, median_price))) +
  geom_col(fill = "#A41034")

Scales

Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.

We can use a logarithmic X axis to make this image more clear

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  scale_x_log10()

Adding more geometries

We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10()

Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this

ggplot(df, aes(x = price, y = number_of_reviews, color = room_type)) +
  geom_point(aes(size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10()

Labels

We can use labs() to set the labels for just about anything.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10() +
  facet_wrap(~neighbourhood) +
  labs(
    title = "Price by Number of Reviews on Airbnb",
    subtitle = "Boston, MA",
    x = "Price (USD)",
    y = "Number of Reviews",
    color = "Type of Room",
    size = "Number of Listings by\nSame Host",
    caption = "Data from insideairbnb.com"
  )

Themes

ggplot has some built-in themes that can improve your charts

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10() +
  facet_wrap(~neighbourhood) +
  labs(
    title = "Price by Number of Reviews on Airbnb",
    subtitle = "Boston, MA",
    x = "Price (USD)",
    y = "Number of Reviews",
    color = "Type of Room",
    size = "Number of Listings by\nSame Host",
    caption = "Data from insideairbnb.com"
  ) +
  theme_bw()

Saving an Image

We can use the ggsave() function to save images that we’ve created as local files.

ggplot(df, aes(x = price, y = number_of_reviews)) +
  geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
  geom_smooth(method = lm) +
  scale_x_log10() +
  facet_wrap(~neighbourhood) +
  labs(
    title = "Price by Number of Reviews on Airbnb",
    subtitle = "Boston, MA",
    x = "Price (USD)",
    y = "Number of Reviews",
    color = "Type of Room",
    size = "Number of Listings by\nSame Host",
    caption = "Data from insideairbnb.com"
  ) +
  theme_bw()

ggsave("images/airbnb.png")

References

Chang, Winston. n.d. R Graphics Cookbook, 2nd Edition. Accessed August 21, 2021. https://r-graphics.org.

“Get the Data.” n.d. Inside Airbnb. Accessed October 10, 2023. http://insideairbnb.com/get-the-data/.

R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). http://dx.doi.org/10.18637/jss.v059.i10.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1 edition. Sebastopol, CA: O’Reilly Media.