new_int <- 4
new_int[1] 4
RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.
RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:
RStudio > Preferences (Mac)Tools > Options (Windows)There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.
You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.
new_int <- 4
new_int[1] 4
Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).
cos(new_int) [1] -0.6536436
cos(4)[1] -0.6536436
You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.
People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:
install.packages(c("tidyverse", "leaflet"))library(tidyverse)We’ll be using data from Inside Airbnb, specifically listings in the city of Boston. The original data can be found here.
df <- read_csv("./data/original/listings.csv") %>%
filter(price < 9999)
detail <- read_csv("./data/original/listings_detail.csv") %>%
mutate(price = as.numeric(str_remove(price, "\\$")))
head(df)| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3781 | Rental unit in Boston · ★4.96 · 1 bedroom · 1 bed · 1 bath | 4804 | Frank | NA | East Boston | 42.36413 | -71.02991 | Entire home/apt | 120 | 32 | 24 | 2022-09-05 | 0.25 | 1 | 224 | 1 | NA |
| 5506 | Guest suite in Boston · ★4.79 · 1 bedroom · 1 bed · 1 bath | 8229 | Terry | NA | Roxbury | 42.32844 | -71.09581 | Entire home/apt | 139 | 3 | 118 | 2022-12-05 | 0.68 | 10 | 79 | 8 | Approved by the government |
| 6695 | Condo in Boston · ★4.80 · Studio · 2 beds · 1 bath | 8229 | Terry | NA | Roxbury | 42.32802 | -71.09387 | Entire home/apt | 171 | 3 | 124 | 2023-03-26 | 0.73 | 10 | 71 | 8 | STR446650 |
| 8789 | Rental unit in Boston · ★4.65 · 1 bedroom · 1 bed · 1 bath | 26988 | Anne | NA | Beacon Hill | 42.35867 | -71.06307 | Entire home/apt | 93 | 91 | 26 | 2023-05-12 | 0.24 | 8 | 186 | 1 | NA |
| 10813 | Rental unit in Boston · ★5.0 · Studio · 1 bed · 1 bath | 38997 | Michelle | NA | Back Bay | 42.35061 | -71.08787 | Entire home/apt | 133 | 29 | 5 | 2020-12-02 | 0.06 | 11 | 365 | 0 | NA |
| 10986 | Condo in Boston · Studio · 1 bed · 1 bath | 38997 | Michelle | NA | North End | 42.36377 | -71.05206 | Entire home/apt | 139 | 33 | 2 | 2016-05-23 | 0.02 | 11 | 365 | 0 | NA |
One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean price for each neighborhood, we could do so like this:
df_avg <- df %>%
group_by(neighbourhood) %>%
summarize(mean_price = mean(price))
df_avg| neighbourhood | mean_price |
|---|---|
| Allston | 141.0735 |
| Back Bay | 344.7915 |
| Bay Village | 286.5273 |
| Beacon Hill | 228.3977 |
| Brighton | 147.1414 |
| Charlestown | 278.5063 |
| Chinatown | 298.3535 |
| Dorchester | 155.9565 |
| Downtown | 343.8106 |
| East Boston | 174.5811 |
| Fenway | 284.2260 |
| Hyde Park | 117.7742 |
| Jamaica Plain | 202.7200 |
| Leather District | 244.8571 |
| Longwood Medical Area | 99.0000 |
| Mattapan | 158.6000 |
| Mission Hill | 149.1000 |
| North End | 323.6129 |
| Roslindale | 187.6486 |
| Roxbury | 150.9072 |
| South Boston | 239.2228 |
| South Boston Waterfront | 318.8689 |
| South End | 265.6985 |
| West End | 256.4783 |
| West Roxbury | 226.8431 |
You can create multiple columns this way, too:
df_summed <- df %>%
group_by(neighbourhood) %>%
summarize(
mean_price = mean(price),
max_price = max(price),
median_price = median(price)
)
df_summed| neighbourhood | mean_price | max_price | median_price |
|---|---|---|---|
| Allston | 141.0735 | 2701 | 96.0 |
| Back Bay | 344.7915 | 1500 | 275.0 |
| Bay Village | 286.5273 | 1200 | 192.0 |
| Beacon Hill | 228.3977 | 950 | 170.0 |
| Brighton | 147.1414 | 1547 | 111.0 |
| Charlestown | 278.5063 | 950 | 235.0 |
| Chinatown | 298.3535 | 673 | 278.0 |
| Dorchester | 155.9565 | 2793 | 99.0 |
| Downtown | 343.8106 | 3999 | 269.0 |
| East Boston | 174.5811 | 800 | 152.0 |
| Fenway | 284.2260 | 1461 | 237.0 |
| Hyde Park | 117.7742 | 292 | 90.5 |
| Jamaica Plain | 202.7200 | 1746 | 150.0 |
| Leather District | 244.8571 | 400 | 245.0 |
| Longwood Medical Area | 99.0000 | 123 | 114.0 |
| Mattapan | 158.6000 | 514 | 103.0 |
| Mission Hill | 149.1000 | 359 | 149.5 |
| North End | 323.6129 | 2200 | 196.5 |
| Roslindale | 187.6486 | 1000 | 150.5 |
| Roxbury | 150.9072 | 1270 | 105.0 |
| South Boston | 239.2228 | 1109 | 196.0 |
| South Boston Waterfront | 318.8689 | 999 | 275.0 |
| South End | 265.6985 | 1200 | 213.0 |
| West End | 256.4783 | 885 | 242.0 |
| West Roxbury | 226.8431 | 1600 | 113.0 |
You can even group by multiple variables:
df_summed <- df %>%
group_by(neighbourhood, room_type) %>%
summarize(
mean_price = mean(price),
max_price = max(price),
median_price = median(price)
)
head(df_summed)| neighbourhood | room_type | mean_price | max_price | median_price |
|---|---|---|---|---|
| Allston | Entire home/apt | 244.77922 | 2701 | 172 |
| Allston | Private room | 78.57937 | 250 | 76 |
| Allston | Shared room | 30.00000 | 30 | 30 |
| Back Bay | Entire home/apt | 346.80952 | 1500 | 261 |
| Back Bay | Hotel room | 285.12500 | 348 | 326 |
| Back Bay | Private room | 343.43478 | 530 | 378 |
Try running this code:
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point()ggplot2 is built on the idea of the Grammar of Graphics, a system initially described by Leland Wilkinson (see endnotes). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:
We can see this in action
ggplot(df)It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.
ggplot(df, aes(x = price, y = number_of_reviews))Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point()The points are just one way of rendering this. We could do it another way if we wanted to.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_text(aes(label = host_name))Or we could try this
ggplot(df, aes(x = neighbourhood, y = price)) +
geom_boxplot()Or this:
ggplot(df, aes(x = neighbourhood, y = price)) +
geom_col()ggplot(df, aes(x = neighbourhood, y = price)) +
geom_col(aes(color = room_type))Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by type of room.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point() +
facet_grid(room_type~.)Using what you know so far, create a plot that lays out price and number of reviews by neighborhood.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point() +
facet_grid(neighbourhood~.)We could also write this as:
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point() +
facet_wrap(~neighbourhood)We can also encode variables as color, shape, or size. Let’s try coloring the points of df by type of room.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = room_type))What if I want to make all the points blue?
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = "blue"))That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as price, neighborhood, or type of room. If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(color = "blue")Try creating a bar plot that shows the median price by neighborhood, using the grouping/summarizing logic we used before. Try coloring the bars in Harvard crimson (#A41034).
df %>%
group_by(neighbourhood) %>%
summarize(median_price = median(price, na.rm = TRUE)) %>%
ggplot(aes(x = median_price, y = reorder(neighbourhood, median_price))) +
geom_col(fill = "#A41034")Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.
We can use a logarithmic X axis to make this image more clear
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
scale_x_log10()We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
geom_smooth(method = lm) +
scale_x_log10()Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this
ggplot(df, aes(x = price, y = number_of_reviews, color = room_type)) +
geom_point(aes(size = calculated_host_listings_count)) +
geom_smooth(method = lm) +
scale_x_log10()We can use labs() to set the labels for just about anything.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
geom_smooth(method = lm) +
scale_x_log10() +
facet_wrap(~neighbourhood) +
labs(
title = "Price by Number of Reviews on Airbnb",
subtitle = "Boston, MA",
x = "Price (USD)",
y = "Number of Reviews",
color = "Type of Room",
size = "Number of Listings by\nSame Host",
caption = "Data from insideairbnb.com"
)ggplot has some built-in themes that can improve your charts
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
geom_smooth(method = lm) +
scale_x_log10() +
facet_wrap(~neighbourhood) +
labs(
title = "Price by Number of Reviews on Airbnb",
subtitle = "Boston, MA",
x = "Price (USD)",
y = "Number of Reviews",
color = "Type of Room",
size = "Number of Listings by\nSame Host",
caption = "Data from insideairbnb.com"
) +
theme_bw()We can use the ggsave() function to save images that we’ve created as local files.
ggplot(df, aes(x = price, y = number_of_reviews)) +
geom_point(aes(color = room_type, size = calculated_host_listings_count)) +
geom_smooth(method = lm) +
scale_x_log10() +
facet_wrap(~neighbourhood) +
labs(
title = "Price by Number of Reviews on Airbnb",
subtitle = "Boston, MA",
x = "Price (USD)",
y = "Number of Reviews",
color = "Type of Room",
size = "Number of Listings by\nSame Host",
caption = "Data from insideairbnb.com"
) +
theme_bw()ggsave("images/airbnb.png")