Data Visualization and Geospatial Analysis With R

Data Visualization in R (2021)


Statistical computing is essential for scientific inquiry, discovery, and storytelling. With R, there are endless possibilities for assembling, transforming, querying, analyzing, and ultimately visualizing data. In this workshop, we will give you the tools to get you started.


Course level: Moderate. Prior experience in R required.

  1. The codes are tested on R version 4.1.1- Kick Things and version 4.0.3 - Bunny-Wunnies Freak Out. To update, see the supplementary section.

  2. MAC users: To run some of the packages in this workshop, an application called ‘XQuartz’ is required on your laptop. Go to the following link to download the application: https://www.xquartz.org/


CHAPTER 1. Operators and data types

1.1. Basic operators

In this section, we will learn about some basic R operators that are used to perform operations on values and variables. Some most commonly used operators are shown in the table below.

# Sum
2+4+7  
## [1] 13
# Order of operations
1/2*3+4-5
## [1] 0.5
1/2*(3+4-5)
## [1] 1
1/(2*(3+4-5))
## [1] 0.25
1/(2*3+4-5) 
## [1] 0.2
# Notice how output changes with the placement of operators

# Other operators:
2^3                            # Raised to the power of ...
## [1] 8
log(10)                        # Log function
## [1] 2.302585
sqrt(4)                        # Square root function
## [1] 2
pi                             # pi or 3.14
## [1] 3.141593
# Clear the Environment
rm(list = ls())

1.2. Basic data operations

In this section, we will create some vector data and apply built-in operations to examine the properties of a dataset.

# The "is equal to" assignment operator in R is "<-" or "=" 

# Generate sample data
data <- c(1, 4, 2, 3, 9)

# rbind combines data by rows, and hence "r"-bind
# cbind combines data by columns, and hence "c"-bind

# Checking the properties of a dataset. 
# Note: the na.rm argument ignores NA values in the dataset.
data = rbind(1, 4, 2, 3, 9) 
dim(data)                       # [5,1]: 5 rows, 1 column
## [1] 5 1
data[2,1]                       # Show the value in row 2, column 1
## [1] 4
data[c(2:5), 1]                 # Show a range of values in column 1
## [1] 4 2 3 9
mean(data, na.rm = T)           # Mean
## [1] 3.8
max(data)                       # Maximum
## [1] 9
min(data)                       # Minimum
## [1] 1
sd(data)                        # Standard deviation
## [1] 3.114482
var(data)                       # Variance
##      [,1]
## [1,]  9.7
summary(data) 
##        V1     
##  Min.   :1.0  
##  1st Qu.:2.0  
##  Median :3.0  
##  Mean   :3.8  
##  3rd Qu.:4.0  
##  Max.   :9.0
str(data)                       # Prints structure of data
##  num [1:5, 1] 1 4 2 3 9
head(data, 6)                   # Returns the first 6 items in the object
##      [,1]
## [1,]    1
## [2,]    4
## [3,]    2
## [4,]    3
## [5,]    9
head(data, 2)                   # Print first 2
##      [,1]
## [1,]    1
## [2,]    4
tail(data, 2)                   # Print last 2
##      [,1]
## [4,]    3
## [5,]    9
# Do the same, but with "c()" instead of "rbind"
data = c(1, 4, 2, 3, 9) 
dim(data)                       # Note: dim is NULL
## NULL
length(data)                    # Length of a dataset is the number of variables (columns)
## [1] 5
data[2]                         # This should give you 4 
## [1] 4
# Other operators work in the same way
mean(data)                      # Mean
## [1] 3.8
max(data)                       # Maximum
## [1] 9
min(data)                       # Minimum
## [1] 1
sd(data)                        # Standard deviation
## [1] 3.114482
var(data)                       # Variance
## [1] 9.7
# Text data
data=c("TAMU", "GEOS", "BAEN", "WMHS") 
data                            # View
## [1] "TAMU" "GEOS" "BAEN" "WMHS"
data[1]
## [1] "TAMU"
# Mixed data
data=c(1, "GEOS", 10, "WMHS")   # All data is treated as text if one value is text
data[3]                         # Note how output is in quotes i.e. "10"
## [1] "10"

1.3. Data types

In R, data is stored as an “array”, which can be 1-dimensional or 2-dimensional. A 1-D array is called a “vector” and a 2-D array is a “matrix”. A table in R is called a “data frame” and a “list” is a container to hold a variety of data types. In this section, we will learn how to create matrices, lists and data frames in R.

# Lets make a random matrix
test_mat = matrix( c(2, 4, 3, 1, 5, 7),    # The data elements 
  nrow = 2,                                # Number of rows 
  ncol = 3,                                # Number of columns 
  byrow = TRUE)                            # Fill matrix by rows 

test_mat = matrix(c(2, 4, 3, 1, 5, 7), 
                  nrow = 2, ncol = 3, 
                  byrow = TRUE)            # Same result 
test_mat
##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7
test_mat[ ,2]                              # Display all rows, and second column
## [1] 4 5
test_mat[2, ]                              # Display second row, all columns
## [1] 1 5 7
# Types of datasets
out = as.matrix(test_mat)
out                                        # This is a matrix
##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7
out = as.array(test_mat)
out                                        # This is also a matrix
##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7
out = as.vector(test_mat)
out                                        # This is just a vector
## [1] 2 1 4 5 3 7
# Data frame and list
data1 = runif(50, 20, 30)                   # Create 50 random numbers between 20 and 30  
data2 = runif(50, 0, 10)                    # Create 50 random numbers between 0 and 10  

# Lists
out = list()                                # Create and empty list
out[[1]] = data1                            # Notice the brackets "[[ ]]" instead of "[ ]"
out[[2]] = data2
out[[1]]                                    # Contains data1 at this location
##  [1] 20.36217 22.06699 29.43265 26.01856 24.25079 29.49264 23.32752 24.02397
##  [9] 25.69344 28.09800 25.07687 20.47548 26.19136 22.17593 21.10640 21.38896
## [17] 22.44622 28.94234 22.66411 29.69079 26.61345 26.60872 25.57228 21.99966
## [25] 25.82554 27.81743 22.76661 24.27930 29.22639 27.97143 28.31561 24.92463
## [33] 20.91709 28.94026 23.48239 26.09111 24.23019 27.26506 20.74990 21.91903
## [41] 26.70382 24.65537 27.42185 29.61954 20.90820 27.64590 29.21621 27.23648
## [49] 21.73714 24.40926
# Data frame
out = data.frame(x = data1, y = data2)

# Let's see how it looks!
plot(out$x, out$y)

plot(out[ ,1])


CHAPTER 2. Plotting with base R


If you need to quickly visualize your data, base R has some functions that will help you do this in a pinch. In this section we’ll look at some basics of visualizing univariate and multivariate data.

2.1. Overview

# Create 50 random numbers between 0 and 100  
data = runif(50, 0, 100) 

# Over plotting means adding layers to a plot.
# The "plot" function initializes the plot.
# The "type" argument changes the plot type. 
plot(data)          

plot(data, type = "l")                       # "l" calls up a line plot

plot(data, type = "b")                       # Buffered points joined by lines

# Try options type = "o" and type = "c" as well.

# We can also quickly visualize box plots, histograms, and density plots using the same procedure
boxplot(data)                                # Box-and-whisker plot

hist(data)                                   # Histogram points

plot(density(data))                          # Plot with density distribution 

2.2. Plotting univariate data

Let’s dig deeper into the plot function. Here, we will look at how to adjust the colors, shapes, and sizes for markers, axis labels and titles, and the plot title.

# Part 2.2.1. Line plots
plot(data, 
     type= "o", 
     col= "red",
     xlab = "x-axis title", 
     ylab = "y-axis title", 
     main = "My plot",                       # Name of axis labels and title
     cex.axis = 2, 
     cex.main = 2, 
     cex.lab = 2,                            # Size of axes, title and label
     pch= 23,                                # Change marker style
     bg= "red",                              # Change color of markers
     lty= 5,                                 # Change line style
     lwd= 2)                                 # Selecting line width 

# Adding legend
legend(1, 100, 
       legend = c("Data 1"),
       col= c("red"), 
       lty = 2, 
       cex = 1.2)

# Part 2.2.2. Histograms
hist(data,col = "red",
     xlab = "Number", 
     ylab = "Value", 
     main = "My plot",                       # Name of axis labels and title
     border = "blue")

# Try adjusting the parameters:
# hist(data,col="red",
#      xlab="Number",
#      ylab ="Value", 
#      main="My plot", 
#      cex.axis=2, 
#      cex.main=2,
#      cex.lab=2,            
#      border="blue", 
#      xlim=c(0,100),                        # Control the limits of the x-axis
#      las=0,                                # Try different values of las: 0,1,2,3 to rotate labels
#      breaks=5                              # Try using 5,20,50, 100) 

2.3. Plotting multivariate data

Here, we introduce you to data frames: equivalent of tables in R. A data frame is a table with a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

plot_data = data.frame(x = runif(50, 0, 10), 
                       y = runif(50, 20, 30), 
                       z = runif(50, 30, 40)) 

plot(plot_data$x, plot_data$y)              # Scatter plot of x and y data

# Mandatory beautification
plot(plot_data$x,plot_data$y, 
     xlab = "Data X", 
     ylab = "Data Y", 
     main  = "X vs Y plot",
     col = "darkred", 
     pch = 20, 
     cex = 1.5)                             # Scatter plot of x and y data

# Multiple lines on one axis
matplot(plot_data, 
        type = c("b"), 
        pch = 16, 
        col = 1:4) 

matplot(plot_data, 
        type = c("b", "l", "o"), 
        pch = 16, 
        col = 1:4)                          # Try this now. Any difference? 

legend("topleft", 
       legend = 1:4, 
       col = 1:4, 
       pch = 1)                             # Add legend to a top left

legend("top", 
       legend = 1:4, 
       col = 1:4, 
       pch = 1)                             # Add legend to at top center

legend("bottomright", 
       legend = 1:4, 
       col = 1:4, 
       pch = 1)                             # Add legend at the bottom right

2.4. Time series data

Working with time series data can be tricky at first, but here’s a quick look at how to quickly generate a time series using the as.Date function.

date = seq(as.Date('2011-01-01'), 
           as.Date('2011-01-31'), 
           by = 1)                   # Generate a sequence 31 days

data = runif(31, 0, 10)              # Generate 31 random values between 0 and 10

df = data.frame(Date = date, 
                Value = data)        # Combine the data in a data frame
plot(df, type= "o")

2.5. Combining plots

You can built plots that contain subplots. Using base R, we call start by using the “par” function and then plot as we saw before.

par(mfrow = c(2, 2))                       # Call a plot with 4 quadrants

# Plot 1
matplot(plot_data, 
        type = c("b"), 
        pch = 16, 
        col = 1:4) 

# Plot 2
plot(plot_data$x, plot_data$y) 

# Plot 3
hist(data,col = "red",
     xlab = "Number", 
     ylab = "Value", 
     main = "My plot", 
     border = "blue") 

# Plot4
plot(data,type = "o", 
     col = "red",
     xlab= "Number", 
     ylab = "Value", 
     main = "My plot",
     cex.axis= 2, 
     cex.main= 2, 
     cex.lab = 2, 
     pch= 23,   
     bg= "red", 
     lty= 5, 
     lwd= 2) 

# Alternatively, we can call up a plot using a matrix
# Plot 1 is plotted in the first two spots followed by plot 2 and 3 
matrix(c(1, 1, 2, 3), 
       2, 
       2, 
       byrow = TRUE)            
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    3
layout(matrix(c(1, 1, 2, 3), 
              2, 
              2, 
              byrow = TRUE))     # Fix plot layout 

# Plot 1
matplot(plot_data, 
        type = c("b"), 
        pch = 16, 
        col = 1:4)

# Plot2
plot(plot_data$x, plot_data$y) 

# Plot 3
hist(data,col = "red",
     xlab = "Number", 
     ylab = "Value", 
     main = "My plot",
     border = "blue")

2.6. Saving figures to disk

Plots can be saved as image files or a PDF. This is done by specifying the output file type, its size and resolution, then calling the plot.

# Tells R we will plot image in png of given specification
png("awesome_plot.png", 
    width = 4, 
    height = 4, 
    units = "in", 
    res = 400) 

matplot(plot_data, 
        type = c("b", "l", "o"), 
        pch = 16, 
        col = 1:4)  

legend("topleft", 
       legend = 1:4, 
       col = 1:4, 
       pch = 1)

# This sends the image to disc
# Keep pressing till you get the following: 
# Error in dev.off() : cannot shut down device 1 (the null device) 
# This ensures that we are no longer plotting.
dev.off()
## png 
##   2
# It looks like everything we just plotted was squeezed together too tightly. Let's change the size.

# Tell R we will plot image in png of given specification
png("awesome_plot.png", 
    width = 6, 
    height = 4, 
    units = "in", 
    res = 400)                  #Note the change in dimension
                              
matplot(plot_data, 
        type = c("b", "l", "o"), 
        pch = 16, 
        col = 1:3)  

legend("topleft", 
       legend = 1:3, 
       col = 1:3, 
       pch = 16)

dev.off() 
## png 
##   2



Some useful resources

If you want to plot something a certain way and don’t know how to do it, the chances are that someone has asked that question before. Try a Google search for what your are trying to do and check out some of the forums. There is TONS of material online. Here are some additional resources:

  1. The R Graph Gallery: https://www.r-graph-gallery.com/
  2. Graphical parameters: https://www.statmethods.net/advgraphs/parameters.html
  3. Plotting in R: https://www.harding.edu/fmccown/r/
  4. Histogram: https://www.r-bloggers.com/how-to-make-a-histogram-with-basic-r/
  5. Line plots: https://www.statmethods.net/graphs/line.html

CHAPTER 3. Plotting with ggplot2


3.1. Import libraries and create sample dataset

For this section, we will use the ggplot2, gridExtra, utils, and tidyr packages. gridExtra and cowplot are used to combine ggplot objects into one plot and utils and tidyr are useful for manipulating and reshaping the data. We will also install some packages here that will be required for the later sections. You will find more information in the sections to follow.

# Load libraries
lib_names=c("ggplot2", "gridExtra", "utils", "tidyr", "cowplot", "plot3D", "leaflet", "maps", 
            "pdftools", "tm", "SnowballC", "wordcloud", "RColorBrewer", 
            "wordcloud2", "webshot", "htmlwidgets", "officer", "flextable", "rvg")

Install all necessary packages (Run once).

If you see a prompt: Do you want to restart R prior to installing: Select No.

invisible(suppressMessages
          (suppressWarnings
            (lapply
              (lib_names, install.packages, repos = "http://cran.r-project.org",
                character.only = T))))
# Load necessary packages
invisible(suppressMessages
         (suppressWarnings
           (lapply
             (lib_names, library, character.only = T))))
# Generate a dataset containing random numbers within specified ranges
Year = seq(1913, 2001, 1)
Jan = runif(89, -18.4, -3.2)
Feb = runif(89, -19.4, -1.2)
Mar = runif(89, -14, -1.8)
January = runif(89, 1, 86)
dat = data.frame(Year, Jan, Feb, Mar, January)

3.2. Basics of ggplot

Whereas base R has an “ink on paper” plotting paradigm, ggplot has a “grammar of graphics” paradigm that packages together a variety plotting functions. With ggplot, you assign the result of a function to an object name and then modify it by adding additional functions. Think of it as adding layers using pre-designed functions rather than having to build those functions yourself, as you would have to do with base R.

l1 = ggplot(data = dat, 
            aes(x = Year, y = Jan, color = "blue")) +   # Tell which data to plot
            geom_line() +                               # Add a line
            geom_point() +                              # Add a points
            xlab("Year") +                              # Add labels to the axes
            ylab("Value")

# Or, they can be specified for any individual geometry
l1 + geom_line(linetype = "solid", color = "Blue")      # Add a solid line

l1 + geom_line(aes(x = Year, y = January))              # Add a different data set

# There are tons of other built-in color scales and themes
# EXAMPLE: scale_color_grey(), 
#          scale_color_brewer(), 
#          theme_classic(), 
#          theme_minimal(),                    
#          theme_dark()

# OR, CREATE YOUR OWN THEME! You can group themes together in one list
theme1 = theme(
  legend.position = "none",
  panel.background = element_blank(),
  plot.title = element_text(hjust = 0.5),
  axis.line = element_line(color = "black"),
  axis.text.y = element_text(size = 11),
  axis.text.x = element_text(size = 11),
  axis.title.y = element_text(size = 11),
  axis.title.x  = element_text(size = 11),
  panel.border = element_rect(colour = "black", fill = NA, size = 0.5))

3.3. Multivariate plots

For multivariate data, ggplot takes the data in the form of groups. This means that each data row should be identifiable to a group. To get the most out of ggplot, we will need to reshape our dataset. There are generally two data formats: wide (horizontal) and long (vertical).

In the horizontal format, every column represents a category of the data.

In the vertical format, every row represents an observation for a particular category (think of each row as a data point).

Both formats have their comparative advantages. We will now convert the data frame we randomly generated in the previous section to the long format. Here are several ways to do this:

library(tidyr)

# Using the gather function
dat2 = dat %>% gather(Month, Value, -Year)

# Using pivot_longer and selecting all of the columns we want. This function is the best!
dat2 = dat %>% pivot_longer(cols = c(Jan, Feb, Mar), 
                            names_to = "Month", 
                            values_to = "Value") 

# Or we can choose to exclude the columns we don't want
dat2 = dat %>% pivot_longer(cols = -c(Year,January), 
                            names_to = "Month", 
                            values_to = "Value") 

head(dat2) # The data is now shaped in the long format
## # A tibble: 6 x 4
##    Year January Month  Value
##   <dbl>   <dbl> <chr>  <dbl>
## 1  1913    43.3 Jan    -7.43
## 2  1913    43.3 Feb   -19.3 
## 3  1913    43.3 Mar    -3.13
## 4  1914    79.1 Jan   -12.1 
## 5  1914    79.1 Feb   -16.7 
## 6  1914    79.1 Mar   -13.6

Line plot

# Line plot
l = ggplot(dat2, aes(x = Year, y = Value, group = Month)) +
       geom_line(aes(color = Month)) +
       geom_point(aes(color = Month))

l

Density plot

# Density plot
d = ggplot(dat2, aes(x = Value))
d = d + geom_density(aes(color = Month, fill = Month), alpha = 0.4) # Alpha specifies transparency
d

Histogram

# Histogram
h = ggplot(dat2, aes(x = Value))
h = h + geom_histogram(aes(color = Month, fill = Month), alpha = 0.4,
                 color = "white",
                 position = "dodge")
h

Grid plotting and saving files to disk

There are multiple ways to arrange multiple plots and save images. One method is using grid.arrange() which is found in the gridExtra package. You can then save the file using ggsave, which comes with the ggplot2 library.

# The plots can be displayed together on one image using 
# grid.arrange from the gridExtra package
img = gridExtra::grid.arrange(l, d, h, nrow = 3)

# Finally, plots created using ggplot can be saved using ggsave
ggsave("grid_plot_1.png", 
       plot = img, 
       device = "png", 
       width = 6, 
       height = 4, 
       units = c("in"), 
       dpi = 600)

Another approach is to use the plot_grid function, which is in the cowplot library. Notice how the axes are now beautifully aligned.

img2 = cowplot::plot_grid(l, d, h, nrow = 3, 
                          align = "v")    # "v" aligns vertical axes and "h" aligns horizontal axes

ggsave("grid_plot_2.png", 
       plot = img2, 
       device = "png", 
       width = 6, 
       height = 4, 
       units = c("in"), 
       dpi = 600)


Some useful resources

The links below offer a treasure trove of examples and sample code to get you started.

  1. The R Graph Gallery: https://www.r-graph-gallery.com/
  2. Line plots in ggplot2: http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization
  3. Top 50 visualizations with ggplot2: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
  4. Practical guide in ggplot2: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization

CHAPTER 4. Advanced plotting with R


4.1. 3-D plots in R

In this section, we will use the plot3D package in R to make some 3-dimensional plots. More information on 3D plotting can be accessed from: https://cran.r-project.org/web/packages/plot3D/plot3D.pdf

We can make plots with three axes (x, y and z) and we can also add a fourth dimension by color.

Also, there are multiple other packages in R for 3d plotting such as scatterplot3d, rgl and plotly, so do check them out too while you’re making 3d plots for your data.

# Load libraries
library(plot3D)

# Generate data from a normal distribution
set.seed(123)
x = rnorm(n = 1000, mean = 50, sd = 10)
y = rnorm(n = 1000, mean = 0,  sd = 10)
z = rnorm(n = 1000, mean = 0,  sd = 10) 
val = rnorm(n = 1000, mean = 0,  sd = 10)
par(mar = c(2, 2, 2, 2))

# Create a basic 3D scatter plot
scatter3D(x, y, z, colvar = val)

# Full box
scatter3D(x, y, z, 
          bty = "f",               # Add a box and grid lines to the plot using the 'bty' argument
          colvar = NULL,           # Use "colvar = NULL"" to avoid coloring by z variable
          col = "blue", 
          pch = 19, 
          cex = 0.5) 

# Gray background with white grid lines
scatter3D(x, y, z, 
          bty = "g", 
          colvar = NULL, 
          col = "blue", 
          pch = 19, 
          cex = 0.5)  

# Change the view direction
scatter3D(x, y, z, 
          phi = 0,                  # phi is the co-latitude
          theta = 45,               # theta specifies the azimuth rotation 
          bty ="g" ,
          ticktype = "detailed",    # ticktype "detailed" draws normal ticks and labels and "simple" draws arrows
          colvar = val,
          col = gg.col(100),        # Apply colors similar to ggplot
          pch = 18)  

# Create a 3D histogram
hist3D(z = matrix(x[1:25], 
                  nrow = 5, 
                  ncol = 5),        # Matrix containing values to be plotted
                  x = 1:5,          # Vector of length = nrow(z)
                  y = 1:5,          # Vector of length = ncol(z)
                  scale = FALSE,
                  expand = 0.1, 
                  bty = "g", 
                  phi = 20,
                  col = "lightblue", 
                  border = "black", 
                  shade = 0.2, 
                  space = 0.3, 
                  ltheta = 90,
                  ticktype = "detailed")

4.2. Interactive maps with Leaflet

Leaflet is a JavaScript library which is popular among many news websites to present geo-spatial data in an interactive window.

We will use the leaflet package in R to make some maps in R. These maps can be saved in HTML format and can be used to present your data in a more interactive way.

You can add points, shape files, or raster data to maps but we’ll just work with tabular data for this workshop.

More information is available at: https://rstudio.github.io/leaflet/

# Load libraries
library(leaflet)
library(maps)

Basemap

Leaflet uses map tiles as basemaps using Open Street Map by default, and can be added to R using addTiles().

We’ll set the default zoom to Scoates Hall.

leaflet() %>%
setView(lng = -96.338322, 
        lat = 30.618496, 
        zoom = 100)%>%
addTiles()      #Other basemaps can be added using "addProviderTiles(). Try "addProviderTiles(providers$Esri.NatGeoWorldMap)".


Adding labels

We’ll add some point locations of USGS stream gages on Buffalo Bayou to a data frame textConnection() can be used to read data that is stored in an R source file rather than an external file.

gages <- read.csv(textConnection(
                        "Station,Lat,Long
                         Buffalo Bayou at Houston,29.76023,-95.40855
                         Buffalo Bayou at Piney Point,29.74690,-95.52355
                         Buffalo Bayou at W Belt Dr at Houston,29.76217,-95.55772
                         Buffalo Bayou nr Addicks,29.76190,-95.60578
                         Buffalo Bayou nr Katy,29.74329,-95.80689"))

# Now we can plot them on open street map as markers using "addMarkers"
leaflet(gages) %>%
addTiles() %>%
addMarkers(~Long, 
           ~Lat, 
           label = ~Station)


Marker Clusters

We can combine data in a cluster based on location. Let’s cluster the US cities dataset in the “maps” package as an example.

head(maps::us.cities)
##         name country.etc    pop   lat    long capital
## 1 Abilene TX          TX 113888 32.45  -99.74       0
## 2   Akron OH          OH 206634 41.08  -81.52       0
## 3 Alameda CA          CA  70069 37.77 -122.26       0
## 4  Albany GA          GA  75510 31.58  -84.18       0
## 5  Albany NY          NY  93576 42.67  -73.80       2
## 6  Albany OR          OR  45535 44.62 -123.09       0
leaflet(maps::us.cities) %>%
addTiles() %>%
addMarkers(~long, 
           ~lat, 
           clusterOptions = markerClusterOptions()) # clusterOptions is used to cluster large number of markers


Add circles

To plot circles with city populations, we’ll first add the data frame using text connections directly in R source file.

texas_pop <- read.csv(textConnection(
                        "City, Lat, Long, Pop
                         Austin, 30.2672, -97.7431, 950715
                         Houston, 29.7604, -95.3698, 2313000
                         College Station, 30.6280, -96.3344, 113564
                         Dallas, 32.7767, -96.7970, 1341000
                         San Antonio, 29.4241, -98.4936, 1493000"))

leaflet(texas_pop) %>%
addTiles() %>%
addCircles(lng = ~Long, 
           lat = ~Lat, 
           weight = 1,
           radius = ~sqrt(Pop) * 30, 
           popup = ~as.character(Pop))

CHAPTER 5. OfficeR


There may be instances where it is useful to present your findings in R as a PowerPoint presentation at conferences or meetings. The following exercises showcase the abilities of the OfficeR package to develop PowerPoint presentations while including any data analysis or visualizations performed in R. More information on using R to develop a PowerPoint can be accessed from: https://ardata-fr.github.io/officeverse/officer-for-powerpoint.html

# Load libraries
# library(ggplot2)
library(officer)

5.1. Add slides

In this section, we will use the OfficeR package in R to make some slides in PowerPoint using R.

# Import a previously made PowerPoint named "name of file.pptx" to R
# doc <- read_pptx("name of file.pptx")

# Start a new presentation by leaving a blank in the "read_pptx" function 
#  which will be named and saved at the end of this section
presentation <- read_pptx()

# Adding slides
# Use 3 objects to add slides, rpptx object, slide layout name object, 
#  and master layout name object
presentation <- add_slide(presentation, 
                          layout = "Title and Content", 
                          master = "Office Theme")

# Layout names and master layout read easily with layout_summary function
layout_summary(presentation)
##              layout       master
## 1       Title Slide Office Theme
## 2 Title and Content Office Theme
## 3    Section Header Office Theme
## 4       Two Content Office Theme
## 5        Comparison Office Theme
## 6        Title Only Office Theme
## 7             Blank Office Theme
# We can make our first slide as an introduction slide including a title, 
#  a footer, the current date, and a slide number. 
# When adding content into a slide, 3 arguments are involved. 
# These are the rpptx object, object to be printed, and the location to 
#  define the placeholder where the shape will be created.

# Let's add the title
presentation <- ph_with(presentation, 
                        value = "Introduction", 
                        location = ph_location_type(type = "title"))

# Let's add the footer 
presentation <- ph_with(presentation, 
                        value = "Rstudio Visualization Workshop", 
                        location = ph_location_type(type = "ftr"))

# Let's add the date
presentation <- ph_with(presentation, 
                        value = format(Sys.Date()), 
                        location = ph_location_type(type = "dt"))

# Let's add the slide number
presentation <- ph_with(presentation, 
                        value = "slide 1", 
                        location = ph_location_type(type = "sldNum"))

# Let's use the histogram from a previous example and upload it to a 
#  second slide in PowerPoint with a title

# Add a new slide
presentation <- add_slide(presentation)

# Load the histogram example to the slide
presentation <- ph_with(x = presentation, 
                        value = h, 
                        location = ph_location_fullsize())

# Add a title using the "location" and "type" functions to specify title
presentation <- ph_with(x = presentation, 
                        "Histogram Example", 
                        location = ph_location_type(
                        type = "title"))

5.2. Export and save PowerPoint

In this section, we will export and save the slides we created to a PowerPoint file on your local system using R.

# Write and save the PowerPoint file to your computer
print(presentation, 
      target = "~first_example.pptx") 

CHAPTER 6. Text mining with R


6.1. Creating a Word Cloud

Import libraries and sample dataset

For this section, we will create a Word Cloud from a set of PDF documents. These files can be downloaded manually from GitHub by accessing https://github.com/Vinit-Sehgal/Data-Part1

Or, you can run the following code to download them from GitHub repository.

# Import sample data from GitHub repository
download.file(url = "https://github.com/Vinit-Sehgal/Data-Part1/archive/main.zip",
              destfile = "Data-Part1-main.zip")              # Download ".Zip"

# Unzip the downloaded .zip file
unzip(zipfile = "Data-Part1-main.zip")
getwd()                                                      # Working directory for this workshop
## [1] "G:/My Drive/TAMU/Teaching/DataVisWorkshop/DataVisWorkshop2021/PrepFolderDataVizard2021/Part1_2021"
list.files("./Data-Part1-main")                              # List contents of the folder
##  [1] "desktop.ini"                      "grid_plot_1.png"                 
##  [3] "grid_plot_2.png"                  "I've Been to the Mountaintop.pdf"
##  [5] "Martin Luther King Speech.pdf"    "Matin Luther King Speech.pdf"    
##  [7] "OurGodIsMarchingOn.pdf"           "README.md"                       
##  [9] "updated_words_df.csv"             "Workbook_DVGAR-Part1.html"       
## [11] "Workbook_DVGAR-Part2.html"
# Load required libraries
library("webshot")
library("htmlwidgets")

Create a vector of PDF file names using the list.files function. The pattern argument is to only grab files ending with “pdf”

Note: ensure that your working directory is set to the folder containing the PDF files for this workshop.

files <- list.files(path="./Data-Part1-main/",
                    pattern = "pdf$",
                    full.names = TRUE)

Create a list object with a number, x, of elements, one for each document

# lapply to apply pdf_text function
# The pdf_text function in pdftools is used to extract text
Speeches <- lapply(files, pdf_text)  
length(Speeches)
## [1] 4
lapply(Speeches, length) 
## [[1]]
## [1] 7
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] 1

Create a corpus or a database of text

Corpora are collections of documents containing (natural language) text.

corp <- tm::Corpus(URISource(files),
               readerControl = list(reader = readPDF))

Create a TermDocumentMatrix (TDM)

Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follows:

# A TDM stores counts of terms for each document
speeches.tdm <- tm::TermDocumentMatrix(corp, 
                                      control = list(removePunctuation = TRUE, 
                                                     stopwords = TRUE, 
                                                     tolower = TRUE, 
                                                     stemming = TRUE, 
                                                     removeNumbers = TRUE)) 

View the (TDM)

inspect(speeches.tdm[1:30, ])
## <<TermDocumentMatrix (terms: 30, documents: 4)>>
## Non-/sparse entries: 32/88
## Sparsity           : 73%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##            Docs
## Terms       I've Been to the Mountaintop.pdf Martin Luther King Speech.pdf
##   ’tis                                     0                             1
##   “and                                     5                             0
##   “bull                                    1                             0
##   “dear                                    1                             0
##   “for                                     0                             1
##   “free                                    0                             1
##   abernathi                                2                             0
##   abl                                      0                             5
##   act                                      2                             0
##   afternoon                                1                             0
##            Docs
## Terms       Matin Luther King Speech.pdf OurGodIsMarchingOn.pdf
##   ’tis                                 0                      0
##   “and                                 0                      0
##   “bull                                0                      0
##   “dear                                0                      0
##   “for                                 0                      0
##   “free                                0                      0
##   abernathi                            0                      0
##   abl                                  6                      0
##   act                                  0                      0
##   afternoon                            0                      1

There are words preceded with double quotes and dashes even though we specified removePunctuation = TRUE To take care of this:

corp <- tm::tm_map(corp, 
                   removePunctuation, 
                   ucp = TRUE)

Recreate the TDM with removepunctuation=TRUE argument

speeches.tdm <- tm::TermDocumentMatrix(corp, 
                                   control = 
                                     list(stopwords = TRUE,
                                          tolower = TRUE,
                                          stemming = TRUE,
                                          removeNumbers = TRUE))

View the (TDM) again

inspect(speeches.tdm[1:30, ])
## <<TermDocumentMatrix (terms: 30, documents: 4)>>
## Non-/sparse entries: 40/80
## Sparsity           : 67%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##            Docs
## Terms       I've Been to the Mountaintop.pdf Martin Luther King Speech.pdf
##   abernathi                                2                             0
##   abl                                      0                             5
##   act                                      2                             0
##   afternoon                                1                             0
##   agenda                                   3                             0
##   ago                                      1                             1
##   ahead                                    1                             1
##   aint                                     1                             0
##   alabama                                  3                             2
##   allow                                    7                             2
##            Docs
## Terms       Matin Luther King Speech.pdf OurGodIsMarchingOn.pdf
##   abernathi                            0                      0
##   abl                                  6                      0
##   act                                  0                      0
##   afternoon                            0                      1
##   agenda                               0                      0
##   ago                                  0                      0
##   ahead                                0                      0
##   aint                                 0                      1
##   alabama                              2                      2
##   allow                                0                      0

Fix Mispelled Words

#Before creating the wordcloud, we will edit the words manually by creating a .csv file
words = as.matrix(sort(apply(speeches.tdm, 1, sum), 
                       decreasing = TRUE))

words_df = data.frame(Word=as.character(rownames(words)),
                      count= as.numeric(words)) 

head(words_df)
##      Word count
## 1    will    49
## 2     let    34
## 3     day    32
## 4 freedom    30
## 5     now    29
## 6     say    26
write.csv(words_df,
          "words_df.csv", 
          row.names = FALSE)

# Examine words_df.csv for possible incomplete words or errors and then let's 
#  load the updated .csv file back into R 

words_df <- read.csv("./Data-Part1-main/updated_words_df.csv")

Generate the Word Cloud

The importance of words can be illustrated as a word cloud as follow:

set.seed(1234)
wordcloud(words = words_df$Word, 
          freq = words_df$count, 
          min.freq = 1,
          max.words = 200, 
          random.order = FALSE, 
          rot.per = 0.35, 
          colors = brewer.pal(8, "Dark2"))

The above word cloud shows that “will”, “now”, “say”, “freedom” and “together” are some most used words in the speeches from Martin Luther King.

Another way to visually design the word cloud

wordcloud2(words_df, 
           size = 0.5)
wc = wordcloud2(words_df, 
                size = 0.5,
                shape = 'rectangle',
                backgroundColor = 'gray98',
                color = "random-light", 
                fontFamily = "Segoe UI")

#webshot::install_phantomjs()
install_phantomjs(version = "2.1.1",
  baseURL = "https://github.com/wch/webshot/releases/download/v0.3.1/",
  force = TRUE)
## phantomjs has been installed to C:\Users\vinit\AppData\Roaming\PhantomJS
saveWidget(wc, 
           "wordcloud.html", 
           selfcontained = F)

webshot::webshot("wordcloud.html", 
                 "wordcloud1.png",
                 vwidth = 500, 
                 vheight = 500, 
                 delay = 5)

Another way to visually design the wordcloud

wc2=wordcloud2(words_df, 
               size =0.6, 
               minRotation = -pi/8, 
               maxRotation = -pi/8, 
               rotateRatio = 1,
               shape = 'circle')
           
saveWidget(wc2, 
           "wordcloud2.html", 
           selfcontained = F)

webshot::webshot("wordcloud2.html", 
                 "wordcloud2.png",
                 vwidth = 500, 
                 vheight = 500, 
                 delay = 5)



Try out some more fancier wordclouds

wordcloud2(words_df, color = “random-light”, backgroundColor = “grey”)

wordcloud2(words_df, color = “random-light”, backgroundColor = “black”)


CHAPTER 7. Supplementary material


7.1. DIY: OfficeR and Wordcloud

7.1.1. OfficeR

In this section, you will have the opportunity to try out more functions for working with PowerPoint using R.

# Load required libraries
library(flextable)
library(rvg)
library(officer)
library(tm)

# Start a new presentation leaving a blank in the "read_pptx" function
new_pres <- read_pptx()

##############################
# MAKE FIRST SLIDE

# Use the 3 main parameters in the "add_slide" function
new_pres <- add_slide(new_pres, 
                      layout = "Title and Content", 
                      master = "Office Theme")

# Let's use some data from a previous example
Year = seq(1913, 2001, 1)
Jan = runif(89, -18.4, -3.2)
Feb = runif(89, -19.4, -1.2)
Mar = runif(89, -14, -1.8)
January = runif(89, 1, 86)
dat = data.frame(Year, Jan, Feb, Mar, January)

# It is often visually appealing to insert a table into a PowerPoint slide when presenting data.
# Let's add a table to the slide and create labels for the categories of data.
new_pres <- ph_with(new_pres, 
                    head(dat),
                    location = ph_location_label(ph_label = "Content Placeholder 2"))

#############################
# ADD SECOND SLIDE
new_pres <- add_slide(new_pres, 
                      layout = "Two Content")

# You can design a two column layout in PowerPoint by specifying the location in 
#  the "ph_location_... function". 
# For example, if you wish to make two lists showcasing the pros and cons of living 
#  in an apartment rather than a house, you can use this particular function to do so. 

new_pres <- ph_with(x = new_pres, 
                    "Apartment Living vs. House Living", 
                    location = ph_location_type(type = "title"))

new_pres <- ph_with(new_pres, 
                    sprintf("Amenaties Included"),
                    location = ph_location_left())

new_pres <- ph_with(new_pres, 
                    sprintf("Private Amenaties"),
                    location = ph_location_right())

#############################
# ADD THIRD SLIDE
new_pres <- add_slide(new_pres, 
                      layout = "Title Slide")

# Create the content you wish to put in the slide
paragraph <- fpar(ftext("Welcome to Data Visualization in Rstudio!", 
                        fp_text(color = "white", 
                        font.size = 40)))

# Make a "free location" or design the specific area/shape/color/direction to place the content
free_loc <- ph_location(
               left = 2.0, 
               top = 1.0, 
               width = 4.0, 
               height = 4.0, 
               rotation = 45, 
               bg = "blue")

# Place the content in the location created in the slide
new_pres <- ph_with(new_pres, 
                    paragraph,
                    location = free_loc)
 
#############################   
# ADD FOURTH SLIDE
new_pres <- add_slide(new_pres, 
                      layout = "Two Content")

# Design another free location
free_loc_1 <- ph_location_template(top = 4, 
                                   type = "body", 
                                   id = 1, 
                                   width = 4, 
                                   height = 4)

# Add content to the free location
new_pres <- ph_with(new_pres, 
                    "This is a message for YOU to have a WONDERFUL day!", 
                    location = free_loc_1 )

#############################
# ADD FIFTH SLIDE
new_pres <- add_slide(new_pres)

# Import image
img.file <- file.path(R.home("doc"), 
                       "html", 
                       "logo.jpg" )

# Add image to the slide and include properties
new_pres <- ph_with(x = new_pres, 
                    external_img(img.file, width = 2, height = 2),
                    location = ph_location_type(type = "body"), 
                    use_loc_size = FALSE )

#############################
# ADD SIXTH SLIDE
new_pres <- add_slide(new_pres)

# Attach the image to the slide without including any properties
new_pres <- ph_with(x = new_pres, 
                    external_img(img.file),
                    location = ph_location_type(type = "body"), 
                    use_loc_size = TRUE)

#############################
# ADD SEVENTH SLIDE
new_pres <- add_slide(new_pres)

# Design the flextable
ft <- flextable(head(dat))
ft <- autofit(ft)

# Add the flextable to the slide
new_pres <- ph_with(x = new_pres, 
                    ft,
                    location = ph_location_type(type = "body"))

#############################
# ADD EIGHTH SLIDE
new_pres <- add_slide(new_pres)

# Load new libraries
# library(rvg)

# Create editable graphics (rvg). Editable graphs are ... 
gg_plot <- ggplot(data = dat ) + 
           geom_point(mapping = aes(Year, Jan), size = 3) + 
           theme_minimal()

editable_graph <- dml(ggobj = gg_plot)

# Add editable graphics to the slide 
new_pres <- ph_with(x = new_pres, 
                    editable_graph,
                    location = ph_location_type(type = "body"))

#############################
# ADD NINTH SLIDE
new_pres <- add_slide(new_pres)

# Import multiple paragraphs. Why would we need multiple paragraphs??? 

# define blocklist 
fp_t1 <- fp_text(bold = TRUE, 
                 font.size = 30)

fp_t2 <- fp_text(bold = TRUE, 
                 font.size = 30, 
                 color = "green")

fp_t3 <- fp_text(font.size = 30, 
                 color = "#006699")

par1 <- fpar(ftext("Welcome", fp_t2), 
             ftext("Coders", fp_t3), 
             fp_p = fp_par(text.align = "left"))

par2 <- fpar(ftext("Let's", fp_t1), 
             ftext("Code", fp_t3), 
             fp_p = fp_par(text.align = "center"))

bl <- block_list(
         par1, 
         par1, 
         par1,
         par2, 
         par2)

new_pres <- ph_with(x = new_pres, 
                    value = bl,
                    level_list = seq_along(bl),
                    location = ph_location_type(type = "body"))

#############################
# ADD TENTH SLIDE
new_pres <- add_slide(new_pres)

new_pres <- ph_with(x = new_pres, 
                    value = bl, 
                    location = ph_location(label = "Introduction", 
                                           left = 5, 
                                           top = 3, 
                                           width = 4, 
                                           height = 4, 
                                           bg = "wheat", 
                                           rotation = 90))

print(new_pres, target = "~second_example.pptx") 

7.1.2. Text Mining - Analyzing term usage

In this section, you will examine the term frequency and their association in a document using the list of speeches from Chapter 6.

Explore frequent terms and their associations

#Load required libraries
library(tm)

fta <- findFreqTerms(speeches.tdm, 
                     lowfreq = 4)
head(fta)
## [1] "abl"      "alabama"  "allow"    "alway"    "america"  "american"

You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. The R code below identifies which words are associated with “freedom” in the speeches :

z <- findAssocs(speeches.tdm, 
                terms = "freedom", 
                corlimit = 0.3)

The frequency table of words

head(z$freedom)
##        true       dream    mountain mississippi   allegheni       becom 
##        0.99        0.98        0.98        0.97        0.94        0.94

Find Terms with Specific Frequency

To see the counts of words with a specified frequency, we could save the words using findFreqTerms and use it to subset the TDM. Notice we have to use as.matrix to see the print out of the subsetted TDM.

ft <- findFreqTerms(speeches.tdm, 
                    lowfreq = 20, 
                    highfreq = Inf)

as.matrix(speeches.tdm[ft, ]) 
##          Docs
## Terms     I've Been to the Mountaintop.pdf Martin Luther King Speech.pdf
##   around                                19                             0
##   day                                   14                             8
##   dream                                  2                            10
##   freedom                                5                            13
##   god                                   14                             3
##   let                                   12                            11
##   long                                   5                             6
##   now                                   28                             0
##   one                                    7                            10
##   ring                                   0                            11
##   say                                   21                             1
##   stop                                  20                             0
##   will                                  14                            13
##          Docs
## Terms     Matin Luther King Speech.pdf OurGodIsMarchingOn.pdf
##   around                             0                      1
##   day                               10                      0
##   dream                              8                      0
##   freedom                           12                      0
##   god                                1                      2
##   let                               10                      1
##   long                               0                     13
##   now                                0                      1
##   one                                8                      0
##   ring                              10                      0
##   say                                0                      4
##   stop                               0                      1
##   will                              15                      7

Plot word frequencies

The frequency of the first 10 frequent words are plotted:

barplot(words_df$count[1:10], 
        las = 2, 
        names.arg = words_df$Word[1:10],
        col = "lightblue", 
        main = "Most frequent words",
        ylab = "Word frequencies")

7.2. Updating R using RStudio

The operations in this tutorial are based on R version 4.1.1- Kick Things and version 4.0.3 - Bunny-Wunnies Freak Out. If necessary, update R from Rstudio using the updateR function from the installr package.

install.packages("installr", repos = "http://cran.r-project.org")
library(installr)
updateR(keep_install_file=TRUE)

Follow Part 2 for more on Large-scale geospatial analysis in R


Data Visualization and Geospatial Analysis With R (2021)
Correspondence
Vinit Sehgal, (https://orcid.org/0000-0002-8837-5864)
Leah Kocian, (https://orcid.org/0000-0001-7007-1582)
Shubham Jain, (https://orcid.org/0000-0003-4283-2043)
Alan Lewis, (https://orcid.org/0000-0002-9164-8683)