Data Visualization & Geospatial Analysis with R
PART 1/2: Data Visualization in R
Statistical computing is essential for scientific inquiry, discovery, and storytelling
. With R
, there are endless possibilities for assembling, transforming, querying, analyzing, and ultimately visualizing data. In this workshop, we will give you the tools to get you started.
Hands-on workshop on Data Visualization & Geospatial Analysis with R
organized by Texas A&M Institute of Data Science (TAMIDS
) & TAMU High Performance Research Computing (HPRC
).
Course level: Moderate. Prior experience in R required.
The codes are tested on R version 4.0.3 - Bunny-Wunnies Freak Out
. To update, see the supplementary section.
CHAPTER 1. Operators and data types
1.1. Basic operators
In this section, we will learn about some basic R operators that are used to perform operations on values and variables. Some most commonly used operators are shown in the table below.
## [1] 13
## [1] 0.5
## [1] 1
## [1] 0.25
## [1] 0.2
## [1] 8
## [1] 2.302585
## [1] 2
## [1] 3.141593
1.2. Basic data operations
In this section, we will create some vector data and apply built-in operations to examine the properties of a dataset.
# The "is equal to" or "assignment operator in R is "<-" or "="
# Generate sample data
data<-c(1,4,2,3,9)
# rbind combines data by rows, and hence "r"bind
# cbind combines data by columns, and hence "c"bind
# Checking the properties of a dataset. Note: the na.rm argument ignores NA values in the dataset.
data=rbind(1,4,2,3,9)
dim(data) # [5,1]: 5 rows, 1 column
## [1] 5 1
## [1] 4
## [1] 4 2 3 9
## [1] 3.8
## [1] 9
## [1] 1
## [1] 3.114482
## [,1]
## [1,] 9.7
## V1
## Min. :1.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.8
## 3rd Qu.:4.0
## Max. :9.0
## num [1:5, 1] 1 4 2 3 9
## [,1]
## [1,] 1
## [2,] 4
## [3,] 2
## [4,] 3
## [5,] 9
## [,1]
## [1,] 1
## [2,] 4
## [,1]
## [4,] 3
## [5,] 9
## NULL
## [1] 5
## [1] 4
## [1] 3.8
## [1] 9
## [1] 1
## [1] 3.114482
## [1] 9.7
## [1] "TAMU" "GEOS" "BAEN" "WMHS"
## [1] "TAMU"
# Mixed data
data=c(1,"GEOS",10,"WMHS") # All data is treated as text if one value is text
data[3] # Note how output is in quotes i.e. "10"
## [1] "10"
1.3. Data types
In R, data is stored as an “array”, which can be 1-dimensional or 2-dimensional. A 1-D array is called a “vector” and a 2-D array is a “matrix”. A table in R is called a “data frame” and a “list” is a container to hold a variety of data types. In this section, we will learn how to create matrices, lists and data frames in R.
# Lets make a random matrix
test_mat = matrix( c(2, 4, 3, 1, 5, 7), # The data elements
nrow=2, # Number of rows
ncol=3, # Number of columns
byrow = TRUE) # Fill matrix by rows
test_mat = matrix( c(2, 4, 3, 1, 5, 7),nrow=2,ncol=3,byrow = TRUE) # Same result
test_mat
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
## [1] 4 5
## [1] 1 5 7
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
## [1] 2 1 4 5 3 7
# Data frame and list
data1=runif(50,20,30) # Create 50 random numbers between 20 and 30
data2=runif(50,0,10) # Create 50 random numbers between 0 and 10
# Lists
out = list() # Create and empty list
out[[1]] = data1 # Notice the brackets "[[ ]]" instead of "[ ]"
out[[2]] = data2
out[[1]] # Contains data1 at this location
## [1] 29.18933 27.95485 28.31264 28.70377 29.22321 22.58681 28.10250 22.13927
## [9] 22.49101 21.35438 20.69045 22.90640 29.35435 27.68587 25.63316 21.47947
## [17] 21.45240 28.76167 25.18714 28.12519 24.01581 22.71113 24.63520 25.08795
## [25] 28.70108 20.71663 22.90016 21.97386 20.97670 23.00555 29.66211 28.22074
## [33] 25.64117 27.93509 22.72155 29.46658 29.12766 25.46548 23.09826 28.17638
## [41] 24.39947 25.89014 20.58469 28.67705 26.89515 28.00919 27.89364 29.55466
## [49] 22.75411 20.31576
CHAPTER 2. Plotting with base R
If you need to quickly visualize your data, base R has some functions that will help you do this in a pinch. In this section we’ll look at some basics of visualizing univariate and multivariate data.
2.1. Overview
# Create 50 random numbers between 0 and 100
data=runif(50, 0, 100)
# Overplotting means adding layers to a plot.
plot(data) # The "plot" function initializes the plot.
# Try options type = "o" and type = "c" as well.
# We can also quickly visualize boxplots, histograms, and density plots using the same procedure
boxplot(data) # Box-and-whisker plot
2.2. Plotting univariate data
Let’s dig deeper into the plot function. Here, we will look at how to adjust the colors, shapes, and sizes for markers, axis labels and titles, and the plot title.
# Part 2.2.1. Line plots
plot(data,type="o", col="red",
xlab="x-axis title",ylab ="y-axis title", main="My plot", # Name of axis labels and title
cex.axis=2, cex.main=2,cex.lab=2, # Size of axes, title and label
pch=23, # Change marker style
bg="red", # Change color of markers
lty=5, # Change line style
lwd=2 # Selecting line width
)
# Adding legend
legend(1, 100, legend=c("Data 1"),
col=c("red"), lty=2, cex=1.2)
# Part 2.2.2. Histograms
hist(data,col="red",
xlab="Number",ylab ="Value", main="My plot", # Name of axis labels and title
border="blue"
)
# Try adjusting the parameters:
# hist(data,col="red",
# xlab="Number",ylab ="Value", main="My plot", # Name of axis labels and title
# cex.axis=2, cex.main=2,cex.lab=2, # Size of axes, title and label
# border="blue",
# xlim=c(0,100), # Control the limits of the x-axis
# las=0, # Try different values of las: 0,1,2,3 to rotate labels
# breaks=5 # Try using 5,20,50, 100
# ) # Using more options and controls
2.3. Plotting multivariate data
Here, we introduce you to data frames: equivalent of tables in R. A data frame is a table with a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
plot_data=data.frame(x=runif(50,0,10), y=runif(50,20,30), z=runif(50,30,40))
plot(plot_data$x, plot_data$y) # Scatter plot of x and y data
# Mandatory beautification
plot(plot_data$x,plot_data$y, xlab="Data X", ylab="Data Y", main="X vs Y plot",
col="darkred",pch=20,cex=1.5) # Scatter plot of x and y data
matplot(plot_data, type = c("b","l","o"),pch=16,col = 1:4) # Try this now. Any difference?
legend("topleft", legend = 1:4, col=1:4, pch=1) # Add legend to a top left
legend("top", legend = 1:4, col=1:4, pch=1) # Add legend to at top center
legend("bottomright", legend = 1:4, col=1:4, pch=1) # Add legend at the bottom right
2.4. Time series data
Working with time series data can be tricky at first, but here’s a quick look at how to quickly generate a time series using the as.Date function.
date=seq(as.Date('2011-01-01'),as.Date('2011-01-31'),by = 1) # Generate a sequence 31 days
data=runif(31,0,10) # Generate 31 random values between 0 and 10
df=data.frame(Date=date,Value=data) # Combine the data in a data frame
plot(df,type="o")
2.5. Combining plots
You can built plots that contain subplots. Using base R, we call start by using the “par” function and then plot as we saw before.
par(mfrow=c(2,2)) # Call a plot with 4 quadrants
# Plot 1
matplot(plot_data, type = c("b"),pch=16,col = 1:4)
# Plot 2
plot(plot_data$x,plot_data$y)
# Plot 3
hist(data,col="red",
xlab="Number",ylab ="Value", main="My plot",
border="blue")
# Plot4
plot(data,type="o", col="red",
xlab="Number",ylab ="Value", main="My plot",
cex.axis=2, cex.main=2,cex.lab=2,
pch=23,
bg="red",
lty=5,
lwd=2
)
# Alternatively, we can call up a plot using a matrix
matrix(c(1,1,2,3), 2, 2, byrow = TRUE) # Plot 1 is plotted for first two spots, followed by plot 2 and 3
## [,1] [,2]
## [1,] 1 1
## [2,] 2 3
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE)) # Fixes a layout of the plots we want to make
# Plot 1
matplot(plot_data, type = c("b"),pch=16,col = 1:4)
# Plot2
plot(plot_data$x,plot_data$y)
# Plot 3
hist(data,col="red",
xlab="Number",ylab ="Value", main="My plot",
border="blue"
)
2.6. Saving figures to disk
Plots can be saved as image files or a PDF. This is done by specifying the output file type, its size and resolution, then calling the plot.
png("awesome_plot.png", width=4, height=4, units="in", res=400)
#Tells R we will plot image in png of given specification
matplot(plot_data, type = c("b","l","o"),pch=16,col = 1:4)
legend("topleft", legend = 1:4, col=1:4, pch=1)
dev.off() # Very important: this sends the image to disc
## png
## 2
# Keep pressing till you get the following:
# Error in dev.off() : cannot shut down device 1 (the null device)
# This ensures that we are no longer plotting.
# It looks like what everything we just plotted was squeezed together to tightly. Let's change the size.
png("awesome_plot.png", width=6, height=4, units="in", res=400) #note change in dimension
#Tells R we will plot image in png of given specification
matplot(plot_data, type = c("b","l","o"),pch=16,col = 1:3)
legend("topleft", legend = 1:3, col=1:3, pch=16)
dev.off()
## png
## 2
Some useful resources
If you want to plot something a certain way and don’t know how to do it, the chances are that someone has asked that question before. Try a Google search for what your are trying to do and check out some of the forums. There is TONS of material online. Here are some additional resources:
The R Graph Gallery: https://www.r-graph-gallery.com/
Graphical parameters:
https://www.statmethods.net/advgraphs/parameters.html
Plotting in R: https://www.harding.edu/fmccown/r/
Histogram:
https://www.r-bloggers.com/how-to-make-a-histogram-with-basic-r/
Line plots: https://www.statmethods.net/graphs/line.html
CHAPTER 3: Plotting with ggplot2
3.1. Import libraries and create sample dataset
For this section, we will use the ggplot2
, gridExtra
, utils
, and tidyr
packages. gridExtra
and cowplot
are used to combine ggplot objects into one plot and utils
and tidyr
are useful for manipulating and reshaping the data. We will also install some packages here that will be required for the later sections. You will find more information in the sections to follow.
###############################################################
#~~~ Load required libraries
lib_names=c("ggplot2","gridExtra","utils","tidyr","cowplot","plot3D", "leaflet", "pdftools", "tm", "SnowballC", "wordcloud", "RColorBrewer", "wordcloud2")
# If you see a prompt: Do you want to restart R prior to installing: Select **No**.
# Install all necessary packages (Run once)
# invisible(suppressMessages
# (suppressWarnings
# (lapply
# (lib_names,install.packages,repos="http://cran.r-project.org",
# character.only = T))))
# Load necessary packages
invisible(suppressMessages
(suppressWarnings
(lapply
(lib_names,library, character.only = T))))
###############################################################
#~~~ Generate a dataset containing random numbers within specified ranges
Year = seq(1913,2001,1)
Jan = runif(89, -18.4, -3.2)
Feb = runif(89, -19.4, -1.2)
Mar = runif(89, -14, -1.8)
January = runif(89, 1, 86)
dat = data.frame(Year, Jan, Feb, Mar, January)
3.2. Basics of ggplot
Whereas base R has an “ink on paper” plotting paradigm, ggplot
has a “grammar of graphics” paradigm that packages together a variety plotting functions. With ggplot
, you assign the result of a function to an object name and then modify it by adding additional functions. Think of it as adding layers using pre-designed functions rather than having to build those functions yourself, as you would have to do with base R.
l1 = ggplot(data=dat, aes(x = Year, y = Jan, color = "blue")) + # Tell which data to plot
geom_line() + # Add a line
geom_point() + # Add a points
xlab("Year") + # Add labels to the axes
ylab("Value")
# Or, they can be specified for any individual geometry
l1 + geom_line(linetype = "solid", color="Blue") # Add a solid line
# There are tons of other built-in color scales and themes, such as scale_color_grey(), scale_color_brewer(), theme_classic(), theme_minimal(), and theme_dark()
# OR, CREATE YOUR OWN THEME! You can group themes together in one list
theme1 = theme(
legend.position = "none",
panel.background = element_blank(),
plot.title = element_text(hjust = 0.5),
axis.line = element_line(color = "black"),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 11),
axis.title.y = element_text(size = 11),
axis.title.x = element_text(size = 11),
panel.border = element_rect(
colour = "black",
fill = NA,
size = 0.5
)
)
3.3. Multivariate plots
For multivariate data, ggplot takes the data in the form of groups. This means that each data row should be identifiable to a group. To get the most out of ggplot, we will need to reshape our dataset.
library(tidyr)
# There are two generally data formats: wide (horizontal) and long (vertical). In the horizontal format, every column represents a category of the data. In the vertical format, every row represents an observation for a particular category (think of each row as a data point). Both formats have their comparative advantages. We will now convert the data frame we randomly generated in the previous section to the long format. Here are several ways to do this:
# Using the gather function
dat2 = dat %>% gather(Month, Value, -Year)
# Using pivot_longer and selecting all of the columns we want. This function is the best!
dat2 = dat %>% pivot_longer(cols = c(Jan, Feb, Mar), names_to = "Month", values_to = "Value")
# Or we can choose to exclude the columns we don't want
dat2 = dat %>% pivot_longer(cols = -c(Year,January), names_to = "Month", values_to = "Value")
head(dat2) # The data is now shaped in the long format
## # A tibble: 6 x 4
## Year January Month Value
## <dbl> <dbl> <chr> <dbl>
## 1 1913 74.9 Jan -9.42
## 2 1913 74.9 Feb -16.4
## 3 1913 74.9 Mar -6.99
## 4 1914 81.3 Jan -15.0
## 5 1914 81.3 Feb -10.8
## 6 1914 81.3 Mar -12.8
Line plot
# LINE PLOT
l = ggplot(dat2, aes(x = Year, y = Value, group = Month)) +
geom_line(aes(color = Month)) +
geom_point(aes(color = Month))
l
Density plot
# DENSITY PLOT
d = ggplot(dat2, aes(x = Value))
d = d + geom_density(aes(color = Month, fill = Month), alpha=0.4) # Alpha specifies transparency
d
Histogram
# HISTOGRAM
h = ggplot(dat2, aes(x = Value))
h = h + geom_histogram(aes(color = Month, fill = Month), alpha=0.4,
fill = "white",
position = "dodge")
h
Grid plotting and saving files to disk
There are multiple ways to arrange multiple plots and save images. One method is using grid.arrange()
which is found in the gridExtra
package. You can then save the file using ggsave
, which comes with the ggplot2
library.
# The plots can be displayed together on one image using
# grid.arrange from the gridExtra package
img = grid.arrange(l, d, h, nrow=3)
# Finally, plots created using ggplot can be saved using ggsave
ggsave("grid_plot_1.png",
plot = img,
device = "png",
width = 6,
height = 4,
units = c("in"),
dpi = 600)
Another approach is to use the plot_grid
function, which is in the cowplot
library. Notice how the axes are now beautifally aligned.
img2=cowplot::plot_grid(l, d, h, nrow = 3, align = "v") # "v" aligns vertical axes and "h" aligns horizontal axes
ggsave("grid_plot_2.png",
plot = img2,
device = "png",
width = 6,
height = 4,
units = c("in"),
dpi = 600)
Some useful resources
The links below offer a treasure trove of examples and sample code to get you started.
The R Graph Gallery: https://www.r-graph-gallery.com/
Line plots in ggplot2: http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization
Top 50 visualizations with ggplot2: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
Practical guide in ggplot2: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization
CHAPTER 4: Advanced plotting with R
4.1. 3-D plots in R
In this section, we will use the plot3D
package in R to make some 3-dimensional plots. More information on 3D plotting can be accessed from: https://cran.r-project.org/web/packages/plot3D/plot3D.pdf
# Load libraries
library(plot3D)
# Generate data from a normal distribution
set.seed(123)
x = rnorm(n = 1000, mean = 50, sd = 10)
y = rnorm(1000, 0, 10)
z = rnorm(1000, 0, 10)
par(mar = c(2, 2, 2, 2))
# Create a basic 3D scatter plot
scatter3D(x, y, z)
# Full box
scatter3D(x, y, z,
bty = "f", # Add a box and grid lines to the plot and use the 'bty' argument
colvar = NULL, # Use "colvar = NULL"" to avoid coloring by z variable
col = "blue",
pch = 19,
cex = .5)
# Gray background with white grid lines
scatter3D(x, y, z, bty = "g", colvar = NULL, col = "blue", pch = 19, cex = .5)
# Change the view direction
scatter3D(x, y, z,
phi = 0, # phi is the co-latitude
theta = 45, # theta specifies the azimuth rotation
bty ="g" ,
ticktype = "detailed", # ticktype "detailed" draws normal ticks and labels and "simple" draws arrows
col = gg.col(100), # Apply colors similar to ggplot
pch = 18)
# Create a 3D histogram
hist3D(z = matrix(x[1:25], nrow = 5, ncol = 5), # Matrix containing values to be plotted
x = 1:5, # Vector of length = nrow(z)
y = 1:5, # Vector of length = ncol(z)
scale = FALSE,expand = 0.1, bty = "g", phi = 20,
col = "lightblue", border = "black", shade = 0.2,space = 0.3, ltheta = 90,
ticktype = "detailed")
4.2. Interactive maps with Leaflet
In this section, we will use the leaflet
package in R to make some interactive plots in R. These maps can be saved in HTML format. More information is available at: https://rstudio.github.io/leaflet/
Basemap
Leaflet uses map tiles as basemaps using Open Street Map by default, and can be added to R using addTiles().
Other basemaps can be added using "addProviderTiles().
Try “addProviderTiles(providers$Esri.NatGeoWorldMap)”.
We’ll also set the default zoom to Scoates Hall.
Adding labels
We’ll add some point locations of USGS stream gages on Buffalo Bayou to a data frame textConnection() can be used to read data that is stored in an R source file rather than an external file.
df <- read.csv(textConnection(
"Station,Lat,Long
Buffalo Bayou at Houston,29.76023,-95.40855
Buffalo Bayou at Piney Point,29.74690,-95.52355
Buffalo Bayou at W Belt Dr at Houston,29.76217,-95.55772
Buffalo Bayou nr Addicks,29.76190,-95.60578
Buffalo Bayou nr Katy,29.74329,-95.80689"))
df
## Station Lat
## 1 Buffalo Bayou at Houston 29.76023
## 2 Buffalo Bayou at Piney Point 29.74690
## 3 Buffalo Bayou at W Belt Dr at Houston 29.76217
## 4 Buffalo Bayou nr Addicks 29.76190
## 5 Buffalo Bayou nr Katy 29.74329
## Long
## 1 -95.40855
## 2 -95.52355
## 3 -95.55772
## 4 -95.60578
## 5 -95.80689
# Now we can plot them on open street map as markers using "addMarkers"
leaflet(df) %>%
addTiles() %>%
addMarkers(~Long, ~Lat, label = ~Station)
Marker Clusters
We can also combine the events in a cluster based on location. We will use the earthquakes dataset that is preloaded in R and gives the locations of 1000 seismic events of Mb > 4.0 near Fiji since 1964.
data(quakes)
leaflet(quakes) %>%
addTiles() %>%
addMarkers(~long, ~lat, clusterOptions = markerClusterOptions())
Add circles
To plot circles with city populations, we’ll first add the data frame using text connections directly in R source file.
cities <- read.csv(textConnection(
"City,Lat,Long,Pop
Austin,30.2672,-97.7431,950715
Houston,29.7604,-95.3698,2313000
College Station,30.6280,-96.3344,113564
Dallas,32.7767,-96.7970,1341000
San Antonio,29.4241,-98.4936,1493000"))
leaflet(cities) %>%
addTiles() %>%
addCircles(lng = ~Long, lat = ~Lat, weight = 1,
radius = ~sqrt(Pop) * 30, popup = ~as.character(Pop))
CHAPTER 5: Text mining with R
5.1. Creating a Word Cloud
Import libraries and sample dataset
For this section, we will create a Word Cloud from a set of PDF documents. These files can be downloaded manually from GitHub
by accessing https://github.com/Vinit-Sehgal/Data-Part1
Or, you can run the following code to download them from GitHub
repository. Load Libraries
###############################################################
#~~~ Import sample data from GitHub repository
download.file(url = "https://github.com/Vinit-Sehgal/Data-Part1/archive/main.zip",
destfile = "Data-Part1-main.zip") # Download ".Zip"
# Unzip the downloaded .zip file
unzip(zipfile = "Data-Part1-main.zip")
getwd() # Working directory for this workshop
## [1] "G:/My Drive/TAMU/Teaching/DataVisWorkshop/DataVisWorkshop2020/revision2"
## [1] "grid_plot_1.png" "grid_plot_2.png"
## [3] "I've Been to the Mountaintop.pdf" "Martin Luther King Speech.pdf"
## [5] "Matin Luther King Speech.pdf" "OurGodIsMarchingOn.pdf"
## [7] "README.md" "updated_words_df.csv"
## [9] "Workbook_DVGAR-Part1.html" "Workbook_DVGAR-Part2.html"
###############################################################
#~~~ Load required libraries
lib_names=c("pdftools","tm","SnowballC","wordcloud","wordcloud2",
"RColorBrewer","webshot","htmlwidgets")
Install all necessary packages (Run once). If you see a prompt: Do you want to restart R prior to installing: Select No.
invisible(suppressMessages
(suppressWarnings
(lapply
(lib_names,install.packages,repos="http://cran.r-project.org",
character.only = T))))
# Load necessary packages
invisible(suppressMessages
(suppressWarnings
(lapply
(lib_names,library, character.only = T))))
Create a vector of PDF file names using the list.files function. The pattern argument is to only grab files ending with “pdf”
Note: ensure that your working directory is set to the folder containing the PDF files for this workshop.
Create a list object with a number, x, of elements, one for each document
Speeches <- lapply(files, pdf_text)
#lapply to apply pdf_text function
#The pdf_text function in pdftools is used to extract text
length(Speeches)
## [1] 4
## [[1]]
## [1] 7
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 1
Create a corpus or a database of text Corpora
are collections of documents containing (natural language) text.
Create a TermDocumentMatrix (TDM)
Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follows:
speeches.tdm <- tm::TermDocumentMatrix(corp,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
#A TDM stores counts of terms for each document
View the (TDM)
## <<TermDocumentMatrix (terms: 30, documents: 4)>>
## Non-/sparse entries: 32/88
## Sparsity : 73%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms I've Been to the Mountaintop.pdf Martin Luther King Speech.pdf
## ’tis 0 1
## “and 5 0
## “bull 1 0
## “dear 1 0
## “for 0 1
## “free 0 1
## abernathi 2 0
## abl 0 5
## act 2 0
## afternoon 1 0
## Docs
## Terms Matin Luther King Speech.pdf OurGodIsMarchingOn.pdf
## ’tis 0 0
## “and 0 0
## “bull 0 0
## “dear 0 0
## “for 0 0
## “free 0 0
## abernathi 0 0
## abl 6 0
## act 0 0
## afternoon 0 1
There are words preceded with double quotes and dashes even though we specified removePunctuation = TRUE To take care of this:
Recreate the TDM with removepunctuation=TRUE argument
speeches.tdm <- tm::TermDocumentMatrix(corp,
control =
list(stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
View the (TDM) again
## <<TermDocumentMatrix (terms: 30, documents: 4)>>
## Non-/sparse entries: 40/80
## Sparsity : 67%
## Maximal term length: 11
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms I've Been to the Mountaintop.pdf Martin Luther King Speech.pdf
## abernathi 2 0
## abl 0 5
## act 2 0
## afternoon 1 0
## agenda 3 0
## ago 1 1
## ahead 1 1
## aint 1 0
## alabama 3 2
## allow 7 2
## Docs
## Terms Matin Luther King Speech.pdf OurGodIsMarchingOn.pdf
## abernathi 0 0
## abl 6 0
## act 0 0
## afternoon 0 1
## agenda 0 0
## ago 0 0
## ahead 0 0
## aint 0 1
## alabama 2 2
## allow 0 0
Fix Mispelled Words
#Before creating the wordcloud, we will edit the words manually by creating a .csv file
words=as.matrix(sort(apply(speeches.tdm, 1, sum), decreasing = TRUE))
words_df=data.frame(Word=as.character(rownames(words)),count= as.numeric(words))
head(words_df)
## Word count
## 1 will 49
## 2 let 34
## 3 day 32
## 4 freedom 30
## 5 now 29
## 6 say 26
write.csv(words_df,"words_df.csv", row.names = FALSE)
# Examine words_df.csv for possible incomplete words or errors and then let's load the updated .csv file back into R
words_df <- read.csv("./Data-Part1-main/updated_words_df.csv")
Generate the Word Cloud
The importance of words can be illustrated as a word cloud as follow:
set.seed(1234)
wordcloud(words = words_df$Word, freq = words_df$count, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The above word cloud shows that “will”, “now”, “say”, “freedom” and “together” are some most used words in the speeches from Martin Luther King.
Another way to visually design the word cloud
wc = wordcloud2(words_df,size=0.5,
shape='rectangle',
backgroundColor='gray98',
color = "random-light",fontFamily="Segoe UI")
webshot::install_phantomjs()
## It seems that the version of `phantomjs` installed is greater than or equal to the requested version.To install the requested version or downgrade to another version, use `force = TRUE`.
saveWidget(wc,"wordcloud.html",selfcontained = F)
webshot::webshot("wordcloud.html","wordcloud1.png",
vwidth = 500, vheight =500, delay = 5)
Another way to visually design the wordcloud
wc2=wordcloud2(words_df, size =0.6, minRotation = -pi/8,
maxRotation = -pi/8, rotateRatio = 1,
shape = 'circle')
saveWidget(wc2,"wordcloud2.html",selfcontained = F)
webshot::webshot("wordcloud2.html","wordcloud2.png",
vwidth = 500, vheight = 500, delay = 5)
Try out some more fancier wordclouds
wordcloud2(words_df, color = “random-light”, backgroundColor = “grey”)
wordcloud2(words_df, color = “random-light”, backgroundColor = “black”)
5.2. Analyzing term usage
Explore frequent terms and their associations
## [1] "abl" "alabama" "allow" "alway" "america" "american"
You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. The R code below identifies which words are associated with “freedom” in the speeches :
The frequency table of words
## true dream mountain mississippi allegheni becom
## 0.99 0.98 0.98 0.97 0.94 0.94
Find Terms with Specific Frequency
To see the counts of words with a specified frequency, we could save the words using findFreqTerms
and use it to subset the TDM. Notice we have to use as.matrix to see the print out of the subsetted TDM.
## Docs
## Terms I've Been to the Mountaintop.pdf Martin Luther King Speech.pdf
## around 19 0
## day 14 8
## dream 2 10
## freedom 5 13
## god 14 3
## let 12 11
## long 5 6
## now 28 0
## one 7 10
## ring 0 11
## say 21 1
## stop 20 0
## will 14 13
## Docs
## Terms Matin Luther King Speech.pdf OurGodIsMarchingOn.pdf
## around 0 1
## day 10 0
## dream 8 0
## freedom 12 0
## god 1 2
## let 10 1
## long 0 13
## now 0 1
## one 8 0
## ring 10 0
## say 0 4
## stop 0 1
## will 15 7
Plot word frequencies
The frequency of the first 10 frequent words are plotted:
barplot(words_df$count[1:10], las = 2, names.arg = words_df$Word[1:10],
col ="lightblue", main ="Most frequent words",
ylab = "Word frequencies")
Supplementary material
Updating R using RStudio
The operations in this tutorial are based on R version 4.0.3 - Bunny-Wunnies Freak Out
. If necessary, update R from Rstudio using the updateR
function from the installr
package.
install.packages("installr")
library(installr)
updateR(keep_install_file=TRUE)