Data Visualization

Histogram
Box Plot
Scatter Plot
Interactive Scatter Plot using plotly
Scatterplot Matrix
Barplot Based on Individual Data
Barplot Based on Summary Data
Time Series Plot
Heatmap
Tree Maps
Visualizing Geographic Data
Network Graph (Example: Random graph for illustration)
Arrange Plots in a Matrix

Histogram

We use the data frame “mtcars” already available from R to demonstrate the creation of a few commonly used graphs. The data are displayed below:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

The data can also be displayed with the code below using the DT package:

datatable(mtcars)

Now, we create a histogram for the mpg column. Note that a histogram or boxplot is only for numeric data.

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 5, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of MPG",
       x = "Miles Per Gallon (MPG)",
       y = "Frequency")

Explanation of the above code:

ggplot(mtcars, aes(x = mpg)): Specifies the data frame (mtcars) and the aesthetics mapping. The x-axis (x) represents Miles Per Gallon (mpg).
geom_histogram(bins = 5, fill = “blue”, color = “black”, alpha = 0.7): Adds a histogram layer to the plot. bins sets the number of bins, fill sets the fill color of the bars, color sets the border color of the bars, and alpha controls the transparency of the bars.
labs(title = “Histogram of MPG”, x = “Miles Per Gallon (MPG)”, y = “Frequency”): Sets the title and axis labels for the plot.

Interpretation of results:

The histogram provides a visual representation of the distribution of MPG in the mtcars dataset.
The x-axis represents the range of MPG values, divided into three bins (as specified by bins = 5).
The y-axis represents the frequency of observations within each bin.
The blue bars indicate the count (frequency) of observations in each MPG range, and the black borders add visual separation between the bars.
The transparency (alpha) of the bars is set to 0.7, making overlapping areas more visible.

Box Plot

ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Box Plot of MPG by Number of Cylinders",
       x = "Number of Cylinders",
       y = "Miles Per Gallon (MPG)")

Interpretation of results:

The x-axis shows the number of cylinders, treated as a categorical variable. Each box plot corresponds to a different number of cylinders (e.g., 4, 6, 8).
The y-axis represents Miles Per Gallon (MPG).
The boxes in the plot provide a visual summary of the distribution of MPG for each category of the number of cylinders.
The box represents the interquartile range (IQR), with the line inside the box indicating the median. Whiskers extend from the boxes to show the range of the data.
Outliers may be displayed as individual points beyond the whiskers.

Scatter Plot

ggplot(mtcars, aes(x = mpg, y = disp)) +
  geom_point() +
  geom_text(aes(label = 1:32), vjust = -0.5, hjust = -0.5) +
  labs(title = "Scatter Plot of MPG vs. Displacement",
       x = "Miles Per Gallon (MPG)",
       y = "Displacement")

Explanation of the above code:

ggplot(mtcars, aes(x = mpg, y = disp)): Specifies the data frame (mtcars) and the aesthetics mapping (x = mpg, y = disp) for the scatter plot.
geom_point(): Adds points to the scatter plot based on the specified aesthetics.
geom_text(aes(label = 1:32), vjust = -0.5, hjust = -0.5): Adds text labels to each point with the labels being the row numbers (1 to 32). vjust and hjust parameters control the vertical and horizontal alignment of the labels.
labs(title = “Scatter Plot of MPG vs. Displacement”, x = “Miles Per Gallon (MPG)”, y = “Displacement”): Sets the title and axis labels for the plot.

Explanation of results:

The scatter plot displays points representing each observation in the mtcars dataset, where the x-coordinate is the Miles Per Gallon (mpg) and the y-coordinate is the Displacement (disp).
The geom_text layer adds labels to each point with the corresponding row numbers (1 to 32). This can be useful for identifying individual observations.
The title and axis labels provide context and information about the variables being plotted.

Interactive Scatter Plot using plotly

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# We store a scatterplot in an object called p
p <- ggplot(mtcars, aes(x = mpg, y = disp)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs. Displacement",
       x = "Miles Per Gallon (MPG)",
       y = "Displacement")

# Make the plot interactive
ggplotly(p)

Explanation of the above code:

This code uses ggplot to create a scatter plot (geom_point()) using the mtcars dataset.
The aesthetics (aes) mapping specifies that the x-axis (x) should represent “mpg” (Miles Per Gallon), and the y-axis (y) should represent “disp” (engine displacement).
Additional elements are added to the plot, such as a title and axis labels using the labs() function.
The ggplotly() function is applied to the ggplot object p. This function is part of the plotly library and converts the static ggplot object into an interactive plot.
The resulting plot can be zoomed, panned, and provides additional interactive features when viewed in a web-based environment.

Scatterplot Matrix

pairs(mtcars[c("mpg", "disp", "hp", "wt")])

Barplot Based on Individual Data

ggplot(mtcars, aes(x = cyl)) +
  geom_bar() +
  labs(title = "Distributin of the Number of Cylinders",
       x = "Number of Cylinders") + 
  theme(plot.title = element_text(hjust = 0.5))

Explanation of the above code:

ggplot(mtcars, aes(x = cyl)): Initiates the creation of a ggplot object using the mtcars dataset. The aesthetic mapping (aes) specifies that the x-axis variable is “cyl” (number of cylinders).
geom_bar(): Adds a bar geometry to create a barplot. This function uses the default settings for creating a barplot based on the specified aesthetic mappings.
labs(title = “Distribution of the Number of Cylinders”, x = “Number of Cylinders”): Customizes plot labels and title. Sets the title of the plot to “Distribution of the Number of Cylinders” and labels the x-axis as “Number of Cylinders.”
theme(plot.title = element_text(hjust = 0.5)): Customizes the theme of the plot. Specifically, it adjusts the horizontal justification (hjust) of the plot title to 0.5, which centers the title horizontally. The element_text() function is used to specify text-related properties.

Barplot Based on Summary Data

# Calculate average mpg for each number of cylinders
avg_mpg <- aggregate(mpg ~ cyl, data = mtcars, FUN = mean)

# Barplot
ggplot(avg_mpg, aes(x = cyl, y = mpg)) +
  geom_col() +
  labs(title = "Average MPG by Number of Cylinders",
       x = "Number of Cylinders",
       y = "Average Miles Per Gallon (MPG)") + 
  theme(plot.title = element_text(hjust = 0.5))

Time Series Plot

Let’s say that stock price of 3M Company has mean of 106 and Standard Deviation of 1.26. The Standard Deviation is a measure of how spread out the prices or returns of an asset are on average. It is the most widely used risk indicator in the field of investing and finance.

We generate monthly time series data from a normal distribution with mean 106 and standard deviation of 0.27.

stock = ts(rnorm(100, 106, 0.27), frequency = 12, start = c(1998, 2))
stock

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998          105.6254 105.3020 106.0901 106.3761 105.5460 105.7923 106.1339
## 1999 105.8000 105.9569 106.1227 106.1920 106.3724 105.7462 105.5297 105.8498
## 2000 106.4730 105.7630 105.6875 105.7641 106.0937 105.7381 105.4277 106.0798
## 2001 106.0806 105.9288 105.4425 106.0464 105.7975 106.3408 105.6732 105.9791
## 2002 105.9559 105.8260 106.0372 106.0449 105.8435 105.8834 105.6386 105.8515
## 2003 105.8820 106.2117 106.0813 106.0965 105.6684 105.8549 105.9249 105.8704
## 2004 105.7583 106.3315 106.3935 106.0003 106.2597 106.0432 106.4831 106.1453
## 2005 105.8015 106.0931 105.7109 106.5101 106.4558 105.9197 105.5926 106.1594
## 2006 106.2911 105.7376 105.7855 106.3559 105.8911                           
##           Sep      Oct      Nov      Dec
## 1998 105.9876 105.7674 105.8960 105.9745
## 1999 106.2392 105.8667 106.1186 105.8361
## 2000 106.5948 105.7216 105.9366 106.2797
## 2001 106.3516 105.5821 105.8648 106.0571
## 2002 106.1627 105.8432 105.9771 105.9157
## 2003 105.8601 106.0661 106.2748 106.1002
## 2004 106.1806 106.3307 106.1268 106.1987
## 2005 105.9013 105.7073 105.4566 106.0048
## 2006

Plot the time series:

library(forecast)
autoplot(stock)

Explanation of the above code:

The autoplot function is used to create an automatic time series plot for the data stored in the stock object.
The specific appearance of the plot will depend on the characteristics of the time series data in stock.
The autoplot function is designed to work with various time series objects and automatically choose an appropriate plot type based on the characteristics of the data.

To plot the monthly means, we do

monthly_stock_means = tapply(stock, cycle(stock), mean)
plot(monthly_stock_means, type = "l", xlab = "Month", ylab = "Mean Stock Price", main = "Monthly Means of 3M Stock Price")

Explanation of the above code:

The tapply function is used to apply the mean function to the stock variable, grouping it by the cycle of stock. In this context, the cycle function is used to extract the month information. The result is a vector (monthly_stock_means) containing the mean stock price for each month.
The plot function is then used to create a line plot of monthly_stock_means. type = “l” specifies a line plot. xlab and ylab set the labels for the x-axis and y-axis, respectively. main sets the title of the plot.

Another way:

Month = cycle(stock)
monthly_stock_means = aggregate(stock~Month, FUN=mean)
monthly_stock_means$Month = factor(monthly_stock_means$Month, labels = month.abb)

ggplot(monthly_stock_means, aes(x = Month, y = stock)) +
  geom_line(group = 1)

Explanation of the above code:

This line uses the cycle() function to extract the cyclic component of the time series stock. In this context, it likely corresponds to the months. The resulting Month variable represents the month for each observation in the stock time series.
The aggregate() function is used to calculate the mean of the stock variable for each unique value of Month. This results in a data frame named monthly_stock_means with two columns: Month and stock, where stock represents the mean stock value for each month. FUN = mean specifies that the mean function should be applied to aggregate the data.
This line converts the Month variable to a factor and assigns the abbreviated month names (e.g., Jan, Feb) as labels. This is done to ensure that the x-axis in the subsequent plot displays month names in a readable format.
This code uses ggplot() to create a line plot with monthly_stock_means as the data frame. The aesthetics (aes()) function specifies that Month should be plotted on the x-axis and stock on the y-axis. geom_line() is used to add a line to the plot. The group = 1 parameter is added to ensure that all points are connected by a single line.

Heatmap

correlation_matrix <- cor(mtcars)
heatmap(correlation_matrix, col = heat.colors(20), main = "Correlation Heatmap")

Explanation of the above code:

The cor() function computes the pairwise correlation coefficients between variables.
heatmap(): This function creates a heatmap visualizing the correlation matrix.
col = heat.colors(20): Sets the color palette for the heatmap. In this case, it uses a gradient of 20 colors from the heat.colors palette.
main = “Correlation Heatmap”: Specifies the main title of the heatmap as “Correlation Heatmap”.

Explanation of the colors in the heatmap:

In a heatmap, the color represents the intensity or strength of a particular value. For correlation heatmaps specifically, the colors indicate the strength and direction of the correlation coefficient between pairs of variables. Here’s a general guide on interpreting the colors in a correlation heatmap:

Positive Correlation (High Values):

Color: Usually a shade of blue or green. Interpretation: High positive correlation. As one variable increases, the other variable tends to increase as well.

Negative Correlation (Low Values):

Color: Usually a shade of red or orange. Interpretation: High negative correlation. As one variable increases, the other variable tends to decrease.

No Correlation (Near Zero Values):

Color: Neutral or close to white. Interpretation: Little to no correlation. Changes in one variable do not systematically predict changes in the other.

Perfect Correlation (Correlation Coefficient = 1 or -1):

Color: The most intense shades of color (brightest). Interpretation: Perfect positive correlation (if the color is in the positive range) or perfect negative correlation (if the color is in the negative range).

Intensity of Color:

Dark Colors: Stronger correlation. Light Colors: Weaker correlation.

The heatmap can be also used to show missing values in a data frame.

data <- data.frame(x = c(2, 8, 9, NA, 3, NA, NA, 9),
                   y = c(6, NA, 9, 5, NA, 3, 3, NA),
                   z = c(6, NA, 5, 5, 8, NA, 9, 0))
# Create a heatmap with missing values
heatmap(is.na(data)*1, col = c("white", "red"), main = "Missing Values Heatmap")

Explanation of the above code:

The above code creates a data frame named data with three variables: x, y, and z. Each variable has a sequence of numeric values and some missing values denoted by NA.
is.na(data) returns a matrix of the same dimensions as data, where each element is TRUE if the corresponding element in data is NA, and FALSE otherwise.
*1 (multiplied by 1) is used to convert TRUE and FALSE values to numeric values (1 for TRUE, 0 for FALSE).
The col argument specifies the color palette for the heatmap. In this case, missing values (1 in the matrix) are colored red, and non-missing values (0 in the matrix) are white.
main sets the title of the heatmap to “Missing Values Heatmap”.

Explanation of the results:

The vertical dendrogram on the left side of the heatmap represents the clustering of observations (rows) based on their similarities in terms of missing values across variables (x, y, z). Observations that are more similar in terms of their missing value patterns are grouped together in branches of the dendrogram. The height of the dendrogram indicates the level of dissimilarity among observations.
The horizontal dendrogram at the top of the heatmap represents the clustering of variables (x, y, z) based on their similarities in terms of missing values. Variables that are more similar in terms of missing values are grouped together in branches of the dendrogram. Similar to the top dendrogram, the height of the branches in the left dendrogram indicates the level of dissimilarity among variables.

Tree Maps

The following code creates a tree map visualization of the GNI (Gross National Income) per capita data for the year 2014.

library(treemap)
data(GNI2014)

treemap(GNI2014,
        index=c("continent", "iso3"),  # You can try index=c("iso3") to see the difference
        vSize="population",
        vColor="GNI",
        type="manual", 
        palette = "RdYlGn"
       )

The code uses the treemap function from the treemap package to create a treemap visualization of the GNI (Gross National Income) per capita data for the year 2014. Let’s break down the code:

data(GNI2014): Loads the GNI2014 dataset into the R environment. This dataset likely contains information about Gross National Income per capita, population, continent, ISO3 codes, and other related variables for various countries.
treemap(GNI2014, index = c(“continent”, “iso3”), vSize = “population”, vColor = “GNI”, type = “manual”, palette = “RdYlGn”):
GNI2014: The dataset to be visualized.
index = c(“continent”, “iso3”): Specifies the hierarchical structure of the treemap. The treemap will be divided first by continent and then by ISO3 code within each continent.
vSize = “population”: The variable used to determine the size of each rectangle in the treemap. In this case, it’s the “population” variable.
vColor = “GNI”: The variable used to determine the color of each rectangle in the treemap. In this case, it’s the “GNI” variable.
type = “manual”: Specifies the type of treemap. “Manual” indicates that the treemap will be created manually, and the index and value variables are specified explicitly.
palette = “RdYlGn”: Specifies the color palette to be used for coloring the rectangles in the treemap. Here, it uses the Red-Yellow-Green color palette.

Interpretation of results:

The treemap visualizes the Gross National Income (GNI) per capita data for the year 2014.
Each rectangle in the treemap represents a country or region.
The rectangles are divided first by continent and then by ISO3 code within each continent.
The size of each rectangle corresponds to the population of the country or region.
The color of each rectangle corresponds to the GNI per capita, with a color gradient from red to yellow to green.

This treemap provides a hierarchical and visually appealing representation of GNI per capita data, allowing for quick comparisons and insights into the distribution of GNI per capita and population across continents and countries.

Visualizing Geographic Data

There are many R packages such as plot_ly and leaflet for visualizing geographic data. We introduce the use of the plot_ly package.

library(plotly)

# Sample data: Store locations and sales
store_data <- data.frame(Store = c("Store A", "Store B", "Store C"),
                         Latitude = c(37.7749, 34.0522, 40.7128),
                         Longitude = c(-122.4194, -118.2437, -74.0060),
                         Sales = c(12000, 15000, 18000))

# Create an interactive map using Plotly
map <- plot_ly(data = store_data, type = "scattergeo", mode = "markers",
               lat = ~Latitude, lon = ~Longitude, text = ~Store,
               marker = list(size = ~Sales / 1000, color = ~Sales, colorscale = "Viridis"))

# Customize map layout and annotation
map %>% layout(
  geo = list(showland = TRUE),
  title = "Interactive Map of Store Locations and Sales",
  annotations = list(
    list(x = 0.5, y = -0.1, text = "Marker size indicates sales amount", showarrow = FALSE),
    list(x = 0.5, y = -0.15, text = "Hover over markers to view store details", showarrow = FALSE)
  )
)

Network Graph (Example: Random graph for illustration)

This is for fun.

library(igraph)

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:plotly':
## 
##     groups

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

set.seed(123)
graph <- erdos.renyi.game(10, p = 0.3)
plot(graph, layout = layout_with_fr)

Arrange Plots in a Matrix

We will use the R package “gridExtra”.

library(gridExtra)

# Create individual plots
plot1 <- ggplot(mtcars, aes(x = mpg, y = disp)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs. Displacement")

plot2 <- ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs. Horsepower")

plot3 <- ggplot(mtcars, aes(x = mpg, y = wt)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs. Weight")

plot4 <- ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs. Quarter Mile Time")

# Arrange the plots in a 2x2 matrix
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)

Explanation of the above code:

The above code creates four scatter plots using the ggplot2 library, each comparing the “mpg” (Miles Per Gallon) variable with a different variable from the mtcars dataset (disp, hp, wt, qsec).
The grid.arrange function is then used to arrange these four plots in a 2x2 matrix.

Data Visualization

SZhang

1/17/2024

Histogram

Box Plot

Scatter Plot

Interactive Scatter Plot using plotly

Scatterplot Matrix

Barplot Based on Individual Data

Barplot Based on Summary Data

Time Series Plot

Heatmap

Tree Maps

Visualizing Geographic Data

Network Graph (Example: Random graph for illustration)

Arrange Plots in a Matrix