We use the data frame “mtcars” already available from R to demonstrate the creation of a few commonly used graphs. The data are displayed below:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
The data can also be displayed with the code below using the DT package:
datatable(mtcars)
Now, we create a histogram for the mpg column. Note that a histogram or boxplot is only for numeric data.
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(bins = 5, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Histogram of MPG",
x = "Miles Per Gallon (MPG)",
y = "Frequency")
Explanation of the above code:
ggplot(mtcars, aes(x = mpg)): Specifies the data frame (mtcars) and the aesthetics mapping. The x-axis (x) represents Miles Per Gallon (mpg).
geom_histogram(bins = 5, fill = “blue”, color = “black”, alpha = 0.7): Adds a histogram layer to the plot. bins sets the number of bins, fill sets the fill color of the bars, color sets the border color of the bars, and alpha controls the transparency of the bars.
labs(title = “Histogram of MPG”, x = “Miles Per Gallon (MPG)”, y = “Frequency”): Sets the title and axis labels for the plot.
Interpretation of results:
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Box Plot of MPG by Number of Cylinders",
x = "Number of Cylinders",
y = "Miles Per Gallon (MPG)")
Interpretation of results:
ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_point() +
geom_text(aes(label = 1:32), vjust = -0.5, hjust = -0.5) +
labs(title = "Scatter Plot of MPG vs. Displacement",
x = "Miles Per Gallon (MPG)",
y = "Displacement")
Explanation of the above code:
ggplot(mtcars, aes(x = mpg, y = disp)): Specifies the data frame (mtcars) and the aesthetics mapping (x = mpg, y = disp) for the scatter plot.
geom_point(): Adds points to the scatter plot based on the specified aesthetics.
geom_text(aes(label = 1:32), vjust = -0.5, hjust = -0.5): Adds text labels to each point with the labels being the row numbers (1 to 32). vjust and hjust parameters control the vertical and horizontal alignment of the labels.
labs(title = “Scatter Plot of MPG vs. Displacement”, x = “Miles Per Gallon (MPG)”, y = “Displacement”): Sets the title and axis labels for the plot.
Explanation of results:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# We store a scatterplot in an object called p
p <- ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs. Displacement",
x = "Miles Per Gallon (MPG)",
y = "Displacement")
# Make the plot interactive
ggplotly(p)
Explanation of the above code:
pairs(mtcars[c("mpg", "disp", "hp", "wt")])
ggplot(mtcars, aes(x = cyl)) +
geom_bar() +
labs(title = "Distributin of the Number of Cylinders",
x = "Number of Cylinders") +
theme(plot.title = element_text(hjust = 0.5))
Explanation of the above code:
ggplot(mtcars, aes(x = cyl)): Initiates the creation of a ggplot object using the mtcars dataset. The aesthetic mapping (aes) specifies that the x-axis variable is “cyl” (number of cylinders).
geom_bar(): Adds a bar geometry to create a barplot. This function uses the default settings for creating a barplot based on the specified aesthetic mappings.
labs(title = “Distribution of the Number of Cylinders”, x = “Number of Cylinders”): Customizes plot labels and title. Sets the title of the plot to “Distribution of the Number of Cylinders” and labels the x-axis as “Number of Cylinders.”
theme(plot.title = element_text(hjust = 0.5)): Customizes the theme of the plot. Specifically, it adjusts the horizontal justification (hjust) of the plot title to 0.5, which centers the title horizontally. The element_text() function is used to specify text-related properties.
# Calculate average mpg for each number of cylinders
avg_mpg <- aggregate(mpg ~ cyl, data = mtcars, FUN = mean)
# Barplot
ggplot(avg_mpg, aes(x = cyl, y = mpg)) +
geom_col() +
labs(title = "Average MPG by Number of Cylinders",
x = "Number of Cylinders",
y = "Average Miles Per Gallon (MPG)") +
theme(plot.title = element_text(hjust = 0.5))
Let’s say that stock price of 3M Company has mean of 106 and Standard Deviation of 1.26. The Standard Deviation is a measure of how spread out the prices or returns of an asset are on average. It is the most widely used risk indicator in the field of investing and finance.
We generate monthly time series data from a normal distribution with mean 106 and standard deviation of 0.27.
stock = ts(rnorm(100, 106, 0.27), frequency = 12, start = c(1998, 2))
stock
## Jan Feb Mar Apr May Jun Jul Aug
## 1998 105.6254 105.3020 106.0901 106.3761 105.5460 105.7923 106.1339
## 1999 105.8000 105.9569 106.1227 106.1920 106.3724 105.7462 105.5297 105.8498
## 2000 106.4730 105.7630 105.6875 105.7641 106.0937 105.7381 105.4277 106.0798
## 2001 106.0806 105.9288 105.4425 106.0464 105.7975 106.3408 105.6732 105.9791
## 2002 105.9559 105.8260 106.0372 106.0449 105.8435 105.8834 105.6386 105.8515
## 2003 105.8820 106.2117 106.0813 106.0965 105.6684 105.8549 105.9249 105.8704
## 2004 105.7583 106.3315 106.3935 106.0003 106.2597 106.0432 106.4831 106.1453
## 2005 105.8015 106.0931 105.7109 106.5101 106.4558 105.9197 105.5926 106.1594
## 2006 106.2911 105.7376 105.7855 106.3559 105.8911
## Sep Oct Nov Dec
## 1998 105.9876 105.7674 105.8960 105.9745
## 1999 106.2392 105.8667 106.1186 105.8361
## 2000 106.5948 105.7216 105.9366 106.2797
## 2001 106.3516 105.5821 105.8648 106.0571
## 2002 106.1627 105.8432 105.9771 105.9157
## 2003 105.8601 106.0661 106.2748 106.1002
## 2004 106.1806 106.3307 106.1268 106.1987
## 2005 105.9013 105.7073 105.4566 106.0048
## 2006
Plot the time series:
library(forecast)
autoplot(stock)
Explanation of the above code:
To plot the monthly means, we do
monthly_stock_means = tapply(stock, cycle(stock), mean)
plot(monthly_stock_means, type = "l", xlab = "Month", ylab = "Mean Stock Price", main = "Monthly Means of 3M Stock Price")
Explanation of the above code:
Another way:
Month = cycle(stock)
monthly_stock_means = aggregate(stock~Month, FUN=mean)
monthly_stock_means$Month = factor(monthly_stock_means$Month, labels = month.abb)
ggplot(monthly_stock_means, aes(x = Month, y = stock)) +
geom_line(group = 1)
Explanation of the above code:
correlation_matrix <- cor(mtcars)
heatmap(correlation_matrix, col = heat.colors(20), main = "Correlation Heatmap")
Explanation of the above code:
The cor() function computes the pairwise correlation coefficients between variables.
heatmap(): This function creates a heatmap visualizing the correlation matrix.
col = heat.colors(20): Sets the color palette for the heatmap. In this case, it uses a gradient of 20 colors from the heat.colors palette.
main = “Correlation Heatmap”: Specifies the main title of the heatmap as “Correlation Heatmap”.
Explanation of the colors in the heatmap:
In a heatmap, the color represents the intensity or strength of a particular value. For correlation heatmaps specifically, the colors indicate the strength and direction of the correlation coefficient between pairs of variables. Here’s a general guide on interpreting the colors in a correlation heatmap:
Color: Usually a shade of blue or green. Interpretation: High positive correlation. As one variable increases, the other variable tends to increase as well.
Color: Usually a shade of red or orange. Interpretation: High negative correlation. As one variable increases, the other variable tends to decrease.
Color: Neutral or close to white. Interpretation: Little to no correlation. Changes in one variable do not systematically predict changes in the other.
Color: The most intense shades of color (brightest). Interpretation: Perfect positive correlation (if the color is in the positive range) or perfect negative correlation (if the color is in the negative range).
Dark Colors: Stronger correlation. Light Colors: Weaker correlation.
The heatmap can be also used to show missing values in a data frame.
data <- data.frame(x = c(2, 8, 9, NA, 3, NA, NA, 9),
y = c(6, NA, 9, 5, NA, 3, 3, NA),
z = c(6, NA, 5, 5, 8, NA, 9, 0))
# Create a heatmap with missing values
heatmap(is.na(data)*1, col = c("white", "red"), main = "Missing Values Heatmap")
Explanation of the above code:
The above code creates a data frame named data with three variables: x, y, and z. Each variable has a sequence of numeric values and some missing values denoted by NA.
is.na(data) returns a matrix of the same dimensions as data, where each element is TRUE if the corresponding element in data is NA, and FALSE otherwise.
*1 (multiplied by 1) is used to convert TRUE and FALSE values to numeric values (1 for TRUE, 0 for FALSE).
The col argument specifies the color palette for the heatmap. In this case, missing values (1 in the matrix) are colored red, and non-missing values (0 in the matrix) are white.
main sets the title of the heatmap to “Missing Values Heatmap”.
Explanation of the results:
The vertical dendrogram on the left side of the heatmap represents the clustering of observations (rows) based on their similarities in terms of missing values across variables (x, y, z). Observations that are more similar in terms of their missing value patterns are grouped together in branches of the dendrogram. The height of the dendrogram indicates the level of dissimilarity among observations.
The horizontal dendrogram at the top of the heatmap represents the clustering of variables (x, y, z) based on their similarities in terms of missing values. Variables that are more similar in terms of missing values are grouped together in branches of the dendrogram. Similar to the top dendrogram, the height of the branches in the left dendrogram indicates the level of dissimilarity among variables.
The following code creates a tree map visualization of the GNI (Gross National Income) per capita data for the year 2014.
library(treemap)
data(GNI2014)
treemap(GNI2014,
index=c("continent", "iso3"), # You can try index=c("iso3") to see the difference
vSize="population",
vColor="GNI",
type="manual",
palette = "RdYlGn"
)
The code uses the treemap function from the treemap package to create a treemap visualization of the GNI (Gross National Income) per capita data for the year 2014. Let’s break down the code:
Interpretation of results:
This treemap provides a hierarchical and visually appealing representation of GNI per capita data, allowing for quick comparisons and insights into the distribution of GNI per capita and population across continents and countries.
There are many R packages such as plot_ly and leaflet for visualizing geographic data. We introduce the use of the plot_ly package.
library(plotly)
# Sample data: Store locations and sales
store_data <- data.frame(Store = c("Store A", "Store B", "Store C"),
Latitude = c(37.7749, 34.0522, 40.7128),
Longitude = c(-122.4194, -118.2437, -74.0060),
Sales = c(12000, 15000, 18000))
# Create an interactive map using Plotly
map <- plot_ly(data = store_data, type = "scattergeo", mode = "markers",
lat = ~Latitude, lon = ~Longitude, text = ~Store,
marker = list(size = ~Sales / 1000, color = ~Sales, colorscale = "Viridis"))
# Customize map layout and annotation
map %>% layout(
geo = list(showland = TRUE),
title = "Interactive Map of Store Locations and Sales",
annotations = list(
list(x = 0.5, y = -0.1, text = "Marker size indicates sales amount", showarrow = FALSE),
list(x = 0.5, y = -0.15, text = "Hover over markers to view store details", showarrow = FALSE)
)
)
This is for fun.
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
##
## groups
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
set.seed(123)
graph <- erdos.renyi.game(10, p = 0.3)
plot(graph, layout = layout_with_fr)
We will use the R package “gridExtra”.
library(gridExtra)
# Create individual plots
plot1 <- ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs. Displacement")
plot2 <- ggplot(mtcars, aes(x = mpg, y = hp)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs. Horsepower")
plot3 <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs. Weight")
plot4 <- ggplot(mtcars, aes(x = mpg, y = qsec)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs. Quarter Mile Time")
# Arrange the plots in a 2x2 matrix
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
Explanation of the above code: