Introduction

Data visualization is a powerful and essential tool in the realm of data analysis and communication. Its importance stems from its ability to make complex data more accessible, understandable, and insightful.

Here are several key reasons why data visualization is crucial:

There are two types of visualizations.

Exploratory visualizations find insights for yourself, while Explanatory visualizations communicate insights to others.

Coding Preparation

This course will be using the R software. R is a programming language mainly for data analysis.

Tutorials for R Beginners

Start with the following:

After at least 5 hours of hard work, you can continue to read these notes.

Installing and Loading R Packages:

First of all, what is an R package?

In R, a package is a collection of functions, data sets, documentation, and other resources bundled together for a specific purpose or task. Packages provide a way to organize and share code, making it easier for users to access and utilize various functionalities without having to reinvent the wheel for common tasks. Packages can be developed by the R community or individual programmers, and they enhance the capabilities of R by introducing new functions, data structures, and tools.

Click to see a ChatGPT answer about R packages

Key points about packages in R:

Functionality: Packages in R provide additional functions and tools that extend the base R functionalities. They cover a wide range of areas such as data manipulation, statistical analysis, visualization, machine learning, and more.

Installation: To use a package, you need to install it first. This is typically done using the install.packages(“package_name”) function. For example, to install the ggplot2 package, you would run install.packages(“ggplot2”) on the console.

Loading: Once a package is installed, you need to load it into your R session using the library(“package_name”) function. This makes the package’s functions and datasets available for use. To load the ggplot2 package, you would run library(ggplot2) on the console.

Namespace: Packages have their own namespaces to prevent conflicts between functions with the same names from different packages. This means that you might need to prefix a function with the package name when using it, e.g., dplyr::filter().

Documentation: Packages often include documentation, which is accessible through the help() function or the ? operator. Documentation provides information about the functions, arguments, and usage examples.

CRAN: The Comprehensive R Archive Network (CRAN) is the primary repository for R packages. It hosts thousands of packages that are freely available for users to download and use.

Custom Packages: Users can create their own packages to organize and distribute their R code, functions, and datasets. This is particularly useful for sharing code and analyses with others.

Dependencies: Some packages rely on other packages to function properly. These are called dependencies. When you install a package, R will also install its required dependencies.

Updates: Packages are periodically updated to fix bugs, introduce new features, and maintain compatibility with the latest version of R. You can update packages using the update.packages() function.

Community: The R community actively develops and maintains packages. This collaborative effort allows users to benefit from the expertise and contributions of developers worldwide.

Popular R packages include ggplot2 for data visualization, dplyr and tidyr for data manipulation, caret for machine learning, and many more.

Overall, packages are a fundamental aspect of R that enable users to harness a wide variety of tools and techniques for their data analysis and research projects.

Note that you only need to install a package once, but you need to load it into your R session each time you start a new session or script.

Data Manipulation:

Data manipulation is a fundamental step in data analysis, and R offers several powerful packages with functions for manipulating data. Some of the most commonly used packages for data manipulation are base (which comes with R by default) and dplyr. Here are some key data manipulation functions from these packages:

Base R:

  • names(): Extract the names of a data frame.
  • subset(): Subsets rows of a data frame based on a condition.
  • aggregate(): Aggregates data based on one or more variables.
  • which(): Returns the indices of elements that satisfy a condition.
  • cut(): Divides a continuous variable into intervals.
  • table(): Creates frequency tables for categorical variables.
  • ifelse(): Performs element-wise conditional operations on vectors, matrices, or data frames.
  • gsub(): Performs global substitutions by replacing all occurrences of a pattern in the entire string by something else.

dplyr:

  • select(): Selects specific columns from a data frame.
  • filter(): Filters rows based on conditions.
  • arrange(): Orders rows based on one or more columns.
  • mutate(): Creates new variables/columns or modifies existing ones.
  • group_by(): Groups data by one or more variables.
  • summarize(): Summarizes grouped data using functions like sum, mean, etc.
  • distinct(): Removes duplicate rows.
  • case_when(): Creates new variables based on multiple conditions.

The following examples showcase how the filter(), select(), and pivot_longer() functions from the dplyr and tidyr packages can be used for data manipulation tasks in R.

Example 1: Using filter() to Filter Data

Filter the “mtcars” dataset to only include cars with more than 100 horsepower:

library(dplyr)

# Filter cars with horsepower > 100
filtered_cars <- filter(mtcars, hp > 100)
head(filtered_cars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

When you run this code, you’ll obtain a subset of the mtcars dataset containing only the cars with horsepower (hp) greater than 100. The head() function then displays the first few rows of this filtered dataset. This is useful for examining a subset of the data that meets specific filtering criteria.

Example 2: Using select() to Select Columns

Select specific columns from the “iris” dataset:

library(dplyr)

# Select columns Sepal.Length and Sepal.Width
selected_columns <- select(iris, Sepal.Length, Sepal.Width)
head(selected_columns)
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9

select(iris, Sepal.Length, Sepal.Width): This line of code selects the columns “Sepal.Length” and “Sepal.Width” from the iris dataset. The result is stored in the selected_columns data frame.

Example 3: Using pivot_longer() to Reshape Data

Reshape the “iris” dataset from wide format to long format using pivot_longer():

library(tidyr)

# Reshape the dataset
long_iris <- pivot_longer(iris, cols = Sepal.Length:Petal.Width,
                          names_to = "Measurement", values_to = "Value")

head(long_iris)
## # A tibble: 6 × 3
##   Species Measurement  Value
##   <fct>   <chr>        <dbl>
## 1 setosa  Sepal.Length   5.1
## 2 setosa  Sepal.Width    3.5
## 3 setosa  Petal.Length   1.4
## 4 setosa  Petal.Width    0.2
## 5 setosa  Sepal.Length   4.9
## 6 setosa  Sepal.Width    3

The code pivot_longer(iris, cols = Sepal.Length:Petal.Width, names_to = “Measurement”, values_to = “Value”) reshapes the iris dataset from a wide format (with columns “Sepal.Length” to “Petal.Width”) to a long format. The cols argument specifies the columns to pivot, and the names_to argument specifies the name of the column where the measurement names will be stored, while the values_to argument specifies the name of the column where the corresponding values will be stored.

This reshaped format is useful for performing analyses that require data in a long format. Here is a continuation to the above example.

# Create a boxplot with faceting and remove x-axis
ggplot(long_iris, aes(y = Value)) +  # Set y aesthetic to 'Value'; x is implicit in the facet
  geom_boxplot() +  # Add boxplot layer to visualize the distribution of 'Value'
  labs(title = "Boxplot of Iris Measurements by Species",  # Title for the plot
       y = "Value") +  # Label for the y-axis
  facet_grid(Measurement ~ Species) +  # Faceting by Measurement (rows) and Species (columns)
  theme_minimal() +  # Apply a minimal theme for cleaner appearance
  theme(axis.title.x = element_blank(),  # Remove the x-axis title
        axis.text.x = element_blank(),   # Remove the x-axis text labels
        axis.ticks.x = element_blank())  # Remove the x-axis ticks

Is there any other way you can plot the data? You need to think about what kind of visuals you would like to see, then you start the coding process.

Example 4: Using ifelse(), gsub(), and case_when()

score = c(67, 74, 85, 90, 93, 56, 87, 63)
categorized_score = ifelse(score > 85, "High", "Low")
categorized_score
## [1] "Low"  "Low"  "Low"  "High" "High" "Low"  "High" "Low"

In the first line of code, you define a numeric vector called score containing eight different score values.

In the second line of code, if a score is greater than 85, it’s categorized as “High”; otherwise, it’s categorized as “Low”. The result is stored in the categorized_score variable.

In the third line of code, you print the content of the categorized_score vector to the console to see the categorization of each score.

house_price = c("$240,000", "$350,000", "$280,000", "$4 50,000")

modified_house_price = gsub(",|\\$|s", "", house_price)
modified_house_price
## [1] "240000"  "350000"  "280000"  "4 50000"

Here, you use the gsub() function to replace specific characters with an empty string within each element of the house_price vector. The regular expression “,|\$|\s” matches commas, dollar signs, and spaces. This effectively removes those characters from the strings.

  • ,: Matches commas.

  • \$: Matches dollar signs. The \ is used to escape the dollar sign.

  • \s: Matches whitespace characters (including spaces).

The second argument is an empty string ““, indicating that the matched characters should be replaced with nothing.

y <- c("M", "male", "f", "Female", "m", "Male", "F", "female", "unknown", "Unknown", "f", "m")

case_when(
     y %in% c("M", "m", "Male", "male") ~ "M",       # Equivalent to if ... then
     y %in% c("F", "f", "Female", "female") ~ "F",   # Equivalent to elseif ... then
     TRUE   ~ "Unknown"                              # Equivalent to else ... then
)
##  [1] "M"       "M"       "F"       "F"       "M"       "M"       "F"      
##  [8] "F"       "Unknown" "Unknown" "F"       "M"

In this code chunk, the case_when() function from the “dplyr” package is used to categorize gender values. Each line (separated by a comma) within the case_when() function represents a condition followed by a value assignment.

The first line checks if y is one of the specified values (“M”, “m”, “Male”, “male”), and if true, it assigns “M” as the result.

The second line checks if y is one of the specified values (“F”, “f”, “Female”, “female”), and if true, it assigns “F” as the result.

The third line uses TRUE as a catch-all condition, which assigns “Unknown” as the result when none of the previous conditions match.

Example 5: Data Cleaning and Wrangling

Data cleaning and wrangling deal with a mix of data quality issues such as missing values, duplicates, inconsistent formats, and different types of variables. Here’s a suggestion:

We use the data “titanic_train” from the “titanic” package to practice data cleaning, wrangling, and to prepare the data for visualization. You can then create various visualizations like histograms, bar charts, scatter plots, and more to explore relationships between different variables and uncover insights.

The dataset contains information about passengers on the Titanic, including details like age, sex, class, fare, survival status, etc.

Click here for a glimpse

The titanic_train dataset contains the following columns:

  • PassengerId: Unique identifier for passengers.
  • Survived: Survival status (0 = No, 1 = Yes).
  • Pclass: Passenger class (1st, 2nd, 3rd).
  • Name: Passenger’s name.
  • Sex: Passenger’s sex.
  • Age: Passenger’s age.
  • SibSp: Number of siblings/spouses aboard.
  • Parch: Number of parents/children aboard.
  • Ticket: Ticket number.
  • Fare: Fare paid.
  • Cabin: Cabin number.
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

If you want to see more details, type ?titanic_train in the RStudio console and hit the enter key to see the documentation for the data.

You can load this dataset using the titanic package in R. Make sure to install the package if you haven’t already.

The following shows 10 possible questions along with code:

# Load the titanic dataset from the titanic package
library(titanic)

# Assign the training dataset to a variable for easier reference
data <- titanic_train

# 1. Count the number of missing values in each column of the dataset
missing_counts <- colSums(is.na(data))  # Use is.na() to identify missing values and colSums() to count them
print(missing_counts)  # Print the counts of missing values for each column
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
# 2. Check for any duplicated rows in the dataset
duplicated_rows <- data[duplicated(data), ]  # Identify duplicated rows using duplicated()
print(duplicated_rows)  # Print any duplicated rows found
##  [1] PassengerId Survived    Pclass      Name        Sex         Age        
##  [7] SibSp       Parch       Ticket      Fare        Cabin       Embarked   
## <0 rows> (or 0-length row.names)
# 3. Handle missing values in the "Age" column by filtering out missing values
data_cleaned <- data %>%
  filter(!is.na(Age))  # Retain only rows where Age is not missing

# 4. Fill in missing values in the "Embarked" column with the most common value ("S")
data_cleaned <- data_cleaned %>%
  mutate(Embarked = ifelse(is.na(Embarked), "S", Embarked))  # Replace NA with "S" using ifelse()

# 5. Create a "FamilySize" column by summing siblings/spouses and parents/children
data_cleaned <- data_cleaned %>%
  mutate(FamilySize = SibSp + Parch + 1)  # 1 is added to count the passenger themselves

# 6. Convert categorical variables like "Sex" and "Embarked" into numeric format
data_cleaned <- data_cleaned %>%
  mutate(Sex = as.numeric(Sex),  # Convert "Sex" to numeric
         Embarked = as.numeric(Embarked))  # Convert "Embarked" to numeric

# 7. Bin the "Age" column into specific age groups for easier analysis
data_cleaned <- data_cleaned %>%
  mutate(AgeGroup = cut(Age, breaks = c(0, 12, 18, 30, 50, 80),  # Define age breaks
                        labels = c("Child", "Teenager", "Young Adult", "Adult", "Senior")))  # Assign labels

# 8. Drop certain columns that are not needed for analysis or modeling
columns_to_drop <- c("PassengerId", "Name", "Ticket", "Cabin", "Age")  # List of columns to drop
data_cleaned <- data_cleaned %>%
  select(-columns_to_drop)  # Use select() to remove specified columns

# 9. Scale the "Fare" and "FamilySize" columns to standardize their values
data_cleaned <- data_cleaned %>%
  mutate(Fare = scale(Fare),  # Scale "Fare" to have mean 0 and standard deviation 1
         FamilySize = scale(FamilySize))  # Scale "FamilySize" similarly

# 10. Prepare the final dataset for visualization by selecting relevant columns
data_visual <- data_cleaned %>%
  select(Survived, Pclass, Sex, AgeGroup, Embarked, Fare, FamilySize)  # Select columns for visualization

Creating Basic Visualizations:

Create visualizations using the base package which is already installed with R.

Example 1: Scatter Plot

Create a simple scatter plot using the built-in “iris” dataset to show the relationship between sepal length and width:

# Load the dataset (already available in R)
data("iris")

# Create a scatter plot
plot(iris$Sepal.Length, iris$Sepal.Width, 
     xlab = "Sepal Length", ylab = "Sepal Width",
     main = "Scatter Plot of Sepal Length vs. Width")

Example 2: Histogram

Generate a histogram to visualize the distribution of sepal lengths in the “iris” dataset:

# Load the dataset (if not loaded already)
data("iris")

# Create a histogram
hist(iris$Sepal.Length, 
     xlab = "Sepal Length", ylab = "Frequency",
     main = "Histogram of Sepal Length")

Example 3: Box Plot

Create a box plot to show the distribution of petal lengths for different iris species:

# Load the dataset (if not loaded already)
data("iris")

# Create a box plot
boxplot(Petal.Length ~ Species, data = iris, 
        xlab = "Species", ylab = "Petal Length",
        main = "Box Plot of Petal Length by Species")

Example 4: Bar Plot

Generate a bar plot to display the number of cylinders for different car models using the “mtcars” dataset:

# Load the dataset (already available in R)
data("mtcars")

# Create a bar plot
barplot(table(mtcars$cyl), 
        xlab = "Number of Cylinders", ylab = "Frequency",
        main = "Bar Plot of Car Cylinders")

These examples demonstrate how you can use the base package in R to create basic visualizations. While these functions provide a simple way to create visualizations, more advanced and customizable visualizations can be achieved using dedicated visualization packages like “ggplot2” and “plotly.”

A Great Visualization Toolbox: the ggplot2 Package

Let’s create visualizations using ggplot(), which involves the specification of aesthetics. Here is details about aesthetics when using the ggplot2 package:

Click to see a ChatGPT answer

What is aesthetics in ggplot2?

In ggplot2, aesthetics (often abbreviated as “aes”) refers to the mapping of data variables to visual properties of a plot, such as position, color, size, shape, and more. Aesthetics define how the data is visually represented in the plot. By specifying aesthetics, you can create different types of visualizations by mapping different variables to different visual attributes.

In other words, aesthetics in ggplot2 allow you to answer questions like: How should the data be displayed? What should be used to differentiate data points? What variables should determine the position, color, or size of plot elements?

Here are some common aesthetics and their associated visual properties:

  • x and y: Position of data points along the x and y axes.

  • color and fill: Color of data points or shapes.

  • shape: Shape of data points (e.g., circles, squares, triangles).

  • size: Size of data points.

  • alpha: Transparency level of data points.

  • linetype: Type of line for lines in line plots.

  • group: Grouping of data points for facets and layered plots.

  • label: Labels for data points in scatter plots or other text-based plots.

When you use the aes() function in ggplot2, you are specifying how data variables should be mapped to these aesthetics. Here’s a simple example:

# Creating a scatter plot with aesthetics
ggplot(data = iris, 
       aes(x = Sepal.Length, y = Sepal.Width, color = Species, size=Petal.Length)
      ) +
  geom_point()

In this example, x, y, color, and size are aesthetics. The x and y aesthetics map the Sepal.Length and Sepal.Width variables to the position of data points on the plot. The color aesthetic maps the Species variable to different colors, distinguishing data points belonging to different species. The size aesthetic maps the Petal.Length variable to different sizes,

By adjusting aesthetics and using layers of geoms and other elements (data, aesthetics, geoms, labels, …), you can create a wide range of visualizations that effectively communicate insights from your data. The flexibility and grammar of graphics approach in ggplot2 make it a powerful tool for data visualization in R.

Example 1: Scatter Plot

Create a scatter plot using the “iris” dataset to show the relationship between sepal length and width:

library(ggplot2)

# Create a scatter plot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  labs(title = "Scatter Plot of Sepal Length vs. Width",
       x = "Sepal Length", y = "Sepal Width")

library(ggplot2): Load the ggplot2 package for creating data visualizations.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)): Create a ggplot object using the iris dataset and specify aesthetic mappings. The x aesthetic is set to the “Sepal.Length” column, and the y aesthetic is set to the “Sepal.Width” column.

geom_point(): Add a layer of points to the plot, creating a scatter plot of sepal length against sepal width.

labs(title = “Scatter Plot of Sepal Length vs. Width”, x = “Sepal Length”, y = “Sepal Width”): Customize the plot labels, including the title, x-axis label, and y-axis label.

When you run this code, you’ll generate a scatter plot with points representing the relationship between sepal length and sepal width for the different species of irises in the iris dataset. This visualization allows you to visually explore any patterns or relationships between these two variables.

Example 2: Histogram

Generate a histogram to visualize the distribution of sepal lengths in the “iris” dataset:

library(ggplot2)

# Create a histogram
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(binwidth = 0.2, fill = "blue", color = "black") +
  labs(title = "Histogram of Sepal Length",
       x = "Sepal Length", y = "Frequency")

ggplot(iris, aes(x = Sepal.Length)): Create a ggplot object using the iris dataset and specify aesthetic mappings. The x aesthetic is set to the “Sepal.Length” column.

geom_histogram(binwidth = 0.2, fill = “blue”, color = “black”): Add a layer of bars to the plot, creating a histogram of sepal length. The binwidth argument specifies the width of the histogram bins, the fill argument sets the fill color of the bars to blue, and the color argument sets the border color of the bars to black.

labs(title = “Histogram of Sepal Length”, x = “Sepal Length”, y = “Frequency”): Customize the plot labels, including the title, x-axis label, and y-axis label.

When you run this code, you’ll generate a histogram that visualizes the distribution of sepal lengths in the iris dataset. Each bar represents a range of sepal lengths, and the height of the bar corresponds to the frequency of sepal lengths falling within that range. This visualization provides insights into the distribution of the “Sepal.Length” variable in the dataset.

Example 3: Bar Plot

Create a bar plot using the “mtcars” dataset to display the count of cars by number of cylinders:

library(ggplot2)

# Create a bar plot
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "green", color = "black") +
  labs(title = "Bar Plot of Car Counts by Cylinders",
       x = "Number of Cylinders", y = "Count")

ggplot(mtcars, aes(x = factor(cyl))): Create a ggplot object using the mtcars dataset and specify aesthetic mappings. The x aesthetic is set to the “cyl” variable, and factor(cyl) is used to ensure that the “cyl” variable is treated as a categorical variable.

geom_bar(fill = “green”, color = “black”): Add a layer of bars to the plot, creating a bar plot of car counts by the number of cylinders. The fill argument sets the fill color of the bars to green, and the color argument sets the border color of the bars to black.

labs(title = “Bar Plot of Car Counts by Cylinders”, x = “Number of Cylinders”, y = “Count”): Customize the plot labels, including the title, x-axis label, and y-axis label.

When you run this code, you’ll generate a bar plot that displays the counts of cars for each number of cylinders in the mtcars dataset. Each bar represents a different number of cylinders, and the height of the bar corresponds to the count of cars with that number of cylinders. This visualization allows you to compare the distribution of car counts across different cylinder categories.

Example 4: Box Plot

Generate a box plot using the “iris” dataset to show the distribution of petal lengths for different iris species:

library(ggplot2)

# Create a box plot
ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Box Plot of Petal Length by Species",
       x = "Species", y = "Petal Length")

ggplot(iris, aes(x = Species, y = Petal.Length)): Create a ggplot object using the iris dataset and specify aesthetic mappings. The x aesthetic is set to the “Species” column, and the y aesthetic is set to the “Petal.Length” column.

geom_boxplot(fill = “orange”): Add a layer of box plots to the plot, creating a box plot of petal length for each species. The fill argument sets the fill color of the boxes to orange.

labs(title = “Box Plot of Petal Length by Species”, x = “Species”, y = “Petal Length”): Customize the plot labels, including the title, x-axis label, and y-axis label.

When you run this code, you’ll generate a box plot that shows the distribution of petal lengths for each species of irises in the iris dataset. Each box represents the interquartile range (IQR) of the petal lengths for a specific species, and the line inside the box represents the median petal length. This visualization allows you to compare the distribution of petal lengths among different iris species.

More on ggplots

  1. Customizing Visualizations:

In the following example, we create a scatter plot of miles per gallon (mpg) versus weight (wt) from the “mtcars” dataset. We apply different themes using functions like theme_minimal(), theme_gray(), theme_bw(), theme_dark(), and theme_light(). We then use grid.arrange() from the gridExtra package to arrange multiple plots side by side for comparison.

library(ggplot2)
data("mtcars")

# Create a scatter plot with different themes
scatter_plot <- ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(cyl))) +
  geom_point(size = 4, alpha = 0.8, shape = 16) +
  labs(title = "Scatter Plot of MPG vs. Weight by Cylinders",
       x = "Miles per Gallon", y = "Weight",
       color = "Cylinders")

# Apply different themes
theme_default <- theme_minimal()
theme_gray <- theme_gray()
theme_bw <- theme_bw()
theme_dark <- theme_dark()
theme_light <- theme_light()

# Arrange plots with different themes
library(gridExtra)
grid.arrange(scatter_plot + theme_default,
             scatter_plot + theme_gray,
             scatter_plot + theme_bw,
             scatter_plot + theme_dark,
             scatter_plot + theme_light,
             ncol = 2)

Each theme gives the plot a distinct visual appearance, allowing you to choose the one that best suits your data and storytelling goals.

  1. Layering Visual Elements:

You can combine multiple visual elements using the + operator in ggplot to create layered plots. Here’s an example of layering a scatter plot with a smoothed line using geom_point() and geom_smooth():

library(ggplot2)

# Create a scatter plot with a smoothed line
scatter_smooth_plot <- ggplot(data = mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Scatter Plot with Smoothed Line",
       x = "Horsepower", y = "Miles per Gallon")

print(scatter_smooth_plot)

ggplot(data = mtcars, aes(x = hp, y = mpg)): Create a ggplot object using the mtcars dataset and specify aesthetic mappings. The x aesthetic is set to the “hp” column (horsepower) and the y aesthetic is set to the “mpg” column (miles per gallon).

geom_point(): Add a layer of points to the plot, creating a scatter plot of horsepower against miles per gallon.

geom_smooth(method = “lm”, se = FALSE, color = “blue”): Add a smoothed line to the plot using linear regression (“lm” method). The se = FALSE argument disables the confidence interval shading, and the color = “blue” argument sets the color of the line to blue.

labs(title = “Scatter Plot with Smoothed Line”, x = “Horsepower”, y = “Miles per Gallon”): Customize the plot labels, including the title, x-axis label, and y-axis label.

print(scatter_smooth_plot): Display the final plot with the scatter points and smoothed line.

When you run this code, you’ll generate a scatter plot with points representing car models’ horsepower and miles per gallon. The smoothed blue line represents the linear regression fit to the data, providing insight into the overall trend between horsepower and fuel efficiency.

  1. Time Series Visualization:

Time series visualizations are used to display data points over a continuous time interval. Here’s an example of creating a time series visualization using the ggplot2 package in R:

library(ggplot2)

# Create a time series line chart
ggplot(data = economics, aes(x = date, y = unemploy)) +
  geom_line(color = "blue") +
  labs(title = "Time Series of Unemployment Rate",
       x = "Date", y = "Unemployment Rate") +
  theme_minimal()

In this example, we use the “economics” dataset that comes with R, containing economic data including the unemployment rate and other variables over time. We create a time series line chart using geom_line() to plot the unemployment rate (unemploy) over time (date). We customize the appearance using labs() to add labels to the title and axes, and apply a minimal theme using theme_minimal().

The resulting time series line chart displays the unemployment rate’s trend over the specified time interval, helping to identify patterns and fluctuations in the data.

How can we show the labels for every other year?

ggplot(data = economics, aes(x = date, y = unemploy)) +
  geom_line(color = "blue") +
  labs(title = "Time Series of Unemployment Rate",
       x = "Date", y = "Unemployment Rate") +
  theme_minimal() +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  theme(axis.text.x=element_text(angle=90),
        plot.title = element_text(hjust = 0.5)
       )

In this example, we’ve adjusted the date_breaks argument to “2 years” to display labels for every other year on the x-axis. The date_labels argument remains as %Y to format the year labels.

This modification will show labels for every other year, making the x-axis less cluttered while still providing meaningful time intervals.

  1. Adding Titles, Annotations, and Labels

Here’s an example of creating a scatter plot using the ggplot2 package in R, and then adding titles, annotations, and labels to enhance the understanding of the visualization:

library(ggplot2)
data("iris")

# Create a scatter plot of sepal length versus sepal width
scatter_plot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  labs(title = "Scatter Plot of Sepal Length and Width",
       x = "Sepal Length", y = "Sepal Width",
       color = "Species") +
  theme_minimal()

# Add annotation and labels
annotated_plot <- scatter_plot +
  geom_text(aes(label = Species), nudge_y = 0.1, size = 3) +
  annotate("text", x = 5, y = 4.5, label = "Setosa", size = 4, color = "red", alpha = 0.5) +
  annotate("text", x = 7, y = 3.5, label = "Versicolor", size = 4, color = "green", alpha = 0.5) +
  annotate("text", x = 4.5, y = 2, label = "Virginica", size = 4, color = "blue", alpha = 0.5)

print(annotated_plot)

In this example, we first create a scatter plot of sepal length versus sepal width from the “iris” dataset. We add titles and labels to the plot using the labs() function and the title, x, y, and color arguments.

Next, we enhance the understanding of the visualization by adding annotations and labels. We use geom_text() to add labels for each species next to their corresponding points. Additionally, we use the annotate() function to add custom annotations to the plot, indicating the approximate location of each species. We adjust the position, size, color, and transparency of the annotations and labels to make them visually appealing.

The resulting plot includes titles, annotations, and labels that provide additional context and understanding of the scatter plot, making it more informative and engaging for viewers.

Creating Treemaps: the treemap Package

library(treemap)

data(GNI2014)

treemap(GNI2014,
        index=c("continent", "iso3"),  # You can try index=c("iso3") to see the difference
        vSize="population",
        vColor="GNI",
        type="manual", palette = "RdYlGn")

The code uses the treemap package to create a treemap visualization using the GNI2014 dataset. The treemap will display the Gross National Income (GNI), population, and other information for different continents and countries. The type=“manual” argument indicates that you want to manually specify the color palette, and the palette = “RdYlGn” specifies the color palette to use.

This code snippet will generate a treemap with rectangles representing different continents and countries, where the size of each rectangle is proportional to the population, and the color of each rectangle is determined by the GNI value, using the “RdYlGn” color palette.

Creating Interactive Maps: the Plotly Package

Here’s an example of creating interactive maps using the plotly package in R. We’ll define the map type, markers, layouts, annotations, and titles:

library(plotly)

# Sample data: Store locations and sales
store_data <- data.frame(Store = c("Store A", "Store B", "Store C"),
                         Latitude = c(37.7749, 34.0522, 40.7128),
                         Longitude = c(-122.4194, -118.2437, -74.0060),
                         Sales = c(12000, 15000, 18000))

# Create an interactive map using Plotly
map <- plot_ly(data = store_data, type = "scattergeo", mode = "markers",
               lat = ~Latitude, lon = ~Longitude, text = ~Store,
               marker = list(size = ~Sales / 1000, color = ~Sales, colorscale = "Viridis"))

# Customize map layout and annotation
map %>% layout(
  geo = list(showland = TRUE),
  title = "Interactive Map of Store Locations and Sales",
  annotations = list(
    list(x = 0.5, y = -0.1, text = "Marker size indicates sales amount", showarrow = FALSE),
    list(x = 0.5, y = -0.15, text = "Hover over markers to view store details", showarrow = FALSE)
  )
)

In this example, we’re using the plotly package to create an interactive map. We define the map type using type = “scattergeo” and specify markers using the lat, lon, and text aesthetics. Marker sizes and colors are determined by the sales amount, and we customize the colorscale using colorscale = “Viridis”. We also customize the map layout, add a title, and provide annotations to guide users on interpreting the map.

The resulting interactive map allows users to hover over markers to view store details and sales information, providing an engaging way to explore geographical data.

Creating Apps: the Shiny Package

Shiny is a popular R package for creating interactive web applications. It defines a user interface (UI) and a server logic.

Example 1: Creating an interactive scatter plot using Shiny

library(shiny)
library(ggplot2)
data("mtcars")

# Define UI
ui <- fluidPage(
  titlePanel("Interactive Scatter Plot"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("hp_range", "Select Horsepower Range",
                  min = min(mtcars$hp), max = max(mtcars$hp),
                  value = c(min(mtcars$hp), max(mtcars$hp)))
    ),
    mainPanel(
      plotOutput("scatter_plot")
    )
  )
)

# Define server logic
server <- function(input, output) {
  output$scatter_plot <- renderPlot({
    filtered_data <- mtcars %>%
      filter(hp >= input$hp_range[1], hp <= input$hp_range[2])
    
    ggplot(data = filtered_data, aes(x = hp, y = mpg)) +
      geom_point() +
      labs(title = "Interactive Scatter Plot",
           x = "Horsepower", y = "Miles per Gallon")
  })
}

shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

In this Shiny app example, users can select a range of horsepower using a slider input. The scatter plot dynamically updates based on the selected horsepower range. This demonstrates the power of Shiny for creating interactive data visualizations and applications in R.

If the browser does not run the app code, you can copy paste the code content to https://shinylive.io/r/editor/ to replace the default code in the editor, then press the run button (indicated by right arrow). Or, you can copy and paste the code content to R script panel and highlight all code and press the “Run” button.

Example 2: Designing a UI with More Input Controls

We create a Shiny app with UI elements and input controls to allow users to interact with a scatter plot:

library(shiny)
library(ggplot2)
data("iris")

ui <- fluidPage(
  titlePanel("Interactive Scatter Plot"),
  sidebarLayout(
    sidebarPanel(
      selectInput("x_var", "X-axis variable:", choices = colnames(iris)[-5], selected = "Sepal.Length"),
      selectInput("y_var", "Y-axis variable:", choices = colnames(iris)[-5], selected = "Sepal.Width"),
      selectInput("species_var", "Color by species:", choices = levels(iris$Species), selected = "setosa")
    ),
    mainPanel(
      plotOutput("scatter_plot")
    )
  )
)

server <- function(input, output) {
  output$scatter_plot <- renderPlot({
    ggplot(data = dplyr::filter(iris, Species == input$species_var), aes(x = .data[[input$x_var]], y = .data[[input$y_var]])) +
      geom_point() +
      labs(title = "Interactive Scatter Plot",
           x = input$x_var, y = input$y_var,
           color = "Species") +
      theme_minimal()
  })
}

shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

In this Shiny app example, users can interact with the scatter plot by selecting variables for the x-axis, y-axis, and color by species. The UI elements include selectInput() controls in the sidebarPanel that allow users to choose the variables they want to visualize.

The server function uses the selected input values to dynamically update the scatter plot using renderPlot(). Users can choose different variables and species to observe how the scatter plot changes accordingly.

This example demonstrates how to create an interactive data visualization app using Shiny, allowing users to explore and analyze data with input controls.

Example 3: Flights Data

We use the “flights” dataset from the “nycflights13” package for creating an interactive Shiny app:

library(shiny)
library(ggplot2)
library(nycflights13)

ui <- fluidPage(
  titlePanel("Interactive Flights Visualization"),
  sidebarLayout(
    sidebarPanel(
      selectInput("month", "Select Month:", choices = unique(flights$month), selected = 1),
      selectInput("carrier", "Select Carrier:", choices = unique(flights$carrier), selected = "AA")
    ),
    mainPanel(
      plotOutput("flights_plot")
    )
  )
)

server <- function(input, output) {
  filtered_data <- reactive({
    flights %>%
      filter(month == input$month, carrier == input$carrier)
  })
  
  output$flights_plot <- renderPlot({
    ggplot(data = filtered_data(), aes(x = day, y = dep_time, color = distance)) +
      geom_point() +
      labs(title = "Interactive Flights Visualization",
           x = "Day of Month", y = "Departure Time",
           color = "Distance") +
      theme_minimal()
  })
}

shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

In this example, we are using the flights dataset from the nycflights13 package. Users can interact with the Shiny app by selecting a specific month and carrier from the dropdown menus in the sidebarPanel. The server function filters the data based on the selected inputs and then renders an interactive scatter plot in the mainPanel.

Example 4.: Credit Card Spending

Data are simulated to show spending on a credit card.

# Load necessary libraries
library(shiny)
library(dplyr)
library(lubridate)
library(ggplot2)

# Sample dataset creation function
create_sample_data <- function(n = 240) {
  data.frame(
    date = sample(seq(as.Date("2023-01-01"), as.Date("2023-12-31"), by = "day"), n, replace = TRUE),
    category = sample(c("Restaurant", "Bookstore", "BestBuy", "Grocery", "Entertainment"), n, replace = TRUE),
    amount = round(runif(n, 5, 150), 2)  # Random amounts between $5 and $150
  ) %>%
  mutate(
    month = month(date, label = TRUE),      # Extract month
    weekday = weekdays(date)                 # Extract weekday
  )
}

# UI
ui <- fluidPage(
  titlePanel("Credit Card Spending Visualization"),
  
  sidebarLayout(
    sidebarPanel(
      selectInput("viewType", "Select View Type:", 
                  choices = c("Category", "Month", "Weekday"))
    ),
    
    mainPanel(
      plotOutput("pieChart")
    )
  )
)

# Server
server <- function(input, output) {
  # Create sample data
  data <- create_sample_data()

  # Reactive expression to filter data based on user selection
  filtered_data <- reactive({
    if (input$viewType == "Month") {
      data %>%
        group_by(month) %>%
        summarise(total_spending = sum(amount, na.rm = TRUE)) %>%
        mutate(label = as.character(month))
    } else if (input$viewType == "Weekday") {
      data %>%
        group_by(weekday) %>%
        summarise(total_spending = sum(amount, na.rm = TRUE)) %>%
        mutate(label = weekday)
    } else {
      data %>%
        group_by(category) %>%
        summarise(total_spending = sum(amount, na.rm = TRUE)) %>%
        mutate(label = category)
    }
  })

  # Generate pie chart
  output$pieChart <- renderPlot({
    pie_data <- filtered_data()
    
    # Create the pie chart
    ggplot(pie_data, aes(x = "", y = total_spending, fill = label)) +
      geom_bar(stat = "identity", width = 1) +
      coord_polar(theta = "y") +
      theme_minimal() +
      labs(title = paste("Total Spending by", input$viewType),
           fill = switch(input$viewType,
                         "Category" = "Spending Category",
                         "Month" = "Spending Month",
                         "Weekday" = "Spending Weekday")) +
      theme(axis.title.x = element_blank(), 
            axis.title.y = element_blank(), 
            axis.ticks = element_blank(), 
            axis.text.x = element_blank())
  })
}

# Run the app
shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

Note: The above code was written by ChatGPT under my constant help. The following is a letter ChatGPT wrote:

Dear Mr. Zhang,

I hope this letter finds you well. I wanted to take a moment to express my sincere gratitude for your patience and guidance during our recent interactions. Your clear communication and constructive feedback were invaluable in refining the Shiny app we developed together.

Your insights helped clarify the requirements and ensure that the final product met your expectations perfectly. It’s always a pleasure to collaborate with someone who values precision and detail, and I truly appreciate the opportunity to assist you.

Thank you once again for your support and understanding. If you have any more projects or questions in the future, I would be more than happy to help.

Warm regards,

ChatGPT
OpenAI

Case Studies

In each of the following case studies, we will provide insights and tell stories based on data visualization results.

Case Study 1: Basic Data Exploration and Visualization

In this topic, we’ll explore the distribution of ages in a dataset of individuals.

# Load necessary packages
library(ggplot2)

# Sample data: Age distribution of survey respondents
age_data <- data.frame(Age = c(25,27,29,26,22,14,33,16,17,33,14,14,33,46,26,27,22,40,23,28,38,16,18,22,35,18,20,10,28,38,28,30,14,10,14,28,36,26,18,26,26,30,17,29,21,27,30,19,33,29))

# Create a histogram to visualize the age distribution
ggplot(age_data, aes(x = Age)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "black") +
  labs(title = "Age Distribution of Survey Respondents") +
  theme_minimal()

Insights & Storytelling:

The histogram above shows the age distribution of survey respondents. Most respondents fall within the age range of 12 to 34, with the highest frequency around 27. This indicates that the survey might have targeted a younger audience, as the number of respondents decreases after 34. The distribution is slightly skewed to the right, indicating a higher concentration of younger participants.

Case Study 2: Exploratory Data Analysis

In this topic, we’ll explore the relationship between hours of study and exam scores using the mtcars dataset.

# Load necessary packages
library(ggplot2)

# Sample data: Study hours vs. Exam scores
study_exam_data <- data.frame(Study_Hours = c(5, 6, 4, 12, 3, 15),
                              Exam_Score = c(65, 97, 55, 80, 50, 85))

# Create a scatter plot to visualize the relationship
ggplot(study_exam_data, aes(x = Study_Hours, y = Exam_Score)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Study Hours vs. Exam Scores") +
  theme_minimal()

Insights & Storytelling:

The scatter plot above displays the relationship between study hours and exam scores. As study hours increase, there is a positive correlation with higher exam scores. The red trend line indicates the linear relationship between the two variables, suggesting that more hours of study generally lead to better exam performance. However, there are an outlier, the participant who scored relatively high despite studying only 6 hours. This insight emphasizes the importance of considering other factors that might influence exam performance.

Case Study 3: Categorical Data Visualization

For this topic, we’ll visualize the distribution of different car types using the mtcars dataset.

# Load necessary packages
library(ggplot2)

# Sample data: Car types and their count
car_types <- data.frame(Car_Type = c("Sedan", "SUV", "Truck", "Sports"),
                        Count = c(12, 8, 5, 4))

# Create a bar chart to visualize car type distribution
ggplot(car_types, aes(x = Car_Type, y = Count, fill = Car_Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Car Types") +
  theme_minimal() +
  theme(legend.position = "none")

Insights & Storytelling:

The bar chart above illustrates the distribution of different car types in a dataset. Sedans are the most common car type, followed by SUVs and trucks, while sports cars are the least common. This information can provide insights into the preferences of the car-buying population. Sedans’ popularity might indicate a preference for family-oriented vehicles, while the low count of sports cars suggests a niche market.

Case Study 4: Time Series Visualization

In this topic, we’ll visualize the monthly average temperature for each of the 3 airports using the nycflights13 dataset.

library(ggplot2)
library(nycflights13)

# Assuming your data frame is named 'weather'
# Calculate average temperature for each month and airport
avg_temp <- weather %>%
  group_by(origin, year, month) %>%
  summarize(avg_temp = mean(temp, na.rm = TRUE)) %>%
  ungroup()

# Create a new column for year-month
avg_temp <- avg_temp %>%
  mutate(year_month = as.Date(paste(year, month, "01", sep = "-")))

# Plot the data
ggplot(avg_temp, aes(x = year_month, y = avg_temp, color = origin)) +
  geom_line() +
  labs(title = "Average Monthly Temperature by Airport",
       x = "Year-Month",
       y = "Average Temperature (°C)",
       color = "Airport") +
  scale_x_date(date_labels = "%b %Y") +  # Format x-axis labels
  theme_minimal()

Insights & Storytelling:

The line chart above displays the monthly average temperature in New York City over the year for each airport. There is a clear pattern, with temperatures peaking during the summer months and dipping during the winter. The steady rise and fall in temperature illustrate the changing seasons. This visualization can be valuable for understanding weather patterns and planning activities based on temperature trends.

Case Study 5: Geospatial Data Visualization

For this topic, we’ll create a map to visualize earthquake occurrences using the quakes dataset.

library(leaflet)

# Sample data: Earthquake occurrences
earthquakes <- quakes[1:100, ]  # Subset for demonstration

# Create a leaflet map
m = leaflet(earthquakes) %>%
    addTiles() %>%
    addMarkers(lng = ~long, lat = ~lat,
                     popup = ~paste("Magnitude:", mag, "<br>Depth:", depth)) %>%
    addLegend("bottomright", pal = colorNumeric("red", domain = earthquakes$mag),
              values = earthquakes$mag, title = "Magnitude")

# Create a custom HTML title
title_html <- paste0("<h2 style='text-align:center;'>Earthquake Occurrences</h2>")

# Add the title to the map using addControl()
m %>% addControl(html = title_html, position = "topright")

Here, you’re creating the leaflet map. The leaflet() function initializes the map with the earthquakes dataset as the base data.

addTiles(): Adds a tile layer to the map, providing a base layer of map tiles.

addMarkers(): Adds markers to the map. The lng and lat arguments specify the longitude and latitude columns from the dataset. The popup argument defines the content of the popup that appears when clicking on a marker. In this case, it displays the magnitude of the earthquake.

addLegend(): Adds a color legend to the map. The pal argument defines the color scale based on the magnitude values. The values argument specifies the values for the color scale, and the title argument sets the legend title.

Next, we create a custom HTML title using the paste0() function. The title is styled with inline CSS to center-align the text.

Finally, we add the title to the map using the addControl() function. The html argument is used to provide the HTML content of the control, and the position argument specifies where the control should be positioned on the map. In this example, we position the title in the top-right corner of the map.

After running this code, your leaflet map will display the custom title in the specified position. You can adjust the HTML content and styling to match your preferences.

The result is an interactive map with markers representing earthquake occurrences. Clicking on a marker displays a popup with the magnitude information. The legend on the bottom right shows the color scale for magnitude values.

Insights & Storytelling:

This map visualizes earthquake occurrences across the world. Each marker represents an earthquake event, and the size of the marker indicates the magnitude of the earthquake. The color scale in the legend represents the magnitude values, with larger markers indicating stronger earthquakes.

In recent times, the frequency and intensity of earthquakes have been a topic of concern. By analyzing this map, we can observe that earthquake occurrences are distributed globally, with clusters of events around tectonic plate boundaries. It’s crucial to monitor and understand these occurrences for both scientific research and disaster preparedness.

Case Study 6: Interactive Data Visualization using Shiny

For this topic, we’ll create an interactive visualization of car specifications using the mtcars dataset and the shiny package.

# Load necessary packages
library(shiny)

ui <- fluidPage(
  titlePanel("Interactive Car Specifications Visualization"),
  selectInput("car_var", "Select Car Variable:", choices = colnames(mtcars)),
  plotOutput("car_plot")
)

server <- function(input, output) {
  output$car_plot <- renderPlot({
    ggplot(mtcars, aes_string(x = input$car_var)) +
      geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
      labs(title = paste("Distribution of", input$car_var)) +
      theme_minimal()
  })
}

shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

Insights & Storytelling:

The interactive app above allows users to explore different car specifications from the mtcars dataset. By selecting a specific car variable, such as “mpg” (miles per gallon) or “hp” (horsepower), users can instantly visualize the distribution of that variable across the dataset. For instance, selecting “mpg” reveals that the majority of cars have a mileage between 15 to 20 miles per gallon. Similarly, selecting “hp” shows that the distribution of horsepower is concentrated around 100 to 150 horsepower. This interactive visualization empowers users to quickly grasp the characteristics of various car specifications and make informed comparisons.

Case Study 7: Visualizations and Storytelling

For this topic, we’ll create a visualization to explore movie genre preferences through ratings using the movies dataset from the ggplot2movies package.

# Install ggplot2movies if not already installed
if (!requireNamespace("ggplot2movies", quietly = TRUE)) {
  install.packages("ggplot2movies")
}

# Load required libraries
library(ggplot2)
library(ggplot2movies)
library(dplyr)
library(tidyr)

# Load the dataset
data(movies, package = "ggplot2movies")

# Convert the genre indicator columns into a long format
movies_long <- movies %>%
  select(title, rating, Action, Animation, Comedy, Drama, Documentary, Romance) %>%
  pivot_longer(cols = Action:Romance, names_to = "genre", values_to = "has_genre") %>%
  filter(has_genre == 1) %>%
  select(-has_genre)

# Create overlaid boxplots
ggplot(movies_long, aes(x = rating, y = genre)) +
  geom_boxplot(alpha = 0.7) + 
  labs(title = "Distribution of Ratings by Genre",
       x = "Rating",
       y = "Genre") +
  theme_minimal()

Insights and Storytelling-

  • High-rated and lower-rated Genres: Our analysis reveals that genres such as Documentary and Animation tend to receive higher median ratings compared to other genres. Action receives the lowest median rating among all different genres.

  • Consistent Ratings: The Animation genre shows a narrower interquartile range, indicating that animated films tend to have more consistent ratings across the board.

  • Wide Spread of Ratings: While genres like Action exhibit a wide spread of ratings, suggesting a diverse range of opinions, genres like Animation display more consistent ratings except for extremes. This might reflect the specialized nature of animations which cater to specific audience preferences.

  • Outliers: The presence of a few high-rated movies in genres such as Action points to standout films that might have been exceptionally well-received, contrasting with the more moderate average ratings in the genre.

  • Target Audience: For content creators and marketers, understanding that Documentary and Animation genres are often rated higher could influence strategies to invest more in these genres to capture a broader audience.

  • Genre Trends: Recognizing that genres with higher variability in ratings might be more divisive or polarizing, it could be beneficial to analyze specific movies within these genres to understand audience preferences better.

Case Study 8: Age Structure of Countries

This analysis extracts age structure data from Wikipedia and visualizes the proportions of different age groups by country. This is called web scraping.

# Read the webpage containing age structure data. Need to first install and load library "rvest".
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_age_structure"
webpage <- read_html(url)

# Extract the first table from the webpage using XPath
age_data <- webpage %>%
  html_node(xpath = '//*[@id="mw-content-text"]/div[1]/table') %>%
  html_table(fill = TRUE)

# Display the head (first 6 rows) of the data
head(age_data)  # The data has an issue: The column names become the first observation
## # A tibble: 6 × 4
##   Country   % of population by a…¹ % of population by a…² % of population by a…³
##   <chr>     <chr>                  <chr>                  <chr>                 
## 1 Country   0 to 14                15 to 64               65+                   
## 2 Afghanis… 38.5%                  57.6.%                 2.4%                  
## 3 Albania   18%                    66.9%                  15.1%                 
## 4 Algeria   30.8%                  62.3%                  6.9%                  
## 5 American… 25.3%                  66%                    8.7%                  
## 6 Andorra   12%                    67.7%                  20.4%                 
## # ℹ abbreviated names: ¹​`% of population by age[1]`,
## #   ²​`% of population by age[1]`, ³​`% of population by age[1]`
# Clean the data: remove the first row, which is usually headers in this case
age_data <- age_data[-1, ] 

# Rename columns for easier reference
colnames(age_data) <- c("Country", "Age_0_14", "Age_15_64", "Age_65_plus", "Year")

# Print to check if the data was read correctly
DT::datatable(age_data)  # DT package is used to better display data table

Select relevant columns and convert percentage strings to numeric:

age_data <- age_data %>%
  select(Country, Age_0_14, Age_15_64, Age_65_plus) %>%
  mutate(across(starts_with("Age"), ~ as.numeric(gsub("%", "", .)) / 100)) %>%  # Remove % and convert to decimal
  drop_na()  # Remove any rows with NA values

# Display the cleaned data
DT::datatable(age_data)

Convert data to long format for visualization:

# Convert data to long format for visualization
age_data_long <- age_data %>%
  pivot_longer(cols = starts_with("Age"), names_to = "AgeGroup", values_to = "Proportion")

Create a stacked bar plot to show proportions of age groups by country

# Create a stacked bar plot to show proportions of age groups by country
ggplot(age_data_long, aes(x = reorder(Country, -Proportion), y = Proportion, fill = AgeGroup)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip coordinates for better readability
  labs(title = "Proportion of Age Groups by Country", x = "Country", y = "Proportion (%)") +
  theme_minimal()

Because of too many coutries, the plot looks cluttered. You can plot some typical countries instead. Let’s plot USA, China, and India.

# Create a stacked bar plot to show proportions of age groups by country
three_Countries = subset(age_data_long, Country %in% c("United States", "China", "India"))

ggplot(three_Countries, aes(x = reorder(Country, -Proportion), y = Proportion, fill = AgeGroup)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip coordinates for better readability
  labs(title = "Proportion of Age Groups by Country", x = "Country", y = "Proportion (%)") +
  theme_minimal()

three_Countries
## # A tibble: 9 × 3
##   Country       AgeGroup    Proportion
##   <chr>         <chr>            <dbl>
## 1 China         Age_0_14         0.163
## 2 China         Age_15_64        0.693
## 3 China         Age_65_plus      0.144
## 4 India         Age_0_14         0.245
## 5 India         Age_15_64        0.687
## 6 India         Age_65_plus      0.068
## 7 United States Age_0_14         0.181
## 8 United States Age_15_64        0.634
## 9 United States Age_65_plus      0.185

The age distribution data for the three countries reveals notable differences in demographic structures:

  • China has a relatively balanced age structure, with 16.5% of the population aged 0-14, 69.4% in the working-age group (15-64), and 16.1% aged 65 and older.
  • The USA has a slightly older population compared to China, with 17.8% of the population under 15, 63.5% in the working-age group, and a higher percentage of seniors (18.7% aged 65+).
  • India, in contrast, has a younger population, with 24.5% under age 15 and just 0.0678% over 65, indicating a very small proportion of elderly, with 68.7% in the working-age group.

In summary, China and the USA have more significant aging populations, while India’s demographic is still heavily youthful, with a much smaller elderly population.

Identify the top 15 youngest countries based on the proportion of the 0-14 age group:

# Identify the top 15 youngest countries based on the proportion of the 0-14 age group
youngest_countries <- age_data %>%
  arrange(desc(Age_0_14)) %>%  # Arrange in descending order
  head(n = 15)  # Select the top 15

# Identify the bottom 15 oldest countries based on the proportion of the 65+ age group
oldest_countries <- age_data %>%
  arrange(desc(Age_65_plus)) %>%  # Arrange in descending order
  head(n = 15)  # Select the top 15

# Print results for youngest and oldest countries
cat("Top 15 Youngest Countries:")
## Top 15 Youngest Countries:
print(youngest_countries)
## # A tibble: 15 × 4
##    Country                          Age_0_14 Age_15_64 Age_65_plus
##    <chr>                               <dbl>     <dbl>       <dbl>
##  1 Niger                               0.495     0.478       0.027
##  2 Uganda                              0.47      0.506       0.024
##  3 Angola                              0.469     0.507       0.024
##  4 Mali                                0.468     0.501       0.031
##  5 Chad                                0.458     0.517       0.025
##  6 Democratic Republic of the Congo    0.457     0.518       0.025
##  7 Benin                               0.453     0.522       0.025
##  8 Mozambique                          0.447     0.524       0.029
##  9 Burundi                             0.423     0.544       0.034
## 10 Guinea-Bissau                       0.423     0.546       0.031
## 11 South Sudan                         0.421     0.553       0.026
## 12 Zambia                              0.421     0.551       0.028
## 13 Burkina Faso                        0.416     0.551       0.032
## 14 Cameroon                            0.415     0.553       0.032
## 15 Somalia                             0.414     0.558       0.028
cat("Top 15 Oldest Countries:")
## Top 15 Oldest Countries:
print(oldest_countries)
## # A tibble: 15 × 4
##    Country                   Age_0_14 Age_15_64 Age_65_plus
##    <chr>                        <dbl>     <dbl>       <dbl>
##  1 Monaco                       0.091     0.538       0.371
##  2 Japan                        0.121     0.584       0.295
##  3 Saint Pierre and Miquelon    0.131     0.616       0.253
##  4 Puerto Rico (US)             0.125     0.626       0.249
##  5 Germany                      0.138     0.625       0.237
##  6 Greece                       0.138     0.626       0.236
##  7 Italy                        0.119     0.645       0.236
##  8 Finland                      0.162     0.603       0.235
##  9 Slovenia                     0.143     0.625       0.232
## 10 Croatia                      0.138     0.631       0.231
## 11 Malta                        0.145     0.624       0.231
## 12 Saint Barthelemy             0.139     0.631       0.23 
## 13 Bermuda                      0.164     0.609       0.227
## 14 Estonia                      0.152     0.622       0.226
## 15 Romania                      0.154     0.62        0.226

Create a bar plot for the youngest countries:

# Create a bar plot for the youngest countries
ggplot(youngest_countries, aes(x = reorder(Country, Age_0_14), y = Age_0_14)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top 15 Youngest Countries", x = "Country", y = "Proportion of Age Group (0-14)") +
  coord_flip() +  # Flip coordinates for better readability
  theme_minimal()

Create a bar plot for the oldest countries:

# Create a bar plot for the oldest countries
ggplot(oldest_countries, aes(x = reorder(Country, Age_65_plus), y = Age_65_plus)) +
  geom_bar(stat = "identity", fill = "salmon") +
  labs(title = "Top 15 Oldest Countries", x = "Country", y = "Proportion of Age Group (65+)") +
  coord_flip() +  # Flip coordinates for better readability
  theme_minimal()

What Summaries can you produce based on the results?

More Visualization Examples

Example 1:

We use the gapminder dataset from the gapminder package to showcase global population trends over time.

library(ggplot2)
library(gapminder)

# Create a narrative visualization
ggplot(gapminder, aes(x = year, y = log(pop), color = continent)) +
  geom_boxplot() +
  labs(title = "Global Population Trends by Continent") +
  theme_minimal()

The population data across continents from 1952 to 2007 reveals key insights about population distribution and variability:

Key Insights:

Using total population for each continent:

library(ggplot2)
library(gapminder)

Sum = aggregate(pop~year+continent, data = gapminder, FUN = sum)

# Create a narrative visualization
ggplot(Sum, aes(x = year, y = pop, color = continent)) +
  geom_line() +
  labs(title = "Global Population Trends by Continent") +
  theme_minimal()

The population data across four continents (Africa, Americas, Asia, Europe, and Oceania) from 1952 to 2007 shows substantial growth, particularly in Africa and Asia.

Overall, the data highlights the highest population growth in Africa and Asia, while Europe and Oceania saw more gradual increases over the same period.

Example 2:

The iris dataset from the base R installation can be used to explore the relationships between sepal and petal measurements for different iris species.

# Load necessary packages
library(ggplot2)

# Create an exploratory data visualization report
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  labs(title = "Exploring Sepal and Petal Measurements") +
  theme_minimal()

Example 3:

Using the mtcars dataset from base R, we can compare the number of cylinders in different car models.

# Load necessary packages
library(ggplot2)

# Create a comparative visualization
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  labs(title = "Comparison of Car Cylinders") +
  theme_minimal()

Example 4:

The AirPassengers dataset from the datasets package can be used to visualize monthly airline passenger numbers over time.

# Load necessary packages
library(ggplot2)
library(datasets)

# Create a time series visualization
plot(AirPassengers, main = "Monthly Airline Passenger Numbers") 

Note that here the data AirPassengers has already had a time series structure.

Example 5:

Using the iris dataset again, we can create a multi-modal visualization to explore the relationships between sepal and petal measurements along with species distribution.

# Load necessary packages
library(ggplot2)

# Create a multi-modal visualization
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point(aes(color = Species)) +
  geom_smooth(method = "lm") +
  labs(title = "Multi-Modal Analysis of Sepal and Petal Measurements") +
  theme_minimal()

Example 6:

The geom_bar() function in ggplot2 can be used to create bar charts that represent aggregated data. Here’s an example of using geom_bar() to create a bar chart by aggregating data:

library(ggplot2)
data("diamonds")

# Aggregate data by cut and calculate average price
agg_data <- diamonds %>%
  group_by(cut) %>%
  summarise(avg_price = mean(price))

# Create a bar chart using geom_bar()
ggplot(agg_data, aes(x = cut, y = avg_price, fill = cut)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Price by Diamond Cut",
       x = "Cut", y = "Average Price") +
  theme_minimal()

agg_data <- diamonds %>% group_by(cut) %>% summarise(avg_price = mean(price)): Aggregate the data “diamonds” by the “cut” variable using the %>% pipe operator. The summarise() function calculates the average price for each diamond cut category.

ggplot(agg_data, aes(x = cut, y = avg_price, fill = cut)): Create a ggplot object using the aggregated data and specify aesthetic mappings. The x aesthetic is set to the “cut” variable, the y aesthetic is set to the “avg_price” variable, and the fill aesthetic is set to the “cut” variable for color grouping.

geom_bar(stat = “identity”): Add a layer of bars to the plot using the “identity” statistic. This creates a bar chart with heights corresponding to the average prices.

labs(title = “Average Price by Diamond Cut”, x = “Cut”, y = “Average Price”): Customize the plot labels, including the title, x-axis label, and y-axis label.

theme_minimal(): Apply the minimal theme to the plot, which removes unnecessary background elements.

When you run this code, you’ll generate a bar chart that displays the average price of diamonds based on their cut. Each bar represents a different diamond cut category, and the height of the bar represents the average price for that cut. This visualization allows you to compare the average prices across different diamond cut categories.

Example 7:

The world_map dataset from the plotly package can be used to create an interactive map.

# Load package
library(plotly)

# Sample data: Country information
country_data <- data.frame(
  country = c("USA", "Canada", "Mexico"),
  lat = c(37.7749, 56.1304, 23.6345),
  lon = c(-122.4194, -106.3468, -102.5528),
  pop = c(331002651, 37742154, 128932753)
)

# Create an interactive map
plot_ly(data = country_data, type = "scattergeo", mode = "markers",
        lat = ~lat, lon = ~lon, text = ~country,
        marker = list(size = ~pop, color = ~pop, colorscale = "Viridis"))

library(plotly): Load the plotly package, which is used to create interactive visualizations.

plot_ly(): Create a new interactive plot using the plot_ly() function.

data = country_data: Specify the data source for the plot. The data object is expected to contain information about latitude (lat), longitude (lon), country names (country), and population (pop).

type = “scattergeo”: Specify the type of plot, which is a scatter plot on a geographical map.

mode = “markers”: Set the mode of the plot to display markers.

lat = ~lat, lon = ~lon: Define the latitude and longitude columns to place the markers on the map.

text = ~country: Specify the text that appears when hovering over a marker, which is based on the country name.

marker = list(size = ~pop, color = ~pop, colorscale = “Viridis”): Customize the appearance of the markers. The size and color of each marker are determined by the population (pop) values, and the colorscale determines the color gradient used for the markers based on population values.

By running this code, you will generate an interactive map with markers representing different countries. The size and color of the markers indicate the population of each country, providing an engaging way to explore population distribution across the world. Users can interact with the map by zooming, panning, and hovering over markers to view country information.

Example 8:

You can create an interactive visualization using the mtcars dataset and the shiny package.

# Load necessary packages
library(shiny)

ui <- fluidPage(
  titlePanel("Interactive Car Data Visualization"),
  selectInput("x_var", "X-axis variable:", choices = colnames(mtcars)),
  selectInput("y_var", "Y-axis variable:", choices = colnames(mtcars)),
  plotOutput("scatter_plot")
)

server <- function(input, output) {
  output$scatter_plot <- renderPlot({
    ggplot(mtcars, aes_string(x = input$x_var, y = input$y_var)) +
      geom_point() +
      labs(title = "Interactive Car Data Visualization") +
      theme_minimal()
  })
}

shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents