This RMarkdown document aims to provide a comprehensive understanding of data visualization using ggplot2 package. We will cover:
First of all, let’s load the libraries that will be in use.
# install.packages("ggplot2") If you have not installed it, please do
library(ggplot2) # Load the ggplot2 package for data visualization
library(tidyverse) # For data processing
library(dplyr) # For data processing
ggplot2
is a data visualization package for R. It
provides an implementation of the Grammar of Graphics, a general scheme
for data visualization which breaks up graphs into semantic components
such as scales and layers.
Our Dataset
# Set random seed for reproducibility
set.seed(123)
# Generate grades for 100 students
data_grade <- data.frame(
Grade = sample(c("A", "B", "C", "D", "F"), size = 100, replace = TRUE), # Randomly sample grades A-F
Gender = sample(c("M", "F"), size = 100, replace = TRUE) # Randomly sample gender M or F
) # Creating the 'data_grade' data frame with random values
head(data_grade)
## Grade Gender
## 1 C M
## 2 C M
## 3 B M
## 4 B M
## 5 C F
## 6 F F
Step 1: Lay the Groundwork
# Initialize ggplot object and map Grade variable
base_plot <- ggplot(data_grade, aes(x = Grade)) # Create ggplot with 'data_grade' dataset, and map 'Grade' to x-axis
base_plot # Show the graph
Step 2: Choose the Right Kind of Plot
# Add bar layer to count frequency of each grade
base_plot_with_bars <- base_plot + geom_bar() # Adding bar geom to the base plot
# Display the completed plot
base_plot_with_bars
Step 3: Make it Nice
# Add title and labels to the plot
final_plot <- base_plot_with_bars +
labs(title = "Grade Distribution", # Set the title of the plot
x = "Grades", # Set the label for the x-axis
y = "Frequency") # Set the label for the y-axis
# Display the final bar chart
final_plot
A complete chunk of codes
bar_plot <- ggplot(data_grade, aes(x = Grade)) + # Setup the base ggplot
geom_bar() + # Add a bar geom to visualize frequencies
labs(title = "Grade Distribution", # Set the title of the plot
x = "Grades", # Label for the x-axis
y = "Frequency") # Label for the y-axis
bar_plot
Step 1: Lay the Groundwork
# Initialize ggplot object and map Grade and Gender variables
base_plot <- ggplot(data_grade, aes(x = Grade, fill = Gender)) # Base plot setup with 'fill' aesthetic mapped to 'Gender'
base_plot
Step 2: Choose the Right Kind of Plot
# Add bar layer and separate bars by Gender
base_plot_with_grouped_bars <- base_plot +
geom_bar(position = "dodge") # Adding bar geom with bars positioned side-by-side ("dodged")
base_plot_with_grouped_bars
Note: If position
isn’t set to ‘dodge’,
the default is ‘stack’. This stacks bars, showing distributions of a
second variable like ‘Gender’ within each ‘Grade’ category.For example,
if your x variable is “Grade” and your fill variable is “Gender”, you
would see how many males and females received each grade, all in a
single stacked bar.
Step 3: Make it Nice
# Add title and labels to the plot
final_plot <- base_plot_with_grouped_bars +
labs(title = "Grade Distribution by Gender", # Plot title
x = "Grades", # X-axis label
y = "Frequency") # Y-axis label
# Display the final bar chart
final_plot
A complete chunk of codes
bar_plot_by_gender <- ggplot(data_grade, aes(x = Grade, fill = Gender)) + # Initialize ggplot and map Grade and Gender variables
geom_bar(position = "dodge") + # Add bar geom and set bars to be side-by-side
labs(title = "Grade Distribution by Gender", # Plot title
x = "Grades", # X-axis label
y = "Frequency") # Y-axis label
bar_plot_by_gender
First, let’s generate some data. We create a dataset with 100 students, each with a unique ID, a level of education in years, and a vocabulary score.
# Set random seed for reproducibility
set.seed(123)
# Generate student IDs and education years for 100 students, and make sure they are integers
data_student <- data.frame(
Student_ID = 1:100, # Student IDs from 1 to 100
Gender = sample(c("M", "F"),
size = 100,
replace = TRUE), # Gender: randomly assigns a gender ("Female" or "Male") to each row
Education = round(runif(100, 0, 20)) # Random years of education between 12 and 20, rounded to integers
)
# Recategorize 'Education_level' based on 'Years_of_Education'
data_student <- data_student %>%
mutate(Education_level = case_when(
Education < 9 ~ "High School", # Less than 9 years
Education < 16 ~ "Undergraduate", # 9 to 15 years
TRUE ~ "Postgraduate" # 16 years and above
)) %>%
# Specifying the order of levels helps in ordering the factor levels in a meaningful way
mutate(Education_level = factor(Education_level, levels = c("High School",
"Undergraduate",
"Postgraduate"))) %>%
# Generate Vocabulary_Score as a function of Education, with some random noise, and make them integers
mutate(Vocabulary_Score = round(20 + 2 * data_student$Education + rnorm(100, 0, 5))) # Creating Vocabulary_Score based on Education, rounded to integers)
head(data_student, n = 10) # Look at the first 10 rows
## Student_ID Gender Education Education_level Vocabulary_Score
## 1 1 M 12 Undergraduate 40
## 2 2 M 7 High School 35
## 3 3 M 10 Undergraduate 39
## 4 4 F 19 Postgraduate 56
## 5 5 M 10 Undergraduate 35
## 6 6 F 18 Postgraduate 56
## 7 7 F 18 Postgraduate 52
## 8 8 F 12 Undergraduate 36
## 9 9 M 8 High School 34
## 10 10 M 3 High School 31
Step 1: Lay the Groundwork
Before we can visualize anything, we need to tell ggplot what data we’re using and what variables are of interest. We want to look at how Education and Vocabulary_Score relate to each other, so we’ll put them on the x and y axes, respectively.
# Initialize ggplot object and map Education and Vocabulary_Score variables
base_plot <- ggplot(data_student, aes(x = Education, y = Vocabulary_Score)) # x mapped to Education, y mapped to Vocabulary_Score
base_plot
Step 2: Choose the Right Kind of Plot
Now, we choose the type of plot that best illustrates the data. In this case, a scatter plot is appropriate for showing the relationship between two continuous variables.
# Add scatter layer to visualize the relationship
base_plot_with_scatter <- base_plot +
geom_point() # Adding point geom to visualize individual data points
base_plot_with_scatter
Step 3: Make it Nice
In this step, we add more layers to the plot to make it more informative and visually appealing. We’ll add titles and axis labels, as well as specify axis limits and breaks. Finally, a linear regression line will be added to better visualize the trend in the data.
# Add title, labels, axis limits, breaks, and a linear regression line to the plot
final_plot <- base_plot_with_scatter +
labs(title = "Scatterplot of Education vs Vocabulary Score", # Setting the title
x = "Years of Education", # Labeling x-axis
y = "Vocabulary Score") + # Labeling y-axis
scale_x_continuous(limits = c(0, 25), breaks = seq(0, 25, 5)) + # Setting x-axis limits and breaks
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + # Setting y-axis limits and breaks
geom_smooth(method = lm, se = TRUE, level = 0.95) # Adding linear regression line and error bands
# Display the final scatter plot
final_plot
## `geom_smooth()` using formula = 'y ~ x'
Note: for geom_smooth()
:
method = lm
: This tells ggplot2 to use linear modeling
to fit a line to the data.se = TRUE
: This adds a shaded region around the line,
representing the standard error (confidence region).level = 0.95
: This sets the confidence level for the
standard error. A level of 0.95 corresponds to a 95% confidence
interval.A complete chunk of codes
Finally, here is a complete code chunk that combines all the steps. This is a more concise version that can be useful for future reference.
# Combine all steps for the scatterplot
final_scatter <- ggplot(data_student,
aes(x = Education,
y = Vocabulary_Score)) + # Map variables
geom_point() + # Add point geom
labs(title = "Scatterplot of Education vs Vocabulary Score", # Set title
x = "Years of Education", # Set x-axis label
y = "Vocabulary Score") + # Set y-axis label
scale_x_continuous(limits = c(0, 25), breaks = seq(0, 25, 5)) + # Setting x-axis limits and breaks
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + # Setting y-axis limits and breaks
geom_smooth(method = lm, se = TRUE, level = 0.95) # Adding linear regression line and error bands
final_scatter
## `geom_smooth()` using formula = 'y ~ x'
Now, we’ll extend the ggplot object to include the Gender variable as a color aesthetic and show how the relationship between years of education and vocabulary scores varies by female and male participants.
Step 1: Lay the Groundwork
# Initialize ggplot object with Gender as a color aesthetic
base_plot <- ggplot(data_student, aes(x = Education, y = Vocabulary_Score, color = Gender))
base_plot
Step 2: Choose the Right Kind of Plot
A scatter plot is still the appropriate visualization. We’ll add the points, colored by gender.
# Add scatter layer to visualize the relationship
base_plot_with_scatter <- base_plot +
geom_point() # Adding point geom
base_plot_with_scatter
Step 3: Make it Nice
In this final step, we’ll polish the plot by adding title, labels, and axis limits. Additionally, we’ll include separate regression lines for each gender.
# Add title, labels, axis limits, and regression lines for each gender to the plot
final_plot <- base_plot_with_scatter +
labs(title = "Scatterplot of Education vs Vocabulary Score by Gender", # Setting the title
x = "Years of Education", # Labeling x-axis
y = "Vocabulary Score") + # Labeling y-axis
scale_x_continuous(limits = c(0, 20), breaks = seq(0, 20, 4)) + # Setting x-axis limits and breaks
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + # Setting y-axis limits and breaks
geom_smooth(method = lm, se = TRUE, level = 0.95) # Adding linear regression line and error bands
# Display the final scatter plot
final_plot
## `geom_smooth()` using formula = 'y ~ x'
A complete chunk of codes
Here is the complete code chunk combining all steps for your convenience:
# Combine all steps for the scatterplot
final_scatter <- ggplot(data_student,
aes(x = Education, y = Vocabulary_Score, color = Gender)) +
geom_point() +
labs(title = "Scatterplot of Education vs Vocabulary Score by Gender",
x = "Years of Education",
y = "Vocabulary Score") +
scale_x_continuous(limits = c(0, 20), breaks = seq(0, 20, 4)) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
geom_smooth(method = lm, se = TRUE, level = 0.95)
final_scatter
## `geom_smooth()` using formula = 'y ~ x'
This should produce a scatterplot where the points are colored by gender, and regression lines are plotted for each gender group.
Step 1: Lay the Groundwork
First, we specify only the y variable because we are dealing with a single continuous variable.
# Initialize ggplot object
base_plot1 <- ggplot(data_student, aes(y = Vocabulary_Score)) # Map Vocabulary_Score to y
base_plot1 # Display the graph
Step 2: Choose the Right Kind of Plot
We choose a boxplot for this situation.
# Add boxplot layer
base_plot1_with_box <- base_plot1 + geom_boxplot() # Add boxplot
base_plot1_with_box # Display the graph
Interpretation of a boxplot
Step 3: Make it Nice
Here, we add a title and Y-axis label.
# Finalize the boxplot
final_plot1 <- base_plot1_with_box +
labs(title = "Boxplot of Vocabulary Scores", # Title
y = "Vocabulary Score") + # Y-axis label
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) # Remove info in x-axis
final_plot1 # Display the graph
A complete chunk of codes
final_plot1 <- ggplot(data_student, aes(y = Vocabulary_Score)) + # Map variables
geom_boxplot() + # Add boxplot
labs(title = "Boxplot of Vocabulary Scores", # Title
y = "Vocabulary Score") + # Y-axis label
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) # Remove info in x-axis
final_plot1
Step 1: Lay the Groundwork
# Initialize ggplot object for situation 2
base_plot2 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score)) # Map variables
base_plot2 # Display the graph
Step 2: Choose the Right Kind of Plot
# Add boxplot layer
base_plot2_with_box <- base_plot2 + geom_boxplot() # Add boxplot
base_plot2_with_box # Display the graph
Step 3: Make it Nice
# Finalize the boxplot
final_plot2 <- base_plot2_with_box +
labs(title = "Boxplot of Vocabulary Scores by Gender", # Title
x = "Gender", # X-axis label
y = "Vocabulary Score") # Y-axis label
final_plot2 # Display the graph
A complete chunk of codes
final_plot2 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score)) + # Map variables
geom_boxplot() + # Add boxplot
labs(title = "Boxplot of Vocabulary Scores by Gender", # Title
x = "Gender", # X-axis label
y = "Vocabulary Score") + # Y-axis label
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) # Modify Y-axis scales
final_plot2 # Display the graph
Step 1: Lay the Groundwork
# Initialize ggplot object for situation 3
base_plot3 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score, fill = Education_level)) # Map variables
base_plot3 # Display the graph
Step 2: Choose the Right Kind of Plot
# Add boxplot layer
base_plot3_with_box <- base_plot3 + geom_boxplot() # Add boxplot
base_plot3_with_box # Display the graph
Step 3: Make it Nice
# Finalize the boxplot
final_plot3 <- base_plot3_with_box +
labs(title = "Boxplot of Vocabulary Scores by Gender and Education Level", # Title
x = "Gender", # X-axis label
y = "Vocabulary Score") # Y-axis label
final_plot3 # Display the graph
A complete chunk of codes
final_plot3 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score, fill = Education_level)) + # Map variables
geom_boxplot() + # Add boxplot
labs(title = "Boxplot of Vocabulary Scores by Gender and Education Level", # Title
x = "Gender", # X-axis label
y = "Vocabulary Score") + # Y-axis label
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + # Modify Y-axis scales
scale_fill_brewer(palette = "Set2") # Modify color scale: apply color scales from the ColorBrewer project, which offers a variety of carefully crafted color palettes designed to be colorblind-friendly and print-friendly.
final_plot3 # Display the graph
Step 1: Lay the Groundwork
# Initialize ggplot object for situation 4
base_plot4 <- ggplot(data_student, aes(x = Education_level, y = Vocabulary_Score)) # Map variables
base_plot4 # Display the graph
Step 2: Choose the Right Kind of Plot
# Add boxplot layer
base_plot4_with_box <- base_plot4 + geom_boxplot() # Add boxplot
base_plot4_with_box # Display the graph
Step 3: Make it Nice
# Finalize the boxplot
final_plot4 <- base_plot4_with_box +
labs(title = "Boxplot of Vocabulary Scores by Education Level, Faceted by Gender", # Title
x = "Education Level", # X-axis label
y = "Vocabulary Score") + # Y-axis label
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + # Modify Y-axis scales
facet_wrap(~Gender) # Add facets for Gender
final_plot4 # Display the graph
A complete chunk of codes
final_plot4 <- ggplot(data_student, aes(x = Education_level, y = Vocabulary_Score)) + # Map variables
geom_boxplot() + # Add boxplot
labs(title = "Boxplot of Vocabulary Scores by Education Level, Faceted by Gender", # Title
x = "Education Level", # X-axis label
y = "Vocabulary Score") + # Y-axis label
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) + # Modify Y-axis scales
facet_wrap(~Gender) # Add facets for Gender
final_plot4 # Display the graph
Data Preparation
Let’s prepare a sample dataset that contains ‘Income’, ‘Education’, and ‘Location’.
set.seed(123) # For reproducibility
# Generate dataset
data_sample <- data.frame(
Income = sample(30000:100000, size = 2000, replace = TRUE), # Randomly sample income
Education = sample(c("Secondary School", "Undergraduate", "Graduate"), size = 100, replace = TRUE), # Randomly sample education level
Location = sample(c("Los Angelos", "Lansing", "Houston"), size = 100, replace = TRUE) # Randomly sample location
)
# Modifying dataset to reflect certain trends
data_sample <- data_sample %>%
mutate(Income = case_when(Education == "Undergraduate" ~ as.numeric(Income + 20000), # Adjust income for Undergraduates
Education == "Graduate" ~ as.numeric(Income + 40000), # Adjust income for Graduates
TRUE ~ as.numeric(Income))) %>% # Default: Keep the original income
mutate(Income = case_when(Location == "Los Angelos" ~ as.numeric(Income + 20000), # Increase income for those in Los Angelos
Location == "Lansing" ~ as.numeric(Income - 10000), # Decrease income for those in Lansing
TRUE ~ as.numeric(Income))) %>% # Default: Keep the original income
mutate(Education = factor(Education, levels = c("Secondary School", "Undergraduate", "Graduate"))) # Order Education factor levels
# Show the first few rows of the dataset
head(data_sample)
## Income Education Location
## 1 121662 Graduate Houston
## 2 87869 Secondary School Houston
## 3 92985 Graduate Los Angelos
## 4 99924 Graduate Houston
## 5 118292 Undergraduate Houston
## 6 112554 Undergraduate Houston
Histograms are useful for understanding the distribution of single continuous variables. They partition the range of the variable into bins and show how many observations fall into each bin.
One Continuous Variable: Income
Step 1: Lay the Groundwork
Before we get into plotting, let’s lay the groundwork. Here, we create a ggplot object and map our variable of interest, which is ‘Income’, to the x-axis.
base_histogram <- ggplot(data_sample, aes(x = Income)) # Initialize ggplot and map Income to x-axis
base_histogram # Display the plot
Step 2: Choose the Right Kind of Plot
Once we have the groundwork ready, the next step is to decide what kind of plot to use. In this case, we’ll add a histogram layer. We set the bin width to 5000 to make the income distribution clearer.
income_histogram <- base_histogram + geom_histogram(binwidth = 5000) # Add histogram layer with narrower binwidth of 5000
income_histogram # Display the plot
Step 3: Make it Nice
The final step is refining our plot. This involves adding meaningful labels and perhaps adjusting the scales for better understanding. Here, we set the x-axis limit and break points for better granularity.
nice_income_histogram <- income_histogram +
labs(title = "Income Distribution", # Set title
x = "Income", # Label for x-axis
y = "Frequency") + # Label for y-axis
scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) # Set x-axis limits and breaks
nice_income_histogram # Display the graph
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
One Continuous Variable by a categorical variable: Income by Location
Step 1: Lay the Groundwork
The first step is to initialize the ggplot object. Here we map the variable ‘Income’ to the x-axis.
# Initialize ggplot object and map Income variable to the x-axis
base_histogram_by_location <- ggplot(data_sample, aes(x = Income))
# Display the base plot to see how it looks
base_histogram_by_location
Step 2: Choose the Right Kind of Plot
Since our groundwork is laid, let’s add a histogram layer to it. We’ll continue using a bin width of 5000 to make the distribution clearer.
# Add histogram layer with a bin width of 5000
income_histogram_by_location <- base_histogram_by_location +
geom_histogram(binwidth = 5000)
# Display the plot to see how it looks
income_histogram_by_location
Step 3: Make it Nice
Here, we’ll add labels, adjust scales, and introduce facets to differentiate between different locations.
# Enhance the plot with appropriate titles, axis labels, and facet layer
nice_income_histogram_by_location <- income_histogram_by_location +
labs(title = "Income Distribution by Location", # Add a title
x = "Income", # Label the x-axis
y = "Frequency") + # Label the y-axis
scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) + # Set x-axis limits and breaks
facet_wrap(~ Location, ncol = 1) # Add the facet layer based on Location, arranged in a single column
# Display the graph
nice_income_histogram_by_location
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
General Interpretation:
This histogram now not only shows the distribution of income but also differentiates it by location. Each facet represents a different location, allowing for quick comparisons. For instance, you may notice that incomes are generally higher in Los Angeles compared to Lansing.
One Continuous Variable by Two Categorical Variables: Income by Location and Education
Step 1: Lay the Groundwork
The first thing to do is to initialize a ggplot object. We’ll map our variable of interest, ‘Income’, to the x-axis.
# Initialize ggplot object and map Income to x-axis
base_histogram_by_loc_edu <- ggplot(data_sample, aes(x = Income))
# Display the base plot
base_histogram_by_loc_edu
Step 2: Choose the Right Kind of Plot
Now we will add a histogram layer, this time also including color based on “Education” levels.
# Add histogram layer with a bin width of 5000 and use aes for color based on 'Education'
income_histogram_by_loc_edu <- base_histogram_by_loc_edu +
geom_histogram(aes(fill = Education), binwidth = 5000, alpha = 0.4)
# Display the plot
income_histogram_by_loc_edu
Step 3: Make it Nice
Finally, we refine the plot. This involves adding labels, scales, and the facet layer that represents “Location”.
# Refine plot with labels, scales, and facets
nice_income_histogram_by_loc_edu <- income_histogram_by_loc_edu +
labs(title = "Income Distribution by Location and Education", # Title
x = "Income", # x-axis label
y = "Frequency") + # y-axis label
scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) + # x-axis limits and breaks
facet_wrap(~ Location, ncol = 1) + # Facet layer for 'Location'
scale_fill_brewer(palette="Dark2") # Use a pleasing color palette for 'Education'
# Display the refined plot
nice_income_histogram_by_loc_edu
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_bar()`).
General Interpretation:
This histogram now provides a more nuanced look at the income distribution by both location and education. The facets divided by “Location” and “Education” allow for a more granular comparison. You might notice, for example, that graduate students in Los Angeles tend to have higher incomes than their counterparts in Lansing or Houston.
Density plots visualize the distribution of a continuous variable by estimating the probability density function of the variable. They can be particularly useful when comparing distributions between multiple groups.
One Continuous Variable: Income
Step 1: Lay the Groundwork
We initiate by setting up a ggplot object, mapping ‘Income’ to the x-axis.
base_density <- ggplot(data_sample, aes(x = Income)) # Initialize ggplot and map Income to x-axis
base_density # Display the base plot
Step 2: Choose the Right Kind of Plot
Next, we’ll add a density layer to see the distribution of the ‘Income’ variable.
income_density <- base_density + geom_density(fill = "red", alpha = 0.4) # Add density layer with red fill and some transparency
income_density # Display the refined plot
Step 3: Make it Nice
Lastly, we refine our density plot by adding informative labels.
nice_income_density <- income_density +
scale_y_continuous(labels = scales::comma) + # Convert scientific notation to standard notation
labs(title = "Income Distribution", # Set title
x = "Income", # Label for x-axis
y = "Density") # Label for y-axis
nice_income_density # Display the graph
General Interpretation:
The y-axis in a density plot represents the probability density for each value on the x-axis. The area under the curve sums up to 1. The peak of the curve shows the mode of the distribution, and areas where the plot is flat (or valleys) indicate values that did not appear in the dataset.
Bonus: Overlaying Density Plots
If students want to compare the income distribution across different groups, say ‘Location’ or ‘Education’, they can overlay multiple density plots.
overlay_density <- ggplot(data_sample, aes(x = Income, fill = Location)) + # Map Income to x and use Location for fill color
geom_density(alpha = 0.4) + # Add density layer with some transparency
labs(title = "Income Distribution by Location", # Set title
x = "Income", # Label for x-axis
y = "Density") + # Label for y-axis
facet_wrap(~Education, ncol = 1) # Add the facet layer based on Education, arranged in a single column
overlay_density # Display the graph
This approach allows a clearer comparison of income distributions across the different locations, with each location’s density plot differentiated by color.
Density plots can be particularly useful when overlaid on a histogram, as they provide a smooth estimate of the distribution, giving context to the bars of the histogram.
One Continuous Variable: Income
Step 1: Lay the Groundwork
As before, initiate by setting up a ggplot object, mapping ‘Income’ to the x-axis.
base_overlay <- ggplot(data_sample, aes(x = Income)) # Initialize ggplot and map Income to x-axis
base_overlay # Display the base plot
Step 2: Choose the Right Kind of Plot This time, we will add both a histogram and density plot layer to visualize the distribution of the ‘Income’ variable.
overlay_histogram_density <- base_overlay +
geom_histogram(aes(y = after_stat(density)), binwidth = 5000, color = "black", fill = "white", alpha = 0.5) + # Add histogram layer with a bin width of 5000 and adjusted y aesthetic
geom_density(fill = "purple", alpha = 0.4) # Add density layer with red fill and some transparency
overlay_histogram_density # Display the plot
Step 3: Make it Nice Finalize the plot by adding appropriate labels and scales.
nice_overlay_histogram_density <- overlay_histogram_density +
labs(title = "Histogram and Density Plot of Income Distribution", # Set title
x = "Income", # Label for x-axis
y = "Density") + # Label for y-axis
scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) # Set x-axis limits and breaks
nice_overlay_histogram_density # Display the refined plot
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
Remember, the area under the density curve equals 1, while the total height of the bars in the histogram sums up to the total number of observations. The overlay makes it clear how the two visualizations relate and can be particularly enlightening for those new to data visualization.
Loading the Data
First, let’s load the data and take a look at its structure:
rw <- read.csv("readwrite2.csv", header = T) # Load the data from a CSV file
head(rw) # View the first few rows of the dataset
## id reading writing level
## 1 3070 18 21 3
## 2 1306 20 15 3
## 3 83 14 14 2
## 4 2486 16 20 4
## 5 1938 14 15 2
## 6 397 20 18 4
summary(rw) # Get a summary of the dataset
## id reading writing level
## Min. : 2 Min. : 2.00 Min. : 4.00 Min. :1.000
## 1st Qu.: 834 1st Qu.:16.00 1st Qu.:14.00 1st Qu.:2.000
## Median :1706 Median :18.00 Median :18.00 Median :3.000
## Mean :1686 Mean :18.34 Mean :17.04 Mean :3.082
## 3rd Qu.:2537 3rd Qu.:21.00 3rd Qu.:21.00 3rd Qu.:4.000
## Max. :3311 Max. :25.00 Max. :25.00 Max. :5.000
table(rw$level) # Count the number of participants in each group
##
## 1 2 3 4 5
## 226 342 524 853 52
Basic Visualization
Visualize any potential relationship between reading and writing scores:
ggplot(rw, aes(x = reading, y = writing)) +
geom_point() # Basic scatter plot
Addressing Overplotting
Overplotting can obscure data. Address it using position_jitter():
ggplot(rw, aes(x = reading, y = writing)) +
geom_point(position = position_jitter()) # Scatter plot with jittered points
Adding Regression Line
Visualize the linear relationship between reading and writing scores:
ggplot(rw, aes(x = reading, y = writing)) +
geom_point(position = position_jitter()) +
geom_smooth(method=lm, se=TRUE, level=0.95) # Scatter plot with regression line
## `geom_smooth()` using formula = 'y ~ x'
Introducing Colors by Proficiency Level
Colors can differentiate data groups. Let’s use colors to distinguish proficiency levels:
# Initial attempt with level as a numeric variable
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter())
Notice that continuous variables mapped to color will create a gradient of colors ranging from the minimum to the maximum value of the variable.
# Convert 'level' to a factor variable with meaningful labels
rw$level <- factor(rw$level, labels = c("Novice", "Low-IM", "High-IM", "Advanced", "Superior"))
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter()) # Scatter plot with factor levels colored
Enhancing Visualization
Enhancements can improve clarity and interpretation. Here we adjust transparency, fit regression lines for each proficiency level, and ensure colors are perceptually uniform:
# Visual enhancement
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter(), alpha = 0.3) + # alpha: adjust the transparency level of your data points (0: transparent; 1: opaque)
stat_smooth(method = "lm") # this will fit a line for each group because we have specified group (level)
## `geom_smooth()` using formula = 'y ~ x'
# Adjusting aesthetics to fit an overall regression line while retaining group colors (move the color aesthetic to lower level function)
ggplot(rw, aes(x = reading, y = writing)) +
geom_point(aes(color = level), position = position_jitter(), alpha = .3) +
stat_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
IMPORTANT:
Notice Simpson’s paradox here: a trend appears in several groups of data but disappears or reverses when the groups are combined
# Using a viridis color scale
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter(), alpha = .3, size = 1) +
stat_smooth(method = "lm") +
scale_color_viridis_d()
## `geom_smooth()` using formula = 'y ~ x'
# Using a brewer color scale
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter(), alpha = .3, size = 1) +
stat_smooth(method = "lm") +
scale_color_brewer(palette = "Set1")
## `geom_smooth()` using formula = 'y ~ x'
Adjusting Axis Scales
Here we adjust axis scales to enhance the visualization:
# Adjusting x and y axis scales
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter(), alpha = .3, size = 2) +
stat_smooth(method = "lm", se = FALSE) +
scale_color_viridis_d() +
scale_x_continuous(limits = c(0, 30)) +
scale_y_continuous(limits = c(0, 30))
## `geom_smooth()` using formula = 'y ~ x'
# Use coordinate that applies 1:1 scale
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(position = position_jitter(), alpha = .3, size = 2) +
stat_smooth(method = "lm", se = FALSE) +
scale_color_viridis_d() +
scale_x_continuous(limits = c(0, 30)) +
scale_y_continuous(limits = c(0, 30)) +
coord_equal()
## `geom_smooth()` using formula = 'y ~ x'
Polishing the Final Visualization
The final touches make our visualization publication-ready:
ggplot(rw, aes(x = reading, y = writing, color = level)) +
geom_point(aes(color = level), position = position_jitter(), alpha = .3, size = 2) +
stat_smooth(method = "lm", se = FALSE) +
scale_color_viridis_d() +
scale_x_continuous(limits = c(0, 30)) +
scale_y_continuous(limits = c(0, 30)) +
coord_equal() +
labs(x = "\nReading score",
y = "Writing score\n",
title = "Relationship between reading and \nwritings scores by proficiency level",
color = "Proficiency level") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Creating Subplots
Subplots, or facets, allow for comparisons across groups:
ggplot(rw, aes(x = reading, y = writing)) +
geom_point(position = position_jitter(), alpha = .1) +
stat_smooth(method = "lm") +
coord_equal() +
facet_wrap(. ~ level, ncol = 5) + # Facet layer for proficiency level
labs(title = "Relationship between reading and writings scores by proficiency level",
x = "\nReading score",
y = "Writing score\n") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'