Introduction

This RMarkdown document aims to provide a comprehensive understanding of data visualization using ggplot2 package. We will cover:

  • Bar Charts
  • Scatter Plots
  • Box Plots
  • Visualizing data distribution
  • A nice of example of visualizing your data from scratch

First of all, let’s load the libraries that will be in use.

# install.packages("ggplot2") If you have not installed it, please do
library(ggplot2)  # Load the ggplot2 package for data visualization
library(tidyverse) # For data processing
library(dplyr) # For data processing

ggplot2 is a data visualization package for R. It provides an implementation of the Grammar of Graphics, a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.

Bar Charts

One Categorical Variable

Our Dataset

# Set random seed for reproducibility
set.seed(123)  

# Generate grades for 100 students
data_grade <- data.frame(
  Grade = sample(c("A", "B", "C", "D", "F"), size = 100, replace = TRUE),  # Randomly sample grades A-F
  Gender = sample(c("M", "F"), size = 100, replace = TRUE)  # Randomly sample gender M or F
)  # Creating the 'data_grade' data frame with random values

head(data_grade)
##   Grade Gender
## 1     C      M
## 2     C      M
## 3     B      M
## 4     B      M
## 5     C      F
## 6     F      F

Step 1: Lay the Groundwork

# Initialize ggplot object and map Grade variable
base_plot <- ggplot(data_grade, aes(x = Grade)) # Create ggplot with 'data_grade' dataset, and map 'Grade' to x-axis

base_plot # Show the graph

Step 2: Choose the Right Kind of Plot

# Add bar layer to count frequency of each grade
base_plot_with_bars <- base_plot + geom_bar()  # Adding bar geom to the base plot

# Display the completed plot
base_plot_with_bars

Step 3: Make it Nice

# Add title and labels to the plot
final_plot <- base_plot_with_bars +
  labs(title = "Grade Distribution",  # Set the title of the plot
       x = "Grades",                  # Set the label for the x-axis
       y = "Frequency")   # Set the label for the y-axis

# Display the final bar chart
final_plot

A complete chunk of codes

bar_plot <- ggplot(data_grade, aes(x = Grade)) +  # Setup the base ggplot
  geom_bar() +                                    # Add a bar geom to visualize frequencies
  labs(title = "Grade Distribution",              # Set the title of the plot
       x = "Grades",                              # Label for the x-axis
       y = "Frequency")                           # Label for the y-axis
  

bar_plot

Two Categorical Variables

Step 1: Lay the Groundwork

# Initialize ggplot object and map Grade and Gender variables
base_plot <- ggplot(data_grade, aes(x = Grade, fill = Gender)) # Base plot setup with 'fill' aesthetic mapped to 'Gender'
base_plot

Step 2: Choose the Right Kind of Plot

# Add bar layer and separate bars by Gender
base_plot_with_grouped_bars <- base_plot + 
  geom_bar(position = "dodge") # Adding bar geom with bars positioned side-by-side ("dodged")
base_plot_with_grouped_bars

Note: If position isn’t set to ‘dodge’, the default is ‘stack’. This stacks bars, showing distributions of a second variable like ‘Gender’ within each ‘Grade’ category.For example, if your x variable is “Grade” and your fill variable is “Gender”, you would see how many males and females received each grade, all in a single stacked bar.

Step 3: Make it Nice

# Add title and labels to the plot
final_plot <- base_plot_with_grouped_bars +
  labs(title = "Grade Distribution by Gender",  # Plot title
       x = "Grades",                            # X-axis label
       y = "Frequency")                         # Y-axis label
  
# Display the final bar chart
final_plot

A complete chunk of codes

bar_plot_by_gender <- ggplot(data_grade, aes(x = Grade, fill = Gender)) +  # Initialize ggplot and map Grade and Gender variables
  geom_bar(position = "dodge") +  # Add bar geom and set bars to be side-by-side
  labs(title = "Grade Distribution by Gender",  # Plot title
       x = "Grades",                            # X-axis label
       y = "Frequency")                         # Y-axis label

bar_plot_by_gender

Scatter Plots

Two Continuous Variables

First, let’s generate some data. We create a dataset with 100 students, each with a unique ID, a level of education in years, and a vocabulary score.

# Set random seed for reproducibility
set.seed(123)

# Generate student IDs and education years for 100 students, and make sure they are integers
data_student <- data.frame(
  Student_ID = 1:100,                      # Student IDs from 1 to 100
  Gender = sample(c("M", "F"), 
                  size = 100, 
                  replace = TRUE),         # Gender: randomly assigns a gender ("Female" or "Male") to each row
  Education = round(runif(100, 0, 20))     # Random years of education between 12 and 20, rounded to integers
)

# Recategorize 'Education_level' based on 'Years_of_Education'
data_student <- data_student %>%
  mutate(Education_level = case_when(
    Education < 9 ~ "High School",  # Less than 9 years
    Education < 16 ~ "Undergraduate",  # 9 to 15 years
    TRUE ~ "Postgraduate"  # 16 years and above
  )) %>%
  # Specifying the order of levels helps in ordering the factor levels in a meaningful way
  mutate(Education_level = factor(Education_level, levels = c("High School",
                                                              "Undergraduate",
                                                              "Postgraduate"))) %>%
  # Generate Vocabulary_Score as a function of Education, with some random noise, and make them integers
  mutate(Vocabulary_Score = round(20 + 2 * data_student$Education + rnorm(100, 0, 5)))  # Creating Vocabulary_Score based on Education, rounded to integers)

head(data_student, n = 10) # Look at the first 10 rows
##    Student_ID Gender Education Education_level Vocabulary_Score
## 1           1      M        12   Undergraduate               40
## 2           2      M         7     High School               35
## 3           3      M        10   Undergraduate               39
## 4           4      F        19    Postgraduate               56
## 5           5      M        10   Undergraduate               35
## 6           6      F        18    Postgraduate               56
## 7           7      F        18    Postgraduate               52
## 8           8      F        12   Undergraduate               36
## 9           9      M         8     High School               34
## 10         10      M         3     High School               31

Step 1: Lay the Groundwork

Before we can visualize anything, we need to tell ggplot what data we’re using and what variables are of interest. We want to look at how Education and Vocabulary_Score relate to each other, so we’ll put them on the x and y axes, respectively.

# Initialize ggplot object and map Education and Vocabulary_Score variables
base_plot <- ggplot(data_student, aes(x = Education, y = Vocabulary_Score)) # x mapped to Education, y mapped to Vocabulary_Score
base_plot

Step 2: Choose the Right Kind of Plot

Now, we choose the type of plot that best illustrates the data. In this case, a scatter plot is appropriate for showing the relationship between two continuous variables.

# Add scatter layer to visualize the relationship
base_plot_with_scatter <- base_plot + 
  geom_point()  # Adding point geom to visualize individual data points
base_plot_with_scatter

Step 3: Make it Nice

In this step, we add more layers to the plot to make it more informative and visually appealing. We’ll add titles and axis labels, as well as specify axis limits and breaks. Finally, a linear regression line will be added to better visualize the trend in the data.

# Add title, labels, axis limits, breaks, and a linear regression line to the plot
final_plot <- base_plot_with_scatter +
  labs(title = "Scatterplot of Education vs Vocabulary Score",  # Setting the title
       x = "Years of Education",                                # Labeling x-axis
       y = "Vocabulary Score") +                                # Labeling y-axis
  scale_x_continuous(limits = c(0, 25), breaks = seq(0, 25, 5)) +  # Setting x-axis limits and breaks
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +  # Setting y-axis limits and breaks
  geom_smooth(method = lm, se = TRUE, level = 0.95)             # Adding linear regression line and error bands

# Display the final scatter plot
final_plot
## `geom_smooth()` using formula = 'y ~ x'

Note: for geom_smooth():

  • method = lm: This tells ggplot2 to use linear modeling to fit a line to the data.
  • se = TRUE: This adds a shaded region around the line, representing the standard error (confidence region).
  • level = 0.95: This sets the confidence level for the standard error. A level of 0.95 corresponds to a 95% confidence interval.

A complete chunk of codes

Finally, here is a complete code chunk that combines all the steps. This is a more concise version that can be useful for future reference.

# Combine all steps for the scatterplot
final_scatter <- ggplot(data_student, 
                        aes(x = Education, 
                            y = Vocabulary_Score)) +              # Map variables
  geom_point() +                                                  # Add point geom
  labs(title = "Scatterplot of Education vs Vocabulary Score",    # Set title
       x = "Years of Education",                                  # Set x-axis label
       y = "Vocabulary Score") +                                  # Set y-axis label
  scale_x_continuous(limits = c(0, 25), breaks = seq(0, 25, 5)) +  # Setting x-axis limits and breaks
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +  # Setting y-axis limits and breaks
  geom_smooth(method = lm, se = TRUE, level = 0.95)             # Adding linear regression line and error bands

final_scatter
## `geom_smooth()` using formula = 'y ~ x'

Two Continuous Variables by a Factor

Now, we’ll extend the ggplot object to include the Gender variable as a color aesthetic and show how the relationship between years of education and vocabulary scores varies by female and male participants.

Step 1: Lay the Groundwork

# Initialize ggplot object with Gender as a color aesthetic
base_plot <- ggplot(data_student, aes(x = Education, y = Vocabulary_Score, color = Gender))  
base_plot

Step 2: Choose the Right Kind of Plot

A scatter plot is still the appropriate visualization. We’ll add the points, colored by gender.

# Add scatter layer to visualize the relationship
base_plot_with_scatter <- base_plot + 
  geom_point()  # Adding point geom
base_plot_with_scatter

Step 3: Make it Nice

In this final step, we’ll polish the plot by adding title, labels, and axis limits. Additionally, we’ll include separate regression lines for each gender.

# Add title, labels, axis limits, and regression lines for each gender to the plot
final_plot <- base_plot_with_scatter + 
  labs(title = "Scatterplot of Education vs Vocabulary Score by Gender",  # Setting the title
       x = "Years of Education",  # Labeling x-axis
       y = "Vocabulary Score") +  # Labeling y-axis
  scale_x_continuous(limits = c(0, 20), breaks = seq(0, 20, 4)) +  # Setting x-axis limits and breaks
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +  # Setting y-axis limits and breaks
  geom_smooth(method = lm, se = TRUE, level = 0.95)  # Adding linear regression line and error bands

# Display the final scatter plot
final_plot
## `geom_smooth()` using formula = 'y ~ x'

A complete chunk of codes

Here is the complete code chunk combining all steps for your convenience:

# Combine all steps for the scatterplot
final_scatter <- ggplot(data_student, 
                        aes(x = Education, y = Vocabulary_Score, color = Gender)) +
  geom_point() +
  labs(title = "Scatterplot of Education vs Vocabulary Score by Gender",
       x = "Years of Education",
       y = "Vocabulary Score") +
  scale_x_continuous(limits = c(0, 20), breaks = seq(0, 20, 4)) +
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +
  geom_smooth(method = lm, se = TRUE, level = 0.95)
final_scatter
## `geom_smooth()` using formula = 'y ~ x'

This should produce a scatterplot where the points are colored by gender, and regression lines are plotted for each gender group.

Box Plots

One Continuous Variable

Step 1: Lay the Groundwork

First, we specify only the y variable because we are dealing with a single continuous variable.

# Initialize ggplot object 
base_plot1 <- ggplot(data_student, aes(y = Vocabulary_Score))  # Map Vocabulary_Score to y

base_plot1 # Display the graph

Step 2: Choose the Right Kind of Plot

We choose a boxplot for this situation.

# Add boxplot layer
base_plot1_with_box <- base_plot1 + geom_boxplot()  # Add boxplot

base_plot1_with_box # Display the graph

Interpretation of a boxplot

  • Box: Represents the Interquartile Range (IQR), containing the data between the 25th percentile (Q1) and the 75th percentile (Q3).
  • Midline: The line inside the box marks the median of the data, dividing it into two equal halves.
  • Whiskers: Extend from the box to indicate variability outside the IQR. They usually go up to 1.5 * IQR but can be customized.
  • Dots Outside the Box: These are often considered as “outliers.” They are data points that fall significantly away from the other data points. Typically, they’re more than 1.5 * IQR from either end of the box.

Step 3: Make it Nice

Here, we add a title and Y-axis label.

# Finalize the boxplot
final_plot1 <- base_plot1_with_box +
  labs(title = "Boxplot of Vocabulary Scores",  # Title
       y = "Vocabulary Score") +  # Y-axis label
  theme(axis.title.x=element_blank(), 
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank()) # Remove info in x-axis 

final_plot1 # Display the graph

A complete chunk of codes

final_plot1 <- ggplot(data_student, aes(y = Vocabulary_Score)) +  # Map variables
  geom_boxplot() +  # Add boxplot
  labs(title = "Boxplot of Vocabulary Scores",  # Title
       y = "Vocabulary Score") +  # Y-axis label
  theme(axis.title.x=element_blank(), 
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank()) # Remove info in x-axis 

final_plot1

One Continuous Variable by One Factor (Gender)

Step 1: Lay the Groundwork

# Initialize ggplot object for situation 2
base_plot2 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score))  # Map variables

base_plot2 # Display the graph

Step 2: Choose the Right Kind of Plot

# Add boxplot layer
base_plot2_with_box <- base_plot2 + geom_boxplot()  # Add boxplot

base_plot2_with_box # Display the graph

Step 3: Make it Nice

# Finalize the boxplot
final_plot2 <- base_plot2_with_box +
  labs(title = "Boxplot of Vocabulary Scores by Gender",  # Title
       x = "Gender",  # X-axis label
       y = "Vocabulary Score")  # Y-axis label

final_plot2 # Display the graph

A complete chunk of codes

final_plot2 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score)) +  # Map variables
  geom_boxplot() +  # Add boxplot
  labs(title = "Boxplot of Vocabulary Scores by Gender",  # Title
       x = "Gender",  # X-axis label
       y = "Vocabulary Score") +  # Y-axis label
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10))  # Modify Y-axis scales

final_plot2 # Display the graph

One Continuous Variable by Two Factors (Color)

Step 1: Lay the Groundwork

# Initialize ggplot object for situation 3
base_plot3 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score, fill = Education_level))  # Map variables

base_plot3 # Display the graph

Step 2: Choose the Right Kind of Plot

# Add boxplot layer
base_plot3_with_box <- base_plot3 + geom_boxplot()  # Add boxplot

base_plot3_with_box # Display the graph

Step 3: Make it Nice

# Finalize the boxplot
final_plot3 <- base_plot3_with_box +
  labs(title = "Boxplot of Vocabulary Scores by Gender and Education Level",  # Title
       x = "Gender",  # X-axis label
       y = "Vocabulary Score")  # Y-axis label

final_plot3 # Display the graph

A complete chunk of codes

final_plot3 <- ggplot(data_student, aes(x = Gender, y = Vocabulary_Score, fill = Education_level)) +  # Map variables
  geom_boxplot() +  # Add boxplot
  labs(title = "Boxplot of Vocabulary Scores by Gender and Education Level",  # Title
       x = "Gender",  # X-axis label
       y = "Vocabulary Score") +  # Y-axis label
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +  # Modify Y-axis scales
  scale_fill_brewer(palette = "Set2")  # Modify color scale: apply color scales from the ColorBrewer project, which offers a variety of carefully crafted color palettes designed to be colorblind-friendly and print-friendly.

final_plot3 # Display the graph

One Continuous Variable by Two Factors (Facets)

Step 1: Lay the Groundwork

# Initialize ggplot object for situation 4
base_plot4 <- ggplot(data_student, aes(x = Education_level, y = Vocabulary_Score))  # Map variables

base_plot4 # Display the graph

Step 2: Choose the Right Kind of Plot

# Add boxplot layer
base_plot4_with_box <- base_plot4 + geom_boxplot()  # Add boxplot

base_plot4_with_box # Display the graph

Step 3: Make it Nice

# Finalize the boxplot
final_plot4 <- base_plot4_with_box +
  labs(title = "Boxplot of Vocabulary Scores by Education Level, Faceted by Gender",  # Title
       x = "Education Level",  # X-axis label
       y = "Vocabulary Score") +  # Y-axis label
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +  # Modify Y-axis scales
  facet_wrap(~Gender)  # Add facets for Gender

final_plot4 # Display the graph

A complete chunk of codes

final_plot4 <- ggplot(data_student, aes(x = Education_level, y = Vocabulary_Score)) +  # Map variables
  geom_boxplot() +  # Add boxplot
  labs(title = "Boxplot of Vocabulary Scores by Education Level, Faceted by Gender",  # Title
       x = "Education Level",  # X-axis label
       y = "Vocabulary Score") +  # Y-axis label
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 10)) +  # Modify Y-axis scales
  facet_wrap(~Gender)  # Add facets for Gender

final_plot4 # Display the graph

Visualization data distribution

Data Preparation

Let’s prepare a sample dataset that contains ‘Income’, ‘Education’, and ‘Location’.

set.seed(123)  # For reproducibility

# Generate dataset
data_sample <- data.frame(
  Income = sample(30000:100000, size = 2000, replace = TRUE),  # Randomly sample income
  Education = sample(c("Secondary School", "Undergraduate", "Graduate"), size = 100, replace = TRUE),  # Randomly sample education level
  Location = sample(c("Los Angelos", "Lansing", "Houston"), size = 100, replace = TRUE)  # Randomly sample location
)

# Modifying dataset to reflect certain trends
data_sample <- data_sample %>%
  mutate(Income = case_when(Education == "Undergraduate" ~ as.numeric(Income + 20000), # Adjust income for Undergraduates
                            Education == "Graduate"      ~ as.numeric(Income + 40000), # Adjust income for Graduates
                            TRUE                         ~ as.numeric(Income))) %>%    # Default: Keep the original income

  mutate(Income = case_when(Location == "Los Angelos" ~ as.numeric(Income + 20000), # Increase income for those in Los Angelos
                            Location == "Lansing"     ~ as.numeric(Income - 10000), # Decrease income for those in Lansing
                            TRUE                     ~ as.numeric(Income))) %>%          # Default: Keep the original income
  mutate(Education = factor(Education, levels = c("Secondary School", "Undergraduate", "Graduate")))  # Order Education factor levels

# Show the first few rows of the dataset
head(data_sample)
##   Income        Education    Location
## 1 121662         Graduate     Houston
## 2  87869 Secondary School     Houston
## 3  92985         Graduate Los Angelos
## 4  99924         Graduate     Houston
## 5 118292    Undergraduate     Houston
## 6 112554    Undergraduate     Houston

Histograms

Histograms are useful for understanding the distribution of single continuous variables. They partition the range of the variable into bins and show how many observations fall into each bin.

One Continuous Variable: Income

Step 1: Lay the Groundwork

Before we get into plotting, let’s lay the groundwork. Here, we create a ggplot object and map our variable of interest, which is ‘Income’, to the x-axis.

base_histogram <- ggplot(data_sample, aes(x = Income))  # Initialize ggplot and map Income to x-axis
base_histogram  # Display the plot

Step 2: Choose the Right Kind of Plot

Once we have the groundwork ready, the next step is to decide what kind of plot to use. In this case, we’ll add a histogram layer. We set the bin width to 5000 to make the income distribution clearer.

income_histogram <- base_histogram + geom_histogram(binwidth = 5000)  # Add histogram layer with narrower binwidth of 5000
income_histogram  # Display the plot

Step 3: Make it Nice

The final step is refining our plot. This involves adding meaningful labels and perhaps adjusting the scales for better understanding. Here, we set the x-axis limit and break points for better granularity.

nice_income_histogram <- income_histogram + 
  labs(title = "Income Distribution",  # Set title
       x = "Income",  # Label for x-axis
       y = "Frequency") +  # Label for y-axis
  scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) # Set x-axis limits and breaks  

nice_income_histogram # Display the graph
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

One Continuous Variable by a categorical variable: Income by Location

Step 1: Lay the Groundwork

The first step is to initialize the ggplot object. Here we map the variable ‘Income’ to the x-axis.

# Initialize ggplot object and map Income variable to the x-axis
base_histogram_by_location <- ggplot(data_sample, aes(x = Income))

# Display the base plot to see how it looks
base_histogram_by_location  

Step 2: Choose the Right Kind of Plot

Since our groundwork is laid, let’s add a histogram layer to it. We’ll continue using a bin width of 5000 to make the distribution clearer.

# Add histogram layer with a bin width of 5000
income_histogram_by_location <- base_histogram_by_location +
  geom_histogram(binwidth = 5000)  

# Display the plot to see how it looks
income_histogram_by_location  

Step 3: Make it Nice

Here, we’ll add labels, adjust scales, and introduce facets to differentiate between different locations.

# Enhance the plot with appropriate titles, axis labels, and facet layer
nice_income_histogram_by_location <- income_histogram_by_location + 
  labs(title = "Income Distribution by Location",  # Add a title
       x = "Income",  # Label the x-axis
       y = "Frequency") +  # Label the y-axis
  scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) +  # Set x-axis limits and breaks
  facet_wrap(~ Location, ncol = 1)  # Add the facet layer based on Location, arranged in a single column

# Display the graph
nice_income_histogram_by_location
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).

General Interpretation:

This histogram now not only shows the distribution of income but also differentiates it by location. Each facet represents a different location, allowing for quick comparisons. For instance, you may notice that incomes are generally higher in Los Angeles compared to Lansing.

One Continuous Variable by Two Categorical Variables: Income by Location and Education

Step 1: Lay the Groundwork

The first thing to do is to initialize a ggplot object. We’ll map our variable of interest, ‘Income’, to the x-axis.

# Initialize ggplot object and map Income to x-axis
base_histogram_by_loc_edu <- ggplot(data_sample, aes(x = Income))

# Display the base plot
base_histogram_by_loc_edu  

Step 2: Choose the Right Kind of Plot

Now we will add a histogram layer, this time also including color based on “Education” levels.

# Add histogram layer with a bin width of 5000 and use aes for color based on 'Education'
income_histogram_by_loc_edu <- base_histogram_by_loc_edu + 
  geom_histogram(aes(fill = Education), binwidth = 5000, alpha = 0.4)  

# Display the plot
income_histogram_by_loc_edu  

Step 3: Make it Nice

Finally, we refine the plot. This involves adding labels, scales, and the facet layer that represents “Location”.

# Refine plot with labels, scales, and facets
nice_income_histogram_by_loc_edu <- income_histogram_by_loc_edu + 
  labs(title = "Income Distribution by Location and Education",  # Title
       x = "Income",  # x-axis label
       y = "Frequency") +  # y-axis label
  scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000)) +  # x-axis limits and breaks
  facet_wrap(~ Location, ncol = 1) +  # Facet layer for 'Location'
  scale_fill_brewer(palette="Dark2")  # Use a pleasing color palette for 'Education'

# Display the refined plot
nice_income_histogram_by_loc_edu 
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_bar()`).

General Interpretation:

This histogram now provides a more nuanced look at the income distribution by both location and education. The facets divided by “Location” and “Education” allow for a more granular comparison. You might notice, for example, that graduate students in Los Angeles tend to have higher incomes than their counterparts in Lansing or Houston.

Density plot

Density plots visualize the distribution of a continuous variable by estimating the probability density function of the variable. They can be particularly useful when comparing distributions between multiple groups.

One Continuous Variable: Income

Step 1: Lay the Groundwork

We initiate by setting up a ggplot object, mapping ‘Income’ to the x-axis.

base_density <- ggplot(data_sample, aes(x = Income))  # Initialize ggplot and map Income to x-axis
base_density  # Display the base plot

Step 2: Choose the Right Kind of Plot

Next, we’ll add a density layer to see the distribution of the ‘Income’ variable.

income_density <- base_density + geom_density(fill = "red", alpha = 0.4)  # Add density layer with red fill and some transparency
  
income_density  # Display the refined plot

Step 3: Make it Nice

Lastly, we refine our density plot by adding informative labels.

nice_income_density <- income_density + 
  scale_y_continuous(labels = scales::comma) + # Convert scientific notation to standard notation
  labs(title = "Income Distribution",  # Set title
       x = "Income",  # Label for x-axis
       y = "Density")  # Label for y-axis 

nice_income_density  # Display the graph

General Interpretation:

The y-axis in a density plot represents the probability density for each value on the x-axis. The area under the curve sums up to 1. The peak of the curve shows the mode of the distribution, and areas where the plot is flat (or valleys) indicate values that did not appear in the dataset.

Bonus: Overlaying Density Plots

If students want to compare the income distribution across different groups, say ‘Location’ or ‘Education’, they can overlay multiple density plots.

overlay_density <- ggplot(data_sample, aes(x = Income, fill = Location)) +  # Map Income to x and use Location for fill color
  geom_density(alpha = 0.4) +  # Add density layer with some transparency
  labs(title = "Income Distribution by Location",  # Set title
       x = "Income",  # Label for x-axis
       y = "Density") +  # Label for y-axis
  facet_wrap(~Education, ncol = 1) # Add the facet layer based on Education, arranged in a single column

overlay_density  # Display the graph

This approach allows a clearer comparison of income distributions across the different locations, with each location’s density plot differentiated by color.

Overlaying the Density Plot to a Histogram

Density plots can be particularly useful when overlaid on a histogram, as they provide a smooth estimate of the distribution, giving context to the bars of the histogram.

One Continuous Variable: Income

Step 1: Lay the Groundwork

As before, initiate by setting up a ggplot object, mapping ‘Income’ to the x-axis.

base_overlay <- ggplot(data_sample, aes(x = Income))  # Initialize ggplot and map Income to x-axis
base_overlay  # Display the base plot

Step 2: Choose the Right Kind of Plot This time, we will add both a histogram and density plot layer to visualize the distribution of the ‘Income’ variable.

overlay_histogram_density <- base_overlay +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5000, color = "black", fill = "white", alpha = 0.5) + # Add histogram layer with a bin width of 5000 and adjusted y aesthetic
  geom_density(fill = "purple", alpha = 0.4)  # Add density layer with red fill and some transparency
  
overlay_histogram_density  # Display the plot

Step 3: Make it Nice Finalize the plot by adding appropriate labels and scales.

nice_overlay_histogram_density <- overlay_histogram_density +
  labs(title = "Histogram and Density Plot of Income Distribution",  # Set title
       x = "Income",  # Label for x-axis
       y = "Density") +  # Label for y-axis
  scale_x_continuous(limits = c(0, 200000), breaks = seq(0, 200000, 25000))  # Set x-axis limits and breaks

nice_overlay_histogram_density  # Display the refined plot
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Remember, the area under the density curve equals 1, while the total height of the bars in the histogram sums up to the total number of observations. The overlay makes it clear how the two visualizations relate and can be particularly enlightening for those new to data visualization.

Fun time!

Loading the Data

First, let’s load the data and take a look at its structure:

rw <- read.csv("readwrite2.csv", header = T)  # Load the data from a CSV file
head(rw)  # View the first few rows of the dataset
##     id reading writing level
## 1 3070      18      21     3
## 2 1306      20      15     3
## 3   83      14      14     2
## 4 2486      16      20     4
## 5 1938      14      15     2
## 6  397      20      18     4
summary(rw)  # Get a summary of the dataset
##        id          reading         writing          level      
##  Min.   :   2   Min.   : 2.00   Min.   : 4.00   Min.   :1.000  
##  1st Qu.: 834   1st Qu.:16.00   1st Qu.:14.00   1st Qu.:2.000  
##  Median :1706   Median :18.00   Median :18.00   Median :3.000  
##  Mean   :1686   Mean   :18.34   Mean   :17.04   Mean   :3.082  
##  3rd Qu.:2537   3rd Qu.:21.00   3rd Qu.:21.00   3rd Qu.:4.000  
##  Max.   :3311   Max.   :25.00   Max.   :25.00   Max.   :5.000
table(rw$level)  # Count the number of participants in each group
## 
##   1   2   3   4   5 
## 226 342 524 853  52

Basic Visualization

Visualize any potential relationship between reading and writing scores:

ggplot(rw, aes(x = reading, y = writing)) + 
  geom_point()  # Basic scatter plot

Addressing Overplotting

Overplotting can obscure data. Address it using position_jitter():

ggplot(rw, aes(x = reading, y = writing)) + 
  geom_point(position = position_jitter())  # Scatter plot with jittered points

Adding Regression Line

Visualize the linear relationship between reading and writing scores:

ggplot(rw, aes(x = reading, y = writing)) + 
  geom_point(position = position_jitter()) + 
  geom_smooth(method=lm, se=TRUE, level=0.95)  # Scatter plot with regression line
## `geom_smooth()` using formula = 'y ~ x'

Introducing Colors by Proficiency Level

Colors can differentiate data groups. Let’s use colors to distinguish proficiency levels:

# Initial attempt with level as a numeric variable
ggplot(rw, aes(x = reading, y = writing, color = level)) +
  geom_point(position = position_jitter())  

Notice that continuous variables mapped to color will create a gradient of colors ranging from the minimum to the maximum value of the variable.

# Convert 'level' to a factor variable with meaningful labels
rw$level <- factor(rw$level, labels = c("Novice", "Low-IM", "High-IM", "Advanced", "Superior"))
ggplot(rw, aes(x = reading, y = writing, color = level)) +
  geom_point(position = position_jitter())  # Scatter plot with factor levels colored

Enhancing Visualization

Enhancements can improve clarity and interpretation. Here we adjust transparency, fit regression lines for each proficiency level, and ensure colors are perceptually uniform:

# Visual enhancement
ggplot(rw, aes(x = reading, y = writing, color = level)) + 
  geom_point(position = position_jitter(), alpha = 0.3) + # alpha: adjust the transparency level of your data points (0: transparent; 1: opaque)
  stat_smooth(method = "lm") # this will fit a line for each group because we have specified group (level)
## `geom_smooth()` using formula = 'y ~ x'

# Adjusting aesthetics to fit an overall regression line while retaining group colors (move the color aesthetic to lower level function)
ggplot(rw, aes(x = reading, y = writing)) + 
  geom_point(aes(color = level), position = position_jitter(), alpha = .3) + 
  stat_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

IMPORTANT:

Notice Simpson’s paradox here: a trend appears in several groups of data but disappears or reverses when the groups are combined

# Using a viridis color scale
ggplot(rw, aes(x = reading, y = writing, color = level)) + 
  geom_point(position = position_jitter(), alpha = .3, size = 1) + 
  stat_smooth(method = "lm") + 
  scale_color_viridis_d()
## `geom_smooth()` using formula = 'y ~ x'

# Using a brewer color scale
ggplot(rw, aes(x = reading, y = writing, color = level)) + 
  geom_point(position = position_jitter(), alpha = .3, size = 1) + 
  stat_smooth(method = "lm") + 
  scale_color_brewer(palette = "Set1")
## `geom_smooth()` using formula = 'y ~ x'

Adjusting Axis Scales

Here we adjust axis scales to enhance the visualization:

# Adjusting x and y axis scales
ggplot(rw, aes(x = reading, y = writing, color = level)) + 
  geom_point(position = position_jitter(), alpha = .3, size = 2) + 
  stat_smooth(method = "lm", se = FALSE) + 
  scale_color_viridis_d() + 
  scale_x_continuous(limits = c(0, 30)) + 
  scale_y_continuous(limits = c(0, 30)) 
## `geom_smooth()` using formula = 'y ~ x'

# Use coordinate that applies 1:1 scale
ggplot(rw, aes(x = reading, y = writing, color = level)) + 
  geom_point(position = position_jitter(), alpha = .3, size = 2) + 
  stat_smooth(method = "lm", se = FALSE) + 
  scale_color_viridis_d() + 
  scale_x_continuous(limits = c(0, 30)) + 
  scale_y_continuous(limits = c(0, 30)) + 
  coord_equal()
## `geom_smooth()` using formula = 'y ~ x'

Polishing the Final Visualization

The final touches make our visualization publication-ready:

ggplot(rw, aes(x = reading, y = writing, color = level)) + 
  geom_point(aes(color = level), position = position_jitter(), alpha = .3, size = 2) + 
  stat_smooth(method = "lm", se = FALSE) + 
  scale_color_viridis_d() + 
  scale_x_continuous(limits = c(0, 30)) + 
  scale_y_continuous(limits = c(0, 30)) + 
  coord_equal() +
  labs(x = "\nReading score", 
       y = "Writing score\n", 
       title = "Relationship between reading and \nwritings scores by proficiency level", 
       color = "Proficiency level") + 
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Creating Subplots

Subplots, or facets, allow for comparisons across groups:

ggplot(rw, aes(x = reading, y = writing)) + 
  geom_point(position = position_jitter(), alpha = .1) + 
  stat_smooth(method = "lm") + 
  coord_equal() + 
  facet_wrap(. ~ level, ncol = 5) + # Facet layer for proficiency level
  labs(title = "Relationship between reading and writings scores by proficiency level", 
       x = "\nReading score", 
       y = "Writing score\n") + 
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'