library(datasetsICR)
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(ggrepel)
options(scipen=999)DAT-4313 - LAB 8
REFINE YOUR PLOTS
Plot 1:
library(titanic)
library(dplyr)
titanic <- bind_rows(titanic_train, titanic_test)
ggplot(titanic, aes(x=Age, y=Fare)) +
geom_point(aes(color = factor(Pclass)), size=3, alpha=0.7) +
labs(x = "Age", y="Fare",
title = "Scatter Plot of Age vs. Fare",
caption = "Source: Titanic dataset",
color = "Passenger Class") +
scale_color_manual(values=c("#1f78b4", "#33a02c", "#e31a1c"))Refinement 1: Change Legend and Add annotate
library(ggplot2)
# Create the scatter plot
p <- ggplot(titanic, aes(x = Age, y = Fare)) +
geom_point(aes(color = factor(Pclass)), size = 3, alpha = 0.7) +
geom_smooth(aes(group = factor(Pclass)), method = "lm", se = FALSE) +
labs(x = "Age", y = "Fare",
title = "Scatter Plot of Age vs. Fare with Trend Lines",
caption = "Source: Titanic dataset",
color = "Passenger Class") +
scale_color_manual(values = c("#1f78b4", "#33a02c", "#e31a1c"))
# Add a label with text inside it at the top of the graph
p + annotate("text", x = Inf, y = Inf, label = "Passenger class 1\ntend to pay more", hjust = 1, vjust = 1, size = 4, color = "#1f78b4", fontface = "bold")Obervations
Intent:
The intent of this refinement is to enhance the interpretation of the scatter plot by incorporating trend lines to visualize the relationship between age and fare for each passenger class. Additionally, an annotation is added to highlight the observation that passengers in first class tend to pay more, particularly at higher ages.
Rationale:
Trend Lines: By adding trend lines using the method lm (linear model), the plot provides a clearer depiction of the general trend in fare with respect to age for each passenger class. This helps viewers identify any significant patterns or trends in the data, aiding in the interpretation of the relationship between age and fare.
Annotation: An annotation is included at the top of the graph to draw attention to the observation that passengers in first class tend to pay more, especially at higher ages. This annotation provides additional context and insight into the data, allowing viewers to quickly grasp key findings from the plot without needing to analyze the data in detail.
Refinement 2: Add Data labels and Scale Data
library(scales)
titanic_filtered <- titanic %>% filter(!is.na(Fare))
color_palette <- c("#1f78b4", "#33a02c", "#e31a1c")
ggplot(titanic_filtered, aes(x = Age, y = Fare)) +
geom_point(aes(color = factor(Pclass), shape = ifelse(Fare > 500, "High Fare", ifelse(Fare > 200, "Medium Fare", "Normal Fare"))), size = 3, alpha = 0.7) +
geom_text_repel(data = subset(titanic_filtered, Fare > 500), aes(label = PassengerId), size = 3) +
labs(x = "Age (years)", y = "Fare (USD)",
title = "Scatter Plot of Age vs. Fare",
caption = "Source: Titanic dataset",
color = "Passenger Class", shape = "Fare Category") +
scale_color_manual(values = color_palette) +
scale_shape_manual(values = c("High Fare" = 17, "Medium Fare" = 15, "Normal Fare" = 19)) +
scale_y_log10()Obervations
Intent:
The intent of this refinement is to improve the readability and interpretability of the scatter plot by incorporating logarithmic scaling for the fare axis and providing a more detailed legend for fare categories. Additionally, text labels are added to highlight passengers with high fares, enhancing the viewer’s understanding of the data distribution.
Rationale:
Logarithmic Scaling: By applying logarithmic scaling to the fare axis using scale_y_log10(), the plot effectively handles the wide range of fare values present in the dataset. This transformation compresses the scale, making it easier to visualize variations in fare across different age groups while still retaining detail for higher fare values. As a result, viewers can better perceive patterns and differences in fare distribution without the dominance of extremely high fares.
Detailed Legend: The legend for fare categories is expanded to include three distinct categories: “High Fare,” “Medium Fare,” and “Normal Fare.” This refinement provides viewers with a clearer understanding of how fares are categorized based on their values, allowing them to easily distinguish between different fare ranges and interpret their significance within the context of the plot.
Text Labels: Text labels are added to highlight passengers with high fares, displaying their corresponding Passenger IDs next to the data points. This addition helps draw attention to outliers or notable observations within the dataset, enabling viewers to identify specific instances of interest and potentially investigate them further.
Plot 2:
data(customers)p <- ggplot(data = customers, mapping = aes(x = Fresh, y = Grocery, color = factor(Channel)))
p + geom_point(size = 2) +
scale_x_log10(labels = scales :: dollar) +
scale_y_log10(labels = scales :: dollar) +
labs(x = "Fresh", y = "Grocery",
title = "Relationship between Fresh and Grocery Variables",
subtitle = "Higher spending on Fresh products tends to correlate with higher spending on Grocery products",
caption = "Source: Customers dataset from datasetsICR package",
color = "Channel") +
theme_minimal() +
theme(legend.position = "right")Refinement 1: Enhance color usage
p1 <- p + geom_point(size = 2) +
scale_x_log10(labels = scales :: dollar) +
scale_y_log10(labels = scales :: dollar) +
labs(x = "Fresh", y = "Grocery",
title = "Relationship between Fresh and Grocery Variables",
subtitle = "Higher spending on Fresh products tends to correlate with higher spending on Grocery products",
caption = "Source: Customers dataset from datasetsICR package",
color = "Channel") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "top")
p1Observations:
Intent:
The intent of this refinement is to improve the visual aesthetics and clarity of the plot by adjusting the color palette and legend position. By utilizing a different color palette and relocating the legend to the top of the plot, viewers can better perceive the relationship between the Fresh and Grocery variables and interpret the data more effectively.
Rationale:
Color Palette Adjustment: The color palette is changed to “Set1” from the RColorBrewer package using scale_color_brewer(palette = “Set1”). This adjustment enhances the differentiation between data points representing different channels by providing a distinct and visually appealing set of colors. The new palette ensures that each channel is clearly identifiable, making it easier for viewers to understand the distribution of data points across the plot.
Legend Positioning: The legend is repositioned to the top of the plot using theme(legend.position = “top”). By relocating the legend, the plot layout is optimized to allocate more space for the main visualization area. This adjustment reduces clutter and improves the overall readability of the plot, allowing viewers to focus on the data without obstruction. Placing the legend at the top also ensures that it remains easily accessible and visible, enabling viewers to quickly reference the color coding for different channels while interpreting the plot.
Refinement 2: Use different colors and change theme features
p1 <- p + geom_point(size = 2) +
scale_x_log10(labels = scales :: dollar) +
scale_y_log10(labels = scales :: dollar) +
labs(x = "Fresh", y = "Grocery",
title = "Relationship between Fresh and Grocery Variables",
subtitle = "Higher spending on Fresh products tends to correlate with higher \n spending on Grocery products",
caption = "**Source**: Customers dataset from datasetsICR package",
color = "Channel") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "top",
plot.background = element_rect(fill = "lightgray"),
plot.title = element_text(color = "darkblue", face = "bold"),
axis.text = element_text(color = "darkblue"),
legend.title = element_text(color = "darkred"),
plot.subtitle = element_text(color = "darkred"),
plot.caption = element_text(color = "darkblue", face = "bold")) +
annotate("point", x = max(customers$Fresh), y = max(customers$Grocery), color = "black", size = 3)
p1Observations
Intent:
The intent for this graph was to make it attractive and use a palette of colors that it is easy to understand for people. The Set 2 palette was used because it can call more the attention to customers that are seeing the relationship between grocery and Fresh. The backgroudn color helps the data plotted to be highlighted and just overall having the title, subtitle, legends using the same color palette helps to better visualize the data and do not give a lot of colors to people.
Rationale:
Color Palette: Set1 was chosen for clear differentiation between customer segments represented by the “Channel” variable. This choice ensures each group stands out distinctly while maintaining visual harmony.
Background Color: Light gray creates contrast with data points, making them more prominent and easier to focus on, enhancing readability and drawing attention to variable relationships.
Title Emphasis: Making the plot title bold and using dark blue helps emphasize the main message, aiding quick comprehension and capturing viewer attention effectively.
Consistent Color Scheme: Consistency in color scheme across plot elements such as title, subtitle, legend, and caption ensures coherence and facilitates interpretation, enabling viewers to associate different parts of the plot with relevant information easily.
Legend Positioning: Placing the legend at the top allows easy reference without obstructing data points, ensuring viewers quickly identify color meanings without distraction.
Axis Labels Color: Changing axis label color to dark blue ensures visibility and legibility against the light gray background, aiding accurate interpretation of plotted data points and understanding of variable axes.