L01 Visualization

Data Science 1 with R (STAT 301-1)

Author

Lena Parnassa

Published

September 26, 2023

Github Repo Link

https://github.com/stat301-1-2023-fall/L01-visualization-lena-parnassa

Load packages

You should always begin by loading all necessary packages towards the beginning of your document.

# Loading package(s)
library(tidyverse)

Datasets

This lab utilizes the mpg and diamonds datasets. Both come with ggplot2 and their documentation/codebooks can be accessed with ?mpg and ?diamonds, provided you have installed and loaded ggplot2 to your current R session. The case study utilizes tinder_data.csv located in your data folder. The following line of code reads in the data.

data(mpg)
data(diamonds)
#| eval: false

# read in data
tinder_data <- read_csv("data/tinder_data.csv")
codebook <- read.csv("data/tinder_data_codebook.csv")

Exercises

Exercise 1

There are 3 particularly important components to our template for building a graphic with ggplot2. They are <DATA>, <GEOM_FUNCTION>, and <MAPPINGS>. The importance of <DATA> is obvious. <GEOM_FUNCTION> is referring to the selection of a geom. <MAPPINGS>, specifically aes(<MAPPINGS>), is referring to the process of defining aesthetic mappings.

What is a geom?
What is an aesthetic mapping?

Solution

Geom: A geom in ggplot2 is a visual representation of data in a plot. Geoms are the graphical elements used to display your data, (e.g., points, lines, bars and pie charts). Each geom has its own unique properties and aesthetics that determine how the data is visually represented in the plot. Examples of geoms include geom_point() for scatter plots, geom_line() for line plots, geom_bar() for bar plots, and so on.

Aesthetic Mapping: Aesthetic mapping in ggplot2 involves associating variables from your dataset to visual properties of the geoms. Data variables, numerical or categorical, are mapped onto specific visual elements like color, size, shape, and position in the plot. These elements are specified within the aes() function in ggplot2. For example, you might map a variable to the x-axis position, the y-axis position, the color, or the size of points in a scatter plot using aes(x = variable1, y = variable2, color = variable3, size = variable4).This allows for a more dynamic representation of data.

Exercise 2

Construct a scatterplot of hwy versus cty using the mpg dataset. What is the problem with this plot? How could you improve it?

Solution

mpg|> 
  ggplot(mapping = aes(x = cty, y = hwy)) + 
  geom_point()

The plot definitely shows a relationship between the two variables, but there are no units and its hard to understand what the variables being compaired actually are.

Exercise 3

Construct a scatterplot of hwy versus cty. Set the color of the points to drv.

Now construct a scatterplot of hwy versus cty. Set the color of the points to your favorite color (try typing colors() in the console) and facet by drv. Read ?facet_wrap and adjust the ncol and scales as necessary.

Solution

  mpg|> ggplot(mapping = aes(x = cty, y = hwy, 
                             color = drv)) + 
   geom_point() +
     labs(x = "City MPG", y = "Highway MPG", color = "Drive Type")

  mpg|> ggplot(mapping = aes(x = cty, y = hwy)) + 
   geom_point(color="#462d86") +
   facet_wrap(~ drv, ncol = 2, scales = "free")

    labs(x = "City MPG", y = "Highway MPG")

$x
[1] "City MPG"

$y
[1] "Highway MPG"

attr(,"class")
[1] "labels"

How do the aesthetics behave differently for categorical versus numerical variables? In other words, which variable types (numeric/categorical) are appropriate to match to which aesthetics (size/shape/color)?

Solution

Color is versatile and can be used with both categorical and numerical variables, but its interpretation varies based on the variable type. Shape is typically more suitable for categorical variables or, if each category is small enough, distinguishing individual points within a category. Size is effective for representing magnitude or value, usually for numerical variables. To choose the appropriate aesthetics, you need to consider the nature of your data and the story you want to convey through your visualization.

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Solution

Faceting allows you to separate different subsets of your data into distinct panels, making it easier to compare and analyze individual subsets of the data. This is useful when you have categorical variables that naturally break down your data into meaningful groups. This can improve readability of complex or multivariate data, as each facet can focus on specific relationships within the data. However, if you have a large number of categories or levels in your faceting variable, faceting will produce many small plots that can be too small and overcrowded, making it difficult to understand details.

Alternatively, using color aesthetics allows you to convey multiple variables within a single plot, which can make it easier to compare categories or levels of a variable within a plot, especially when there are only a few distinct categories. This will also save on space and less clutter. However there are downsides to color aesthetics, using certain color sets will make it hard for viewers with color blindness to accurately read your graph. Similarly, with a large dataset or a dataset with many categorical levels, it can be challenging to find a sufficient number of distinct colors to represent each category effectively. This may lead to confusion or misinterpretation.

Because of this, faceting is usually better for larger datasets, especially when you have multiple categorical variables or a large number of levels within a categorical variable. It allows for a structured and systematic exploration of the data in smaller, manageable subsets.

Exercise 4

Construct a scatterplot of hwy vs cty. Next, map a third numerical variable to color, then size, then shape.

Solution

   mpg|> 
  ggplot(mapping = aes(x = cty, y = hwy, 
                 color = displ)) +
  geom_point() +
  labs(x = "City MPG", y = "Highway MPG", color = "Displacement")

Exercise 5

Construct a histogram of the carat variable in the diamonds dataset. Adjust the bins to an appropriate value. Add a title, remove the axis label that says count, and add a caption: “Source: ggplot2 package”.

Solution

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(x = "Carat",
    y = NULL,
    title = "Carat in Diamonds",
    caption = "Source: ggplot2 package")

Exercise 6

Construct 2 appropriate graphics to show the relationship between carat and cut.

Solution

ggplot(diamonds, aes(x = cut, y = carat, fill = cut)) +
geom_boxplot() +
labs(x = "Cut",
y = "Carat",
title = "Box Plot of Carat by Cut",
fill = "Cut")

ggplot(diamonds, aes(x = cut, y = carat, fill = cut)) +
geom_violin() +
labs(x = "Cut",
y = "Carat",
title = "Violin Plot of Carat by Cut",
fill = "Cut")

Exercise 7

Construct a barplot of cut. Add in aes(fill = carat > 0.7).

Solution

ggplot(diamonds, aes(x = cut, fill = carat > 0.7)) +
geom_bar() +
labs(x = "Cut",
y = "Count",
title = "Barplot of Cut with Fill based on Carat > 0.7",
fill = "Carat > 0.7")

Exercise 8

When would you use facet_grid() versus facet_wrap()? When using facet_grid() it is suggested that you put the variable with more unique levels in the columns. Why do you think that this practice is suggested?

Solution

You would want to use facet_grid() to create a grid of plots based on two categorical variables. One variable is used for the rows and the other for the columns. It’s useful for comparing and visualizing interactions across subsets. Usually, the variable with more unique levels is placed in the columns for a more compact grid layout.

facet_wrap() is used to create a series of plots for a single categorical variable in a one-dimensional layout. It’s ideal when you have one categorical variable with many levels and want to create plots for each level. It’s commonly used for a single categorical variable with numerous levels or when the levels don’t fit into a grid layout.

In facet_grid(), placing the variable with more unique levels in columns improves readability and usability. It allows for a compact and efficient use of space, making it easier to compare across levels. The variable with fewer unique levels should be placed in rows to keep plots vertically aligned and easily comparable.

Case Study

Congratulations, you just landed your first job as a data analyst for Tinder! The dataset is stored in the data folder called tinder_data. A codebook, tinder_data_codebook.csv, provides a description of each of the variable names. This has been read in for you at the top of the document. We will learn more about importing data later in the quarter.

Your first assignment is to determine if there is a relationship between messages sent and messages received and how this differs based on user gender. Your boss has asked for a one paragraph summary with graphics to support your conclusions. Your boss wants all graphics saved into a folder named “plots”. Hint: ggsave().

Since this is your first project as a data analyst you have been provided some tips and considerations for getting started:

When approaching a research question it is important to use univariate, bivariate, and multivariate analysis (depending on the problem) to get a better understanding of your data and also identify any potential problems.
How might the distribution of your variables impact your conclusions? Outliers? Weird values? Imbalanced classes?
How might coord_fixed() and geom_abline() improve a graphic?
Feel free to be creative! It is your job to answer this question and interpret conclusions in the most appropriate ways you see fit.

This dataset was provided by Swipestats.io.

Solution

ggplot(tinder_data, aes(x = messages_sent, y = messages_received)) +
geom_point(alpha = 0.5) +
  geom_smooth(method=lm, linetype = "dashed") +
labs(title = "Relationship between Messages Sent and Messages Received",
x = "Messages Sent",
y = "Messages Received")

ggsave("plots/messages_relationship.png")

ggplot(tinder_data, aes(x = messages_sent, y = messages_received)) +
geom_point(alpha = 0.5) +
  geom_smooth(method=lm, linetype = "dashed") +
  facet_grid(.~user_gender) +
labs(title = "Relationship between Messages Sent and Messages Received By Gender",
x = "Messages Sent",
y = "Messages Received")

ggsave("plots/messages_sent_vs_received_by_gender.png")

Based on the analysis of Tinder user data, we observed a positive relationship between messages sent and messages received. Both male and female users generally tend to receive more messages as they send more messages. However, there are instances of outliers in both the messages sent and received, indicating potential variations in user engagement. Further analysis is needed to explore these variations in detail and derive actionable insights to enhance user experiences on the platform.