Data visualization

The choice of your visualization tool is determined by the type (level of measurement) of variables you want to describe visually. Remember, we broadly divide variables into categorical and continuous.

  • for example, bar charts are only appropriate for categorical variables.

Data visualizations come in many forms, such as bar charts and scatter plots, but no matter how you present your data, your figures have to be appropriately labeled.

When visualizing our data, we are often interested in plotting two (or three) variables at the same time. There are univariate and multivariate plots.

Another thing to be aware of before doing data visualization in R is that we often need to do some data preparation in advance of creating plots.

Level of measurement?

Before graphing your data, the first thing you need to do is know what is the level of measurement of the variables you will plot.

There are many different data types (R speak for level of measurement) in R, but for simplicity, we will focus only on the four types below:

  • categorical variables with more than two levels should be stored as factors.
  • categorical variables with two values (dummy variables) are logicals.
  • continuous variables are of the type int (integers) or num (numeric).

You can check the level of measurement of a variable by using the class function.

  • let’s examine data types for gss variables age, marital status, and gender.
## [1] "integer"
## [1] "factor"
## [1] "integer"

If the variable is not the appropriate type, you can change its type by using the as.datatype function.

  • for example, let’s change the type for female from integer to logical.
## [1] "integer"

Now let’s make sure that the data type change worked:

## [1] "logical"

It worked!

Exercise

Use the mtcars data that comes preloaded with R and answer what is the level of measurement for variables mpg and wt?

ggplot2 package

This is once of the most versatile and effective R packages for data visualization.

  • let’s install and load the package into R.

Here is how the ggplot2 works, in most basic terms.

There are different chart types, depending on what kind of plot you want to make.

  • for example, to create a bar chart, you need to specify geom_bar().

Chart types start with the prefix geom_ and the part after the underscore indicates the type of plot you want to make (like bar for bar chart or histogram for histogram).

Bar chart

Categorical variables are typically visually represented with a bar chart (also known as bar plots).

  • bar charts show the number (or proportion) of cases in each category of a categorical variable.

Single variable

Let’s use a bar chart to visualize the distribution of the variable that describes marital status (the name of the variable is marital).

  • first, we need to check how the variable is stored in R.
## [1] "factor"

Now that we know that the variable is the correct type (factor), we can use the ggplot2 package to create a simple bar chart for one variable.

But before that let’s examine how many people are in each category of the marital status variable.

## 
##      divorced       married never married     separated       widowed 
##           270           811           493            56           116

What is the modal category based on the output above? It is the category married. Remember, modal category is the one with the most cases.

Let’s now create a bar chart for the variable marital status. (In the line of code below, where do you see the dataset and the variable names?)

You can see above that the height of the bars corresponds to the number of cases in each category. The bar is the highest for the the category married because most people in this sample are married.

We can also change the y-axis (the vertical axis) to show the proportion of cases within each category instead of the count of cases. This is sometimes more informative. You can do this with the code below.

Exercise

Use the mtcars dataset and create a bar chart for the variable that describes the number of gears (gear variable). What is the modal category?

Two variables

Now, say we want to show how many people there are within each category of marital status who have ever been incarcerated (lockedup).

  • we will do that by filling in the bars for each category of marital status with different color depending on how many people have been incarcerated (or not).

As you can see in the bar chart above, about 80 married people have been incarcerated compared to about a a 100 among those who have never been married. In terms of proportion (or percentage), what category has the most people who have been incarcerated?

Exercise

Create a chart that shows how many people have been incarerated (lockedup) within each category of the arrest variable. Approximately how many people who have been arrested have ended up being incarcerated?

Side by side bar chart

We can also show the same bar chart as above by using the side-by-side layout.

Box plot

Box plots are used to describe the spread of a continous variable with only five numbers: minimum, first quartile, median, third quartile, and maximum.

  • here’s an example of a box plot showing the spread of the variable age in the gss dataset. The line in the middle of the box is the median. The outer lines of the box are the beginning of the bottom quarter and the top quarter of the data.

Two variables

Box plots are especially useful when describing the spread of a continuous variable by levels of a categorical variable.

  • let’s create a box plot to visualize the distribution of age by marital status in R.

Exercise

Create a plot that shows how occupational prestige score (prestg10) is distributed across categories of the lockedup variable. On average, what group has a higher occupational prestige score?

Histograms

Histograms are used to visually describe the distribution of a continuous variable by showing the count or proportion of people in different intervals (or bins).

Instead of plotting how often each number shows up in the data, like you would in a bar chart, you group numbers and then plot how many cases are within each group or interval.

Here’s an example:

  • say you know the age of 19 jail inmates: 20, 21, 22, 25, 26, 27, 33, 34, 34, 40, 43, 44, 44, 45, 50, 58, 60, 61, 69.

  • you can visualize this variable with a histogram by first grouping the ages into the following five bins or intervals 20-29, 30-39, 40-49, 50-59, and 60-69.

  • by hand, count the number of inmates in each category and draw a bar whose height corresponds to the size of each bin.

Now let’s use R to see create a histogram of the occupational prestige score in gss.

  • first, we will check its data type, range and the mean.
## [1] "integer"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   16.00   35.00   45.00   45.11   55.00   80.00      90

Now, let’s create a histogram in R.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 90 rows containing non-finite values (stat_bin).

We can also change the number of bins (the default is 30). Let’s change it to 15.

## Warning: Removed 90 rows containing non-finite values (stat_bin).

15 is clearly not a very useful number of bins. Try different bin widths if your histogram does not look informative.

Now let’s see how the occupational prestige score is distributed by gender.

We can compare the distribution of occupational prestige by gender if we create a histogram for men and women separately. In ggplot2, that’s called facetting.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 90 rows containing non-finite values (stat_bin).

Labeling plots

Plots have to be labeled so that the person who looks at them knows exactly what they describe. In particular, this means labeling the x and y axis and providing an informative title.

Scatterplots

Scatterplots are used to describe the distribution of continuous variables. They are the most useful when we want to visualize a bivarite relationship–that is, when we want to show how two variables are related.

For this part, we will use the mtcars dataset that comes preloaded with R.

  • the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

  • here, we will show how car weight measured in 1000 lbs (wt) is related to miles per gallon (mpg). What kind of relationship would you expect to see?

You can see that cars cover fewer miles per gallon as their weight increases. This is an example of a negative correlation–as one variable goes up in value, the other goes down.

You can also change many of the characteristics of the plots, such as the color and the size of the points on the scatterplot.

  • let’s change the color into red and increase the size of the points:

Scatterplots can also show how the relationship between two continuous variables looks like within catgeries of a categorical variable.

  • for example, let’s see how the relationship between wt and mpg for cars with different numbers of cylinders.