The first step in any comprehensive data analysis is to explore each import variable in turn. Univariate graphs plot the distribution of data from a single variable. The variable can be categorical (e.g., race, sex, political affiliation) or quantitative (e.g., age, weight, income).

The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama (see Appendix A.5). We’ll explore the distribution of three variables from this dataset: the age and race of the wedding participants and the occupation of the wedding officials.

Categorical Graphs

The race of the participants and the occupation of the officials are both categorical variables. The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map or waffle chart.

Bar Chart

library(ggplot2)
data(Marriage, package = "mosaicData")

# plot the distribution of race
ggplot(Marriage, aes(x = race)) + 
  geom_bar()

  labs(title = "Race of 98 Individuals From Marriage Records",
  subtitle = "Mobile County Alabama",
      x = "Race",
      y = "# People")

## $x
## [1] "Race"
## 
## $y
## [1] "# People"
## 
## $title
## [1] "Race of 98 Individuals From Marriage Records"
## 
## $subtitle
## [1] "Mobile County Alabama"
## 
## attr(,"class")
## [1] "labels"

What do you observe?

I observe that in Mobile County Alabama the race with the most marriage records is white.

How can we improve this graph?

We can improve this graph by not having the colors bland, having the bars smaller, adding color to the bars, and adding.

You can modify the bar fill and border colors, plot labels, and title by adding options to the geom_bar function. In ggplot2, the fill parameter is used to specify the color of areas such as bars, rectangles, and polygons. The color parameter specifies the color objects that technically do not have an area, such as points, lines, and borders.

# plot the distribution of race with modified colors and labels
ggplot(Marriage, aes(x=race)) + 
  geom_bar(fill = "cornflowerblue", 
           color="black") +
  labs(x = "Race", 
       y = "Frequency", 
       title = "Participants by race")

Suppose we want to modify this to represent percents instead of counts? Guess how we might do this?

We would do this by specifying we want percentages through an argument.

Percents

Bars can represent percents rather than counts. For bar charts, the code aes(x=race) is actually a shortcut for aes(x = race, y = after_stat(count)), where count is a special variable representing the frequency within each category. You can use this to calculate percentages, by specifying y variable explicitly.

# plot the distribution as percentages
ggplot(Marriage, 
       aes(x = race, y = after_stat(count/sum(count)))) + 
  geom_bar() +
  labs(x = "Race", 
       y = "Percent", 
       title  = "Participants by race") +
  scale_y_continuous(labels = scales::percent)

Reordering categories

It is often helpful to sort the bars by frequency. The frequencies are calculated explicitly in the code below.

# calculate number of participants in each race category
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

plotdata <- Marriage %>%
 count(race)
plotdata

##              race  n
## 1 American Indian  1
## 2           Black 22
## 3        Hispanic  1
## 4           White 74

This new dataset is then used to create the graph with the following modifications:

The reorder function is used to sort the categories by frequency.
The option stat=“identity” tells the plotting function not to calculate counts, because they are supplied directly.

# plot the bars in ascending order
ggplot(plotdata, 
       aes(x = reorder(race, n), y = n)) + 
  geom_bar(stat="identity") +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race")

The graph bars are sorted in ascending order. Use reorder(race, -n) to sort in descending order.

Try making this change to the above graph.

ggplot(plotdata, 
       aes(x = reorder(race, -n), y = n)) + 
  geom_bar(stat="identity") +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race")

Labeling Bars

Finally, you may want to label each bar with its numerical value.

In the code below:

geom_text adds the labels, and
vjust controls vertical justification.

# plot the bars with numeric labels
ggplot(plotdata, 
       aes(x = race, y = n)) + 
  geom_bar(stat="identity") +
  geom_text(aes(label = n), vjust=-0.5) +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race")

Modify the above graph with the following:

bars sorted in descending order
bar outline of “black” and fill colors of “indianred3”

ggplot(plotdata, 
       aes(x = reorder(race, -n),
           y = n/sum(n))) +
  geom_bar(stat="identity",fill= "indianred3", color = "black") +
  geom_text(aes(label = paste(c(
    round(n/sum(n),2)*100),"%", sep="")),
            vjust=-0.2) +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race") +
  scale_y_continuous(breaks = seq(0,80,20),
                     labels = scales::percent)

It’s a bit tricky, but we sort the bars AND relabel the vertical axis and bars as percent:

# plot the bars with numeric labels
ggplot(plotdata, 
       aes(x = reorder(race, -n),
           y = n/sum(n))) +
  geom_bar(stat="identity",fill= "indianred3", color = "black") +
  geom_text(aes(label = paste(c(
    round(n/sum(n),2)*100),"%", sep="")),
            vjust=-0.5) +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race") +
  scale_y_continuous(limits = c(0,0.8),
                     labels = scales::percent)

Overlapping labels

Consider the distribution of marriage officials.

What is problematic with the following?

The bar chart is going to have overlapping labels.

# Basic bar chart with overlapping labels
ggplot(Marriage, aes(x=officialTitle)) + 
  geom_bar() +
  labs(x = "Officiate",
       y = "Frequency",
       title = "Marriages by officiate")

What ideas do you have for fixing this?

Sorting the bars in descending order, adding color, and spacing out the labels so that they are not overlapping each other.

Here are three approaches. Identify the key features of each.

In the chart below the labels are organized better because they do not overlap, making the chart easier to read.

# horizontal bar chart
ggplot(Marriage, aes(x = officialTitle)) + 
  geom_bar() +
  labs(x = "",
       y = "Frequency",
       title = "Marriages by officiate") +
  coord_flip()

In the chart below the labels are slanted which prevents them from overlapping, making the chart easier to read.

# bar chart with rotated labels
ggplot(Marriage, aes(x=officialTitle)) + 
  geom_bar() +
  labs(x = "",
       y = "Frequency",
       title = "Marriages by officiate") +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1))

In the chart below the labels alternate between being positioned higher and lower, making the chart easier to read.

# bar chart with staggered labels
lbls <- paste0(c("","\n"), levels(Marriage$officialTitle))
ggplot(Marriage, 
       aes(x=factor(officialTitle, 
                    labels = lbls))) + 
  geom_bar() +
  labs(x = "",
       y = "Frequency",
       title = "Marriages by officiate")

Stacked Bar charts

Another variant of a bar chart is the stacked bar chart. This helps visualize the relative frequency of counts as part of a whole.

To make a stacked bar chart, we specify the following:

no variable on the x-axis
use t
specify the position argument of geom_bar() to “stack”

ggplot( data = Marriage,
        aes( x = "",
             fill = officialTitle)) +
  geom_bar( position = "stack" )

We can also specify the

ggplot( data = Marriage,
        aes( x = "",
             fill = officialTitle)) +
  geom_bar( position = "stack" )

Use whichever of the above graphs you prefer, then add appropriate title and axis lables:

ggplot( data = Marriage,
        aes( x = "",
             fill = officialTitle)) +
  geom_bar( position = "stack" ) +
  labs( title = "Wedding Officiate Occupation",
        x = "",
        fill = "Official Title")

Choose another categorical variable from the marriage dataset to make a stacked bar chart.

ggplot( data = Marriage,
        aes( x = "",
             fill = race)) +
  geom_bar( position = "stack" ) +
  labs( title = "Wedding Officiate by Race",
        x = "",
        fill = "Race")

Pie Chart

Pie charts are controversial in statistics.

Why do you think this is the case?

Because they are not as exact as other visualizations.

A pie chart is essentially a stacked bar chart in polor coordinates. So to make a pie chart in ggplot2, we simply add the layer coord_polar() to a stacked barchart:

ggplot( data = Marriage,
        aes( x = "",
             fill = sign)) +
  geom_bar( position = "stack" ) + 
  coord_polar( theta = "y")

What is problematic here?

The problem is that you cannot tell the exact number of each variable.

Make a bar chart of the sign variable. Which is more informative, this or the previous graph?

ggplot(data = Marriage, 
       aes(x = sign, 
           fill = sign)) + 
  geom_bar(position = "stack") + 
  coord_flip() +
  labs(x = "Sign", 
       y = "Count", 
       title = "Frequency of Wedding Signs")

If you aim to compare the frequency of categories, you are better off with bar charts (humans are better at judging the length of bars than the volume of pie slices). If your goal is to compare each category with the the whole (e.g., what portion of participants are Hispanic compared to all participants), and the number of categories is small, then pie charts may work.

Make a pie chart of the race variable, with appropriate labels. Compared to the race variable, which do you prefer?

ggplot(data = Marriage, 
       aes(x = "", 
           fill = race)) + 
  geom_bar(position = "stack") + 
  coord_polar(theta = "y") +
  labs(fill = "Race",
    title = "Officiants By Race", 
       x = "", 
       y = "percentage")

Tree Map

An alternative to a pie chart is a tree map. Unlike pie charts, it can handle categorical variables that have many levels.

library(treemapify)

# create a treemap of marriage officials
plotdata <- Marriage %>%
  count(officialTitle)

ggplot(plotdata, 
       aes(fill = officialTitle, area = n)) +
  geom_treemap() + 
  labs(title = "Marriages by officiate")

Here is a more useful version with labels.

# create a treemap with tile labels
ggplot(plotdata, 
       aes(fill = officialTitle, 
           area = n, 
           label = officialTitle)) +
  geom_treemap() + 
  geom_treemap_text(colour = "white", 
                    place = "centre") +
  labs(title = "Marriages by officiate") +
  theme(legend.position = "none")

Make a tree map for the sign variable.

install.packages("treemapify")

## Warning: package 'treemapify' is in use and will not be installed

library(ggplot2)
library(treemapify)

library(dplyr)

plotdata <- Marriage %>%
  group_by(sign) %>%
  summarise(n = n(), .groups = 'drop')

ggplot(plotdata, 
       aes(fill = sign, 
           area = n, 
           label = sign)) +
  geom_treemap() + 
  geom_treemap_text(colour = "white", 
                    place = "centre", 
                    grow = TRUE) +
  labs(title = "Marriages by Sign") +
  theme(legend.position = "none")

Waffle Chart

A waffle chart, also known as a gridplot or square pie chart, represents observations as squares in a rectangular grid, where each cell represents a percentage of the whole. You can create a ggplot2 waffle chart using the geom_waffle function in the waffle package.

Let’s create a waffle chart for the professions of wedding officiates. As with treemaps, start by summarizing the data into groups and counts.

library(dplyr)
plotdata <- Marriage %>%
  count(officialTitle)
plotdata

##       officialTitle  n
## 1            BISHOP  2
## 2   CATHOLIC PRIEST  2
## 3       CHIEF CLERK  2
## 4    CIRCUIT JUDGE   2
## 5             ELDER  2
## 6 MARRIAGE OFFICIAL 44
## 7          MINISTER 20
## 8            PASTOR 22
## 9          REVEREND  2

Next, create the ggplot2 graph. Set the fill to the grouping variable and values to the counts. Don’t specify an x and y.

Download the waffle package.

# create a basic waffle chart
library(waffle)
library(dplyr)

plotdata <- Marriage %>%
  group_by(officialTitle) %>%
  summarise(n = n()) %>%
  ungroup()


ggplot(plotdata, aes(fill = officialTitle, values=n)) +
  geom_waffle(na.rm=TRUE)

Next, we’ll customize the graph by

specifying the number of rows and cell sizes and setting borders around the cells to “white” (geom_waffle)
change the color scheme to “Spectral” (scale_fill_brewer)
assure that the cells are squares and not rectangles (coord_equal)
simplify the theme (the theme functions)
modify the title and add a caption with the scale (labs)

# Create a customized caption
cap <- paste0("1 square = ", ceiling(sum(plotdata$n)/100), 
              " case(s).")
library(waffle)
ggplot(plotdata, aes(fill = officialTitle, values=n)) +
  geom_waffle(na.rm=TRUE,
              n_rows = 10,
              size = .4,
              color = "white") + 
  scale_fill_brewer(palette = "Spectral") +
  coord_equal() +
  theme_minimal() + 
  theme_enhance_waffle() +
  theme(legend.title = element_blank()) +
  labs(title = "Proportion of Wedding Officials",
       caption = cap)

Make a waffle chart of the sign variable.

library(dplyr)

plotdata <- Marriage %>%
  group_by(sign) %>%
  summarise(n = n()) %>%
  ungroup()

cap <- paste0("1 square = ", ceiling(sum(plotdata$n)/100), 
              " case(s).")
library(waffle)
ggplot(plotdata, aes(fill = sign, values=n)) +
  geom_waffle(na.rm=TRUE,
              n_rows = 10,
              size = .4,
              color = "white") + 
  scale_fill_brewer(palette = "Spectral") +
  coord_equal() +
  theme_minimal() + 
  theme_enhance_waffle() +
  theme(legend.title = element_blank()) +
  labs(title = "Proportion of Wedding Officials",
       caption = cap)

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors

Quantitative Graphs

In the Marriage dataset, age is a quantitative variable. The distribution of a single quantitative variable is typically plotted with a histogram, kernel density plot, or dot plot.

Histograph

Histograms are the most common approach to visualizing a quantitative variable. In a histogram, the values of a variable are typically divided up into adjacent, equal-width ranges (called bins), and the number of observations in each bin is plotted with a vertical bar.

library(ggplot2)

# plot the age distribution using a histogram
ggplot(Marriage, aes(x = age)) +
  geom_histogram() + 
  labs(title = "Participants by age",
       x = "Age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most participants appear to be in their early 20’s with another group in their 40’s, and a much smaller group in their late sixties and early seventies. This would be a multimodal distribution.

Histogram colors can be modified using two options

fill - fill color for the bars
color - border color around the bars

# plot the histogram with blue bars and white borders
ggplot(Marriage, aes(x = age)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white") + 
  labs(title="Participants by age",
       x = "Age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bin and binwidths

One of the most important histogram options is bins, which control the number of bins into which the numeric variable is divided (i.e., the number of bars in the plot). The default is 30, but it is helpful to try smaller and larger numbers to get a better impression of the shape of the distribution.

# plot the histogram with 20 bins
ggplot(Marriage, aes(x = age)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 bins = 20) + 
  labs(title="Participants by age", 
       subtitle = "number of bins = 20",
       x = "Age")

Alternatively, you can specify the binwidth, and the width of the bins represented by the bars.

# plot the histogram with a binwidth of 5
ggplot(Marriage, aes(x = age)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 binwidth = 5) + 
  labs(title="Participants by age", 
       subtitle = "binwidth = 5 years",
       x = "Age")

As with bar charts, the y-axis can represent counts or percent of the total.

# plot the histogram with percentages on the y-axis
library(scales)
ggplot(Marriage, 
       aes(x = age, y= after_stat(count/sum(count)))) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 binwidth = 5) + 
  labs(title="Participants by age", 
       y = "Percent",
       x = "Age") +
  scale_y_continuous(labels = scales::percent)

Make a histogram of the dayOfBirth variable. What is an appropriate number of bins to use?

# plot the histogram with percentages on the y-axis
library(scales)
ggplot(Marriage, 
       aes(x = dayOfBirth, y= after_stat(count/sum(count)))) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 bins = 12) + 
  labs(title="Participants by age", 
       y = "Percent",
       x = "Age")

What if we realize, that we really want to plot the month of each person’s date of birth? How could we do this?

Hint: you can extract the month from a date date type using the lubridate package.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# Extract month from dates
month(Marriage$dob)

##  [1]  4  8  2  5 12  2 10  1 12  7  2 11  9 10 10 11  1  5  2  5  4 11 10  1 11
## [26]  8 10 10  4  3  2  9  4  6  2 11  5  3  5  9  2  3  5  9  2  9  5  7 12  2
## [51]  4  3  5 12 11 12  9  3  7  4  4  4  2  6  3  4  4 11  6  9 10  5  3  2  9
## [76]  8  6 12 10 11  4  3  7  3  8  8 10  2  2  9  7  8  9  6  6  1  5  8

plotdata = Marriage%>%
  mutate(month_of_birth = as.factor(month(dob)))

library(lubridate)
# Extract month from dates
month(Marriage$dob)

##  [1]  4  8  2  5 12  2 10  1 12  7  2 11  9 10 10 11  1  5  2  5  4 11 10  1 11
## [26]  8 10 10  4  3  2  9  4  6  2 11  5  3  5  9  2  3  5  9  2  9  5  7 12  2
## [51]  4  3  5 12 11 12  9  3  7  4  4  4  2  6  3  4  4 11  6  9 10  5  3  2  9
## [76]  8  6 12 10 11  4  3  7  3  8  8 10  2  2  9  7  8  9  6  6  1  5  8

plotdata <- Marriage %>%
  mutate(month_of_birth = month(dob, label = TRUE)) %>%  # Extracts the month and labels it (e.g., "Jan", "Feb")
  group_by(month_of_birth) %>%
  summarise(n = n()) %>%
  ungroup()



library(dplyr)


ggplot(data = plotdata,
      aes( x = reorder(month_of_birth, n),
           y = n)) +
  geom_bar(stat = "identity")

Kernel Density Plot

An alternative to a histogram is the kernel density plot. Technically, kernel density estimation is a nonparametric method for estimating the probability density function of a continuous random variable (what??). We are trying to draw a smoothed histogram, where the area under the curve equals one.

# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
  geom_density() + 
  labs(title = "Participants by age")

The graph shows the distribution of scores. For example, the proportion of cases between 20 and 40 years old would be represented by the area under the curve between 20 and 40 on the x-axis.

As with previous charts, we can use fill and color to specify the fill and border colors.

# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
  geom_density(fill = "indianred3") + 
  labs(title = "Participants by age")

Smoothing Parameter

The degree of smoothness is controlled by the bandwidth parameter bw. To find the default value for a particular variable, use the bw.nrd0 function. dsLarger values will result in more smoothing, while smaller values will produce less smoothing.

# default bandwidth for the age variable
bw.nrd0(Marriage$age)

## [1] 5.181946

# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
  geom_density(fill = "deepskyblue", 
               bw = 1) + 
  labs(title = "Participants by age",
       subtitle = "bandwidth = 1")

In this example, the default bandwidth for age is 5.18. Choosing a value of 1 resulted in less smoothing and more detail.

Kernel density plots allow you to easily see which scores are most frequent and which are relatively rare. However, it can be difficult to explain the meaning of the y-axis means to a non-statistician. (But it will make you look really smart at parties!)

Make two kernel density plots of the dayOfBirth variable: one with the default smoothing parameter, and one with a smoothing parameter of your choice.

ggplot(Marriage, aes(x = dob)) +
  geom_density(fill = "deepskyblue", 
               bw = 1) + 
  labs(title = "Participants by Date of Birth",
       subtitle = "bandwidth = 1")

ggplot(Marriage, aes(x = dob)) +
  geom_density(fill = "deepskyblue", 
               bw = 10) + 
  labs(title = "Participants by Date of Birth",
       subtitle = "bandwidth = 1")

Dot Chart

Another alternative to the histogram is the dot chart. Again, the quantitative variable is divided into bins, but rather than summary bars, each observation is represented by a dot. By default, the width of a dot corresponds to the bin width, and dots are stacked, with each dot representing one observation. This works best when the number of observations is small (say, less than 150).

# plot the age distribution using a dot plot
ggplot(Marriage, aes(x = age)) +
  geom_dotplot() + 
  labs(title = "Participants by age",
       y = "Proportion",
       x = "Age")

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

The fill and color options can be used to specify the fill and border color of each dot respectively.

# Plot ages as a dot plot using 
# gold dots with black borders
ggplot(Marriage, aes(x = age)) +
  geom_dotplot(fill = "gold", 
               color="black") + 
  labs(title = "Participants by age",
       y = "Proportion",
       x = "Age")

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

There are many more options available. See ?geom_dotplot for details and examples.

Make a dot chart of delay,

To fit all dots on the chart, compact the vertical stacking using the dotsize and stackratio parameters.
Make x-axis labeling more descriptive using the scale scale_x_continuous().
Remove the y-axis using the scale scale_y_continuous(NULL, breaks = NULL)

ggplot(Marriage, aes(x = delay)) +
  geom_dotplot(dotsize = 0.8, stackratio = 0.5, fill = "deepskyblue") + 
  labs(title = "Participants by Delay In marriage",
       y = "Proportion",
       x = "Delay") +
scale_x_continuous(name = "Delay (minutes)",
                   breaks = seq(0, max(Marriage$delay, na.rm = TRUE), by = 5)) +
scale_y_continuous(NULL, breaks = NULL)

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Since dotplots represent one observation per dot, they lend themselves to use fill colors by a categorical variable.

Let’s add race as fill to the above chart. What do you notice?

I notice that there are now two different colors in the dotplot.

ggplot( Marriage,
        aes( x = delay, fill = race)) +
  geom_dotplot(stackratio = 0.7, 
               dotsize = 0.85) +
    scale_y_continuous(NULL, breaks = NULL) + 
    scale_x_continuous(breaks = seq(0,30,2))

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

To fix this, we need the following arguments:

stackgroups = TRUE allows different fill groups to be stacked together
binpositions = "all" determines position of bins with all the data taken together; this is used for aligning dot stacks across multiple groups.

ggplot( Marriage,
        aes( x = delay, fill = race)) +
  geom_dotplot(stackgroups = TRUE, 
               binpositions = "all", 
               stackratio = 0.7, 
               dotsize = 0.85, 
               color = "black") +
   
    scale_x_continuous(breaks = seq(0,30,2))

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Make a dotplot of the age variable, colored by race.

ggplot( Marriage,
        aes( x = age, fill = race)) +
  geom_dotplot(stackgroups = TRUE, 
               binpositions = "all", 
               stackratio = 0.7, 
               dotsize = 0.7, 
               color = "black") +
  scale_y_continuous(limits = c(0, 100),
                     breaks = seq(0, 100, 10)) +
    scale_x_continuous(breaks = seq(0,30,2))

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Practice

Create graphs to analyze the following variables of the loan50.csv data set separately.

loan50 <- read.csv("loan50.csv")

state

library(dplyr)

plotdata2 <- loan50 %>%
  count(state)

library(ggplot2)
ggplot(plotdata2,
       aes(x=reorder(state, -n),y=n))+
  geom_bar(fill = "plum",
    stat = "identity")

labs(x = "state", y = "frequency")

## $x
## [1] "state"
## 
## $y
## [1] "frequency"
## 
## attr(,"class")
## [1] "labels"

emp_length

library(dplyr)
library(ggplot2)

ggplot(loan50,
       aes( x = emp_length)) +
      geom_histogram(fill = "deeppink1") +
  labs(title = "Years of Employment",
       x = "Length of Employment",
       y = "Count") +
  scale_x_continuous(breaks = seq(0,10,2)) +
  scale_y_continuous(breaks = seq(0,10,2))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

homeownership

library(dplyr)
library(ggplot2)

plotdata <- loan50 %>%
  count(homeownership)

ggplot(plotdata, aes(x = "", y = n, fill = homeownership)) +
  geom_bar(stat = "identity", position = "stack") +
  coord_polar(theta = "y") +
  labs(x = "", y = "", fill = "Homeownership") +
  geom_text(aes(label = n), 
            position = position_stack(vjust = 0.5))

debt_to_income

library(dplyr)
library(ggplot2)

plotdata <- loan50 %>%
  count(debt_to_income)

ggplot(plotdata, aes(x = debt_to_income)) +
  geom_dotplot(fill = "darkmagenta")

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

labs(title = "Rate of Debt to Income",
     y = "Rate",
     x = "Count") +
  scale_x_continuous(breaks = seq(0,6,0.5))

## NULL

annual_income

library(ggplot2)

ggplot(loan50, aes(x = annual_income)) +
  geom_histogram(fill = "lightpink",
                 color = "magenta3") +
  labs(title = "Participants Annual Income",
       x = "Annual Income") +
  scale_x_continuous(breaks = seq(0,325000,400000))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

loan_purpose

library(dplyr)
library(ggplot2)

ggplot(loan50, aes(x = loan_purpose)) +
  geom_bar(fill = "rosybrown3") +
  labs(x = "Loan Purpose",
       y = "Frequency",
       title = "Reason for Loan") +
  coord_flip()

interest_rate

library(ggplot2)
ggplot(loan50, aes(x = interest_rate)) +
  geom_density(fill = "mediumorchid", bw = 0.5) +
  labs(title = "Interest Rate")

Choose the most appropriate graph for each variable, that provides the best understanding.

Please refer to the following for more description of the variables in this data set.
https://www.openintro.org/data/index.php?data=loan50

Assignment 5 - Univariate Graphs

Emma Fields

2024-09-28