Chapter 5 Exploratory Data Analysis

EDA is an iterative cycle where you:
1. Generate questions about your data
2. Search for answers by visualizing, transforming and modeling your data
3. Use what you learn to refine your questions and/or generate new questions

The two types of questions:
1. What type of variations occurs within each variable?
2. What type of covariation occurs between variables ?

Definitions:
Variable is a quantity or quality or property that you can measure
Value is the state of the variable when you measure it.
Observation or a case , is a set of measurements made under similar conditions
Tabular data is a set of values, each associated with a avraible and an observation.
Variation is the tendency of the values of a variable to change from measurement to measurement

Visualizing Distributions

rr library(tidyverse) ggplot(data = diamonds)+ geom_bar(mapping = aes(x=cut))

An example for a categorical variable (usually saved in R as factors)

rr diamonds %>% count(cut)

As example for continuous variable (better use a histogram)

rr ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth =0.5)

You can compute by hand by combining dplyr::count() and ggplot2::cut_width()

rr diamonds %>% count(cut_width(carat, 0.5))

If you want just the diamond with a size of less than three carats and choose a small binwidth:

rr smaller <- diamonds %>% filter(carat < 3) ggplot(data = smaller,mapping=aes(x=carat))+ geom_histogram(binwidth = 0.1)

You can also use geom_freqpoly to use lines instead

rr ggplot(data = smaller,mapping = aes(x = carat, color=cut)) + geom_freqpoly(binwidth = 0.1)

Typical Values

In both bar charts and histograms, tall bars shows the common values, and shorter bars shows the less-common values. Places that do not have bars reveal values that were not seen in your data /br>

rr ggplot(data = faithful, mapping = aes(x = eruptions)) + geom_histogram(binwidth = 0.25)

Unusual Values

Sometimes the only evidence of outliers is the unusually wide limits on the y-axis

rr ggplot(diamonds)+ geom_histogram(mapping = aes(x = y), binwidth = 0.5)

The values may be so small that you barely see it. To make it pop up use a lower coordinate system

rr ggplot(diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0,50))

rr unusual <- diamonds %>% filter(y < 3 | y > 20) %>% arrange(y) # display values unusual

Missing Values

You have two options to work with outliers:
1. drop the row with the outliers

rr diamonds2 <- diamonds %>% filter(between(y,3,20)) diamonds2

or :
2. replace the unusal values with missing values. The easiest way is to use mutate() to replace the variable with a modified copy. you can use the ifelse() function to replace unusual values with NA:

rr diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y)) select(diamonds2)

When using GGplot with missing values, it will place a warning that it removed xx rows containing missin values

library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ------------------------------------------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
  geom_point()

To suppress the warning, use na.rm=TRUE

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
  geom_point(na.rm=TRUE)

Say you wanted to compare the departure times for cancelled and non-cancelled flights.

nycflights13::flights %>%
  mutate(cancelled = is.na(dep_time), 
         sched_hour = sched_dep_time %/% 100,
         sched_min = sched_dep_time %% 100,
         sched_dep_time = sched_hour + sched_min / 60) %>%
  
  ggplot(mapping = aes(sched_dep_time)) +
  geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)

You can really see any patterns in the TRUE (colored light blue) as there are a lot more non cancelled flights.

Covariation

Covariation describes the behavior between variables. It is the tendency of two or more variables to vary together in a related way

But sometimes the pattern is not discernible as one variable has a large measure. To make comparison easier, we use density instead of count() so that the area under each frequency polygon is one.

ggplot(data=diamonds, mapping = aes(x =  price, y= ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

Another way to display the distriubtion of a continuous variable broken down by categorical variable is the boxplot.

ggplot(data = diamonds , mapping = aes(x = cut, y = price)) +
  geom_boxplot()

The better cuts are cheaper on the average cheaper! Many categorical values dont have an intrinsic order. You might want to reorder them to make a more informative display. One way to do that is with the reorder() function.

To illustrate the point, here is the unordered plot :

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

And here is the reordered plot:

ggplot(data = mpg) +
  geom_boxplot(
    mapping = aes( x = reorder(class, hwy, FUN = median), y = hwy)
  )

you can also flip the boxplot chart 90 degrees by using the coord_flip() function. This displays particularly long variable names better.

ggplot(data = mpg) +
  geom_boxplot(
    mapping = aes(
      x = reorder(class, hwy, FUN=median),
      y = hwy)
  ) +
  coord_flip()

NA

Two Categorical Variables

To visualize the covariation between categorical variables, you will need to count the number of observations for EACH combination. The way to do that is with geom_count()

ggplot(data = diamonds) +
  geom_count(mapping = aes(x =cut, y = color))

Another approach is to compute the count with dplyr

diamonds %>%
  count(color,cut)

You then visualize with geom_tile(). This looks like a heat map.

diamonds %>%
  count(color,cut) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = n))

Check out seriation package, d3heatmap or heatmaply packages.

Two continuous variables

Use geom_point() to create scatterplots.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

As the data points increase, the data points will pile up and block each other. Use alpha() aesthtetics to add transparency:

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price),
             alpha= 1/100)

You can also use histograms (2d) with geom_bin2d and geom_hex()

ggplot(data = smaller) + 
  geom_bin2d(mapping = aes(x = carat, y = price))

# package hexbin required for stat_binhex
# you need to install hexbin package
library(hexbin)
package <U+393C><U+3E31>hexbin<U+393C><U+3E32> was built under R version 3.3.3
ggplot(data = smaller) + 
  geom_hex(mapping = aes(x = carat, y = price))

Another option is to convert one continuous variable into a bin using the cut_width() function

ggplot(data = smaller, mapping =aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

One weakness of boxplots is that by default it doesnt show how many observations there are…unless you use the function varwidth=TRUE using cut_number instead of cut_width

ggplot(data = smaller, mapping =aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Patterns and Models

A sample pattern of old faithful geyser

ggplot(data = faithful) +
  geom_point(mapping = aes(x = eruptions, y = waiting))

How to remove the effect of one covariance to be able to examine other subtle relationships. The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values y.

library(modelr)
mod <-  lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <-  diamonds %>%
  add_residuals(mod) %>%
  mutate(resid = exp(resid))
ggplot(data = diamonds2) +
  geom_point(mapping = aes(x = carat, y = resid))

NA

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price - relative to their size, beter uality diamonds are more expensive. As can be seen in the graph below.

ggplot(data = diamonds2) +
  geom_boxplot(mapping = aes(x = cut, y = resid))

Check out also : Graphical Data Analysis with R by Antony Unwin.

