Chapter 5 Exploratory Data Analysis
EDA is an iterative cycle where you: 1. Generate questions about your data 2. Search for answers by visualizing, transforming and modeling your data 3. Use what you learn to refine your questions and/or generate new questions
The two types of questions: 1. What type of variations occurs within each variable? 2. What type of covariation occurs between variables ?
Definitions: Variable is a quantity or quality or property that you can measure Value is the state of the variable when you measure it. Observation or a case , is a set of measurements made under similar conditions Tabular data is a set of values, each associated with a avraible and an observation. Variation is the tendency of the values of a variable to change from measurement to measurement
Visualizing Distributions
rr library(tidyverse) ggplot(data = diamonds)+ geom_bar(mapping = aes(x=cut))

An example for a categorical variable (usually saved in R as factors)
rr diamonds %>% count(cut)
As example for continuous variable (better use a histogram)
rr ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth =0.5)

You can compute by hand by combining dplyr::count() and ggplot2::cut_width()
rr diamonds %>% count(cut_width(carat, 0.5))
If you want just the diamond with a size of less than three carats and choose a small binwidth:
rr smaller <- diamonds %>% filter(carat < 3) ggplot(data = smaller,mapping=aes(x=carat))+ geom_histogram(binwidth = 0.1)

You can also use geom_freqpoly to use lines instead
rr ggplot(data = smaller,mapping = aes(x = carat, color=cut)) + geom_freqpoly(binwidth = 0.1)

Typical Values
In both bar charts and histograms, tall bars shows the common values, and shorter bars shows the less-common values. Places that do not have bars reveal values that were not seen in your data /br>
rr ggplot(data = faithful, mapping = aes(x = eruptions)) + geom_histogram(binwidth = 0.25)

Unusual Values
Sometimes the only evidence of outliers is the unusually wide limits on the y-axis
rr ggplot(diamonds)+ geom_histogram(mapping = aes(x = y), binwidth = 0.5)

The values may be so small that you barely see it. To make it pop up use a lower coordinate system
rr ggplot(diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0,50))

rr unusual <- diamonds %>% filter(y < 3 | y > 20) %>% arrange(y) # display values unusual
Missing Values
You have two options to work with outliers: 1. drop the row with the outliers
rr diamonds2 <- diamonds %>% filter(between(y,3,20)) diamonds2
or : 2. replace the unusal values with missing values. The easiest way is to use mutate() to replace the variable with a modified copy. you can use the ifelse() function to replace unusual values with NA:
rr diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y)) select(diamonds2)
When using GGplot with missing values, it will place a warning that it removed xx rows containing missin values
library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ------------------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()

To suppress the warning, use na.rm=TRUE
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm=TRUE)

Say you wanted to compare the departure times for cancelled and non-cancelled flights.
nycflights13::flights %>%
mutate(cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)

You can really see any patterns in the TRUE (colored light blue) as there are a lot more non cancelled flights.
Covariation
Covariation describes the behavior between variables. It is the tendency of two or more variables to vary together in a related way
But sometimes the pattern is not discernible as one variable has a large measure. To make comparison easier, we use density instead of count() so that the area under each frequency polygon is one.
ggplot(data=diamonds, mapping = aes(x = price, y= ..density..)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

Another way to display the distriubtion of a continuous variable broken down by categorical variable is the boxplot.
ggplot(data = diamonds , mapping = aes(x = cut, y = price)) +
geom_boxplot()

The better cuts are cheaper on the average cheaper! Many categorical values dont have an intrinsic order. You might want to reorder them to make a more informative display. One way to do that is with the reorder() function.
To illustrate the point, here is the unordered plot :
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()

And here is the reordered plot:
ggplot(data = mpg) +
geom_boxplot(
mapping = aes( x = reorder(class, hwy, FUN = median), y = hwy)
)

you can also flip the boxplot chart 90 degrees by using the coord_flip() function. This displays particularly long variable names better.
ggplot(data = mpg) +
geom_boxplot(
mapping = aes(
x = reorder(class, hwy, FUN=median),
y = hwy)
) +
coord_flip()

NA
Two Categorical Variables
To visualize the covariation between categorical variables, you will need to count the number of observations for EACH combination. The way to do that is with geom_count()
ggplot(data = diamonds) +
geom_count(mapping = aes(x =cut, y = color))

Another approach is to compute the count with dplyr
diamonds %>%
count(color,cut)
You then visualize with geom_tile(). This looks like a heat map.
diamonds %>%
count(color,cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

Check out seriation package, d3heatmap or heatmaply packages.
Two continuous variables
Use geom_point() to create scatterplots.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))

As the data points increase, the data points will pile up and block each other. Use alpha() aesthtetics to add transparency:
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price),
alpha= 1/100)

You can also use histograms (2d) with geom_bin2d and geom_hex()
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))

# package hexbin required for stat_binhex
# you need to install hexbin package
library(hexbin)
package <U+393C><U+3E31>hexbin<U+393C><U+3E32> was built under R version 3.3.3
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))

Another option is to convert one continuous variable into a bin using the cut_width() function
ggplot(data = smaller, mapping =aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

One weakness of boxplots is that by default it doesnt show how many observations there are…unless you use the function varwidth=TRUE using cut_number instead of cut_width
ggplot(data = smaller, mapping =aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Patterns and Models
A sample pattern of old faithful geyser
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))

How to remove the effect of one covariance to be able to examine other subtle relationships. The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values y.
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))

NA
Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price - relative to their size, beter uality diamonds are more expensive. As can be seen in the graph below.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))

Check out also : Graphical Data Analysis with R by Antony Unwin. 
---
title: "R For Data Science Chapter 5"
output: html_notebook
---

<h1> Chapter 5 Exploratory Data Analysis </h1>
EDA is an iterative cycle where  you:</br>
1. Generate questions about your data</br>
2. Search for answers by visualizing, transforming and modeling your data</br>
3. Use what you learn to refine your questions and/or generate new questions
</p>

The two types of questions: </br>
1. What type of variations occurs within each variable?</br>
2. What type of covariation occurs between variables ? </p>

Definitions: </br>
Variable is a quantity or quality or property that you can measure</br>
Value is the state of the variable when you measure it.</br>
Observation or a case , is a set of measurements made under similar conditions</br>
Tabular data is a set of values, each associated with a avraible and an observation. </br>
Variation is the tendency of the values of a variable to change from measurement to measurement</br>

<h2> Visualizing Distributions </h2>

```{r}
library(tidyverse)
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))
```
An example for a categorical variable (usually saved in R as factors)
```{r}
diamonds %>%
  count(cut)
```

As example for continuous variable (better use a histogram)

```{r}
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth =0.5)
```

You can compute by hand by combining dplyr::count() and ggplot2::cut_width()

```{r}
diamonds %>%
  count(cut_width(carat, 0.5))
```

If you want just the diamond with a size of less than three carats and choose a small binwidth:
```{r}
smaller <- diamonds %>%
  filter(carat < 3)

ggplot(data = smaller,mapping=aes(x=carat))+
  geom_histogram(binwidth = 0.1)
```

You can also use geom_freqpoly to use lines instead

```{r}
ggplot(data = smaller,mapping = aes(x = carat, color = cut)) +
  geom_freqpoly(binwidth = 0.1)
```

<h2> Typical Values </h2>
In both bar charts and histograms, tall bars shows the common values, and shorter bars shows the less-common values. Places that do not have bars reveal values that were not seen in your data /br>

```{r}
ggplot(data = faithful, mapping = aes(x = eruptions)) +
  geom_histogram(binwidth = 0.25)
```


<h2> Unusual Values </h2>
Sometimes the only evidence of outliers is the unusually wide limits on the y-axis </br>

```{r}
ggplot(diamonds)+
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

```

The values may be so small that you barely see it. To make it pop up use a lower coordinate system

```{r}
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0,50))

```


```{r}
unusual <-  diamonds %>%
  filter(y < 3 | y > 20) %>%
  arrange(y)

# display values
unusual
```

<h2> Missing Values </h2>

You have two options to work with outliers: </br>
1. drop the row with the outliers 
```{r}

diamonds2 <- diamonds %>%
  filter(between(y,3,20))
diamonds2

```

or : </br>
2. replace the unusal values with missing values. The easiest way is to use mutate() to replace the variable with a modified copy. you can use the ifelse() function to replace unusual values with NA:

```{r}
diamonds2 <-  diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

diamonds2
```
When using GGplot with missing values, it will place a warning that it removed xx rows containing missin values 

```{r}
library(tidyverse)
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
  geom_point()

```

To suppress the warning, use na.rm=TRUE

```{r}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
  geom_point(na.rm=TRUE)
```
Say you wanted to compare the departure times for cancelled and non-cancelled flights.

```{r}
nycflights13::flights %>%
  mutate(cancelled = is.na(dep_time), 
         sched_hour = sched_dep_time %/% 100,
         sched_min = sched_dep_time %% 100,
         sched_dep_time = sched_hour + sched_min / 60) %>%
  
  ggplot(mapping = aes(sched_dep_time)) +
  geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)
```
You can really see any patterns in the TRUE (colored light blue) as there are a lot more non cancelled flights. 

<h2> Covariation </h2>
Covariation describes the behavior between variables. It is the tendency of two or more variables to vary together in a related way </p>

But sometimes the pattern is not discernible as one variable has a large measure. To make comparison easier, we use density instead of count()  so that the area under each frequency polygon is one. 

```{r}

ggplot(data=diamonds, mapping = aes(x =  price, y= ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
```

Another way to display the distriubtion of a continuous variable broken down by categorical variable is the boxplot. 

```{r}
ggplot(data = diamonds , mapping = aes(x = cut, y = price)) +
  geom_boxplot()
```
The better cuts are cheaper on the average cheaper!
Many categorical values dont have an intrinsic order. You might want to reorder them to make a more informative display. One way to do that is with the reorder() function. </p>

To illustrate the point, here is the unordered plot :

```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

```



And here is the reordered plot:

```{r}
ggplot(data = mpg) +
  geom_boxplot(
    mapping = aes( x = reorder(class, hwy, FUN = median), y = hwy)
  )
```


you can also flip the boxplot chart 90 degrees by using the coord_flip() function. This displays particularly long variable names better. 


```{r}
ggplot(data = mpg) +
  geom_boxplot(
    mapping = aes(
      x = reorder(class, hwy, FUN=median),
      y = hwy)
  ) +
  coord_flip()
  
```

<h2> Two Categorical Variables </h2>

To visualize the covariation between categorical variables, you will need to count the number of observations for EACH combination. The way to do that is with geom_count() 

```{r}
ggplot(data = diamonds) +
  geom_count(mapping = aes(x =cut, y = color))
```

Another approach is to compute the count with dplyr

```{r}
diamonds %>%
  count(color,cut)

```


You then visualize with geom_tile(). This looks like a heat map.

```{r}
diamonds %>%
  count(color,cut) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = n))


```

Check out  seriation package, d3heatmap or heatmaply packages.


<h2> Two continuous variables </h2>
Use geom_point() to create scatterplots.

```{r}
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))
```

As the data points increase, the data points will pile up and block each other. Use alpha() aesthtetics to add transparency:

```{r}
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price),
             alpha = 1/100)

```

You can also use histograms (2d) with geom_bin2d and geom_hex()

```{r}
ggplot(data = smaller) + 
  geom_bin2d(mapping = aes(x = carat, y = price))

```


```{r}
# package hexbin required for stat_binhex
# you need to install hexbin package
library(hexbin)
ggplot(data = smaller) + 
  geom_hex(mapping = aes(x = carat, y = price))
```

Another option is to convert one continuous variable into a bin using the cut_width() function

```{r}

ggplot(data = smaller, mapping =aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```

One weakness of boxplots is that by default it doesnt show how many observations there are...unless you use the function varwidth=TRUE using cut_number instead of cut_width

```{r}
ggplot(data = smaller, mapping =aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```

<h2> Patterns and Models </h2>
A sample pattern of old faithful geyser

```{r}
ggplot(data = faithful) +
  geom_point(mapping = aes(x = eruptions, y = waiting))
```

How to remove the effect of one covariance to be able to examine other subtle relationships.
The residual data of the simple linear regression model is the difference between the observed data of the dependent variable y and the fitted values y.
```{r}
library(modelr)
# extract the price and carat relationship
mod <-  lm(log(price) ~ log(carat), data = diamonds)

# use add residuals 
diamonds2 <-  diamonds %>%
  add_residuals(mod) %>%
  mutate(resid = exp(resid))

ggplot(data = diamonds2) +
  geom_point(mapping = aes(x = carat, y = resid))
          
```

Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price - relative to their size, beter uality diamonds are more expensive. 
As can be seen in the graph below.

```{r}
ggplot(data = diamonds2) +
  geom_boxplot(mapping = aes(x = cut, y = resid))
```

Check out also : Graphical Data Analysis with R by Antony Unwin.</br>
<img src="https://images.tandf.co.uk/common/jackets/amazon/978149871/9781498715232.jpg"></img>

