library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

We’ve seen examples of data visualization problems caused by outlying values. When a display is forced to include an extreme value, the details in the main part of the data become difficult to see. For example, consider the 2010 county population data in the countyComplete dataset from the openintro package.

library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:datasets':
## 
##     cars
summary(countyComplete$pop2010)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      82   11100   25860   98230   66700 9819000

We can compute a few ratios to make the extreme right skewness in this data clear.

attach(countyComplete)
max(pop2010)/median(pop2010)
## [1] 379.7272
max(pop2010)/quantile(pop2010,.9)
##     90% 
## 49.7284

The largest value is about 400 times the mean value and about times the 90th percentile value. What happens when we look at the distribution of a variable like this.

ggplot(countyComplete) +
  geom_histogram(aes(x=pop2010))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(countyComplete) +
  geom_density(aes(x=pop2010))

Let’s repeat this with the base-10 logarithms of the raw data.

summary(log10(pop2010))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.914   4.045   4.413   4.459   4.824   6.992
ggplot(countyComplete) +
  geom_histogram(aes(x=log10(pop2010)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(countyComplete) +
  geom_density(aes(x=log10(pop2010)))

Loking at the data this way, the similarity to a normal distribution is striking. In fact it is common to hear about the lognormal distribution.

The data visualizations are better, but we need to understand the meaning of the transformed data. What is the log to the base 10 of a number? It is the power to which 10 must be raised to yield the number. The following computations are useful to understand this.

log10(.1)
## [1] -1
log10(0)
## [1] -Inf
log10(1)
## [1] 0
log10(10)
## [1] 1
log10(100)
## [1] 2
log10(1000)
## [1] 3
log10(10000)
## [1] 4
log10(100000)
## [1] 5
log10(1000000)
## [1] 6

You might say that log10 could be interpreted as “How many zero’s?”

looking at the log values, we can see that the log data is concentrated between 4 and 5. In terms of the raw data, this converts to values of 10,000 and 100,000.

Now, what about the problematic quantitative data, x, y, z, table and depth in the diamonds dataset. let’s get that data back.

d = diamonds[diamonds$x > 0 &
             diamonds$y > 0 &
             diamonds$z > 0,]
d$ppc = d$price/d$carat
d$lx = log10(d$x)
d$ly = log10(d$y)
d$lz = log10(d$z)
d$ldepth = log10(d$depth)
d$ltable = log10(d$table)

Compare the distribution of the raw variable x and its relationship with ppc to the analogous items using the log transformed variable.

ggplot(d,aes(x=z)) + 
  geom_histogram() + ggtitle("Raw x Values")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(d,aes(x=lz)) + 
  geom_histogram() + ggtitle("Log-transformed x Values")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(d,aes(x=z,y=ppc)) + 
  geom_smooth() + ggtitle("Raw x Values")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(d,aes(x=lz,y=ppc)) + 
  geom_smooth() + ggtitle("Log-transformed x Values")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

After looking at this, we see the same basic pattern as before. In doing data analysis, you will encounter many deadends. These aren’t wastes of time. You just need to think differently about what you’re seeing. Thoughts?