This document was composed from Dr. Snopkowski’s ANTH 504 Week 4 lecture and from Introduction to Data Science: Data analysis and prediction algorithms with R by Rafael A.Irizarry

Visualizing data distributions

Example: You want to describe height data to an extraterrestrial that has never seen humans.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dslabs)
data(heights)

?heightSelf-Reported Heights in inches for male and females. There are only two columns.

head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65

Describe this dataset.

What types of variables does it include?

describe() for each of the two variables in the data set, it gives variation, number of observations, mean, sd, and median.

describe(heights)
##        vars    n  mean   sd median trimmed  mad min   max range  skew kurtosis
## sex*      1 1050  1.77 0.42    2.0    1.84 0.00   1  2.00  1.00 -1.30    -0.30
## height    2 1050 68.32 4.08   68.5   68.43 3.71  50 82.68 32.68 -0.44     1.58
##          se
## sex*   0.01
## height 0.13

If I use describeBy() I will need to use df$column and group by for the gender. It wil just gicve the describe() results.

describeBy(heights$height, heights$sex)
## 
##  Descriptive statistics by group 
## group: Female
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 238 64.94 3.76  64.98   64.92 2.97  51  79    28 0.05     2.96 0.24
## ------------------------------------------------------------ 
## group: Male
##    vars   n  mean   sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 812 69.31 3.61     69   69.39 2.97  50 82.68 32.68 -0.54     2.89 0.13
describeBy(heights$height)
## Warning in describeBy(heights$height): no grouping variable requested
##    vars    n  mean   sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 1050 68.32 4.08   68.5   68.43 3.71  50 82.68 32.68 -0.44     1.58 0.13

Describing the data

How might you describe the “sex” column? number in each category Proportion in each category

How do we do this in R?

table(heights$sex) #gives us the counts
## 
## Female   Male 
##    238    812

Table is very useful. It will give the count. The function count() will do the same thing.

heights %>% count(sex)
##      sex   n
## 1 Female 238
## 2   Male 812

Create a new column to tell the proportion of each one.

heights %>% count(sex) %>% mutate(proportion = n/sum(n)) #gives us the proportion
##      sex   n proportion
## 1 Female 238  0.2266667
## 2   Male 812  0.7733333

What if we have more than 2 numbers?

Scatter plots are great for 2 continuous variables. Bar charts help group catigorical data that is in factor class.

Can use a bar chart

data(murders)
murders %>% ggplot(aes(region)) + geom_bar() 

OR Storing a data frame tab while added the proportion of states in each region.

tab <- murders %>% 
  count(region) %>% 
  mutate(proportion = n/sum(n))
tab
##          region  n proportion
## 1     Northeast  9  0.1764706
## 2         South 17  0.3333333
## 3 North Central 12  0.2352941
## 4          West 13  0.2549020
tab %>% 
  ggplot(aes(region, proportion)) + 
  geom_bar(stat="identity")

stat="identity"tells R to use the values in the data (in this case proportion). It means to store it as it is. If you don’t tell it, it will give an error.

# tab %>% 
#  ggplot(aes(region, proportion)) + 
#  geom_bar()

With additional options: The functionreorder() will reorder the variables. We reorder them by the percentages. This makes it go from the smallest proportion to the largest proportion.

tab %>% ggplot(aes(x=reorder(region, proportion), y=proportion)) + 
geom_bar(aes(fill=region), stat="identity")

Order from most to least.

tab %>% ggplot(aes(x=reorder(region, proportion,desc), y=proportion)) + 
geom_bar(aes(fill=region), stat="identity")

Histogram

Histogram divides the data into nonoverlapping bins of the same size. For each bin, we count the number of values that fall in that interval. The == makes it a logical argument.We are not assigning but is asking a question. pull(height) means to use the height values and we are only using these values from males.

qplot(heights %>% 
        filter(sex == "Male") %>% 
        pull(height))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

h <- (heights %>% 
        filter(sex == "Male") %>% 
        pull(height))
head(h)
## [1] 75 70 68 74 61 67

OR use ggplot

heights %>% 
  filter(sex=="Male") %>% 
  ggplot(aes(height)) + 
    geom_histogram(fill="gray", color="black") + 
    ggtitle("Male heights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It defalts into 30 equal bins. There are 30 bars. You can adjust it using binwith = x or you can say how many bins you want bin = # in the goem_histogram()

heights %>% 
  filter(sex=="Male") %>% 
  ggplot(bin = 20, aes(height)) + 
    geom_histogram(fill="gray", color="black") + 
    ggtitle("Male heights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Smoothed Density (on the y-axis)

Smooth density plots are more aesthetically appealing than histograms. It removes the sharp edges at the interval boundaries and “local” peaks are generally removed. How valid is it to report smooth densities?

Smooth densities assume that our observed values are a subset of a much larger list of unobserved values.

We assume that if we had a much larger set of values and made the bins smaller and smaller, the height of consecutive bins would be similar and we wouldn’t have big jumps in heights of consecutive bins. To make the curve not depend on the size of the dataset (or hypothetical size), we compute the curve on frequencies (rather than counts).

Smooth Density of our height data

How “smooth” is smooth? We need to be careful with our degree of smoothness because it can change our interpretation of the data.

Instead of histogram, usegeom_density() and adjust = #

heights %>% filter(sex == "Male") %>% ggplot(aes(height)) + 
  geom_density(fill='blue', adjust = 2)

heights %>% filter(sex == "Male") %>% ggplot(aes(height)) + 
  geom_density(fill='blue', adjust = 1)

heights %>% filter(sex == "Male") %>% ggplot(aes(height)) + 
  geom_density(fill='blue')

Can use the “adjust” argument to adjust the smoothness. adjust > 1 makes it more smooth, adjust < 1 is less smooth

Interpreting the y-axis

A smooth density plot makes it harder to interpret the y-axis u It is scaled so that the area under the curve adds to 1 – so if you have a bin of width =1, the y-axis value tells you the proportion of values in that bin. u For intervals other than 1, you can calculate the proportion of the total area contained in that interval. u Smooth densities make it easier to compare two distributions:

heights %>% ggplot(aes(height, fill=sex)) + 
geom_density(alpha = 0.4)

alpha = 0.4 makes the fill somewhat transparent (so you can see both curves)

fill = sex You can use “color” here and it will just give you the outline

What does the graph tell you? density is a proportional amount. It is not discrete, it is continuous. It is a sumation of everything under the curve.The female curve has a higher peak and is more narrow.

Check yourself: Interpreting graphs

Study the following boxplots showing population sizes by country: Which continent has the country with the biggest population size? Asia What continent has the largest median population size? Africa What is median population size for Africa to the nearest million? 10 What proportion of countries in Europe have a population of 14 million? a. 0.99 b. 0.75 c. 0.50 d. 0.25

The normal distribution

Why are the mean and standard deviation used so frequently? Don’t we need more than 2 numbers to represent a distribution? u The normal distribution or the bell curve or the Gaussian distribution occurs in many situations. u The normal distribution is defined as: What are the properties of a normal distribution?

how is standard deviation calculated?

Standard deviation tells us about the spread of the data. We can calculate it in R in 2 ways:

set.seed(123)
x <- sample(1:100, size = 20)
m <- sum(x) / length(x)
s <- sqrt(sum((x-m)^2) / (length(x)-1))

OR

s <- sd(x)

Let’s calculate the mean and standard deviation for the male heights. How would we get male heights into a vector?

x <- heights %>% 
  filter(sex=="Male") %>% 
    pull(height)
head(x) 
## [1] 75 70 68 74 61 67
length(x)
## [1] 812
table(heights$sex)
## 
## Female   Male 
##    238    812

This is the tidyverse way.Checking the length and table will check that it pulled the way you said it would. OR, using indices

index <- heights$sex == "Male"
head(index)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
y <- heights$height[index]
head(y)
## [1] 75 70 68 74 61 67

Using the [] will pull out.

#calculate the mean and standard devation of male height
m <- mean(x)
s <- sd(x)
#create a density plot of male heights overlay a normal curve
heights %>% filter(sex == "Male") %>%
  ggplot(aes(height)) +
  geom_density(fill="pink") +
  stat_function(fun=dnorm, args=list(m, s))

This is relatively normal.

How close is our height data for men to the standard normal curve?

maleHeights <- heights$height[heights$sex=="Male"]
mean(maleHeights)
## [1] 69.31475
sd(maleHeights)
## [1] 3.611024
heights %>% filter(sex=="Male") %>% ggplot(aes(height)) + 
geom_density(fill="pink") + stat_function(fun = dnorm, 
args=list(mean=69.3, sd=3.6))

Boxplots

Boxplots provide a 5-number summary composed of the range and quartiles. Boxplots typically ignore outliers when computing the range and instead plot them as independent points. u The interquartile range (IQR) is the distance between the 25% and 75% percentile

boxplot(heights$height)

Stratification

Comparing accross different groups. We often divide observations into groups based on the values of one or more variables associated with those observations. u For instance, it makes sense to group heights based on a sex variable.

heights %>% ggplot(aes(sex,height)) + geom_boxplot()

Switching x and y axis

heights %>% ggplot(aes(height, sex)) + geom_boxplot()

Quick Plots

We can make a quick histogram with qplot

qplot(x)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x, binwidth = 1)

qplot(x, bins = 15, color =I("black"), xlab = "Population")

To make a quick boxplot

qplot(sex, height, data = heights, geom="boxplot")

heights %>% qplot(sex, height, data=., geom = "boxplot")

To make a quick density plot

qplot(x, geom = "density")