This document was composed from Dr. Snopkowski’s ANTH 504 Week 4 lecture and from Introduction to Data Science: Data analysis and prediction algorithms with R by Rafael A.Irizarry
Example: You want to describe height data to an extraterrestrial that has never seen humans.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dslabs)
data(heights)
?heightSelf-Reported Heights in inches for male and
females. There are only two columns.
head(heights)
## sex height
## 1 Male 75
## 2 Male 70
## 3 Male 68
## 4 Male 74
## 5 Male 61
## 6 Female 65
What types of variables does it include?
describe() for each of the two variables in the data
set, it gives variation, number of observations, mean, sd, and
median.
describe(heights)
## vars n mean sd median trimmed mad min max range skew kurtosis
## sex* 1 1050 1.77 0.42 2.0 1.84 0.00 1 2.00 1.00 -1.30 -0.30
## height 2 1050 68.32 4.08 68.5 68.43 3.71 50 82.68 32.68 -0.44 1.58
## se
## sex* 0.01
## height 0.13
If I use describeBy() I will need to use
df$column and group by for the gender. It wil just gicve
the describe() results.
describeBy(heights$height, heights$sex)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 238 64.94 3.76 64.98 64.92 2.97 51 79 28 0.05 2.96 0.24
## ------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 812 69.31 3.61 69 69.39 2.97 50 82.68 32.68 -0.54 2.89 0.13
describeBy(heights$height)
## Warning in describeBy(heights$height): no grouping variable requested
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1050 68.32 4.08 68.5 68.43 3.71 50 82.68 32.68 -0.44 1.58 0.13
How might you describe the “sex” column? number in each category Proportion in each category
How do we do this in R?
table(heights$sex) #gives us the counts
##
## Female Male
## 238 812
Table is very useful. It will give the count. The function
count() will do the same thing.
heights %>% count(sex)
## sex n
## 1 Female 238
## 2 Male 812
Create a new column to tell the proportion of each one.
heights %>% count(sex) %>% mutate(proportion = n/sum(n)) #gives us the proportion
## sex n proportion
## 1 Female 238 0.2266667
## 2 Male 812 0.7733333
Scatter plots are great for 2 continuous variables. Bar charts help group catigorical data that is in factor class.
Can use a bar chart
data(murders)
murders %>% ggplot(aes(region)) + geom_bar()
OR Storing a data frame
tab while added the proportion of
states in each region.
tab <- murders %>%
count(region) %>%
mutate(proportion = n/sum(n))
tab
## region n proportion
## 1 Northeast 9 0.1764706
## 2 South 17 0.3333333
## 3 North Central 12 0.2352941
## 4 West 13 0.2549020
tab %>%
ggplot(aes(region, proportion)) +
geom_bar(stat="identity")
stat="identity"tells R to use the values in the data (in
this case proportion). It means to store it as it is. If you don’t tell
it, it will give an error.
# tab %>%
# ggplot(aes(region, proportion)) +
# geom_bar()
With additional options: The functionreorder() will
reorder the variables. We reorder them by the percentages. This makes it
go from the smallest proportion to the largest proportion.
tab %>% ggplot(aes(x=reorder(region, proportion), y=proportion)) +
geom_bar(aes(fill=region), stat="identity")
Order from most to least.
tab %>% ggplot(aes(x=reorder(region, proportion,desc), y=proportion)) +
geom_bar(aes(fill=region), stat="identity")
Histogram divides the data into nonoverlapping bins of the same size.
For each bin, we count the number of values that fall in that interval.
The == makes it a logical argument.We are not assigning but
is asking a question. pull(height) means to use the height
values and we are only using these values from males.
qplot(heights %>%
filter(sex == "Male") %>%
pull(height))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
h <- (heights %>%
filter(sex == "Male") %>%
pull(height))
head(h)
## [1] 75 70 68 74 61 67
OR use ggplot
heights %>%
filter(sex=="Male") %>%
ggplot(aes(height)) +
geom_histogram(fill="gray", color="black") +
ggtitle("Male heights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It defalts into 30 equal bins. There are 30 bars. You can adjust it
using
binwith = x or you can say how many bins you want
bin = # in the goem_histogram()
heights %>%
filter(sex=="Male") %>%
ggplot(bin = 20, aes(height)) +
geom_histogram(fill="gray", color="black") +
ggtitle("Male heights")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Smooth density plots are more aesthetically appealing than histograms. It removes the sharp edges at the interval boundaries and “local” peaks are generally removed. How valid is it to report smooth densities?
Smooth densities assume that our observed values are a subset of a much larger list of unobserved values.
We assume that if we had a much larger set of values and made the bins smaller and smaller, the height of consecutive bins would be similar and we wouldn’t have big jumps in heights of consecutive bins. To make the curve not depend on the size of the dataset (or hypothetical size), we compute the curve on frequencies (rather than counts).
How “smooth” is smooth? We need to be careful with our degree of smoothness because it can change our interpretation of the data.
Instead of histogram, usegeom_density() and
adjust = #
heights %>% filter(sex == "Male") %>% ggplot(aes(height)) +
geom_density(fill='blue', adjust = 2)
heights %>% filter(sex == "Male") %>% ggplot(aes(height)) +
geom_density(fill='blue', adjust = 1)
heights %>% filter(sex == "Male") %>% ggplot(aes(height)) +
geom_density(fill='blue')
Can use the “adjust” argument to adjust the smoothness. adjust > 1 makes it more smooth, adjust < 1 is less smooth
A smooth density plot makes it harder to interpret the y-axis u It is scaled so that the area under the curve adds to 1 – so if you have a bin of width =1, the y-axis value tells you the proportion of values in that bin. u For intervals other than 1, you can calculate the proportion of the total area contained in that interval. u Smooth densities make it easier to compare two distributions:
heights %>% ggplot(aes(height, fill=sex)) +
geom_density(alpha = 0.4)
alpha = 0.4 makes the fill somewhat
transparent (so you can see both curves)
fill = sex You can use “color” here and it will just
give you the outline
What does the graph tell you? density is a proportional amount. It is not discrete, it is continuous. It is a sumation of everything under the curve.The female curve has a higher peak and is more narrow.
Study the following boxplots showing population sizes by country:
Which continent has the country
with the biggest population size? Asia What continent has the largest
median population size? Africa What is median population size for Africa
to the nearest million? 10 What proportion of countries in Europe have a
population of 14 million? a. 0.99 b. 0.75 c. 0.50
d. 0.25
Why are the mean and standard deviation used so frequently? Don’t we
need more than 2 numbers to represent a distribution? u The normal
distribution or the bell curve or the Gaussian distribution occurs in
many situations. u The normal distribution is defined as: What are the properties of
a normal distribution?
Standard deviation tells us about the spread of the data. We can calculate it in R in 2 ways:
set.seed(123)
x <- sample(1:100, size = 20)
m <- sum(x) / length(x)
s <- sqrt(sum((x-m)^2) / (length(x)-1))
OR
s <- sd(x)
Let’s calculate the mean and standard deviation for the male heights. How would we get male heights into a vector?
x <- heights %>%
filter(sex=="Male") %>%
pull(height)
head(x)
## [1] 75 70 68 74 61 67
length(x)
## [1] 812
table(heights$sex)
##
## Female Male
## 238 812
This is the tidyverse way.Checking the length and table will check that it pulled the way you said it would. OR, using indices
index <- heights$sex == "Male"
head(index)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE
y <- heights$height[index]
head(y)
## [1] 75 70 68 74 61 67
Using the [] will pull out.
#calculate the mean and standard devation of male height
m <- mean(x)
s <- sd(x)
#create a density plot of male heights overlay a normal curve
heights %>% filter(sex == "Male") %>%
ggplot(aes(height)) +
geom_density(fill="pink") +
stat_function(fun=dnorm, args=list(m, s))
This is relatively normal.
maleHeights <- heights$height[heights$sex=="Male"]
mean(maleHeights)
## [1] 69.31475
sd(maleHeights)
## [1] 3.611024
heights %>% filter(sex=="Male") %>% ggplot(aes(height)) +
geom_density(fill="pink") + stat_function(fun = dnorm,
args=list(mean=69.3, sd=3.6))
Boxplots provide a 5-number summary composed of the range and quartiles. Boxplots typically ignore outliers when computing the range and instead plot them as independent points. u The interquartile range (IQR) is the distance between the 25% and 75% percentile
boxplot(heights$height)
Comparing accross different groups. We often divide observations into groups based on the values of one or more variables associated with those observations. u For instance, it makes sense to group heights based on a sex variable.
heights %>% ggplot(aes(sex,height)) + geom_boxplot()
Switching x and y axis
heights %>% ggplot(aes(height, sex)) + geom_boxplot()
We can make a quick histogram with qplot
qplot(x)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(x, binwidth = 1)
qplot(x, bins = 15, color =I("black"), xlab = "Population")
To make a quick boxplot
qplot(sex, height, data = heights, geom="boxplot")
heights %>% qplot(sex, height, data=., geom = "boxplot")
To make a quick density plot
qplot(x, geom = "density")