Categorical data are variables that are defined by a small number of groups. Ordinal categorical data have an inherent order to the categories (mild/medium/hot, for example). Non-ordinal categorical data have no order to the categories. Numerical data take a variety of numeric values. Continuous variables can take any value. Discrete variables are limited to sets of specific values.
Use the unique and length functions to determine how many unique heights were reported
x <- heights$height
length(unique(x))
## [1] 139
There are 139 different measurements of height in the heights dataset
Use the table function to compute the frequencies of each unique height value. Because we are using the resulting frequency table in a later exercise we want you to save the results into an object and call it tab.
x <- heights$height
tab <- table(x)
To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.
sum(tab==1)
## [1] 63
A frequency table is the simplest way to show a categorical distribution. Use prop.table() to convert a table of counts to a frequency table.
table(heights$sex)
##
## Female Male
## 238 812
prop.table(table(heights$sex))
##
## Female Male
## 0.2266667 0.7733333
For continuous numerical data, reporting the frequency of each unique entry is not an effective summary as many or most values are unique.
The cumulative distribution function (CDF) reports the proportion of data below a value 𝑎 for all values of 𝑎 : 𝐹(𝑎)=Pr(𝑥≤𝑎) The proportion of observations between any two values 𝑎 and 𝑏 can be computed from the CDF as 𝐹(𝑏)−𝐹(𝑎)
For datasets that are not Normal
# define range of values spanning the dataset
a <- seq(min(my_data), max(my_data), length = 100)
# computes prob. for a single value
cdf_function <- function(x) {
mean(my_data <= x)
}
cdf_values <- sapply(a, cdf_function)
plot(a, cdf_values)
# define the interval from min to max, with 100 values
a <- seq(min(heights$height), max(heights$height), length = 100)
#create the function that returns probability for height < x
cdf_function <- function (x) {
mean(heights$height <= x)
}
# apply function to vector a (range of 100 heights from min to max)
cdf_values <- sapply(a, cdf_function)
#plot
plot(a, cdf_values, type="l")
Calculating mean and sd in heights
average <- mean(heights$height)
SD <- sd(heights$height)
average
## [1] 68.32301
SD
## [1] 4.078617
Standard units describe how many standard deviations a value is away from the mean. The z-score, or number of standard deviations an observation x is away from the mean μ: Z=(x−μ)/σ
z <- function(x) {
(x-average)/SD
}
#standard units for the smallest, tallest and average:
z(min(x))
## [1] -4.492457
z(max(x))
## [1] 3.519368
z(average)
## [1] 0
Z-scores are useful to quickly evaluate whether an observation is average or extreme. Z-scores near 0 are average. Z-scores above 2 or below -2 are significantly above or below the mean, and z-scores above 3 or below -3 are extremely rare.
The scale() function converts a vector of approximately normally distributed values into z-scores.
x <- heights$height
z <- scale(x)
You can compute the proportion of observations that are within 2 standard deviations of the mean like this:
mean(abs(z) < 2)
## [1] 0.9466667
The normal distribution is associated with the 68-95-99.7 rule. This rule describes the probability of observing events within a certain number of standard deviations of the mean.
We can estimate the probability that a male is taller than 70.5 inches with:
1 - pnorm(70.5, mean(x), sd(x))
## [1] 0.2967551
Plot distribution of exact heights in data
plot(prop.table(table(x)), xlab = "a = Height in inches", ylab = "Pr(x = a)")
Load the height data set and create a vector x with just the male heights:
x <- heights$height[heights$sex == "Male"]
male_avg <- mean(x)
male_sd <- sd(x)
What proportion of the data is between 69 and 72 inches (taller than 69 but shorter or equal to 72)? A proportion is between 0 and 1.
mean(x>69 & x<=72)
## [1] 0.3337438
Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?
pnorm(72,male_avg,male_sd)-pnorm(69,male_avg,male_sd)
## [1] 0.3061779
Notice that the approximation calculated in the second question is very close to the exact calculation in the first question. The normal distribution was a useful approximation for this case. However, the approximation is not always useful. An example is for the more extreme values, often called the “tails” of the distribution. Let’s look at an example. We can compute the proportion of heights between 79 and 81.
# Not accurate when going to extreme, e.g. estimating proportion of heights between 79 and 81
x <- heights$height[heights$sex == "Male"]
exact <- mean(x > 79 & x <= 81)
avg <- mean(x)
sd <- sd(x)
approx <- pnorm(81,avg,sd)-pnorm(79,avg,sd)
exact/approx
## [1] 1.614261
Quantiles are cutoff points that divide a dataset into intervals with set probabilities. The 𝑞 th quantile is the value at which 𝑞 % of the observations are equal to or less than that value.
Given a dataset data and desired quantile q, you can find the qth quantile of data with:
quantile(data,q)
Percentiles re the quantiles that divide a dataset into 100 intervals each with 1% probability. You can determine all percentiles of a dataset data like this:
p <- seq(0.01, 0.99, 0.01) quantile(data, p)
Divide a dataset into 4 parts each with 25% probability. They are equal to the 25th, 50th and 75th percentiles. The 25th percentile is also known as the 1st quartile, the 50th percentile is also known as the median, and the 75th percentile is also known as the 3rd quartile.
The summary() function returns the minimum, quartiles and maximum of a vector.
summary(heights$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 66.00 68.50 68.32 71.00 82.68
Find the percentiles of heights$height:
p <- seq(0.01, 0.99, 0.01)
percentiles <- quantile(heights$height, p)
head(percentiles)
## 1% 2% 3% 4% 5% 6%
## 59 60 60 61 62 62
Confirm that the 25th and 75th percentiles match the 1st and 3rd quartiles. Note that quantile() returns a named vector. You can access the 25th and 75th percentiles like this (adapt the code for other percentile values):
percentiles[names(percentiles) == "25%"]
## 25%
## 66
percentiles[names(percentiles) == "75%"]
## 75%
## 71
qnorm() functionTheoretical value of a quantile with probability p of observing a value equal to or less than that quantile value given a normal distribution with mean mu and standard deviation sigma:
qnorm(p, mu, sigma)
By default, mu=0 and sigma=1. Therefore, calling qnorm() with no arguments gives quantiles for the standard normal distribution.
qnorm(p)
Recall that quantiles are defined such that p is the probability of a random observation less than or equal to the quantile.
qnorm(0.025)
## [1] -1.959964
pnorm() functionGives the probability that a value from a standard normal distribution will be less than or equal to a z-score value z. Consider:
pnorm(-1.96)
## [1] 0.0249979
The result of pnorm() is the quantile (~0.025). Note that:
qnorm() and pnorm() are inverse functions:
pnorm(qnorm(0.025))
## [1] 0.025
You can use qnorm() to determine the theoretical quantiles of a dataset: that is, the theoretical value of quantiles assuming that a dataset follows a normal distribution. Run the qnorm() function with the desired probabilities p, mean mu and standard deviation sigma.
Suppose male heights follow a normal distribution with a mean of 69 inches and standard deviation of 3 inches. The theoretical quantiles are:
p <- seq(0.01, 0.99, 0.01)
theoretical_quantiles <- qnorm(p, 69, 3)
head(theoretical_quantiles)
## [1] 62.02096 62.83875 63.35762 63.74794 64.06544 64.33568
Are used to check whether distributions are well-approximated by a normal distribution.
Given a proportion p, the quantile q is the value such that the proportion of values in the data below q is p
index <- heights$sex=="Male"
x <- heights$height[index]
z <- scale(x)
Proportion of data below 69.5
mean(x <= 69.5)
## [1] 0.5147783
Calculate observed and theoretical quantiles
p <- seq(0.05, 0.95, 0.05)
observed_quantiles <- quantile(x, p)
observed_quantiles
## 5% 10% 15% 20% 25% 30% 35% 40%
## 63.90079 65.00000 66.00000 67.00000 67.00000 68.00000 68.00000 68.62236
## 45% 50% 55% 60% 65% 70% 75% 80%
## 69.00000 69.00000 70.00000 70.00000 70.86614 71.00000 72.00000 72.00000
## 85% 90% 95%
## 72.44000 73.22751 75.00000
theoretical_quantiles <- qnorm(p, mean = mean(x), sd = sd(x))
theoretical_quantiles
## [1] 63.37515 64.68704 65.57217 66.27564 66.87916 67.42113 67.92335 68.39991
## [9] 68.86099 69.31475 69.76852 70.22960 70.70616 71.20838 71.75035 72.35387
## [17] 73.05734 73.94247 75.25436
QQ-plot
plot(theoretical_quantiles, observed_quantiles)
abline(0,1)
In a boxplot, the box is defined by the 25th and 75th percentiles and the median is a horizontal line through the box. The whiskers show the range excluding outliers, and outliers are plotted separately as individual points. The interquartile range is the distance between the 25th and 75th percentiles.
Male <- heights$sex=="Male"
Female <- heights$sex=="Female"
M <- heights$height[Male]
F <- heights$height[Female]
boxplot(M,F)
# vectors with heights for each gender
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
# the 10th, 30th, 50th, 70th, and 90th percentiles for the heights of each sex
p <- seq(0.1,0.9,0.2)
male_percentiles <- quantile(male,p)
female_percentiles <- quantile(female,p)
# data frame with percentiles
df <- data.frame(male=male_percentiles,female=female_percentiles)
df
## male female
## 10% 65.00000 61.00000
## 30% 68.00000 63.00000
## 50% 69.00000 64.98031
## 70% 71.00000 66.46417
## 90% 73.22751 69.00000
In ggplot2 we create graphs by adding layers. Layers can define geometries, compute summary statistics, define what scales to use, or even change styles. To add layers, we use the symbol +. In general, a line of code will look like this:
DATA %>% ggplot() + LAYER 1 + LAYER 2 + … + LAYER N
Plots in ggplot2 consist of 3 main components:
There are additional components:
You can associate a dataset x with a ggplot object with any of the 3 commands:
ggplot(data = x) ggplot(x) x %>% ggplot()
Aesthetic mappings describe how properties of the data connect with features of the graph
murders %>% ggplot(aes(x = population/10^6, y = total)) + geom_point()
We can store it as graph
graph <- murders %>% ggplot() + geom_point(aes(x=population/10^6,y=total))
label= within the aes argument
graph <- graph + geom_text(aes(population/10^6, total, label = abb))
graph
size to change points size nudge to move labels
p <- murders %>% ggplot()
p + geom_point(aes(population/10^6, total), size = 3) + geom_text(aes(population/10^6, total, label = abb),nudge_x=1.5)
aesIf we define a mapping in ggplot, all the geometries that are added as layers will default to this mapping. We redefine p:
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
and then we can simply write the following code to produce the previous plot:
p + geom_point(size = 3) + geom_text(nudge_x = 1.5)
murders %>%
ggplot(aes(population/10^6, total, label = abb)) +
scale_x_log10() +
scale_y_log10() +
xlab("Population in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010")
We can redefine p to contain everything except the point geometry:
p <- murders %>%
ggplot(aes(population/10^6, total, label = abb)) +
geom_text(nudge_x = 0.075) +
scale_x_log10() +
scale_y_log10() +
xlab("Population in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010")
then we can tinker geom_point separately
p + geom_point(size=3, col="blue")
To color points by region, we need to map geom_point to specify that the categorical data associated with colour (col) is region
p + geom_point(aes(col = region), size = 3)
we want to add a line that represents the average murder rate for the entire country. Once we determine the per million rate to be r , this line is defined by the formula:
y=rx, with y and x our axes: total murders and population in millions, respectively. In the log-scale this line turns into:
log(y) = log(r) + log(x)
So in our plot it’s a line with slope 1 and intercept log(r). To compute this value, we use our dplyr skills:
r <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^6) %>%
pull(rate)
To add a line we use the geom_abline function. ggplot2 uses ab in the name to remind us we are supplying the intercept (a) and slope (b). The default line has slope 1 and intercept 0 so we only have to define the intercept:
p + geom_point(aes(col=region), size = 3) +
geom_abline(intercept = log10(r))
To change the line and capitalise “Region” in legend geom_abline(intercept = log10(r),lty=2,color="darkgrey") scale_color_discrete(name = "Region")
p + geom_point(aes(col=region),size=3) + geom_abline(intercept = log10(r),lty=2,color="darkgrey") + scale_color_discrete(name = "Region")
Putting all together from scratch
#define intercept
r <- murders %>% summarise(rate=sum(total)/sum(population)*10^6) %>% pull(rate)
#make the plot combining all elements
murders %>% ggplot(aes(population/10^6,total,label=abb)) +
geom_point(aes(col=region),size=3) +
scale_x_log10() +
scale_y_log10() +
xlab ("Population in millions (log scale)") +
ylab ("Total number of murders (log scale)") +
ggtitle("US Murders in 2010") +
geom_abline(intercept=log10(r),lty=2,color="darkgrey") +
scale_color_discrete(name="Region") +
geom_text_repel() + # avoids overlap of labels
theme_economist() # changes the plot look and feel
p <- heights %>% filter(sex=="Male") %>% ggplot(aes(x=height))
p + geom_histogram(binwidth = 1,fill="blue",color="black") +
xlab("Male height in inches") +
ggtitle("Heights Histogram")
p <- heights %>% filter(sex=="Male") %>% ggplot(aes(x=height))
p + geom_density(fill="blue") +
xlab("Male height in inches") +
ggtitle("Heights Density")
heights %>% filter(sex=="Male") %>%
ggplot(aes(sample = height)) +
geom_qq()
By default, the sample variable is compared to a normal distribution with average 0 and standard deviation 1. To change this, we use the dparams arguments based on the help file. Adding an identity line is as simple as assigning another layer. For straight lines, we use the geom_abline function. The default line is the identity line (slope = 1, intercept = 0).
params <- heights %>% filter(sex=="Male") %>%
summarize(mean = mean(height), sd = sd(height))
params
## mean sd
## 1 69.31475 3.611024
heights %>% filter(sex=="Male") %>%
ggplot(aes(sample = height)) +
geom_qq(dparams = params) +
geom_abline()
p <- heights %>% filter(sex == "Male") %>% ggplot(aes(x = height))
p1 <- p + geom_histogram(binwidth = 1, fill = "blue", col = "black")
p2 <- p + geom_histogram(binwidth = 2, fill = "blue", col = "black")
p3 <- p + geom_histogram(binwidth = 3, fill = "blue", col = "black")
# arrange plots next to each other in 1 row, 3 columns
grid.arrange(p1, p2, p3, ncol = 3)
dplyr
s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))
s
## average standard_deviation
## 1 64.93942 3.760656
This takes our original data table as input, filters it to keep only females, and then produces a new summarized table with just the average and the standard deviation of heights. We get to choose the names of the columns of the resulting table. For example, above we decided to use average and standard_deviation, but we could have used other names just the same.
Because the resulting table stored in s is a data frame, we can access the components with the accessor $:
s$average
## [1] 64.93942
s$standard_deviation
## [1] 3.760656
# calculate US murder rate, generating a data frame
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
## rate
## 1 3.034555
# extract the numeric US murder rate with the dot operator
us_murder_rate %>% .$rate
## [1] 3.034555
# calculate and extract the murder rate with one pipe
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population * 100000)) %>%
.$rate
# Several options to do the same
compute separate average and standard deviation for male/female heights
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
## # A tibble: 2 x 3
## sex average standard_deviation
## <fct> <dbl> <dbl>
## 1 Female 64.9 3.76
## 2 Male 69.3 3.61
Compute median murder rate in 4 regions of US
murders <- murders %>%
mutate(rate = total/population * 100000)
murders %>%
group_by(region) %>%
summarize(median_rate = median(rate))
## # A tibble: 4 x 2
## region median_rate
## <fct> <dbl>
## 1 Northeast 1.80
## 2 South 3.40
## 3 North Central 1.97
## 4 West 1.29
murders <- murders %>% mutate(rate=total/population*100000)
# arrange by murder rate, smallest to largest
murders %>% arrange(rate) %>% head()
## state abb region population total rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Hawaii HI West 1360301 7 0.5145920
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Iowa IA North Central 3046355 21 0.6893484
## 6 Idaho ID West 1567582 12 0.7655102
# arrange by murder rate in descending order
murders %>% arrange(desc(rate)) %>% head()
## state abb region population total rate
## 1 District of Columbia DC South 601723 99 16.452753
## 2 Louisiana LA South 4533372 351 7.742581
## 3 Missouri MO North Central 5988927 321 5.359892
## 4 Maryland MD South 5773552 293 5.074866
## 5 South Carolina SC South 4625364 207 4.475323
## 6 Delaware DE South 897934 38 4.231937
# arrange by region alphabetically, then by murder rate within each region
murders %>% arrange(region, rate) %>% head()
## state abb region population total rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Maine ME Northeast 1328361 11 0.8280881
## 4 Rhode Island RI Northeast 1052567 16 1.5200933
## 5 Massachusetts MA Northeast 6547629 118 1.8021791
## 6 New York NY Northeast 19378102 517 2.6679599
# show the top 10 states with highest murder rate, ordered by rate
murders %>% arrange(desc(rate)) %>% top_n(10)
## Selecting by rate
## state abb region population total rate
## 1 District of Columbia DC South 601723 99 16.452753
## 2 Louisiana LA South 4533372 351 7.742581
## 3 Missouri MO North Central 5988927 321 5.359892
## 4 Maryland MD South 5773552 293 5.074866
## 5 South Carolina SC South 4625364 207 4.475323
## 6 Delaware DE South 897934 38 4.231937
## 7 Michigan MI North Central 9883640 413 4.178622
## 8 Mississippi MS South 2967297 120 4.044085
## 9 Georgia GA South 9920000 376 3.790323
## 10 Arizona AZ West 6392017 232 3.629527
head(NHANES)
## # A tibble: 6 x 76
## ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
## <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
## 1 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 2 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 3 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
## 5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some Col…
## 6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
## # … with 67 more variables: MaritalStatus <fct>, HHIncome <fct>,
## # HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
## # Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>,
## # BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
## # BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>,
## # BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
## # DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
## # UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
## # HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
## # LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
## # Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
## # PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
## # TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
## # Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
## # RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
## # SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
## # SexOrientation <fct>, PregnantNow <fct>
NHANES %>%
filter(Gender=="male" & AgeDecade==" 40-49") %>%
group_by(Race1) %>%
summarise(average=mean(BPSysAve,na.rm=TRUE),standard_deviation=sd(BPSysAve,na.rm=TRUE)) %>%
arrange(average)
## # A tibble: 5 x 3
## Race1 average standard_deviation
## <fct> <dbl> <dbl>
## 1 White 120. 13.4
## 2 Other 120. 16.2
## 3 Hispanic 122. 11.1
## 4 Mexican 122. 13.9
## 5 Black 126. 17.1
gapminder %>% as_tibble()
## # A tibble: 10,545 x 9
## country year infant_mortality life_expectancy fertility population gdp
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 1960 115. 62.9 6.19 1636054 NA
## 2 Algeria 1960 148. 47.5 7.65 11124892 1.38e10
## 3 Angola 1960 208 36.0 7.32 5270844 NA
## 4 Antigu… 1960 NA 63.0 4.43 54681 NA
## 5 Argent… 1960 59.9 65.4 3.11 20619075 1.08e11
## 6 Armenia 1960 NA 66.9 4.55 1867396 NA
## 7 Aruba 1960 NA 65.7 4.82 54208 NA
## 8 Austra… 1960 20.3 70.9 3.45 10292328 9.67e10
## 9 Austria 1960 37.3 68.8 2.7 7065525 5.24e10
## 10 Azerba… 1960 NA 61.3 5.57 3897889 NA
## # … with 10,535 more rows, and 2 more variables: continent <fct>, region <fct>
Comparing infant_mortality between Sri Lanka and Turkey
gapminder %>%
filter(year==2015 & country %in% c("Sri Lanka","Turkey")) %>%
select(country,infant_mortality)
## country infant_mortality
## 1 Sri Lanka 8.4
## 2 Turkey 11.6
gapminder %>%
filter(year==1962) %>%
ggplot(aes(fertility,life_expectancy,color=continent)) +
geom_point()
gapminder %>%
filter(year %in% c(1962,2002)) %>%
ggplot(aes(fertility,life_expectancy,color=continent)) +
geom_point() +
facet_grid(continent~year)
years <- c(1962,1980,1990,2000,2012)
continents <- c("Europe","Asia")
gapminder %>%
filter(year %in% years & continent %in% continents) %>%
ggplot(aes(fertility,life_expectancy,color=continent)) +
geom_point() +
facet_wrap(~year)
countries <- c("South Korea", "Germany")
labels <- data.frame(country=countries, x=c(1978,1965), y=c(60,68))
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year,life_expectancy,color=country)) +
geom_line() +
geom_text(data=labels, aes(x,y,label=country),size=5)
scale_x_continuous(trans = "log2") in D2
gapminder <- gapminder %>%
mutate(dollars_per_day=gdp/population/365)
d1 <- gapminder %>%
filter(year == 1970 & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
ggtitle("D1 no scale transformation")
d2 <- gapminder %>%
filter(year == 1970 & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
ggtitle("D2 log2 scale")
grid.arrange(d1,d2,nrow=1)
gapminder %>%
filter(year==1970) %>%
ggplot(aes(region,dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90,hjust = 1))
## Warning: Removed 72 rows containing non-finite values (stat_boxplot).
fac <- factor(c("Asia","Asia","West","West","West"))
levels(fac)
## [1] "Asia" "West"
It is ordered alphabetically by default
value <- c(10,11,12,6,4)
If we want to order Asia and West based on the mean of values:
fac <- reorder(fac, value, FUN=mean)
levels(fac)
## [1] "West" "Asia"
We can reorder the previous boxplot.
is.na()is used to filter out any NAs, including only countries with actual values for gdp
gapminder %>%
filter(year == 1970 & !is.na(gdp)) %>%
mutate(region = reorder(region, dollars_per_day, FUN = median)) %>%
ggplot(aes(region, dollars_per_day, fill = continent)) +
scale_y_continuous(trans = "log2") +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("")
We define a vector that groups regions in the “West”
west <- c("Northern Europe","Southern Europe","Western Europe", "Northern America", "Australia and New Zealand")
We can use ifelse to group regions into West or Developing
gapminder %>%
filter(year==1975 & !is.na(gdp)) %>%
mutate(group=ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1,color="black") +
scale_x_continuous(trans = "log2") +
facet_wrap(.~group)
To compare by group AND year (ej 1975 vs 2010)
gapminder %>%
filter(year==c(1975,2010) & !is.na(gdp)) %>%
mutate(group=ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1,color="black") +
scale_x_continuous(trans = "log2") +
facet_grid(year~group)
## Warning in year == c(1975, 2010): longer object length is not a multiple of
## shorter object length
BUT, we know there are more countries with data available in 2010 than in 1970. We will re-do it using only those countries that have data for both years
#countries with gdp different to NA in 1970
country_list1 <- gapminder %>%
filter(year==1975 & !is.na(dollars_per_day)) %>%
.$country
#countries with gdp different to NA in 2010
country_list2 <- gapminder %>%
filter(year==2010 & !is.na(dollars_per_day)) %>%
.$country
# vector of countries to use in the plot
country_list <- intersect(country_list1,country_list2)
gapminder %>%
# filter to include only countries in country_list
filter(year %in% c(1975,2010) & country %in% country_list) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1,color="black") +
scale_x_continuous(trans = "log2") +
facet_grid(year~group)
gapminder %>%
filter(year==c(1975,2010) & country %in% country_list) %>%
mutate(region=reorder(region,dollars_per_day,FUN=median)) %>%
ggplot(aes(region,dollars_per_day,fill=continent)) +
geom_boxplot() +
scale_y_continuous(trans = "log2") +
theme(axis.text.x = element_text(angle=90,hjust=1)) +
facet_grid(year~.)
This plot is hard to compare, but we can use fill=factor(year)
gapminder %>%
filter(year==c(1975,2010) & country %in% country_list) %>%
mutate(region=reorder(region,dollars_per_day,FUN=median)) %>%
ggplot(aes(region,dollars_per_day,fill=factor(year))) +
geom_boxplot() +
scale_y_continuous(trans = "log2") +
theme(axis.text.x = element_text(angle=90,hjust=1))
When we overlay two densities, the default is to have the area represented by each distribution add up to 1, regardless of the size of each group:
gapminder %>%
filter(year == c(1975,2010) & country %in% country_list) %>%
mutate(group=ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day,fill=group)) +
geom_density(alpha=0.2) +
scale_x_continuous(trans = "log2") +
facet_grid(year~.)
## Warning in year == c(1975, 2010): longer object length is not a multiple of
## shorter object length
This makes it appear as if there are the same number of countries in each group. To change this, we will need to learn to access computed variables with geom_density function, adding y=..count..
gapminder %>%
filter(year == c(1975,2010) & country %in% country_list) %>%
mutate(group=ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day,y=..count..,fill=group)) +
geom_density(alpha=0.2,bw=0.65) +
scale_x_continuous(trans = "log2", limit= c(0.125,300)) +
facet_grid(year~.)
## Warning in year == c(1975, 2010): longer object length is not a multiple of
## shorter object length
gapminder <- gapminder %>%
mutate(dollars_per_day=gdp/population/365)
# use case_when() to group regions at a higher level
gapminder <- gapminder %>%
mutate(group = case_when(
region %in% c("Western Europe", "Northern Europe","Southern Europe",
"Northern America",
"Australia and New Zealand") ~ "West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America",
"South America") ~ "Latin America",
continent == "Africa" &
region != "Northern Africa" ~ "Sub-Saharan",
TRUE ~ "Others"))
gapminder %>%
filter(year %in% c(1975,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(dollars_per_day,group,fill=group)) +
geom_density_ridges(adjust=1.5,alpha=0.5) +
scale_x_continuous(trans = "log2") +
facet_grid(.~year)
## Warning: Ignoring unknown parameters: adjust
## Picking joint bandwidth of 0.684
## Picking joint bandwidth of 0.726
gapminder %>%
filter(year %in% c(1975,2010) & country %in% country_list) %>%
group_by(year) %>%
mutate(weight = population/sum(population)*2) %>%
ungroup() %>%
ggplot(aes(dollars_per_day, fill = group)) +
scale_x_continuous(trans = "log2", limit = c(0.125, 300)) +
geom_density(alpha = 0.2, bw = 0.75, position = "stack") +
facet_grid(year ~ .)
gapminder %>%
filter(year %in% c(1975, 2010) & country %in% country_list) %>%
mutate(weight=population/sum(population)*2) %>%
ggplot(aes(dollars_per_day,fill=group,weight=weight)) +
geom_density(alpha = 0.2, bw = 0.75, position = "stack") +
scale_x_continuous(trans = "log2",limit=c(0.125,300)) +
facet_grid(year~.)
gapminder <- gapminder %>%
mutate(group = case_when(
.$region %in% west ~ "The West",
.$region %in% "Northern Africa" ~ "Northern Africa",
.$region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
.$region == "Southern Asia" ~ "Southern Asia",
.$region %in% c("Central America", "South America", "Caribbean") ~ "Latin America",
.$continent == "Africa" & .$region != "Northern Africa" ~ "Sub-Saharan Africa",
.$region %in% c("Melanesia", "Micronesia", "Polynesia") ~ "Pacific Islands"))
# define a data frame with group average income and average infant survival rate
surv_income <- gapminder %>%
filter(year == 2010 & !is.na(gdp) & !is.na(infant_mortality) & !is.na(group)) %>%
group_by(group) %>%
summarise(income=sum(gdp)/sum(population)/365,
infant_survival_rate = 1 - sum(infant_mortality/1000*population)/sum(population))
# sort by income
surv_income %>% arrange(income)
## # A tibble: 7 x 3
## group income infant_survival_rate
## <chr> <dbl> <dbl>
## 1 Sub-Saharan Africa 1.76 0.936
## 2 Southern Asia 2.07 0.952
## 3 Pacific Islands 2.70 0.956
## 4 Northern Africa 4.94 0.970
## 5 Latin America 13.2 0.983
## 6 East Asia 13.4 0.985
## 7 The West 77.1 0.995
# plot
surv_income %>% ggplot(aes(income, infant_survival_rate, label = group, color = group)) +
scale_x_continuous(trans = "log2", limit = c(0.25, 150)) +
scale_y_continuous(trans = "logit", limit = c(0.875, .9981),
breaks = c(.85, .90, .95, .99, .995, .998)) +
geom_label(size = 3, show.legend = FALSE)
Based on the plot above, do we conclude that a country with a low income is destined to have low survival rate? Do we conclude that survival rates in Sub-Saharan Africa are all lower than in Southern Asia, which in turn are lower than in the Pacific Islands, and so on?
Jumping to this conclusion based on a plot showing averages is referred to as the ecological fallacy. The almost perfect relationship between survival rates and income is only observed for the averages at the region level. Once we show all the data, we see a somewhat more complicated story:
GROUP <- surv_income %>% ggplot(aes(income, infant_survival_rate, label = group, color = group)) +
scale_x_continuous(trans = "log2", limit = c(0.25, 150)) +
scale_y_continuous(trans = "logit", limit = c(0.875, .9981),
breaks = c(.85, .90, .95, .99, .995, .998)) +
geom_label(size = 3, show.legend = FALSE)
COUNTRY <- gapminder %>%
filter(year == 2010 & !is.na(gdp) & !is.na(infant_mortality) & !is.na(group)) %>%
group_by(group) %>%
ggplot(aes(dollars_per_day,1-infant_mortality/1000,color=group)) +
geom_point(size=3,alpha=0.5,show.legend = FALSE) +
scale_x_continuous(trans = "log2", limit=c(.25,150)) +
scale_y_continuous(trans = "logit",limit=c(0.875,.9981), breaks = c(.85, .90, .95, .99, .995, .998))
grid.arrange(GROUP,COUNTRY,nrow=1)
We can conclude that the correlation works when using averages at the region level but not examining the data at the country level
Don’t use pie charts
Don’t use D charts that add no value
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
mutate(state = reorder(state, rate))
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
dat %>% ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept=1963, col = "blue") +
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position="bottom",
text = element_text(size = 8)) +
ggtitle(the_disease) +
ylab("") + xlab("")
To show the average for the US
# us rate for Measles (the_disease)
avg <- us_contagious_diseases %>%
filter(disease==the_disease) %>%
group_by(year) %>%
summarise(us_rate=sum(count,na.rm=TRUE)/sum(population,na.rm=TRUE)*10000)
# first geom_line adds rate by state
# second geom_line adds us_rate from avg, calculated before
dat %>%
filter(!is.na(rate)) %>%
ggplot() +
geom_line(aes(year,rate,group=state),color="grey50",size=1,alpha=0.1,show.legend = FALSE) +
scale_y_continuous(trans="sqrt",breaks=c(0,5, 25,125,300)) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1) +
ggtitle("Cases per 10,000 by state") +
xlab("") + ylab("") +
geom_text(data=data.frame(x=1955,y=60),mapping=aes(x,y,label="US average rate")) +
geom_vline(xintercept = 1963,color="blue")
library(titanic)
titanic <- titanic_train %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
mutate(Survived = factor(Survived),
Pclass = factor(Pclass),
Sex = factor(Sex))
# density
titanic %>%
ggplot(aes(x=Age,fill=Sex)) +
geom_density(alpha=0.6)
# density facet
titanic %>%
ggplot(aes(x=Age,fill=Sex)) +
geom_density(alpha=0.6) +
facet_grid(Sex~.)
# count
titanic %>%
ggplot(aes(x=Age,y=..count..,fill=Sex)) +
geom_density(alpha=0.6)
# density facet
titanic %>%
ggplot(aes(x=Age,y=..count..,fill=Sex)) +
geom_density(alpha=0.6) +
facet_grid(Sex~.)
#proportion of 18-35 male higher than women
titanic %>%
ggplot(aes(x=Age,fill=Sex)) +
geom_density(alpha=0.6) +
geom_vline(xintercept = 18) +
geom_vline(xintercept = 35)
#proportion of < 17 male higher than women
titanic %>%
ggplot(aes(x=Age,fill=Sex)) +
geom_density(alpha=0.6) +
geom_vline(xintercept = 17)
# QUESTION 3 QQ plot age distribution
params <- titanic %>%
filter(!is.na(Age)) %>%
summarize(mean = mean(Age), sd = sd(Age))
titanic %>%
ggplot(aes(sample = Age)) +
geom_qq(dparams = params) +
geom_abline()
# QUESTION 4 survival by sex
titanic %>%
ggplot(aes(Survived,y=..count..,fill=Sex)) +
geom_bar(position = position_dodge())
# QUESTION 5 survival by age
titanic %>%
ggplot(aes(x=Age,y=..count..,fill=Survived)) +
geom_density(alpha=0.4)
# QUESTION 6 survival by fare
# Filter the data to remove individuals who paid a fare of 0.
# Make a boxplot of fare grouped by survival status.
# Try a log2 transformation of fares.
# Add the data points with jitter and alpha blending.
names(titanic)
## [1] "Survived" "Pclass" "Sex" "Age" "SibSp" "Parch" "Fare"
titanic %>%
filter(!Fare==0) %>%
group_by(Survived) %>%
ggplot(aes(Survived,Fare,color=Survived)) +
geom_boxplot() +
scale_y_continuous(trans = "log2") +
geom_jitter(alpha=0.2,width = 0.1)
# QUESTION 7 survival by passenger class
titanic %>%
ggplot(aes(Pclass,y=..count..,fill=Survived)) +
geom_bar()
titanic %>%
ggplot(aes(Pclass,y=..count..,fill=Survived)) +
geom_bar(position = position_fill())
titanic %>%
ggplot(aes(Survived,y=..count..,fill=Pclass)) +
geom_bar(position = position_fill())
# QUESTION 7 survival by age, sex and passenger class
# Create a grid of density plots for age, filled by survival status,
# with count on the y-axis, faceted by sex and passenger class.
titanic %>%
filter(!is.na(Age)) %>%
ggplot(aes(Age,y=..count..,fill=Survived)) +
geom_density(alpha=0.5) +
facet_grid(Sex~Pclass)
titanic %>%
filter(!is.na(Age)) %>%
ggplot(aes(Age,y=..count..,fill=Pclass)) +
geom_density(alpha=0.5) +
facet_grid(.~Pclass)
titanic %>%
filter(!is.na(Age)) %>%
ggplot(aes(Age,y=..count..,fill=Sex)) +
geom_density(alpha=0.5) +
facet_grid(Sex~Pclass)
data(stars)
options(digits = 3) # report 3 significant
mean(stars$magnitude)
## [1] 4.26
#4.26
sd(stars$magnitude)
## [1] 7.35
#7.35
# distribution of magnitude
stars %>%
ggplot(aes(magnitude,y=..count..)) +
geom_density(fill="grey")
#distribution of temperature
stars %>%
ggplot(aes(temp)) +
geom_density(fill="grey")
# scatter plot of temperature (X) and magnitude (Y)
stars %>%
ggplot(aes(temp,magnitude)) +
geom_point()
# scatter plot of temperature (X) and magnitude (Y) with flipped axis
stars %>%
ggplot(aes(temp,magnitude)) +
geom_point() +
scale_y_reverse() +
scale_x_continuous(trans = "log10") +
scale_x_reverse()
# The least lumninous star in the sample with a surface temperature over 5000K
stars %>%
ggplot(aes(temp,magnitude,label=star)) +
geom_point() +
geom_text_repel() +
scale_y_reverse() +
scale_x_continuous(trans = "log10")
# Color by star type
stars %>%
ggplot(aes(temp,magnitude,color=type,label=type)) +
geom_point() +
geom_text_repel() +
scale_y_reverse() +
scale_x_continuous(trans = "log10") +
scale_x_reverse()
# G type
stars %>%
filter(type=="G") %>%
ggplot(aes(temp,magnitude,color=type)) +
geom_point() +
scale_y_reverse() +
scale_x_continuous(trans = "log10")
data(temp_carbon)
data(greenhouse_gases)
data(historic_co2)
temp_carbon %>%
filter(!is.na(carbon_emissions)) %>%
pull(year) %>%
max()
## [1] 2014
temp_carbon %>%
filter(!is.na(carbon_emissions)) %>%
.$year %>%
max()
## [1] 2014
temp_carbon %>%
filter(!is.na(carbon_emissions)) %>%
select(year) %>%
max()
## [1] 2014
# First year with carbon emissions available
temp_carbon %>%
filter(!is.na(carbon_emissions)) %>%
select(year) %>% min()
## [1] 1751
# Last year with carbon emissions available
temp_carbon %>%
filter(!is.na(carbon_emissions)) %>%
select(year) %>% max()
## [1] 2014
temp_carbon$carbon_emissions[temp_carbon$year==2014]/temp_carbon$carbon_emissions[temp_carbon$year==1751]
## [1] 3285
# First and last year with temp anomaly
temp_carbon %>%
filter(!is.na(temp_anomaly)) %>%
.$year %>% min()
## [1] 1880
temp_carbon %>%
filter(!is.na(temp_anomaly)) %>%
.$year %>% max()
## [1] 2018
min <- temp_carbon$temp_anomaly[temp_carbon$year==1880]
max <- temp_carbon$temp_anomaly[temp_carbon$year==2018]
# 20th century mean temperature
TA20 <- temp_carbon %>%
filter(!is.na(temp_anomaly) & year %in% c(1901:2000))
colMeans(TA20)
## year temp_anomaly land_anomaly ocean_anomaly
## 1950.5000 0.0004 -0.0002 -0.0002
## carbon_emissions
## 2704.6900
# temperature with blue line for mean 20th century
p <- TA20 %>%
ggplot(aes(year,temp_anomaly)) +
geom_line()
p <- p + geom_hline(aes(yintercept = 0), col = "blue")
p
TA20 %>%
ggplot(aes(year,temp_anomaly)) +
geom_line() +
geom_hline(yintercept = 0, col="blue") +
ylab("Temperature anomaly (degrees C)") +
ggtitle("Temperature anomaly relative to 20th century mean, 1880-2018") +
geom_text(aes(x = 2000, y = 0.05, label = "20th century mean"), col = "blue") +
geom_vline(aes(xintercept=1939),col="red") +
geom_vline(aes(xintercept=1976),col="orange") +
geom_text(aes(x=1936,y=0.05,label="1939"),col="red") +
geom_text(aes(x=1978,y=0.05,label="1976"),col="orange")
TA20 %>%
filter(temp_anomaly>0) %>%
.$year %>% min()
## [1] 1939
TA20 %>%
filter(temp_anomaly<0) %>%
.$year %>% max()
## [1] 1976
TA20 %>%
filter(temp_anomaly>0.5) %>%
.$year %>% min()
## [1] 1997
#adding land and ocean
temp_carbon %>%
filter(year %in% c(1880:2020)) %>%
ggplot() +
geom_line(aes(year,temp_anomaly)) +
geom_line(aes(year,ocean_anomaly),col="blue") +
geom_line(aes(year,land_anomaly), col="brown") +
geom_hline(yintercept = 0, col="black",alpha=0.5) +
geom_vline(xintercept = 2018, col="black",alpha=0.5)
names(greenhouse_gases)
## [1] "year" "gas" "concentration"
head(greenhouse_gases)
## year gas concentration
## 1 20 CO2 278
## 2 40 CO2 278
## 3 60 CO2 277
## 4 80 CO2 277
## 5 100 CO2 278
## 6 120 CO2 278
greenhouse_gases %>%
ggplot(aes(year,concentration)) +
geom_line() +
facet_grid(gas~., scales = "free") +
geom_vline(xintercept = 1850) +
ylab("Concentration (ch4/n2o ppb, co2 ppm)") +
ggtitle("Atmospheric greenhouse gas concentration by year, 0-2000")
temp_carbon %>%
ggplot(aes(year,carbon_emissions)) +
geom_line()
#year, co2, source
co2_time <- historic_co2 %>%
ggplot(aes(year,co2,color=source)) +
geom_line()
co2_time_recent <- historic_co2 %>%
filter(year>1500) %>%
ggplot(aes(year,co2,color=source)) +
geom_line()
co2_time
co2_time_recent
grid.arrange(co2_time,co2_time_recent)
historic_co2 %>%
ggplot(aes(year,co2,color=source)) +
geom_line() +
xlim(-800000,-775000)
historic_co2 %>%
ggplot(aes(year,co2,color=source)) +
geom_line() +
xlim(-375000,-330000)
historic_co2 %>%
ggplot(aes(year,co2,color=source)) +
geom_line() +
xlim(-140000,-120000)
historic_co2 %>%
ggplot(aes(year,co2,color=source)) +
geom_line() +
xlim(1700,2018)