Section 12.8

1. Define variables containing the heights of males and females like below How many measurements do we have for each?

library(dslabs) 
data(heights) 
male<-heights$height[heights$sex=="Male"] 
female<-heights$height[heights$sex=="Female"]

238 females in the female dataset; 812 males in the male dataset.

2. Suppose we can’t make a plot and want to compare the distributions side by side. We can’t just list all the numbers. Instead, we will look at the percentiles. Create a five row table showing female_percentiles and male_percentiles with the 10th, 30th, 50th, 70th, & 90th percentiles for each sex. Then create a data frame with these two as columns.

library(data.table)
j<-c(.1, .3, .5, .7, .9)
man<-quantile(male, j)
woman<-quantile(female, j)
data.table(man, woman)
##         man    woman
##       <num>    <num>
## 1: 65.00000 61.00000
## 2: 68.00000 63.00000
## 3: 69.00000 64.98031
## 4: 71.00000 66.46417
## 5: 73.22751 69.00000

3. Study the following boxplots showing population sizes by country: Which continent has the country with the biggest population size? Asia.

4. What continent has the largest median population size? Africa.

5. What is median population size for Africa to the nearest million? 12 million.

6. What proportion of countries in Europe have populations below 14 million? B. .75

7. If we use a log transformation, which continent shown above has the largest interquartile range? B. Americas

8. Load the height data set and create a vector x with just the male heights: What proportion of the data is between 69 and 72 inches (taller than 69, but shorter or equal to 72)? Hint: use a logical operator and mean.

library(dslabs)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()     masks data.table::between()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ lubridate::hour()    masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ lubridate::mday()    masks data.table::mday()
## ✖ lubridate::minute()  masks data.table::minute()
## ✖ lubridate::month()   masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second()  masks data.table::second()
## ✖ purrr::transpose()   masks data.table::transpose()
## ✖ lubridate::wday()    masks data.table::wday()
## ✖ lubridate::week()    masks data.table::week()
## ✖ lubridate::yday()    masks data.table::yday()
## ✖ lubridate::year()    masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
x<-heights$height[heights$sex=="Male"]
mean(x<=72) - mean(x>69)
## [1] 0.3485222

9. Suppose all you know about the data is the average and the standard deviation. Use the normal approximation to estimate the proportion you just calculated. Hint: start by computing the average and standard deviation. Then use the pnorm function to predict the proportions.

x<-heights$height[heights$sex=="Male"]
mu<-mean(x)
s<-sd(x)
z2<-(72-mu)/s
z1<-(69-mu)/s
pnorm(z2)-pnorm(z1)
## [1] 0.3061779

10. Notice that the approximation calculated in question nine is very close to the exact calculation in the first question. Now perform the same task for more extreme values. Compare the exact calculation and the normal approximation for the interval (79,81]. How many times bigger is the actual proportion than the approximation?

x<-heights$height[heights$sex=="Male"]
exact<-mean(x<=81 & x>79)
mu<-mean(x)
s<-sd(x)
z2<-(81-mu)/s
z1<-(79-mu)/s
approx<-pnorm(z2)-pnorm(z1)
error<-exact/approx

11. Approximate the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches. Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers. Hint: use the pnorm function.

tallmen<-1-pnorm(7*12,mean=69, sd=3)

12. There are about 1 billion men between the ages of 18 and 40 in the world. Use your answer to the previous question to estimate how many of these men (18-40 year olds) are seven feet tall or taller in the world?

nummen<-round((10^9)*tallmen)

13. There are about 10 National Basketball Association (NBA) players that are 7 feet tall or higher. Using the answer to the previous two questions, what proportion of the world’s 18-to-40-year-old seven footers are in the NBA?

NBA<-10/nummen

14. Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players that are at least that tall.

lebron<-1-pnorm(6*12+8, mean=69, sd=3)
prop<-150/(round(lebron*10^9))

15. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player. What would be a fair critique of our calculations: D. As seen in question 10, the normal approximation tends to overestimate the extreme values. It’s possible that there are fewer seven footers than we predicted.