# 1 Personal remark and setup

In this chapter are most of the end-of-chapter exercises not calculation but reflections. I have almost always used the original text for questions and answers. To indicate these quotes are the text passages written in italics and marked with bar on the left margin.

As there are only few R calculations in this chapter I have added an additional challenge: Drawing a distribution for age at time of death. Data are taken from a html table on the web. This results in three different exercises:

1. Getting data via web scraping and cleaning the data frame
2. Getting data via Excel file and cleaning the data frame
3. Displaying the distribution with the program package ggplot2

## 1.1 Global options


### setting up working environment
### for details see: https://yihui.name/knitr/options/
knitr::opts_chunk$set( echo = T, message = T, error = T, warning = T, comment = '##', highlight = T, prompt = T, strip.white = T, tidy = T )  ## 1.2 Installing and loading R packages  # ### accompanying R package: https://github.com/gitrman/itns # if (!require("itns")) # {remotes::install_github("gitrman/itns", # build = TRUE, build_opts = c("--no-resave-data", "--no-manual")) # library("itns")} ### https://www.tidyverse.org/ if (!require("tidyverse")) {install.packages("tidyverse", repos = 'http://cran.wu.ac.at/') library(tidyverse)} Loading required package: tidyverse Registered S3 method overwritten by 'dplyr': method from print.rowwise_df Registered S3 methods overwritten by 'ggplot2': method from [.quosures rlang c.quosures rlang print.quosures rlang Registered S3 method overwritten by 'rvest': method from read_xml.response xml2 [30m── [1mAttaching packages[22m ──────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m [30m[32m✔[30m [34mggplot2[30m 3.1.1 [32m✔[30m [34mpurrr [30m 0.3.2 [32m✔[30m [34mtibble [30m 2.1.1 [32m✔[30m [34mdplyr [30m 0.8.0.[31m1[30m [32m✔[30m [34mtidyr [30m 0.8.3 [32m✔[30m [34mstringr[30m 1.4.0 [32m✔[30m [34mreadr [30m 1.3.1 [32m✔[30m [34mforcats[30m 0.4.0 [39m [30m── [1mConflicts[22m ─────────────────────────────────────────────────────────── tidyverse_conflicts() ── [31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter() [31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m ### above command installed and loaded the core tidyverse packages: # ggplot2: data visualisation # tibble: a modern take on data frames # tidyr: data tidying # readr: data import (csv, tsv, fwf) # purrr: functional R programming # dplyr: data (frame) manipulation # stringr: string manipulation # forcats: working with categorial varialbes # ### to calculate mode: # if (!require("modeest")) # {install.packages("modeest", repos = 'http://cran.wu.ac.at/') # library(modeest)} # # # I am going to use the janitor package for calculating table totals # if (!require("janitor")) # {install.packages("janitor", repos = 'http://cran.wu.ac.at/') # library(janitor)}  ## 1.3 Theme adaption for the graphic display with ggplot2  my_theme <- theme_light() + theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.5)) theme(plot.background = element_rect(color = NA, fill = NA)) + theme(plot.margin = margin(1, 0, 0, 0, unit = 'cm')) List of 2$ plot.background:List of 5
..$fill : logi NA ..$ colour       : logi NA
..$size : NULL ..$ linetype     : NULL
..$inherit.blank: logi FALSE ..- attr(*, "class")= chr [1:2] "element_rect" "element"$ plot.margin    : 'margin' num [1:4] 1cm 0cm 0cm 0cm
..- attr(*, "valid.unit")= int 1
..- attr(*, "unit")= chr "cm"
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi FALSE
- attr(*, "validate")= logi TRUE

# 2 End-of-Chapter Exercises

## 2.1 Find z scores

For a standardized exam of statistics skill, scores are normally distributed: μ = 80, σ = 5. Find each student’s z score:

1. Student 1: X = 80
2. Student 2: X = 90
3. Student 3: X = 75
4. Student 4: X = 95

The formula is $$z = (X-μ)/σ$$.

(80-80)/5 # a.
[1] 0
(90-80)/5 # b.
[1] 2
(75-80)/5 # c.
[1] -1
(95-80)/5 # d.
[1] 3
1. $$z = 0$$
2. $$z = 2$$
3. $$z = -1$$
4. $$z = 3$$

## 2.2 Find percentage of better students

For each student in Exercise 1, use R to find what percent of students did better. (Assume X is a continuous variable.)

I am using the pnorm command. You can get an explanation by using the help command help(pnorm) or help(Normal):

help(Normal)
(1 - pnorm(0)) * 100
[1] 50
(1 - pnorm(2)) * 100
[1] 2.275013
(1 - pnorm(-1)) * 100
[1] 84.13447
(1 - pnorm(3)) * 100
[1] 0.1349898

Percent better: a. 50%; b. 2.28%; c. 84.1%; d. 0.1%.

## 2.3 Calculation of SE

Gabriela and Sylvia are working as a team for their university’s residential life program. They are both tasked with surveying students about their satisfaction with the dormitories. Today, Gabriela has managed to survey 25 students; Sylvia has managed to survey 36 students. The satisfaction scale they are using has a range from 1 to 20 and is known from previous surveys to have σ = 5.

### 2.3.1 Estimation 1

No mathematics, just think: which sample will have the smaller SE: the one collected by Gabriela or the one collected by Sylvia?

Sylvia’s sample will have the smaller SE because she has collected a larger sample.

### 2.3.2 Estimation 2

When the two combine their data, will this shrink the SE or grow it?

Combining the two samples will yield a smaller SE.

### 2.3.3 Calculation

Now calculate the SE for Gabriela’s sample, for Sylvia’s sample, and for the two samples combined.

The formula is $$SE = σ / \sqrt{N}$$.

5 / sqrt(25)
[1] 1
5 / sqrt(36)
[1] 0.8333333
5 / sqrt(25+36)
[1] 0.6401844

For Gabriela, SE = 1; For Sylvia, SE = 0.83; Combined, SE = 0.64.

### 2.3.4 Is the sample size sufficient?

How big a sample size is needed? Based on the combined SE you obtained, does it seem like the residential life program should send Gabriela and Sylvia out to collect more data? Why or why not? This is a judgment call, but you should be able to make relevant comments. Consider not only the SE but the range of the measurement.

What sample size is sufficient is a judgment call, which we’ll discuss further in Chapter 10. For now we can note that the combined data set provides SE = 0.64, meaning that many repeated samples would give sample mean satisfaction scores that would bounce around (i.e., form a mean heap) with standard deviation of 0.64. Given that satisfaction has a theoretical range from 1 to 20, this suggests that any one sample mean will provide a moderately precise estimate, reasonably close to the population mean. This analysis suggests we have sufficient data, although collecting more would of course most likely give us a better estimate.

## 2.4 Nursing home and random sampling

Rebecca works at a nursing home. She’d like to study emotional intelligence amongst the seniors at the facility (her population of interest is all the seniors living at the facility). Which of these would represent random sampling for her study?

1. Rebecca will wait in the lobby and approach any senior who randomly passes by.
2. Rebecca will wait in the lobby. As a senior passes by she will flip a coin. If the coin lands heads she will ask the senior to be in the study, otherwise she will not.
3. Rebecca will obtain a list of all the residents in the nursing home. She will randomly select 10% of the residents on this list; those selected will be asked to be part of the study.
4. Rebecca will obtain a list of all the residents in the nursing home. She will randomly select 1% of the residents on this list; those selected will be asked to be part of the study.

c and d represent random sampling because both give each member of the population an equal chance to be in the study, and members of the sample are selected independently

## 2.5 Skewed distributions

Sampling distributions are not always normally distributed, especially when the variable measured is highly skewed. Below are some variables that tend to have strong skew. a) In real estate, home prices tend to be skewed. In which direction? Why might this be? b) Scores on easy tests tend to be skewed. In which direction? Why might this be? c) Age of death tends to be skewed. In which direction? Why might this be? d) Number of children in a family tends to be skewed. In which direction? Why might this be?

ad a) Home prices tend to be positively skewed (longer tail to the right), because there is a lower boundary of zero, but in effect no maximum—typically a few houses have extremely high prices. These form the long upper tail of the distribution.

ad b) Scores on an easy test tend to be negatively skewed (longer tail to the left). If the test is very easy, most scores will be piled up near the maximum, but there can still be a tail to the left representing a few students who scored poorly.

ad c) Age at time of death tends to be negatively skewed (longer tail to the left). Death can strike at any time (☹), leading to a long lower tail; however, many people (in wealthy countries) die at around 70–85 years old, and no one lives forever, so the distribution is truncated at the upper end.

Searching for “distribution of age at death”, or similar, will find you graphs showing this strong negative skew.

What follows are two examples for this negatively skewed distributions of age at time of death:

ad d) Number of children in a family tends to be positively skewed (longer tail to the right) because 0 is a firm minimum, and then scores extend upward from there, with many families having, say, 1–4 children and a small number of families having many children.

## 2.6 Warning sign for skewed variables

Based on the previous exercise, what is a caution or warning sign that a variable will be highly skewed?

Anything that limits, filters, selects, or caps scores on the high or low end can lead to skew. Selection is not the only thing that can produce skew, but any time your participants have been subjected to some type of selection process you should be alert to the possibility of skew in the variables used to make the selection (and in any related variables).

Also, if the mean and median differ by more than a small amount, most likely there is skew, with the mean being “pulled” in the direction of the longer tail.

Both graphs above are calculated from data. The first one from data on wikipedia using python, the second one used data from a life table of the US social security administration.

This opens up two questions for exercises:

1. How to get data from websites and not especially prepared excel sheets?
2. How to draw these above distributions?

## 3.1 Getting data from a table on a web page

### 3.1.1 Get table data with web scraping

To get data from web pages is called web scraping. You will find with a search term line “R web scraping” many articles, tutorial and videos how to do it. Here I am going to use a blog post by Cory Nissen.

We are going to use the R package rvest to write an appropriate script.

Look at the US period life table from 2016: How to get the necessary data for the updates graph of figure 2?

# install/load package rvest
if (!require("rvest"))
{install.packages("rvest", repos = 'http://cran.wu.ac.at/')
library(rvest)}
Loading required package: rvest

Attaching package: ‘rvest’

The following object is masked from ‘package:purrr’:

pluck

guess_encoding
# store the web url of the page with the table you are interested in
url <- "https://www.ssa.gov/oact/STATS/table4c6.html"
life_table_2016 <- url %>%
read_html() %>% # from package xml2
## 1) go to webpage via google chrome
## 2) set cursor on start of the desired table data
## 3) right clicked and chose “inspect element”
## 4) look for the appropiate line <table …> in the inspector (typically some lines above)
## 5) select this <table …> line in the inspector
## 6) right click it and select "Copy -> Copy XPath"
## 7) include the copied XPath into next line of the R script
html_nodes(xpath='//*[@id="content"]/section[2]/div/div[3]/table') %>%
html_table(fill = TRUE)
life_table_2016 <- life_table_2016[[1]]

To clarify how to get the correct XPath data compare the following screenshots: