Introduction

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as “all people living in a country” or “every atom composing a crystal”. Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation.

Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation). Descriptive statistics are most often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution’s central or typical value, while dispersion (or variability) characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena.

A standard statistical procedure involves the collection of data leading to test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors (null hypothesis is falsely rejected giving a “false positive”) and Type II errors (null hypothesis fails to be rejected and an actual relationship between populations is missed giving a “false negative”). Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis.

Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also occur. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.

The earliest writings on probability and statistics, statistical methods drawing from probability theory, date back to Arab mathematicians and cryptographers, notably Al-Khalil (717–786) and Al-Kindi (801–873). In the 18th century, statistics also started to draw heavily from calculus. In more recent years statistics has relied more on statistical software.

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty.

In applying statistics to a problem, it is common practice to start with a population or process to be studied. Populations can be diverse topics such as “all people living in a country” or “every atom composing a crystal”. Ideally, statisticians compile data about the entire population (an operation called census). This may be organized by governmental statistical institutes. Descriptive statistics can be used to summarize the population data. Numerical descriptors include mean and standard deviation for continuous data (like income), while frequency and percentage are more useful in terms of describing categorical data (like education).

When a census is not feasible, a chosen subset of the population called a sample is studied. Once a sample that is representative of the population is determined, data is collected for the sample members in an observational or experimental setting. Again, descriptive statistics can be used to summarize the sample data. However, drawing the sample contains an element of randomness; hence, the numerical descriptors from the sample are also prone to uncertainty. To draw meaningful conclusions about the entire population, inferential statistics is needed. It uses patterns in the sample data to draw inferences about the population represented while accounting for randomness. These inferences may take the form of answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation), and modeling relationships within the data (for example, using regression analysis). Inference can extend to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied. It can include extrapolation and interpolation of time series or spatial data, and data mining.

Glossary

Symbols

Relational Symbols

Sample and Population symbols

\(\mu\) and \(\sigma\) can take subscripts to show what you are taking the méan or standard deviation of. For instance, \(\sigma_{\bar{X}}\) (“sigma sub x-bar”) is the standard deviation of sample means, or standard error of the mean.

Playground

Mean and standard deviation

vec = c(1, 4, 7, 9, 11)

mean = sum(vec) / length(vec)
mean
## [1] 6.4
mean(vec) # Correct
## [1] 6.4
variance = sum((vec - mean) ^ 2) / (length(vec) - 1)
variance
## [1] 15.8
var(vec) # Correct
## [1] 15.8
st_dv = sqrt(variance)
st_dv
## [1] 3.974921
sd(vec) # Correct
## [1] 3.974921
my_sd_func = function(vector){
  mean = mean(vector)
  variance = sum((vector - mean) ^ 2) / (length(vector) - 1)
  sqrt(variance)
}

my_sd_func(vec)
## [1] 3.974921
test_vectors = rbind(c(2, 2, 3, 3, 4), c(7, 7, 8, 9, 9), c(2, 3, 5, 7, 8), c(1, 4, 7, 9, 11))

apply(test_vectors, 1, my_sd_func)
## [1] 0.836660 1.000000 2.549510 3.974921
apply(test_vectors, 1, sd)
## [1] 0.836660 1.000000 2.549510 3.974921

Random Variable: expected value and standard deviation

Roulette table example with Red, Black and Green. Only Green wins. Whats the probabilty that you get Green?

v = rep(c("Red", "Black", "Green"), c(18,18,2))
prop.table(table(v))
## v
##      Black      Green        Red 
## 0.47368421 0.05263158 0.47368421

This is the sampling model.

x = sample(c(17,-1), prob = c(2/38,36/38))

The expected value is calculated by adding the possible value times their likelyhood together. Its formula is ap + b(1-p). The expected value for 1000 draws.

EV = 1000 * (17*2/38 + (-1*36/38))

The standard error (standard deviation of random variables i.e., probability distributions). Its formula is a - b * sqrt(p * (1-p)). The standard error for 1000 draws.

SE = sqrt(1000) * (-((-1)-17) * sqrt(2/38 * 36/38))

Random variable S storing the experimental values from sampling model.

set.seed(1)
S = sample(c(17,-1), size = 1000, replace = T, prob = c(2/38, 36/38))

sum(S)
## [1] -10

Create the experimental sampling distribution of the sample sum.

roulette_winnings = function(){
  S = sample(c(17,-1), size = 1000, replace = T, prob = c(2/38, 36/38))
  sum(S)
}

set.seed(1)
S = replicate(10000, roulette_winnings())
hist(S)

The mean or expected value of X?

mean(S)
## [1] -52.3324

The standard deviation or standard error of X?

sd(S)
## [1] 126.9762

The probabilty that we win?

mean(S > 0)
## [1] 0.3391
n = 1000
pbinom(500, size = 1000, prob = 1/19) # ??
## [1] 1

Bank and loans example (EV + SE + ND)

# Number of loans
n = 10000

# Probabilty of default
p = 0.03

# Loss per single forclosure
loss_per_forclosure = -200000

# Interest_rate
x = 0

# Random variable S storing defaults = 1 and non defaults = 0
S = sample(c(0,1), prob = c(1-p, p), size = n, replace = T)
head(S, 100)
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Expected value for 10000 loans 
sum(S * loss_per_forclosure)
## [1] -63000000
# Monte-Carlo simulation amount 
B = 10000

# Simulation
losses = replicate(B, { 
  S = sample(c(0,1), prob = c(1-p, p), size = n, replace = T)
  sum(S * loss_per_forclosure)
})

# Losses distribution from simulation
hist(losses / 10^6)

# Expected value of S
EV = x*(1-p) + loss_per_forclosure*p
EV
## [1] -6000
# Standard error of S
SE = abs(x - loss_per_forclosure) * sqrt(p * (1-p))
SE
## [1] 34117.44
# Loans of 180000
x = 180000

# loss_per_forclosure*p + x * (1-p) = 0
# Find out x!
x = - (loss_per_forclosure * p / (1 - p))
x
## [1] 6185.567
# Interest of 6185.57 dollars for each loan needed to get on average 0 in total back as the bank.
EV = loss_per_forclosure * p + x * (1 - p) 
EV # Correct!
## [1] 0

Experiment: of theoretical standard error validity

The sample statistics: 0.45 “democrats and a sample size of 100.

p = 0.45

n = 100

The theoretical standard error for p = 0.45 and sample size n = 100.

SE = sqrt(p * (1 - p)) / sqrt(n)
SE
## [1] 0.04974937

Now we can experimentally proof this expected standard error by running a Monte Carlo simulation. We basically create a sampling distribution of size 10000 of the sample proportions with p = 0.45 and n = 100. Only that we lastly subtract p to find out the actual error.

test_errors = replicate(10000, mean(sample(c(1,0), replace = T, size = n, prob = c(p, (1 - p)))))

# Distribituon of errors
test_errors = test_errors - p

# The standard deviation of the errors
sd(test_errors)
## [1] 0.04944473
# Very close to the theoretical standard error
SE
## [1] 0.04974937

And the error distribution is approximatly normal.

qqnorm(test_errors);qqline(test_errors)

Experiment: How large sample size for 0.01 SE?

The maximum SE in relation to proportions is with p = 0.5. Therefore we will take this worst case scenario case to calculate our goal, the required sample size to get a standard error arround 0.01. We can calculate a list of SE based on sample sizes 100 to 5000.

p = 0.5
n = 100:5000

list_of_SEs = sqrt(p * (1 - p) / n)
head(list_of_SEs, 20)
##  [1] 0.05000000 0.04975186 0.04950738 0.04926646 0.04902903 0.04879500
##  [7] 0.04856429 0.04833682 0.04811252 0.04789131 0.04767313 0.04745790
## [13] 0.04724556 0.04703604 0.04682929 0.04662524 0.04642383 0.04622502
## [19] 0.04602873 0.04583492

When plotting the list of standard errors we can see at what sample size we will reach a standard error of around 0.01: with a sample size of around 2.500.

plot(list_of_SEs, type = "l")

Experiment: sample confidence interval does include parameter

p = 0.45
n = 1000
B = 10000

correct_list = replicate(B, {
  x = sample(c(1,0), size = n, replace = T, prob = c(p, (1-p)))
  x_hat = mean(x)
  se_hat = sqrt(x_hat * (1 - x_hat) / n)
  between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
  
})

And how often did we get a sample together with its confidence interval that did in fact include the real population parameter p? Pretty close to the theoretical goal of 0.95!

mean(correct_list)
## [1] 0.9488

Even more illustrative with this ggplot graph.

lower_bounds = c()
upper_bounds = c()
x_hats = c()
true = c()

for (i in 1:100) {
  x = sample(c(1,0), size = n, replace = T, prob = c(p, (1-p)))
  x_hat = mean(x)
  x_hats = c(x_hats, x_hat)
  se_hat = sqrt(x_hat * (1 - x_hat) / n)
  lower_bounds = c(lower_bounds, (x_hat - 1.96 * se_hat))
  upper_bounds = c(upper_bounds, (x_hat + 1.96 * se_hat))
  true = c(true, between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat))
}

table = tibble(x_hats, lower_bounds, upper_bounds, true)

ggplot(table, aes(seq_along(x_hats), x_hats)) + 
  geom_pointrange(aes(ymin = lower_bounds, ymax = upper_bounds, color = true)) +
  geom_hline(yintercept = 0.45)

Experiment: real polling data

# library(dslabs)
data("polls_us_election_2016")

# Exclude observations that are too old.  
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-31" & state == "U.S.") 

head(polls, 5)
## # A tibble: 5 × 15
##   state startdate  enddate    pollster               grade samplesize population
##   <fct> <date>     <date>     <fct>                  <fct>      <int> <chr>     
## 1 U.S.  2016-11-03 2016-11-06 ABC News/Washington P… A+          2220 lv        
## 2 U.S.  2016-11-01 2016-11-07 Google Consumer Surve… B          26574 lv        
## 3 U.S.  2016-11-02 2016-11-06 Ipsos                  A-          2195 lv        
## 4 U.S.  2016-11-04 2016-11-07 YouGov                 B           3677 lv        
## 5 U.S.  2016-11-03 2016-11-06 Gravis Marketing       B-         16639 rv        
## # … with 8 more variables: rawpoll_clinton <dbl>, rawpoll_trump <dbl>,
## #   rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>,
## #   adjpoll_trump <dbl>, adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>

The first poll. Create a confidence interval.

n = polls$samplesize[1]
x_hat = polls$rawpoll_clinton[1]/100



se_hat = sqrt(x_hat * (1 - x_hat) / n)

cf = c(x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
rm(x_hat, se_hat)

Create columns for x_hat, se_hat, lower and upper confidence bounds. Select only the relevant columns.

polls = polls %>% 
  mutate(x_hat = polls$rawpoll_clinton/100, 
         se_hat = sqrt(x_hat * (1 - x_hat) / samplesize),
         lower = x_hat - 1.96 * se_hat, 
         upper = x_hat + 1.96 * se_hat) %>% 
  select(pollster, enddate, x_hat, se_hat, lower, upper)

Create a hit column indicating whether our confidence intervals included our true parameter, the final vote count for Clinton 48.2.

polls = polls %>% 
  mutate(hit = ifelse(0.482 > lower & 0.482 < upper, TRUE, FALSE))

mean(polls$hit)
## [1] 0.3142857

Packages

wbstats

This package allows to download data from the world bank database.

wb_cachelist$indicators
## # A tibble: 16,649 × 8
##    indicator_id  indicator  unit  indicator_desc   source_org   topics source_id
##    <chr>         <chr>      <lgl> <chr>            <chr>        <list>     <dbl>
##  1 1.0.HCount.1… Poverty H… NA    The poverty hea… LAC Equity … <df […        37
##  2 1.0.HCount.2… Poverty H… NA    The poverty hea… LAC Equity … <df […        37
##  3 1.0.HCount.M… Middle Cl… NA    The poverty hea… LAC Equity … <df […        37
##  4 1.0.HCount.O… Official … NA    The poverty hea… LAC Equity … <df […        37
##  5 1.0.HCount.P… Poverty H… NA    The poverty hea… LAC Equity … <df […        37
##  6 1.0.HCount.V… Vulnerabl… NA    The poverty hea… LAC Equity … <df […        37
##  7 1.0.PGap.1.9… Poverty G… NA    The poverty gap… LAC Equity … <df […        37
##  8 1.0.PGap.2.5… Poverty G… NA    The poverty gap… LAC Equity … <df […        37
##  9 1.0.PGap.Poo… Poverty G… NA    The poverty gap… LAC Equity … <df […        37
## 10 1.0.PSev.1.9… Poverty S… NA    The poverty sev… LAC Equity … <df […        37
## # … with 16,639 more rows, and 1 more variable: source <chr>
wb_cachelist$topics
## # A tibble: 21 × 3
##    topic_id topic                           topic_desc                          
##       <dbl> <chr>                           <chr>                               
##  1        1 Agriculture & Rural Development "For the 70 percent of the world's …
##  2        2 Aid Effectiveness               "Aid effectiveness is the impact th…
##  3        3 Economy & Growth                "Economic growth is central to econ…
##  4        4 Education                       "Education is one of the most power…
##  5        5 Energy & Mining                 "The world economy needs ever-incre…
##  6        6 Environment                     "Natural and man-made environmental…
##  7        7 Financial Sector                "An economy's financial markets are…
##  8        8 Health                          "Improving health is central to the…
##  9        9 Infrastructure                  "Infrastructure helps determine the…
## 10       10 Social Protection & Labor       "The supply of labor available in a…
## # … with 11 more rows
# result = wb_search("")
# result$indicator_desc

# Takes a long time to download
# data = wb_data("SP.POP.TOTL", start_date = 1960, end_date = 2020)


# Example visualization
# library(tidyverse)
# data$country
# data %>% 
#   filter(country == "Germany") %>% 
#   ggplot(aes(date, SP.POP.TOTL/1000000)) +
#   geom_line()

Sampling

In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population.

Each observation measures one or more properties (such as weight, location, colour) of observable bodies distinguished as independent objects or individuals. In survey sampling, weights can be applied to the data to adjust for the sample design, particularly in stratified sampling. Results from probability theory and statistical theory are employed to guide the practice. In business and medical research, sampling is widely used for gathering information about a population. 

Population

Successful statistical practice is based on focused problem definition. In sampling, this includes defining the “population” from which our sample is drawn. A population can be defined as including all people or items with the characteristic one wishes to understand. Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population.

Sometimes what defines a population is obvious. For example, a manufacturer needs to decide whether a batch of material from production is of high enough quality to be released to the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch is the population.

Although the population of interest often consists of physical objects, sometimes it is necessary to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on periods or discrete occasions.

This situation often arises when seeking knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger ‘superpopulation’. For example, a researcher might study the success rate of a new ‘quit smoking’ program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is “everybody in the country, given access to this treatment” – a group which does not yet exist, since the program isn’t yet available to all.

The population from which the sample is drawn may not be the same as the population about which information is desired. Often there is large but not complete overlap between these two groups due to frame issues etc. (see below). Sometimes they may be entirely separate – for instance, one might study rats in order to get a better understanding of human health, or one might study records from people born in 2008 in order to make predictions about people born in 2009.

Time spent in making the sampled population and population of concern precise is often well spent, because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage.

Sampling frame

In the most straightforward case, such as the sampling of a batch of material from production (acceptance sampling by lots), it would be most desirable to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not usually possible or practical. There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will vote at a forthcoming election (in advance of the election). These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory.

As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample. The most straightforward type of frame is a list of elements of the population (preferably the entire population) with appropriate contact information. For example, in an opinion poll, possible sampling frames include an electoral register and a telephone directory.

A probability sample is a sample in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection.

Example: We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, and randomly select one adult from each household. (For example, we can allocate each person a random number, generated from a uniform distribution between 0 and 1, and select the person with the highest number in each household). We then interview the selected person and find their income.

People living on their own are certain to be selected, so we simply add their income to our estimate of the total. But a person living in a household of two adults has only a one-in-two chance of selection. To reflect this, when we come to such a household, we would count the selected person’s income twice towards the total. (The person who is selected from that household can be loosely viewed as also representing the person who isn’t selected.)

In the above example, not everybody has the same probability of selection; what makes it a probability sample is the fact that each person’s probability is known. When every element in the population does have the same probability of selection, this is known as an ‘equal probability of selection’ (EPS) design. Such designs are also referred to as ‘self-weighting’ because all sampled units are given the same weight.

Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These various ways of probability sampling have two things in common:

  1. Every element has a known nonzero probability of being sampled and

  2. involves random selection at some point.

Nonprobability sampling

Nonprobability sampling is any sampling method where some elements of the population have no chance of selection (these are sometimes referred to as ‘out of coverage’/‘undercovered’), or where the probability of selection can’t be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population.

Example: We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a nonprobability sample, because some people are more likely to answer the door (e.g. an unemployed person who spends most of their time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it’s not practical to calculate these probabilities.

Nonprobability sampling methods include convenience sampling, quota sampling and purposive sampling. In addition, nonresponse effects may turn any probability design into a nonprobability design if the characteristics of nonresponse are not well understood, since nonresponse effectively modifies each element’s probability of being sampled.

Sampling Methods

Within any of the types of frames identified above, a variety of sampling methods can be employed, individually or in combination. Factors commonly influencing the choice between these designs include:

  • Nature and quality of the frame

  • Availability of auxiliary information about units on the frame

  • Accuracy requirements, and the need to measure accuracy

  • Whether detailed analysis of the sample is expected

  • Cost/operational concerns

Simple random sampling

In a simple random sample (SRS) of a given size, all subsets of a sampling frame have an equal probability of being selected. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimizes bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results.

Simple random sampling can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn’t reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on average produce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and stratified techniques attempt to overcome this problem by “using information about the population” to choose a more “representative” sample.

Also, simple random sampling can be cumbersome and tedious when sampling from a large target population. In some cases, investigators are interested in research questions specific to subgroups of the population. For example, researchers might be interested in examining whether cognitive ability as a predictor of job performance is equally applicable across racial groups. Simple random sampling cannot accommodate the needs of researchers in this situation, because it does not provide subsamples of the population, and other sampling strategies, such as stratified sampling, can be used instead.

Systematic sampling

Systematic sampling (also known as interval sampling) relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. In this case, k=(population size/sample size). It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an ‘every 10th’ sample, also referred to as ‘sampling with a skip of 10’).

A visual representation of selecting a random sample using the systematic sampling technique

As long as the starting point is randomized, systematic sampling is a type of probability sampling. It is easy to implement and the stratification induced can make it efficient, if the variable by which the list is ordered is correlated with the variable of interest. ‘Every 10th’ sampling is especially useful for efficient sampling from databases.

For example, suppose we wish to sample people from a long street that starts in a poor area (house No. 1) and ends in an expensive district (house No. 1000). A simple random selection of addresses from this street could easily end up with too many from the high end and too few from the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th street number along the street ensures that the sample is spread evenly along the length of the street, representing all of these districts. (Note that if we always start at house #1 and end at #991, the sample is slightly biased towards the low end; by randomly selecting the start between #1 and #10, this bias is eliminated.

However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to be unrepresentative of the overall population, making the scheme less accurate than simple random sampling.

For example, consider a street where the odd-numbered houses are all on the north (expensive) side of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling scheme given above, it is impossible to get a representative sample; either the houses sampled will all be from the odd-numbered, expensive side, or they will all be from the even-numbered, cheap side, unless the researcher has previous knowledge of this bias and avoids it by a using a skip which ensures jumping between the two sides (any odd-numbered skip).

Another drawback of systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of systematic sampling that are given above, much of the potential sampling error is due to variation between neighbouring houses – but because this method never selects two neighbouring houses, the sample will not give us any information on that variation.)

As described above, systematic sampling is an EPS method, because all elements have the same probability of selection (in the example given, one in ten). It is not ‘simple random sampling’ because different subsets of the same size have different selection probabilities – e.g. the set {4,14,24,…,994} has a one-in-ten probability of selection, but the set {4,13,24,34,…} has zero probability of selection.

Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion of PPS samples below.

Stratified sampling

When the population embraces a number of distinct categories, the frame can be organized by these categories into separate “strata.” Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected.The ratio of the size of this random selection (or sample) to the size of the population is called a sampling fraction. There are several potential benefits to stratified sampling.

A visual representation of selecting a random sample using the stratified sampling technique

First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample.

Second, utilizing a stratified sampling method can lead to more efficient statistical estimates (provided that strata are selected based upon relevance to the criterion in question, instead of availability of the samples). Even if a stratified sampling approach does not lead to increased statistical efficiency, such a tactic will not result in less efficiency than would simple random sampling, provided that each stratum is proportional to the group’s size in the population.

Third, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata).

Finally, since each stratum is treated as an independent population, different sampling approaches can be applied to different strata, potentially enabling researchers to use the approach best suited (or most cost-effective) for each identified subgroup within the population.

There are, however, some potential drawbacks to using stratified sampling. First, identifying strata and implementing such an approach can increase the cost and complexity of sample selection, as well as leading to increased complexity of population estimates. Second, when examining multiple criteria, stratifying variables may be related to some, but not to others, further complicating the design, and potentially reducing the utility of the strata. Finally, in some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than would other methods (although in most cases, the required sample size would be no larger than would be required for simple random sampling).

A stratified sampling approach is most effective when three conditions are met

  1. Variability within strata are minimized

  2. Variability between strata are maximized

  3. The variables upon which the population is stratified are strongly correlated with the desired dependent variable.

Advantages over other sampling methods

  1. Focuses on important subpopulations and ignores irrelevant ones.

  2. Allows use of different sampling techniques for different subpopulations.

  3. Improves the accuracy/efficiency of estimation.

  4. Permits greater balancing of statistical power of tests of differences between strata by sampling equal numbers from strata varying widely in size.

Disadvantages

  1. Requires selection of relevant stratification variables which can be difficult.

  2. Is not useful when there are no homogeneous subgroups.

  3. Can be expensive to implement.

Poststratification

Stratification is sometimes introduced after the sampling phase in a process called “poststratification”. This approach is typically implemented due to a lack of prior knowledge of an appropriate stratifying variable or when the experimenter lacks the necessary information to create a stratifying variable during the sampling phase. Although the method is susceptible to the pitfalls of post hoc approaches, it can provide several benefits in the right situation. Implementation usually follows a simple random sample. In addition to allowing for stratification on an ancillary variable, poststratification can be used to implement weighting, which can improve the precision of a sample’s estimates.

Oversampling

Choice-based sampling is one of the stratified sampling strategies. In choice-based sampling, the data are stratified on the target and a sample is taken from each stratum so that the rare target class will be more represented in the sample. The model is then built on this biased sample. The effects of the input variables on the target are often estimated with more precision with the choice-based sample even when a smaller overall sample size is taken, compared to a random sample. The results usually must be adjusted to correct for the oversampling.

Cluster sampling

Sometimes it is more cost-effective to select respondents in groups (‘clusters’). Sampling is often clustered by geography, or by time periods. (Nearly all samples are in some sense ‘clustered’ in time – although this is rarely taken into account in the analysis.) For instance, if surveying households within a city, we might choose to select 100 city blocks and then interview every household within the selected blocks.

A visual representation of selecting a random sample using the cluster sampling technique

Clustering can reduce travel and administrative costs. In the example above, an interviewer can make a single trip to visit several households in one block, rather than having to drive to a different block for each household.

It also means that one does not need a sampling frame listing all elements in the target population. Instead, clusters can be chosen from a cluster-level frame, with an element-level frame created only for the selected clusters. In the example above, the sample only requires a block-level city map for initial selections, and then a household-level map of the 100 selected blocks, rather than a household-level map of the whole city.

Cluster sampling (also known as clustered sampling) generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between one another as compared to the within-cluster variation. For this reason, cluster sampling requires a larger sample than SRS to achieve the same level of accuracy – but cost savings from clustering might still make this a cheaper option.

Cluster sampling is commonly implemented as multistage sampling. This is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. The first stage consists of constructing the clusters that will be used to sample from. In the second stage, a sample of primary units is randomly selected from each cluster (rather than using all units contained in all selected clusters). In following stages, in each of those selected clusters, additional samples of units are selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. This technique, thus, is essentially the process of taking random subsamples of preceding random samples.

Multistage sampling can substantially reduce sampling costs, where the complete population list would need to be constructed (before other sampling methods could be applied). By eliminating the work involved in describing clusters that are not selected, multistage sampling can reduce the large costs associated with traditional cluster sampling.\[8\] However, each sample may not be a full representative of the whole population.

Errors and biases

Survey results are typically subject to some error. Total errors can be classified into sampling errors and non-sampling errors. The term “error” here includes systematic biases as well as random errors.

Sampling errors and biases

Sampling errors and biases are induced by the sample design. They include:

  1. Selection bias: When the true selection probabilities differ from those assumed in calculating the results.

  2. Random sampling error: Random variation in the results due to the elements in the sample being selected at random.

Non-sampling error

Non-sampling errors are other errors which can impact final survey estimates, caused by problems in data collection, processing, or sample design. Such errors may include:

  1. Over-coverage: inclusion of data from outside of the population

  2. Under-coverage: sampling frame does not include elements in the population.

  3. Measurement error: e.g. when respondents misunderstand a question, or find it difficult to answer

  4. Processing error: mistakes in data coding

  5. Non-response or Participation bias: failure to obtain complete data from all selected individuals

After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis.

A particular problem involves non-response. Two major types of non-response exist:

  • unit nonresponse (lack of completion of any part of the survey)

  • item non-response (submission or participation in survey but failing to complete one or more components/questions of the survey)

In survey sampling, many of the individuals identified as part of the sample may be unwilling to participate, not have the time to participate (opportunity cost), or survey administrators may not have been able to contact them. In this case, there is a risk of differences between respondents and nonrespondents, leading to biased estimates of population parameters. This is often addressed by improving survey design, offering incentives, and conducting follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame. The effects can also be mitigated by weighting the data (when population benchmarks are available) or by imputing data based on answers to other questions. Nonresponse is particularly a problem in internet sampling. Reasons for this problem may include improperly designed surveys, over-surveying (or survey fatigue, and the fact that potential participants may have multiple e-mail addresses, which they don’t use anymore or don’t check regularly.

R: Sampling

The simple example out of the statistics book: you want to take a random sample size of 4 students from a class of 60.

sample(60, 4)
## [1] 41 49 16  4

Take a sample of an object x with a specified size and don’t replace values.

sample(x = 1:10, size = 10)
##  [1]  1  9  6 10  3  7  5  4  8  2

Identical but replace parameter set to true, i.e. it returns duplicates.

sample(x = 1:10, size = 10, replace = TRUE)
##  [1]  4  5  1  6  8 10  3  3  9  2

If replace is turned off the sample size is limited to the size of the vector.

population <- c("One", "Zero")
# sample(x = population, size = 10, replace = F) # Fehler in sample.int(length(x), size, replace, prob) : kann keine Stichprobe größer als die Grundgesamtheit nehmen wenn 'replace = FALSE'

sample(x = population, size = 10, replace = T)
##  [1] "One"  "One"  "Zero" "Zero" "One"  "Zero" "Zero" "One"  "One"  "Zero"

Change the probability for certain values/elements to be picked. The first element has a .7 and the second .3 chances, i.e. a lot more “One”s are being picked.

sample(population, 10, replace = T, prob = c(.7,.3))
##  [1] "One"  "Zero" "One"  "Zero" "One"  "One"  "One"  "One"  "One"  "One"

Suppose the population is a college of 5000 students and you want a sample size of 50 students. Select a simple random number between 1 and n (i.e. 1-50) and use it as the first subject being sampled.

sample(50,1)
## [1] 29

Calculate the skip number, which is k = N / n (sample frame or population / sample size). Skip every k subjects.

# ? a loop? 

Descriptive Statistics

A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently non-parametric statistics. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related co-morbidities, etc.

Some measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.

Descriptive statistics provide simple summaries about the sample and about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may either form the basis of the initial description of the data as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular investigation.

For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the number of shots made divided by the number of shots taken. For example, a player who shoots 33% is making approximately one shot in every three. The percentage summarizes or describes multiple discrete events. Consider also the grade point average. This single number describes the general performance of a student across the range of their course experiences.

The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation of populations and of economic data was the first way the topic of statistics appeared. More recently, a collection of summarisation techniques has been formulated under the heading of exploratory data analysis: an example of such a technique is the box plot.

In the business world, descriptive statistics provides a useful summary of many types of data. For example, investors and brokers may use a historical account of return behaviour by performing empirical and analytical analyses on their investments in order to make better investing decisions in the future.

Univariate parameters

Univariate analysis involves describing the distribution of a single variable, including its central tendency (including the mean, median, and mode) and dispersion (including the range and quartiles of the data-set, and measures of spread such as the variance and standard deviation). The shape of the distribution may also be described via indices such as skewness and kurtosis. Characteristics of a variable’s distribution may also be depicted in graphical or tabular format, including histograms and stem-and-leaf display.

R: Centrality

Example

Here are two different distributions.

x <- c(2,3,3,4,4,4,4,5,5,6)
y <- c(2,3,3,4,4,4,4,5,5,22)

x is fairly normally distributed and y is highly skewed to the right:

Calculating the mean

The mean is highly sensitive to skewed distributions. The single value of 22 in y instead of 6 in x changes the mean by 1.6. Therefore the median makes more sense as a measure centrality for y.

mean(x)
## [1] 4
mean(y)
## [1] 5.6

Calculating the median

The median is exactly the same for x and y. The median doesn’t take the value of 22 itself into account, but simply acknowledges the element position in an ordered array of y.

median(x)
## [1] 4
median(y)
## [1] 4

Calculating the mode
The mode is the same for both objects, because the value 4 appears most often. There doesn’t seem to be a mode() function, but this alternative works fine enough. which.max(table(x)) returns 1.) the biggest value of the tabled arrays and 2.) the index/location in the tabled array.

which.max(table(x))
## 4 
## 3
which.max(table(y))
## 4 
## 3

With x and y its mode is both 4 and its position in the tabled array is 3.

R: Variability

length()

Show the length of a vector.

x <- 1:10
length(x)
## [1] 10

Show the length of a matrix, i.e. the number of all cells.

x <- rbind(c(1:20), c(1:20))
length(x)
## [1] 40

The length of a data frame via length() returns only the number of columns, not the number of all cells.

x <- data.frame(c(1:10), rep("A", 10))
length(x)
## [1] 2

Range, IQR and SD

x <- c(5,4,3,6,7,3,5,7,4,4)
y <- c(0,0,4,8,7,9,9,7,3,1)
z <- c(5,4,3,6,7,3,5,7,4,4,1,5,6,7,3,8,8,3,40)

Quick overview with summary()

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    4.00    4.50    4.80    5.75    7.00

Range

range() shows the difference between the biggest and smallest value in a vector. To show the largest and smallest values:

range(x)
## [1] 3 7
range(z)
## [1]  1 40

You need to calculate the actual range with this operation.

max(x) - min(x)
## [1] 4
max(z) - min(z)
## [1] 39

Interquartile range

The IQR is less affected by outliers. It shows the difference between the upper quartile and the lower quartile, i.e. the interquartile range. The middle 50%, i.e. the values between 25% to 75%.

The difference can be big even without big outliers:

max(x) - min(x)
## [1] 4
IQR(x)
## [1] 1.75

Note: IQR() is the same as:

quantile(x, .75) - quantile(x, .25)
##  75% 
## 1.75

Wow, what a difference with only one single big outlier:

max(z) - min(z)
## [1] 39
IQR(z)
## [1] 3.5

Probability

The reason we study statistics and probability together is because when we collect data as part of a statistical study, we want to be able to use what we know about probability to say how likely it is that our results are reliable.

Probability is how likely it is that something will occur. You can write a probability as a fraction, decimal or percent, but all probabilities are numbers equal to or between 0 and 1.

The formula:

\[ P(event) = \frac{\text{outcomes that meet our criteria}}{\text{all possible outcomes (also: sample space)}} \]

Example:

How likely is it to draw a queen from a deck of playing cards? First, how many outcomes or cards meet our criteria? Four. How many possible outcomes are there? A card deck has 52 cards.

\[ P(queen) = \frac{4}{52}=0.07692307692≈7.7 \% \]

Experimental probability

For example, let’s say we flip a coin four times in a row. As you know, it’s completely possible that, just by chance, we end up with four heads in a row. Based on that result, we might say that the probability of getting heads is

\[ P(heads) = \frac{4}{4}= 1 = 100 \% \]

But how can this be true? Before we saw that the probability of getting heads on one flip was 50 % , but now we’re calculating the probability of getting heads four times in a row at 100 % . What’s going on?

We’re looking at the difference between experimental and theoretical probability. Experimental probability (also called empirical probability) is the probability we find when we run experiments, basically the results of an experiment. If we flip the coin a fifth time and get tails this time, then the experimental probability of getting heads after 5 experiments is

\[ P(heads) = \frac{4}{5}= 80\% \]

In other words, the experimental probability of an event will be constantly changing as we run more and more experiments over time. If the experiment is a good one, the idea is that over time the experimental probability will get very close to the theoretical probability.

Theoretical probability

Theoretical probability (also called classical probability) is the probability that an event will occur if you could run an infinite number of experiments. Or, you can think about the theoretical probability as the one we get from the simple probability formula:

\[ P(event) = \frac{\text{outcomes that meet our criteria}}{\text{all possible outcomes}} \]

We know from using this formula that the probability of getting heads when we flip a coin is 50 % . Therefore, the theoretical probability is 50 % , which means that the more experiments we run, the closer our experimental probability should get to 50 % .

This is also called the law of large numbers. It says that, if we could run an infinite number of experiments, that our experimental probability would eventually equal our theoretical probability.

Bayes’ theorem (conditional probability)

In probability theory and statistics, Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes’ theorem allows the risk to an individual of a known age to be assessed more accurately (by conditioning it on their age) than simply assuming that the individual is typical of the population as a whole.

Bayes’ theorem is also called Bayes’ Rule or Bayes’ Law and is the foundation of the field of Bayesian statistics.

Bayes’ theorem is stated mathematically as the following equation:

\[ P(A|B)=\frac{P(B|A)*P(A)}{P(B)} \]

Which tells us:

  • how often \(A\) happens given that \(B\) happens, written \(P(A|B)\).

When we know:

  • how often \(B\) happens given that \(A\) happens, written \(P(B|A)\).

  • how likely \(A\) is on its own, written \(P(A)\).

  • how likely \(B\) is on its own, written \(P(B)\).

Example

Let us say P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then:

  • P(Fire|Smoke) means how often there is fire when we can see smoke 

  • P(Smoke|Fire) means how often we can see smoke when there is fire

So the formula kind of tells us “forwards” P(Fire|Smoke) when we know “backwards” P(Smoke|Fire)

  • dangerous fires are rare (1%), \(P(Fire)\).

  • but smoke is fairly common (10%) due to barbecues, \(P(Smoke)\).

  • and 90% of dangerous fires make smoke, smoke given dangerous fire, \(P(Smoke|Fire)\).

We can then discover the probability of dangerous Fire when there is (given) Smoke:

\[ P(Fire|Smoke) = \frac{P(Smoke|Fire)*P(Fire)}{P(Smoke)}=\frac{90\%*1\%}{10\%}= 9\% \]

So it is still worth checking out any smoke to be sure.

Permutations

Permutations and Combinations are the number of ways you can arrange or combine a set of things. Permutations are used when the order does matter and Combinations when it does not. These methods can be used to answer many probability questions.

There are two types of permutations (remember the order does matter).

Permutations with repetition

Example: a digit lock (it could be 333). How many different ways are there to input a 3-digit lock with numbers from 0 to 9 are there?

The formula:

\[ n^r \]

where \(n\) is the number of distinct things to choose from and we choose \(r\) of them.

The answer:

\[ n^r=10^3=1000 \]

Permutations without repetition

Example 1): arrangement of 16 pool balls. You pick one ball and the next has one option less to choose from. How many different ways are there to arrange 16 distinct pool balls?

The formula:

\[ P(n,r)=\frac{n!}{(n-r)!} \]

Example 1) the answer:

\[ P(n,r)=\frac{n!}{(n-r)!} = P(16,16) = \frac{16!}{(16-16)!}=\frac{16!}{1}=20.922.789.888.000 \]

When we calculate Permutations without repetition each time we have to subtract one less option to choose from, because we’ve already used one distinct value. I.e. \(5*4*3\) instead of \(5*5*5\). This process of decrementing a value by 1 repeatedly is mathematically notated by \(n! = (5-1)*(4-1)*(3-1)*(2-1)\). Because we only choose three (\(r\)) numbers of the set of 5 numbers we have to divide them by the excess permutations.

Example 2) How many different three letter “words” can be formed by the letters abcde?

Example 2) the answer:

\[ P(5,3)=\frac{5!}{(5-3)!}=\frac{5!}{2!}= \frac{5*4*3*2*1}{2*1}= 5*4*3=60 \]

Same as:

abc,abd,abe,acb,acd,ace,adb,adc,ade,aeb,aec,aed (12 with a first)

bac,bad,bae,…. (so 12 times 5 = 60).

Combinations

On the other hand, a Combination is the number of ways you can arrange a set of things, but the order doesn’t matter.

There are also two types of combinations (remember the order does not matter now):

Combinations with repetitions

Example: coins in your pocket (5,5,5,10,10)

This is actually the most complicated and I haven’t had the need to use it so far.

Combinations without repetition

Example: lottery numbers (2,14,15,27). How many combinations are there in choosing 5 lottery numbers from 1 to 10?

The formula, also called the Binomial Coefficient:

\[ P(n,r)=\frac{n!}{r!(n-r)!} \]

The answer:

\[ P(n,r)=\frac{n!}{r!(n-r)!}=\frac{10!}{5!(10-5)!}=\frac{10!}{5!*5!}=252 \]

Same as:

1,2,3,4,5 (counted the same as 2,1,3,4,5 and 3,1,2,4,5 etc.) - 1,2,3,4,6 - …

Random Variables

A variable can take at least two different values. For a random sample or randomized experiment, each possible outcome has a probability that it occurs. The variable itself is sometimes then referred to as a random variable. This terminology emphasizes that the outcome varies from observation to observation according to random variation that can be summarized by probabilities.

Another way to think about the concept of random variables is as placeholders for otherwise long explanations of events. For example, the notation of the probability of getting two heads in an experiment of two coin flips could be written as \(P(\text{getting two heads in 2 flips})\). But if you use a variable X, usually written large, that increases by \(1\) every time a flip shows a head you could write it as such: \(P(X=2)\). Much neater.

Discrete random variables

A variable is discrete if the possible outcomes are a set of separate values, such as a variable expressed as “the number of …” with possible values 0, 1, 2, …. If you can count them.

Example: Let y denote the response to the question “What do you think is the ideal number of children for a family to have?” This is a discrete variable, taking the possible values 0, 1, 2, 3, and so forth. According to recent General Social Surveys, for a randomly chosen person in the United States the probability distribution of Y is approximately as the table shows. The table displays the recorded y-values and their probabilities. For instance, P(4), the probability that Y = 4 children is regarded as ideal, equals 0.12. Each probability in table is between 0 and 1, and the sum of the probabilities equals 1.

A histogram can portray the probability distribution. The rectangular bar over a possible value of the variable has height equal to the probability of that value. The bar over the value 4 has height 0.12, the probability of the outcome 4.

Continuous random variables

It is continuous if the possible outcomes are an infinite continuum, such as all the real numbers between 0 and 1. If you can measure it. A probability distribution lists the possible outcomes and their probabilities.

Example: Commuting Time to Work A recent U.S. Census Bureau study about commuting time for workers in the United States who commute to work 2 measured y = travel time, in minutes. The probability distribution of y provides probabilities such as P(y < 15), the probability that travel time is less than 15 minutes, or P(30 < y < 60), the probability that travel time is between 30 and 60 minutes.

This figure portrays the probability distribution of y. The shaded area in the figure refers to the region of values higher than 45. This area equals 15% of the total area under the curve, representing the probability of 0.15 that commuting time is more than 45 minutes. Those regions in which the curve has relatively high height have the values most likely to be observed.

Transforming random variables (shift or scale)

When transforming a random variable the corresponding metrics change depending on weather the variable is being shifted, by adding or subtracting (+k or -k) the data, or scaled, by multiplying or dividing (*k or / k).

Shifting

The measures of centrality change by + or - k.

The measures of variability don’t change.

Scaling

The measures of centrality change by * or / k.

The measures of variability change by * or / k.

Probability distribution parameters

Like a population distribution, a probability distribution has parameters describing center and variability. The mean describes center and the standard deviation describes variability. The parameter values are the values these measures would assume, in the long run, if the randomized experiment or random sample repeatedly took observations on the variable y having that probability distribution.

Mean or expected value

For example, suppose we take observations from the distribution in child number preferences. Over the long run, we expect y = 0 to occur 1% of the time, y = 1 to occur 3% of the time, and so forth. In 100 observations, for instance, we expect about:

one 0, three 1 ′ s, sixty 2 ′ s, twenty-three 3 ′ s, twelve 4 ′ s, and one 5.

In that case, since the mean equals the total of the observations divided by the sample size, the mean equals:

\[ \frac{0(1) + 1(3) + 2(60) + 3(23) + 4(12) + 5(1)}{100}=\frac{245}{100}=2.45 \]

The calculation has the form:

\[ 0(0.01)+ 1(0.03) + 2(0.60) + 3(0.23) + 4(0.12) + 5(0.01) \]

The mean of a probability distribution is also called the expected value. The terminology reflects that E(y) represents what we expect for the average value of y in a long series of observations.

Standard deviation

The standard deviation of a probability distribution, denoted by σ, measures its variability. The more spread out the distribution, the larger the value of σ. The Empirical Rule helps us to interpret σ. If a probability distribution is bell shaped, about 68% of the probability falls between μ−σ and μ + σ, about 95% falls between μ − 2σ and μ + 2σ, and all or nearly all falls between μ − 3σ and μ + 3σ.

The standard deviation σ is the square root of the variance σ² of the probability distribution. The variance measures the average squared deviation of an observation from the mean. That is, it is the expected value of (y − μ)² . In the discrete case, the formula is

\[ σ^2= E(y-μ)^2= \sum(y-μ)^2*P(y) \]

Normal Distribution

Early statisticians noticed the same shape coming up over and over again in different distributions—so they named it the normal distribution. In probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type of continuous probability distribution for a real-valued random variable. 

The parameter μ is the mean or expectation of the distribution (and also its median and mode), while the parameter σ is its standard deviation. The variance of the distribution is σ². A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal.

Moreover, Gaussian distributions have some unique properties that are valuable in analytic studies. For instance, any linear combination of a fixed collection of normal deviates is a normal deviate. Many results and methods, such as propagation of uncertainty and least squares parameter fitting, can be derived analytically in explicit form when the relevant variables are normally distributed.

A normal distribution is sometimes informally called a bell curve. However, many other distributions are bell-shaped (such as the Cauchy, Student’s t, and logistic distributions).

Properties

Normal distributions have the following features:

  • symmetric bell shape

  • mean and median are equal, both located at the center of the distribution

  • ≈ 68% of the data falls within 1 standard deviation of the mean

  • ≈ 95% of the data falls within 2 standard deviation of the mean

  • ≈ 99.7% of the data falls within 3 standard deviation of the mean

The normal distribution is symmetric, bell shaped, and characterized by its mean μ and standard deviation σ. The probability within any particular number of standard deviations of μ is the same for all normal distributions. This probability (rounded off) equals 0.68 within 1 standard deviation, 0.95 within 2 standard deviations, and 0.997 within 3 standard deviations.

For example, heights of adult females in North America have approximately a normal distribution with μ = 65.0 inches and σ = 3.5. The probability is nearly 1.0 that a randomly selected female has height between μ − 3σ = 65.0 − 3(3.5) = 54.5 inches and μ + 3σ = 65.0 + 3(3.5) = 75.5 inches. Adult male height has a normal distribution with μ = 70.0 and σ = 4.0 inches. So, the probability is nearly 1.0 that a randomly selected male has height between μ − 3σ = 70.0 − 3(4.0) = 58 inches and μ + 3σ = 70.0 + 3(4.0) = 82 inches.

Drawing a normal distribution

Example: the trunk diameter of a certain variety of pine trees is normally distributed with a mean of μ = 150 cm and a standard deviation of σ = 30 cm.

Sketch a normal curve that describes this distribution.

Solution:

Step 1: Sketch a normal curve.

Step 2: The mean of 150 cm goes in the middle.

Step 3: Each standard deviation is a distance of 30 cm.

Finding percentages

Example: a certain variety of tree has a mean trunk diameter of μ = 150 cm and a standard deviation of σ = 30 cm.

Approximately what percent of these trees have a diameter greater than 210 cm?

Solution:

Step 1: Sketch a normal distribution with a mean of μ = 150 cm and a standard deviation of σ = 30 cm.

Step 2: The diameter of 210 cm is two standard deviations above the mean. Shade above that point.

Step 3: Add the percentages in the shaded area:

\[ 2.35+0.15 = 2.5\text{%} \]

About 2.5% of these trees have a diameter greater than 210 cm.

Finding whole counts

Example: a certain variety of pine tree has a mean trunk diameter of μ = 150 cm and a standard deviation of σ = 30 cm. A certain section of a forest has 500 of these trees.

Approximately how many of these trees have a diameter smaller than 120 cm?

Solution:

Step 1: Sketch a normal distribution with a mean of μ = 150 cm and a deviation of σ = 30 cm.

Step 2: The diameter of 120 cm is one standard deviation below the mean. Shade below that point.

Step 3: Add the percentage in the shaded area:

\[ 0.15+2.35+13.5=16\% \]

About 16 % of these trees have a diameter smaller than 120 cm.

Step 4: Find how many trees in the forest that percent represents.

We need to find how many trees 16 % of 500 is.

\[ 16\% \text{ of }500=0.16*500=80 \]

About 80 trees have a diameter smaller than 120 cm.

Standard score (z-scores)

In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

It is calculated by subtracting the population mean from an individual raw score and then dividing the difference by the population standard deviation. This process of converting a raw score into a standard score is called standardizing or normalizing (however, “normalizing” can refer to many types of ratios; see normalization for more).

Standard scores are most commonly called z-scores; the two terms may be used interchangeably, as they are in this article. Other terms include z-values, normal scores, and standardized variables.

Computing a z-score requires knowing the mean and standard deviation of the complete population to which a data point belongs; if one only has a sample of observations from the population, then the analogous computation with sample mean and sample standard deviation yields the t-statistic.

A z-score measures exactly how many standard deviations above or below the mean a data point is.

Here is the formula:

\[ z = \frac{\text{data point - mean}}{\text{standard deviation}} \]

Here is the same formula written with symbols:

\[ z = \frac{x-μ}{σ} \]

Here are some important facts about z-scores:

  1. A positive z-score says the data point is above average.
  2. A negative z-score says the data point is below average.
  3. A z-score close to 0 says the data point is close to average.
  4. A data point can be considered unusual if its z-score is above 3 or below − 3.

Example

The grades on a history midterm at Almond have a mean of
μ = 85 and a standard deviation of σ = 2.

Michael scored 86 on the exam. Find the z-score for Micheal’s exam grade.

\[ z = \frac{\text{his grade - mean grade}}{\text{standard deviation}} \]

\[ z = \frac{86-85}{2}=\frac{1}{2}=0.5 \]

Micheal’s z-score is 0.5 point. His grade was half of a standard deviation above the mean.

R: Calculate probability from z-value

If you have the standard deviation or z-value and want to find out the probability.

Example 1

The height of middle school students is normally distributed with a mean of 150 cm and a standard deviation of 20 cm. What is the probability that randomly selected students have a height greater than 170 cm, denoted by the random variable X.

The pnorm() function in R accepts a z-value/standard deviation value and returns the cumulative probability below that value. In our example we need the cumulative probability above that value (the right side of the curve) and therefore need to subtract the function value from 1.

P.X = 1-pnorm(1)
P.X
## [1] 0.1586553

Or even more elaborate using all parameters in the pnorm() function.

P.X = pnorm(q = 170, mean = 150, sd = 20, lower.tail = F)
P.X
## [1] 0.1586553

Answer: the probability of student randomly selected with a height above 170 cm, i.e. above one standard deviation from the mean, is 15.9 %.

Example 2

A set of middle school student heights are normally distributed with a mean of 150 cm and a standard deviation of 20 centimeters. Let X = the height of a randomly selected student from this set.

Find out: \(P(140 < X < 154)\).

Step 1: Find out the left tail probability for X below 154 cm: 0.58.

Step 2: Find out the left tail probability for X below 140 cm: 0.31.

Step 3: To find out the probability for in between X < 154 and X > 140 subtract the first from the latter.

pnorm(q = 154, mean = 150, sd = 20, lower.tail = T) - pnorm(q = 140, mean = 150, sd = 20, lower.tail = T)
## [1] 0.2707222

Answer: the probability for a randomly selected student to have a height between 140 and 154 cm is 31 %.

Find probabilities if you have a z-value For the left-tail/cumulative probabilities, i.e. mean - z x standard deviation, in a normal distribution. Calculate it by a specific z value.

pnorm(2) # cumulative probability below mu + 2.0(sigma)
## [1] 0.9772499
pnorm(-1) # The P() of falling below the sd of -1.
## [1] 0.1586553

To find the right-tail probabilities for a specific z value in a normal distribution subtract it from 1.

1-pnorm(2) # The P() of falling over the sd of 2.
## [1] 0.02275013
1-pnorm(-1)# The P() of falling over the sd of -1.
## [1] 0.8413447

The probability of falling within 2 standard deviations of the mean:

1 - 2 * pnorm(-2)
## [1] 0.9544997

Show that 90% of a normal probability distribution falls between mean - 1.64 and mean + 1.64.

1 - 2 * (pnorm(1.64))
## [1] -0.8989948

R: Calculate z-value from probability

If, one the other hand, you have the probability for an event and want to find out the z-value or standard deviation you can use the qnorm() function. qnorm() accepts a probability of a normal distribution and return the z-value associated with said probability. Notice that qnorm() also return the left side of the distribution. If you need the right side simply put a minus in front.

qnorm(0.1586)
## [1] -1.000228
-qnorm(0.1586)
## [1] 1.000228

Find z-value if you have a probability
To the the z-values corresponding to specific probability values we can use this.

qnorm(0.975) # q denotes "quantile"; .975 quantile = 97.5 percentile
## [1] 1.959964
# The z-value is 1.96, rounded to two decimals

More checks.

qnorm(.05)
## [1] -1.644854
qnorm(0.01)
## [1] -2.326348
# To get the right-tail probability.
-qnorm(0.005)
## [1] 2.575829

Binomial Distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution remains a good approximation, and is widely used.

Mean/Expected value

If X ~ B(n, p), that is, X is a binomially distributed random variable, n being the total number of experiments and p the probability of each experiment yielding a successful result, then the expected value of X is:

\[ E(X)=np \]

This follows from the linearity of the expected value along with fact that X is the sum of n identical Bernoulli random variables, each with expected value p. In other words, if \(X_1,...,X_n\) are identical (and independent) Bernoulli random variables with parameter p, then \(X=X_1+…+X_n\) and \(E(X)=E[X_1+...+X_n]=E(X_1)+...+E(X_n)=p+...p=np\)

Variance

The variance is:

\[ Var(X)=np(1-p) \]

This similarly follows from the fact that the variance of a sum of independent random variables is the sum of the variances.

R: Calculate probability

Example

Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

Answer

Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers by random attempts as follows.

We use the dbinom() function for a single successful outcome, e.g. \(r=4\), and pbinom() for cumulative successful outcomes, e.g. \(r<=4\).

dbinom(x = 4, size = 12, prob = 0.2)
## [1] 0.1328756

The probability for 4 successful outcomes or less.

pbinom(4, 12, 0.2)
## [1] 0.9274445

Bernoulli distribution

The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution). It is also a special case of the two-point distribution, for which the possible outcomes need not be 0 and 1.

In probability theory and statistics, the Bernoulli distribution is the discrete probability distribution of a random variable which takes the value \(1\) with probability \(p\) and the value \(0\) with probability \(q = 1 −p\). Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability \(q\). It can be used to represent a (possibly biased) coin toss where 1 and 0 would represent “heads” and “tails” (or vice versa), respectively, and p would be the probability of the coin landing on heads or tails, respectively. In particular, unfair coins would have \(p ≠ 1/2\).

Properties

If \(X\) is a random variable with this distribution, then:

\[ Pr(X=1)=p=1-Pr(X=0)=1-q \]


The Bernoulli distribution is a special case of the binomial distribution with \(n=1\).

Mean

The expected value of a Bernoulli random variable X is

\[ E(X) = p \]

This is due to the fact that for a Bernoulli distributed random variable \(X\) with \(Pr(X=1)=p\) and \(Pr(X=0)=q\) we find

\[ E(X)=Pr(X=1)*1+Pr(X=0)*0=p*1+q*0=p \]

Variance

The variance of a Bernoulli distributed X is

\[ Var(X)=pq=p(1-p) \]

We first find

\[ E(X^2)=Pr(X=1)*1^2+Pr(X=0)*0^2=p*1^2+q*0^2=E(X) \]

From this follows

\[ Var(X)=E(X^2)-E(X)^2=E(X)-E(X)^2=p-p^2=p(1-p)=pq \]

With this result it is easy to prove that, for any Bernoulli distribution, its variance will have a value between \(0\) and \(1/4\).

R: Calculate probablity

The required package for Bernoulli operations in R is called Rlab.

Density, distribution function, quantile function and random generation for the Bernoulli distribution with parameter prob.

Rlab::dbern(x = 3, prob = .5)
## [1] 0

Geometric Distribution

In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:

  • The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, … }

  • The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, … }

Which of these one calls “the” geometric distribution is a matter of convention and convenience.

These two different geometric distributions should not be confused with each other. Often, the name shifted geometric distribution is adopted for the former one (distribution of the number X); however, to avoid ambiguity, it is considered wise to indicate which is intended, by mentioning the support explicitly.

The geometric distribution gives the probability that the first occurrence of success requires k independent trials, each with success probability p. If the probability of success on each trial is p, then the probability that the kth trial (out of k trials) is the first success is

\[ Pr(X=k)=(1-p)^{k-1}p \]

for k = 1, 2, 3, ….

The above form of the geometric distribution is used for modeling the number of trials up to and including the first success. By contrast, the following form of the geometric distribution is used for modeling the number of failures until the first success:

\[ Pr(Y=k)=Pr(X=k+1)=(1-p)^kp \]

for k = 0,1,2,3, …

In either case, the sequence of probabilities is a geometric sequence.

For example, suppose an ordinary die is thrown repeatedly until the first time a “1” appears. The probability distribution of the number of times it is thrown is supported on the infinite set { 1, 2, 3, … } and is a geometric distribution with p = 1/6.

The geometric distribution is denoted by Geo(p) where 0 < p ≤ 1.

Consider a sequence of trials, where each trial has only two possible outcomes (designated failure and success). The probability of success is assumed to be the same for each trial. In such a sequence of trials, the geometric distribution is useful to model the number of failures before the first success. The distribution gives the probability that there are zero failures before the first success, one failure before the first success, two failures before the first success, and so on.

When is the geometric distribution an appropriate model?
The geometric distribution is an appropriate model if the following assumptions are true.

  • The phenomenon being modeled is a sequence of independent trials.

  • There are only two possible outcomes for each trial, often designated success or failure.

  • The probability of success, p, is the same for every trial.

If these conditions are true, then the geometric random variable Y is the count of the number of failures before the first success. The possible number of failures before the first success is 0, 1, 2, 3, and so on. In the graphs above, this formulation is shown on the right.

An alternative formulation is that the geometric random variable X is the total number of trials up to and including the first success, and the number of failures is X − 1. In the graphs above, this formulation is shown on the left.

R: Calculate probability

Not sure if those are all correct!

R function dgeom() calculates the probability of x failures prior to the first success.

Example: I pick cards from a standard deck until I get a king (I replace the cards if they are not a king). What is the probability that I need to pick 5 cards, i.e. I pick 4 unsuccessful cards (x=4) and the fifth is a success.

dgeom(x= 4,prob = 1/13)
## [1] 0.05584808

The function pgeom() is the cumulative probability of less than or equal to q failures prior to success. 

Example 1: What is the probability that I pick less than 10 cards?

pgeom(q = 9, prob = 1/13, lower.tail = F)
## [1] 0.4491371

Example 2: What is the probability that I need to pick more than 12 cards?

1-pgeom(q = 13, prob = 1/13, lower.tail = F)
## [1] 0.6739152

Inferential Statistics

With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what’s going on in our data.

Sampling distributions

A sampling distribution refers to a probability distribution of a statistic that comes from choosing random samples of a given population. Also known as a finite-sample distribution, it represents the distribution of frequencies on how spread apart various outcomes will be for a specific population.

1. Sampling distribution of mean

As shown from the example above, you can calculate the mean of every sample group chosen from the population and plot out all the data points. The graph will show a normal distribution, and the center will be the mean of the sampling distribution, which is the mean of the entire population. 

2. Sampling distribution of proportion

It gives you information about proportions in a population. You would select samples from the population and get the sample proportion. The mean of all the sample proportions that you calculate from each sample group would become the proportion of the entire population.

3. T-distribution

T-distribution is used when the sample size is very small or not much is known about the population. It is used to estimate the mean of the population, confidence intervals, statistical differences, and linear regression.

Central Limit Theorem

The Central Limit Theorem (CLT) tells us that when the number of draws, also called the sample size, is large, the probability distribution of the sum of the independent draws is approximately normal. Because sampling models are used for so many data generation processes, the CLT is considered one of the most important mathematical insights in history.

How large is large enough for CLT?

The CLT works when the number of draws is large. But large is a relative term. In many circumstances as few as 30 draws is enough to make the CLT useful. In some specific instances, as few as 10 is enough. However, these should not be considered general rules. Note, for example, that when the probability of success is very small, we need much larger sample sizes.

By way of illustration, let’s consider the lottery. In the lottery, the chances of winning are less than 1 in a million. Thousands of people play so the number of draws is very large. Yet the number of winners, the sum of the draws, range between 0 and 4. This sum is certainly not well approximated by a normal distribution, so the CLT does not apply, even with the very large sample size. This is generally true when the probability of a success is very low. In these cases, the Poisson distribution is more appropriate.

Experimental simulation proving CLT https://onlinestatbook.com/stat_sim/sampling_dist/

Confidence intervals and standard errors

When you make an estimate in statistics, whether it is a summary statistic or a test statistic, there is always uncertainty around that estimate because the number is based on a sample of the population you are studying. 

The confidence interval is the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way. 

The confidence level is the percentage of times you expect to reproduce an estimate between the upper and lower bounds of the confidence interval, and is set by the alpha value.

What exactly is the confidence interval?

A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.

Confidence, in statistics, is another way to describe probability. For example, if you construct a confidence interval with a 95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval.

Your desired confidence level is usually one minus the alpha ( a ) value you used in your statistical test:

Confidence level = 1 − a

So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 = 0.95, or 95%.

When do you use confidence intervals?

You can calculate confidence intervals for many kinds of statistical estimates, including:

  • Proportions

  • Population means

  • Differences between population means or proportions

  • Estimates of variation among groups

These are all point estimates, and don’t give any information about the variation around the number. Confidence intervals are useful for communicating the variation around a point estimate.

What is the difference between standard errors and confidence intervals?

Both standard errors and confidence intervals show basically the same thing - how much can your estimate vary.

Standard error estimate is a measure of statistical accuracy. It is enough for a statistician to see the standard error to get a notion about the p-value and also about the confidence interval.

Confidence interval uses estimated standard errors to draw an interval with a certain level of confidence around your estimated effect. This level of confidence is expressed as a % of how often the true percentage of the population lies within the confidence interval.

A 95% confidence interval means that you can be sure that the true estimate of the population parameter that you look for (the effect, the difference, …) will lie in the interval 95 out of 100 times. How do you calculate the 95% confidence interval?

The lower boundary of the interval = 𝑋−𝑧∗𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑒𝑟𝑟𝑜𝑟X−z∗standarderror

The upper boundary of the interval =𝑋+𝑧∗𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑒𝑟𝑟𝑜𝑟=X+z∗standarderror

X: is the point estimate (the effect that you observe, the difference, etc.)

z: is the critical value. This value depends on the level of confidence that you choose. If you want level of confidence of 95% (which is the most common one), then z = 1.96, if you want 90%, then z = 1.645.

This is also very logical: think of it as if you were trading confidence for precision. You can be very confident that your estimate is in the interval if you draw the interval really wide. On the contrary, if the interval is very narrow, you do not have so much confidence that the true score will be there.

Hypothesis Testing

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution. First, a tentative assumption is made about the parameter or distribution. This assumption is called the null hypothesis and is denoted by \(H_0\). An alternative hypothesis (denoted \(H_a\)), which is the opposite of what is stated in the null hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine whether or not \(H_0\) can be rejected. If \(H_0\) is rejected, the statistical conclusion is that the alternative hypothesis \(H_a\) is true.

P-Value

We can think of a P-value as a conditional probability: given the null hypothesis is true, what’s the probability of obtaining a sample statistic as extreme or more than the one observed by random chance alone. In this type of test, we use the alternative hypothesis \(H_{\mathrm{a}}\) to decide if the P-value comes from the probability above the test statistic, below the test statistic, or comes from a two-sided probability.

z-Test

Conditions for proportion inference

Formula: z test statistic

We can calculate the test statistic corresponding to the sample result: \[ z=\frac{\text { statistic - parameter }}{\text { standard deviation of statistic }} \] \[ =\frac{\hat{p}-p_{0}}{\sqrt{\frac{p_{0}\left(1-p_{0}\right)}{n}}} \]

(where \(\hat{p}\) is the sample proportion, \(p_{0}\) is the proportion from the null hypothesis, and \(n\) is the sample size).

R: P-value from z-score

R Function

To find the p-value associated with a z-score in R, we can use the pnorm() function, which uses the following syntax:

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)

where:

  • q: The z-score

  • mean: The mean of the normal distribution. Default is 0.

  • sd: The standard deviation of the normal distribution. Default is 1.

  • lower.tail: If TRUE, the probability to the left of q in the normal distribution is returned. If FALSE, the probability to the right is returned. Default is TRUE.

Example 1: one-sided

In 2011, \(51 \%\) of cell phone owners in a country reported that their cell phone was a smartphone. The following year, the researchers wanted to test \(H_{0}: p=0.51\) versus \(H_{\mathrm{a}}: p>0.51\), where \(p\) is the proportion of cell phone owners in that country who have a smartphone.

They surveyed a random sample of 934 cell phone owners in that country and found that 501 of them had a smartphone. The test statistic for these results was \(z \approx 1.61\).

What is the P-Value?

pnorm(1.61, lower.tail = F)
## [1] 0.05369893

P-Value Answer: 0.054

Example 2: two-sided

Amanda read a report saying that \(49 \%\) of teachers in the United States were members of a labor union. She wanted to test whether this was true in her state, so she took a random sample of 300 teachers from her state to test \(H_{0}: p=0.49\) versus \(H_{\mathrm{a}}: p \neq 0.49\), where \(p\) is the proportion of teachers in her state who are members of a labor union.

The sample results showed 134 teachers were members of a labor union, and the corresponding test statistic was \(z \approx-1.50\).

What is the P-Value?

2*pnorm(-1.5)
## [1] 0.1336144

P-Value Answer: 0.133

t-Tests

Conditions for mean inference

When we want to carry out inference (build a confidence interval or do a significance test) on a mean, the accuracy of our methods depends on a few conditions. Before doing the actual computations of the interval or test, it’s important to check whether or not these conditions have been met. Otherwise the calculations and conclusions that follow may not be correct.

The conditions we need for inference on a mean are:

  • Random: A random sample or randomized experiment should be used to obtain the data.

  • Normal: The sampling distribution of \(\overline{x}\) (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large \((n\geq 30)\).

  • Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than \(10\%\) of the population.

Let’s look at each of these conditions a little more in-depth.

The random condition

Random samples give us unbiased data from a population. When we don’t use random selection, the resulting data usually has some form of bias, so using it to infer something about the population can be risky.

For example, suppose a university wants to report the average starting salary of their graduates. How do they obtain the data? They can’t access the salaries of all graduates, and they can’t realistically get salaries from a random sample of graduates. The university could rely on graduates who are willing to share their salaries to calculate the average, but using voluntary response will likely lead to a biased estimate of the true average. Graduates with higher starting salaries will probably be more willing to report their salaries than graduates with low salaries (or graduates without salaries). Also, graduates who participate may claim their salary is higher than it really is, but they’d be unlikely to say it’s lower than it really is.

The big idea is that data that came from a non-random sample may not be representative of its population.

More specifically, sample means are unbiased estimators of their population mean. For example, suppose we have a bag of ping pong balls individually numbered from 0 to 30, so the population mean of the bag is 15. We could take random samples of balls from the bag and calculate the mean from each sample. Some samples would have a mean higher than 15 and some would be lower. But on average, the mean of each sample will equal 15. We write this property as \(\mu_{\bar{x} } =\mu\), which holds true as long as we are taking random samples.

This won’t necessarily happen if we use a non-random sample. Biased samples can lead to inaccurate results, so they shouldn’t be used to create confidence intervals or carry out significance tests.

The normal condition

The sampling distribution of \(\bar{x}\) (a sample mean) is approximately normal in a few different cases. The shape of the sampling distribution of \(\bar{x}\) mostly depends on the shape of the parent population and the sample size n.

Case 1: Parent population is normally distributed

If the parent population is normally distributed, then the sampling distribution of \(\bar{x}\) is approximately normal regardless of sample size. So if we know that the parent population is normally distributed, we pass this condition even if the sample size is small. In practice, however, we usually don’t know if the parent population is normally distributed.

Case 2: Not normal or unknown parent population; sample size is large \((n \geq 30)\)

The sampling distribution of \(\bar{x}\) is approximately normal as long as the sample size is reasonably large. Because of the central limit theorem, when \(n \geq 30\), we can treat the sampling distribution of \(\bar{x}\) as approximately normal regardless of the shape of the parent population.

There are a few rare cases where the parent population has such an unusual shape that the sampling distribution of the sample mean \(\bar{x}\) isn’t quite normal for sample sizes near 30. These cases are rare, so in practice, we are usually safe to assume approximately normality in the sampling distribution when \(n \geq 30\).

Case 3: Not normal or unknown parent population; sample size is small \(n<30\)

As long as the parent population doesn’t have outliers or strong skew, even smaller samples will produce a sampling distribution of \(\bar{x}\) that is approximately normal. In practice, we can’t usually see the shape of the parent population, but we can try to infer shape based on the distribution of data in the sample. If the data in the sample shows skew or outliers, we should doubt that the parent is approximately normal, and so the sampling distribution of \(\bar{x}\) may not be normal either. But if the sample data are roughly symmetric and don’t show outliers or strong skew, we can assume that the sampling distribution of \(\bar{x}\) will be approximately normal.

The big idea is that we need to graph our sample data when \(n < 30\), is less than, 30 and then make a decision about the normal condition based on the appearance of the sample data.

The independence condition

To use the formula for standard deviation of \(\bar{x}\) we need individual observations to be independent. In an experiment, good design usually takes care of independence between subjects (control, different treatments, randomization).

In an observational study that involves sampling without replacement, individual observations aren’t technically independent since removing each observation changes the population. However the 10% condition says that if we sample 10% or less of the population, we can treat individual observations as independent since removing each observation doesn’t change the population all that much as we sample. For instance, if our sample size is \(n=30\) there should to be at least \(N = 300\) members in the population for the sample to meet the independence condition.

Assuming independence between observations allows us to use this formula for standard deviation of \(\bar{x}\) when we’re making confidence intervals or doing significance tests:

\[ \sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}} \]

We usually don’t know the population standard deviation \(\sigma\), so we substitute the sample standard deviation \(s_x\) as an estimate for \(\sigma\). When we do this, we call it the standard error of \(\bar{x}\) to distinguish it from the standard deviation.

So our formula for standard error of \(\bar{x}\) is:

\[ \sigma_{\bar{x}} \approx \frac{s_{x}}{\sqrt{n}} \]

Summary

If all three of these conditions are met, then we can we feel good about using \(t\) distributions to make a confidence interval or do a significance test. Satisfying these conditions makes our calculations accurate and conclusions reliable.

The random condition is perhaps the most important. If we break the random condition, there is probably bias in the data. The only reliable way to correct for a biased sample is to recollect the data in an unbiased way.

The other two conditions are important, but if we don’t meet the normal or independence conditions, we may not need to start over. For example, there is a way to correct for the lack of independence when we sample more than 10% of a population, but it’s beyond the scope of what we’re learning right now.

The main idea is that it’s important to verify certain conditions are met before we make these confidence intervals or do these significance tests.

Formula: t test statistic

The test statistic gives us an idea of how far away our sample result is from our null hypothesis. For a one-sample t test for a mean, our test statistics is:

\[ \begin{aligned}t &=\frac{\text { statistic }-\text { parameter }}{\text { standard error of statistic }} \\&=\frac{\bar{x}-\mu_{0}}{\frac{s_{x}}{\sqrt{n}}}\end{aligned} \]

The statistic \(\bar{x}\) is the sample mean, and the parameter \(\mu_{0}\) is the mean from the null hypothesis. The standard error of the sample mean is \(s_{x}\) (the sample standard deviation) divided by the square root of \(n\) (the sample size).

R: P-value from t statistic

Degrees of freedom

\(n - 1\)

The function in R: pt()

pt(q = 2, # the t-statistic
   df = 5, # the degrees of freedom
   lower.tail = T # probabilities for lower or upper tail? Default is True.
   )
## [1] 0.9490303

Example 1: one-sided

Daisy was testing \(H_{0}: \mu=33\) versus \(H_{\mathrm{a}}: \mu>33\) with a sample of 11 observations. Her test statistic was \(t=1.368\). Assume that the conditions for inference were met.

What is the P-value?

pt(q = 1.368, df = 10, lower.tail = F)
## [1] 0.100632

Example 2: two-sided Jasper was testing \(H_{0}: \mu=36\) versus \(H_{\mathrm{a}}: \mu \neq 36\) with a sample of 16 observations. His test statistic was \(t=2.4\). Assume that the conditions for inference were met.

What is the P-value?

2*pt(q = 2.4, df = 15, lower.tail = F)
## [1] 0.02982493

R: t-tests

x = rnorm(10)
y = rnorm(10)
t.test(x,y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.082458, df = 17.465, p-value = 0.9352
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.205725  1.114846
## sample estimates:
## mean of x mean of y 
## 0.2856339 0.3310732

t or z Statistic?

z-tests are a statistical way of testing a hypothesis when either:

  • We know the population variance, or

  • We do not know the population variance but our sample size is large n ≥ 30

  • If we have a sample size of less than 30 and do not know the population variance, then we must use a t-test.

  • The t-test is kind of a weaker statistic, but which helps us if we have less information available.

Proportion significance tests don’t need t-tests, because we can calculate the z-score without the sampling distribution or population standard deviation.

Regressions

Linear regression

Multiple regression

Causal Inference

The Rubin causal model is based on the idea of potential outcomes. For example, a person would have a particular income at age 40 if they had attended college, whereas they would have a different income at age 40 if they had not attended college. To measure the causal effect of going to college for this person, we need to compare the outcome for the same individual in both alternative futures. Since it is impossible to see both potential outcomes at once, one of the potential outcomes is always missing. This dilemma is the “fundamental problem of causal inference”.


Potential outcome framework

Thought experiment

  • Let’s assume we want to know if a particular intervention/treatment $D$ (like job loss) caused a particular outcome $Y$ (AFD vote propensity)

  • The variables job loss $D$ can take two values: 1 if the respondent did lose the job and 0 otherwise

  • The variable AFD vote propensity $Y$ is measured as a standard normal variable (that could be a continous survey measure)

  • In this thought experiment, we also assume that we know the potential outcomes for each individual

  • Our experimental sample of interest are $N=100$ German citizens $u$ drawn from the population of all citizers that are eligble to vote

  • In an ideal experiment we would observe the vote choice under job loss or not and compute the difference to get the causal effect

Lets simulate this setup in R:

N=100
u = seq(1:N)
Y0 = rnorm(N)
Y1 = rnorm(N) + 1
D=1:100 %in% sample(1:100, 50)
yl = "Y(0) & Y(1)"
# Plot Potential Outcomes for each unit
# Library to resize plot
library(repr)
plot(u, Y0, ylim=c(-3, 4), xlim=c(1,N), xlab="u")
lines(u, Y1, type = "p", col="red")
title("Y(1) and Y(0) for all units ")

options(repr.plot.width=4, repr.plot.height=4)
# Plot Potential Outcomes for each unit and take mean for treatment and control
plot(u[D==0], Y0[D==0], ylim=c(-3, 4), xlim=c(1,N), main = "Y(1| T=1) and Y(0| T=0)", xlab="u", ylab=yl)
abline(h=mean(Y0[D==0]))
lines(u[D==1], Y1[D==1], type = "p", col="red")
abline(h=mean(Y1[D==1]), col="red")

options(repr.plot.width=4, repr.plot.height=4)

Functions

str() and summary()

Display the internal structure of an object, i.e. the structure, data type and sample of its variables (if with a data frame).

x <- data.frame(c(1:5), rep("A",5), c(5:1))
str(x)
## 'data.frame':    5 obs. of  3 variables:
##  $ c.1.5.     : int  1 2 3 4 5
##  $ rep..A...5.: chr  "A" "A" "A" "A" ...
##  $ c.5.1.     : int  5 4 3 2 1

Display a (statistical) summary of an object, i.e. the mean, median, min and max of its variables.

summary(x) # Note: not sum()
##      c.1.5.  rep..A...5.            c.5.1. 
##  Min.   :1   Length:5           Min.   :1  
##  1st Qu.:2   Class :character   1st Qu.:2  
##  Median :3   Mode  :character   Median :3  
##  Mean   :3                      Mean   :3  
##  3rd Qu.:4                      3rd Qu.:4  
##  Max.   :5                      Max.   :5