Statistics Learning

This post is to summarize the stat learned from stattrek

Variable

variable: classified as qualitative (aka categorical) and quantitative (aka numeric)

Qantitative variables: can be further classified as discrete or continuous

Univariate vs (one variable) Bivariate Data (two variable)

Population and sample

population vs sample: A measureable characteristic of a population, such as a mean or standard deviation, is called a parameter, but a measureable characteristic of a sample is called statistic

Ramdom sampling: it allow researcher to use statistical methods to analyze sample result. Statistical analysis is not appropriate when non-random sampling methods are used.

Sampling with replacement or without replacement. When sampling with replacement, sample size can be greater than population size.

Basic statistics

The mean and median: statisticians refer to the mean and median as measures of central tendency Population mean = mu = sum(X)/ N OR Sample mean = x = sum(x) / n

The median may be a better indicator of the most typical value if a set of scores has an outlier. An outlier is an extreme value that differs greatly from other values.

However, when the sample size is large and does not include outliers, the mean score usually provides a better measure of central tendency.

Variability: The most common measures of variability are the range, the interquartile range (IQR), variance, and standard deviation.

range ; the difference between the max and min IQR: Q3 minus Q1 variance:

+ population varianve: sigma2 = sum(( Xi - mu )2) / N
+ sample variance: s2 = sum(( xi - x )2 )/ ( n - 1 )

standard deviation: the square root of the variance. Thus, the standard deviation of a population is: ??

Effect of Changing Units

Sometimes, researchers change units (minutes to hours, feet to meters, etc.). If you add a constant to every value, the distance between values does not change. As a result, all of the measures of variability (range, interquartile range, standard deviation, and variance) remain the same.

On the other hand, suppose you multiply every value by a constant. This has the effect of multiplying the range, interquartile range (IQR), and standard deviation by that constant. It has an even greater effect on the variance. It multiplies the variance by the square of the constant.

Position: the position of a value, relative to other values in a set of observations. The most common measures of position are percentiles, quartiles, and standard scores (aka, z-scores).

The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles.

Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

Note the relationship between quartiles and percentiles. Q1 corresponds to P25, Q2 corresponds to P50, Q3 corresponds to P75. Q2 is the median value in the set.

Standard Scores (z-Scores) A standard score (aka, a z-score) indicates how many standard deviations an element is from the mean. A standard score can be calculated from the following formula. z = (X - mu) / ?? where z is the z-score, X is the value of the element, ?? is the mean of the population, and ?? is the standard deviation.

Chart and plots

Difference bewteen bar chart and histogram: With bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a quantitative variable

stemplot: show exact values of individual observations and their distribution across group boxplot: splits the quantitative data set into quartiles.

Residual

examine residuals and residual plot to ensure if a linear regression is appropriate

If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

Transformation to achieve linearity

When a residual plot reveals a data set to be nonlinear, it is often possible to “transform” the raw data to make it more linear. This allows us to use linear regression techniques more effectively with nonlinear data.

Linear transformation: add, multiply or devide by a constant - does not change correlation
Nonlinear transformation: take the square root of x - does change correlation

How to Perform a Transformation to Achieve Linearity: Transforming a data set to enhance linearity is a multi-step, trial-and-error process.

Conduct a standard regression analysis on the raw data. Construct a residual plot. If the plot pattern is random, do not transform data. If the plot pattern is not random, continue. Compute the coefficient of determination (R2). Choose a transformation method (see above table). Transform the independent variable, dependent variable, or both. Conduct a regression analysis, using the transformed variables. Compute the coefficient of determination (R2), based on the transformed variables. If the tranformed R2 is greater than the raw-score R2, the transformation was successful. Congratulations! If not, try a different transformation method. The best tranformation method (exponential model, quadratic model, reciprocal model, etc.) will depend on nature of the original data. The only way to determine which method is best is to try each and compare the result (i.e., residual plots, correlation coefficients).

Regular Transformation methods

Method	Transformation(s)	Regression equation	Predicted value (y)
linear regression	None	y = b0 + b1x	y = b0 + b1x
Exponential model	log(y)	log(y) = b0 + b1x	\(y = 10^{(b0 + b1x)}\)
Quadratic model	sqrt(y)	sqrt(y) = b0 + b1x	y = ( b0 + b1x )2
Reciprocal model	1/y	1/y = b0 + b1x	y = 1 / (b0 + b1x)
Logarithmic model	log(x)	y= b0 + b1log(x)	y = b0 + b1log(x)
Power model	log(y),log(x)	log(y)= b0+b1log(x)	y = 10b0 + b1log(x)

Influential points

Outliers: Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.

It could have an extreme X value compared to other data points.
It could have an extreme Y value compared to other data points.
It could have extreme X and Y values.
It might be distant from the rest of the data, even without extreme X or Y values.

An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.

Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller.

If your data set includes an influential point, here are some things to consider.

An influential point may represent bad data, possibly the result of measurement error. If possible, check the validity of the data point.
Compare the decisions that would be made based on regression equations defined with and without the influential point. If the equations lead to contrary decisions, use caution.

Categorical Data

One-way table: is the tabular equivalent of a bar chart. Like a bar chart, a one-way table displays categorical data in the form of frequency counts (propotions) and/or relative frequencies (percentage).

two-way table: can be displayed as frequency counts or as relative frequencies (just like a one-way table) for the whole table, for rows, or for columns. They can be displayed graphically as a segmented bar chart.

Probability

Law of large numbers (LLN): the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Rules of probability:

The probability that Event A occurs, given that Event B has occurred, is called a conditional probability. The conditional probability of Event A, given Event B, is denoted by the symbol P(A|B).
The probability that Events A and B both occur is the probability of the intersection of A and B. The probability of the intersection of Events A and B is denoted by P(A n B). If Events A and B are mutually exclusive, P(A n B) = 0.
The probability that Events A or B occur is the probability of the union of A and B. The probability of the union of Events A and B is denoted by P(A u B) .
If the occurrence of Event A changes the probability of Event B, then Events A and B are dependent. On the other hand, if the occurrence of Event A does not change the probability of Event B, then Events A and B are independent.
Rule of subtraction: P(A) = 1 - P(A’)
Rule of multiplication: P(A n B) = P(A) P(B|A)
Rule of addition: P(A u B) = P(A) + P(B) - P(A n B) or P(A u B) = P(A) + P(B) - P(A)P(B|A)

Random variables

When the value of a variable is determined by a chance event, that variable is called a random variable.

When comparing discrete and continuous variables, it is more correct to say that continuous variables can always take on an infinite number of values; whereas some discrete variables can take on an infinite number of values, but others cannot.

Discrete:Within a range of numbers, discrete variables can take on only certain values.
Continuous:Continuous variables, in contrast, can take on any value within a range of values.

Probability distribution

A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occuranc.

Note: Given a probability distribution, you can find cumulative probability. The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).

Mean and variance

Mean of a discrete random variable - also called expected value:

E(X) = mux = sum [ xi * P(xi) ] where xi is the value of the random variable for outcome i, mux is the mean of random variable X, and P(xi) is the probability that the random variable will be outcome i.

Variability of a discrete random variable (variance):

sigma² = sum { [ xi - E(x) ]² * P(xi) } where xi is the value of the random variable for outcome i, P(xi) is the probability that the random variable will be outcome i, E(x) is the expected value of the discrete random variable x.

Independent random variables

If two random variables, X and Y, are independent, they satisfy the following conditions.

P(x|y) = P(x) OR P(x n y) = P(x) * P(y), for all values of X and Y. The above conditions are equivalent.

the correlation between X and Y is equal to zero.

Linear transformation

Does not affect correlation

Adding a constant: Y = X + b Subtracting a constant: Y = X - b Multiplying by a constant: Y = mX Dividing by a constant: Y = X/m Multiplying by a constant and adding a constant: Y = mX + b Dividing by a constant and subtracting a constant: Y = X/m - b

Y = mX + b and Var(Y) = m2 * Var(X)

where m and b are constants, Y is the mean of Y, X is the mean of X, Var(Y) is the variance of Y, and Var(X) is the variance of X.

Binomial distribution

Normal Distribution

The normal distribution is defined by the normal equation:

The graph of the normal distribution depends on two factors - the mean and the standard deviation. The mean of the distribution determines the location of the center of the graph, and the standard deviation determines the height and width of the graph. When the standard deviation is large, the curve is short and wide; when the standard deviation is small, the curve is tall and narrow.

Every normal curve (regardless of its mean or standard deviation) conforms to the following “rule”. Collectively, these points are known as the empirical rule or the 68-95-99.7 rule.

About 68% of the area under the curve falls within 1 standard deviation of the mean.
About 95% of the area under the curve falls within 2 standard deviations of the mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the mean.

Standard normal distribution

A special case of the normal distribution which has a mean of zero and a standard deviation of one. The normal random variable of a standard normal distribution is called a standard score or a z-score. z = (X - mu) / sigma where X is a normal random variable, ?? is the mean of X, and ?? is the standard deviation of X.

t distribution

The t distribution (aka, Student’s t-distribution) is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown.

According to the central limit theorem, the sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is sufficiently large. Therefore, when we know the standard deviation of the population, we can compute a z-score, and use the normal distribution to evaluate probabilities with the sample mean.

But sample sizes are sometimes small, and often we do not know the standard deviation of the population. When either of these problems occur, statisticians rely on the distribution of the t statistic (also known as the t score), whose values are given by:

t = [ x - mu ] / [ s / sqrt( n ) ]

where x is the sample mean, mu is the population mean, s is the standard deviation of the sample, and n is the sample size. The distribution of the t statistic is called the t distribution or the Student t distribution.

Properties of the t Distribution

The mean of the distribution is equal to 0 .
The variance is equal to v / ( v - 2 ), where v is the degrees of freedom (see last section) and v > 2.
The variance is always greater than 1, although it is close to 1 when there are many degrees of freedom. With infinite degrees of freedom, the t distribution is the same as the standard normal distribution.

T-Distribution vs. Normal Distribution

The t distribution and the normal distribution can both be used with statistics that have a bell-shaped distribution. Guidelines exist to help you make that choice. Some focus on the population standard deviation.

If the population standard deviation is known, use the normal distribution. If the population standard deviation is unknown, use the t-distribution.
If the sample size is large, use the normal distribution. If the sample size is small, use the t-distribution.

Estimation in statistics

In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample.

An estimation of a population parameters may be expressed in two ways:

Point estimate
Interval estimate - comfidence intervals Example: A population mean is not an example of a point estimate; a sample mean is an example of a point estimate.

Confidence interval

A confidence interval describes the likehood that a particular sampling method will produce a confidence interval that includes the true population parameter. A confidence interval consists of three parts:

A confidence level - High confidence level means more likely the interval would include the population parameters.
A statistic
A margin of error

Standard error

The standard error is an estimate of the standard deviation (SE = s / sqrt( n )) and is a measure of varaibility. The equations for the standard error are identical to the equations for the standard deviation, except for one thing - the standard error equations use statistics where the standard deviation equations use parameters (from population and often unknown). Specifically, the standard error equations use p in place of P, and s in place of sigma.

Margin of error

The margin of error can be defined by either of the following equations:

Critical value x Standard deviation
Critical value x Standard error

The central limit theorem states that the sampling distribution of a statistic will be nearly normal, if the sample size is large enough. As a rough guide, many statisticians say that a sample size of 30 is large enough when the population distribution is bell-shaped. But if the original population is badly skewed, has multiple peaks, and/or has outliers, researchers like the sample size to be even larger.

When the sampling distribution is nearly normal, the critical value can be expressed as a t score or as a z score. When the sample size is smaller, the critical value should only be expressed as a t score.

To find the critical value, follow the steps:

compute alpha: 1 - (confidence level / 100)
compute the critical probability (p): 1 - alpha /2
To express the critical value as a z score, find the z score having a cumulative probability equal to the critical probability (p).
To express the critical value as a t score:
- calculate the degree of freedom (DF)
- the t score is a cumulative probability equal to the critical probability (p) with degrees of freedom equals to DF.

Proportion confidence interval

Pre-requsite:

The sampling method is simple random sampling
The sample is sufficiently large. As a rule of thumb, a sample is considered “sufficiently large” if it include at least 10 successes and 10 failures.

The standard deviation of the sample proportion sigma(P) is:

sigma(P) = sqrt[ P * ( 1 - P ) / n ] * sqrt[ ( N - n ) / ( N - 1 ) ]

When the population size is much larger (at least 20 times larger) than the sample size. the standard deviation can be approximated by:

sigma(P) = sqrt[ P * ( 1 - P ) / n ]

When the true population proportion P is unknown. Us the standard error:

SE(p) = sqrt[ P * ( 1 - P ) / n ] * sqrt[ ( N - n ) / ( N - 1 ) ]

When the population size is much larger (at least 20 times larger) than the sample size. the standard deviation can be approximated by:

SE(p) = sqrt[ P * ( 1 - P ) / n ]

Hypothesis testing

There are two types of statistical hypotheses:

Null hypothesis - denoted by H0
Alternative hypothesis - denoted by Ha

Statistics Learning

JH

February 11, 2016