This post is to summarize the stat learned from stattrek
Variable
variable: classified as qualitative (aka categorical) and quantitative (aka numeric)
Qantitative variables: can be further classified as discrete or continuous
Univariate vs (one variable) Bivariate Data (two variable)
Population and sample
population vs sample: A measureable characteristic of a population, such as a mean or standard deviation, is called a parameter, but a measureable characteristic of a sample is called statistic
Ramdom sampling: it allow researcher to use statistical methods to analyze sample result. Statistical analysis is not appropriate when non-random sampling methods are used.
Sampling with replacement or without replacement. When sampling with replacement, sample size can be greater than population size.
Basic statistics
The mean and median: statisticians refer to the mean and median as measures of central tendency Population mean = mu = sum(X)/ N OR Sample mean = x = sum(x) / n
The median may be a better indicator of the most typical value if a set of scores has an outlier. An outlier is an extreme value that differs greatly from other values.
However, when the sample size is large and does not include outliers, the mean score usually provides a better measure of central tendency.
Variability: The most common measures of variability are the range, the interquartile range (IQR), variance, and standard deviation.
range ; the difference between the max and min IQR: Q3 minus Q1 variance:
+ population varianve: sigma2 = sum(( Xi - mu )2) / N
+ sample variance: s2 = sum(( xi - x )2 )/ ( n - 1 )
Effect of Changing Units
Sometimes, researchers change units (minutes to hours, feet to meters, etc.). If you add a constant to every value, the distance between values does not change. As a result, all of the measures of variability (range, interquartile range, standard deviation, and variance) remain the same.
On the other hand, suppose you multiply every value by a constant. This has the effect of multiplying the range, interquartile range (IQR), and standard deviation by that constant. It has an even greater effect on the variance. It multiplies the variance by the square of the constant.
Position: the position of a value, relative to other values in a set of observations. The most common measures of position are percentiles, quartiles, and standard scores (aka, z-scores).
The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles.
Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.
Note the relationship between quartiles and percentiles. Q1 corresponds to P25, Q2 corresponds to P50, Q3 corresponds to P75. Q2 is the median value in the set.
Standard Scores (z-Scores) A standard score (aka, a z-score) indicates how many standard deviations an element is from the mean. A standard score can be calculated from the following formula. z = (X - mu) / ?? where z is the z-score, X is the value of the element, ?? is the mean of the population, and ?? is the standard deviation.
Chart and plots
Difference bewteen bar chart and histogram: With bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a quantitative variable
stemplot: show exact values of individual observations and their distribution across group boxplot: splits the quantitative data set into quartiles.
Residual
examine residuals and residual plot to ensure if a linear regression is appropriate
If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
Transformation to achieve linearity
When a residual plot reveals a data set to be nonlinear, it is often possible to “transform” the raw data to make it more linear. This allows us to use linear regression techniques more effectively with nonlinear data.
Linear transformation: add, multiply or devide by a constant - does not change correlation
Nonlinear transformation: take the square root of x - does change correlation
How to Perform a Transformation to Achieve Linearity: Transforming a data set to enhance linearity is a multi-step, trial-and-error process.
Conduct a standard regression analysis on the raw data. Construct a residual plot. If the plot pattern is random, do not transform data. If the plot pattern is not random, continue. Compute the coefficient of determination (R2). Choose a transformation method (see above table). Transform the independent variable, dependent variable, or both. Conduct a regression analysis, using the transformed variables. Compute the coefficient of determination (R2), based on the transformed variables. If the tranformed R2 is greater than the raw-score R2, the transformation was successful. Congratulations! If not, try a different transformation method. The best tranformation method (exponential model, quadratic model, reciprocal model, etc.) will depend on nature of the original data. The only way to determine which method is best is to try each and compare the result (i.e., residual plots, correlation coefficients).
Regular Transformation methods
| Method | Transformation(s) | Regression equation | Predicted value (y) |
|---|---|---|---|
| linear regression | None | y = b0 + b1x | y = b0 + b1x |
| Exponential model | log(y) | log(y) = b0 + b1x | \(y = 10^{(b0 + b1x)}\) |
| Quadratic model | sqrt(y) | sqrt(y) = b0 + b1x | y = ( b0 + b1x )2 |
| Reciprocal model | 1/y | 1/y = b0 + b1x | y = 1 / (b0 + b1x) |
| Logarithmic model | log(x) | y= b0 + b1log(x) | y = b0 + b1log(x) |
| Power model | log(y),log(x) | log(y)= b0+b1log(x) | y = 10b0 + b1log(x) |
Influential points
Outliers: Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.
An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.
Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller.
If your data set includes an influential point, here are some things to consider.
Categorical Data
One-way table: is the tabular equivalent of a bar chart. Like a bar chart, a one-way table displays categorical data in the form of frequency counts (propotions) and/or relative frequencies (percentage).
two-way table: can be displayed as frequency counts or as relative frequencies (just like a one-way table) for the whole table, for rows, or for columns. They can be displayed graphically as a segmented bar chart.
Probability
Law of large numbers (LLN): the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Rules of probability:
The probability that Event A occurs, given that Event B has occurred, is called a conditional probability. The conditional probability of Event A, given Event B, is denoted by the symbol P(A|B).
The probability that Events A and B both occur is the probability of the intersection of A and B. The probability of the intersection of Events A and B is denoted by P(A n B). If Events A and B are mutually exclusive, P(A n B) = 0.
The probability that Events A or B occur is the probability of the union of A and B. The probability of the union of Events A and B is denoted by P(A u B) .
If the occurrence of Event A changes the probability of Event B, then Events A and B are dependent. On the other hand, if the occurrence of Event A does not change the probability of Event B, then Events A and B are independent.
Rule of addition: P(A u B) = P(A) + P(B) - P(A n B) or P(A u B) = P(A) + P(B) - P(A)P(B|A)
Random variables
When the value of a variable is determined by a chance event, that variable is called a random variable.
When comparing discrete and continuous variables, it is more correct to say that continuous variables can always take on an infinite number of values; whereas some discrete variables can take on an infinite number of values, but others cannot.
Probability distribution
A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occuranc.
Note: Given a probability distribution, you can find cumulative probability. The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).
Mean and variance
E(X) = mux = sum [ xi * P(xi) ] where xi is the value of the random variable for outcome i, mux is the mean of random variable X, and P(xi) is the probability that the random variable will be outcome i.
sigma2 = sum { [ xi - E(x) ]2 * P(xi) } where xi is the value of the random variable for outcome i, P(xi) is the probability that the random variable will be outcome i, E(x) is the expected value of the discrete random variable x.
Independent random variables
If two random variables, X and Y, are independent, they satisfy the following conditions.
P(x|y) = P(x) OR P(x n y) = P(x) * P(y), for all values of X and Y. The above conditions are equivalent.
the correlation between X and Y is equal to zero.
Linear transformation
Adding a constant: Y = X + b Subtracting a constant: Y = X - b Multiplying by a constant: Y = mX Dividing by a constant: Y = X/m Multiplying by a constant and adding a constant: Y = mX + b Dividing by a constant and subtracting a constant: Y = X/m - b
Y = mX + b and Var(Y) = m2 * Var(X)
where m and b are constants, Y is the mean of Y, X is the mean of X, Var(Y) is the variance of Y, and Var(X) is the variance of X.
Binomial distribution
Normal Distribution
The normal distribution is defined by the normal equation:
The graph of the normal distribution depends on two factors - the mean and the standard deviation. The mean of the distribution determines the location of the center of the graph, and the standard deviation determines the height and width of the graph. When the standard deviation is large, the curve is short and wide; when the standard deviation is small, the curve is tall and narrow.
Every normal curve (regardless of its mean or standard deviation) conforms to the following “rule”. Collectively, these points are known as the empirical rule or the 68-95-99.7 rule.
Standard normal distribution
A special case of the normal distribution which has a mean of zero and a standard deviation of one. The normal random variable of a standard normal distribution is called a standard score or a z-score. z = (X - mu) / sigma where X is a normal random variable, ?? is the mean of X, and ?? is the standard deviation of X.
t distribution
The t distribution (aka, Student’s t-distribution) is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown.
According to the central limit theorem, the sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is sufficiently large. Therefore, when we know the standard deviation of the population, we can compute a z-score, and use the normal distribution to evaluate probabilities with the sample mean.
But sample sizes are sometimes small, and often we do not know the standard deviation of the population. When either of these problems occur, statisticians rely on the distribution of the t statistic (also known as the t score), whose values are given by:
t = [ x - mu ] / [ s / sqrt( n ) ]
where x is the sample mean, mu is the population mean, s is the standard deviation of the sample, and n is the sample size. The distribution of the t statistic is called the t distribution or the Student t distribution.
Properties of the t Distribution
T-Distribution vs. Normal Distribution
The t distribution and the normal distribution can both be used with statistics that have a bell-shaped distribution. Guidelines exist to help you make that choice. Some focus on the population standard deviation.
If the population standard deviation is known, use the normal distribution. If the population standard deviation is unknown, use the t-distribution.
If the sample size is large, use the normal distribution. If the sample size is small, use the t-distribution.
Estimation in statistics
In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample.
An estimation of a population parameters may be expressed in two ways:
Confidence interval
A confidence interval describes the likehood that a particular sampling method will produce a confidence interval that includes the true population parameter. A confidence interval consists of three parts:
Standard error
The standard error is an estimate of the standard deviation (SE = s / sqrt( n )) and is a measure of varaibility. The equations for the standard error are identical to the equations for the standard deviation, except for one thing - the standard error equations use statistics where the standard deviation equations use parameters (from population and often unknown). Specifically, the standard error equations use p in place of P, and s in place of sigma.
Margin of error
The margin of error can be defined by either of the following equations:
The central limit theorem states that the sampling distribution of a statistic will be nearly normal, if the sample size is large enough. As a rough guide, many statisticians say that a sample size of 30 is large enough when the population distribution is bell-shaped. But if the original population is badly skewed, has multiple peaks, and/or has outliers, researchers like the sample size to be even larger.
When the sampling distribution is nearly normal, the critical value can be expressed as a t score or as a z score. When the sample size is smaller, the critical value should only be expressed as a t score.
To find the critical value, follow the steps:
Proportion confidence interval
Pre-requsite:
The standard deviation of the sample proportion sigma(P) is:
sigma(P) = sqrt[ P * ( 1 - P ) / n ] * sqrt[ ( N - n ) / ( N - 1 ) ]
When the population size is much larger (at least 20 times larger) than the sample size. the standard deviation can be approximated by:
sigma(P) = sqrt[ P * ( 1 - P ) / n ]
When the true population proportion P is unknown. Us the standard error:
SE(p) = sqrt[ P * ( 1 - P ) / n ] * sqrt[ ( N - n ) / ( N - 1 ) ]
When the population size is much larger (at least 20 times larger) than the sample size. the standard deviation can be approximated by:
SE(p) = sqrt[ P * ( 1 - P ) / n ]
Hypothesis testing
There are two types of statistical hypotheses: