Statistics deals with the collection, organization, analysis, interpretation and presentation of data.
Statistics is the practice of turning data into information to identify trends and understand features of populations.
A variable is a characteristic of an individual in a population, the value of which can differ between entities within that population.
A numeric variable is one whose observations are naturally recorded as numbers.
A continuous variable can be recorded as any value in some interval.
A discrete variable may take on only distinct numeric values.
A categorical variable may take only one of a finite number of possibilities.
A nominal variable is a categorical variable that cannot be logically ranked.
An ordinal variable is a categorical variable that can be naturally ranked.
Univariate data deals with only one variable at a time. Bivariate data deals with two variables at a time. Multivariate data deals with many variables at a time.
A population is defined as the entire collection of individuals or entities of interest.
Characteristics that describe a population are called parameters. We typically use Greek letters to denote these.
A sample is any subset of the population.
Characteristics that describe a sample are called statistics. We typically use Latin letters to denote these.
In inferential statistics our goal is to use the information we have available in our sample to find out about our population.
The distribution is a variable tells us what values it takes and how often. There are many numbers we can calculate that will tell this story.
The arithmetic average. The sum of all the observations divided by the number of observations n.
\[\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\]
Order all the observations from smallest to largest. This is the middle value (for odd n) or the average of the two middle values (for even n).
This value divides our data in half. The median is less affected by outliers than the mean and preferred when dealing with skewed data.
The most common value.
For categorical data, it makes no sense to calculate numeric summaries. However, it is sometimes useful to count the number of observations that fall within each category. These counts or frequencies represent the most elementary summary statistic of categorical data.
A table is a useful way to summarize these frequencies.
From the table it is easy to calculate the proportion that represents the fraction of observations in each category, usually expressed as a decimal.
A quantile is a value computed from a collection of numeric measurements that indicates an observation’s rank when compared to all the other present observations.
Percentiles are quantiles expressed on a percent scale. They divide the distribution of a variable in 100 equal pieces.
To find the \(p\)th percentile use the ranked-ordered data to calculate:
\[m = \frac{p}{100}n\] If \(m\) is not an integer, round up. The pth percentile is the data value located at the position of the next integer greater than \(m\). If \(m\) is an integer, calculate the mean of positions \(m\) and \(m+1\). We can also talk about deciles, quartiles, etc.
The five-number summary of a variable is comprised of the 0th percentile (the minimum), the 25th percentile, the 50th percentile, the 75th percentile, and the 100th percentile (the maximum).
Measures of spread or dispersion tell us how much variability there is in the data.
Difference between the minimum and maximum value.
Difference between the third and first quartile.
The sample variance measures the degree of the spread of numeric observations around their arithmetic mean.
The variance is a particular representation of the average squared distance of each observation when compared to the mean.
\[\ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2\]
The standard deviation is simply the square root of the variance \(\sqrt{\ s^2 }\).
The z-score is a measure that indicates the relative position of a data value with respect to the data of which it is a member.
This implies a transformation on the data also called standardization or normalization. To do this transformation, subtract the mean and divide by the standard deviation:
\[\ z_i = \frac{x_i - \bar{x}}{s}\]
An outlier is an observation that does not appear to “fit” with the rest of the data.
A value with a z-score of \(+/- 3\) or \(+/- 2.5\) might be considered an outlier.
Sometimes we are interested in investigating the relationship between two numeric variables.
The covariance expresses how much two numeric variables change together and the nature of that relationship, whether it is positive or negative.
\[\ s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})\]
The correlation allows you to interpret the covariance further by identifying both the direction and the strength of any association.
\[\ r_{xy}=\frac{\ s_{xy}} {\ s_x\ s_y}\] where \(-1 \le \ r_{xy} \le +1\).
Notice that this is Pearson’s product-moment correlation coefficient and it measures only the linear association between two variables. Correlations closer to \(\pm 1\) are strong and those closer to \(0\) are weak.