Chapter 13: Elementary Statistics

Basic Definitions

Statistics deals with the collection, organization, analysis, interpretation and presentation of data.

Statistics is the practice of turning data into information to identify trends and understand features of populations.

A variable is a characteristic of an individual in a population, the value of which can differ between entities within that population.

Types of Variables

A numeric variable is one whose observations are naturally recorded as numbers.

A continuous variable can be recorded as any value in some interval.

A discrete variable may take on only distinct numeric values.

A categorical variable may take only one of a finite number of possibilities.

A nominal variable is a categorical variable that cannot be logically ranked.

An ordinal variable is a categorical variable that can be naturally ranked.

Univariate data deals with only one variable at a time. Bivariate data deals with two variables at a time. Multivariate data deals with many variables at a time.

Summary Statistics

The distribution is a variable tells us what values it takes and how often. There are many numbers we can calculate that will tell this story.

Measures of Central Tendency

Mean

The arithmetic average. The sum of all the observations divided by the number of observations n.

\[\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\]

Median

Order all the observations from smallest to largest. This is the middle value (for odd n) or the average of the two middle values (for even n).

This value divides our data in half. The median is less affected by outliers than the mean and preferred when dealing with skewed data.

Mode

The most common value.

Counts, Percentages, and Proportions

For categorical data, it makes no sense to calculate numeric summaries. However, it is sometimes useful to count the number of observations that fall within each category. These counts or frequencies represent the most elementary summary statistic of categorical data.

A table is a useful way to summarize these frequencies.

From the table it is easy to calculate the proportion that represents the fraction of observations in each category, usually expressed as a decimal.

Measures of Location

A quantile is a value computed from a collection of numeric measurements that indicates an observation’s rank when compared to all the other present observations.

Percentiles are quantiles expressed on a percent scale. They divide the distribution of a variable in 100 equal pieces.

To find the \(p\)th percentile use the ranked-ordered data to calculate:

\[m = \frac{p}{100}n\] If \(m\) is not an integer, round up. The pth percentile is the data value located at the position of the next integer greater than \(m\). If \(m\) is an integer, calculate the mean of positions \(m\) and \(m+1\). We can also talk about deciles, quartiles, etc.

Five-number summary

The five-number summary of a variable is comprised of the 0th percentile (the minimum), the 25th percentile, the 50th percentile, the 75th percentile, and the 100th percentile (the maximum).

Measures of Variability

Measures of spread or dispersion tell us how much variability there is in the data.

Range

Difference between the minimum and maximum value.

IQR - Interquartile Range

Difference between the third and first quartile.

Variance and Standard Deviation

The sample variance measures the degree of the spread of numeric observations around their arithmetic mean.

The variance is a particular representation of the average squared distance of each observation when compared to the mean.

\[\ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2\]

The standard deviation is simply the square root of the variance \(\sqrt{\ s^2 }\).

Measures of Relative Standing

The z-score is a measure that indicates the relative position of a data value with respect to the data of which it is a member.

This implies a transformation on the data also called standardization or normalization. To do this transformation, subtract the mean and divide by the standard deviation:

\[\ z_i = \frac{x_i - \bar{x}}{s}\]

Outliers

An outlier is an observation that does not appear to “fit” with the rest of the data.

A value with a z-score of \(+/- 3\) or \(+/- 2.5\) might be considered an outlier.

Measures of Association

Sometimes we are interested in investigating the relationship between two numeric variables.

The covariance expresses how much two numeric variables change together and the nature of that relationship, whether it is positive or negative.

\[\ s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})\]

The correlation allows you to interpret the covariance further by identifying both the direction and the strength of any association.

\[\ r_{xy}=\frac{\ s_{xy}} {\ s_x\ s_y}\] where \(-1 \le \ r_{xy} \le +1\).

Notice that this is Pearson’s product-moment correlation coefficient and it measures only the linear association between two variables. Correlations closer to \(\pm 1\) are strong and those closer to \(0\) are weak.