Introduction to Statistics: Terminology and Concepts

Rasim Muzaffer Musal

What is Statistics what is Data

  • Datum (Data): Set of recorded observation(s) on object(s). (customers, animals, energy, purchase decisions, demographics etc…)

  • Statistics can be defined as the study of data and the processes that create that data.

  • Taxonomy is the science of how an organism/system is organized/classified.

  • Similar to other sciences, there will be more than one way to organize statistics. Here we present one such taxonomy.

Intro Stats Taxonomy

Descriptive Statistics

  • Measures of Central Tendency: Mean, Median, Mode

    • Describes where balance lies in the data.
  • Mean: Average. If every observation’s value is given equal weight, where would the central point be?

  • Median: If every observation’s location is given equal weight where would the central point be?

  • Mode: What value repeats most in data?

Descr. Stats: Measures of Dispersion

  • Variance, standard deviation, Interquartile Range. Describes how the data deviates (changes).

    • Variance, Standard Deviation: Standard deviation is the square root of variation. They are a number between 0 to positive infinity \((+ \infty)\) They measure how the data deviates around the mean.

Descr. Stats: Measures of Dispersion

  • Interquartile Range: Difference between \(75^{th}\) and \(25^{th}\) percentiles. Used to define what is and not an outlier.
    • A percentile is a value in the data where \(p\) percent of the data is below that value. \(p\) is a number between 0 and 100 where 0 is the minimum and the 100 is the maximum observed value.
  • Range: Difference between the maximum and the minumum values.

Terminology: Population vs Sample

  • Population is a predefined set of elements.

    • Grades of students in a particular class, life expectancy in Ankole Watusi cow species, amount of money people ages 18-22 spent in Durham NH and San Marcos Texas… Population does not need to be large
  • Sample is a selected group of elements from that population.

  • Data can be obtained by observing the population or sample.

What is statistics: Descriptive

  • Say we observe 3 values (for now context is not given on purpose) 5, 9, and 10. If I ask you what the average is instinctively you might add them up and divide by 3. \(\frac{(5+9+10)}{3}=8\).

  • But what is the idea here? How can we present this in a more general form?

  • First let us define a new concept. A random variable.

Terminology: Random Variable

  • A random variable is a quantity of interest observed from a process whose outcome is not known with certainty.

  • We usually represent them with Latin letters, X,Y,Z,M etc…

  • The capital Latin letter is a shorthand notation to describe the random variable that the researcher/scientist sets. In formulations we use lowercase of the same letters to indicate an arbitrary value that the random variable will get.

  • The actual values are represented with Arabic numbers.

Using a random variable in formulations.

\[X=x\]

  • X is the definition of what the value x represents.

  • If a formulation has an index value for x within the values observed in X, it will be represented as \(x_{i}\). For instance if an arbitrary X has the following values in a list \(X={5,9,10}\).

  • The symbol \(x_{i=1}\) will refer to the value 5, \(x_{i=2}=9\) and off course \(x_{i=3}=10\).

Using a random variable in formulations.

\[\sum_{i=1}^{i=N} X=x_{i}\]

  • Describes that we should sum \((\Sigma)\) values of X observed from the first indexed value to the value represented by N.

  • If I declare that N is the number of observations and apply the formulation on this slide to the list of observed numbers \(X={5,9,10}\), what I am asking you to do is

Math Symbolism

\[\begin{aligned} X= &\{5,9,10\}\\ & x_{1}=5,x_{2}=9,x_{3}=10\\ & \sum_{i=1}^{i=N} X=x_{i} \text{ or } \sum_{i=1}^{i=N} x_{i}\\ N=& 3\\ = & x_{1}+x_{2}+x_{3}\\ = & 5+9+10=24 \end{aligned}\]

Formulation: Mean

\[\begin{aligned} \mu_{X}=\frac{\sum_{i=1}^{i=N}{X=x_{i}}}{N} \\ \bar{X}=\frac{\sum_{i=1}^{i=n}{X=x_{i}}}{n} \end{aligned}\]
  • \(\mu_{X}\) is the population mean.
  • \(\bar{X}\) is the sample mean.
  • N is population size and n is sample size.

Algorithm: Median

  • We go through a set of steps to find the median.
    • Sort the data from smallest to largest or largest to smallest.
    • Find \(i^{*}\) which is the middle point of the data for a data length of N. \[i^{*}=\frac{N+1}{2}\]
    • If \(x_{i^{*}}\) exists that is the median, otherwise
    • find the neighbor points to \(i^{*}\) and average them.

Example: Median

\[\begin{aligned} X=&\{5,9,10\}\\ N=&3, i^{*}=\frac{3+1}{2}=2, x_{2}=9\\ X=&\{5,9,10,20\}\\ N=&4, i^{*}=\frac{4+1}{2}=2.5, \frac{x_{2}+x_{3}}{2}=\frac{9+10}{2}=9.5\\ \end{aligned}\]

Mode Definition

  • The most repeated value in the data.
  • Data can be unimodal where only a single value is repeated the most.
  • Data can be bimodal where 2 values are repeated the most.
  • Data can be multimodal where more than 2 values are repeated the most.
  • Data can be nonmodal where no value is repeated the most.

Mode Examples

  • \(\{2,5,2,2,10\}\) 2 is the mode and data is unimodal.
  • \(\{2,5,5,2,10\}\) 2,5 are the modes and data is bimodal.
  • \(\{2,5,5,2,10,10,3\}\) 2,5,10 are the modes and data is multimodal
  • \(\{2,5,5,2,10,10,3,3\}\) Every value is repeated the same and data is non-modal.

Variance

  • Population variance \(\sigma^{2}_{X}\), sample variance \(s^{2}_{X}\)
  • The idea is to calculate how much on average the data deviate from the mean.

\[E([x-\mu]^{2})\]

\[\sigma^{2}_{X}=\frac{\sum_{i=1}^{i=N}(x_{i}-\mu)^{2}}{N},s^{2}=\frac{\sum_{i=1}^{i=n}(x_{i}-\bar{X})^{2}}{n-1}\]

Variance Example

\[X={5,9,10}\]

  • Calculate \(\mu_{X}=\bar{X}=\frac{x_{1}+x_{2}+x_{3}}{n}=\frac{5+9+10}{3}=8\) \[\begin{aligned} \sigma^{2}_{X}=\frac{(5-8)^{2}+(9-8)^{2}+(10-8)^{2}}{3} \\ \sigma^{2}_{X}=\frac{-3^{2}+1^{2}+2^{2}}{3}=4.67 \end{aligned}\] \[s^{2}_{X}=\frac{(5-8)^{2}+(9-8)^{2}+(10-8)^{2}}{2}=7 \]

Standard Deviation

  • Standard deviation is simply the square root of variance. Both variance and standard deviation is useful in inference and prediction formulations.

  • \(\sigma\) is population standard deviation and \(s\) is sample standard deviation.

  • Both standard deviation and variance can be a number that is greater than or equal to 0 but never negative.

  • What would it imply for the standard deviation or variance to be 0 for a random variable X?

A bit more on statistics:

  • Descriptive statistics is about summarizing data. The alternative would be to look at the raw data. This is impractical in 21st century.

  • On the other hand summary of non trivial data leads to loss of information.

  • This is the reason why you need more than one summary statistic and always be ready to ask questions.

Gatos Curiosos

  • A person makes the argument that his donors are giving in small amounts and cites 50 dollars the average donation amount.

  • What other statistic should you ask for to check this claim?

InterQuartile Range and finding Outliers

  • We will discuss the calculation for this statistic in the next session. However we still need to illustrate the formulations.

  • A data is an outlier if it seems to be observed from a process different from the process that created the rest.

  • IQR: Interquartile range. 75th - 25th Percentile

  • Upper Bound = 75th Percentile +1.5*IQR

  • Lower Bound = 25th Percentile -1.5*IQR

  • Data above upper bound and data below lower bound are referred to as outliers. 1.5 is rule of thumb some uses 3.

InterQuartile Range Example

  • 75th percentile of anarbitrary variable X is 90.
  • 25th percentile of anarbitrary variable X is 50.
  • IQR=90-50=40. We know at this point \(50\%\) of the data is between 90 and 50.
  • Upper Bound\(=90+1.5*40=150\) and Lower Bound \(= 50-1.5*40=-10\)
  • The X Values above 150 and below -10 are going to be designated as outliers.

- Types of random variables.

  • So far every arbitrary variable we discussed were continuous. Continuous variables are variables such that between any two arbitrarily chosen numbers from the scale of that variable a middle point will exist.

  • Discrete variables are variables that do not have this property.

Types of random variables.

  • Gas price in a station, is it continuous or discrete? It is continuous.

  • Imagine a price point of 3.12 and another price point of 3.13.

  • The middle point theoretically exists at 3.125. You might not pay that 0.005 directly but you can calculate with that number and get an intelligent result. For instance 100 gallons of 3.125 is 3125 dollars. In short the gas station might not be able to show you all the decimal points (they probably round it up) since in real life we have to take action with those numbers and our minds do not really care about parts of a cent.

  • The color/brand of a car is discrete. You can represent colors or brands with numbers but a middle point does not necessarily exist (we are ignoring the color continuum and just thinking about the car colors that are already set in the manufacturing plant)

Types of random variables.

  • Discrete vs continuous is one way to approach type of variables.

  • Another way is

    • Nominal
    • Ordinal
    • Interval and Ratio

Nominal random variables

  • Nominal random variables are r.v.s that have no hierarchical information between its values. examples: car brands, cities, continents. You might prefer one value to the other but this preference is not part of the random variable’s representation.

  • Toyota can be represented as 1 in a dataset.

  • Subaru can be represented as 2 in a dataset.

  • but there is no inherent reason why 1<2.

  • You can count these variables but not do arithmetic operations beyond that.

Ordinal random variables

  • There is a hierarchical relationship between the values in these r.v.s. However the strength of these relationships are not known. So you can claim one value is larger in magnitude than the other but can not go beyond that.

  • Ex: Ranks in military, corporate hierarchy, medal colors in Olympics.

  • A private can be represented as 1

  • A Sargent can be represented as 2

  • A lieutanent can be represented as 3.

  • We can declare 3>2>1 but we can not do 3=2+1

Interval/Ratio random variables

  • These are r.v.s where not only do you know that there is a hierarchical relationship between the values but also the strength of these relationships.

  • There is a subtle difference between Interval and Ratio random variables however this class will not distinguish between them.

  • Gas prices, temperature, weight, height etc…

What is statistics: Inferential

  • Make statements about what the whole of a thing (population) might be, by observing parts of that thing (sample). (Hypothesis tests, Coefficient estimation in regression).

What is statistics: Predictive

  • Use the gathered information to guess how something is going to turn out to be in similar conditions (regression).

  • Regression can be used to predict new values that do not exist in the data.