Datum (Data): Set of recorded observation(s) on object(s). (customers, animals, energy, purchase decisions, demographics etc…)
Statistics can be defined as the study of data and the processes that create that data.
Taxonomy is the science of how an organism/system is organized/classified.
Similar to other sciences, there will be more than one way to organize statistics. Here we present one such taxonomy.
Measures of Central Tendency: Mean, Median, Mode
Mean: Average. If every observation’s value is given equal weight, where would the central point be?
Median: If every observation’s location is given equal weight where would the central point be?
Mode: What value repeats most in data?
Variance, standard deviation, Interquartile Range. Describes how the data deviates (changes).
Population is a predefined set of elements.
Sample is a selected group of elements from that population.
Data can be obtained by observing the population or sample.
Say we observe 3 values (for now context is not given on purpose) 5, 9, and 10. If I ask you what the average is instinctively you might add them up and divide by 3. \(\frac{(5+9+10)}{3}=8\).
But what is the idea here? How can we present this in a more general form?
First let us define a new concept. A random variable.
A random variable is a quantity of interest observed from a process whose outcome is not known with certainty.
We usually represent them with Latin letters, X,Y,Z,M etc…
The capital Latin letter is a shorthand notation to describe the random variable that the researcher/scientist sets. In formulations we use lowercase of the same letters to indicate an arbitrary value that the random variable will get.
The actual values are represented with Arabic numbers.
\[X=x\]
X is the definition of what the value x represents.
If a formulation has an index value for x within the values observed in X, it will be represented as \(x_{i}\). For instance if an arbitrary X has the following values in a list \(X={5,9,10}\).
The symbol \(x_{i=1}\) will refer to the value 5, \(x_{i=2}=9\) and off course \(x_{i=3}=10\).
\[\sum_{i=1}^{i=N} X=x_{i}\]
Describes that we should sum \((\Sigma)\) values of X observed from the first indexed value to the value represented by N.
If I declare that N is the number of observations and apply the formulation on this slide to the list of observed numbers \(X={5,9,10}\), what I am asking you to do is
\[E([x-\mu]^{2})\]
\[\sigma^{2}_{X}=\frac{\sum_{i=1}^{i=N}(x_{i}-\mu)^{2}}{N},s^{2}=\frac{\sum_{i=1}^{i=n}(x_{i}-\bar{X})^{2}}{n-1}\]
\[X={5,9,10}\]
Standard deviation is simply the square root of variance. Both variance and standard deviation is useful in inference and prediction formulations.
\(\sigma\) is population standard deviation and \(s\) is sample standard deviation.
Both standard deviation and variance can be a number that is greater than or equal to 0 but never negative.
What would it imply for the standard deviation or variance to be 0 for a random variable X?
Descriptive statistics is about summarizing data. The alternative would be to look at the raw data. This is impractical in 21st century.
On the other hand summary of non trivial data leads to loss of information.
This is the reason why you need more than one summary statistic and always be ready to ask questions.
A person makes the argument that his donors are giving in small amounts and cites 50 dollars the average donation amount.
What other statistic should you ask for to check this claim?
We will discuss the calculation for this statistic in the next session. However we still need to illustrate the formulations.
A data is an outlier if it seems to be observed from a process different from the process that created the rest.
IQR: Interquartile range. 75th - 25th Percentile
Upper Bound = 75th Percentile +1.5*IQR
Lower Bound = 25th Percentile -1.5*IQR
Data above upper bound and data below lower bound are referred to as outliers. 1.5 is rule of thumb some uses 3.
So far every arbitrary variable we discussed were continuous. Continuous variables are variables such that between any two arbitrarily chosen numbers from the scale of that variable a middle point will exist.
Discrete variables are variables that do not have this property.
Gas price in a station, is it continuous or discrete? It is continuous.
Imagine a price point of 3.12 and another price point of 3.13.
The middle point theoretically exists at 3.125. You might not pay that 0.005 directly but you can calculate with that number and get an intelligent result. For instance 100 gallons of 3.125 is 3125 dollars. In short the gas station might not be able to show you all the decimal points (they probably round it up) since in real life we have to take action with those numbers and our minds do not really care about parts of a cent.
The color/brand of a car is discrete. You can represent colors or brands with numbers but a middle point does not necessarily exist (we are ignoring the color continuum and just thinking about the car colors that are already set in the manufacturing plant)
Discrete vs continuous is one way to approach type of variables.
Another way is
Nominal random variables are r.v.s that have no hierarchical information between its values. examples: car brands, cities, continents. You might prefer one value to the other but this preference is not part of the random variable’s representation.
Toyota can be represented as 1 in a dataset.
Subaru can be represented as 2 in a dataset.
but there is no inherent reason why 1<2.
You can count these variables but not do arithmetic operations beyond that.
There is a hierarchical relationship between the values in these r.v.s. However the strength of these relationships are not known. So you can claim one value is larger in magnitude than the other but can not go beyond that.
Ex: Ranks in military, corporate hierarchy, medal colors in Olympics.
A private can be represented as 1
A Sargent can be represented as 2
A lieutanent can be represented as 3.
We can declare 3>2>1 but we can not do 3=2+1
These are r.v.s where not only do you know that there is a hierarchical relationship between the values but also the strength of these relationships.
There is a subtle difference between Interval and Ratio random variables however this class will not distinguish between them.
Gas prices, temperature, weight, height etc…
Use the gathered information to guess how something is going to turn out to be in similar conditions (regression).
Regression can be used to predict new values that do not exist in the data.