Descriptive Statistics
Descriptive Measures of Statistics
The purpose of this module is to determine certain features of a variable in a data set which will describe the nature of the variable in that data set in a general way. The most important features are:
Central tendency: provides the average trend of the variable in appropriate sense.
Dispersion: Scatterness of the variable.
Skewness: Measure of the asymmetry of the frequency distribution.
Kurtosis: Measure of the sharpness of the peak or flatness of the tails of a frequency distribution.
In this module, we will learn these four features in details.
Lecture 1: Measures of Central Tendency
Suppose you are asked to describe the key features of the above data on Airbnb prices (in USD). It is apparent that, most of the Airbnb prices range around \(80\) to \(120\) USD. If a representative number or typical value is to put forward for the Austin Airbnb prices variable, that should be somewhere in between \(8\) to \(120\) USD. This representative value is called a measure of central tendency or simply an average. There are many types of measures of central tendency available, the applicability of which depends on the context. We will learn a five important measures among them.
(I) Arithmetic Mean (AM):
The most familiar notion of average is that of arithmetic mean (AM), denoted by \(\bar{x}\), which is simply the sum of all the observations, \(x_{1},\ldots,x_{n}\), divided by the number of observations, i.e.,
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}.\]
Some important properties of AM:
(1) Sum of the deviations of the observations from the AM is zero, i.e., \(\sum_{i=1}^{n} (x_{i} - \bar{x}) =0\).
(2) Let all the observations are equal, and the common value is \(c\), then \(\bar{x}=c\).
(3) [Base and scale change] If \(y_{i}= a+bx_{i}\) for each \(i=1, \ldots,n\), then \(\bar{y} = a+b\bar{x}\). Suppose a set of paired observations \((x_{i},y_{i})\) is available for each \(i=1,\ldots,n\). Define the variable \(z_{i}=ax_{i}+by_{i}+c\), for some constants \(a,b,c\). Then \(\bar{z}=a\bar{x}+b\bar{y}+c\).
(4) Let there be \(t\) sets of values of the variable \(x\), containing \(n_{1}, \ldots, n_{t}\) values, and having AMs \(\bar{x}_{1}, \cdots, \bar{x}_{t}\) , respectively, then the grand mean/pooled mean of \(x\) is
\[\bar{x} =\frac{\sum_{i=1}^{t} n_{i} \bar{x}_{i}}{\sum_{i=1}^{t}n_{i}}.\]
(II) Geometric Mean (GM):
Let the variable \(x\) has \(n\) observations \(x_{1},\ldots,x_{n}\), then the geometric mean (GM) of \(x\), denoted by \(x_{g}\) is given by
\[ x_{g} = \exp\left\{\frac{1}{n} \sum_{i=1}^{n} \log x_{i} \right\} = \left(\prod_{i=1}^{n} x_{i} \right)^{1/n}. \]
Some important properties of GM:
(1) Let all the observations be equal, and the common value is \(c\), then \(x_{g}=c\).
(2) [Scale change] If \(y_{i}= bx_{i}\) for each \(i=1, \ldots,n\), then \(y_{g} = bx_{g}\).
(3) Let there be \(t\) sets of values of the variable \(x\), containing \(n_{1}, \ldots, n_{t}\) values, and having GMs \(x_{g,1}, \cdots, x_{g,t}\) , respectively, then the grand/pooled geometric mean of \(x\) is
\[x_{g} =\left(x_{g,1}^{n_{1}}\cdots x_{g,t}^{n_{t}} \right)^{1/\sum_{i=1}^{t} n_{i} }.\]
(4) Suppose a set of paired observations \((x_{i},y_{i})\) is available for each \(i=1,\ldots,n\). Define the variable \(z_{i}=x_{i}/y_{i}\). Then \(z_{g}=x_{g}/y_{g}\).
(III) Harmonic Mean (HM):
Let the variable \(x\) has \(n\) observations \(x_{1},\ldots,x_{n}\), then the harmonic mean (HM) of $x, denoted by \(x_{h}\) is given by
\[ x_{h} = \left( \frac{1}{n} \sum_{i=1}^{n} x_{i}^{-1} \right)^{-1} = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_{i}}}. \]
Some important properties of HM:
(1) Let all the observations be equal, and the common value is \(c\neq 0\), then \(x_{h}=c\).
(2) [Scale change] If \(y_{i}= bx_{i}\) for each \(i=1, \ldots,n\), then \(y_{h} = bx_{h}\).
(3) Let there be \(t\) sets of values of the variable \(x\), containing \(n_{1}, \ldots, n_{t}\) values, and having HMs \(x_{h,1}, \cdots, x_{h,t}\) , respectively, then the grand/pooled harmonic mean of \(x\) is
\[x_{h} =\frac{\sum_{i=1}^{t} n_{i} }{\displaystyle\sum_{i=1}^{t}\frac{n_{i}}{ x_{h,i}}} .\]
Comparison of AM-GM-HM:
- When the variable of interest, say \(z\), is the ratio of two other variables, say \(z=x/y\), then one would expect that the average of \(z\) will be the ratio of averages of \(x\) and \(y\). As GM possesses this property (see Property (4) of GM), it is preferred over AM/HM in such situations.
Example: Consider the example of price relatives of two commodities, \(A\) and \(B\) in 2022, in comparison with 2019. Suppose the price relative of \(A\) is \(2\), i.e, price of \(A\) has doubled in 3 years, and that of \(B\) is \(1/2\). If equal importance is given to \(A\) and \(B\), then the average price relative should be one. This criterion is satisfied by GM only.
- When a variable changes over time exponentially, GM provides a more reasonable estimate of the average than AM/HM.
Example: Suppose that the population of a country increases at a rate \(ar^{t}\), and it is observed at time points \(t_{1}\) and \(t_{2}\). If one wants to interpolate the population at time \((t_{1}+t_{2})/2\) by taking an appropriate average of these two time points. One would expect the average to result in \(ar^{(t_{1}+t_{2})/2}\). This criterion is also satisfied by GM only.
- Sometimes the variable of interest is of the form ‘\(x\) per unit \(y\)’, for example distance per hour. In such cases HM would be the proper average if equal units of \(x\) are considered, while AM would be the proper average if equal units of \(y\) are considered.
Example: Suppose a train moves 10 km distance in speed 100 km/hour, and another 10 km distance in 160 km/hour. Then the average speed of the train must be \[\frac{20}{\frac{10}{100}+\frac{10}{160}}, \] as total \(20\) km is covered by the train in \(\frac{10}{100}+\frac{10}{160}\) hours. However, this turns out to be the HM of \(100\) and \(160\) with frequencies \(10\) each.
On the other hand, if the train moves in \(100\) km/hour for two hours and in \(160\) km/hour for one hour, then the average speed is \[ \frac{2\times 100 + 1 \times 160}{3},\] which is the AM of \(100\) and \(160\) with frequencies \(2\) and \(1\), respectively.
- If a single observation is \(0\), then GM and HM become \(0\), as well.
- For any set of positive numbers, \(\boldsymbol{AM\geq GM \geq HM}\). (Why?)
(IV) MEDIAN:
Median: The median of \({\bf x}\), \(\tilde{x}_{me}\), is a number such that at least half of the data points are bigger than or equal to it, and at least half of the data points are smaller than or equal to it, i.e., \[ \sum_{i} I(x_{i} \geq \tilde{x}_{me} ) = \sum_{i} I(x_{i} \leq \tilde{x}_{me} ),\] where \(I\) is the indicator function.
Suppose the observations are arranged in ascending or descending order of magnitude, then median is the middle most value in this arrangement. Thus if \(n\) is odd, median is the \((n+1)/2\)-th observation of the ordered arrangement. If \(n\) is even, then any value lying between the \(n/2\)-th and \((n/2)+1\)-th of the ordered arrangement is a median.
Some important properties of Median:
(1) Let all the observations be equal, and the common value is \(c\), then \(\tilde{x}_{me}=c\).
(2) [Base and scale change] If \(y_{i}= a+bx_{i}\) for each \(i=1, \ldots,n\), then \(\tilde{y}_{me} = a+b\tilde{x}_{me}\). In fact, if \(y=g(x)\), where \(g\) is a monotone function, then \(\tilde{y}_{me} = g(\tilde{x}_{me})\).
(3) Mean deviation about median is least, i.e., \(\sum_{i=1}^{n} |x_{i} -a|\) is least when \(a=\tilde{x}_{me}\).(Why?)
(V) MODE:
The mode, \(\tilde{x}_{mo}\), of a discrete variable is a value having the highest frequency. If there are more than one values having the highest frequency, then the mode is not unique.
For a continuous variable, the modal class is a class with highest frequency. The mode is ideally the value of the variable with highest frequency density corresponding to the ideal distribution which would be obtained if the total frequency were increased indefinitely and, at the same time the width of the class intervals were decreased indefinitely.
From a frequency distribution, the mode can be approximated by the following formula \[ \tilde{x}_{mo} \approx x_{l} + \frac{f_{0}-f_{-1}}{2f_{0} - f_{-1}-f_{1}}\times c, \] where \(x_{l}\) is the lower class limit of the modal class, \(c\) is the width of the modal class, \(f_{0}\), \(f_{-1}\) and \(f_{1}\) are the frequencies of the modal class, the preceding class and the following class, respectively. (Why?)
Some important properties of Mode:
(1) Let all the observations be equal, and the common value is \(c\), then \(\tilde{x}_{mo}=c\).
(2) [Base and scale change] If \(y_{i}= a+bx_{i}\) for each \(i=1, \ldots,n\), then \(\tilde{y}_{mo} = a+b\tilde{x}_{mo}\). In fact, if \(y=g(x)\), where \(g\) is an one-to-one function, then \(\tilde{y}_{mo} = g(\tilde{x}_{mo})\).(Why?)
Comparison Between Mean Median and Mode:
(1) Mean is unique. Both median and mode may not be unique.
(2) Although in determining mean, median and mode, all the observations are taken into consideration, in the actual computation only the mean directly uses all the observations. The value of sample mean would change even if a single observation is altered.
(3) Mean is least affected by sampling fluctuations.
(4) Under the existence of extreme values, mean is most affected. Median and mode are more robust measures of central tendency than mean.
(5) Mean can not be calculated if the terminal classes of a frequency distribution are open.
An exercise to understand sampling fluctuations and robustness:
- [Sampling fluctuations] Take \(100\) samples, each of size \(n=50\), from a normal distribution with mean \(5\) and variance \(2^2\). For each of these \(100\) samples calculate the mean and the median. Draw a histogram of these \(100\) values of means, and that of medians. Which is more dispersed? Why?
- [Effect of extreme values] Take \(100\) samples, each of size \(n=50\), from a standard Cauchy distribution. For each of these \(100\) samples, calculate the mean and the median. Draw a histogram of these \(100\) values of means, and that of medians. Which is more dispersed? Why?
Lecture 2: Measures of Dispersion
Consider the following two data sets.
Mean, median and mode for both the data sets are at 250. However, the data on annual bill of both the companies have different spread. Therefore, we need a measure of spread, along with the location, to describe the difference among the performance of these two companies. There are many types of measures of dispersion available, the applicability of which depends on the context. We will learn some important measures among them.
I) Range:
The simplest measure of dispersion is the range, which is defined as the difference between the maximum and the minimum observation. A higher value of the range indicates higher dispersion.
Properties:
If \(y_{i}=a+bx_{i}\), for \(i=1,\ldots,n\), then \(\mathrm{Range}(y)= |b|\mathrm{Range}(x)\).
\(\mathrm{Range}(x)=0\) if and only if (iff) \(x_{i}=c\) for some constant \(c\) for all \(i\).
II) Mean Deviation (MD):
Let \(m\) be a chosen central value of a variable \(x\) in a data set consisting of observations \(x_{1},\ldots,x_{n}\). Then the mean deviation of \(x\) about \(m\) is defined as
\[\mathrm{MD}_{m}(x)=\frac{1}{n} \sum_{i=1}^{n} |x_{i}-m|.\] A higher value of the MD indicates higher average distance of the data points from the central value \(m\), which in turn indicates higher dispersion.
Properties:
If \(y_{i}=a+bx_{i}\), for \(i=1,\ldots,n\), and the measure of central tendency \(m\) satisfies \(m(y)=a+bm(x)\), then \(\mathrm{MD}_{m}(y)= |b|\mathrm{MD}_{m}(x)\).
\(\mathrm{MD}_{m}(x)\) is minimum among the class of \(m=\tilde{x}_{me}\).
When \(m=\bar{x}\), then \(\mathrm{MD}_{\bar{x}}(x)\) can be simplified as
\[\mathrm{MD}_{\bar{x}}(x) = \frac{2}{n} \sum_{i:x_{i}>\bar{x}}(x_{i}-\bar{x})= \frac{2}{n} \sum_{i:x_{i}<\bar{x}}(\bar{x}-x_{i}).\]
\(\mathrm{MD}_{m}(x)=0\) iff \(x_{i}=m\) for each \(i\).
III) Variance and Standard Deviation (SD):
Let \(x_{1},\ldots,x_{n}\) be \(n\) values of a variable \(x\) in a data set, then the variance of \(x\) is represented as \[\mathrm{var}(x)=\frac{1}{n}\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}.\]
Variance provides the average of squared departure of the observations from the mean. The positive square root of variance is called the standard deviation (\(\mathrm{SD}(x)\)), and can be treated as average distance of the data points from the mean.
A generalization of SD, which can be directly compared with MD is the root-mean-square deviation (RMSD) about a chosen average \(m\), and can be expressed as
\[ \mathrm{RMSD}_{m}(x)=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (x_{i}-m)^{2}}.\]
Properties:
If \(y_{i}=a+bx_{i}\), for \(i=1,\ldots,n\), then \(\mathrm{SD}(y)= |b|\mathrm{SD}(x)\).
\(\mathrm{var}(x)\) can be simplified as
\[\mathrm{var}(x) = \frac{1}{n} \sum_{i=1}^{n}x_{i}^{2}-\bar{x}^{2}.\]
All the observations of a variable are equal if and only if (iff) the SD is zero.
Let there be \(t\) groups of observations \({\bf x}_{(1)}, \cdots, {\bf x}_{(t)}\), where \({\bf x}_{(j)}=(x_{j,1},\ldots,x_{j,n_{j}})^{\top}\). Let the mean and variance of the \(j\)-th group be \(\bar{x}_{j}\) and \(s_{j}^{2}\), respectively. Then the combined/pooled variance of the \(t\) groups is given by \[ \mathrm{var} (x) =\frac{\sum_{j=1}^{t} n_{j} s_{j}^{2}}{\sum_{j=1}^{t} n_{j}} + \frac{\sum_{j=1}^{t} n_{j} (\bar{x}_{j} -\bar{x})^{2}}{\sum_{j=1}^{t} n_{j}},\] where \(\bar{x}\) is the pooled mean of the \(t\) groups.
\(\mathrm{RMSD}_{m}(x)\) is minimum if \(m=\bar{x}\).
Define the Gini’s mean-squared difference measure as \[\Delta=\frac{1}{n^{2}} \sum_{i=1}^{n}\sum_{j=1}^{n} (x_{i}-x_{j})^{2}.\]
\(\Delta\) is a measure of dispersion, which is independent of any particular measure of central tendency. However, it can be shown that \(\Delta=2\mathrm{var}(x)\).
IV) Quartile deviation (QD):
Quartiles:
Given \(n\) observations \(x_{1},\ldots,x_{n}\) of a variable \(x\), the \(j\)-th quartile \(Q_{j}({\bf x})\) is a number such that at least \(j\times n/4\) observations are smaller than or equal to it, and at least \((4-j)\times n/4\) observations are smaller than or equal to it, \(j=1,\ldots,4\).
The second quartile is the median, \(Q_{2}({\bf x})=\tilde{x}_{me}\).
Let the observations be arranged in ascending order of magnitude, i.e., \(x_{1} \leq x_{2} \leq \cdots \leq x_{n}\). If \(n\) is a multiple of \(4\), then any number between the \(n/4\)-th and the \((n/4)+1\)-th observation is \(Q_{1}({\bf x})\), and any number between the \(3n/4\)-th and the \((3n/4)+1\)-th observation is \(Q_{3}({\bf x})\). If \(n\) is not a multiple of \(4\), then the \([n/4]+1\)-th observation is \(Q_{1}({\bf x})\), and the \([3n/4]+1\)-th observation is \(Q_{3}({\bf x})\).
It is understood that if the variables are more dispersed then the quartiles would be more distant from one another. In particular, if the observations above a central value are more dispersed, then the difference \(Q_{3}({\bf x})-Q_{2}({\bf x})\) would be large, and if the observations below a central value are more dispersed, then the difference \(Q_{2}({\bf x})-Q_{1}({\bf x})\) would be large. From this understanding, the average of the two differences \(Q_{3}({\bf x})-Q_{2}({\bf x})\) and \(Q_{2}({\bf x})-Q_{1}({\bf x})\) is regarded as a measure of dispersion, and is called the quartile deviation (\(\mathrm{QD}({\bf x})\)), i.e., \[\mathrm{QD}({\bf x}) = \frac{\{Q_{3}({\bf x})-Q_{2}({\bf x})\}+\{Q_{2}({\bf x})-Q_{1}({\bf x})\}}{2}=\frac{Q_{3}({\bf x})-Q_{1}({\bf x})}{2}.\]
As \(Q_{3}({\bf x})-Q_{1}({\bf x})\) is called the interquartile range, \(\mathrm{QD}({\bf x})\) is also known as semi-interquartile range.
Properties:
- If \(y_{i}=a+bx_{i}\), for \(i=1,\ldots,n\), then \(\mathrm{QD}(y)= |b|\mathrm{QD}(x)\).
Comparison Between Range, MD, SD and QD:
Merits and Demerits of Range:
Range is simplest to compute. It is often employed as a measure of dispersion in Statistical Quality Control (SQC), where the analysis of the data must be done immediately after the data is collected.
Compared to SD and QD, MD and Range are simple to understand.
Range is most affected by extreme values. As variance and MD are averages of distance of each point from a central value, these two measures are also affected by extreme values. Like median, QD is least affected by extreme values.
Range, MD and SD can not be computed if the data is open ended.
Calculation of Range is based two extreme observations only. Calculation of QD also does not depend on all the observations. Even if most of the observations change keeping the quantiles unchanged, then QD will not change. However, in calculation of SD and MD, all the observations are taken into account.
Like mean, SD and MD are not much affected by sampling fluctuations. QD and range are more affected by sampling fluctuations.
Unlike MD, SD has some desirable properties which make it easily amenable to algebraic treatments.
Some important relations:
SD can not be smaller than MD about mean.
The difference between mean and median can not be greater than \(\mathrm{MD}_{\tilde{x}_{me}}(x)\).
Let the variable \(x\) has \(n\) observations, and \(R\) be the range of \(x\), then \[ \frac{R^{2}}{2n} \leq \mathrm{var}(x) \leq \frac{R^2}{4}.\]
V) Relative Measures of Dispersion:
The measures of dispersion discussed so far are called absolute measures, as each of them possesses the same unit as that of the variable. Naturally, these measures can not be used to compare dispersion of two or more variables having different units of measurement. Therefore we require some measures of dispersion which are independent of units. Such measures are referred to relative measures of dispersion.
The most common relative measure of dispersion is Coefficient of Variation (CV), which is defined as \[\mathrm{CV}(x)= \frac{\mathrm{SD}(x)}{\bar{x}}\times 100\%. \] Remarks: 1. CV is also used to compare dispersion of variables, when the mean of the variables are wide apart.
Example: Consider two variables where the first variable indicates repeated measurements of a table of length \(210\)cm, and the second variable indicates the same for a chair of length \(48\)cm. Let there be \(n\) measurements for both. The mean and SD of measurements of the table are \(209.5\)cm and \(1.25\)cm, and that for the chair are \(47.76\)cm and \(.35\)cm. Although the SD for measurements of the chair is quite less, the CV of the measurements of the table and the chair are \(0.6\%\) and \(0.7\%\) (approximately). This indicates that the measurements of the table are more stable.
- Although CV is the most popular relative measure of dispersion, there exists other measures of relative dispersion, such as
coefficient of mean deviation, defined as \(MD_{m}(x)\times 100/m\%\), and
coefficient of quartile deviation, defined as \(QD(x)\times 100/\tilde{x}_{me}\%\).
Lecture 3: Measures of Skewness
Let \(x_{1},\ldots,x_{n}\) be a set of observations on a variable \(x\).
Suppose \(x\) is discrete. Then \(x\) is called symmetric about the point \(x_{0}\) if the frequency of the point \((x_{0}+h)\) is same as that of \((x_{0}-h)\), for any \(h>0\) (if \(x_{0}+h\) is not a plausible value of \(x\) then the corresponding frequency is considered zero).
Suppose \(x\) is continuous. Then \(x\) is called symmetric about the point \(x_{0}\) if there are same number of observations below \((x_{0}-h)\), and above \((x_{0}+h)\) for each \(h>0\). The term skewness refers to departure from symmetry.
Let \(y_{i}=x_{i} -m\), where \(m\) is a measure of central tendency. Then \(x\) is called positively skewed or right skewed if there are larger positive values in \(\{y_{1},\ldots,y_{n}\}\) compared to negative values. Consequently, the right tail of the histogram (or barplot) is thicker and larger, compared to the left tail. Further \(x\) is called negatively skewed or left skewed if there are larger negative values in \(\{y_{1},\ldots,y_{n}\}\) compared to positive values. Consequently, the left tail of the histogram (or barplot) is thicker and larger, compared to the right tail.
There are three popular measures of skewness, (i) Fisher-Pearson coefficient of skewness (a moment measure), (ii) Pearson’s coefficient of skewness, and (Iii) Bowley’s coefficient of skewness (a quantile based measure).
I) Fisher-Pearson coefficient of skewness:
Observe that, a standardized variable satisfies the following property: \[ g_{1}(x)=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{x_{i}- \bar{x}}{SD}\right)^{3} =\left\{\begin{array}{lll} 0 \qquad & \mbox{if $x$ is symmetric,}\\ +ve \qquad & \mbox{if $x$ is positively skewed,} \\ -ve \qquad & \mbox{if $x$ is negatively skewed.} \end{array} \right. \] Therefore the quantity \(g_{1}\) is considered as a measure of skewness. Note that, it is unit free.
Properties:
(1) \(g_{1}\) is scale-free, i.e., a relative measure of skewness.
(2) \(g_{1}\) take any value from \(-\infty\) to \(\infty\). However, if the distribution is left skewed, then \(g_{1}<0\), and if right skewed then \(g_{1}>0\).
(3) [Base and scale change] Let \(y_{i}=a+bx_{i}\), then \(g_{1}(y)=\mathrm{sign}(b) g_{1}(x)\), where \(\mathrm{sign}(b)\) is \(+1\) if \(b>0\), \(0\) if \(b=0\), and \(-1\) if \(b<0\).
II) Pearson’s coefficient of skewness:
For a symmetric and unimodal variable, mean median and mode are equal to the point of symmetry.
For a left skewed variable \(x\),
\[
\bar{x} < \tilde{x}_{me} < \tilde{x}_{mo}.
\]
For a right-skewed distribution
\[\tilde{x}_{mo} < \tilde{x}_{me} < \bar{x}.\]
Keeping this relation in mind the following measure was proposed:
\[ Sk(x) =\frac{\bar{x}-\tilde{x}_{mo}}{SD}. \]
\(Sk\) is called Pearson’s measure of skewness.
\(Sk\) satisfies all the three properties which are satisfied by \(g_{1}\). However, both \(Sk\) and \(g_{1}\) can be unbounded. Thus, it is difficult to measure how large value of these measures would ensure enough departure from symmetry. Therefore, a bounded measure of skewness is required.
Based on the empirical relation between \((\bar{x}-\tilde{x}_{me})\) and \((\bar{x}-\tilde{x}_{mo})\), which is \[(\bar{x}-\tilde{x}_{mo}) \approx 3(\bar{x}-\tilde{x}_{me}), \tag{*}\] and the fact that \((\bar{x}-\tilde{x}_{me}) \leq SD\), another measure of skewness \(SK_{2}\) is proposed \[Sk_{2}(x) =\frac{3(\bar{x}-\tilde{x}_{me})}{SD}. \] \(Sk_{2}\) satisfies the properties (1)-(3) except that the value of \(SK_{2}\) lies between \(-3\) to \(3\). Due to the empirical relation \((*)\), which is valid for moderately skewed distributions, the approximate range of \(Sk\) is also \(-3\) to \(3\).
III) Bowley’s coefficient of skewness:
For a variable \(x\), the frequency distribution of which is symmetric, \(\{Q_{3}(x)-Q_{2}(x)\}=\{Q_{2}(x) - Q_{1}(x)\}\). For a left skewed distribution \({Q_{3}(x)-Q_{2}(x)} < {Q_{2}(x)-Q_{1}(x)}\), and for a right-skewed distribution \({Q_{3}(x)-Q_{2}(x)} > {Q_{2}(x)-Q_{1}(x)}\).
These relation induces a quantile based measure of skewness, viz., \[ Sk_{3} = \frac{ \{Q_{3}(x)-Q_{2}(x)\} - \{Q_{2}(x)-Q_{1}(x)\}}{ Q_{3}(x)-Q_{1}(x)}= \frac{ Q_{3}(x)-2Q_{2}(x)+Q_{1}(x)}{ Q_{3}(x)-Q_{1}(x)}.\] \(Sk_{3}\) also satisfies the properties (1)-(3), except that the value of \(SK_{3}\) lies between \(-1\) to \(1\).
Lecture 4: Moments and Quantiles
So far we have came across two main types of measures, (i) mean type measures (examples include, \(\bar{x}\), \(\mathrm{MD}_{m}(x)\), variance, etc.), and (ii) median type measures (examples include, median, quartile deviation, etc.). Generally, the mean type measures are based on average of \(n\) quantities, each based on the \(i\)-th observation. A broad class of such quatities is called moments. Further, the median type measures are based on a few observations appearing at some special (ordered) positions. A broad class of such quantities are called quantiles.
Moments:
Let \(x_{1}, \ldots, x_{n}\) be \(n\) observations on a variable \(x\). The sample \(r\)-th moment of \(x\) about the origin \(A\) is defined as \[ m_{r}^{\prime}(A) = \frac{1}{n} \sum_{i=1}^{n} (x_{i}-A)^{r}, \quad r\geq 0. \tag{*} \] 1. When \(A=0\), then the moments are called raw moments, and are denoted by \(m_{r}^{\prime}\), \(r\geq 0\).
- When \(A=\bar{x}\) then the moments are called central moments, and are denoted by \(m_{r}\), \(r\geq 0\).
- Instead of \((x_{i}-A)^{r}\) in (*), if we consider \(|x_{i}-A|^{r}\), then the moments are called absolute moments.
Example:
The \(0\)-th order raw (or central, or absolute) moment is \(1\).
AM is the first order raw moment.
The first order central moment is zero.
Variance id the second order central moment.
\(\mathrm{MD}_{m}(x)\) is the first order absolute moment about \(m\).
Properties of moments:
If \(y_{i}=a+bx_{i}\) for each \(i=1,\ldots,n\), then \(m_{r}(y)=b^{r} m_{r}(x)\).
Let \(r\) be a positive integer. Then the \(r\)-th order central moment, \(m_{r}\), can be expressed in terms of moment about an arbitrary origin \(A\).
Let \(r\) be a positive integer, and \(A\) be any number. Then the \(r\)-th order moment about an arbitrary origin \(A\), can be expressed in terms of the \(r\)-th order central moment, \(m_{r}\).
Let \(m_{r,a}(A)\) and \(m_{s,a}(A)\) be the \(r\)-th and \(s\)-th order absolute moment of \(x\) about \(A\), and \(r<s\). Then \(m_{r,a}(A) \leq 1+m_{s,a}(A)\).
Interpretation of higher order moments:
The higher order central moments can be regarded as a weighted average of \(n\) quantities, one corresponding to each of the observations. The observations lying at larger distance from the center receive higher weight. We can interpret higher order moments with the help of the following example.
Take a random sample of size \(n=50\) from \(N(0,1)\) distribution. Let the observations be \(x_{1},\ldots,x_{n}\).
Let \(\{y_{1},\ldots,y_{n+4}\}\) be defined as follows: \(y_{i}=x_{i}\) for \(i=1,\ldots,n\), \(y_{n+1}=y_{n+2}=4\) and \(y_{n+3}=y_{n+4}=-4\). Clearly, the distribution of \(y\) has heavier tail than that of \(x\). Let us verify how the moments of \(x\) and \(y\) differ.
[1] "The mean and sd of x are -0.34 0.87"
[1] "The mean and sd of y are -0.32 1.38"
[1] "The 3rd, 4th central and 3rd absolute moments of standardized x are -0.2 2.25 and 1.42"
[1] "The 3rd, 4th central and 3rd absolute moments of standardized y are 0.36 5.72 and 2.16"
Quantiles:
Let \(x_{1}, \ldots, x_{n}\) be \(n\) observations on a variable \(x\). The sample \(p\)-th quantile of \(x\) is a number such that at least \(p\times 100\%\) of the observations are smaller than or equal to it, and at least \((1-p)\times 100\%\) of the observations are bigger than or equal to it.
Example:
Median is the \(0.5\)-th quantile.
Interquartile range is the difference of \(0.75\)-th quantile and \(0.25\)-th quantile.
Quantile-Quantile plot:
Quantile-quantile plot allows us to compare two distributions using the sample quantiles.
This kind of comparison is much more detailed than a simple comparison of means or medians.
There is a cost associated with this extra detail. We need more observations than for simple comparisons.
[1] "Mean, SD and g1 of 100 random samples from N(0,1) are 0.14, 0.96 and 0.8 respectively"
[1] "Mean, SD and g1 of 80 random samples from t(8) are 0.14, 0.97 and -0.24 respectively"
Lecture 5: Normal Approximation for Data
Along with the summary statistics, it is often important to provide a distributional assumption to the variable under consideration.
The normal distribution plays an important role in this aspect.
Due to a list of very important properties of normal distribution, and appealing to the central limit theorem (CLT), many of the variables are modeled using normal distribution.
Important properties of normal distribution
Normal distribution is completely characterized by two parameters, the location parameter, \(\mu\), and the scale parameter, \(\sigma^{2}\), we write \(N(\mu,\sigma^2)\). If \(X \sim N(\mu,\sigma^2)\), then \(E(X)=\mu\), and \(\text{var}(X)=\sigma^2\).
Normal distribution is symmetric about \(\mu\). Consequently, \(X \sim N(\mu,\sigma^2)\), then \(P(X\leq \mu -h)=P(X\geq \mu +h)\) for each \(h\geq 0\).
Let \(X\) be distributed as the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Then \[ P(\mu -\sigma \leq X\leq \mu+\sigma)\approx 0.683, \quad P(\mu -2\sigma \leq X\leq \mu+2\sigma)\approx 0.954, \quad \text{and} \quad P(\mu -3\sigma \leq X\leq \mu+3\sigma)\approx 0.997. \]
[CLT] Let \(X_{i}, ~i=1,\ldots,n\) be independent \(n\) independent and identically distributed (i.i.d.) samples from any distribution with mean \(\mu\) and variance \(\sigma^{2}\), then \[ \frac{\sqrt{n} (\bar{X} -\mu)}{\sigma} \xrightarrow{d} N(0,1), \qquad \text{as} \quad n\rightarrow \infty. \]
If \(X\sim N(\mu,\sigma^{2})\), and \(a,b\) are two constants (\(b\neq 0\)), then \(Y=a+bX \sim N(a+b\mu, b^{2}\sigma^{2})\).
Implication of the above properties:
When the sample size \(n\) is sufficiently large, then the sample mean \(\bar{x}\) is a good estimator of population expectation \(\mu\) (when it exists?!), and sample variance \(s^{2}\) is a good estimator of population variance \(\sigma^{2}\) (when it exists?!).
In view of the above property of \((\bar{x},s^{2})\), and property (1) of normal distribution, it is expected that if the variable \(x\) is modeled using a normal distribution, then it is enough to summarize \(x\) using \((\bar{x},s^{2})\).
If \(x\) is modeled using a normal distribution, then the it is expected that almost \(68\%\) of the data points would lie in the interval \((\bar{x}-s,\bar{x}+s)\), and almost \(100\%\) of the observations lie in the interval \((\bar{x}-3s,\bar{x}+3s)\).
Many variables observed in practice are basically sum of several quantities which may be considered as i.i.d. For example, the number of votes for an electoral candidate is the sum of individual votes, your marks in end-semester exam of MTH201A is the sum of the marks obtained in each question, etc. In such cases, appealing to the CLT, it is natural to approximate the variate by a normal distribution.
Deviation from normal distribution:
Although normal distribution has many favorable properties, not all variables can be approximated by a normal distribution.
From the histogram one may get some idea, but the necessity of deriving a more direct measure of validity of normal approximation remains.
[1] "Mean, SD and Sk2 of the 250 random samples are 0.02, 1.11 and -0.04 respectively"
Let \(n\) samples be collected on a variable \(x\) . The standardized sample \(y_{i}=(x_{i}-\bar{x})/SD(x)\), \(i=1,\ldots,n\) has mean \(0\) and variance \(1\). To check if the \(N(0,1)\) approximation is valid for the transformed data, one may first look for the skewness property. Left or right skewness of \(y\) is indicative of inappropriateness of normal approximation.
When \(y\) is (nearly) symmetric, one can test the validity of normal approximation by the ‘tail-thickness property’, more commonly known as Kurtosis.
Another way to test normality is using quantile-quantile plot. As the values of almost all quantiles of the standard normal distribution are tabulated, one may check if the sample quantile nearly matches the population quantile.
Kurtosis
Kurtosis (peakedness) is a measure of shape of a distribution. According to Balanda and MacGillivray (1988) “…it is best to define kurtosis vaguely as the location- and scale- free movement of probability mass from the shoulders of a distribution into its center and tails.”
It is typically measured using the forth moment of the standardized distribution. If \(\{x_{1},\ldots,x_{n}\}\) are \(n\) observations on a variable \(x\), then the Pearson’s measure of kurtosis is defined as \[ b_{2} =\frac{m_{4}}{m_{2}^{2}}= \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_{i} - \bar{x} }{SD} \right)^{4}. \]
If a random variable \(Y\) is distributed as \(N(0,1)\) distribution, then \(E\left(Y^{4}\right)=3\). Consequently, if the normality assumption is valid for a variable \(x\), then \(b_{2}(x)\) will be close to the number \(3\), at least when the sample size is sufficiently large.
A distribution having kurtosis measure sufficiently larger than \(3\) is called Leptokurtic. A Leptokurtic distribution has higher concentration of values near the center and thicker tails in compared to normal distribution.
A distribution having kurtosis measure sufficiently smaller than \(3\) is called Platykurtic. A Platykurtic distribution has lower concentration of values near the center and thinner tails in compared to normal distribution.
In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high risk for an investment because it indicates high probabilities of extremely large and extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low.
In the above example, the \(b_{1}\) measure of the sample of size \(250\) is \(4.03\).