Descriptive Statistics

Author

Minerva Mukhopadhyay

The term “descriptive statistics” refers to the analysis, summary, and presentation of findings related to a data set derived from a sample or the entire population.

Lecture 1: (I) Data collection

One meaning of Statistics is “data”. The subject Statistics teaches the tool of obtaining information from the data. It is therefore important to learn the sources of data and methods of data collection.

  • Where then does data come from? How is it gathered?
  • How do we ensure its accurate? Is the data reliable?
  • Is it representative of the population from which it was drawn?

Generally, data collection methods are divided to two main categories of primary data collection methods (the collected data is called primary data) and secondary data collection methods (the collected data is called secondary data).

1) Primary Data:

Primary data is a type of data that is collected by researchers directly from main sources through interviews, surveys, experiments, etc. Primary data are usually collected from the source - where the data originally originates from and are regarded as the best kind of data in research.

2) Secondary Data:

Secondary data are data that has previously been collected by someone else but has been made available for use for others. They were probably previously primary data, but when they are reused by a third party, they become secondary.

Data collection methods (source: link)

3) Primary Data Collection Methods:

  • Survey methods:
    • Questionnaire Method: Collecting information through the answers to a set of questions (called questionnaire) from respondents representing a specific population.

    • Interviews: Collecting information via direct conversations with the representatives of a specific population.

    • Observational Methods: Gathering first-hand data through the observation of events, behaviors, interactions, processes, etc. directly to obtain an understanding of the concepts.

  • Designed experiments: A designed experiment is a controlled study whose purpose is to control as many factors as possible to isolate the effects of a particular factor. Designed experiments must be carefully set up to achieve their purposes.

For details of these data collection methods, read this reference material.

(II) Population and Samples

  • In statistics, we are interested in obtaining information about a total collection of elements, which we refer to population.

  • The population is often too large to examine each of its members. For example, suppose we are interested in the average order price and average number of orders in Hall 4 canteen in 2022. An exhaustive method of obtaining the data would be to record the data on each order placed in Hall 4 canteen in each day of 2022.

  • In such cases, we try to learn about the population by choosing and then examining a subgroup of it’s elements. This subgroup of a population is called sample. Sampling is often more economical, accessible to the researchers, and practical and effective.

  • The sample should be collected in such a way that it is representative of the underlying population. In the above example, if one collects information on sale of Hall 4 Canteen only on institute holidays, the sample data so collected will not be representative of the entire population (why?).

Properties of representative samples:

  • If a sample is representative of a population, then statistics calculated from sample data will be close to corresponding values from the population.

  • Samples contain less information than full populations, so estimates from samples about population quantities always involve some uncertainty.

  • Random sampling, in which every potential sample of a given size has the same chance of being selected, is one of the best way to obtain a representative sample.

  • Thus, it is important to understand both how to conduct a random sample in practice and the properties of random samples.

  • However, it often impossible or impractical to obtain a random sample.

Exercise:

(1) Consider the population of students taking the course MTH211A. Set a statistical problem of interest of your choice and collect data on the entire population of interest accordingly. For example, if you are interested in the association of JEE score and school records, collect information on results of 10th standard, 12th standard along with the JEE score, etc.

(2) Collect a secondary data from any of the following archives: UCI, Kaggle . Describe the underlying problem of interest.

Lecture 2: Methods of Sampling

(I) Simple Random Sampling:

Simple Random Sampling is a type of probability sampling that ensure that each sample of size \(n\) has an equal chance of being selected.

  • In a simple random sample, all individuals are equally likely (equal probability) to be included in the sample.

  • The converse, however, is untrue: Consider sampling either all five men or all five women with equal probability from a population with ten people. Each person has a 50% chance of being included, but any sample with a mix of men and women has no probability of being chosen.

  • Estimates from simple random samples are unbiased; there is no systematic discrepancy between sample estimates and corresponding population values.

  • For random samples, larger samples are typically more accurate; the chance difference between sample estimates and population values is smaller (on average) for larger samples (but not necessarily for specific samples).

(II) Simple Random Sampling With and Without Replacement:

Simple random sampling with replacement (SRSWR):

SRSWR is a method of selection of \(n\) units out of the \(N\) units one by one such that at each stage of selection, each of the \(N\) unit has an equal probability of being selected, i.e., \(1/N\).

Note: In SRSWR one particular unit can be sampled more than once.

Simple random sampling without replacement (SRSWOR):

SRSWOR is a method of selection of \(n\) units out of the \(N\) units one by one such that at any stage of selection, each of the remaining units have the same chance of being selected. Therefore, at the first draw each of the \(N\) components of the population has probability \(1/N\) of being selected. At the second draw, each of the remaining \(N-1\) components of the population has probability \(1/(N-1)\) of being selected, and so on.

Note: In SRSWOR one particular unit, once sampled, is not considered for further sampling. Thus, none of the population units can be sampled more than once.

(III) How accurate is the simple random sample estimate when we are interested in the population mean of a variable of interest?

Consider the example of average number of orders in Hall 4 canteen in 2022. Let \(N\) be the number of days in which the canteen was open. For the \(i\)-th day, let \(x_{i}\) be the actual number of orders placed in the canteen \(i=1,\ldots,N\). So, we are basically interested in the quantity

\[ \mu=\frac{x_{1}+\ldots+ x_{N}}{N}.\]

Let the variance of the number of orders be denoted by \(\sigma^{2}\), that is,

\[ \sigma^{2}= \frac{1}{N} \sum_{i=1}^{N} (x_{i}-\mu)^{2}. \]

Let us assume for simplicity that the population values \(\{x_{1},\cdots,x_{N}\}\) are all different.

Case1: SRSWR.

Let a sample of size \(n\) be collected with replacement from this population. Let \(Y_{1},\ldots, Y_{n}\) be the sample obtained. What do we know about the distribution of \(Y_{j}, ~j=1, \ldots, n\)?

  • (Identically distributed) \(Y_{1}\) can take any value from the set \(\{x_{1}, \ldots, x_{N}\}\) with probability \(1/N\). In fact, for any \(i\) , \(Y_{i}\) can take any value from the set \(\{x_{1}, \ldots, x_{N}\}\) with probability \(1/N\).

  • (Independent) The distribution of \(Y_{1}\) does not affect that of \(Y_{2}\). If fact the distribution of \(Y_{i}\) does not affect that of \(Y_{j}\), for each \(i\neq j\). Therefore \(Y_{i} \stackrel{i.i.d.}{\sim} \mathrm{Discrete~Uniform}(x_{1},\ldots,x_{N})\).

  • Suppose we want to estimate the population average \(\mu\) by the sample average \[\bar{Y}=\frac{Y_{1}+\ldots+Y_{n}}{n}.\]

  • Then the expectation of \(\bar{Y}\) is \(E(\bar{Y})=\mu\) (Why?).

  • Further, \(\mathrm{var}(\bar{Y})= \frac{\sigma^{2}}{n}\) (Why?).

Case 2: SRSWOR.

Let a sample of size \(n\) be collected with replacement from this population. Let \(Y_{1},\ldots, Y_{n}\) be the sample obtained. What do we know about the distribution of \(Y_{j}, ~j=1, \ldots, n\)?

  • The probability of observing a particular sample \(\{y_{1},\cdots, y_{n}\}\) is:
\[\begin{aligned} P(Y_{1}=y_{1},\cdots, Y_{n}=y_{n}) &= P(Y_{1}=y_{1}) P(Y_{2}=y_{2} \mid Y_{1}=y_{1}) \cdots P(Y_{n}=y_{n} \mid Y_{1}=y_{1},\cdots,Y_{n-1}=y_{n-1})\\ &= \frac{1}{N(N-1)\cdots (N-n+1)}=\frac{(N-n)!}{N!}=\frac{1}{(N)_{n}}. \end{aligned}\]
  • Number of possible distinct samples \((N)_{n}\).

  • If the order in which the sample units are obtained is ignored, then probability of collecting the sample \(\{y_{1},\cdots,y_{n}\}\) is \(\frac{1}{\binom{N}{n}}\). (Why?)

  • Then the expectation of \(\bar{Y}\) is \(E(\bar{Y})=\mu\). (Why?)

  • It can be shown that \(\mathrm{var}(\bar{Y})=\frac{\sigma^{2}}{n}\left(1-\frac{n-1}{N-1} \right)\). (Why?)

Note:

(1) For both SRSWR and SRSWOR the sample mean is expected to be close to the population mean.

(2) The variance \(\bar{Y}\) is smaller in SRSWOR than SRSWR, implying that the sample estimate is more stable in SRSWOR.

(3) When \(N=n\), \(\bar{Y}=\mu\) with probability one in SRSWOR. However, this is not the case, as expected, in SRSWR.

A Simulation Exercise:

1) Suppose the number of orders in a particular weekday follows a Poisson distribution with mean \(75\). That in a weekend follows a Poisson distribution with mean \(120\). Setting seed \(=1\) and assuming the Hall 4 canteen was open in all 365 days in 2022, generate a (simulated) data set of daily orders in Hall 4 canteen. Treat this data as the population data, and calculate \(\mu\).

2) Take a sample with and without replacement of size \(n\) from this data set and find the mean \(\bar{Y}\).

3) Repeat the experiment by taking \(n=25, 50, 100\). Based on your results, comment on the closeness of \(\bar{Y}\) and \(\mu\) under SRSWR and SRSWOR schemes, as \(n\) grows.

4) How will you compare the performance of two schemes using simulation?

Theoretical Exercise:

  1. How will you modify the above results on sample mean when the population data points \(\{x_{1},\cdots, x_{n}\}\) are not necessarily distinct? Assume the distinct values of \(x_{i}\)’s are \(\xi_{1},\cdots,\xi_{m}\), and let \(N_{i}\) be the frequency of \(\xi_{i}\) in the population, \(i=1,\cdots,m\). Find the mean and variance of \(\bar{Y}\) under SRSWR and SRSWOR.

(IV) Stratified Random Sampling:

In many situations, it is not possible to conduct simple random sampling. For example, if we are interested in the data of month household income in rural area of Uttar Pradesh (UP), the population consist of all the households residing rural areas of UP. Conducting an SRS for this population is very inconvenient administratively. It is much more convenient to split the entire region of UP into districts and conduct SRS within each district separately. This method is called Stratified Sampling.

  • In Stratified Random Sampling (StrRS), one divides the population into \(k\) sub-populations called Strata, which are relatively homogeneous within themselves.

  • The sample \({\bf y}\) consists of \(k\) different sub-samples, \({\bf y}=\begin{bmatrix} {\bf y}_{1} \\ \vdots \\ {\bf y}{k} \end{bmatrix}\), where \({\bf y}_{i}\) is a SRS from \(i\)-th stratum, drawn usually without replacement.

  • The samples from different strata are usually assumed to be independent.

  • The final estimator of the population parameter \(\mu\) is the weighted average of sample estimators obtained from different strata.

Why StrRS is preferred than SRS?

(1) In many situations StrRS is administratively more convenient.

(2) StrRS is more representative than SRS as representation of all segments of the population is ensured here.

(3) Along with providing an estimate of the population parameter, StrRS also provides separate estimates for the individual strata.

How accurate is the stratified sample estimate when we are interested in the population mean of a variable of interest?

  • Suppose the population is divided into \(k\) strata.

  • The size, population mean and variances of the \(j\)-th stratum are \(N_{j}\), \(\mu_{j}\) and \(\sigma_{j}^{2}\), respectively.

  • Then the population mean is

\[\mu = \frac{1}{N} \sum_{j=1}^{k} N_{j} \mu_{j}. \qquad \mbox{[Why?]}\] We are interested in estimating \(\mu\).

  • Suppose we take a sample of size \(n_{j}\) from the \(j\)-th stratum, and the sample mean of \(j\)-th stratum is \(\bar{Y}_{j}\), \(j=1,\ldots,k\).
  • Then it can be shown that the best linear unbiased estimator (BLUE) of \(\mu\) is \(\bar{\bar{Y}}=\frac{1}{N} \sum_{j=1}^{k} N_{j} \bar{Y}_{j}\). (How?)

  • Further, the sample size \(n_{j}\) of \(j\)-th stratum should be chosen carefully. If the variability of the strata are comparable, then it is optimal to choose \(n_{j}\propto N_{j}\). If the population variances of the strata are widely different, then the optimal choice of \(n_{j}\) is \(n_{j} \propto N_{j} S_{j}\), where \(S_{j}^{2} = N_{j} \sigma_{j}^{2}/(N_{j}-1)\). (Why?)

  • If the between strata variability is large, and within strata variance is small, then StrRS is more efficient than SRS.

A Simulation Exercise (continue):

5) In the above Hall 4 canteen example, consider two strata: weekdays (stratum:1) and weekends (stratum:2). Considering \(n_{i}\propto N_{i}\), find an estimate of \(\mu\).

6) Compare your estimate with the previous estimates as \(n\) increases.

7) Also, considering \(n_{i} \propto N_{i} \sigma_{i}\), repeat the experiment, and compare the results. Here \(\sigma^{2}_{i}\) is the population variance of the \(i\)-th stratum.

Reference: Fundamentals of Statistics Vol 2 by Goon, Gupta and Dasgupta

Lecture 3: What are the different types of variables?

  • The variables of interest are broadly classified into two categories, (A) Qualitative Variables or Categorical variables, and (B) Quantitative Variables or measurement variables.

  • Qualitative variables: The objects being studied are grouped into categories based on some qualitative trait. The resulting data are merely labels or categories. Examples: hair color, gender, opinion of students about politics, etc.

    • The qualitative variables are further categorized into two sub-categories, (Aa) Nominal variables, and (Ab) Ordinal variables.

      • Nominal variables: A type of categorical variables in which objects fall into unordered categories. Examples: hair color, smoking status, race, etc.

      • Ordinal variables: A type of categorical variable in which order is important. Examples: grade (\(A^{*}, A, B^{+},\cdots\) ), degree of illness (none, mild, moderate, severe,\(\cdots\) ).

      • Binary variable: A type of categorical variable in which there are only two categories. Binary variable can either be nominal or ordinal. Example: smoking status (smoker, non-smoker), attendance (present, absent).

  • Quantitative variable: The objects being studied are “measured” based on some quantitative trait. The resulting data are set of numbers. Examples: cholesterol level, height, age, GATE score, etc.

    • Qualitative variables are further classified into (Ba) Discrete variables or (Bb) Continuous variables.

      • Discrete variables: Only certain values are possible (there are gaps between the possible values). Examples: number of students present in a class, shoe size of residents of a community, etc.

        • Note: Discrete data non-necessarily imply counting data. For example, the shoe size \(4, 4.5, 4, \ldots\) are possible. However, there does not exist any shoe size between \(4\) and \(4.5\).
      • Continuous variables: Theoretically, any value within an interval is possible with a fine enough measuring is possible with a fine enough measuring device. Examples: height, age, time, etc.

Why it is important to learn the data-types? The type(s) of data collected in a study determine the type of statistical analysis used.

Some Examples:

(1) Suppose data is collected on smoking status of students in IITK. The data typically looks like:

  • Serial no. Student ID Smoking Status
    1 20221001 Non-smoker
    2 20221002 Non-smoker
    3 20221003 Smoker
    \(\vdots\) \(\vdots\) \(\vdots\)
  • How will you summarize the data?

  • How will you plot the data?

  • What the questions of interest?

(2) Suppose data is collected on the grades of students in MSO201A. The data typically looks like:

Serial no. Student ID Grade
1 20221001 \(B^{+}\)
2 20221002 C
3 20221003 \(C^{+}\)
\(\vdots\) \(\vdots\) \(\vdots\)
  • How will you summarize the data?

  • How will you plot the data?

  • What the questions of interest?

(3) Suppose data is collected on shoe size of students in IITK. The data typically looks like:

Serial no. Student ID Shoe size
1 20221001 \(6.5\)
2 20221002 \(6\)
3 20221003 \(5\)
\(\vdots\) \(\vdots\) \(\vdots\)
  • How will you summarize the data?

  • How will you plot the data?

  • What the questions of interest?

  1. Suppose data is collected on average time (per week) spend by students for MSO201A. The data typically looks like:
Serial no. Student ID Time (in hours)
1 20221001 \(7.39\)
2 20221002 \(12.05\)
3 20221003 \(10.99\)
\(\vdots\) \(\vdots\) \(\vdots\)
  • How will you summarize the data?

  • How will you plot the data?

  • What the questions of interest?

  • A typical real data, however, is a combination of all types. See the data sets in UCI machine learning repository. A typical example: Cars93 data set available in MASS R-package.

  Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1        Acura Integra   Small      12.9  15.9      18.8       25          31
2        Acura  Legend Midsize      29.2  33.9      38.7       18          25
3         Audi      90 Compact      25.9  29.1      32.3       20          26
4         Audi     100 Midsize      30.8  37.7      44.6       19          26
5          BMW    535i Midsize      23.7  30.0      36.2       22          30
6        Buick Century Midsize      14.2  15.7      17.3       22          31
             AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1               None      Front         4        1.8        140 6300
2 Driver & Passenger      Front         6        3.2        200 5500
3        Driver only      Front         6        2.8        172 5500
4 Driver & Passenger      Front         6        2.8        172 5500
5        Driver only       Rear         4        3.5        208 5700
6        Driver only      Front         4        2.2        110 5200
  Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1         2890             Yes               13.2          5    177       102
2         2335             Yes               18.0          5    195       115
3         2280             Yes               16.9          5    180       102
4         2535             Yes               21.1          6    193       106
5         2545             Yes               21.1          4    186       109
6         2565              No               16.4          6    189       105
  Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1    68          37           26.5           11   2705 non-USA Acura Integra
2    71          38           30.0           15   3560 non-USA  Acura Legend
3    67          37           28.0           14   3375 non-USA       Audi 90
4    70          37           31.0           17   3405 non-USA      Audi 100
5    69          39           27.0           13   3640 non-USA      BMW 535i
6    69          41           28.0           16   2880     USA Buick Century

Lecture 4: Data Representation and Visualization

Based on the variable type the data representation and visualization techniques would vary.

Case 1: Qualitative data

Suppose the variable of interested is nominal or ordinal. Example: Grades of students.

One of the popular ways representing this data using frequency table and barplot.

Grade A* A B+ B C+ C D+ D F
\(\#\) students 3 12 22 26 19 14 9 2 0

Case 2: Quantitative data

Suppose we are interested in a quantitative variable. There is a number of ways to represent the same.

(A) Discrete:

In case of discrete variables, a frequency table, along with bar graph can be used to represent the same. If the number of distinct values is large then a line-graph provides a better representation.

Example:

The following example is from the Cars93 data set available on R-package MASS, which contains data about cars on sale in the USA in 1993.

(B) Continuous: When the variable of interest is continuous or discrete with many possible distinct values values, then there is a number of ways in which the data can be summarized and represented. A non-exhaustive list is given below:

  • Visualizing the distribution: To visualize the entire distribution of the data on a continuous variable (or discrete variable with many possible values), the following tools are useful.

    • Stem and leaf plot: A stem-and-leaf plot of a quantitative variable is a textual graph that classifies data items according to their most significant numeric digits.
[1] "Stem and Leaf plot of Engine Size (litres) of Cars93 data set"

  The decimal point is at the |

  1 | 02333
  1 | 55555556666688888889
  2 | 000012222222222333333444
  2 | 5555888
  3 | 000000000002233444
  3 | 5588888888
  4 | 3
  4 | 56669
  5 | 0
  5 | 77
  • Summarizing using frequency distribution (FD) and cumulative frequency distribution (CFD):

Data on continuous variables are often summarized using frequency tables. To construct a frequency table, we divide the range of the observations into small classes. The number of observations in each class is called the frequency of that category. A Frequency Table or Frequency Distribution is a table showing the classes next to their frequencies. When grouping the data in classes, we make sure that every observation falls into exactly one of the classes.

[1] "Frequency table of Engine Size (litres) of Cars93 data set"
     Class Frequency Relative_frequency Cumulative_Less_type
1  [1,1.5)         5         0.05376344                    5
2  [1.5,2)        20         0.21505376                   25
3  [2,2.5)        24         0.25806452                   49
4  [2.5,3)         7         0.07526882                   56
5  [3,3.5)        18         0.19354839                   74
6  [3.5,4)        10         0.10752688                   84
7  [4,4.5)         1         0.01075269                   85
8  [4.5,5)         5         0.05376344                   90
9  [5,5.5)         1         0.01075269                   91
10 [5.5,6)         2         0.02150538                   93
   Cumulative_Greater_type
1                       93
2                       88
3                       68
4                       44
5                       37
6                       19
7                        9
8                        8
9                        3
10                       2
  • Relevant plots:
  • Histogram: The graph of relative frequency distribution is called a histogram.

  • Ogive: The line graph of cumulative frequency distributions is called an ogive.

  • Summarizing using summary statistics as measures of central tendency and dispersion:: Sometimes a list of summary statistics are used to represent a continuous variable.
    • Measure of central tendency: One may summarize data using a measure of central tendency, a value which is at the center of the data cloud in an appropriate sense. Two popular measures of center tendency are:

      • Mean: Let \({\bf x} =\{x_{1},\ldots,x_{n}\}\) are the \(n\) observations on a continuous variable, say \(X\). Then the arithmetic mean of \({\bf x}\) is \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n} x_{i}\). The arithmetic mean (commonly know as mean) provides the central value in the sense the total deviation from \(\bar{x}\) of all the observations bigger than \(\bar{x}\), is same as that of the observations less than \(\bar{x}\), i.e., \[\sum_{i: x_{i}>\bar{x}}(x_{i}-\bar{x})=\sum_{i: x_{i}<\bar{x}}(\bar{x}-x_{i}).\]

      • Median: The median of \({\bf x}\), \(\tilde{x}_{me}\), is a number such that half of the data points are bigger than or equal to it, and half of the data points are smaller than or equal to it, i.e., \[ \sum_{i} I(x_{i} \geq \tilde{x}_{me} ) = \sum_{i} I(x_{i} \leq \tilde{x}_{me} ),\] where \(I\) is the indicator function.

  • Measure of dispersion: Along with the central tendency, it is also important to provide a measure of scatterness (or dispersion) of the data. The combination of a measure of central tendency and dispersion provides an idea on how scattered is the data cloud, and around which point it is scattered. A small value of measure of dispersion indicates that the data points are close to the central value. A large value, on the contrary, indicates that the data points are more scattered. Two popular measures of dispersion are:
    • Standard deviation: Let \({\bf x} =\{x_{1},\ldots,x_{n}\}\) are the \(n\) observations on a continuous variable, say \(X\). Then the variance of \({\bf x}\) is \(\mathrm{var}(x)=\frac{1}{n}\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}\). Variance provides the average of squared departure of the observations from the mean. The square root of variance, standard deviation (sd), can be treated as average distance of the data points from the mean.
    • Interquartile range: Given a data \({\bf x}\), the first quartile \(Q_{1}({\bf x})\) is a number such that three-forth of the data points are bigger than or equal to it, and one-forth of the data points are smaller than or equal to it. Further the third quartile \(Q_{3}({\bf x})\) is a number such that one-forth of the data points are bigger than or equal to it, and three-forth of the data points are smaller than or equal to it. The region between \(Q_{3}({\bf x})\) and \(Q_{1}({\bf x})\) indicates the range around the median, where \(50\%\) of the data points lie. This is called the interquartile range (IQR). Large IQR indicates high dispersion of the data points. Further, small IQR indicates that the data points are concentrated near the median.
  • Relevant plots:
    • Boxplot: The graph of the summary statistics, depicting the minimum value, maximum value, \(Q_{1}({\bf x})\), \(Q_{3}({\bf x})\) and \(\tilde{x}_{me}\) . A typical boxplot is shown in Figure 1.

      Figure 1

The boxplots produced by statistical packages are rarely as described above. An attempt is made to alert the user to sample values which are suspected to be outliers, and may be unusually removed from the bulk of the data. Thus, the lowest point \(L_{1}\) of the lower whisker is set to \(L_{1}=Q_1({\bf x}) - 1.5 \times \{Q_3({\bf x})-Q_1({\bf x})\}\), and the highest point, \(L_{3}\) of the upper whisker is usually set to \(L_{3}=Q_3({\bf x}) + 1.5 \times \{Q_3({\bf x})-Q_1({\bf x})\}\). This modification is usually called, “box and whisker plot”.

Figure 2

Source of Figure 1 and 2: https://web.pdx.edu/~stipakb/download/PA551/boxplot.html

Case 3: Bivariate or Trivariate Observations

Usually the data set contains record on several variables on a set of individuals or elements. Rather than the distribution of each of the variables separately, relation between more than one variables is usually of interest. We will now learn some of the popular methods of representing paired observations.

Paired Observations on Qualitative Variables:

To represent paired nominal/ordinal observations, one may use a contingency table along with a stacked barplot.

Example:

         
          USA non-USA
  Compact   7       9
  Large    11       0
  Midsize  10      12
  Small     7      14
  Sporty    8       6
  Van       5       4

Paired Observations on a Qualitative and a Quantitative Variable:

When a pair of a qualitative and a quantitative variables are observed on a set of individuals/elements, a joint frequency distribution, along with paired boxplot are the most popular tool of data-representation.

[1] "Frequency table of Engine Size (litres) of set for US and non-US cars"
     Class Frequency_USA Frequeny_non_USA
1  [1,1.5)             1                4
2  [1.5,2)             7               13
3  [2,2.5)            11               13
4  [2.5,3)             2                5
5  [3,3.5)            10                8
6  [3.5,4)             9                1
7  [4,4.5)             1                0
8  [4.5,5)             4                1
9  [5,5.5)             1                0
10 [5.5,6)             2                0

Paired Observations on Quantitative Variables:

When both the variables are qualitative, then the most common way of representing the data is through scatter plots.

Scatter plot: A scatter plot is a two-dimensional graph, each dimension representing a variable. Let \(var_{1}\) is depicted in \(x\)-axis, and \(var_2\) in \(y\)-axis. Then for each individual/element, we get a point on the \(xy\)-plane, with the abscissa equal to the value of the element in \(var_1\), and ordinate is that in \(var_{2}\).

Extending to Three Dimensions:

When there are data on two qualitative and a quantitative variable, a appropriately colored scatter plot can be used to represent the data.

To represent three quantitative variables 3-dimensional scatter plots can be employed.

Case 4: Multivariate observations

Multivariate data, however, can not be directly represented in a 2 or 3-dimensional graph. Some of the methods to visualize multivariate observations are given below:

Scatterplot matrix:

Scatterplot matrix is an extension of scatter plots for multidimensional data where a collection of scatter plots is organized in a matrix simultaneously to provide correlation information among the attributes.

Parallel coordinates:

Parallel coordinates is a well-known technique where attributes are represented by parallel vertical axes linearly scaled within their data range. Each data item is represented by a polygonal line that intersects each axis at respective attribute data value. Parallel coordinates can be used to study the correlations among attributes by spotting the locations of the intersection points.

Chernoff Faces:

In Chernoff face visualization, two attributes are mapped to the 2D position of a face and remaining attributes are mapped to its properties of the face, for instance, the shape of nose, mouth, eyes and that of the face itself.

effect of variables:
 modified item       Var           
 "height of face   " "MPG.city"    
 "width of face    " "Horsepower"  
 "structure of face" "Passengers"  
 "height of mouth  " "Weight"      
 "width of mouth   " "Width"       
 "smiling          " "Luggage.room"
 "height of eyes   " "MPG.city"    
 "width of eyes    " "Horsepower"  
 "height of hair   " "Passengers"  
 "width of hair   "  "Weight"      
 "style of hair   "  "Width"       
 "height of nose  "  "Luggage.room"
 "width of nose   "  "MPG.city"    
 "width of ear    "  "Horsepower"  
 "height of ear   "  "Passengers"  

Star plot:

In star plot the dimensions are represented as equal angular axes radiating from the center of a circle, with an outer line connecting the data value points on each axis. Each data item is presented by a star.

Note: (1) There are many other ways to visualize multivariate data. For a comparative study, see, for example, the paper `A Survey on Multivariate Data Visualization’ by Winnie Wing-Yi Chan.