Probability and Statistics I

1 Topic One: Nature and Presentation of Statistical Data

1.1 Objectives

By the end of the session, you should be able to:

Understand the meaning, nature, importance and limitations of statistics
Explain the types of variables
Classify measurements and data into various types

1.2 Introduction

1.2.1 Meaning and Definition of Statistics

Statistics has different meanings for different people and the purpose. Statistics has been defined also in different ways by different writers. This is due to changes in the scope of statistics with the passage of time.

Statistics is used in two senses:

In plural sense meaning a collection of facts or estimates – the figure themselves (numerical data).
As a singular noun meaning Statistics is the scientific method of collecting, organizing, summarizing, presenting and analyzing data, as well as interpreting data. (Interpretation means drawing valid conclusions and making reasonable decisions on the basis of such analysis).

Collection of data: Once an investigator has collected data through a survey, it is necessary to edit these data in order to correct any apparent inconsistencies, ambiguities, recording errors or for that matter any mistake that can enter into the actual computations. But even before the data has been collected and edited, it is assumes that these can be suitably classed according to some common characteristic of the population sampled.
Description of data: The organized data can now be presented in the form of tables or diagrams or graphs. This presentation in an orderly manner facilitates the understanding as well as analysis of data.
Analysis of data: The basic purpose of data analysis is to make it useful for certain conclusions. This analysis may simply be a critical observation of data to draw some meaningful conclusions about it or it may involve highly complex and sophisticated mathematical techniques. Some simple statistical tools such as calculations of averages, dispersion of data around averages and percentages are commonly used to analyze data.
Interpretation of data: Interpretation means drawing conclusions from the data which form the basis of decision making. Correct interpretation requires a high degree of skill and experience and is necessary in order to draw valid conclusions.

1.2.2 Uses of statistics

Statistics is an increasingly important subject which is useful in many types of scientific investigations. Statistics is particularly useful in situations where there is experimental uncertainty and may be defined as ‘the science of making decisions in the face of uncertainty’. It is applicable in various fields including education, business, agriculture, engineering.

To present data in a concise and definite form – helps in classifying and tabulating raw data for processing and further tabulation for other users.
To make it easy to understand complex and large data - permits summarization and presentation of large quantities of information. i.e. It condenses and summarizes voluminous data into a few presentable, understandable and precise figures. For example, stock market prices of individual stocks and their trends are highly complex to comprehend, but a graph of prices trends gives us the overall picture at a glance.
To undertake and understand research in our areas of interest such as It helps in determining functional relationship between two or more phenomenon. Statistical techniques such as correlational analysis assist in establishing the degree of association between two or more independent variables. For example, the coefficient of correlation between literacy and employment gives us the degree of association between extent of training and industrial productivity.
Used in government and other organizations to formulate new programmes and policies as well as in administration ie It helps the central management and the government in formulating policies. Example, the recently conducted census, will be used as a source of information for planning by the government for the next 10 years until another census is conducted in 2019.
For comparison of variables in different sets of data - Arrangement of data with respect to different characteristics facilitates comparison and interpretation. For example, data on age, height, gender, and family income of college students gives us a much better picture of students when the data is categorized relative to these characteristics.
Aids in forecasting outcomes of future events- Statistical methods are highly useful tools in analyzing the past data and predicting some future trends. Eg Helps businesses in decision making by making future estimates and expectations . For example, the sales for a particular product for the next year can be computed by knowing the sales for the same product over the previous years, the current market trends and the possible changes in the variable that affect the demand of the product.

1.2.3 Scope of Statistics

Some of the important areas where the knowledge of statistics is usefully applied are as follows:

Government. Various departments of the government collect and interpret vast amount of data and information for efficient functioning and decision making.
Economics. Statistics are widely used in economics study and research. The subject of economics is mainly concerned with production and distribution of wealth as well as savings and investments. Some of the areas of economic interest in which statistical tools are used are as follows:
- Statistical methods are extensively used in measuring and forecasting Gross National Product ( GNP ).
- Economic stability is primarily judged by statistical studies of business cycles.
- Statistical analyzes of population growth, unemployment figures, rural or urban population shifts and so on influence much of the economic policy making.
- Econometric models which involve application of statistical methods and used for optimum utilization of resources available.
- Financial statistics are necessary in the fields of money and banking including consumer savings and credit availability.
Physical, Natural and Social Sciences. In physical sciences, as an example, the science of meteorology uses statistics in analyzing the data gathered by satellites in predicting weather conditions.
Statistics and Research. There is hardly any advanced research going on without the use of statistics in one form or another. Statistics are used extensively in medical, pharmaceutical and agricultural research. The effectiveness of a new drug is determined by statistical experimentation and evaluation.
Other Areas. Statistics are commonly used by insurance companies, stock brokerage firms, banks, public utility companies and so on. Statistics are also immensely useful to politicians since they can predict their chance of winning through the use of sampling techniques in random selection of voters sampled and studying their attitude on issues and policies.

1.2.4 Limitations of Statistics

Statistics has a number of limitations, pertinent among them are as follows:

It does not deal with individual values. Statistics only deals with aggregate values. For example, the marks obtained by one student in a class does not carry any meaning in itself, unless it can be compared with a set standard or with other students in the same class or with his own marks obtained earlier.
It cannot deal with qualitative characteristics. Statistics is not applicable to qualitative characteristics such as honesty, kindness, goodness, colour, poverty, beauty, and so on, since these cannot be expressed in quantitative terms. The characteristics, however, can be statistically dealt with if some quantitative values can be assigned to these with logical criterion.
Statistical conclusions are not universally true. Since statistics is not an exact science, as is the case with natural sciences, the statistical conclusions are true only under certain assumptions.
Statistical interpretation requires a high degree of skill and understanding of the subject. In order to get meaningful results, it is necessary that the data be properly and professionally collected and critically interpreted. It requires extensive training to read and analyze statistics in its proper context.
Statistics can be misused. The famous statement that ‘figures don’t lie but the liars can figure’ is a testimony to the misuse of statistics. Thus, inaccurate or incomplete figures can be manipulated to get desirable references. Example: advertising slogans such as “4 out of 5 dentists recommend brand X toothpaste” give the impression that 80% of all dentists recommend this brand. This may not be true since we don’t know how big the sample is or whether the sample represents the entire population. Another example is opinion polls on the news where percentages are given without sample size or representativeness.
There are certain phenomena or concepts where statistics cannot be used. This is because these phenomena or concepts are not amenable to measurement. For example, beauty, intelligence, and courage cannot be quantified. Statistics has no place where quantification is not possible.
Statistics reveal the average behaviour—the normal or general trend. Applying an ‘average’ to an individual may lead to wrong or dangerous conclusions. For example, an average river depth of four feet does not mean it is safe throughout; some points may be much deeper.
Since statistics are collected for a particular purpose, such data may not be relevant or useful in other situations. For example, secondary data (i.e., data originally collected by someone else) may not be useful for another person.
Statistics are not 100 per cent precise as Mathematics or Accountancy. Users should be aware of this limitation.
In statistical surveys, sampling is generally used as it is not physically possible to cover the whole universe. The results may not fully represent the universe. Moreover, surveys with identical sample sizes but different sample units may give different outcomes.
At times, association or relationship between two or more variables is studied, but this does not indicate a cause-and-effect relationship. It only shows similarity or dissimilarity in movement. Interpretation requires care.
A major limitation of statistics is that it does not reveal everything about a phenomenon. Some background information or other relevant aspects may not be covered. The user of statistics must interpret results while considering other relevant information.

1.2.5 Misuses

Sometimes people, knowingly or unknowingly, use statistical data wrongly. Such forms of misuse include:

Failure to give the sources of data: this may compromise the reliability of the data because the user of such data will not know how far this data will fit his/her situation including if he/she wants to refer to the original source.
Defective data: This may be done knowingly in order to defend one’s position or to prove a particular point. This apart, the definition used to denote a certain phenomenon may be defective. For example, in case of data relating to unemployed persons, the definition may include even those who are employed, though partially. The question here is how far it is justified to include partially employed persons amongst unemployed ones.
Unrepresentative sample: In statistics, several times one has to conduct a survey, which necessitates to choose a sample from the given population or universe. The sample may turn out to be unrepresentative of the universe. One may choose a sample just on the basis of convenience. He may collect the desired information from either his friends or nearby respondents in his neighbourhood even though such respondents do not constitute a representative sample.
Inadequate sample: At times one may conduct a survey based on an extremely inadequate sample. For example, in a city we may find that there are 100,000 households. When we have to conduct a household survey, we may take a sample of merely 100 households comprising only 0.1 per cent of the universe. A survey based on such a small sample may not yield right information.
Unfair Comparisons: For instance, one may construct an index of production choosing the base year where the production was much less. Then he may compare the subsequent year’s production from this low base. Such a comparison will undoubtedly give a wrong picture of the production though in reality it is not so. Another source of unfair comparisons could be when one makes absolute comparisons instead of relative ones. An absolute comparison of two figures, say, of production or export, may show a good increase, but in relative terms it may turn out to be very negligible. Another example of unfair comparison is when the population in two cities is different, but a comparison of overall death rates and deaths by a particular disease is attempted. Such a comparison is wrong. Likewise, when data are not properly classified or when changes in the composition of population in the two years are not taken into consideration, comparisons of such data would be unfair as they would lead to misleading conclusions.
Unwanted conclusions: This may be as a result of making false assumptions. For example, while making projections of population in the next five years, one may assume a lower rate of growth though the past two years indicate otherwise. Sometimes one may not be sure about the changes in business environment in the near future. In such a case, one may use an assumption that may turn out to be wrong. Another source of unwarranted conclusion may be the use of wrong average. Suppose in a series there are extreme values, one is too high while the other is too low, such as 800 and 50. The use of an arithmetic average in such a case may give a wrong idea. Instead, harmonic mean would be proper in such a case.
Confusion of correlation and causation: In statistics, several times one has to examine the relationship between two variables. A close relationship between the two variables may not establish a cause-and-effect-relationship in the sense that one variable is the cause and the other is the effect. It should be taken as something that measures degree of association rather than try to find out causal relationship.

1.2.6 Branches of statistics

Statistics can be divided into two branches:

Descriptive: statistics that summarize the characteristics of given data, without trying to extrapolate or make predictions. Utilizes numerical and graphical method to summarize the information, look for patterns in the data set and present the information in a convenient form (Describes or summarizes things you definitely know).
Inferential: statistics used to make claims or predictions about the larger population based on a subset (sample) of that population. Utilizes sample data to make estimates, decisions, predictions and other generalizations about a larger set of data. (Compares groups, tests hypothesis or predicts or infers). Conclusions made are called Statistical inference which cannot be absolutely certain hence the need to use probability in drawing conclusions.

Remark:
In this course, you will study numerical and graphical ways to describe and display your data. This area of statistics is what we have called “Descriptive Statistics.” You will learn how to calculate, and even more importantly, how to interpret these measurements and graphs.

1.3 Data

1.3.1 Definition of some terms.

Organization of Data - Data organization, in broad terms, refers to the method of classifying and organizing data sets to make them more useful. Some IT experts apply this primarily to physical records, although some types of data organization can also be applied to digital records.
Data is a collection of observations from an experiment or a survey.
A population is a set of units (people, objects, transactions or events). The entire set of all possible outcomes or measurements of interest.
In collecting data, it’s often not possible to observe the whole group referred to as target group population; hence one observes a smaller representative of the group called a sample (sample - a subset of the population for which we have data, and that we hope is representative of the population).
If the whole group is observed a census has been conducted.
If a smaller group is observed a sample survey has been conducted.
If the sample is a representative of a population, then important conclusions about population can be made from it.
Target population may be finite or infinite.
Finite Population: e.g. number of students in ABC University.
Infinite Population: e.g. number of insects in ABC University.
Variable: A characteristic or property of an individual population unit. A quantity that can assume prescribed set of values. May be discrete, continuous or constant.
Discrete Variable - Take on a finite number (values), are countable. E.g. size of a family.
Continuous Variable - Takes any value within a specified range. E.g. Height of students.
Constant Variables - Takes one value. E.g. Number of hours in a day.

1.3.2 Levels of Measurement

Measurement: is the process we use to assign numbers to variables of individual population units according to a set of rules.
Nominal measurement – classifies data into mutually exclusive (non-overlapping) exhausting categories in which no order, or ranking can be imposed on the data e.g. gender - male & female, bloodgroups O. A, B & AB, eye colour – blue, brown, religion etc.
Ordinal - classifies data into categories that can be ranked or ordered with respect to each other. For example – guest speaker might be ranked as good, average or poor, health condition of a patient can be good, better or best. The precise difference between ranks does not exist. More examples: Grade A, B… etc, Ranking scale (poor, good, excellent, etc), judging (1st, 2ndetc)
Interval measurement: classifies and ranks data and precise difference between units of measurement exist. However, there is no meaningful zero. For example – temperature has no meaningful difference between each unit. 0 degrees Celsius does not mean there is no heat, IQ, Exam score.
Ratio measurement: There is a difference between units and a true zero exists. Examples – height, time, age, salary, etc.

1.3.3 Types of Data

All data can be classified as one of two general types: Quantitative Data and Qualitative Data.

Quantitative data (Numerical data – it yields numerical responses, for example, “What is your age?”) They are data that are measured on a naturally occurring numerical scale. They represent a measurable quantity. Observations are numbers representing an amount or count of a certain characteristic like height, weight etc

Examples: The number of patients admitted in the County hospital, the current unemployment rate for each county the scores of a sample of 150 students in an exam, the number of male students in the class.

Ratio and interval measurement fall under the quantitative category.

These data can be classified into two types: discrete and continuous.

Discrete Data - Discrete data can only take on particular values and thus has clear boundaries. Assumes only countable number of values. Example: You can have 30 students or 31 students, but not 30.5 students, so “number of students” is a discrete variable, family size etc. In fact, any variable based on counting is discrete, whether you are counting the number of books purchased in a year or the number of motor accidents reported in a year.
Continuous Data - Continuous data can take any value, or any value within a range or an interval. Most data measured by interval and ratio scales, other than that based on counting, is continuous. Example: weight and height of students, distance from town to campus, an income received by an employee are all continuous.

Qualitative data (Categorical data – that which yields responses such as Yes or No. for example,” Did you buy the books?”)

Qualitative data cannot be measured on a natural numerical scale; they can only be classified into groups or categories. Take on values that are names or labels. Categories are non - overlapping, may or may not suggest an order or rank.

Examples: The political party affiliations in a sample of 50 chief executive officers, the size of a car (subcompact, compact, mid-size, or full-size) rented by each of a sample of 30 business travelers, a coffee tester’s ranking (best, worst, etc.) of four brands of coffee for a panel of 10 testers.

These data can be classified into three types: Attribute, Nominal and Ordinal.

Attribute Data: Also known as dichotomous data. These data has only two categories. Example: yes/no, male/female.
Nominal Data: These data have several unordered categories. Example: type of an insurance policy (motor, medical, fire, burglary, life insurance policies).
Ordinal or Ranked Data: These data have several ordered categories. Example: Questionnaire response such as Strongly Agree ……… Strongly Disagree to questions like: I am the best student in my class, My classmates are very co-operative, I live in the best hostel, Muscle response (none, partial, complete), Tree vigor (Healthy, sick, dead), Income (less than kSh9999, KSh10,000-KSh19,999, KSh20,000-KSh49,999, Greater than KSh50,000)
Remark:
In economics, data is also often categorized by how it relates to time.
Cross-sectional data.
In cross-sectional data, all observations come from the same point in time. The observations typically correspond to individuals or groups like states or countries. For instance, a survey of Americans on who they support in the upcoming presidential election is cross-sectional data. So is a data set with the homicide rate for each state in a single year.
Longitudinal or time-series data.
In longitudinal or time-series data, each data point corresponds to a particular point in time – usually for a single individual or group. For instance, if you recorded your income every day for a year, that would give me a longitudinal data set. The GDP of the U.S. from 1945 to the present is also a longitudinal data set.
Panel data.
Panel data is both cross-sectional and longitudinal. It involves getting cross-sectional data for many time periods (or, alternatively, time-series data for many different individuals or groups). For instance, if you recorded the income for each one of your classmates every year for the next 20 years, that would be a panel data set. One way to think of this is in terms of dimensions. Both cross-sectional and time-series data are one-dimensional; panel data is two-dimensional.

1.3.4 Data Sources and Collection Tools

1.3.4.1 Data Collection

Figure 1.1: Data Collection Methods

N/B: In Experimental methods, the researcher has to control the independent variables while in Non-Experimental methods there is no control.

1.3.4.2 Sources of Data

There are two main sources of data collection techniques: Primary and Secondary sources. There is also a third source known as internal data.

Primary Data

Primary data are measurements observed and recorded as part of an original study. Data is primary if it has been collected by the same person or entity that is using it. It has not yet been published, is more reliable, authentic and objective. It has not been changed or altered. The work of collecting original data is usually limited by time, money, and manpower available for the study.
There are two basic methods of obtaining primary data, namely:

Surveys – most commonly used method in social sciences, management, psychology etc.
Questionnaire – commonly used in survey-asking people questions (Questioning) A formal list of such questions either open or closed ended questions for which the respondent gives answers. May be conducted through telephone, mail, live, electronic mail or fax etc.
Direct Observation - When data are collected by observation, the investigator asks no questions and may let the one being observed or may not let him know he’s being observed.
Interviews– face to face with the respondent. Is slow, expensive and may take away from their working hours but allows in depth and follow-up questioning.
Experiments – subjects are divided into treatment groups and control groups to measure the difference between them after some kind of treatment is given to the former group. This is very common in medical testing.

Secondary Data

Data which has been already collected by and available from other sources. This is primary data from another purpose for our purpose. Secondary data can be obtained from journals, reports, government publications, publications of research organizations, trade and professional bodies, compilations from computerized data bases and information systems, magazines, newspapers, internet, stories told by people etc. This is also referred to as Data mining(data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both). N/B Information from the Census, Bureau of Labor Statistics, Dept. of Commerce, etc., is secondary. Well, that’s true if you use it. If they (that is, employees of the Census Bureau) use it, it’s primary.

Internal Data

Internal data refer to the measurements that are the by-product of routine business record keeping like accounting, finance, production, personnel, quality control, sales, etc.

Exercise 1.1

Describe meaning of each of the following terms:
- Statistics.
- Data
- Frequency distribution
Discuss four functions of statistics.
What are the major limitations of Statistics? Explain with suitable examples.
Distinguish between the following terms as used in statistics:
- Descriptive and inferential statistics.
- Target population and sample.
- Census and sample survey.
- Nominal and interval measurement.
- Quantitative Data and Qualitative Data.
Explain the two main sources of data.
Categorize these measurements according to their level:
- Students performance: Distinction, Pass, Fail.
- Annual net income for Afya Insurance in 2012.
- Names of insurance products.
- Religious preference of tourists.
- Room temperature measured in Kelvin scale.
- The length of time spent in a restaurant.
- The rank of an army officer.
- The type of a vehicle driven by the president.
- The mass of a pig.
State which of the following variables are discrete and which are continuous:
- Height of a person.
- Number of employees in ABC bank.
- Temperature on a certain day.
- Age of a building.
- Length of a train journey.
- Time taken to complete a project.
- Volume of water in a container.
- Number of children in a family.
Classify the following examples of data as nominal, ordinal, interval or ratio giving reasons for each:
- The species of trees growing in a farm.
- The grades of students at the end of semester exams.
- The financial stability of banks in Kenya.
- The number of years of service of all employees in Karatina University.
- Favorite rainbow colours among a sample of 50 pupils in ABC school.
- The number of defective bulbs produced by XYZ factory between January and May 2000.
List the various methods of data collection techniques you know of.
Sometimes people, knowingly or unknowingly, use statistical data wrongly. State any two forms of misuse of statistical data.
Classify the different measurement systems into one of the four types of scales:

The distance around your forehead measured with a tape measure as a measure of your intelligence.
A response to the statement “My dress my choice” where “Strongly Disagree” = 1, “Disagree” = 2, “No Opinion” = 3, “Agree” = 4, and “Strongly Agree” = 5, as a measure of women’s attitude toward manner of dressing.
Research Question: Write down the advantages of data classification.

1.4 Data Presentation

1.4.1 Objectives

By the end of the lecture the learner should be able to:

Summarize a set of data using a table or frequency distribution table.
Display data graphically using bar graphs, histogram, frequency polygon, frequency curve, and Ogive curve and interpret the graphs.

1.4.2 Introduction

When data is collected (raw data), it is usually not organized. After the data have been collected, the next step is to present them in some suitable form. Proper presentation is necessary because statistical data in raw form are difficult to comprehend.

Often, the first stage in presenting data is to produce a table.

If the data are few, they can be easily presented and understood.
If the number of figures is large, proper classification is essential for analysis.

Next is to represent the data diagrammatically or graphically.

A statistical graph is a tool that helps you learn about the shape or distribution of a sample or population. Graphs often communicate information more effectively than large sets of numbers.

Common graphs include:

Dot plot
Bar graph
Histogram
Stem-and-leaf plot
Frequency curve
Frequency polygon
Pie chart
Box plot
Cumulative frequency (Ogive) curve

In this course, we will look at:

Histogram
Line graphs
Bar graphs
Frequency polygons
Cumulative frequency (Ogive) curve

1.4.3 Frequency Distribution

One method of data presentation is the frequency distribution.

The frequency of a value is the number of times that value appears. When observations are few and values repeat, we can arrange them in a table showing each value and its frequency. This is called a frequency table.

A frequency table/distribution is a listing of possible values for a variable together with the number of observations (or relative frequencies) for each value.

1.4.3.1 Ungrouped Data

Suppose we record some observations where some values occur once and others multiple times.

Recording numbers as they appear is tedious: this is ungrouped (or raw) data.
When the number of distinct values is small (discrete distribution), it is convenient to use an ungrouped frequency distribution table.

Example 1.1: Ungrouped Frequency Distribution

The following set of data consists of exam scores for 25 students:

3, 3, 6, 4, 5, 4, 10, 5, 29, 3, 5, 6, 10, 31, 4, 10, 3, 29, 5, 31, 29, 11, 31, 6, 10

Construct an ungrouped frequency distribution table to represent this data set.

Solution: Steps of Construction of Ungrouped Frequency Distribution Table

Identify the smallest and the largest value in the data set and arrange all values in ascending (or descending) order.
Tally the number of times each value appears in the data.
Count the number of tallies of each value and record them as frequencies.

The smallest value is 3 and the largest is 31.
Arranging the values in ascending order, we obtain:

3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 10, 10, 10, 10, 11, 29, 29, 29, 31, 31, 31, 31

Next step is to construct the frequency distribution table.

Note: If a tally reaches 5, we use //// and not /////.

Ungrouped Frequency Distribution Table

Scores (x)	Tallies	Frequency (f)
3	///	3
4	///	3
5	////	4
6	///	3
10	////	4
11	/	1
29	///	3
31	////	4
Total		25

R Code: Construction of Ungrouped Frequency Table

##   Scores (x) Frequency (f)
## 1          3             4
## 2          4             3
## 3          5             4
## 4          6             3
## 5         10             4
## 6         11             1
## 7         29             3
## 8         31             3

1.4.3.2 Categorical Frequency Distributions

A categorical frequency distribution is used for data that can be placed into categories such as gender, religion, marital status, blood group, etc. These categories are mutually exclusive and collectively exhaustive.

The categorical frequency distribution is used for data that can be placed in specific categories, such as nominal- or ordinal-level data. For example, data such as political affiliation, religious affiliation, blood group, tree species or major field of study would use categorical frequency distributions.

Example 1.2: Categorical Frequency Distribution Table

A lecturer recorded the major field of study for 30 first-year students. The categories observed were Statistics, Mathematics, Computer Science, and Actuarial Science. The data collected are as follows:

Statistics, Mathematics, Statistics, Computer Science, Actuarial Science, Statistics, Mathematics, Mathematics, Computer Science, Statistics, Actuarial Science, Statistics, Mathematics, Computer Science, Statistics, Mathematics, Actuarial Science, Computer Science, Statistics, Mathematics, Statistics, Actuarial Science, Computer Science, Mathematics, Statistics, Mathematics, Computer Science, Statistics, Actuarial Science, Statistics.

Construct a categorical frequency distribution table for this data.

Solution

Category	Frequency
Actuarial Science	6
Computer Science	6
Mathematics	8
Statistics	10
Total	30

1.4.4 Grouped Frequency Distribution Tables (Classification According to Class-Intervals)

If amount of data is large we put it into groups/categories/classes and determine number of units in each category (class frequency).

A grouped frequency distribution table normally has columns which show the class intervals, class mid-points, class frequencies, and cumulative frequencies, the last of these being a running total of the frequencies themselves. There may also be a column of tallied frequencies, if the table is being constructed from the raw data without having first arranged the values in rank order.

1.4.4.1 Principles of Classification

For the purpose of further calculations in statistical work the mid-point of each class is taken to represent that class.
There are two methods of classifying the data according to class-intervals, namely:
- “Exclusive” method: When the class-intervals are so fixed that the upper limit of one class is the lower limit of the next class, it is known as the “Exclusive” method of classification. It is clear that the “exclusive” method ensures continuity of data in as much as the upper limit of one class is the lower limit of the next class.
- “Inclusive” method: Under the “Inclusive” method of classification, the upper limit of one class is included in that class itself.
The number of classes denoted by $k$ falls between 5 and 15. (However, there is no rigidity about it. The classes can be more than 15 depending upon the total number of observations in the data and the details required). Further, the precise number of classes to be used for a given variable may depend upon personal judgment and other considerations such as the details required, the ease of calculation of further statistical work, etc.
The classes should be mutually exclusive.
The starting point, i.e., the lower limit of the first class, should either be zero or 5 or multiples of 5. For example, if the lowest value of the data is 63 and we have taken a class-interval of 10, then the first class should be 60 – 70, instead of 63 – 73.
To ensure continuity and to get correct class-interval we should adopt the “exclusive” method of classification. However, where the “inclusive” method has been adopted it is necessary to make an adjustment to determine the correct class-interval and to have continuity. See steps in the Construction of a Grouped Frequency Distribution below. The adjustment consists of finding the difference between the lower limit of the second class and the upper limit of the first class, dividing the difference by two, subtracting the value so obtained from all lower limits and adding the value to all upper limits. This can be expressed in the formula as follows:

\[\text{Correction factor} = \frac{(\text{Lower limit of 2nd class}) - (\text{Upper limit of 1st class})}{2}\]

Whenever possible all classes should be of the same size.

Steps in the Construction of a Grouped Frequency Distribution

Step 1. Select the number of classes $k$. One such guideline is to pick $k$ such that $2^k \geq n$, so that if the sample size $n = 20$, $k = 5$ because $2^5 = 32 > n$ and if $n = 80$, $k = 7$ because $2^7 = 128 > n$. To be more specific, we can solve for $k$ to get:

\[k > \frac{\log n}{\log 2}\]

Alternatively, Sturges suggested the following formula for determining the approximate number of classes:

\[k = 1 + 3.322 \log(n)\]

where $k$ = the approximate number of classes, $n$ = total number of observations and $\log$ = the ordinary logarithm to the base of 10.

Step 2. Find the largest and smallest values and compute the working range denoted by $R$.

\[R = \text{Maximum Value} - \text{Minimum Value}\]

(or Desired Lower Class Limit (LCL) of starting class). LCL of the starting class is normally the minimum value in the data or any other value slightly less than the minimum value.

Step 3. Identify the smallest unit of measurement ($u$) used in the data collection. The value of $u$ can be inferred from the given data or the given starting value (usually tens (10), ones (1), tenth (0.1) and hundredth (0.01) etc.

\[u = (\text{LCL of 2nd class}) - (\text{UCL of 1st class})\]

Estimate the class interval ($i$) (sometimes denoted by $c$) as:

\[\text{Class width } CW(i) = \frac{\text{Largest data value} - \text{Smallest data value}}{\text{Desirable number of classes}}\]

\[i = \text{Round up} \left(\frac{R}{k}\right) \text{ to the nearest } u\]

Note: You must Round Up, not Round Off. For $u = 1$, Round Up (5.2) = 6 not 5 and for $u = 0.1$ Round Up is exact (no remainder when divided by $u$) — add one to the number of classes. Or simply put, round $i$ to the next highest whole number so that the classes cover the whole data.

Step 4. The starting value used in calculation of $R$ above is picked as the lower class limit (LCL) of the first class. Add the class interval $i$ to this LCL successively to get the rest of the lower class limits.

Step 5. Find the Upper Class Limit (UCL) of the first class by subtracting $u$ from the LCL of the second class. Then continue to add the class interval $i$ to this UCL to find the rest of the upper limits.

Step 6. If necessary, find the class boundaries (CB) for each class as follows:

Lower Class Boundary: $LCB = LCL - 0.5u \quad (0.5u = \text{the correction factor})$
Upper Class Boundary: $UCB = UCL + 0.5u$

Step 7. Tally the number of observations falling in each class and find the frequencies.

Note: A value $x$ falls into a class $LCL - UCL$ only if $LCB \leq x < UCB$. That is, $x$ can be equal to $LCB$ but not $UCB$ of that class.

Step 8. Record the number of tallies in each category as the class frequencies.

Step 9. Compute the cumulative frequencies to confirm that the last value of the column is equal to the sum of the frequencies.

Step 10. Compute the midpoints of each class using the class boundaries.

Example 1.3

The idea of grouped data can also be illustrated by considering the following raw dataset:

Time taken (in seconds) by a group of students to answer a simple math question

Table 1.1: Table 1.2: Raw data: time taken (seconds) by students

20	25	24	33	13	16	21	17	11	34
26	8	19	31	11	14	15	21	18	17

The above data can be organized into a frequency distribution (or a grouped data) in several ways. One method is to use intervals as a basis.

The smallest value in the above data is 8 and the largest is 34. The interval from 8 to 34 is broken up into smaller subintervals (called class intervals). Suppose we want to have number of classes as:

\[k = 1 + 3.322 \log(20) = 5.322 \approx 6\]

Then the class width is obtained as:

\[CW = \frac{34 - 8}{6} = 4.33 \implies \text{rounding to the next whole number, } CW = i = 5\]

The results are tabulated as a frequency distribution as follows:

Frequency distribution of the time taken (in seconds) by the group of students to answer a simple math question:

Using Exclusive Method of Classification

Table 1.3: Table 1.4: Exclusive method of classification
Time taken (seconds)	Interval notation	Tallies	Frequency	Cumulative frequencies	Class mid-point
5-10	5 ≤ t < 10	/	1	1	7.5
10-15	10 ≤ t < 15	////	4	5	12.5
15-20	15 ≤ t < 20	/////	6	11	17.5
20-25	20 ≤ t < 25	////	4	15	22.5
25-30	25 ≤ t < 30	//	2	17	27.5
30-35	30 ≤ t < 35	///	3	20	32.5

Using the Inclusive Method of Classification

Table 1.5: Table 1.6: Inclusive method of classification
Time taken (seconds)	Tallies	Frequency	Cumulative frequencies	Class mid-point	Class boundaries
5-9	/	1	1	7.5	4.5-9.5
10-14	////	4	5	12.5	9.5-14.5
15-19	/////	6	11	17.5	14.5-19.5
20-24	////	4	15	22.5	19.5-24.5
25-29	//	2	17	27.5	24.5-29.5
30-34	///	3	20	32.5	29.5-34.5

Note: To ensure continuity, the class limits are adjusted to obtain the true class limits (class boundaries) as shown earlier in the principles of classification number (iv). This is indicated in the last column.

Example 1.4

Let the marks of 50 students of a class be:

Table 1.7: Table 1.8: Marks of 50 students

46	58	54	52	55	59	52	62	65	67
64	63	77	78	92	6	7	12	18	16
3	23	25	25	27	81	88	24	29	22
34	33	30	37	36	42	48	28	22	28
17	13	70	37	32	36	41	40	43	44

We can arrange them as follows:

Table 1.9: Table 1.10: Grouped frequency distribution of marks of 50 students

Marks	Frequency	Marks	Frequency
0 – 10	3	50 – 60	6
10 – 20	5	60 – 70	5
20 – 30	10	70 – 80	3
30 – 40	8	80 – 90	2
40 – 50	7	90 – 100	1

Data organized and summarized as in the above frequency distribution is called grouped data.

Remark:

Consider the following:

Mass (Kg) Number of students

60–62 5

63–65 18

66–68 42

69–71 27

72–74 8

75– 0

66–68 is referred to as the class interval where 66 is the lower class limit while 68 is the upper class limit. 75– is the open class interval.

If measurements are taken to the nearest Kg then for example 65.5–68.5 are the true class limits/boundaries.

Mid-point between class limits is called the class mid-mark/midpoint. It is used for all mathematical analysis of frequency distribution.

\[\text{Mid-point of a class} = \frac{\text{Upper class boundary} + \text{Lower class boundary}}{2}\]

Note: Relative Frequencies may also be calculated by dividing the number of cases in each category by the total number of students (100) and multiplying by 100. For example in the class 66–68: \[\text{Relative frequency} = \frac{42}{100} \times 100 = 42\]

Relative frequencies are most useful where the class size is different.

Mass (Kg)	Number of students
60–62	5
63–65	18
66–68	42
69–71	27
72–74	8
75–	0

Self-Test Question

The list below shows One-way Commuting Distances (in Km) for 60 workers in Nairobi city.

Table 1.11: Table 1.12: One-way commuting distances (km) for 60 Nairobi workers

13	7	12	6	34	14	47	25	45	2
13	26	10	8	1	14	41	10	3	21
8	13	28	24	16	19	4	7	36	37
20	15	16	15	17	31	17	3	11	46
24	8	40	17	18	12	27	16	4	14
23	9	29	12	2	6	12	18	9	16

Construct a grouped frequency distribution table and include the cumulative frequencies and class mid-point using:
1. Exclusive method of classification with the class boundaries ending with either 0 or 5.
2. Inclusive method of classification.
Find the class boundaries in (b) to ensure continuity.

1.5 Diagrammatic Representation of Data

1.5.1 Histogram

A histogram consists of a set of adjoining rectangles such that their bases are on the x-axis with centers at class marks and length equals class interval size. The horizontal axis is labeled with what the data represents (for instance, distance from campus to your hostel). The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram can give you the shape of the data, the center, and the spread of the data.

The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample or population. If:

$f$ = frequency
$n$ = total number of data values (or the sum of the individual frequencies), and
$RF$ = relative frequency,

Then:

\[RF = \frac{\text{frequency}}{\text{total frequencies}} = \frac{f}{n}\]

The areas of the rectangles are proportional to the class frequencies. If class intervals have equal sizes the histogram is obtained by plotting the frequencies against the true class limits (class boundaries) such that the heights of rectangles are proportional to class frequencies.

But if class intervals are not equal, then plot the frequency density (or relative frequencies) against the class boundaries as illustrated in Example 1.4 (ii).

\[\text{Frequency density} = \frac{\text{frequency } (f)}{\text{class width } (i)}\]

To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. This usually equals the number of intervals/classes in the data set. Choose a starting point to be the lower class boundary of a class lower than the first interval in the data set. For instance if the class intervals were: 10–15, 15–20, … then the first interval will be 5–10 with a height/frequency zero.

Example 1.4 (i): Histogram – Equal Class Widths

Represent the following data by a histogram.

Table 1.13: Table 1.14: Frequency distribution – equal class widths
Marks	Frequency	Marks	Frequency
0–10	5	50–60	10
10–20	11	60–70	8
20–30	19	70–80	6
30–40	21	80–90	3
40–50	16	90–100	1
Total: 100

The class intervals are of equal size and class boundaries are given since the exclusive method of data classification has been used.

marks_mid  <- seq(5, 95, by = 10)
marks_freq <- c(5, 11, 19, 21, 16, 10, 8, 6, 3, 1)

hist_df <- data.frame(
  lower = seq(0, 90, by = 10),
  upper = seq(10, 100, by = 10),
  freq  = marks_freq
)

ggplot(hist_df, aes(xmin = lower, xmax = upper, ymin = 0, ymax = freq)) +
  geom_rect(fill = "#2e86c1", color = "white", linewidth = 0.5) +
  scale_x_continuous(breaks = seq(0, 100, by = 10),
                     labels = seq(0, 100, by = 10)) +
  labs(title = "Histogram of student marks",
       x     = "Marks",
       y     = "Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.2: Histogram of student marks (equal class widths)

Example 1.4 (ii): Histogram – Unequal Class Widths (Frequency Density)

Construct a histogram to represent the following data set:

Table 1.15: Table 1.16: Frequency distribution – unequal class widths
X (Class limits)	F	Class boundaries	Relative frequency	i = class size	Frequency density (fd = f/i)
15-19	5	14.5-19.5	5/100	5	5/5
20-29	8	19.5-29.5	8/100	10	8/10
30-34	22	29.5-34.5	22/100	5	22/5
35-39	35	34.5-39.5	35/100	5	35/5
40-54	20	39.5-54.5	20/100	15	20/15
55-59	10	54.5-59.5	10/100	5	10/5

The class sizes are unequal and therefore to construct the histogram we use frequency density for each class calculated as $fd = \frac{f}{i}$.

Class limits are given hence to obtain class boundaries (true class limits), we adjust the limits by using the correction factor.

unequal_hist <- data.frame(
  lower = c(14.5, 19.5, 29.5, 34.5, 39.5, 54.5),
  upper = c(19.5, 29.5, 34.5, 39.5, 54.5, 59.5),
  freq  = c(5, 8, 22, 35, 20, 10),
  width = c(5, 10, 5, 5, 15, 5)
)
unequal_hist$fd <- unequal_hist$freq / unequal_hist$width

ggplot(unequal_hist,
       aes(xmin = lower, xmax = upper, ymin = 0, ymax = fd)) +
  geom_rect(fill = "#117a65", color = "white", linewidth = 0.5) +
  scale_x_continuous(breaks = c(14.5, 19.5, 29.5, 34.5, 39.5, 54.5, 59.5)) +
  labs(title = "Histogram using frequency density (unequal class widths)",
       x     = "Class boundaries",
       y     = "Frequency density (f/i)") +
  theme_classic(base_size = 13) +
  theme(plot.title    = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        axis.text.x   = element_text(angle = 45, hjust = 1))

Figure 1.3: Histogram of unequal class widths (frequency density)

Exercise: Suppose the classes were of equal widths, then construct a histogram (DIY).

Class limits 15–19 20–24 25–29 30–34 35–39 40–44

Frequency 1 4 22 35 20 8

Class limits	15–19	20–24	25–29	30–34	35–39	40–44
Frequency	1	4	22	35	20	8

1.5.2 Frequency Polygon

A frequency polygon is a graphical form of representation of data. It is used to depict the shape of the data and to depict trends. It is usually drawn with the help of a histogram but can be drawn without it as well. If a histogram is already drawn and the midpoint of adjacent rectangles joined by straight lines we will obtain frequency polygons.

Steps to Draw a Frequency Polygon

Mark the class intervals for each class on the horizontal axis. We will plot the frequency on the vertical axis.
Calculate the class mark for each class interval. The formula for class mark is:

\[\text{Class mark} = \frac{\text{Upper limit} + \text{Lower limit}}{2}\]

Mark all the class marks on the horizontal axis. It is also known as the mid-value of every class.
Corresponding to each class mark, plot the frequency as given to you. The height always depicts the frequency. Make sure that the frequency is plotted against the class mark and not the upper or lower limit of any class.
Join all the plotted points using a line segment. The curve obtained will be kinked.
This resulting curve is called the frequency polygon.

N/B: It can be drawn without rectangles.

Example: Frequency Polygon of Student Marks

Plot the frequency polygon of the marks of students given in (a) above.

Solution:

Table 1.17: Table 1.18: Midpoints and frequencies for frequency polygon
Marks	Frequency	Midpoint
–	0	0
0–10	5	5
10–20	11	15
20–30	19	25
30–40	21	35
40–50	16	45
50–60	10	55
60–70	8	65
70–80	6	75
80–90	3	85
90–100	1	95
–	0	105

Note: It is customary to add the extensions PQ and RS to the next lower and next higher midpoints which have corresponding class frequencies of zero.

fp_data <- data.frame(
  midpoint  = c(0, 5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 105),
  frequency = c(0, 5, 11, 19, 21, 16, 10,  8,  6,  3,  1,   0)
)

# Underlying histogram bars
hist_bars <- data.frame(
  lower = seq(0, 90, by = 10),
  upper = seq(10, 100, by = 10),
  freq  = c(5, 11, 19, 21, 16, 10, 8, 6, 3, 1)
)

ggplot() +
  geom_rect(data = hist_bars,
            aes(xmin = lower, xmax = upper, ymin = 0, ymax = freq),
            fill = "#aed6f1", color = "white", linewidth = 0.4, alpha = 0.6) +
  geom_line(data = fp_data,
            aes(x = midpoint, y = frequency),
            color = "#1a5276", linewidth = 1.2) +
  geom_point(data = fp_data,
             aes(x = midpoint, y = frequency),
             color = "#e74c3c", size = 2.5) +
  scale_x_continuous(breaks = seq(0, 105, by = 10)) +
  labs(title = "Frequency polygon of student marks",
       x     = "Marks (midpoints)",
       y     = "Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.4: Frequency polygon of student marks

Plot the Frequency Polygon – Given Data Set

Table 1.19: Table 1.20: Data for second frequency polygon
Class limits	Frequency	Class mid-point
15-19	1	17
20-24	4	22
25-29	22	27
30-34	35	32
35-39	20	37
40-44	8	42

fp2_data <- data.frame(
  midpoint  = c(12, 17, 22, 27, 32, 37, 42, 47),
  frequency = c( 0,  1,  4, 22, 35, 20,  8,  0)
)

hist2_bars <- data.frame(
  lower = c(14.5, 19.5, 24.5, 29.5, 34.5, 39.5),
  upper = c(19.5, 24.5, 29.5, 34.5, 39.5, 44.5),
  freq  = c(1, 4, 22, 35, 20, 8)
)

ggplot() +
  geom_rect(data = hist2_bars,
            aes(xmin = lower, xmax = upper, ymin = 0, ymax = freq),
            fill = "#a9dfbf", color = "white", linewidth = 0.4, alpha = 0.6) +
  geom_line(data = fp2_data,
            aes(x = midpoint, y = frequency),
            color = "#117a65", linewidth = 1.2) +
  geom_point(data = fp2_data,
             aes(x = midpoint, y = frequency),
             color = "#e74c3c", size = 2.5) +
  scale_x_continuous(breaks = c(12, 17, 22, 27, 32, 37, 42, 47)) +
  labs(title = "Frequency polygon",
       x     = "Class mid-points",
       y     = "Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.5: Frequency polygon – second data set

1.5.3 Bar Graph (Chart)

The height of the bar is proportional to the frequency of the variate but the thickness of the bar is insignificant. A bar chart comprises a number of spaced rectangles and thus do not suggest continuity and which generally have their major axes vertical. They can be used to represent a large variety of statistical data. The bar chart is appropriate for displaying discrete data with only a few categories.

(a) Simple Bar Chart

Example 1.5 (i)– Birth Rates by Country

The following table gives the birth rate per thousand of different countries over a certain period of time.

Table 1.21: Table 1.22: Birth rate per thousand by country
Country	Birth rate
Kenya	30
India	33
China	40
Uganda	29
U.K.	20
Sweden	15

Represent the above data by a suitable diagram.

Solution: The appropriate diagram for this data is a simple bar diagram.

birth_rate <- data.frame(
  Country    = c("Kenya","India","China","Uganda","U.K.","Sweden"),
  Birth_Rate = c(30, 33, 40, 29, 20, 15)
)
birth_rate$Country <- factor(birth_rate$Country,
                             levels = birth_rate$Country[order(birth_rate$Birth_Rate)])

ggplot(birth_rate, aes(x = Country, y = Birth_Rate)) +
  geom_bar(stat = "identity", fill = "#2e86c1", width = 0.6) +
  geom_text(aes(label = Birth_Rate), vjust = -0.4, size = 4, color = "#1a5276") +
  labs(title = "Birth rate per thousand by country",
       x     = "Country",
       y     = "Birth Rate (per thousand)") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.6: Simple bar chart of birth rates by country

Comparing the size of the bars, you can easily see that China has the highest birth rate while Sweden has the lowest.

Example 1.5 (ii): Bacterial Meningitis Cases

Consider data relating to the number of patients diagnosed with Bacterial meningitis in a hospital each year.

Table 1.23: Table 1.24: Bacterial meningitis patients per year
Year	No. of patients
2001	141
2002	225
2003	205
2004	108
2005	192

This data can be represented by the bar chart as shown below.

The number of patients diagnosed with Bacterial meningitis in a hospital during the period 2001 – 2005.

mening <- data.frame(
  Year     = factor(2001:2005),
  Patients = c(141, 225, 205, 108, 192)
)

ggplot(mening, aes(x = Year, y = Patients)) +
  geom_bar(stat = "identity", fill = "#117a65", width = 0.6) +
  geom_text(aes(label = Patients), vjust = -0.4, size = 4, color = "#117a65") +
  labs(title = "Bacterial meningitis patients (2001–2005)",
       x     = "Year",
       y     = "Number of patients") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.7: Bacterial meningitis cases 2001–2005

Notice that it is now easy to see that there are variations in the number of cases over this period of time.

(b) Multiple Bar Chart Bar charts often prove most useful if we have two (or more) sets of comparable data, and wish to compare and contrast them.

Example 1.6

Suppose that apart from the data relating to the number of patients diagnosed with Bacterial meningitis in a hospital each year, we also have the corresponding numbers for Malaria cases.

Table 1.25: Table 1.26: Meningitis and malaria patients per year
Year	Number of patients (Meningitis)	Number of patients (Malaria)
2001	141	321
2002	225	251
2003	205	123
2004	108	547
2005	192	148

multi_long <- data.frame(
  Year     = rep(factor(2001:2005), 2),
  Disease  = c(rep("Meningitis", 5), rep("Malaria", 5)),
  Patients = c(141, 225, 205, 108, 192, 321, 251, 123, 547, 148)
)

ggplot(multi_long, aes(x = Year, y = Patients, fill = Disease)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  geom_text(aes(label = Patients),
            position = position_dodge(width = 0.7),
            vjust = -0.4, size = 3.5) +
  scale_fill_manual(values = c("Meningitis" = "#2e86c1",
                               "Malaria"    = "#e67e22")) +
  labs(title = "Meningitis vs malaria cases (2001–2005)",
       x     = "Year",
       y     = "Number of patients",
       fill  = "Disease") +
  theme_classic(base_size = 13) +
  theme(plot.title   = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        legend.position = "top")

Figure 1.8: Multiple bar chart: meningitis vs malaria cases 2001–2005

(c) Component Bar Charts (Sub-divided Bar Diagrams)

In this type of bar chart each bar is subdivided into two or more components.

Example 1.7

Suppose further that the data in the example above is grouped according to sex as follows:

Table 1.27: Table 1.28: Meningitis patients by sex (2001–2005)
Year	Number of Male patients	Number of Female patients	Total Patients
2001	100	41	141
2002	125	100	225
2003	90	115	205
2004	20	88	108
2005	102	90	192

This data can be represented in a component bar chart as shown in the figure below. Looking at this presentation, it is possible to discern two main features; firstly, we can see how the meningitis cases vary from year to year and secondly we can get a good idea of the make up of this total in terms of proportions of patients who are male or female.

comp_long <- data.frame(
  Year     = rep(factor(2001:2005), 2),
  Sex      = c(rep("Male", 5), rep("Female", 5)),
  Patients = c(100, 125, 90, 20, 102, 41, 100, 115, 88, 90)
)

ggplot(comp_long, aes(x = Year, y = Patients, fill = Sex)) +
  geom_bar(stat = "identity", position = "stack", width = 0.6) +
  geom_text(aes(label = Patients),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4, fontface = "bold") +
  scale_fill_manual(values = c("Male" = "#2e86c1", "Female" = "#e74c3c")) +
  labs(title = "Component bar chart: meningitis patients by sex (2001–2005)",
       x     = "Year",
       y     = "Number of patients",
       fill  = "Sex") +
  theme_classic(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        legend.position = "top")

Figure 1.9: Component bar chart: meningitis cases by sex 2001–2005

1.5.4 Pie Chart

A pie chart presents data in the form of a circle. The slices represent absolute or relative proportions. A pie chart is formed by making a portion of the pie corresponding to each characteristic being displayed.

Example 1.8

A researcher studying the distribution of manufacturing costs in ABC Ltd found that 20% of the firm’s unit cost is due to labour, 40% raw materials, 25% maintenance costs and 15% debt servicing. Present this information in a pie chart.

Fig 2: A pie chart representing the distribution of ABC Ltd per unit manufacturing cost during the year.

Table 1.29: Table 1.30: ABC Ltd manufacturing cost distribution
Component	Percentage
Labour	20
Raw Materials	40
Maintenance Costs	25
Debt Servicing	15

pie_data <- data.frame(
  Component  = c("Labour","Raw Materials","Maintenance costs","Debt servicing"),
  Percentage = c(20, 40, 25, 15)
)
pie_data$Component <- factor(pie_data$Component,
                             levels = pie_data$Component)
pie_data$label     <- paste0(pie_data$Component, "\n", pie_data$Percentage, "%")

ggplot(pie_data, aes(x = "", y = Percentage, fill = Component)) +
  geom_bar(stat = "identity", width = 1, color = "white", linewidth = 0.7) +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = paste0(Percentage, "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Labour"           = "#2e86c1",
                               "Raw Materials"     = "#e67e22",
                               "Maintenance costs" = "#117a65",
                               "Debt servicing"    = "#8e44ad")) +
  labs(title = "ABC Ltd: per unit manufacturing cost distribution",
       fill  = "Component") +
  theme_void(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276",
                                       face = "bold", size = 13),
        legend.position = "right")

Figure 1.10: Pie chart: ABC Ltd manufacturing cost distribution

1.6 Graphical Representation of Data

1.6.1 Frequency Curve

Consider for example the sales data for some company over a period of six years as shown in the table below:

sales_df2 <- data.frame(
  Year  = c(2000, 2001, 2002, 2003, 2004),
  Sales = c(420000, 370000, 360000, 380000, 540000)
)

ggplot(sales_df2, aes(x = Year, y = Sales)) +
  geom_line(color = "#1a237e", linewidth = 1.2) +
  geom_point(color = "#1a237e", size = 2) +
  scale_x_continuous(breaks = c(2000, 2001, 2002, 2003, 2004),
                     labels = c("2000","2001","2002","2003","2004")) +
  scale_y_continuous(breaks = seq(0, 600000, by = 100000),
                     labels = c("0","100,000","200,000","300,000",
                                "400,000","500,000","600,000"),
                     limits = c(0, 620000)) +
  labs(title = NULL, x = NULL, y = NULL) +
  theme_gray(base_size = 12) +
  theme(
    panel.background = element_rect(fill = "#c0c0c0", color = NA),
    plot.background  = element_rect(fill = "white", color = "black", linewidth = 0.8),
    panel.grid.major = element_line(color = "white", linewidth = 0.4),
    panel.grid.minor = element_blank(),
    axis.text        = element_text(color = "black", size = 10)
  )

Figure 1.11: Sales data for a company (2000–2004)

This original data can be presented in a graphical form as follows.

1.6.2 Cumulative Frequency Curve (Ogive)

(i) “Less Than” Ogive

The cumulative frequency curve is obtained by first plotting the points with the upper class boundaries of each class interval on the X-axis and their corresponding cumulative frequencies on the Y-axis. The points are joined by means of a freehand smooth curve. The cumulative frequency curve is specifically called the “Less than” Ogive curve.

Example 1.9

Plot the “Less than” ogive curve of the marks of students given in example 2 above.

Solution:

Table 1.31: Table 1.32: Less than ogive – cumulative frequency table
Marks	Frequency	Cumulative frequency	Upper class boundary
0 – 10	5	5	10
10 – 20	11	16	20
20 – 30	19	35	30
30 – 40	21	56	40
40 – 50	16	72	50
50 – 60	10	82	60
60 – 70	8	90	70
70 – 80	6	96	80
80 – 90	3	99	90
90 – 100	1	100	100

lt_data <- data.frame(
  upper_boundary = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
  cum_freq       = c(0,  5, 16, 35, 56, 72, 82, 90, 96, 99, 100)
)

ggplot(lt_data, aes(x = upper_boundary, y = cum_freq)) +
  geom_line(color = "#1f618d", linewidth = 1.2) +
  geom_point(color = "#e74c3c", size = 3) +
  scale_x_continuous(breaks = seq(0, 100, by = 10)) +
  scale_y_continuous(breaks = seq(0, 100, by = 10)) +
  labs(title = '"Less than" Ogive Curve of Student Marks',
       x     = "Upper Class Boundary (Marks)",
       y     = "Cumulative Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.12: ‘Less than’ ogive curve of student marks

From the graph there are “y” students who scored less than “x” marks.

(ii) “More Than” Ogive

If we plot the “more than” cumulative frequencies against the corresponding lower class boundaries and join the points by a smooth curve, we get a “more than” ogive.

Example 1.10

Plot the “More than” ogive curve of the marks of students given in example 2 above.

Solution:

Table 1.33: Table 1.34: More than ogive – cumulative frequency table
Marks	Frequency	More than cumulative frequency	Lower class boundary
0 – 10	5	100	0
10 – 20	11	95	10
20 – 30	19	84	20
30 – 40	21	65	30
40 – 50	16	44	40
50 – 60	10	28	50
60 – 70	8	18	60
70 – 80	6	10	70
80 – 90	3	4	80
90 – 100	1	1	90
		0	100

mt_data <- data.frame(
  lower_boundary = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
  cum_freq       = c(100, 95, 84, 65, 44, 28, 18, 10,  4,  1,   0)
)

ggplot(mt_data, aes(x = lower_boundary, y = cum_freq)) +
  geom_line(color = "#117a65", linewidth = 1.2) +
  geom_point(color = "#e74c3c", size = 3) +
  scale_x_continuous(breaks = seq(0, 100, by = 10)) +
  scale_y_continuous(breaks = seq(0, 100, by = 10)) +
  labs(title = '"More than" Ogive Curve of Student Marks',
       x     = "Lower Class Boundary (Marks)",
       y     = "More Than Cumulative Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title = element_text(hjust = 0.5, color = "#1a5276", face = "bold"))

Figure 1.13: ‘More than’ ogive curve of student marks

From the graph there are “y” students who scored more than “x” marks.

The value of x at the intersection of the two graphs is the median value.

Both Ogive Curves on the Same Graph

The intersection of the “less than” and “more than” ogive curves gives the median.

lt_data$type <- "Less than"
mt_data2 <- data.frame(
  upper_boundary = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
  cum_freq       = c(100, 95, 84, 65, 44, 28, 18, 10,  4,  1,   0),
  type           = "More than"
)
names(lt_data)[1] <- "boundary"
names(mt_data2)[1] <- "boundary"

both_ogive <- rbind(lt_data, mt_data2)

ggplot(both_ogive, aes(x = boundary, y = cum_freq, color = type)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.5) +
  scale_color_manual(values = c("Less than" = "#1f618d", "More than" = "#117a65")) +
  scale_x_continuous(breaks = seq(0, 100, by = 10)) +
  scale_y_continuous(breaks = seq(0, 100, by = 10)) +
  labs(title = '"Less than" and "More than" Ogive Curves',
       x     = "Class Boundary (Marks)",
       y     = "Cumulative Frequency",
       color = "Ogive type") +
  theme_classic(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        legend.position = "top")

Figure 1.14: Both ogive curves – median at intersection

This is a graph of upper class boundaries and cumulative frequencies ($c_f$).

Exercise 1.2

Consider the following data:

Table 1.35: Table 1.36: Data for Exercise 1.2 (Question 1)

32	46	25	57	39	45	55	42	20	36
58	12	38	34	22	40	33	64	43	46
31	40	52	29	14	57	66	36	32	48
46	42	47	54	65	44	35	19	54	25
23	33	38	45	32	38	41	42	58	43

Arrange the data in a frequency distribution with the first class interval 10 – 19.

The highway patrol set up a radar checkpoint and recorded the speed in miles per hour of a random sample of 50 cars that passed the checkpoint in one hour. The speed of the cars was recorded as follows:

Table 1.37: Table 1.38: Speed (mph) of 50 cars at a radar checkpoint

74	66	65	55	48	56	50	75	75	67
76	68	50	65	60	65	60	68	68	76
68	77	63	65	52	52	63	80	80	70
65	81	70	63	45	45	65	71	71	64
55	70	64	45	64	64	40	55	55	71

Make a frequency distribution table using 5 as the class width.

Given the data below:

Table 1.39: Table 1.40: Data for Exercise 1.2 (Question 3)

3.0	3.4	4.1	4.1	4.3	2.7	3.5	3.7	3.4	3.4
3.8	4.2	3.1	3.9	3.1	4.1	2.8	3.7	4.4	3.5
3.5	3.4	3.7	3.7	2.8	4.3	3.8	3.4	4.1	3.0
4.4	4.1	4.1	3.6	3.4	2.7	3.6	3.0	3.4	4.3
3.8	3.2	4.2	3.9	4.2	3.4	2.9	4.4	3.5	3.9

Form a frequency distribution using the classes 2.7–2.9, 3.0–3.2, 3.3–3.5, …

Using Sturges’ rule,

\[K = 1 + 3.322 \log_{10} N\]

where $K$ = number of class-intervals and $N$ = total number of observations; classify, in equal intervals, the following hours worked by 20 workers in a factory for one month:

Table 1.41: Table 1.42: Hours worked by 20 factory workers in one month

155	120	50	110	116	95	125	42	175	130
160	90	68	71	135	147	115	108	140	98

Find the percentage frequency in each class-interval.

Represent the following data by a histogram.

Table 1.43: Table 1.44: Frequency distribution of student marks
Marks	Frequency	Marks	Frequency
0 – 10	5	50 – 60	10
10 – 20	11	60 – 70	8
20 – 30	19	70 – 80	6
30 – 40	21	80 – 90	3
40 – 50	16	90 – 100	1
Total: 100

Using the data classified in questions 1, 2 and 3, draw:
1. A Histogram
2. A Frequency polygon
3. “Less than” and “more than” Ogive curves

A nutritionist is interested in knowing the percent of calories from fat which Kenyans intake on a daily basis. To study this, the nutritionist randomly selects 25 Kenyans and evaluates the percent of calories from fat consumed in a typical day. The results of the study are as follows:

Table 1.45: Table 1.46: Percent of calories from fat (25 Kenyans)

34	18	33	25	30
42	40	33	39	40
45	35	45	25	27
23	32	33	47	23
27	32	30	28	36

Construct a frequency distribution and the corresponding histogram.

In Kenya, approximately 45% of the population has blood type O; 40% type A; 11% type B; and 4% type AB. Illustrate this distribution of blood types with a pie chart.

blood_df <- data.frame(
  Type       = c("Type O", "Type A", "Type B", "Type AB"),
  Percentage = c(45, 40, 11, 4)
)

ggplot(blood_df, aes(x = "", y = Percentage, fill = Type)) +
  geom_bar(stat = "identity", width = 1, color = "white", linewidth = 0.7) +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = paste0(Percentage, "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Type O"  = "#2e86c1",
                               "Type A"  = "#e67e22",
                               "Type B"  = "#117a65",
                               "Type AB" = "#8e44ad")) +
  labs(title = "Distribution of Blood Types in Kenya",
       fill  = "Blood Type") +
  theme_void(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276",
                                       face = "bold", size = 13),
        legend.position = "right")

Figure 1.15: Distribution of blood types in Kenya

In the academic years 1982 to 1985, the number of students in College ABC were as follows:

Table 1.47: Table 1.48: Number of students in College ABC by faculty (1982–1985)
Year	Science	Arts	Law
1982–83	1000	1500	200
1983–84	1600	2000	350
1984–85	2100	4000	420

Represent the data by an appropriate diagram (Component bar chart).

ex9_long <- data.frame(
  Year    = rep(c("1982–83", "1983–84", "1984–85"), 3),
  Faculty = c(rep("Science", 3), rep("Arts", 3), rep("Law", 3)),
  Count   = c(1000, 1600, 2100, 1500, 2000, 4000, 200, 350, 420)
)
ex9_long$Year    <- factor(ex9_long$Year, levels = c("1982–83","1983–84","1984–85"))
ex9_long$Faculty <- factor(ex9_long$Faculty, levels = c("Law","Science","Arts"))

ggplot(ex9_long, aes(x = Year, y = Count, fill = Faculty)) +
  geom_bar(stat = "identity", position = "stack", width = 0.6) +
  geom_text(aes(label = Count),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4, fontface = "bold") +
  scale_fill_manual(values = c("Science" = "#2e86c1",
                               "Arts"    = "#e67e22",
                               "Law"     = "#117a65")) +
  labs(title = "Students in College ABC by Faculty (1982–1985)",
       x     = "Academic Year",
       y     = "Number of Students",
       fill  = "Faculty") +
  theme_classic(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        legend.position = "top")

Figure 1.16: Component bar chart: students in College ABC by faculty (1982–1985)

The table below gives data relating to the Kenyan exports and imports (in millions of Ksh) during the four years ending 1999–2004:

Table 1.49: Table 1.50: Kenyan exports and imports (millions of Ksh), 1999–2004. Source: KNBS
Year	Export	Import
1999–2000	160000	200000
2000–2001	170000	300000
2001–2002	180000	350000
2002–2003	200000	300000
2003–2004	200000	380000

Represent this information using a suitable diagram (multiple bar chart).

ex10_long <- data.frame(
  Year  = rep(c("1999–2000","2000–2001","2001–2002","2002–2003","2003–2004"), 2),
  Type  = c(rep("Export", 5), rep("Import", 5)),
  Value = c(160000,170000,180000,200000,200000,
            200000,300000,350000,300000,380000)
)
ex10_long$Year <- factor(ex10_long$Year,
                         levels = c("1999–2000","2000–2001","2001–2002",
                                    "2002–2003","2003–2004"))

ggplot(ex10_long, aes(x = Year, y = Value / 1000, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  geom_text(aes(label = paste0(Value/1000, "K")),
            position = position_dodge(width = 0.7),
            vjust = -0.4, size = 3.5) +
  scale_fill_manual(values = c("Export" = "#2e86c1", "Import" = "#e67e22")) +
  labs(title = "Kenyan Exports and Imports (1999–2004)",
       x     = "Year",
       y     = "Value (thousands of millions Ksh)",
       fill  = "Trade type",
       caption = "Source: KNBS") +
  theme_classic(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        axis.text.x     = element_text(angle = 30, hjust = 1),
        legend.position = "top")

Figure 1.17: Multiple bar chart: Kenyan exports and imports (1999–2004)

The following table shows the Kenyan population age structure as per the 2009 census:

Table 1.51: Table 1.52: Kenyan population age structure – 2009 Census. Source: CIA World Factbook 2017
Age	% of total population	Male	Female
0–14	40.02	9557274	9497870
15–24	19.15	4552448	4567894
25–54	33.91	8170264	7976751
55–64	3.92	856092	1009075
65 years and above	3.00	614751	813320

How best would you represent this data diagrammatically?

ex11_long <- data.frame(
  Age = rep(c("0–14","15–24","25–54","55–64","65+"), 2),
  Sex = c(rep("Male", 5), rep("Female", 5)),
  Population = c(9557274, 4552448, 8170264, 856092, 614751,
                 9497870, 4567894, 7976751, 1009075, 813320)
)
ex11_long$Age <- factor(ex11_long$Age,
                        levels = c("0–14","15–24","25–54","55–64","65+"))

ggplot(ex11_long, aes(x = Age, y = Population / 1e6, fill = Sex)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  scale_fill_manual(values = c("Male" = "#2e86c1", "Female" = "#e74c3c")) +
  labs(title = "Kenyan Population Age Structure by Sex (2009 Census)",
       x     = "Age Group",
       y     = "Population (millions)",
       fill  = "Sex",
       caption = "Source: CIA World Factbook 2017") +
  theme_classic(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        legend.position = "top")

Figure 1.18: Component bar chart: Kenyan population by age group and sex (2009 Census)

The following data represents the maximum temperatures in degrees centigrade predicted for some 55 major cities on the 24th September 1993.

Table 1.53: Table 1.54: Maximum temperatures (°C) for 55 major cities, 24 September 1993

17	25	21	18	14	15	24	22	15	21	25
17	25	15	18	17	29	16	24	39	30	23
23	27	43	28	29	15	15	19	32	30	32
23	13	18	13	27	32	17	17	25	25	30
20	18	17	33	28	27	26	32	32	33	19

Construct a frequency distribution table for these temperatures starting with the classes 11–17, 18–24, …

Solution:

Table 1.55: Table 1.56: Frequency distribution of maximum temperatures for 55 cities
Temperature (°C)	Frequency
11 – 17	15
18 – 24	15
25 – 31	16
32 – 38	7
39 – 45	2

(a) Histogram of Maximum Temperatures

Represent the data using a histogram.

temp_hist <- data.frame(
  lower = c(10.5, 17.5, 24.5, 31.5, 38.5),
  upper = c(17.5, 24.5, 31.5, 38.5, 45.5),
  freq  = c(15, 15, 16, 7, 2)
)

ggplot(temp_hist, aes(xmin = lower, xmax = upper, ymin = 0, ymax = freq)) +
  geom_rect(fill = "#2e86c1", color = "white", linewidth = 0.5) +
  scale_x_continuous(breaks = c(10.5, 17.5, 24.5, 31.5, 38.5, 45.5)) +
  labs(title = "Histogram of Maximum Temperatures (55 Cities, Sept 1993)",
       x     = "Temperature (°C) – Class Boundaries",
       y     = "Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title  = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))

Figure 1.19: Histogram of maximum temperatures for 55 cities

(b) Frequency Polygon of Maximum Temperatures

temp_poly <- data.frame(
  midpoint  = c(3.5, 14, 21, 28, 35, 42, 49),
  frequency = c(  0, 15, 15, 16,  7,  2,  0)
)

temp_bars <- data.frame(
  lower = c(10.5, 17.5, 24.5, 31.5, 38.5),
  upper = c(17.5, 24.5, 31.5, 38.5, 45.5),
  freq  = c(15, 15, 16, 7, 2)
)

ggplot() +
  geom_rect(data = temp_bars,
            aes(xmin = lower, xmax = upper, ymin = 0, ymax = freq),
            fill = "#aed6f1", color = "white", linewidth = 0.4, alpha = 0.6) +
  geom_line(data = temp_poly,
            aes(x = midpoint, y = frequency),
            color = "#1a5276", linewidth = 1.2) +
  geom_point(data = temp_poly,
             aes(x = midpoint, y = frequency),
             color = "#e74c3c", size = 3) +
  scale_x_continuous(breaks = c(3.5, 14, 21, 28, 35, 42, 49)) +
  labs(title = "Frequency Polygon of Maximum Temperatures (55 Cities)",
       x     = "Class Mid-points (°C)",
       y     = "Frequency") +
  theme_classic(base_size = 13) +
  theme(plot.title  = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))

Figure 1.20: Frequency polygon of maximum temperatures for 55 cities

(c) Ogive Curves of Maximum Temperatures

Table 1.57: Table 1.58: Ogive table for maximum temperatures
Temperature (°C)	Frequency	Less than cumulative frequency	More than cumulative frequency	Upper class boundary	Lower class boundary
11 – 17	15	15	55	17.5	10.5
18 – 24	15	30	40	24.5	17.5
25 – 31	16	46	25	31.5	24.5
32 – 38	7	53	9	38.5	31.5
39 – 45	2	55	2	45.5	38.5

temp_lt <- data.frame(
  boundary = c(10.5, 17.5, 24.5, 31.5, 38.5, 45.5),
  cum_freq = c(0, 15, 30, 46, 53, 55),
  type     = "Less than"
)
temp_mt <- data.frame(
  boundary = c(10.5, 17.5, 24.5, 31.5, 38.5, 45.5),
  cum_freq = c(55, 40, 25, 9, 2, 0),
  type     = "More than"
)
temp_both <- rbind(temp_lt, temp_mt)

ggplot(temp_both, aes(x = boundary, y = cum_freq, color = type)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Less than" = "#1f618d", "More than" = "#117a65")) +
  scale_x_continuous(breaks = c(10.5, 17.5, 24.5, 31.5, 38.5, 45.5)) +
  scale_y_continuous(breaks = seq(0, 55, by = 5)) +
  labs(title = "Ogive Curves for Maximum Temperatures (55 Cities)",
       x     = "Class Boundary (°C)",
       y     = "Cumulative Frequency",
       color = "Ogive type") +
  theme_classic(base_size = 13) +
  theme(plot.title      = element_text(hjust = 0.5, color = "#1a5276", face = "bold"),
        legend.position = "top",
        axis.text.x     = element_text(angle = 30, hjust = 1))

Figure 1.21: Less than and more than ogive curves for maximum temperatures

(d) Using the ogive curves, estimate:

i. The modal temperature

The modal class is 25 – 31°C (highest frequency = 16). The modal temperature is approximately 28°C.

ii. The median temperature

The median is at cumulative frequency = 55/2 = 27.5. From the “less than” ogive, reading off at $cf = 27.5$ gives a median of approximately $\approx 24.5°C$.

iii. The lower and upper class boundaries of the temperature range within which the middle 50% of all cities lie.

The middle 50% lies between $Q_1$ (at $cf = 13.75$) and $Q_3$ (at $cf = 41.25$). From the ogive: $Q_1 \approx 17.5°C$ and $Q_3 \approx 31.5°C$.

iv. The minimum and maximum temperature of the middle 80% of the cities.

The middle 80% lies between the 10th percentile ($cf = 5.5$) and the 90th percentile ($cf = 49.5$). From the ogive: $P_{10} \approx 14°C$ and $P_{90} \approx 35°C$.

v. On this particular day, a researcher was collecting data and required data from cities whose temperatures were above $29.5°C$. How many of these cities did he include in his study?

From the “less than” ogive, at $x = 29.5°C$, cumulative frequency $\approx 43$. Therefore, cities with temperature $> 29.5°C$ = $55 - 43 = \mathbf{12}$ cities.

2 Topic Two: Measures of Central Tendency

2.1 Objectives

By the end of the topic, the learner should be able to:

Define measure of central tendency and state the objectives of averaging.
Calculate arithmetic mean using different methods.
Compute combined mean for two or more data sets.
Calculate weighted average for a given data set.

2.2 Introduction

Even after the data have been classified and tabulated one often finds too much details for many uses that may be made of the information available. We, therefore, frequently need further analysis of the tabulated data. One of the powerful tools of analysis is to calculate a single average value that represents the entire mass of data. An “average” is a single value which is considered as the most representative or typical value for a given set of data. Such a value is neither the smallest nor the largest value, but is a number whose value is somewhere in the middle of the group. For this reason an average is frequently referred to as a measure of central tendency or central value.

Definition: A measure of central tendency refers to measurement of values around which data is scattered.

2.3 Objectives of Averaging

There are two main objectives of study of averages:

To get one single value that describes the characteristics of the entire data.
- Measures of central value, by condensing the mass of data in one single value, enables us to get an idea of the entire data.
To facilitate comparison.

Measures of central value, by reducing the mass of data in one single value, enables comparisons to be made. Comparison can be made either at a point of time or over a period of time.

2.4 Characteristics of a Good Average

Since an average is a single value representing a group of values, it is desirable that such a value satisfies the following properties:

It should be easy to understand.
- Since statistical methods are designed to simplify complexity, it is desirable that an average be such that it can be readily understood; otherwise its use is bound to be very limited.
It should be simple to compute.

It should be simple to compute so that it can be used widely; however, simplicity should not be sought at the expense of other advantages.

It should be based on all observations.

The average should depend upon each and every observation so that if any of the observations is dropped, the average itself is altered.

It should be rigidly defined.

An average should be properly defined so that it has one and only one interpretation.

It should be capable of further algebraic or statistical treatment/analysis.
- We should prefer to have an average that could be used for further statistical computations.

It should have sampling stability.

We should prefer to get a value which has what statisticians call “sampling stability” — it should be least affected by the fluctuations of sampling.

It should not be affected by the presence of extreme values.

Although each and every observation should influence the value of the average, none of the observations should influence it unduly.

In this course we will look at the following important measures of central tendency which are generally used in various fields e.g. business, education, etc:

Arithmetic mean
Median
Mode
Geometric mean
Harmonic mean

2.5 Arithmetic Mean

The most popular and widely used measure for representing the entire data by one value is what most laymen call an “average” and what statisticians call the arithmetic mean. Its value is obtained by adding together all the observations and by dividing this total by the number of observations.

(a) Calculation of Arithmetic Mean of Ungrouped Data Using Direct Method

Suppose we have $n$ observations: $x_1, x_2, x_3, \ldots, x_n$

$\Sigma$ (sigma) is the notation for sum. Thus,

\[\sum_{i=1}^{n} x_i = x_1 + x_2 + x_3 + \ldots + x_n\]

is the sum of all observations. The arithmetic mean is denoted by $\bar{x}$.

$\bar{x}$ of ungrouped data is given by:

\[\bar{x} = \frac{x_1 + x_2 + x_3 + \ldots + x_n}{n} = \frac{\sum_{i=1}^{n} x_i}{n}\]

(b) Calculation of Arithmetic Mean of Grouped Data Using Direct Method

If the $x_i$’s occur with frequencies $f_1, f_2, f_3, \ldots, f_n$ respectively, i.e.

\[x_1 \rightarrow f_1, \quad x_2 \rightarrow f_2, \quad \ldots, \quad x_n \rightarrow f_n\]

Then the arithmetic mean is given by:

\[\bar{x} = \frac{\sum_{i=1}^{n} f_i x_i}{\sum f_i}\]

where $\sum_{i=1}^{n} f_i$ is the total number of observations.

(c) Properties of Arithmetic Mean

(i) Sum of Deviations from Mean is Zero

Proof:

Consider $n$ observations $x_1, x_2, x_3, \ldots, x_n$ with mean $\bar{x}$.

Let the deviations from the mean for each observation be:

\[x_1 - \bar{x} = d_1, \quad x_2 - \bar{x} = d_2, \quad x_3 - \bar{x} = d_3, \quad \ldots, \quad x_n - \bar{x} = d_n\]

Then sum of the deviations is:

\[d_1 + d_2 + \ldots + d_n = \sum_{i=1}^{n} d_i = \sum_{i=1}^{n}(x_i - \bar{x})\]

\[= \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \bar{x}\]

But by definition $\bar{x} = \dfrac{\sum_{i=1}^{n} x_i}{n} \implies \sum_{i=1}^{n} x_i = n\bar{x}$

Thus:

\[\sum_{i=1}^{n} d_i = n\bar{x} - n\bar{x} = 0 \qquad \blacktriangle\]

Exercise: If the $x_i$’s occur with frequencies $f_1, f_2, f_3, \ldots, f_n$ respectively, show that the sum of the deviations from the arithmetic mean is zero.

(ii) Data Coding

Change of Origin

For a given set of data $x_1, x_2, x_3, \ldots, x_n$ with mean $\bar{x}$, if a constant value $a$ is added or subtracted from each value in the set, the mean of the new data set is $\bar{x} \pm a$.

Change of Scale

For a given set of data $x_1, x_2, x_3, \ldots, x_n$ with mean $\bar{x}$, if a constant value $a$ is multiplied by or divided with each value in the set, the mean of the new data set is $\bar{x} \cdot a$ or $\dfrac{\bar{x}}{a}$.

Illustration – Change of Origin

Adding a constant: $x_1 + a, \quad x_2 + a, \quad \ldots, \quad x_n + a$

Thus:

\[\text{new mean} = \frac{\sum(x_i + a)}{n} = \frac{\sum x_i}{n} + \frac{\sum a}{n} = \bar{x} + a\]

where $\bar{x} = \dfrac{\sum x_i}{n}$ and $\sum a = na$.

Subtracting a constant: $d_i = x_i - a$

Thus:

\[\text{new mean} = \frac{\sum(x_i - a)}{n} = \frac{\sum x_i}{n} - \frac{\sum a}{n} = \bar{x} - a\]

Therefore if $a$ is an assumed mean and $d_i = x_i - a$ (deviations from $x_i$):

\[\implies x_i = d_i + a, \quad \text{and} \quad \bar{x} = \sum_{i=1}^{n} x_i\]

Then:

\[\bar{x} = \frac{\sum_{i=1}^{n}(d_i + a)}{n} = \frac{\sum_{i=1}^{n} d_i}{n} + \frac{\sum_{i=1}^{n} a}{n} = \frac{\sum_{i=1}^{n} d_i}{n} + a\]

And for grouped data:

\[\bar{x} = \frac{\sum f_i x_i}{\sum f_i} \quad \text{and hence} \quad \frac{\sum f_i(d_i + a)}{\sum f_i} = \frac{\sum f_i d_i + \sum f_i a}{\sum f_i} = \frac{\sum f_i d_i}{\sum f_i} + a\]

Therefore to calculate arithmetic mean using the assumed mean method we have:

\[\bar{x} = a + \frac{\sum d_i}{n} \quad \text{(ungrouped data)} \qquad \bar{x} = a + \frac{\sum f_i d_i}{\sum f_i} \quad \text{(grouped data)}\]

Illustration – Change of Scale

Assuming that all classes have similar class width $c$, then each deviation $d_i = x_i - a$ can be divided by $c$ to get a value $u_i \left(u_i = \dfrac{d_i}{c}\right)$ where $u_i$ is positive, negative or zero such that $d_i = cu_i$. Then:

\[\bar{x} = a + \left(\frac{\sum_{i=1}^{n} f u_i}{\sum f_i}\right)c\]

Proof:

If $c$ is the size of each class then:

\[x_2 = x_1 + c, \quad x_3 = x_1 + 2c, \quad x_4 = x_1 + 3c, \quad \ldots \quad x_q = x_1 + (q-1)c \quad \ldots \quad x_p = x_1 + (p-1)c\]

This shows that the difference between any two consecutive values is a multiple of $c$:

\[x_p - x_q = x_1 + (p-1)c - x_1 - (q-1)c\]

$q$ is a multiple of $c$, hence deviations can be written as $d_i = cu_i$.

And therefore:

\[\bar{x} = a + \frac{\sum d_i}{n} = a + \frac{\sum cu_i}{n} = a + c\frac{\sum u_i}{n}\]

To calculate the arithmetic mean using the coding method we use:

\[\bar{x} = a + c\frac{\sum u_i}{n} \quad \text{(ungrouped data)} \qquad \bar{x} = a + c\frac{\sum f_i u_i}{\sum f_i} \quad \text{(grouped data)}\]

where $u_i = \dfrac{d_i}{c}$ and $d_i = x_i - a$.

Example 2.1

The winning scores in a certain golf tournament in the years from 2000 to 2009 were as follows:

\[284, \ 280, \ 277, \ 282, \ 279, \ 285, \ 281, \ 283, \ 278, \ 277\]

Find the arithmetic mean of these scores.

Solution:

i. Using direct method

By definition $\bar{x} = \dfrac{\sum_{i=1}^{n} x_i}{n}$

\[\bar{x} = \frac{284 + 280 + 277 + 282 + 279 + 285 + 281 + 283 + 278 + 277}{10} = \frac{2806}{10} = 280.6\]

ii. Using assumed mean method (change of origin)

Rather than directly adding these values, we first subtract $a = 280$ from each one to obtain the new values $d_i = x_i - 280$:

\[d_i: \quad 4, \ 0, \ -3, \ 2, \ -1, \ 5, \ 1, \ 3, \ -2, \ -3 \quad \text{and} \quad \sum d_i = 6\]

By definition $\bar{x} = a + \dfrac{\sum d_i}{n} = 280 + \dfrac{6}{10} = \mathbf{280.6}$

iii. Using coding method (change of scale)

By definition $\bar{x} = a + c\dfrac{\sum u_i}{n}$

This is ungrouped data and therefore we choose an appropriate value of $c$ — either the g.c.d of the $d_i$’s or any other value (use a factor that will not result in recurring decimals).

Let $c = 5$, then $u_i = \dfrac{d_i}{5}$ results in:

\[0.8, \ 0, \ -0.6, \ 0.4, \ -0.2, \ 1, \ 0.2, \ 0.6, \ -0.4, \ -0.6\]

Hence: $\bar{x} = a + c\dfrac{\sum u_i}{n} = 280 + 5 \times \dfrac{1.2}{10} = \mathbf{280.6}$

Example 2.2

The following is a frequency table giving the ages of members of a cultural club for young adults.

Table 2.1: Table 2.2: Ages of members of a cultural club
Age	Frequency
15	2
16	5
17	11
18	9
19	14
20	13

Find the arithmetic mean of the ages of the 54 members of the club.

Solution:

This data is ungrouped but has been placed in a simple frequency distribution table. Hence:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{15 \times 2 + 16 \times 5 + 17 \times 11 + 18 \times 9 + 19 \times 14 + 20 \times 13}{54} = 18.24\]

This is equivalent to writing the formula as:

\[\bar{x} = \frac{\sum_{i=1}^{n} f_i x_i}{\sum_{i=1}^{n} f_i}\]

Example 2.3

Calculate the arithmetic mean of the following data using the three methods.

Solution: Use $a = 75$, $c = 5$

Table 2.3: Table 2.4: Arithmetic mean computation – three methods (a = 75, c = 5)
Class	f	x	fx	d = x − a	fd	u = d/c	fu
53–57	2	55	110	-20	-40	-4	-8
58–62	12	60	720	-15	-180	-3	-36
63–67	12	65	780	-10	-120	-2	-24
68–72	25	70	1750	-5	-75	-1	-25
73–77	27	75	675	0	0	0	0
78–82	10	80	800	5	50	1	10
83–87	9	85	765	10	90	2	18
88–92	3	90	270	15	45	3	9
	Σf = 100		Σfx = 7220		Σfd = −280		Σfu = −56

i. Direct method:

\[\bar{x} = \frac{\sum f_i x_i}{\sum f_i} = \frac{7220}{100} = \mathbf{72.2}\]

ii. Assumed mean method:

\[\bar{x} = a + \frac{\sum fd}{\sum f} = 75 + \frac{-280}{100} = 75 - 2.8 = \mathbf{72.2}\]

iii. Coding method:

\[\bar{x} = a + c\frac{\sum fu}{\sum f} = 75 + 5 \times \frac{-56}{100} = 75 - 2.8 = \mathbf{72.2}\]

2.5.1 Correcting Incorrect Values

It sometimes happens that due to an oversight or mistake in copying, certain wrong values are taken while calculating the mean. The process of correction is simple: from $\sum x$ deduct the wrong observations and add the correct observations, then divide the correct $\sum x$ by the number of observations.

Example 2.4

i. The average weekly wage for a group of 25 persons working in a factory was calculated to be $378.40. It was later discovered that one figure was misread as $160 instead of the correct value $200. Calculate the correct average wage.

Solution:

\[\text{Incorrect } \sum x = 378.40 \times 25 = 9{,}460\]

\[\text{Correct } \sum x = 9{,}460 - 160 + 200 = 9{,}500\]

\[\text{Correct mean} = \frac{9{,}500}{25} = \mathbf{\$380}\]

ii. The mean of 200 observations was 50. Later on, it was discovered that two observations were wrongly read as 92 and 8 instead of 192 and 88. Find out the correct mean.

Solution:

\[\text{Incorrect } \sum x = 50 \times 200 = 10{,}000\]

\[\text{Correct } \sum x = 10{,}000 - 92 - 8 + 192 + 88 = 10{,}180\]

\[\text{Correct mean} = \frac{10{,}180}{200} = \mathbf{50.9}\]

2.5.2 Exercise

The mean of seven numbers is seven. One number is removed and the mean increases to 10. Find the number which was removed.
The average weight of a group of 30 friends increases by 1 kg when the weight of their football coach was added. If the average weight of the group after including the weight of the football coach is 31 kg, what is the weight of their football coach?
The average wages of a worker during a fortnight comprising 15 consecutive working days was $90 per day. During the first 7 days, his average wages was $87 per day and the average wages during the last 7 days was $92 per day. What was his wage on the 8th day?
The average age of a group of 10 students was 20. The average age increased by 2 years when two new students joined the group. What is the average age of the two new students who joined the group?

2.5.3 Combined Mean

If we have the arithmetic mean and number of observations of two or more related groups, we can compute the combined average using the following formula:

\[\bar{X}_{12} = \frac{N_1 \bar{X}_1 + N_2 \bar{X}_2}{N_1 + N_2}\]

where:

$\bar{X}_{12}$ = Combined mean of the two groups
$\bar{X}_1$ = Arithmetic mean of the first group
$\bar{X}_2$ = Arithmetic mean of the second group
$N_1$ = Number of observations in the first group
$N_2$ = Number of observations in the second group

Example 2.5

i. There are two branches of a company employing 100 and 80 employees respectively. If the arithmetic means of the monthly salaries paid by two branches are $4,570 and $6,750 respectively, find the arithmetic mean of the salaries of the employees of the company as a whole.

Solution:

\[\bar{X}_{12} = \frac{N_1 \bar{X}_1 + N_2 \bar{X}_2}{N_1 + N_2} = \frac{100 \times 4570 + 80 \times 6750}{100 + 80} = \frac{457000 + 540000}{180} = \frac{997000}{180} = \mathbf{\$5538.89}\]

If we have to find out the combined mean of three related groups, the above formula can be extended as: \[\bar{X}_{123} = \frac{N_1\bar{X}_1 + N_2\bar{X}_2 + N_3\bar{X}_3}{N_1 + N_2 + N_3}\]

ii. The mean of marks in Statistics of 100 students of a class was 72. The mean of marks of boys was 75, while their number was 70. Find out the mean marks of girls in the class.

Solution:

We are given $N_1 + N_2 = 100$, $\bar{X}_{12} = 72$, mean of boys $\bar{X}_1 = 75$, number of boys $N_1 = 70$. We have to find out the mean marks of girls, i.e., $\bar{X}_2$.

\[72 = \frac{70 \times 75 + 30 \times \bar{X}_2}{100}\]

\[7200 = 5250 + 30\bar{X}_2 \implies \bar{X}_2 = \frac{7200 - 5250}{30} = \frac{1950}{30} = \mathbf{65}\]

Hence the mean marks of girls in the class = 65.

iii. The mean age of a combined group of men and women is 30 years. If the mean age of the group of men is 32 and that of the group of women is 25, find out the percentage of men and women in the group.

Solution:

Let $N_1$ represent the percentage of men and $N_2$ represent the percentage of women so that $N_1 + N_2 = 100$. We are given $\bar{X}_{12} = 30$, $\bar{X}_1 = 32$, $\bar{X}_2 = 25$.

\[30 = \frac{32N_1 + 25N_2}{N_1 + N_2} = \frac{32N_1 + 25(100 - N_1)}{100}\]

\[3000 = 32N_1 + 2500 - 25N_1 = 7N_1 + 2500\]

\[N_1 = \frac{500}{7} \approx 71.43\% \quad \text{(men)}, \qquad N_2 \approx 28.57\% \quad \text{(women)}\]

iv. A shopkeeper has 50 cold drink bottles. Some of the bottles are 1-litre and some are 2-litre bottles. The average cold drink of the bottles is 1200 ml. Find the number of 2-litre bottles. (1 litre = 1000 ml)

Solution:

Let the number of 1-litre bottles be $N_1$ and the number of 2-litre bottles be $N_2$. We know that $N_1 + N_2 = 50$. The average of group 1 ($\bar{X}_1$) is 1000 ml and the average of group 2 ($\bar{X}_2$) is 2000 ml. The weighted average is 1200 ml.

\[1200 = \frac{1000 N_1 + 2000 N_2}{N_1 + N_2} = \frac{1000 N_1 + 2000 N_2}{50}\]

\[60000 = 1000 N_1 + 2000 N_2\]

Since $N_1 + N_2 = 50 \implies N_1 = 50 - N_2$:

\[60000 = 1000(50 - N_2) + 2000 N_2 = 50000 + 1000 N_2\]

\[N_2 = 10 \quad \text{and} \quad N_1 = 40\]

Thus, the shopkeeper has 10 bottles of 2-litre.

2.5.4 Weighted Arithmetic Mean

The arithmetic mean discussed above gives equal importance to all observations. But there are cases where the relative importance of the different observations is not the same. When this is so, we compute weighted arithmetic mean. The term “weight” stands for the relative importance of the different observations. The formula for computing weighted arithmetic mean is:

\[\bar{X}_w = \frac{\sum WX}{\sum W}\]

where $W$ represents the respective weights.

Example 2.6

A student’s final marks in Mathematics, Physics, English and Accounting are respectively 82, 86, 90, and 70. If the respective credits received for these courses are 3, 5, 3, and 1; determine the approximate average mark.

Solution:

Table 2.5: Table 2.6: Weighted arithmetic mean of student marks
Subject	Marks (X)	Weight (W)	WX
Mathematics	82	3	246
Physics	86	5	430
English	90	3	270
Accounting	70	1	70
Total		ΣW = 12	ΣWX = 1016

\[\bar{X}_w = \frac{\sum WX}{\sum W} = \frac{1016}{12} = \mathbf{84.67}\]

2.5.5 Merits and Demerits of the Arithmetic Mean

Merits: Satisfies properties (i), (ii), (iii), (iv), (v), and (vi).

Demerits: Does not satisfy property (vii), i.e., it is affected by extreme observations.

2.6 Quartiles, Deciles and Percentiles

Objectives

By the end of the sub-topic, the learner should be able to:

Calculate and interpret quartiles and of given data sets.
Estimate the measures of location of given data sets graphically.
Determine the modal value of a given data set.

Median (Ungrouped Data)

Order the values of a data set of size $n$ from smallest to largest (in order of magnitude).

If $n$ is odd, the median is the value in position $\dfrac{n+1}{2}$
If $n$ is even, the median is the average of the values in positions $\dfrac{n}{2}$ and $\dfrac{n}{2} + 1$, i.e. it is the arithmetic mean of the two middle values.

Example 2.7

i. Find the median of: $1, 10, 7, 20, 5$

Solution:

Put the data in an array and arrange in ascending order: $1, 5, 7, 10, 20$

\[\text{Median} = \frac{5+7}{2} = 6\]

ii. Find the median of the set of numbers: $21, 3, 7, 17, 19, 31, 46, 20$ and $43$.

Median (Grouped Data)

The following formula is used:

\[\text{Median} = l_m + \frac{\left(\frac{N}{2} - c_f\right)}{f_m} \times c\]

where:

$l_m$ = Lower limit of median class
$N = \sum f$ = total number of units
$c$ = Size of median class
$f_m$ = Frequency of median class
$c_f$ = Cumulative frequency of class preceding the median class

Example 2.9

Table 2.7: Table 2.8: Frequency distribution for median calculation
Class	f	$c_f$
53–57	2	2
58–62	12	14
63–67	12	26
68–72	25	51
73–77	27	78
78–82	10	88
83–87	9	97
88–92	3	100

Median class: $\dfrac{N}{2} = \dfrac{100}{2} = 50$, so the median class is 68–72.

\[\text{Median} = l_m + \frac{\left(\frac{N}{2} - c_f\right)}{f_m} \times c = 67.5 + \frac{(50 - 26) \times 5}{25} = 67.5 + 4.8 = \mathbf{72.3}\]

Lower Quartile ($Q_1$)

Divides the distribution into four equal parts.

Calculation of Lower Quartile – Grouped Data

Determine the particular class in which the value of the lower quartile lies. Use $\dfrac{N}{4}$ to locate the lower quartile class. Apply the following formula:

\[Q_1 = L + \frac{\left(\frac{N}{4} - \text{p.c.f.}\right)}{f} \times i\]

where:

$L$ = Lower limit of the lower quartile class
p.c.f. = Preceding cumulative frequency to the lower quartile class
$f$ = Frequency of the lower quartile class
$i$ = The class-interval of the lower quartile class

Upper Quartile ($Q_3$)

Divides the distribution into three out of four equal parts.

Calculation of Upper Quartile – Grouped Data

Determine the particular class in which the value of the upper quartile lies. Use $\dfrac{3N}{4}$ to locate the upper quartile class. Apply the following formula:

\[Q_3 = L + \frac{\left(\frac{3N}{4} - \text{p.c.f.}\right)}{f} \times i\]

where:

$L$ = Lower limit of the upper quartile class
p.c.f. = Preceding cumulative frequency to the upper quartile class
$f$ = Frequency of the upper quartile class
$i$ = The class-interval of the upper quartile class

Example 2.8

The profits earned by 100 companies during 2010–2011 are given below:

Table 2.9: Table 2.10: Profits earned by 100 companies (2010–2011)
Profits ($) </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: rgba(31, 97, 141, 255) !important;"> No. of companies </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: rgba(31, 97, 141, 255) !important;"> Profits ($)	No. of companies
20 – 30	4	60 – 70	15
30 – 40	8	70 – 80	10
40 – 50	18	80 – 90	8
50 – 60	30	90 – 100	7

Solution:

Table 2.11: Table 2.12: Cumulative frequency table for quartile calculation
Profits ($)	No. of companies (f)	Cumulative frequency
20 – 30	4	4
30 – 40	8	12
40 – 50	18	30
50 – 60	30	60
60 – 70	15	75
70 – 80	10	85
80 – 90	8	93
90 – 100	7	100

Lower Quartile, $Q_1$ = size of $\dfrac{N}{4} = \dfrac{100}{4} = 25^{th}$ observation.

Hence $Q_1$ lies in the class 40 – 50. $L = 40$, p.c.f. $= 12$, $f = 18$, $i = 10$.

\[Q_1 = 40 + \frac{\left(25 - 12\right)}{18} \times 10 = 40 + \frac{130}{18} = 40 + 7.22 = \mathbf{\$47.22}\]

Hence 25% of the companies earn an annual profit of $47.22 or less.

Upper Quartile, $Q_3$ = size of $\dfrac{3N}{4} = \dfrac{3 \times 100}{4} = 75^{th}$ observation.

Hence $Q_3$ lies in the class 60 – 70. $L = 60$, p.c.f. $= 60$, $f = 15$, $i = 10$.

\[Q_3 = 60 + \frac{\left(75 - 60\right)}{15} \times 10 = 60 + \frac{150}{15} = 60 + 10 = \mathbf{\$70}\]

Hence 75% of the companies earn an annual profit of $70 or less.

These values, i.e., $Q_1$, median, and $Q_3$ can also be obtained from the Ogive curve.

In general, the $p^{th}$ percentile $X_p$ is the value of $x$ in the ogive corresponding to $\dfrac{pN}{100}$.

Note:

The median is the 50th percentile value.

The lower quartile is the 25th percentile value.

The upper quartile is the 75th percentile value.

The formula for evaluating $X_p$ is:

\[X_p = L + \frac{\left(\frac{pN}{100} - \text{p.c.f.}\right)}{f} \times i\]

2.7 Mode

It is the value with the highest frequency.

For ungrouped data e.g. $1, 2, 3, 4, 5, 5, 5$ the mode is 5.

Example 2.9 (Mean, Median, Mode and Range)

Find the mean, median, mode, and range for the following list of values: $13, 18, 13, 14, 13, 16, 14, 21, 13$

Solution:

The median is the middle value, so first rewrite the list in numerical order:

\[13, 13, 13, 13, 14, 14, 16, 18, 21\]

There are nine numbers in the list, so the middle one is the $\dfrac{9+1}{2} = 5^{th}$ number:

\[13, 13, 13, \mathbf{13}, \underline{14}, 14, 16, 18, 21\]

So the median is 14.

The mode is 13, since 13 is repeated 4 times.

The range $= 21 - 13 = 8$.

\[\text{Mean} = \frac{13+18+13+14+13+16+14+21+13}{9} = \frac{135}{9} = \mathbf{15}\]

Note: The mean, in this case, is not a value from the original list. This is a common result. You should not assume that your mean will be one of your original numbers.

Statistic	Value
Mean	15
Median	14
Mode	13
Range	8

Mode for Grouped Data

\[\text{Mode} = l_m + \left(\frac{\delta_1}{\delta_1 + \delta_2}\right)c\]

where:

$l_m$ = Lower class boundary of modal class
$\delta_1 = f_{\text{mode}} - f_1$ = Excess of modal frequency minus the next lower class frequency
$\delta_2 = f_{\text{mode}} - f_2$ = Excess of modal frequency minus the next higher class frequency
$c$ = Class size

Mode Calculation – Example

Table 2.13: Table 2.14: Frequency distribution for mode calculation
Class	Frequencies
58–62	12
63–67	12
68–72	25
73–77	27
78–82	10
83–87	9
88–92	3

The modal class is 73–77 (highest frequency = 27).

\[l_m = 72.5, \quad c = 5, \quad \partial_1 = 27 - 25 = 2, \quad \partial_2 = 27 - 10 = 17\]

\[\text{Mode} = l_m + \frac{\partial_1 \, c}{\partial_1 + \partial_2} = 72.5 + \frac{2 \times 5}{2 + 17} = 72.5 + \frac{10}{19} = 72.5 + 0.5263 = \mathbf{73.03 \text{ Units}}\]

2.8 Geometric Mean (GM)

In business and economic problems, very often we are faced with questions pertaining to percentage rates of change over time. Neither the mean, the median nor mode is appropriate in these instances. The correct average is obtained through the use of the geometric mean or, what amounts to the same thing, through the use of the familiar compound interest formula.

Geometric mean is defined as the $N$th root of the product of $N$ observations of a given data. Symbolically:

\[G = \sqrt[N]{x_1 \cdot x_2 \cdot x_3 \cdots x_N} = (x_1 \cdot x_2 \cdot x_3 \cdots x_N)^{1/N}\]

where $x_1, x_2, \ldots, x_N$ refer to the various observations of the data.

When the number of observations is three or more, logarithms are used to simplify calculations:

\[G = \text{Antilog}\left(\frac{\sum \log x}{N}\right)\]

Calculation of Geometric Mean – Ungrouped Data

\[G = \text{Antilog}\left(\frac{\sum \log x}{N}\right)\]

For grouped data, first find the midpoints and then apply:

\[G = \text{Antilog}\left(\frac{\sum f \log X}{\sum f}\right)\]

where $X$ is the midpoint.

Applications of Geometric Mean

Used to find the average per cent increase in sales, production, population, etc.
It is considered to be the best average in construction of index numbers.

Example 2.10

Compared to the previous year the overhead expenses went up by 32% in 2006; they increased by 40% in the next year and by 50% in the following year. Calculate the average rate of increase in the overhead expenses over the three years.

Solution:

In average ratios and percentages, geometric mean is more appropriate.

Table 2.15: Table 2.16: Geometric mean – overhead expenses
% Rise	Expenses at end of year (X)	Log X
32	132	2.1206
40	140	2.1461
50	150	2.1761
Σ Log X = 6.4428

\[G = \text{Antilog}\left(\frac{6.4428}{3}\right) = \text{Antilog}(2.1476) = 140.5\]

Average rate of increase in overhead expenses $= 140.5 - 100 = \mathbf{40.5\%}$

Example 2.11

The annual rates of growth of output of a factory in 5 years are 5.0, 7.5, 2.5, 5.0, and 10.0 respectively. What is the compound rate of growth of output per annum for the period?

Solution:

Table 2.17: Table 2.18: Geometric mean – annual rates of growth of factory output
Annual rate of growth	Output relatives at end of year (X)	Log X
5.0	105.0	2.0212
7.5	107.5	2.0314
2.5	102.5	2.0107
5.0	105.0	2.0212
10.0	110.0	2.0414
Σ Log X = 10.1259

\[G = \text{Antilog}\left(\frac{10.1259}{5}\right) = \text{Antilog}(2.0252) = 105.9\]

The compound rate of growth of output per annum $= 105.9 - 100 = \mathbf{5.9\%}$

2.9 Harmonic Mean (HM)

Harmonic mean is based on the reciprocal of the numbers averaged. It is defined as the reciprocal of the arithmetic mean of the reciprocals of the individual observations:

\[H = \frac{N}{\sum \dfrac{1}{x}} = \frac{N}{\dfrac{1}{x_1} + \dfrac{1}{x_2} + \cdots + \dfrac{1}{x_N}}\]

where $x_1, x_2, \ldots, x_N$ refer to the various observations of the data.

For grouped data:

\[H = \frac{\sum f}{\sum \dfrac{f}{X}}\]

where $X$ is the midpoint of the various classes and $f$ their corresponding frequencies.

Applications of Harmonic Mean

Useful for computing the average:

Rate of increase of profits
Speed at which a journey has been performed
Price at which an article has been sold

Example 2.12

(a) Calculate harmonic mean of numbers 10, 20, 25, 40, 50.

(b) Calculate harmonic mean from the following frequency distribution:

Table 2.19: Table 2.20: Frequency distribution for harmonic mean
Marks	No. of students
0 – 10	8
10 – 20	15
20 – 30	20
30 – 40	4
40 – 50	3

Solution:

(a)

Table 2.21: Table 2.22: Harmonic mean – ungrouped data
X	1/X
10	0.1
20	0.05
25	0.04
40	0.025
50	0.02
	Σ(1/X) = 0.235

\[H = \frac{N}{\sum \frac{1}{X}} = \frac{5}{0.235} = \mathbf{21.28}\]

(b)

Table 2.23: Table 2.24: Harmonic mean – grouped data
Marks	X	f	f/X
0 – 10	5	8	1.6
10 – 20	15	15	1
20 – 30	25	20	0.8
30 – 40	35	4	0.114
40 – 50	45	3	0.067
		Σf = 50	Σ(f/X) = 3.581

\[H = \frac{\sum f}{\sum \frac{f}{X}} = \frac{50}{3.581} = \mathbf{13.96}\]

3 Topic Three: Measures of Dispersion

3.1 Objectives

By the end of the topic, the learner should be able to:

Define measure of variation and explain its importance.
Distinguish between absolute measure of variation and relative measure of variation.
Calculate and interpret Range, Average Deviation, Quartile Deviation.

3.2 Introduction

The various measures of central tendency discussed in the previous chapter gives us one single value that represents the entire data. But the average alone cannot adequately describe a set of observations, unless all observations are alike. It is necessary to describe the variability or dispersion of observations. Measures of variation help us in studying the important characteristics of a distribution, i.e., the extent to which the observations vary from one another from some average value (i.e., The degree to which numerical data tends to spread about an average is called a variation/dispersion).

3.3 Significance of Measuring Variation

Measures of variation are needed for four basic purposes:

(i) To determine the reliability of an average

Measures of variation point out as to how far an average is representative of the entire data. When variation is small, the average is a typical value in the sense that it closely represents the individual value and it is reliable in the sense that it is a good estimate of the average in the corresponding universe. On the other hand, when variation is large, the average is not so typical, and unless the sample is very large, the average may be quite unreliable.

(ii) To serve as a basis for the control of variability

Another purpose of measuring variation is to determine nature and cause of variation in order to control the variation itself.

(iii) To compare two or more series with regard to their variability

It may also be looked upon as a means of determining uniformity or consistency. A high degree of variation would mean little uniformity or consistency whereas a low degree of variation would mean greater uniformity or consistency.

(iv) To facilitate the use of other statistical measures

Many powerful analytical tools in statistics such as correlation analysis, the testing of hypothesis, the analysis of fluctuations, etc., are based on measures of variation of one kind or another.

3.4 Properties of a Good Measure of Variation

A good measure of variation should possess, as far as possible, the following properties:

It should be easy to understand.
It should be simple to compute.
It should be based on all observations.
It should be rigidly defined.
It should be capable of further algebraic treatment.
It should have sampling stability.
It should not be affected by the presence of extreme values.

The following are the important methods of studying variation:

The range
The Interquartile Range or Quartile Deviation
The Average Deviation
The Standard Deviation
The Lorenz Curve

Of these, the first four are mathematical methods and the last is a graphical one.

3.5 Absolute and Relative Measures of Variation

Measures of variation may be either absolute or relative. Absolute measures of variation are expressed in the same statistical unit in which the original data are given such as kilograms, dollars, tonnes, etc. These values may be used to compare variation in two or more than two distributions provided the variables are expressed in the same units and have almost the same average value. In case the two sets of data are expressed in different units, such as quintals of sugar versus tonnes of sugarcane, or if the average value is very much different, such as manager’s salary versus worker’s salary, the absolute measures of variation are not comparable. In such cases measures of relative variation should be used. A measure of relative variation is the ratio of a measure of absolute variation to an average. It is sometimes called a coefficient of variation, because “coefficient” means a pure number that is independent of the unit of measurement. It should be remembered that while computing the relative variation the average used as base should be the same one from which the absolute deviations were measured.

3.6 Range

Range is the simplest method of studying variation. It is defined as the difference between the value of the smallest observation and the value of the largest observation. Symbolically:

\[\text{Range} = L - S\]

where $L$ = Largest value, and $S$ = Smallest value.

The relative measure corresponding to range, called the coefficient of range, is obtained by applying the following formula:

\[\text{Coefficient of Range} = \frac{L - S}{L + S}\]

Example 3.1

The following are the prices of shares of a company from Monday to Saturday.

Table 3.1: Table 3.2: Prices of shares of a company (Monday to Saturday)
Day	Price ($) </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: rgba(31, 97, 141, 255) !important;"> Day </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: rgba(31, 97, 141, 255) !important;"> Price ($)
Monday	200	Thursday	160
Tuesday	210	Friday	220
Wednesday	208	Saturday	250

Solution

\[\text{Range} = L - S = 250 - 160 = 90\]

\[\text{Coefficient of Range} = \frac{L - S}{L + S} = \frac{250 - 160}{250 + 160} = \frac{90}{410} = 0.2195\]

In a frequency distribution, range is calculated by taking the difference between the lower limit of the lowest class and the upper limit of the highest class.

Example 3.2

Calculate the coefficient of range from the following data:

Table 3.3: Table 3.4: Profits and number of companies
Profits ($) </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: rgba(31, 97, 141, 255) !important;"> No. of Companies </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: rgba(31, 97, 141, 255) !important;"> Profits ($)	No. of Companies
10 – 20	8	40 – 50	8
20 – 30	10	50 – 60	4
30 – 40	12

Solution

\[\text{Range} = L - S = 60 - 10 = 50\]

\[\text{Coefficient of Range} = \frac{L - S}{L + S} = \frac{60 - 10}{60 + 10} = \frac{50}{70} = 0.714\]

3.7 Average Deviation

Average deviation is obtained by calculating the absolute deviations of each observation from the median (or mean), and then averaging these deviations by taking their arithmetic mean. The formula for average deviation may be written as:

Mean Absolute Deviation (M.A.D):

\[\text{M.A.D} = \frac{\sum_{i=1}^{n} |x_i - \bar{x}|}{n} \quad \text{(average deviation from mean)}\]

\[= \frac{\sum_{i=1}^{n} f|x_i - \bar{x}|}{\sum f_i}\]

In case deviations are taken from median the formula shall be written as:

\[\text{Median A.D} = \frac{\sum_{i=1}^{n} |x_i - \text{median}|}{n} \quad \text{(average deviation from median)}\]

\[= \frac{\sum_{i=1}^{n} f|x_i - \text{median}|}{\sum f_i}\]

The reason for taking absolute deviations, that is, deviations in which signs are ignored, is that it is the amount of the differences of observations from median rather than the direction of the difference which is of main interest.

The relative measure corresponding to the average deviation, called the coefficient of average deviation, is obtained by dividing average deviation by the particular average used in computing average deviation.

If median has been used while calculating average deviation, then:

\[\text{Coefficient of Average Deviation} = \frac{\text{Average Deviation}}{\text{Median}}\]

If mean has been used while calculating average deviation, then:

\[\text{Coefficient of Average Deviation} = \frac{\text{Average Deviation}}{\text{Mean}}\]

3.7.1 Merits and Limitations of Average Deviation

Merits

It is simple to understand and easy to compute.
It is based on each and every observation of the data.
It is less affected by the values of extreme observations.
Since deviations are taken from a central value, comparison about formation of different distributions can easily be made.

Limitations

The greatest drawback of this method is that algebraic signs are ignored while taking deviations of the items.
This method may not give us very accurate results.
It is not capable of further algebraic treatment.
It is rarely used in sociological and business studies.

Because of these limitations its use is limited and it is overshadowed as a measure of variation by the superior standard deviation.

3.8 Semi Interquartile Range (Quartile Deviation)

The interquartile range is the range which includes the middle 50% of the observations. That is, one quartile of the observations at the lower end and another quartile of the observations at the upper end of the distribution are excluded in computing the inter-quartile range. In other words, inter-quartile range represents the difference between the upper quartile and the lower quartile. Symbolically:

\[\text{Inter-quartile range} = Q_3 - Q_1\]

Very often the inter-quartile range is reduced to the form of semi-interquartile range or quartile deviation by dividing it by 2. Symbolically:

\[\text{Quartile Deviation (Q.D)} = \frac{Q_3 - Q_1}{2} = \text{semi-interquartile range}\]

Quartile deviation is an absolute measure of variation. The relative measure corresponding to this measure, called the coefficient of quartile deviation, is calculated as follows:

\[\text{Coefficient of Q.D.} = \frac{Q_3 - Q_1}{Q_3 + Q_1}\]

Coefficient of quartile deviation can be used to compare the degree of variation in different distributions.

3.8.1 Merits and Limitations of Quartile Deviation

Merits

In certain respects it is superior than range as a measure of variation.
It has a special utility in measuring variation in case of open-end distributions or one in which the data may be ranked but measured quantitatively.
It is also useful in erratic or highly skewed distributions, where the other measures of variation would be warped by extreme values.
It is not affected by the presence of extreme values.

Limitations

Quartile deviation ignores 50% of the items, i.e., the first 25% and the last 25%. As the value of quartile deviation does not depend upon every observation it cannot be regarded as a good method of measuring variation.
It is not capable of mathematical manipulation.
Its value is very much affected by sampling fluctuations.

3.9 Standard Deviation

3.9.1 Objectives

By the end of the topic, the learner should be able to:

Calculate and interpret standard deviation.
Calculate and interpret coefficient of variation for given data sets.
Calculate combined standard deviation for two or more data sets.

3.9.2 Introduction

The standard deviation is by far the most important and widely used measure of studying variation. Its significance lies in the fact that it is free from those defects from which earlier methods suffer and satisfies most of the properties of a good measure of variation. It is a measure of how much “spread” or “variability” is present in the sample. If all the numbers in the sample are very close to each other, the standard deviation is close to zero. If the numbers are well dispersed, the standard deviation will tend to be large. Standard deviation is known as the root mean square deviation for the reason that it is the square root of the means of square deviations from the arithmetic mean. Standard deviation is denoted by the small Greek letter $\sigma$ (read as sigma) and is defined as:

\[\sigma = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n}}\]

If we square standard deviation, we get what is called Variance.

\[\text{Variance} = \sigma^2 = \frac{\sum(x_i - \bar{x})^2}{n}\]

The standard deviation measures the absolute variation of a distribution; the greater the amount of variation, the greater the standard deviation. A small standard deviation means a high degree of uniformity of the observations as well as homogeneity of a series; a large standard deviation means a low degree of uniformity of the observations as well as homogeneity of a series. Thus if we have two or more comparable series with identical means, it is the distribution with the smallest standard deviation that has the most representative mean.

3.9.3 Calculation Using Direct Method

\[S = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n}}\]

\[= \sqrt{\frac{\sum f(x_i - \bar{x})^2}{\sum f}}\]

The square of standard deviation is variance:

\[S^2 = \frac{\sum(x_i - \bar{x})^2}{n}\]

\[S^2 = \frac{\sum f(x_i - \bar{x})^2}{\sum f}\]

Proof that $\dfrac{\sum(x_i - \bar{x})^2}{n} = \dfrac{\sum x_i^2}{n} - \left(\dfrac{\sum x_i}{n}\right)^2$

\[\frac{\sum(x_i - \bar{x})^2}{n} = \frac{\sum(x_i^2 - 2x_i\bar{x} + \bar{x}^2)}{n}\]

\[= \frac{\sum x_i^2}{n} - 2\bar{x}\frac{\sum x}{n} + \frac{n\bar{x}^2}{n}\]

\[= \frac{\sum x_i^2}{n} - 2\bar{x}^2 + \bar{x}^2\]

\[= \frac{\sum x_i^2}{n} - \bar{x}^2\]

\[= \frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n}\right)^2\]

Therefore:

\[S^2 = \frac{\sum f x^2}{\sum f} - \left(\frac{\sum f x}{\sum f}\right)^2\]

3.9.4 Calculation Using Assumed Mean Method

If $d_i = x_i - a$ where $a$ is a constant, then:

\[S^2 = \frac{\sum f d^2}{\sum f} - \left(\frac{\sum f d}{\sum f}\right)^2\]

Proof

\[S^2 = \frac{\sum f x^2}{\sum f} - \left(\frac{\sum f x}{\sum f}\right)^2 \quad \text{where } d_i = x_i - a,\ x_i = d_i + a\]

\[= \frac{\sum f(d_i + a)^2}{\sum f} - \left(\frac{\sum f(d_i + a)}{\sum f}\right)^2\]

\[= \frac{\sum f(d_i^2 + 2d_ia + a^2)}{\sum f} - \left(\frac{\sum fd_i + \sum fa}{\sum f}\right)^2\]

\[= \frac{\sum fd_i^2}{\sum f} + \frac{2a\sum fd_i}{\sum f} + \frac{\sum fa^2}{\sum f} - \left(\frac{\sum fd_i}{\sum f} + \frac{\sum fa}{\sum f}\right)^2\]

\[= \frac{\sum fd_i^2}{\sum f} + \frac{2a\sum fd_i}{\sum f} + a^2 - \left(\frac{\sum fd_i}{\sum f} + a\right)^2\]

\[= \frac{\sum fd_i^2}{\sum f} + 2a\frac{\sum fd_i}{\sum f} + a^2 - \left(\frac{\sum fd_i}{\sum f}\right)^2 - 2a\frac{\sum fd_i}{\sum f} - a^2\]

\[= \frac{\sum fd_i^2}{\sum f} - \left(\frac{\sum fd_i}{\sum f}\right)^2 = S^2\]

3.9.5 Calculation Using Coding Method

If $d_i = x_i - a = cu_i$, then:

\[S^2 = c^2\left\{\frac{\sum fu_i^2}{\sum f} - \left(\frac{\sum fu_i}{\sum f}\right)^2\right\}\]

Proof

\[S^2 = \frac{\sum fd_i^2}{\sum f} - \left(\frac{\sum fd_i}{\sum f}\right)^2\]

\[= \frac{\sum f(cu_i)^2}{\sum f} - \left(\frac{\sum f\,cu_i}{\sum f}\right)^2\]

\[= c^2\frac{\sum fu_i^2}{\sum f} - c^2\left(\frac{\sum fu_i}{\sum f}\right)^2\]

\[S^2 = c^2\left[\frac{\sum fu_i^2}{\sum f} - \left(\frac{\sum fu_i}{\sum f}\right)^2\right]\]

Example

Table 3.5: Table 3.6: Calculation of standard deviation using three methods
Mass (kg)	f	x	fx	fx²	d	fd	fd²	$u_i$	fu	fu²
53–57	2	55	110		-20	-40	800	-4	-8	32
58–62	12	60	720		-15	-180	2700	-3	-36	108
63–67	12	65	780		-10	-120	1200	-2	-24	48
68–72	25	70	1750		-5	-125	625	-1	-25	25
73–77	27	75	2025		0	0	0	0	0	0
78–82	10	80	800		5	50	250	1	10	10
83–87	9	85	765		10	90	900	2	18	36
88–92	2	90	180		15	30	450	3	6	18
Total	99		7130	519500		-295	6925		-59	277

Using Direct Method:

\[S^2 = \frac{\sum fx^2}{\sum f} - \left(\frac{\sum fx}{\sum f}\right)^2 = \frac{519550}{99} - \left(\frac{7130}{99}\right)^2 = 5247.9798 - 5186.9095 = 61.0703\]

Using Assumed Mean Method:

\[S^2 = \frac{\sum fd^2}{\sum f} - \left(\frac{\sum fd}{\sum f}\right)^2 = \frac{6925}{99} - \left(\frac{-295}{99}\right)^2 = 69.9495 - 8.8792 = 61.0703\]

Using Coding Method ($c = 5$):

\[S^2 = c^2\left[\frac{\sum fu_i^2}{\sum f} - \left(\frac{\sum fu_i}{\sum f}\right)^2\right] = 25\left[\frac{277}{99} - \left(\frac{-59}{99}\right)^2\right] = 25(2.7980 - 0.3552) = 61.0703\]

3.9.6 Pooled Mean and Variance (When $n$ is Equal)

The following is a summary of statistics of 2 samples:

Table 3.7: Table 3.8: Summary statistics of 2 samples
	Mean	Variance	Size
Sample 1	$55\ (\bar{X}_1)$	$100\ (s_1^2)$	$100\ (n_1)$
Sample 2	$50\ (\bar{X}_2)$	$150\ (s_2^2)$	$100\ (n_2)$

Determine mean and variance of combined sample.

Combined mean:

\[\bar{X} = \frac{\text{total of sample 1} + \text{total of sample 2}}{\text{total size}} = \frac{(100 \times 55) + (100 \times 50)}{100 + 100} = 52\]

For combined samples of sizes $n_1$ and $n_2$ with means $\bar{X}_1$ and $\bar{X}_2$ and variances $s_1^2$ and $s_2^2$:

\[\bar{X} = \frac{n_1 x_1 + n_2 x_2}{n_1 + n_2}\]

Combined Variance:

For first sample, $s_1^2 = 100$:

\[s_1^2 = \frac{\sum fx^2}{n} - \bar{X}^2 \implies 100 = \frac{\sum fx^2}{100} - 55^2 \implies \sum fx^2 = 312500\]

For second sample, $s_2^2 = 64$:

\[s_2^2 = \frac{\sum fy^2}{n} - \bar{Y}^2 \implies 64 = \frac{\sum fy^2}{150} - 50^2 \implies \sum fy^2 = 384600\]

\[\text{Combined variance} = \frac{\sum fx^2 + \sum fy^2}{n_1 + n_2} - (\text{combined mean})^2\]

\[= \frac{312500 + 384600}{100 + 100} - (52)^2 = 84.4\]

In general:

If the two samples have equal means, then:

\[\textbf{Pooled variance} = \frac{n_1 s_1^2 + n_2 s_2^2}{n_1 + n_2}\]

If means are not equal, then:

\[\textbf{Pooled variance} = \frac{(s_1^2 + \bar{X}_1^2)n_1 + (s_2^2 + \bar{X}_2^2)n_2}{n_1 + n_2} - m^2\]

3.9.7 Merits and Limitations of Standard Deviation

Merits

It is based on each and every observation of the data.
It is amenable to further algebraic treatment.
It is less affected by fluctuations of sampling than most other measures of variation.
It is possible to calculate the combined standard deviation of two or more groups. This is not possible with any other measure.
For comparing the variability of two or more distributions, coefficient of variation is considered to be most appropriate and this measure is based on mean and standard deviation.
Standard deviation is most prominently used in further statistical work.

Limitations

As compared to other measures it is difficult to compute.
It gives more weight to extreme values and less to those which are near the mean.

3.9.8 Coefficient of Variation

The standard deviation discussed so far is an absolute measure of variation. The corresponding relative measure is known as the coefficient of variation denoted by C.V. This measure developed by Karl Pearson is the most commonly used measure of relative variation. It is used in such problems where we want to compare the variability of two or more than two series. That series (or group) for which the coefficient of variation is greater is said to be more variable or conversely less consistent, less uniform, less stable or less homogeneous. On the other hand, the series (or group) for which the coefficient of variation is less is said to be less variable or conversely more consistent, more uniform, more stable or more homogeneous.

Note:

Standard deviation of an exponential distribution = its mean, so its C.V = 1.

Distributions with C.V $< 1$ tend to have low variance.

Distributions with C.V $> 1$ tend to have high variance.

Coefficient of Variation is obtained as follows (This is the principle measure of relative dispersion. It expresses standard deviation as a % of mean. Co-efficient of variation = ratio of standard deviation $\sigma$ to the mean $\mu$):

\[C.V = \frac{S}{\bar{X}} \times 100\%\]

Coefficient of Variation is more useful when the two distributions are entirely different and the units of measurement are also different.

Advantages

The co-efficient of variation is useful because the standard deviation of data must always be understood in the context of the mean of the data.
The co-efficient of variation is a dimensionless number. So when comparing between data sets with different units with widely different means one should use co-efficient of variation for comparison.

Disadvantages

When the mean value is near zero, the co-efficient of variation is sensitive to small changes in the mean, limiting its usefulness.
Unlike standard deviation it cannot be used to construct confidence intervals for the mean.

Example

Calculate co-efficient of variation of the following populations of the values obtained:

Table 3.9: Table 3.10: Values for Population A and Population B
Population	Value 1	Value 2	Value 3	Value 4	Value 5
Population A	2	7	3	2	1
Population B	8	3	12	1	6

Solution:

Population A:

\[S = \sqrt{\frac{\sum x^2}{n} - \left(\frac{\sum x}{n}\right)^2} = \sqrt{\frac{67}{5} - \left(\frac{15}{5}\right)^2} = \sqrt{4.4}\]

\[C.V = \frac{S}{\bar{X}} = \frac{\sqrt{4.4}}{3} = 0.6992 \cong 69.92\%\]

Population B:

\[S = \sqrt{\frac{\sum y^2}{n} - \left(\frac{\sum y}{n}\right)^2} = \sqrt{\frac{254}{5} - \left(\frac{30}{5}\right)^2} = \sqrt{14.8}\]

\[C.V = \frac{S}{\bar{X}} = \frac{\sqrt{14.8}}{6} = 0.6412 \cong 64.12\%\]

Since C.V $< 1$, it has low variance.

Combined Standard Deviation (Pooled Mean and Variance)

Example 3.5

The following is a summary of statistics of 2 samples:

Table 3.11: Table 3.12: Summary statistics – Example 3.5
	Mean	Variance	Size
Sample 1	$55\ (\bar{X}_1)$	$100\ (s_1^2)$	$100\ (n_1)$
Sample 2	$50\ (\bar{X}_2)$	$150\ (s_2^2)$	$100\ (n_2)$

Determine mean and variance of combined sample.

Combined mean:

\[\bar{X} = \frac{\text{total of sample 1} + \text{total of sample 2}}{\text{total size}} = \frac{(100 \times 55) + (50 \times 150)}{100 + 150} = 52\]

For combined samples of sizes $n_1$ and $n_2$ with means $\bar{X}_1$ and $\bar{X}_2$ and variances $s_1^2$ and $s_2^2$:

\[\bar{X} = \frac{n_1 x_1 + n_2 x_2}{n_1 + n_2}\]

Combined Variance:

For first sample, $s_1^2 = 100$:

\[s_1^2 = \frac{\sum fx^2}{n} - \bar{X}^2 \implies 100 = \frac{\sum fx^2}{100} - 55^2 \implies \sum fx^2 = 312500\]

For second sample, $s_2^2 = 64$:

\[s_2^2 = \frac{\sum fy^2}{n} - \bar{Y}^2 \implies 64 = \frac{\sum fy^2}{150} - 50^2 \implies \sum fy^2 = 384600\]

\[\text{Combined variance} = \frac{\sum fx^2 + \sum fy^2}{n_1 + n_2} - (\text{combined mean})^2\]

\[= \frac{312500 + 384600}{100 + 100} - (52)^2 = 84.4\]

In general:

If the two samples have equal means, then:

\[\textbf{Pooled variance} = \frac{n_1 s_1^2 + n_2 s_2^2}{n_1 + n_2}\]

If means are not equal, then:

\[\textbf{Pooled variance} = \frac{(s_1^2 + \bar{X}_1^2)n_1 + (s_2^2 + \bar{X}_2^2)n_2}{n_1 + n_2} - m^2\]

Note: - Distributions with C.V $< 1$ tend to have low variance. - Distributions with C.V $> 1$ tend to have high variance.

4 Topic Four: Moments

4.1 Objectives

By the end of the topic, the learner should be able to:

Distinguish between the different types of moments.
Express central moments in terms of moments about an arbitrary point.
Compute moments for both ungrouped and grouped data.

4.2 Introduction

Moments are popularly used to describe the characteristic of a distribution. The Greek letter $\mu$ (read as mu) is generally used to denote moments.

4.3 Raw Moments

Suppose a variable $x$ changes values $x_1, x_2, x_3, \ldots, x_m$ then:

\[\bar{X}^m = \frac{x^n + x_1^n + \ldots + x_n^n}{m} = \bar{X}^m\]

\[\frac{\sum_{i=1}^{n} x_i^n}{m}\]

is called the $n^{th}$ moment about the origin (raw moment).

Example 4.1

$X = 10, 11, 12$

$\bar{X}^2$ is called $2^{nd}$ moment about the origin:

\[\bar{X}^2 = \frac{10^2 + 11^2 + 12^2}{3}\]

$\bar{X}$ is called the first moment about the origin (mean, $E(X)$):

\[\bar{X} = \frac{10 + 11 + 12}{3}\]

4.4 Central Moments

Given a data set $x_1, x_2, x_3, \ldots, x_n$, the $r^{th}$ moment of a variable $X$ about the arithmetic mean is given by:

\[M_r = \frac{\sum_{i=1}^{n}(x_i - \bar{X})^r}{n} \quad \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \text{(i)} \quad \text{(ungrouped data)}\]

Example 4.2

$X = 10, 11, 12$

First moment about mean:

\[\bar{X} = 11\tfrac{2}{3}\]

\[\frac{\left(10 - 11\tfrac{2}{3}\right) + \left(12 - 11\tfrac{2}{3}\right) + \left(13 - 11\tfrac{2}{3}\right)}{3} = 0\]

N/B: The first moment about the mean is always 0.

Proof: $1^{st}$ moment about mean is zero

\[M_r = \frac{\sum_{i=1}^{m} x_i - \bar{X}^r}{m}\]

\[= \frac{\sum_{i=1}^{m} x_i}{m} - \frac{\sum_{i=1}^{m} \bar{X}}{m}\]

\[= \bar{X} - \bar{X} = 0\]

Second moment about the mean:

\[M_2 = \frac{\sum_{i=1}^{m}(x_i - \bar{X})^2}{m} = S^2 \quad \text{or variance}\]

4.5 Moments about an Arbitrary Point

The $r^{th}$ moment of a variable $X$ about any arbitrary point $A$ is given by:

\[M_r' = \frac{\sum_{i=1}^{m}(x_i - a)^r}{m} \quad \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \text{(ii)}\]

Example 4.3

$X = 10, 12, 13$, $a = 10$ (First moment about 10):

\[M_r' = \frac{(10-10) + (12-10) + (13-10)}{3}\]

\[= \frac{2 + 3}{3} = \frac{5}{3}\]

Remarks

$E(X^r)$ — $r^{th}$ moment about the origin:

\[\frac{\sum x^r}{m}\]

$r^{th}$ Moment about origin:

\[M_1 = 0\] \[M_2 = S^2\] \[\vdots\] \[M_r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^r}{n}\]

Moments about any point $a$:

\[M_r' = \frac{\sum_{i=1}^{n}(x_i - a)^r}{n}\]

Example 4.4

Find the first, second, third & fourth moments of the following set of numbers about the origin:

$2, 3, 7, 8, 10$

\[M_1 = \bar{X} = \frac{2+3+7+8+10}{5} = 6\]

\[M_2 = \bar{X}^2 = \frac{2^2+3^2+7^2+8^2+10^2}{5} = 45.2\]

\[M_3 = \bar{X}^3 = \frac{2^3+3^3+7^3+8^3+10^3}{5} = 378\]

\[M_4 = \bar{X}^4 = \frac{2^4+3^4+7^4+8^4+10^4}{5} = 3318.8\]

4.6 Relationship between Central Moments and Moments about an Arbitrary Point

Prove that: $M_2 = M_2' - [M_1']^2$

Let $d = x - a$, so $x = a + d$, $\bar{X} = a + \bar{d}$, and $x - \bar{X} = d - \bar{d}$.

\[M_2 = \frac{\sum(x - \bar{X})^2}{n}\]

\[= \frac{\sum(d - \bar{d})^2}{n} = \frac{\sum(d^2 - 2d\bar{d} + \bar{d}^2)}{n}\]

\[= \frac{\sum d^2}{n} - 2\bar{d}\left(\frac{\sum d}{n}\right) + \frac{\sum(\bar{d})^2}{n}\]

\[= \frac{\sum d^2}{n} - 2\bar{d}\left(\frac{\sum d}{n}\right) + (\bar{d})^2\]

\[= \frac{\sum d^2}{n} - 2\bar{d}\cdot\bar{d} + (\bar{d})^2\]

\[= \frac{\sum d^2}{n} - (\bar{d})^2\]

\[= \left(\frac{\sum(x-a)}{n}\right)^2 - \left(\frac{\sum(x-a)}{n}\right)^2\]

\[= M_2' - \left(\frac{\sum d}{n}\right)^2 = M_2' - \left(\frac{\sum(x-a)}{n}\right)^2\]

\[\boxed{M_2 = M_2' - [M_1']^2}\]

Prove that: $M_3 = M_3' - 3M_1'M_2' + 2(M_1')^3$

\[M_3 = \frac{\sum(x - \bar{X})^3}{n} = \frac{\sum(d - \bar{d})^3}{n}\]

\[= \frac{\sum\left(d^3 - 3d^2\bar{d} + 3d(\bar{d})^2 - (\bar{d})^3\right)}{n}\]

\[= \frac{\sum d^3}{n} - 3\bar{d}\left(\frac{\sum d^2}{n}\right) + 3(\bar{d})^2\frac{\sum d}{n} - (\bar{d})^3\]

\[= \frac{\sum d^3}{n} - 3\bar{d}\left(\frac{\sum d^2}{n}\right) + 3(\bar{d})^2\bar{d} - (\bar{d})^3\]

\[= \frac{\sum d^3}{n} - 3\bar{d}\left(\frac{\sum d^2}{n}\right) + 2(\bar{d})^3\]

\[= \bar{d}^3 - 3\bar{d}\cdot\bar{d}^2 + 2(\bar{d})^3\]

\[= \left(\frac{\sum(x-a)}{n}\right)^3 - 3\left(\frac{\sum(x-a)}{n}\right)\left(\frac{\sum(x-a)^2}{n}\right) + 2\left(\frac{\sum(x-a)}{n}\right)^3\]

\[\boxed{M_3 = M_3' - 3M_1'M_2' + 2(M_1')^3}\]

Prove that: $M_4 = M_4' - 4M_1'M_3' + 6(M_1')^2 M_2' - 3(M_1')^4$

\[M_4 = \frac{\sum(x - \bar{x})^4}{n} = \frac{\sum(d - \bar{d})^4}{n}\]

\[= \frac{\sum\left(d^4 - 4d^3(\bar{d}) + 6d^2(\bar{d})^2 - 4d(\bar{d})^3 + (\bar{d})^4\right)}{n}\]

\[= \bar{d}^4 - 4\bar{d}\cdot\bar{d}^3 + 6(\bar{d})^2\cdot\bar{d} - 4(\bar{d})^3\cdot\bar{d} + (\bar{d})^4\]

\[= \bar{d}^4 - 4\bar{d}\cdot\bar{d}^3 + 6(\bar{d})^2\cdot\bar{d} - 4(\bar{d})^4 + (\bar{d})^4\]

\[= \bar{d}^4 - 4\bar{d}\cdot\bar{d}^3 + 6(\bar{d})^2\cdot\bar{d} - 3(\bar{d})^4\]

\[= \left(\frac{\sum(x-a)}{n}\right)^4 - 4\left(\frac{\sum(x-a)}{n}\right)\left(\frac{\sum(x-a)^3}{n}\right) + 6\left(\frac{\sum(x-a)}{n}\right)^2\left(\frac{\sum(x-a)^2}{n}\right) - 3\left(\frac{\sum(x-a)}{n}\right)^4\]

\[\boxed{M_4 = M_4' - 4M_1'M_3' + 6(M_1')^2 M_2' - 3(M_1')^4}\]

For Grouped Data

The $r^{th}$ moment about the origin is given by:

\[\bar{X}^r = \frac{f_1 x_1^r + f_2 x_2^r + \ldots + f_n x_n^r}{f_1 + f_2 + \ldots + f_n} = \frac{\sum_{i=1}^{n} f_i x_i^r}{\sum f_i}\]

$r^{th}$ Moment about the mean:

\[M_r = \frac{\sum_{i=1}^{n} f_i (x_i - \bar{X})^r}{\sum f_i}\]

For different values of $r$, we shall get different moments. Thus if we put $r = 1$, we will get the first moment, if we put $r = 2$, we will get the second moment, and so on.

\[M_1 = 0 \qquad M_2 = S^2\]

$r^{th}$ Moment about any point $a$:

\[M_r' = \frac{\sum_{i=1}^{n} f_i(x_i - a)^r}{\sum f_i}\]

If $X_i = A + CU_i$ where $A$ and $C$ are constants, then:

\[M_r' = \frac{C^r \sum f_i u_i}{\sum f_i}\]

Example 4.5

The marks obtained by students in a class were as follows:

Mark	Frequency	X	D	U = d/c, c = 10	fu	fu²	fu³	fu⁴
1-10	4	5.5	-40	-4	-16	64	-256	1024
11-20	5	15.5	-30	-3	-15	45	-135	405
21-30	32	25.5	-20	-2	-64	128	-256	512
31-40	89	35.5	-10	-1	-89	89	-89	89
41-50	102	45.5	0	0	0	0	0	0
51-60	78	55.5	10	1	78	78	78	78
61-70	63	65.5	20	2	126	252	504	1008
71-80	21	75.5	30	3	63	189	567	1701
81-90	9	85.5	40	4	36	144	576	2304
91-100	3	95.5	50	5	15	75	375	1875
Total	406	134			134	1064	1364	8996

Obtain the first 4 moments about the mean.

Moments about 45.5:

\[M_1' = \frac{C\sum fu}{\sum f} = \frac{10 \times 134}{406} = 3.30\]

\[M_2' = \frac{C^2\sum fu^2}{\sum f} = \frac{100 \times 1064}{406} = 262.07\]

\[M_3' = \frac{C^3\sum fu^3}{\sum f} = \frac{1000 \times 1364}{406} = 3359.61\]

\[M_4' = \frac{C^4\sum fu^4}{\sum f} = \frac{10000 \times 8996}{406} = 221576.35\]

\[M_1 = 0\]

\[M_2 = M_2' - [M_1']^2 = 262.07 - (3.30)^2 = 251.18\]

\[M_3 = M_3' - 3M_1'M_2' + 2(M_1')^3 = 3359.61 - 3(3.30)(262.07) + 2(3.30)^2 = 786.90\]

\[M_4 = M_4' - 4M_1'M_3' + 6(M_1')^2 M_2' - 3(M_1')^4\] \[= 221576.35 - 4(3.30)(3359.35) + 6(3.3)^2(262.07) - 3(3.30)^4 = 194353.15\]

Example 4.6

Calculate the first four moments about the mean of the following distribution:

x	f	d (x - a)	u	fu	fu²	fu³	fu⁴
12	1	-6	-3	-3	9	-27	81
14	4	-4	-2	-8	16	-32	64
16	6	-2	-1	-6	6	-6	6
18	10	0	0	0	0	0	0
20	7	2	1	7	7	7	7
22	2	4	2	4	8	16	32
Total	30			-6	46	-42	190

\[M_1' = \frac{C\sum fu}{\sum f} = \frac{2 \times (-6)}{30} = -0.4\]

\[M_2' = \frac{C^2\sum fu^2}{\sum f} = \frac{4 \times 46}{30} = 6.13\]

\[M_3' = \frac{C^3\sum fu^3}{\sum f} = \frac{8 \times (-42)}{30} = -11.2\]

\[M_4' = \frac{C^4\sum fu^4}{\sum f} = \frac{16 \times 190}{30} = 101.33\]

\[M_1 = 0\]

\[M_2 = M_2' - [M_1']^2 = 6.13 - (-0.4)^2 = 5.97\]

\[M_3 = M_3' - 3M_1'M_2' + 2(M_1')^3 = (-11.2) - 3(-0.4)(6.13) + 2(-0.4)^3 = -3.524\]

\[M_4 = M_4' - 4M_1'M_3' + 6(M_1')^2 M_2' - 3(M_1')^4\] \[= 101.33 - 4(-0.4)(-11.2) + 6(-0.4)^2(6.13) - 3(-0.4)^4 = 89.294\]

5 Topic Five: Skewness and Kurtosis

5.1 Objectives

By the end of the topic, the learner should be able to:

Define and distinguish between the terms Skewness and Kurtosis
Compute the different measures of skewness and kurtosis

5.2 Skewness

The measures of central tendency and variation discussed in previous chapters do not reveal the entire story about a frequency distribution. Two distributions may have the same mean and standard deviation but may differ in their shape of the distribution. Further description of their characteristics is necessary — that is provided by measures of skewness and kurtosis.

Definition: Skewness refers to the degree of asymmetry or departure from symmetry of a given distribution.

Figure 5.1: Types of Skewness

Positively Skewed

If the frequency curve has a longer tail to the right of the central maximum, then the distribution is skewed to the right or positively skewed.

Negatively Skewed

When the tail is longer to the left of the central maximum point, then it is negatively skewed or skewed to the left.

5.2.1 Measures of Skewness

Karl Pearson’s 1st and 2nd Coefficient of Skewness

\[\text{Skewness} = \frac{\text{mean} - \text{mode}}{\text{standard deviation}} \tag{i}\]

\[\text{Skewness} = \frac{3(\text{mean} - \text{median})}{\text{standard deviation}} \tag{ii}\]

Quartile Coefficient of Skewness (Bowley’s Coefficient of Skewness)

\[= \frac{(Q_3 - Q_2) - (Q_2 - Q_1)}{Q_3 - Q_1} = \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1}\]

The 10–90 Percentile Coefficient of Skewness (Kelley’s Coefficient of Skewness)

\[\frac{(P_{90} - P_{50}) - (P_{50} - P_{10})}{P_{90} - P_{10}} = \frac{P_{90} - 2P_{50} + P_{10}}{P_{90} - P_{10}}\]

Moment Coefficient of Skewness, denoted by $a_3$

\[a_3 = \frac{m_3}{s^3} = \frac{m_3}{\left(\sqrt{m_2}\right)^3} \qquad \text{or} \qquad b_1 = a_3^2\]

N/B: For a symmetrical distribution such as the normal curve, the value of $a_3$ and $b_1 = 0$.

5.2.2 Difference Between Variation and Skewness

The following two points of difference between variation and skewness should be carefully noted:

Variation tells us about the amount of the variation. Skewness tells us about the direction of variation.
In business and economic series, measures of variation have greater practical applications than measures of skewness.

Example 5.1

Calculate:

First and second coefficient of skewness
Quartile coefficient of skewness
Percentile coefficient of skewness
Moment coefficient

Table 5.1: Frequency Distribution Table — Example 5.1
$x$	$f$	$fx$	$d = x - a$	$u,\ c=2$	$fu$	$fu^2$	$fu^3$	$cf$
12	1	12	0	0	0	0	0	1
14	4	56	2	1	4	4	4	5
16	6	96	4	2	12	24	48	11
18	10	180	6	3	30	90	270	21
20	7	140	8	4	28	112	448	28
22	2	44	10	5	10	50	250	30
Total	30	528			84	280	1020

1st Coefficient of Skewness

\[\text{Mean} = \frac{\sum fx}{\sum f} = \frac{528}{30} = 17.6\]

\[\text{Standard Deviation} = C^2 \left[\frac{\sum fu^2}{\sum f} - \left(\frac{\sum fu}{\sum f}\right)^2\right] = 2^2\left[\frac{280}{30} - \left(\frac{84}{30}\right)^2\right] = \frac{448}{75}\]

\[\text{Median: } \frac{n}{2} = \frac{30}{2} = 15^{\text{th}} \text{ position} = 18\]

\[\text{Mode: highest frequency} = 18\]

\[\mathbf{1^{st}\ Coefficient\ of\ Skewness} = \frac{\text{mean} - \text{mode}}{\text{standard deviation}} = \frac{17.6 - 18}{\dfrac{448}{75}} = -0.4 \times \frac{75}{448} = -0.066964\]

\[\mathbf{2^{nd}\ Coefficient} = \frac{3(\text{mean} - \text{median})}{\text{standard deviation}} = \frac{3(17.6 - 18)}{\dfrac{448}{75}} = -0.200893\]

Quartile Coefficient of Skewness

\[\frac{(Q_3 - Q_2) - (Q_2 - Q_1)}{Q_3 - Q_1} = \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1}\]

\[Q_3 = \frac{3}{4}n = \frac{3}{4} \times 30 = 22.5^{\text{th}} \text{ term}, \quad \therefore Q_3 = 20\]

\[Q_2 = \text{median} = 18\]

\[Q_1 = \frac{1}{4}n = \frac{1}{4} \times 30 = 7.5^{\text{th}} \text{ term}, \quad Q_1 = 16\]

\[\textbf{Quartile coefficient of skewness} = \frac{20 - 2(18) + 16}{20 - 16} = \frac{0}{4} = 0\]

Percentile Coefficient of Skewness

\[\frac{(P_{90} - P_{50}) - (P_{50} - P_{10})}{P_{90} - P_{10}} = \frac{(20 - 18) - (18 - 14)}{20 - 14} = -\frac{2}{6}\]

Moment Coefficient

\[a_3 = \frac{m_3}{s^3} = \frac{m_3}{\left(\sqrt{m_2}\right)^3}\]

\[M'_1 = \frac{C \sum fu}{\sum f} = \frac{2 \times 84}{30} = 5.6\]

\[M'_2 = \frac{C^2 \sum fu^2}{\sum f} = \frac{4 \times 280}{30} = \frac{112}{3}\]

\[M'_3 = \frac{C^3 \sum fu^3}{\sum f} = \frac{2^3 \times 1020}{30} = 272\]

\[M_2 = M'_2 - \left[M'_1\right]^2 = \frac{112}{3} - (5.6)^2 = \frac{448}{75}\]

\[M_3 = M'_3 - 3M'_1 M'_2 + 2\left(M'_1\right)^3 = 272 - 3\left(5.6 \times \frac{112}{3}\right) + 2(5.6)^3 = -292.48\]

\[a_3 = \frac{m_3}{\left(\sqrt{m_2}\right)^3} = \frac{-292.48}{\left(\dfrac{448}{75}\right)^{3/2}} = -20.03415586\]

5.3 Kurtosis

Kurtosis in Greek means “bulginess”.

Definition: In statistics, kurtosis refers to the degree of flatness or peakedness in the region about the mode of a frequency curve. The degree of kurtosis of a distribution is measured relative to the peakedness of a normal curve.

5.3.1 Types of Kurtosis

Figure 5.2: Types of Kurtosis: 1 = Leptokurtic, 2 = Mesokurtic, 3 = Platykurtic

Leptokurtic — Has a relatively higher peak
Platykurtic — Has a more flat-topped peak
Mesokurtic — This is the normal curve. It is not very peaked nor very flat-topped.

5.3.2 Measure of Kurtosis

This is measured by the $4^{\text{th}}$ moment about the mean expressed in dimensionless form, given by:

\[\text{Moment coefficient of kurtosis} \quad b_2 = a_4 = \frac{m_4}{s^2} = \frac{m_4}{m_2^2}\]

Note: For normal distribution, $b_2 = a_4 = 3$.
Hence kurtosis is sometimes defined by $b_2 - 3$, which is:

Positive for Leptokurtic

Negative for Platykurtic

Zero for normal distribution

Example 5.2

The marks obtained by students in an exam were as follows. Investigate the symmetry and peakedness of the data.

Table 5.2: Frequency Distribution Table — Example 5.2
Marks	$f$	$x$	$d$	$u,\ c=10$	$fu$	$fu^2$	$fu^3$	$fu^4$
0–20	7	10	-40	-4	-4	112	-448	1792
20–40	15	30	-20	-2	-30	60	-12	240
40–60	32	50	0	0	0	0	0	0
60–80	12	70	20	2	24	48	96	192
80–100	9	90	40	4	36	144	576	2304
Total	75			2	364	364	104 (corrected)	4528

Computations

\[M'_1 = \frac{C \sum fu}{\sum f} = \frac{10 \times 2}{75} = 0.27\]

\[M'_2 = \frac{C^2 \sum fu^2}{\sum f} = \frac{10^2 \times 364}{75} = 485.33\]

\[M'_3 = \frac{C^3 \sum fu^3}{\sum f} = \frac{10^3 \times 104}{75} = 1386.67\]

\[M'_4 = \frac{C^4 \sum fu^4}{\sum f} = \frac{10^4 \times 4528}{75} = 603{,}733.33\]

Peakedness (Kurtosis)

\[M_4 = M'_4 - 4M'_1 M'_3 + 6\left(M'_1\right)^2 M'_2 - 3\left(M'_1\right)^4\]

\[= 603{,}733.33 - 4\left(\frac{4}{15} \times 1386\tfrac{2}{3}\right) + 6\left(\frac{4}{15}\right)^2 \times 485\tfrac{1}{3} = 602{,}461.2978\]

\[M_2 = M'_2 - \left[M'_1\right]^2 = 485\tfrac{1}{3} - \left(\frac{4}{15}\right)^2 = 485\tfrac{59}{225}\]

\[\textbf{Peakedness} = \frac{m_4}{m_2^2} = \frac{602{,}461.2978}{\left(485\tfrac{59}{225}\right)^2} = 2.5584\]

Since $b_2 = 2.5584 < 3$, the distribution is Platykurtic (flatter than normal).

Skewness

\[M_3 = M'_3 - 3M'_1 M'_2 + 2\left(M'_1\right)^3\]

\[= 1386\tfrac{2}{3} - 3\left(\frac{4}{15} \times 485\tfrac{1}{3}\right) + 2\left(\frac{4}{15}\right)^3 = 998.5422\]

\[a_3 = \frac{m_3}{\left(\sqrt{m_2}\right)^3} = \frac{998.5422}{\left(\sqrt{485.333}\right)^3} = 0.093391273\]

Since $a_3 = 0.093 > 0$, the distribution is slightly positively skewed.

6 Topic Six: Probability

6.1 Outline of Statistics

Statistics is an increasingly important subject which is useful in many types of scientific investigation. It has become the science of collecting, analysing and interpreting data in the best possible way. Statistics is particularly useful in situations where there is experimental uncertainty and may be defined as ‘the science of making decisions in the face of uncertainty’.

6.2 The Concept of Probability

Probability theory enables us to calculate the chance or probability of getting a given event, while Statistical theory enables us to estimate that chance.

6.3 Some Definitions

We begin our study of probability theory with some definitions. The sample space is defined as the set of all possible outcomes of an experiment. For example:

When a die is thrown the sample space is 1, 2, 3, 4, 5 and 6, i.e. $S = \{1, 2, 3, 4, 5, 6\}$
If two coins are tossed, the sample space is HH, TT, HT, and TH, i.e. $S = \{HH, TT, HT, TH\}$
In testing the reliability of a machine, the sample space is ‘success’ and ‘failure’.

Each possible outcome is a sample point. A collection of sample points with common property is called an event. It is usually denoted by $E$.

If a die is thrown and a number less than 4 is obtained, this is an event containing the sample points 1, 2, and 3.
If two coins are tossed and at least one head is obtained, this is an event containing the sample points HH, HT and TH.

NOTE:

$E \subset S$
A sample point is an event consisting of one element only.
An empty set, $\phi$, is an event in $S$, i.e., $\phi \subset S$
$S$ itself is also an event

Let $A$ and $B$ be two events in $S$, i.e., $A \subset S$ and $B \subset S$, then:

$A \cup B$ is an event that occurs if $A$ occurs or $B$ occurs or both $A$ and $B$ occur.
$A \cap B$ is an event that occurs if both $A$ and $B$ occur.
$A'$, $A^c$, or $\bar{A}$ is the event that occurs if $A$ does not occur.

\[P(A) = \frac{n(A)}{\text{No of elements in } S} = \frac{\text{No of elements in } A}{n(S)} \quad \text{and} \quad 0 \leq P(A) \leq 1 \text{ for every event } A \text{ in } S.\]

The probability of a sample point is the proportion of occurrences of the sample point in a long series of experiments.

We will denote the probability that sample point $x$ will occur by $P(x)$. For example, a coin is said to be ‘fair’ if heads and tails are equally likely to occur, so that $P(H) = P(T) = \frac{1}{2}$. By this we mean that if the coin is tossed $N$ times and $f_H$ heads are observed, then the ratio $f_H / N$ tends to get closer to $\frac{1}{2}$ as $N$ increases. On the other hand if the coin is ‘loaded’ then the ratio $f_H / N$ will not tend to $\frac{1}{2}$.

The probability of a sample point always lies between zero and one.
If the sample point cannot occur, then its probability is zero, but if the sample point must occur, then its probability is one.
The sum of all the sample points is one.

Probability theory is concerned with setting up a list of rules for manipulating these probabilities and for calculating the probabilities of more complex events. Most probabilities have to be estimated from sample data but simple examples deal with equally likely sample points which are known to all have the same probability.

Example 1

Toss two fair coins. Denote heads by H and tails by T. There are four points in the sample space and they are equally likely to occur.

Table 6.1: Sample Space for Two Fair Coins
Sample space	Probability
HH	1/4
HT	1/4
TH	1/4
TT	1/4

Example 2

If two fair dice are tossed, the sample space consists of the thirty-six combinations shown below:

Table 6.2: Sample Space for Two Fair Dice
	Die 2 = 1	Die 2 = 2	Die 2 = 3	Die 2 = 4	Die 2 = 5	Die 2 = 6
Die 1 = 1	(1,1)	(1,2)	(1,3)	(1,4)	(1,5)	(1,6)
Die 1 = 2	(2,1)	(2,2)	(2,3)	(2,4)	(2,5)	(2,6)
Die 1 = 3	(3,1)	(3,2)	(3,3)	(3,4)	(3,5)	(3,6)
Die 1 = 4	(4,1)	(4,2)	(4,3)	(4,4)	(4,5)	(4,6)
Die 1 = 5	(5,1)	(5,2)	(5,3)	(5,4)	(5,5)	(5,6)
Die 1 = 6	(6,1)	(6,2)	(6,3)	(6,4)	(6,5)	(6,6)

Each of the thirty-six sample points is equally likely to occur and so each has probability $\frac{1}{36}$. By inspection we can see for example that:

\[P(\text{sum of the 2 dice is 7}) = \frac{6}{36} = \frac{1}{6}\]

\[P(\text{sum is 2}) = \frac{1}{36}\]

\[P(\text{sum is 7 or 11}) = \frac{6}{36} + \frac{2}{36} = \frac{8}{36}\]

##vTypes of Events

6.3.1 Mutually Exclusive Events

If two events, $E_1$ and $E_2$, are mutually exclusive, they have no common sample points. In a single trial mutually exclusive events cannot both occur; the probability that one of the mutually exclusive events occurs being the sum of their respective probabilities. This is the addition law for mutually exclusive events:

\[P(E_1 \cup E_2) = P(E_1) + P(E_2)\]

The notation $E_1 \cup E_2$ means that at least one of the events occurs; as applied to mutually exclusive events it means that one event or the other occurs. For example if a coin is tossed once and head appears the tail cannot appear.

6.3.2 Not Mutually Exclusive Events

Two events that are not mutually exclusive contain one or more common sample points. The probability that at least one of these events occur is given by the general addition law:

\[P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2)\]

where $(E_1 \cap E_2)$ is the event that both $E_1$ and $E_2$ occur.

Example 3

What is the probability of rolling two dice to obtain the sum seven and/or the number three on at least one die?

Solution

Let event $E_1$ be ‘the sum is 7’ and event $E_2$ be ‘at least one 3 turns up’. By inspection of the table in the previous example, we find:

\[P(E_1) = \frac{6}{36}, \quad P(E_2) = \frac{11}{36}, \quad P(E_1 \cap E_2) = \frac{2}{36}\]

Thus:

\[P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2) = \frac{6}{36} + \frac{11}{36} - \frac{2}{36} = \frac{15}{36}\]

Events that are not mutually exclusive may be further classified as dependent or independent events. Dependence between events is treated by the notion of conditional probability.

6.3.3 Conditional Probability

We define the probability of an event as the sum of the probabilities of the sample points in the event:

\[P(E) = \sum_{\text{event}} P(\text{sample point } s \text{ in } E)\]

Now suppose that we are interested in the probability of an event $E_1$ and we are told that event $E_2$ has occurred. The conditional probability of $E_1$, given that $E_2$ has occurred, is written $P(E_1 | E_2)$, read as $P(E_1 \text{ given } E_2)$. The conditional probability can be defined:

\[P(E_1 | E_2) = \frac{\sum P(\text{sample points common to } E_1 \text{ and } E_2)}{\sum P(\text{sample points in } E_2)} = \frac{P(E_1 \cap E_2)}{P(E_2)}\]

The effect of the conditional information is to restrict the sample space to the sample points contained in event $E_2$.

Example 4

Given that a roll of two fair dice has produced at least one three, what is the probability that the sum is seven?

Solution

Let event $E_1$ be ‘the sum is 7’ and event $E_2$ be ‘at least one 3 turns up’.

\[P(E_1 | E_2) = \frac{P(E_1 \cap E_2)}{P(E_2)} = \frac{2/36}{11/36} = \frac{2}{11}\]

This result can be obtained directly from the table of Example 2, by removing all points in which no three occurs. Of the remaining eleven points exactly two give a sum which is seven.

6.4 Axioms of Probability

For every event $A \subset S$, $\quad 0 \leq P(A) \leq 1$
$P(S) = 1$
If $A$ and $B$ are mutually exclusive events in $S$, $\quad P(A \cup B) = P(A) + P(B)$
If $A_1, A_2, \ldots, A_n$ are mutually exclusive events in $S$, \[P(A_1 \cup A_2 \cup \cdots \cup A_n) = P(A_1) + P(A_2) + \cdots + P(A_n)\]

Theorem 1

If $\phi$ is the empty set, then $P(\phi) = 0$.

Proof

Let $A$ be any event in $S$. $A \cup \phi = A$ and $A \cap \phi = \phi$, i.e., $A$ and $\phi$ are mutually exclusive events.

\[P(A \cup \phi) = P(A) + P(\phi) = P(A)\] \[P(A) + P(\phi) = P(A) \implies P(\phi) = 0\]

Theorem 2

$P(A) + P(A^c) = 1$, where $A^c$ is the complement of $A$.

Proof

$S = A \cup A^c$ and $A$ and $A^c$ are mutually exclusive events. By axioms 2 and 3:

\[P(S) = P(A \cup A^c)\] \[1 = P(A) + P(A^c)\]

Theorem 3

If $A$ and $B$ are any two events in $S$:

\[P(A - B) = P(A) - P(A \cap B)\]

Figure 6.1: Venn Diagram: Events A and B

Proof

\[A = (A \cap B) \cup (A - B)\]

$(A \cap B) \cap (A - B) = \phi$, that is, $(A \cap B)$ and $(A - B)$ are mutually exclusive events.

\[\therefore P(A) = P[(A \cap B) \cup (A - B)]\] \[\implies P(A) = P(A \cap B) + P(A - B) \implies P(A - B) = P(A) - P(A \cap B)\]

Theorem 4

If $A$ and $B$ are any two events in $S$:

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

Figure 6.2: Venn Diagram: Events A and B

Proof

\[(A \cup B) = (A - B) \cup B\]

$(A - B) \cap B = \phi$, that is, $(A - B)$ and $B$ are mutually exclusive events.

\[\therefore P(A \cup B) = P[(A - B) \cup B] = P(A - B) + P(B)\]

From Theorem 3:

\[P(A - B) = P(A) - P(A \cap B)\]

Hence:

\[P(A \cup B) = P(A) - P(A \cap B) + P(B)\] \[\implies P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

In particular, if $A$ and $B$ are mutually exclusive events in $S$: $P(A \cup B) = P(A) + P(B)$.

Example 5

Let $A$ and $B$ be events in $S$ such that $P(A) = \frac{1}{2}$, $P(B) = \frac{1}{3}$, and $P(A \cap B) = \frac{1}{4}$. Evaluate the following:

$P(A' \cup B')$ $\quad$ (ii) $P(A' \cap B)$ $\quad$ (iii) $P(A' \cap B')$ $\quad$ (iv) $P(A \cup B')$

Solution

(i)

\[P(A' \cup B') = 1 - P(A \cap B) = 1 - \frac{1}{4} = \frac{3}{4}\]

(ii)

\[P(A' \cap B) = P(B - A) = P(B) - P(A \cap B) = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}\]

(iii)

\[P(A' \cap B') = 1 - P(A \cup B)\] \[= 1 - \left[P(A) + P(B) - P(A \cap B)\right]\] \[= 1 - \left[\frac{1}{2} + \frac{1}{3} - \frac{1}{4}\right] = 1 - \frac{7}{12} = \frac{5}{12}\]

(iv)

\[P(A \cup B') = P(A) + P(B') - P(A \cap B')\] \[= P(A) + P(B') - [P(A) - P(A \cap B)]\] \[= P(B') + P(A \cap B) = \frac{2}{3} + \frac{1}{4} = \frac{11}{12}\]

6.5 Independent and Dependent Events

Two events, $E_1$ and $E_2$, are said to be independent if $P(E_1) = P(E_1 | E_2)$. Thus the knowledge that event $E_2$ has occurred has no effect on the probability of event $E_1$. Conversely, two events are said to be dependent if $P(E_1) \neq P(E_1 | E_2)$.

Example 6

A coin is tossed three times and the eight possible outcomes, HHH, HHT, HTH, THH, HTT, THT, TTH, and TTT, are assumed to be equally likely. If $A$ is the event that a head occurs on each of the first two tosses, $B$ is the event that a tail occurs on the third toss, and $C$ is the event that exactly two tails occur in the three tosses, show that:

events $A$ and $B$ are independent;
events $B$ and $C$ are independent.

Solution

Since:

\[A = \{HHH, HHT\}, \quad B = \{HHT, HTT, THT, TTT\}, \quad C = \{HTT, THT, TTH\}\] \[A \cap B = \{HHT\}, \quad B \cap C = \{HTT, THT\}\]

the assumption that the eight possible outcomes are all equiprobable yields:

\[P(A) = \frac{1}{4}, \quad P(B) = \frac{1}{2}, \quad P(C) = \frac{3}{8}, \quad P(A \cap B) = \frac{1}{8}, \quad P(B \cap C) = \frac{1}{4}\]

(a) Since $P(A) \cdot P(B) = \frac{1}{4} \times \frac{1}{2} = \frac{1}{8} = P(A \cap B)$, events $A$ and $B$ are independent.

(b) Since $P(B) \cdot P(C) = \frac{1}{2} \times \frac{3}{8} = \frac{3}{16} \neq \frac{1}{4} = P(B \cap C)$, events $B$ and $C$ are not independent.

6.5.1 Conditional Probability (Multiplication Law)

Let $A$ be an event in a sample space $S$. The probability that an event $B$ occurs given that $A$ has occurred is defined as:

\[P\left(\frac{B}{A}\right) = \frac{P(A \cap B)}{P(A)}, \quad P(A) > 0 \tag{i}\]

$P\left(\frac{B}{A}\right)$ is the conditional probability of $B$ given that $A$ has occurred.

From definition (i) above, for any two events $A$, $B$ in $S$:

\[P(A \cap B) = P(A) \times P\left(\frac{B}{A}\right) \quad \text{[Multiplication law of probability]}\]

Also:

\[P(A \cap B) = P(B \cap A)\] \[P(B) \times P\left(\frac{A}{B}\right) = P(A) \times P\left(\frac{B}{A}\right)\]

6.5.2 Extension of Multiplication Law for Three Events

For three events $A$, $B$, and $C$ in $S$:

\[P(A \cap B \cap C) = P(A) \times P\left(\frac{B}{A}\right) \times P\left(\frac{C}{A \cap B}\right)\]

In general, for $n$ events $A_1, A_2, \ldots, A_n$:

\[P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \times P\left(\frac{A_2}{A_1}\right) \times P\left(\frac{A_3}{A_1 \cap A_2}\right) \times \cdots \times P\left(\frac{A_n}{A_1 \cap A_2 \cap \cdots \cap A_{n-1}}\right)\]

This is also known as the Multiplication Theorem.

Example 7

A pair of dice is tossed. If the two numbers appearing are different, find the probability that:

The sum is six
The sum is four or less

Solution

$S = \{(a,b) \mid a, b = 1, 2, 3, 4, 5, 6\}$ consists of 36 equally likely points. Let $A$ be the event that “the numbers appearing are different”:

\[A = \{(a,b) \mid a, b = 1, 2, 3, 4, 5, 6 \text{ and } a \neq b\}. \quad \text{Hence } P(A) = \frac{30}{36}\]

(i) Let $B$ be the event that “the sum is six”:

\[B = \{(a,b) \mid a, b = 1, 2, 3, 4, 5, 6 \text{ and } a + b = 6\} = \{(1,5),(2,4),(3,3),(4,2),(5,1)\}\]

Hence $P(B) = \frac{5}{36}$. $\quad A \cap B = \{(1,5),(2,4),(4,2),(5,1)\}$. Hence $P(A \cap B) = \frac{4}{36}$.

\[P\left(\frac{B}{A}\right) = \frac{P(A \cap B)}{P(A)} = \frac{4/36}{30/36} = \frac{4}{30} = \frac{2}{15}\]

(ii) Let $C$ be the event that “the sum is four or less”:

\[C = \{(a,b) \mid a, b = 1, 2, 3, 4, 5, 6 \text{ and } a + b \leq 4\} = \{(1,3),(2,2),(3,1),(1,2),(2,1),(1,1)\}\]

Hence $P(C) = \frac{6}{36}$. $\quad A \cap C = \{(1,3),(3,1),(1,2),(2,1)\}$. Hence $P(A \cap C) = \frac{4}{36}$.

\[P\left(\frac{C}{A}\right) = \frac{P(A \cap C)}{P(A)} = \frac{4/36}{30/36} = \frac{4}{30} = \frac{2}{15}\]

Example 8

A class has ten boys and 5 girls. Three students are selected at random without replacement from the class. Find the probability that:

All are girls
The first two are boys and the third is a girl
The first and the third are of the same sex and the second is of the opposite sex

Solution

Let $G_i$ denote the event “the $i$th student selected is a girl” and $B_i$ denote the event “the $i$th student selected is a boy”.

(i)

\[P(G_1 \cap G_2 \cap G_3) = P(G_1) \times P\left(\frac{G_2}{G_1}\right) \times P\left(\frac{G_3}{G_1 \cap G_2}\right) = \frac{5}{15} \times \frac{4}{14} \times \frac{3}{13} = \frac{2}{91}\]

(ii)

\[P(B_1 \cap B_2 \cap G_3) = P(B_1) \times P\left(\frac{B_2}{B_1}\right) \times P\left(\frac{G_3}{B_1 \cap B_2}\right) = \frac{10}{15} \times \frac{9}{14} \times \frac{5}{13} = \frac{15}{91}\]

(iii) We need $P(B_1 \cap G_2 \cap B_3)$ and $P(G_1 \cap B_2 \cap G_3)$:

\[P(B_1 \cap G_2 \cap B_3) = \frac{10}{15} \times \frac{5}{14} \times \frac{9}{13} = \frac{15}{91}\]

\[P(G_1 \cap B_2 \cap G_3) = \frac{5}{15} \times \frac{10}{14} \times \frac{4}{13} = \frac{20}{273}\]

Required probability:

\[P\left[(B_1 \cap G_2 \cap B_3) \cup (G_1 \cap B_2 \cap G_3)\right] = \frac{15}{91} + \frac{20}{273} = \frac{215}{273}\]

6.5.3 Joint Events

The probability of the joint event $(E_1 \cap E_2)$ can be obtained by considering:

\[P(E_1 | E_2) = \frac{P(E_1 \cap E_2)}{P(E_2)}\]

which can be rearranged to give:

\[P(E_1 \cap E_2) = P(E_2) \cdot P(E_1 | E_2)\]

Similarly:

\[P(E_1 \cap E_2) = P(E_1) \cdot P(E_2 | E_1)\]

These relations are general and apply to both dependent and independent events. Of course if the events are independent then $P(E_1 | E_2) = P(E_1)$, so that the equation simplifies to give:

\[P(E_1 \cap E_2) = P(E_1) \cdot P(E_2)\]

This is called the product law for independent events.

Example 9

What is the probability of rolling the sum seven with two fair dice, one of which will be a three?

Solution

Let event $E_1$ be ‘the sum is 7’ and event $E_2$ be ‘at least one 3 occurs’.

\[P(E_1 \cap E_2) = P(E_2) \cdot P(E_1 | E_2) = \frac{11}{36} \times \frac{2}{11} = \frac{1}{18}\]

If there are three events then similar reasoning provides all required relationships:

\[P(E_1 \cup E_2 \cup E_3) = P(E_1) + P(E_2) + P(E_3) - P(E_1 \cap E_2) - P(E_1 \cap E_3) - P(E_2 \cap E_3) + P(E_1 \cap E_2 \cap E_3)\]

\[P(E_1 \cap E_2 \cap E_3) = \frac{P(E_1 \cap E_2 \cap E_3)}{P(E_2 \cap E_3)}\]

\[P(E_1 \cap E_2 \cap E_3) = P(E_1) \cdot P(E_2 | E_1) \cdot P(E_3 | E_1 \cap E_2)\]

If $E_1$, $E_2$, and $E_3$ are independent events then:

\[P(E_1 \cap E_2 \cap E_3) = P(E_1) \cdot P(E_2) \cdot P(E_3)\]

Example 10

Suppose that if a person visits his dentist, the probability that he will have his teeth cleaned is 0.44, the probability that he will have a cavity filled is 0.24, the probability that he will have a tooth extracted is 0.21, the probability that he will have his teeth cleaned and a cavity filled is 0.08, the probability that he will have his teeth cleaned and a tooth extracted is 0.11, the probability that he will have a cavity filled and a tooth extracted is 0.07, and the probability that he will have his teeth cleaned, a cavity filled, and a tooth extracted is 0.03. What is the probability that a person visiting his dentist will have at least one of these things done to him?

Solution

If $C$ is the event that the person will have his teeth cleaned, $F$ is the event that he will have a cavity filled, and $E$ is the event that he will have a tooth extracted, we are given:

\[P(C) = 0.44, \quad P(F) = 0.24, \quad P(E) = 0.21\] \[P(C \cap F) = 0.08, \quad P(C \cap E) = 0.11, \quad P(F \cap E) = 0.07, \quad P(C \cap F \cap E) = 0.03\]

Substituting into the formula for addition of three events:

\[P(C \cup F \cup E) = 0.44 + 0.24 + 0.21 - 0.08 - 0.11 - 0.07 + 0.03 = 0.66\]

6.6 Permutations and Combinations

Many problems in probability require the total number of points in a sample space to be counted. When there is a very large number of these, knowledge of combinatorial theory, and in particular of permutations and combinations, is useful.

A quantity frequently used in combinatorial theory is factorial $n$, which is defined as:

\[n! = n(n-1)(n-2) \cdots 3 \times 2 \times 1\]

6.6.1 Permutations

The number of ways in which $r$ items can be selected from $n$ distinct items, taking notice of the order of selection, is called the number of permutations of $n$ items taken $r$ at a time, and is denoted by $^nP_r$ or $P(n,r)$:

\[^nP_r = n(n-1)(n-2) \cdots (n-r+1) = \frac{n!}{(n-r)!}\]

For example, the number of permutations of the letters $a$, $b$, $c$ taken two at a time is $^3P_2 = 3 \times 2 = 6$. These are: $ab$, $ac$, $bc$, $ba$, $ca$, $cb$.

6.6.2 Combinations

The number of ways in which $r$ items can be selected from $n$ distinct items, disregarding the order of selection, is called the number of combinations of $n$ items taken $r$ at a time, and is denoted by $^nC_r$ or $\binom{n}{r}$. There are $r!$ permutations which are the same combination. Thus we have:

\[^nC_r = \frac{^nP_r}{r!} = \frac{n!}{(n-r)!\, r!}\]

For example, the number of combinations of the letters $a$, $b$, $c$ taken two at a time is $^3C_2 = \frac{6}{2!} = 3$. These are: $ab$, $ac$, $bc$.

Important note: $ab$ is the same combination as $ba$ but not the same permutation.

Example 11

An inspector draws a sample of five items from a batch of a hundred valves which are numbered from one to a hundred to distinguish them. How many distinct samples can he choose? If there is one defective item in the batch, how many distinct samples of size five can be drawn which contain the defective item? What is the probability of drawing a sample which contains the defective item?

Solution

As we are not concerned with the order in which the valves are selected, the total number of distinct samples is the number of combinations of five from a hundred, i.e., $^{100}C_5$. In the second case one of the items is fixed and we need to find the total number of ways of selecting four more items from the remaining ninety-nine valves. This is $^{99}C_4$. Thus the probability of drawing a sample which contains the defective item is:

\[\frac{^{99}C_4}{^{100}C_5}\]

This probability can of course be written down straight away by noting that any valve has a probability of $\frac{5}{100}$ of being in any particular sample.

6.7 Exercise

(1) Let $A$ and $B$ be events in $S$ such that $P(A) = \frac{1}{2}$, $P(B) = \frac{3}{8}$, and $P(A \cup B) = \frac{3}{4}$. Evaluate the following:

$P(A \cap B)$ $\quad$ (ii) $P(A' \cap B')$ $\quad$ (iii) $P(A' \cup B')$ $\quad$ (iv) $P(A' \cap B)$

(2) Let a die be weighted so that the probability of a number appearing when the die is tossed is proportional to the given number. Let $A = \{\text{even number}\}$, $B = \{\text{prime number}\}$, and $C = \{\text{odd number}\}$.

Find the probability of each point. $\left[\frac{1}{21}, \frac{2}{21}, \ldots\right]$
Find $P(A)$, $P(B)$ and $P(C)$. $\left[P(A) = \frac{12}{21},\ P(B) = \frac{9}{21},\ P(C) = \frac{9}{21}\right]$
Find the probability that an even or prime number occurs. $\left[\frac{20}{21}\right]$

(3) In a class of 125 students, 60 students take mathematics, 55 take physics, and 40 take chemistry. 30 of them take physics and mathematics, 15 take chemistry and physics, and 25 take chemistry and mathematics. 10 students take all the three subjects. One student is selected at random from the class. Find the probability that he takes none of the three subjects.

(4) A box contains 12 items of which four are defective. Three items are drawn at random from the box without replacement. Find the probability that:

All the items drawn are non-defective. $\left[\frac{14}{55}\right]$
The last item drawn is defective. $\left[\frac{2}{3}\right]$

(5) A box contains 12 items of which four are defective. Three items are drawn at random from the box with replacement. Find the probability that:

All the items drawn are non-defective. $\left[\frac{8}{27}\right]$
The last item drawn is defective. $\left[\frac{1}{3}\right]$
Only one item drawn is defective. $\left[\frac{8}{27}\right]$

(6) Two fair dice are rolled one time. Given that the sum of the two numbers that occurred was at least 7, compute the probability that it was equal to $i$, for $i = 7, 8, 9, 10, 11, 12$.

\[\left[\frac{6}{21},\ \frac{5}{21},\ \frac{4}{21},\ \frac{3}{21},\ \frac{2}{21},\ \frac{1}{21}\right]\]

(7) Each of three different students, Joe, Hugh, and Rachael, are given the same problem to solve. They work on the problem independently and have probability 0.8, 0.7, and 0.6 of solving it, respectively.

What is the probability that none of them solve the problem? [0.024]
What is the probability that the problem will be solved (by one or more of them)? [0.976]
Granted the problem was solved, what is the probability that the solution is due to Rachael alone? [0.037]

(8) An urn contains 2 black balls and 5 brown balls. A ball is selected at random. If the ball drawn is brown, it is replaced and 2 additional brown balls are also put into the urn. If the ball is black, it is not replaced in the urn and no additional balls are added. A ball is then drawn from the urn the second time. What is the probability that it is brown? [0.7936]

6.8 Bayes’ Theorem

The events $E_1, E_2, \ldots, E_n$ are called a partition of the sample space $S$ if $E_i \cap E_j = \phi$ for all $i \neq j$ and $E_1 \cup E_2 \cup \cdots \cup E_n = S$. Thus a partition cuts the whole sample space into mutually exclusive pieces. Figure 1 below gives a Venn diagram with $n = 7$ events in the partition.

If $A \subset S$ is any event and $E_1, E_2, \ldots, E_n$ is a partition of $S$, then $E_1, E_2, \ldots, E_n$ also partition $A$; that is:

\[A = (A \cap E_1) \cup (A \cap E_2) \cup \cdots \cup (A \cap E_n)\]

and $(A \cap E_i) \cap (A \cap E_j) = \phi$ for all $i \neq j$.

It then follows that we can write:

\[P(A) = P(A \cap E_1) + P(A \cap E_2) + \cdots + P(A \cap E_n)\]

a result known as the theorem of total probability.

Example 1 (Total Probability)

A calculator manufacturer buys the same integrated circuit from three different suppliers, call them I, II, and III. From past experience, 1% of the circuits supplied by I have been defective, 3% of those supplied by II have been defective and 4% of those supplied by III have been defective. Granted that this manufacturer buys 30% of his circuits from I, 50% from II, and the rest from III, use the theorem of total probability to compute the probability that an integrated circuit, checked just before final assembly into a calculator, is found to be defective.

Solution

Let $A$ be the event that the chip is found defective and let $E_1, E_2, E_3$ be the events that the chip selected was manufactured by I, II, III, respectively:

\[P(E_1) = 0.3,\quad P(E_2) = 0.5,\quad P(E_3) = 0.2\] \[P(A/E_1) = 0.01,\quad P(A/E_2) = 0.03,\quad P(A/E_3) = 0.04\]

Thus:

\[P(A \cap E_1) = 0.003,\quad P(A \cap E_2) = 0.015,\quad P(A \cap E_3) = 0.008\]

\[P(A) = 0.003 + 0.015 + 0.008 = 0.021\]

The theorem of total probability can be used to establish Bayes’ Theorem, named after the Reverend Thomas Bayes.

Theorem (Bayes’ Theorem) is stated as follows;

Let $E_1, E_2, \ldots, E_n$ be a partition of $S$. Then for any event $A \subset S$:

\[P(E_i / A) = \frac{P(E_i)\, P(A/E_i)}{\displaystyle\sum_{j=1}^{n} P(E_j)\, P(A/E_j)}, \quad i = 1, 2, \ldots, n\]

Proof

By definition:

\[P(E_i / A) = \frac{P(E_i \cap A)}{P(A)}\]

and since $P(E_i \cap A) = P(E_i)\, P(A/E_i)$, and:

\[P(A) = \sum_{j=1}^{n} P(A \cap E_j) = \sum_{j=1}^{n} P(E_j)\, P(A/E_j)\]

the result follows immediately.

Example 2 (Bayes’ Theorem)

Assume that the probability is 0.95 that the jury selected to try a criminal case will arrive at the appropriate verdict. That is, given a guilty defendant on trial, the probability is 0.95 that the jury will find him guilty and, conversely, given an innocent man on trial, the probability is 0.95 that the jury will find him innocent. Suppose that the local police force is quite diligent in its duties and that 99% of the people brought before the court are actually guilty. Compute the probability that a defendant is innocent, given that the jury finds him innocent.

Solution

Let $G$ be the event that the defendant is guilty and let $J$ be the event that the jury finds him guilty.

Then we are given that $P(J/G) = P(\bar{J}/\bar{G}) = 0.95$, $P(G) = 0.99$, and we want to compute $P(\bar{G}/\bar{J})$.

$G$ and $\bar{G}$ form a partition of the sample space, so from Bayes’ Theorem:

\[P(\bar{G}/\bar{J}) = \frac{P(\bar{J}/\bar{G})\, P(\bar{G})}{P(\bar{J}/G)\, P(G) + P(\bar{J}/\bar{G})\, P(\bar{G})} = \frac{(0.95)(0.01)}{(0.05)(0.99) + (0.95)(0.01)} = 0.161\]

Thus there is about 1 chance in 6 he really is innocent, if found innocent, and about 5 chances in 6 he is guilty, although found innocent.

\[P(G/\bar{J}) = \frac{P(\bar{J}/G)\, P(G)}{P(\bar{J}/G)\, P(G) + P(\bar{J}/\bar{G})\, P(\bar{G})} = \frac{(0.05)(0.99)}{(0.95)(0.01) + (0.05)(0.99)} = 0.839\]

It is not, at first glance, easy to grasp the meaning of Bayes’ Theorem. Note that it gives the probability of occurrence of $E_i$, one of the events in a partition, given an event $A$ has occurred, which seems in a sense backward. Some of its original uses were concerned with $E_1, E_2, \ldots, E_n$, which represented various possible mutually exclusive theories of how the world was created or how it reached its current state; the values $P(E_i)$ were called prior probabilities.

The event $A$ represents some event that is known to have occurred, such as a recorded history to a given point in time. Bayes’ theorem then shows how to evaluate $P(E_i/A)$, the conditional probability that theory $i$ is correct (called the posterior probability), given the occurrence of $A$.

It is well suited for adaptive schemes of using data or experience to modify prior beliefs. Criticism of Bayesian procedures generally centres on the assumption that $P(E_i)$ and $P(A/E_i)$ are necessarily known. If they are not, of course, it is not possible to employ the theorem.

Example 3 (Bayes’ Theorem — Repeated Application)

Let us assume the same situation as Example 1: 30% of the integrated circuits are supplied by I, 50% by II, and 20% by III, and the probabilities of defects for these suppliers are $P(A/E_1) = 0.01$, $P(A/E_2) = 0.03$, $P(A/E_3) = 0.04$.

Now suppose that an unlabelled box of these integrated circuits is found; it is known only that all the integrated circuits in the box came from the same supplier, which particular one is unknown. Without testing any of the circuits, it would seem reasonable to assume that the probabilities the box came from each of the three are $P(E_1) = 0.3$, $P(E_2) = 0.5$, $P(E_3) = 0.2$, respectively, because these are the proportions of circuits purchased from the three suppliers. If one circuit is selected from the box and tested, use Bayes’ theorem to compute new probabilities that the box came from each of the three suppliers, given the result of testing the circuit.

Solution

If the circuit is found defective:

\[P(E_1/A) = \frac{(0.3)(0.01)}{(0.3)(0.01) + (0.5)(0.03) + (0.2)(0.04)} = 0.115\]

\[P(E_2/A) = \frac{(0.5)(0.03)}{(0.3)(0.01) + (0.5)(0.03) + (0.2)(0.04)} = 0.577\]

\[P(E_3/A) = \frac{(0.2)(0.04)}{(0.3)(0.01) + (0.5)(0.03) + (0.2)(0.04)} = 0.308\]

If the circuit is found non-defective:

\[P(E_1/\bar{A}) = \frac{(0.3)(0.99)}{(0.3)(0.99) + (0.5)(0.97) + (0.2)(0.96)} = 0.305\]

\[P(E_2/\bar{A}) = \frac{(0.5)(0.97)}{(0.3)(0.99) + (0.5)(0.97) + (0.2)(0.96)} = 0.498\]

\[P(E_3/\bar{A}) = \frac{(0.2)(0.96)}{(0.3)(0.99) + (0.5)(0.97) + (0.2)(0.96)} = 0.197\]

Thus the posterior probabilities that the box was supplied by I, II, or III are changed more from their initial unconditional values by finding the circuit defective than they are by finding the circuit non-defective.

Bayes’ theorem can be applied more than once. If one circuit has already been selected, tested, and found defective the probabilities that the box came from suppliers I, II, and III are:

\[P(E_1) = 0.115, \quad P(E_2) = 0.577, \quad P(E_3) = 0.308\]

If a second item is selected and found defective, we have:

\[P(E_1/A) = \frac{(0.115)(0.01)}{(0.115)(0.01) + (0.577)(0.03) + (0.308)(0.04)} = 0.037\]

\[P(E_2/A) = \frac{(0.577)(0.03)}{(0.115)(0.01) + (0.577)(0.03) + (0.308)(0.04)} = 0.562\]

\[P(E_3/A) = \frac{(0.308)(0.04)}{(0.115)(0.01) + (0.577)(0.03) + (0.308)(0.04)} = 0.400\]

which would be the probabilities that the box came from the three suppliers, given two circuits were tested and both found defective.

Exercise (Bayes’ Theorem)

(1) Suppose that medical science has a cancer-diagnostic test that is 95% accurate on both those who do and those who do not have cancer. If 0.005 of the population actually does have cancer, compute the probability that a particular individual has cancer, given that the test says he has cancer. [0.087]

(2) Two different suppliers, A and B, provide a manufacturer with the same part. All supplies of this part are kept in a large bin. In the past, 5% of the parts supplied by A and 9% of the parts supplied by B have been defective. A supplies four times as many parts as B. Suppose you reach into the bin and select a part, and find it is non-defective. What is the probability that it was supplied by A? [0.807]

End of Topic — Probability

	Mean	Variance	Size
Sample 1	\(55\ (\bar{X}_1)\)	\(100\ (s_1^2)\)	\(100\ (n_1)\)
Sample 2	\(50\ (\bar{X}_2)\)	\(150\ (s_2^2)\)	\(100\ (n_2)\)


46	58	54	52	55	59	52	62	65	67
64	63	77	78	92	6	7	12	18	16
3	23	25	25	27	81	88	24	29	22
34	33	30	37	36	42	48	28	22	28
17	13	70	37	32	36	41	40	43	44


13	7	12	6	34	14	47	25	45	2
13	26	10	8	1	14	41	10	3	21
8	13	28	24	16	19	4	7	36	37
20	15	16	15	17	31	17	3	11	46
24	8	40	17	18	12	27	16	4	14
23	9	29	12	2	6	12	18	9	16


32	46	25	57	39	45	55	42	20	36
58	12	38	34	22	40	33	64	43	46
31	40	52	29	14	57	66	36	32	48
46	42	47	54	65	44	35	19	54	25
23	33	38	45	32	38	41	42	58	43


74	66	65	55	48	56	50	75	75	67
76	68	50	65	60	65	60	68	68	76
68	77	63	65	52	52	63	80	80	70
65	81	70	63	45	45	65	71	71	64
55	70	64	45	64	64	40	55	55	71


3.0	3.4	4.1	4.1	4.3	2.7	3.5	3.7	3.4	3.4
3.8	4.2	3.1	3.9	3.1	4.1	2.8	3.7	4.4	3.5
3.5	3.4	3.7	3.7	2.8	4.3	3.8	3.4	4.1	3.0
4.4	4.1	4.1	3.6	3.4	2.7	3.6	3.0	3.4	4.3
3.8	3.2	4.2	3.9	4.2	3.4	2.9	4.4	3.5	3.9


17	25	21	18	14	15	24	22	15	21	25
17	25	15	18	17	29	16	24	39	30	23
23	27	43	28	29	15	15	19	32	30	32
23	13	18	13	27	32	17	17	25	25	30
20	18	17	33	28	27	26	32	32	33	19


46	58	54	52	55	59	52	62	65	67
64	63	77	78	92	6	7	12	18	16
3	23	25	25	27	81	88	24	29	22
34	33	30	37	36	42	48	28	22	28
17	13	70	37	32	36	41	40	43	44


13	7	12	6	34	14	47	25	45	2
13	26	10	8	1	14	41	10	3	21
8	13	28	24	16	19	4	7	36	37
20	15	16	15	17	31	17	3	11	46
24	8	40	17	18	12	27	16	4	14
23	9	29	12	2	6	12	18	9	16


32	46	25	57	39	45	55	42	20	36
58	12	38	34	22	40	33	64	43	46
31	40	52	29	14	57	66	36	32	48
46	42	47	54	65	44	35	19	54	25
23	33	38	45	32	38	41	42	58	43


74	66	65	55	48	56	50	75	75	67
76	68	50	65	60	65	60	68	68	76
68	77	63	65	52	52	63	80	80	70
65	81	70	63	45	45	65	71	71	64
55	70	64	45	64	64	40	55	55	71


3.0	3.4	4.1	4.1	4.3	2.7	3.5	3.7	3.4	3.4
3.8	4.2	3.1	3.9	3.1	4.1	2.8	3.7	4.4	3.5
3.5	3.4	3.7	3.7	2.8	4.3	3.8	3.4	4.1	3.0
4.4	4.1	4.1	3.6	3.4	2.7	3.6	3.0	3.4	4.3
3.8	3.2	4.2	3.9	4.2	3.4	2.9	4.4	3.5	3.9


17	25	21	18	14	15	24	22	15	21	25
17	25	15	18	17	29	16	24	39	30	23
23	27	43	28	29	15	15	19	32	30	32
23	13	18	13	27	32	17	17	25	25	30
20	18	17	33	28	27	26	32	32	33	19

Probability and Statistics I