September 3, 2019

Learning Outcomes

At the end of this lesson, students should be able to:

  1. differentiate all types of variables.
  2. understand scale of measurements.
  3. describe and calculate the properties of data.

Outlines

  • Introduction
  • Types of Variables
  • Scales of Measurements
  • Population & Samples
  • Measures of Central Tendency
  • Measure of Dispersion
  • Properties of Distribution

What is statistical analysis?

  • Collection, analysis, interpretation and presentation of data to discover its underlying causes, patterns, relationships and trends.
  • Two major branches of statistics:

    1. Descriptive statistics
    2. Inferential statistics

Descriptive Statistics

  • Procedures used to summarise and organise a set of scores or observations in a meaningful way.
  • Typically presented graphically, in tabular form (in tables), or as summary statistics (single values).
  • Descriptive statistics do not allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made.

Descriptive Statistics (cont.)

  • Allow simpler interpretation of data.
  • Used when the intent is to describe the data that they actually collected.
  • Example:
    • A clinical psychologist conducted a study in which she gave some of her clients a new depression treatment and she wanted to describe the average depression score of only those clients who got the treatment.

Inferential Statistics

  • Procedures that allow researchers to infer or generalize observations made with samples to the larger population from which they are selected.
  • Infer the value of population parameter from a sample statistics.
  • Determine the probability of characteristics of population based on the characteristics of the sample.

Inferential Statistics (cont.)

  • They help assess strength of the relationship between your independent (causal) variables, and your dependent (effect) variables.
  • Examples:
    • A clinical psychologist conducted a study in which she gave some of her clients a new depression treatment and she wanted to estimate what the results would be if she were to give the same treatment to additional clients.

Statistical analysis

Types of Variables

  • Variables are measurements or observations that are typically numeric.
  • Three categories of variables:

    1. Independent VS Dependent
    2. Continuous VS Discrete
    3. Quantitative VS Qualitative

Independent VS Dependent

  1. Independent variable (IV)
    • Variable with two or more levels that are expected to have effects on another variable. (affect/change other outcome variable)
    • Sometimes called as predictor or experimental variable.
  2. Dependent variable (DV)
    • The outcome variable that is used to compare the effects of the different independent variable (IV) levels.

Independent VS Dependent (cont.)

  • Example:
In an experiment to study the effects of a new treatment on reducing depressive symptoms, researchers gave the new treatment to a sample of people with depression and withhold the treatment from another sample of people with depression. (i.e. new treatment vs no treatment). After both samples were given their respective treatment levels, the amount of depression in each sample was compared by counting the number of depressive symptoms.

Independent VS Dependent (cont.)

  1. Independent variable (IV)
    • The new treatment
    • No treatment
  2. Dependent variable (DV)
    • Amount of depression

Continuous VS Discrete

  • A continuous variable is measured along a continuum.
    • measured in whole units or fractional units.
    • e.g.: If we measure a height of 35cm and 36cm, an infinite number of heights is possible in the range of 35 and 36.
  • A discrete variable is not measured along a continuum.
    • measured in whole units or categories.
    • e.g.: The number of your siblings and your family’s socioeconomic class (working class, middle class, upper class)

Quantitative VS Qualitative

  • Quantitative data is information about quantities; that is, information that can be measured and written down with numbers.
    • varies by amount
    • e.g : height, shoe size, and the length of your fingernails.
  • Qualitative data is information about qualities; information that cannot be easily measured, but can be observed subjectively.
    • varies by class
    • e.g : the smells, tastes, textures, attractiveness, color.

Quantitative Types of Data

  • Measured in numeric units, so both continuous and discrete unit can be quantitative.
  • e.g.: we can measure food intake in calories (a continuous variable) or we can count the number of pieces of food consumed (a discrete variable)

Qualitative Types of Data

  • Only discrete variables can fall into this category.
  • e.g.: socioeconomic class (working class, middle class, upper class), mental disorders/depression (unipolar/bipolar) or drug use (none, experimental, abusive)

Scales of Measurement

  • Important to determine the kind of statistical procedures that can be used on that variable.
  • Four different scales of measurement:

    1. Nominal
    2. Ordinal
    3. Interval
    4. Ratio

Nominal

  • Known as categorical data.
  • Cannot be measured, because of their qualitative nature.
  • Do not denote quantity and not in order.
  • Categorise individuals into groups that are qualitatively different from other groups.
  • Usually this categorical variables have been coded (converted into numeric)
    • e.g. person’s race (malay [1], chinese [2], indian [3]), gender (female [1] & male[2]), marital status (single[1], married[2])

Nominal (cont.)

  • Sometimes, data that might have been expressed on different scale of measurements (ordinal, interval, ratio) may be recorded in nominal categories.
  • E.g.
    • Marks of between 50.0 and 100.0 = PASS
    • Marks of between 0.0 and 49.0 = FAIL

Ordinal

  • An ordinal scale of measurement is one that conveys order alone (no indication of how much more).
  • Only indicates that some value is greater or less than another value, so differences between ranks do not have meaning.
  • E.g.:
    • level of education (PhD, MSc, Bachelor’s)
    • level of satisfaction (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied)
    • degree of illnesses (mild, moderate, severe).

Interval

  • Interval scales are measurements where the values have no true zero and the distance between each value is equidistant.
    • True zero – values where the value 0 truly indicates nothing.
    • Equidistant – those values whose intervals are distributed in equal units.
  • E.g.: temperature in Celcius or Fahrenheit. Difference between 5°F and 3°F is similar to 8°F and 6°F (equidistant) but 0°F is not the absence of heat.

Ratio

  • Similar to interval scales in that scores are distributed in equal units (equidistant).
  • Unlike interval scales, a distribution scores on a ratio scale has a true zero (the absence of quantity being measured).
  • E.g: salary amount, duration of drug abuse (in years), score (from 0 to 100%) on an exam, weight (in pounds) of an infant.

Scales of Measurement (Example)

Types of Data (Exercise)

State whether the variable is continuous or discrete, and quantitative or qualitative.

Types of Data (Answer)

Types of Data (cont.)

Exercises (Types of Variables)

Name the variable being measured, (2) state whether it is continuous or discrete, and (3) state whether the variable is quantitative or qualitative.

  1. A researcher records the month of birth among patients with schizophrenia.
  2. A professor records the number of students absent during final exam.
  3. A researcher asks children to choose which type of cereal they prefer (one with a toy inside or one without). He records the choice of cereal.
  4. A therapist measures the time (in hours) that clients continue a recommended program of counseling.

Population VS Sample

  • Population is a group of all things that share a set of characteristics.
    • Population parameter – the value that would be obtained if the entire population were actually studied.
  • Sample is a subset of population that is intended to represent the population.
    • Sample statistic – the value obtained from the sample. It is used to estimate the population parameter value.

Population VS Sample (cont.)

  • A set of data is a sample from our population. The sample is a subset of the population. Statistical inference – the process that we used to draw conclusions from a data.
  • Inference involves using statistics we calculate from the sample to make and informed guess about population.
  • When we do statistical inference we are interested in drawing conclusions from a set of data (sample) so that we can estimate population parameters.

Statistical Inference

Population VS Sample (cont.)

Population VS Sample (cont.)

Describing Distributions

  1. Measures of Central Tendency
  2. Measures of Dispersion

Measures of Central Tendency

  • Also known as measures of central location (locate central distribution).
  • Three kinds of averages of a data set to answer “where do the data center?”
  • Measures include:

    1. Mean
    2. Mode
    3. Median

The Mean

  • The usual “average” that is familiar to everyone.
  • Adds up all the numbers (\(\sum{X}\)) and divide by how many numbers there are (N for population or n for sample).
  • Formula:
    • Sample mean: \[\bar{X}=\frac{\sum_{i=1}^n X_i}{n}\]
    • Population mean: \[\mu=\frac{\sum_{i=1}^N X_i}{N}\]

The Mean (example)

The reduction in blood pressure (mmHg) in 6 patients 4 hours after administration of a standard dose of a novel antihypertensive agent is shown in Table 1.1. Calculate the mean reduction in blood pressure reduction in the 6 patients.

Table 1.1 Effect of an antihypertensive drug on blood pressure lowering in six patients
Patient number Reduction in BP (mmHg)
1 20
2 25
3 21
4 34
5 31
6 37

The Mean (example)

  • Substituting the figures from Table 1.1 into the equation for the mean, we obtain:

\[\begin{aligned} &= (20 + 25 + 21 + 34 + 41 + 37)/6 \\ &= 178/6 \\ &= 29.67 \end{aligned}\]

The Weighted Mean

  • Each datum point in the distribution does not contribute equally to the overall calculation of the mean.
  • Data is divided into groups, each of which possesses different weighting.
  • Formula:
    \[\frac{\sum_{j=1}^N w_j X_j}{N}\]

The Weighted Mean (Example)

The effect of a defined dose of a commercially available analgesic to suppress pain following a painful stimulus was evaluated in 20 volunteers using an analogue scale (Table 1.2). Calculate the mean of the pain assessment by the 20 volunteers.

Table 1.2 Recorded assessment of pain by 20 volunteers following administration of analgesic and exposure to a painful stimulus
Number of volunteers Pain assessment by volunteers
2 3 (extreme pain)
12 2 (moderate pain)
6 1 (slight pain)

The Weighted Mean (Example)

  • Substituting the figures from Table 1.2 into the equation for the weighted arithmetic mean, we obtain:

\[\begin{aligned} &= (20 \times 3) + (12 \times 2) + (6 \times 1) / 20 \\ &= 36/20 \\ &= 1.8 \end{aligned}\]

The Weighted Mean (Frequency Distribution)


Diameter (mm) Frequency Midpoint (x) f.x
35-39 6 37 222
40-44 12 42 504
45-49 15 47 705
50-54 10 52 520
55-59 7 57 399
Total 50 2350


- Mean \(= 2350 / 50 = 47\)

The Median

  • An alternative method of describing the central nature of data.
  • Relatively unaffected by the nature of the spread of data.
  • Is the middle number. It is found by putting the numbers in order and taking the actual middle number if there is one, or the average of the two middle numbers if not.

The Median (Example)

A random samples of yearly income of 7 employees (rounded to the nearest hundred dollars):


24.8 22.8 24.6 192.5 25.2 18.5 23.7

The mean (rounded in 1 decimal place is) : 47.4, but the statement the average income of 7 employees is $47, 400 is certainly misleading.

The Median (Outliers)

24.8 22.8 24.6 192.5 25.2 18.5 23.7


  • Number 192.5 is called outlier (far removed from most or all the remaining measurements).
  • Usually is the result of some sort of error (but not always).
  • Mean is sensitive to extreme values.
  • So, a better measure of the “center” of the data can be obtained if we were to arrange the data in numerical order.

The Median (Outliers)

  • The order
    18.5 22.8 23.7 24.6 24.8 25.2 192.5


  • Then select the middle number in the list, in this case 24.6.
  • In this sense, it locates the center of the data.

The Median (Outliers)

  • If there are an even number of measurements in the data sets, there will be two middle elements -> take the mean of middle two as the median
  • Example:
    18.5 22.8 23.7 24.6 24.8 25.2 28.9 192.5


  • Median: (24.6 + 24.8) / 2 = 24.7

The Mode

  • The easiest measure of the average.
  • Defined as the item of data with the highest frequency.
  • Most frequently occurring number.
  • For any data set there is always exactly one mean and exactly one median.
  • However, several different values could occur with the highest frequency.

The Mode (cont.)

  • Data set 1:
    -1 0 2 0

  • The mode of this data set is 0.

- Data set 2:
2 2 3 1 1 5


- Two most frequently observed values in this data set are 1 and 2. Therefore mode is a set of two values : {1,2}

The Mode, Median and Mean (Example)

Weight of luggage presented by airline passengers at check-in (measured to the nearest kg).

18 23 20 21 24 23 20 20 15 19 24

Calculate mode,median and mean.
  • Mode: 20

  • Median: 20
    • put the numbers in order first and take the actual middle number (odd count) or the average middle number (even count).
15 18 19 20 20 20 21 23 23 24 24


- Mean = (15+18+19+20+20+20+21+23+23+24+24) / 11 = 20.64

When not to use the mean?

  • Mean is good for dataset that is evenly spread.
  • For normally distributed data, the numerical values of the mean and median should be identical and either term may successfully be used to describe the central point.
  • The use of median is preferable for distributions that possess extreme values (mean is unacceptably distorted).
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k


- The mean salary is 30.7k

Measures of Dispersion

  • Also known as measures of variation
  • How spread out are the data?
  • Describing quantitative data will not be complete without knowing how observed values are spread out from the average.
  • E.g: two classes who sat the same exam might have the same mean mark but the marks may vary in a different pattern around this.
  • Measures include:

    1. Range
    2. Variance
    3. Standard Deviation (SD)

Measures of Dispersion (Example)

  • Example:

    Table 1.3: Individual values associated with two sets of data possessing identical means.
    Set A Set B
    10 28
    20 29
    30 30
    20 29
    10 28
    Mean: 30 Mean: 30

Range

  • Very simple measure of dispersion.
  • Difference between the maximum value and the minimum value (max-min).
  • Example
    1. Marks on test A

    2. Marks on test B

Range Calculation

  1. Marks on test A

  2. Marks on test B

    On test A, the range of marks is 70-45=25.
    On test B, the range of marks is 65-45=20.

Range (another example)

  1. Marks on test C


  • On test C, the range of marks is 75-40=32.

Variance

  • Squared deviations from the mean.
  • The sample variance of a set of n sample data is the number (\(s^2\)) defined by the formula: \[s^2=\frac{\sum(X-\bar{X})^2}{n-1}\]
  • The population variance (\(\sigma^2\)) formula : \[\sigma^2=\frac{\sum(X-\mu)^2}{N}\]

Standard Deviation

  • Measure of variation (deviation) of all values from the mean.
  • Positive square root of the variance.
  • Properties include:
    • The value is usually positive.
    • 0 indicates no variation.
    • Larger values indicate greater variation.
    • The value can increase dramatically with the inclusion of one or more outliers.
    • The units are the same as the units of the original data values.

Standard Deviation (formula)

  • Sample standard deviation formula \[s=\sqrt\frac{\sum(X-\bar{X})^2}{n-1}\]
  • The population standard deviation formula \[\sigma=\sqrt{\sigma^2}=\sqrt\frac{\sum(X-\mu)^2}{N}\]

Procedure for finding the standard deviation (sample)

  1. Compute the mean (\(\bar{X}\))
  2. Subtract the mean from each individual value to get a list of deviations of the form: (\(X – \bar{X}\))
  3. Square each of the differences obtained from step 2: \((X – \bar{X})^2\)
  4. Add all of the squares obtained from step 3: \(\sum(X – \bar{X})^2\)
  5. Divide the total from step 4 by the number (n – 1), which is 1 less than the total number of values present: \(\frac{\sum(X – \bar{X})^2}{n-1}\)
  6. Find the square root of the result of step 5: \(\sqrt{\frac{\sum(X – \bar{X})^2}{n-1}}\)

Standard Deviation (Example)

  • Multiple waiting line: 1, 3, 4
  • Compute the mean (\(\bar{X}\)) = 18 / 3 = 6 min
\(X\) \(X-mean\) \(X-mean^2\)
1 -5 25
3 -3 9
14 8 64
18 98


  • Variance: 98 / 2 = 49
  • Standard deviation: \(\sqrt{49}\) = 7.0 min

Describing Distribution

Range Rule of Thumb

  • For interpreting a known value of the standard deviation

    1. Minimum “usual” value = (mean) – 2 x standard deviation
    2. Maximum “usual” value = (mean) + 2 x standard deviation

Rule of Thumb (Example)

Past results from the National Health Survey suggest that the head circumferences of 2 months old girls have a mean of 40.05 cm and a standard deviation of 1.64 cm. Determine whether a circumference of 42.6 cm would be considered “unusual”.

Rule of Thumb (Solution)

  • With a mean of 40.05 cm and a standard deviation of 1.64 cm, we use the range rule of thumb to find the minimum and maximum usual circumferences as follows:
    1. Minimum usual value = (mean) – 2 x (standard deviation) = 40.05 – 2 x 1.64 = 36.77 cm
    2. Maximum usual value = (mean) + 2 x (standard deviation) = 40.05 + 2 x 1.64 = 43.33 cm
  • Based on these results, we expect that typical two-month-old girls have head circumferences between 36.77 cm and 43.33 cm.
  • Because 42.6 cm falls within those limits, it would be considered usual or typical, not unusual.

Skewness

  • Skewness
    • Refer to shape of distribution, either symmetry or assymetry
  • Kurtosis
    • Refer to the flatness or peakness of a distribution

Positively Skewed Distribution

  • Tail pointing to the larger value
  • Mean > Median > Mode


Negatively Skewed Distribution

  • Tail pointing to the smaller value
  • Mode > Median > Mean


Kurtosis

  • Platykurtic distribution
    • Low peak
    • Flatter than the normal curve
  • Leptokurtic distribution
    • High peak
    • More peaked than the normal curve