MT5762 Lecture 4

C. Donovan

Summary Statistics

Centre and spread

Fundamental to statistics are measures of centre and measures of spread.

  • Knowing something about both of these for a particular set of data is far more powerful than just one alone.
  • our sense of what is unusual is a function of both these measures e.g. a company you have invested has had a decrease in profits of $1,000,000 in the last quarter - should you be concerned?
  • As humans we appear to use distributional information collected over time to calibrate our perception of what is normal and what is odd e.g.
  • Is someone 7-foot unusually tall? Why? ( = about 2.1 metres)
  • Is an adult Giant Squid Architeuthis dux 20m in length unusual? why?

A large? squid

Is an adult Giant Squid Architeuthis dux 20m in length unusual? why?

...

Center: sample mean and sample median.

Calculating centre

  • The sample mean \( \bar{x} \) is given by:

\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \]

  • The sample median is the middle value in the sample sorted in ascending order. In the event of even numbers of points, the median lies between the middle two values.

Comments

  • \( \bar{x} \) is the balancing point of a histogram.
  • \( median \) is the point where 50\% of the area of a histogram is to its left and 50\% to its right
  • \( \bar{x} \) is sensitive to outliers, while \( median \) is robust to outliers.
  • With symmetric data, \( \bar{x}=median \).
  • With right skewed data, \( \bar{x} > median \); left skewed, \( \bar{x} < median \).
  • \( median \) often more appropriate with highly skewed data; e.g., income, house costs.

Mean of a binary variable

With qualitative data with two categories, e.g., dead or alive, we can denote one category by a 1, say alive, and the other by a 0, for dead. With such binary data, \( \bar{x} \) corresponds to a sample proportion for category = 1.

Sample proportions are usually denoted \( \hat{p} \).

Spread

Measures of spread

There are many measures of measures of spread, we look at 3 here: range, interquartile range, and standard deviation (variance).

  • range: simply maximum - minimum.
  • interquartile range (IQR): 75th percentile - 25th percentile, or 3rd quartile - 1st quartile.
  • (Sample) standard deviation:

\[ s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} \]

A related measure, called the sample variance, is simply the square of the standard deviation and denoted \( s^2 \).

Comments

  • Range is of limited use. Sensitive or robust?
  • IQR: sensitive or robust?
  • \( s \): sensitive or robust?

Objectives

This section outlines:

  • the broad types of data that exist,
  • given their type what sort of numerical or graphical summary is appropriate
  • how these summaries are interpreted
  • Further information: Chapters on displaying qualitative and quantitative data e.g. Chapters 2-3 in Wild & Seber 2000.

Types of data

2 broad categories

The type of variable affects the way to summarise and study the variable. For example, a variable like eye colour (brown, blue, green, red) cannot be described by a numerical average in contrast to a variable like annual income.

There are two general categories of variables:

  • Quantitative numerical valued
  • Qualitative non-numerical valued; e.g., color, sex, religion

More refined categories

These can be further partitioned:

  • Quantitative

    • Continuous: infinite number of possible values
    • Discrete: count data, data with gaps between values

Note the distinction between these can become blurred under certain circumstances.

  • Qualitatitive

    • Ordinal: nonnumeric but relative values; e.g., poor, fair, good, excellent
    • Nominal: unordered, distinct by name only; e.g., green, red, white

Tables/plots for a qualitative variable

Tabular presentation of data

  • Round numbers for presentation, not for calculation
  • Special case of tables for presentation: rounding, compare by columns not rows, provide row and column averages/summaries, include verbal summaries

Example: Marital status of people in Wales aged 16 and older in 2001.

Status Number
Single (never married) 649,512
Married 1,031,511
Re-married 172,466
Separated (but still legally married) 43,819
Divorced 200,991
Widowed 217,631
Total 2,315,930

Alternatively

Status Number (1000s) Percentage
Married 1,032 44.5
Single (never married) 650 28.0
Widowed 218 9.4
Divorced 201 8.7
Re-married 172 7.4
Separated (but still legally married) 44 1.9
Total 2,316 100.0

One of our recurrent datasets

This data is based on customer mortgage defaults. There are a large number of covariates with a binary response indicating the defaulted/non-defaulted categories. It contains

  • BAD - The target variable/response, code as 1 for a loan default and 0 if paid back.
  • REASON - What the loan was for: HomeImp for home improvement and DebtCon for debt consolidation.
  • JOB - a set of six occupational categories.
  • MORTDUE - the amount remaining on the existing mortgage.
  • VALUE - value of the current property

One of our recurrent datasets

  • DEBTINC the debt-to-income ratio i.e. if this is high, then the person owes a lot relative to their income (they're highly indebted).
  • YOJ - the number of years at the present job.
  • DEROG - the number of major derogatory reports
  • CLNO - the number of trade lines
  • DELINQ - the number of delinquent trade lines
  • CLAGE - the age of the oldest trade line (months)
  • NINQ - the number of recent credit checks.

One of our recurrent datasets

  loanData <- read.csv("data/hmeq.csv", header = T)

  head(loanData)
  BAD LOAN MORTDUE  VALUE  REASON    JOB  YOJ DEROG DELINQ     CLAGE NINQ
1   1 1100   25860  39025 HomeImp  Other 10.5     0      0  94.36667    1
2   1 1300   70053  68400 HomeImp  Other  7.0     0      2 121.83333    0
3   1 1500   13500  16700 HomeImp  Other  4.0     0      0 149.46667    1
4   1 1500      NA     NA                  NA    NA     NA        NA   NA
5   0 1700   97800 112000 HomeImp Office  3.0     0      0  93.33333    0
6   1 1700   30548  40320 HomeImp  Other  9.0     0      0 101.46600    1
  CLNO  DEBTINC
1    9       NA
2   14       NA
3   10       NA
4   NA       NA
5   14       NA
6    8 37.11361

Plots for qualitative variables

Pie charts - often maligned.

3D pie? just say no - the perspective alters apparent area of slices.

  # get the frequencies
  freqTable <- as.data.frame(table(loanData$JOB))

  names(freqTable) <- c('Job', 'Freq')
  pie(freqTable$Freq, labels = freqTable$Job)

plot of chunk unnamed-chunk-4

Plots for qualitative variables

Better just this IMO

  barplot(freqTable$Freq, names.arg = freqTable$Job)

plot of chunk unnamed-chunk-6

Plots for qualitative variables

Bit prettier(?)

  library(ggplot2)

  p <- ggplot(freqTable, aes(x=Job, y=Freq, fill=Job)) +
  geom_bar(stat="identity")
p

plot of chunk unnamed-chunk-8

Plots for for a quantitative variable

Single quantitative variable

Typically interested in the distribution of the data.

  • Histograms: These are very similar to bar-charts with the difference that the x-axis gives numerical magnitude.

Single quantitative variable

histogram

Note the splitting of the data to produce bins for plotting can alter the appearance appreciably. Typically analysis software will have algorithms to make this decision.

  hist(loanData$VALUE, col = 'purple')

plot of chunk unnamed-chunk-10

Single quantitative variable

histogram

Note the splitting of the data to produce bins for plotting can alter the appearance appreciably. Typically analysis software will have algorithms to make this decision.

  p <- ggplot(loanData) + geom_histogram(aes(VALUE), fill = 'orange', col = 'black')

  p

plot of chunk unnamed-chunk-12

Single quantitative variable

Typically interested in the distribution of the data.

  • Boxplots These display the median & interquartile-range (forming the box) and lines 1.5 \( \times \) Interquartile-range beyond the box form the whiskers. Data beyond the whiskers are indicated as points.
  boxplot(loanData$VALUE)

plot of chunk unnamed-chunk-13

Single quantitative variable

It is gross features of the data distribution that are highlighted in histograms or boxplots e.g.

  • Modes: most common values
  • Symmetry
  • Outliers: extremely small or large values

Summaries for two variables

Let us consider pairs of variables now. Note: many situations one variable (conventionally \( X \)), will have the special status of an explanatory variable, while the other variable (conventionally \( Y \)), is deemed to be the response variable.

Examples of related variables (where the theoretical causal direction is clear):

\( X \) \( Y \) Explanatory Type Response Type
acupuncture level of lower back pain Qualitative Quantitative
Cell phone usage Occurrence of cancer Quantitative Qualitative?
Ethnicity Type of employment Qualitative Qualitative

Relationships between two quantitative variables

Examples

Two variables, \( X \) and \( Y \), measured on the same subject (unit) often are co-related. Some examples.

  • As rainfall volume increases (\( X \)), the highway runoff volume (\( Y \)) increases.
  • The flow in cfs (\( X \)) in the lower Sacramento river and the salinity (\( Y \)) have an inverse relationship, as flow increases, salinity decreases.
  • The probability of an O-ring on the space shuttles failing (\( Y \)) (prior to 1986) increased as air temperature (\( X \)) decreased.

Scatterplots

Given a sample of \( n \) \( X,Y \) pairs, the first thing to do is to draw a scatterplot, namely a plot of \( Y \) vs \( X \). What one looks for:

  • any relationship at all, and if so, linear or nonlinear?
  • any outliers
  • any influential points

Scatterplots

Two quantitative variables lend themselves to coordinates in 2D

  p <- ggplot(data = loanData) + geom_point(aes(x = VALUE, y = MORTDUE), pch = 21, size = 4, fill = 'darkorange')

  p

plot of chunk unnamed-chunk-15

Relationships between a quantitative and a qualitative variable

  • Common in observational studies or randomized experiments where two or more treatments are applied' and a numerical value is measured on the units.
  • The treatment is the explanatory variable and is qualitative and the measured value is the response variable and is quantitative.
  • If there are two or more treatments then side-by-side boxplot are an appropriate display.

Boxplot groups

  boxplot(loanData$DEBTINC~loanData$JOB)

plot of chunk unnamed-chunk-17

Grouped histograms

  hist(loanData$VALUE[which(loanData$BAD == 0)], col='skyblue', border=F)
  hist(loanData$VALUE[which(loanData$BAD == 1)], add=T, col=scales::alpha('red',.5), border=F)

plot of chunk unnamed-chunk-19

Grouped histograms

 p <- ggplot(loanData) + geom_histogram(aes(VALUE, fill = factor(BAD)), alpha = 0.8) +
  facet_grid(factor(BAD) ~ .)

  p

plot of chunk unnamed-chunk-21

Relationships between two qualitative variables

Two-way frequency tables are the simplest and most complete way to summarize the relationship between two qualitative variables.

Look at Job versus Loan intention


             DebtCon HomeImp
          38      41      20
  Mgr      3      75      23
  Office   3      65      32
  Other    3      67      30
  ProfExe  2      66      32
  Sales    0      89      11
  Self     3      38      60

Example

Is there a relationship?

Looks like self-employed have a different distribution - home-improvement is markedly more likely compared to other job types

# simple frequency table
  crossTab <- table(loanData$JOB, loanData$REASON)

# let's adjust for different numbers in each job
  rowSums <- apply(crossTab, 1, sum)

  # now we get rows as %-age i.e. rows sum to 100(-ish)
  percentTable <- apply(crossTab, 2, function(q){round(q/rowSums*100)})
percentTable

             DebtCon HomeImp
          38      41      20
  Mgr      3      75      23
  Office   3      65      32
  Other    3      67      30
  ProfExe  2      66      32
  Sales    0      89      11
  Self     3      38      60

Recap and look-forwards

We've covered:

  • Classes of data, how we treat them

Next:

  • Probability