C. Donovan
Fundamental to statistics are measures of centre and measures of spread.
Is an adult Giant Squid Architeuthis dux 20m in length unusual? why?
\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \]
With qualitative data with two categories, e.g., dead or alive, we can denote one category by a 1, say alive, and the other by a 0, for dead. With such binary data, \( \bar{x} \) corresponds to a sample proportion for category = 1.
Sample proportions are usually denoted \( \hat{p} \).
There are many measures of measures of spread, we look at 3 here: range, interquartile range, and standard deviation (variance).
\[ s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} \]
A related measure, called the sample variance, is simply the square of the standard deviation and denoted \( s^2 \).
This section outlines:
The type of variable affects the way to summarise and study the variable. For example, a variable like eye colour (brown, blue, green, red) cannot be described by a numerical average in contrast to a variable like annual income.
There are two general categories of variables:
These can be further partitioned:
Quantitative
Note the distinction between these can become blurred under certain circumstances.
Qualitatitive
Example: Marital status of people in Wales aged 16 and older in 2001.
| Status | Number |
|---|---|
| Single (never married) | 649,512 |
| Married | 1,031,511 |
| Re-married | 172,466 |
| Separated (but still legally married) | 43,819 |
| Divorced | 200,991 |
| Widowed | 217,631 |
| Total | 2,315,930 |
| Status | Number (1000s) | Percentage |
|---|---|---|
| Married | 1,032 | 44.5 |
| Single (never married) | 650 | 28.0 |
| Widowed | 218 | 9.4 |
| Divorced | 201 | 8.7 |
| Re-married | 172 | 7.4 |
| Separated (but still legally married) | 44 | 1.9 |
| Total | 2,316 | 100.0 |
This data is based on customer mortgage defaults. There are a large number of covariates with a binary response indicating the defaulted/non-defaulted categories. It contains
loanData <- read.csv("data/hmeq.csv", header = T)
head(loanData)
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ
1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1
2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0
3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1
4 1 1500 NA NA NA NA NA NA NA
5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0
6 1 1700 30548 40320 HomeImp Other 9.0 0 0 101.46600 1
CLNO DEBTINC
1 9 NA
2 14 NA
3 10 NA
4 NA NA
5 14 NA
6 8 37.11361
Pie charts - often maligned.
3D pie? just say no - the perspective alters apparent area of slices.
# get the frequencies
freqTable <- as.data.frame(table(loanData$JOB))
names(freqTable) <- c('Job', 'Freq')
pie(freqTable$Freq, labels = freqTable$Job)
Better just this IMO
barplot(freqTable$Freq, names.arg = freqTable$Job)
Bit prettier(?)
library(ggplot2)
p <- ggplot(freqTable, aes(x=Job, y=Freq, fill=Job)) +
geom_bar(stat="identity")
p
Typically interested in the distribution of the data.
histogram
Note the splitting of the data to produce bins for plotting can alter the appearance appreciably. Typically analysis software will have algorithms to make this decision.
hist(loanData$VALUE, col = 'purple')
histogram
Note the splitting of the data to produce bins for plotting can alter the appearance appreciably. Typically analysis software will have algorithms to make this decision.
p <- ggplot(loanData) + geom_histogram(aes(VALUE), fill = 'orange', col = 'black')
p
Typically interested in the distribution of the data.
boxplot(loanData$VALUE)
It is gross features of the data distribution that are highlighted in histograms or boxplots e.g.
Let us consider pairs of variables now. Note: many situations one variable (conventionally \( X \)), will have the special status of an explanatory variable, while the other variable (conventionally \( Y \)), is deemed to be the response variable.
Examples of related variables (where the theoretical causal direction is clear):
| \( X \) | \( Y \) | Explanatory Type | Response Type |
|---|---|---|---|
| acupuncture | level of lower back pain | Qualitative | Quantitative |
| Cell phone usage | Occurrence of cancer | Quantitative | Qualitative? |
| Ethnicity | Type of employment | Qualitative | Qualitative |
Two variables, \( X \) and \( Y \), measured on the same subject (unit) often are co-related. Some examples.
Given a sample of \( n \) \( X,Y \) pairs, the first thing to do is to draw a scatterplot, namely a plot of \( Y \) vs \( X \). What one looks for:
Two quantitative variables lend themselves to coordinates in 2D
p <- ggplot(data = loanData) + geom_point(aes(x = VALUE, y = MORTDUE), pch = 21, size = 4, fill = 'darkorange')
p
boxplot(loanData$DEBTINC~loanData$JOB)
hist(loanData$VALUE[which(loanData$BAD == 0)], col='skyblue', border=F)
hist(loanData$VALUE[which(loanData$BAD == 1)], add=T, col=scales::alpha('red',.5), border=F)
p <- ggplot(loanData) + geom_histogram(aes(VALUE, fill = factor(BAD)), alpha = 0.8) +
facet_grid(factor(BAD) ~ .)
p
Two-way frequency tables are the simplest and most complete way to summarize the relationship between two qualitative variables.
Look at Job versus Loan intention
DebtCon HomeImp
38 41 20
Mgr 3 75 23
Office 3 65 32
Other 3 67 30
ProfExe 2 66 32
Sales 0 89 11
Self 3 38 60
Is there a relationship?
Looks like self-employed have a different distribution - home-improvement is markedly more likely compared to other job types
# simple frequency table
crossTab <- table(loanData$JOB, loanData$REASON)
# let's adjust for different numbers in each job
rowSums <- apply(crossTab, 1, sum)
# now we get rows as %-age i.e. rows sum to 100(-ish)
percentTable <- apply(crossTab, 2, function(q){round(q/rowSums*100)})
percentTable
DebtCon HomeImp
38 41 20
Mgr 3 75 23
Office 3 65 32
Other 3 67 30
ProfExe 2 66 32
Sales 0 89 11
Self 3 38 60
We've covered:
Next: