Initial Data Exploration with R

Graphical Data Exploration

A data set to use

Download the data dpt.csv from the Teams Class Materials folder in the Files section in the General channel
Overview of the data here
This has two columns
- Date : date on which measurements taken
- PM10 : Level of particulate matter (daily average) at Dublin Port Tunnel
PM10 describes inhalable particles, with diameters that are generally 10 $\mu$m and smaller.
- Level is measured in $[\mu g][m^{-3}]$ (micrograms per cubic metre)
- 50 considered a limit value

Why not just look at the numbers?

library(readr)
dpt <- read_csv('dpt.csv')
dpt$PM10

##   [1]  8.219792  8.764583 13.368750  7.432292  6.008333 13.534375 21.034375
##   [8] 11.550000 11.792708 20.832292  9.136458 13.201923 12.856250 61.700000
##  [15] 16.633333 15.459375 17.439583 26.765625 56.996875 14.800000 23.027083
##  [22] 20.506250 37.785417 28.714583 14.471875 13.919444  9.177083 11.206250
##  [29]  5.441667  5.827083  3.803125  5.195833  6.114583 10.163542 21.531250
##  [36] 19.636458 13.588542 15.700000 20.180208 14.850000 16.828125 12.143182
##  [43] 10.792361 11.877083  7.111458  6.513542 11.756250 20.808333 12.647917
##  [50]  6.856250 12.396875  5.323958 31.517708 20.792014 11.925694  5.243750
##  [57]  6.693750 14.632292  9.880208 13.438542 12.381250 23.457292 34.204167
##  [64] 19.496875 13.539583 16.554167 18.290625 24.913542 30.075000 14.872569
##  [71] 10.289583  6.861458  7.903125 11.429167 19.469792 19.133333 16.719792
##  [78] 24.810417 18.256250 14.785417  8.616667 32.409615 24.564236 83.942708
##  [85] 39.155208 21.264583 13.112500 32.660417 11.938636        NA  8.550000
##  [92] 17.038542  7.354167  6.591667 13.840625  7.443750 10.374306 19.456597
##  [99] 26.453125 20.583333 33.951042 23.797917 32.590625 27.559375 22.747917
## ...

Not the easiest way to see what is going on.
- possibly only obvious thing is the NA (missing) value.

Getting an idea of distribution of values

hist(dpt$PM10)

Using Some Options To Control Appearance

hist(dpt$PM10,col='dodgerblue',
     main='PM10 Concentration (Daily Average)',
     xlab='Conc (micrograms per cubic metre)') # Histogram with options
abline(v=50,col='red') # Also add a vertical line at 50

Alternative approaches - Boxplots

boxplot(dpt$PM10,horizontal = TRUE,
     main='PM10 Concentration (Daily Average)',
     xlab='Conc (micrograms per cubic metre)') # Boxplot with options
abline(v=50,col='red') # Also add a vertical line at 50

How it works

source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Alternatively - density plots (cf session 1)

PM10_dens <- density(dpt$PM10,na.rm = TRUE)
plot(PM10_dens,main='PM10 Concentration (Daily Average)')
abline(v=50,col='red')

Notes

This works slightly differently to the other approaches
The density function doesn’t automatically make a plot
You have to create a density object, then plot it
No logical reason - it was just coded that way…

density(dpt$PM10,na.rm = TRUE) # Call without a plot command

## 
## Call:
##  density.default(x = dpt$PM10, na.rm = TRUE)
## 
## Data: dpt$PM10 (272 obs.);   Bandwidth 'bw' = 2.002
## 
##        x                y            
##  Min.   :-2.688   Min.   :8.000e-08  
##  1st Qu.:20.471   1st Qu.:1.635e-04  
##  Median :43.630   Median :7.390e-04  
##  Mean   :43.630   Mean   :1.078e-02  
##  3rd Qu.:66.789   3rd Qu.:1.237e-02  
##  Max.   :89.948   Max.   :6.345e-02

How it works

Create a normal curve around each data point
Average them

Numerical Summary Statistics

Median

Middle value when observations are sorted in order
If $n$ is even it is the halfway point between the 2 central values (12.405)

Interquartile Range

Difference between the quarter and three quarter points along the ordered observations: here it is 9.146

Computing these in R

Median and IQR are straightforward -the na.rm option tells R to ignore the NA value

median(dpt$PM10,na.rm=TRUE)

## [1] 12.40469

IQR(dpt$PM10,na.rm=TRUE)

## [1] 9.145833

Also the five number summary:

fivenum(dpt$PM10)

## [1]  3.317708  8.521875 12.404687 17.755208 83.942708

The numbers are - Min, Q1, Median, Q3, max

Mean

Imagine each observation is a solid block; plot is balanced on a plank
The mean (add values and divide by $n$) is the balancing point

The standard deviation

Not so easy to explain with a diagram but its a measure of spread (like the IQR)
Main idea is this:
- Find the difference of each observation from the mean
- Square these (so being above or below the mean is always positive)
- Take the average of these squared deviations
- Take square root to bring it back to the original units

\[ \sigma = \sqrt{\frac{1}{n} \sum_i \left(x_i - \bar{x}\right)^2} \]

Computing these in R

as with other functions the na.rm option removes NA values before computing the value:

mean(dpt$PM10,na.rm=TRUE)

## [1] 14.6721

sd(dpt$PM10,na.rm=TRUE)

## [1] 9.774103

For a normal ( ) distribution
- Around 95% of values lie within mean $\pm$ 2 $\times$ SD.

mean(dpt$PM10,na.rm=TRUE) + c(-2,2)*sd(dpt$PM10,na.rm=TRUE)

## [1] -4.876102 34.220310

Above suggests PM10 not normally distributed

More than 1 Variable

A New Data Set

Download the data rain_jan80.csv from the Teams Class Materials folder in the Files section in the General channel
Rainfall for 25 Irish weatyher stations, January 1980
This one has two columns
- Distance of a weather station from the coast (Km)
- monthly total rainfall (in)
Source
- Data rescue in the classroom: research-led teaching to extend historical records (ICARUS)
- https://journals.ametsoc.org/bams/article/99/9/1757/70408/Integrating-Data-Rescue-into-the-Classroom

Visual Inspection

rain_jan80 <- read_csv('rain_jan80.csv')
plot(rain_jan80)

A fitted straight line

fit_line <- lm(Rainfall~Coast,data=rain_jan80)
plot(rain_jan80)
abline(fit_line,col='darkblue')

The Correlation Coefficient - 1

With Negative correlation, most points in (-,+) or (+,-)

The Correlation Coefficient - 2

With Positive correlation, most points in (-,-) or (+,+)

The Correlation Coefficient - 3

With no correlation, points equally in all quadrants

Covariance

\[ \textrm{cov}(x,y) = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})}{n} \]

add up difference between x and the mean of x times difference between y and the mean of y
Find the mean of this
This is in the units of x timews the units of y
Therefore depends on units used.
In the rainfall example it is odd, to say the least
- it is in [Km] $\times$ [inches] $\div$ [Months]

Correlation rescales covariance

Remove the unit effects by dividing by sd(x) $sd(y) : rescale by both x and y
Thus $\textrm{cor}(x,y) = \frac{\textrm{cov}(x,y)}{ \textrm{sd}(x) \textrm{sd}(y)}$
Correlation always takes values between -1 and 1
- $-1 \rightarrow$ Strong negative association
- $\phantom{+} 0 \rightarrow$ No association
- $+1 \rightarrow$ Strong positive association
For Jan 1980 rainfall data
- cor(rain_jan80$Coast,rain_jan80$Rainfall) = -0.6555913
- Greater distance from coast mostly associated with less rainfall…

Conclusion

💡 New ideas

New general ideas
- Using graphics to explore data
- Basic statistics to explore data
New techniques
- Mean, median, sd, correalation, IQR
- Boxplot, scatterplot, density plot, histogram
Practical issues
- R commands for above, and annotating graphs
Next lecture - Using Geographical Data and Mapping
This link may be useful particularly the sf format discussion.