Graphical Data Exploration

A data set to use

  • Download the data dpt.csv from the Teams Class Materials folder in the Files section in the General channel

  • Overview of the data here

  • This has two columns

    • Date : date on which measurements taken
    • PM10 : Level of particulate matter (daily average) at Dublin Port Tunnel
  • PM10 describes inhalable particles, with diameters that are generally 10 \(\mu\)m and smaller.

    • Level is measured in \([\mu g][m^{-3}]\) (micrograms per cubic metre)
    • 50 considered a limit value

Why not just look at the numbers?

library(readr)
dpt <- read_csv('dpt.csv')
dpt$PM10
##   [1]  8.219792  8.764583 13.368750  7.432292  6.008333 13.534375 21.034375
##   [8] 11.550000 11.792708 20.832292  9.136458 13.201923 12.856250 61.700000
##  [15] 16.633333 15.459375 17.439583 26.765625 56.996875 14.800000 23.027083
##  [22] 20.506250 37.785417 28.714583 14.471875 13.919444  9.177083 11.206250
##  [29]  5.441667  5.827083  3.803125  5.195833  6.114583 10.163542 21.531250
##  [36] 19.636458 13.588542 15.700000 20.180208 14.850000 16.828125 12.143182
##  [43] 10.792361 11.877083  7.111458  6.513542 11.756250 20.808333 12.647917
##  [50]  6.856250 12.396875  5.323958 31.517708 20.792014 11.925694  5.243750
##  [57]  6.693750 14.632292  9.880208 13.438542 12.381250 23.457292 34.204167
##  [64] 19.496875 13.539583 16.554167 18.290625 24.913542 30.075000 14.872569
##  [71] 10.289583  6.861458  7.903125 11.429167 19.469792 19.133333 16.719792
##  [78] 24.810417 18.256250 14.785417  8.616667 32.409615 24.564236 83.942708
##  [85] 39.155208 21.264583 13.112500 32.660417 11.938636        NA  8.550000
##  [92] 17.038542  7.354167  6.591667 13.840625  7.443750 10.374306 19.456597
##  [99] 26.453125 20.583333 33.951042 23.797917 32.590625 27.559375 22.747917
## ...
  • Not the easiest way to see what is going on.

    • possibly only obvious thing is the NA (missing) value.

Getting an idea of distribution of values

hist(dpt$PM10)

Using Some Options To Control Appearance

hist(dpt$PM10,col='dodgerblue',
     main='PM10 Concentration (Daily Average)',
     xlab='Conc (micrograms per cubic metre)') # Histogram with options
abline(v=50,col='red') # Also add a vertical line at 50

Alternative approaches - Boxplots

boxplot(dpt$PM10,horizontal = TRUE,
     main='PM10 Concentration (Daily Average)',
     xlab='Conc (micrograms per cubic metre)') # Boxplot with options
abline(v=50,col='red') # Also add a vertical line at 50

How it works

Alternatively - density plots (cf session 1)

PM10_dens <- density(dpt$PM10,na.rm = TRUE)
plot(PM10_dens,main='PM10 Concentration (Daily Average)')
abline(v=50,col='red')

Notes

  • This works slightly differently to the other approaches
  • The density function doesn’t automatically make a plot
  • You have to create a density object, then plot it
  • No logical reason - it was just coded that way…
density(dpt$PM10,na.rm = TRUE) # Call without a plot command
## 
## Call:
##  density.default(x = dpt$PM10, na.rm = TRUE)
## 
## Data: dpt$PM10 (272 obs.);   Bandwidth 'bw' = 2.002
## 
##        x                y            
##  Min.   :-2.688   Min.   :8.000e-08  
##  1st Qu.:20.471   1st Qu.:1.635e-04  
##  Median :43.630   Median :7.390e-04  
##  Mean   :43.630   Mean   :1.078e-02  
##  3rd Qu.:66.789   3rd Qu.:1.237e-02  
##  Max.   :89.948   Max.   :6.345e-02

How it works

  • Create a normal curve around each data point
  • Average them

Numerical Summary Statistics

Median

  • Middle value when observations are sorted in order
  • If \(n\) is even it is the halfway point between the 2 central values (12.405)

Interquartile Range

  • Difference between the quarter and three quarter points along the ordered observations: here it is 9.146

Computing these in R

Median and IQR are straightforward -the na.rm option tells R to ignore the NA value

median(dpt$PM10,na.rm=TRUE)
## [1] 12.40469
IQR(dpt$PM10,na.rm=TRUE)
## [1] 9.145833

Also the five number summary:

fivenum(dpt$PM10)
## [1]  3.317708  8.521875 12.404687 17.755208 83.942708

The numbers are - Min, Q1, Median, Q3, max

Mean

  • Imagine each observation is a solid block; plot is balanced on a plank

  • The mean (add values and divide by \(n\)) is the balancing point

The standard deviation

  • Not so easy to explain with a diagram but its a measure of spread (like the IQR)

  • Main idea is this:

    • Find the difference of each observation from the mean

    • Square these (so being above or below the mean is always positive)

    • Take the average of these squared deviations

    • Take square root to bring it back to the original units

\[ \sigma = \sqrt{\frac{1}{n} \sum_i \left(x_i - \bar{x}\right)^2} \]

Computing these in R

  • as with other functions the na.rm option removes NA values before computing the value:
mean(dpt$PM10,na.rm=TRUE) 
## [1] 14.6721
sd(dpt$PM10,na.rm=TRUE) 
## [1] 9.774103
  • For a normal ( ) distribution

    • Around 95% of values lie within mean \(\pm\) 2 \(\times\) SD.
mean(dpt$PM10,na.rm=TRUE) + c(-2,2)*sd(dpt$PM10,na.rm=TRUE) 
## [1] -4.876102 34.220310
  • Above suggests PM10 not normally distributed

More than 1 Variable

A New Data Set

Visual Inspection

rain_jan80 <- read_csv('rain_jan80.csv')
plot(rain_jan80)

A fitted straight line

fit_line <- lm(Rainfall~Coast,data=rain_jan80)
plot(rain_jan80)
abline(fit_line,col='darkblue')

The Correlation Coefficient - 1

  • With Negative correlation, most points in (-,+) or (+,-)

The Correlation Coefficient - 2

  • With Positive correlation, most points in (-,-) or (+,+)

The Correlation Coefficient - 3

  • With no correlation, points equally in all quadrants

Covariance

\[ \textrm{cov}(x,y) = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})}{n} \]

  • add up difference between x and the mean of x times difference between y and the mean of y

  • Find the mean of this

  • This is in the units of x timews the units of y

  • Therefore depends on units used.

  • In the rainfall example it is odd, to say the least

    • it is in [Km] \(\times\) [inches] \(\div\) [Months]

Correlation rescales covariance

  • Remove the unit effects by dividing by sd(x) $sd(y) : rescale by both x and y

  • Thus \(\textrm{cor}(x,y) = \frac{\textrm{cov}(x,y)}{ \textrm{sd}(x) \textrm{sd}(y)}\)

  • Correlation always takes values between -1 and 1

    • \(-1 \rightarrow\) Strong negative association
    • \(\phantom{+} 0 \rightarrow\) No association
    • \(+1 \rightarrow\) Strong positive association
  • For Jan 1980 rainfall data

    • cor(rain_jan80$Coast,rain_jan80$Rainfall) = -0.6555913
    • Greater distance from coast mostly associated with less rainfall…

Conclusion

💡 New ideas

  • New general ideas

    • Using graphics to explore data
    • Basic statistics to explore data
  • New techniques

    • Mean, median, sd, correalation, IQR
    • Boxplot, scatterplot, density plot, histogram
  • Practical issues

    • R commands for above, and annotating graphs
  • Next lecture - Using Geographical Data and Mapping

  • This link may be useful particularly the sf format discussion.