Download the data dpt.csv
from the Teams Class Materials
folder in the Files
section in the General
channel
Overview of the data here
This has two columns
Date
: date on which measurements takenPM10
: Level of particulate matter (daily average) at Dublin Port TunnelPM10 describes inhalable particles, with diameters that are generally 10 \(\mu\)m and smaller.
library(readr) dpt <- read_csv('dpt.csv') dpt$PM10
## [1] 8.219792 8.764583 13.368750 7.432292 6.008333 13.534375 21.034375 ## [8] 11.550000 11.792708 20.832292 9.136458 13.201923 12.856250 61.700000 ## [15] 16.633333 15.459375 17.439583 26.765625 56.996875 14.800000 23.027083 ## [22] 20.506250 37.785417 28.714583 14.471875 13.919444 9.177083 11.206250 ## [29] 5.441667 5.827083 3.803125 5.195833 6.114583 10.163542 21.531250 ## [36] 19.636458 13.588542 15.700000 20.180208 14.850000 16.828125 12.143182 ## [43] 10.792361 11.877083 7.111458 6.513542 11.756250 20.808333 12.647917 ## [50] 6.856250 12.396875 5.323958 31.517708 20.792014 11.925694 5.243750 ## [57] 6.693750 14.632292 9.880208 13.438542 12.381250 23.457292 34.204167 ## [64] 19.496875 13.539583 16.554167 18.290625 24.913542 30.075000 14.872569 ## [71] 10.289583 6.861458 7.903125 11.429167 19.469792 19.133333 16.719792 ## [78] 24.810417 18.256250 14.785417 8.616667 32.409615 24.564236 83.942708 ## [85] 39.155208 21.264583 13.112500 32.660417 11.938636 NA 8.550000 ## [92] 17.038542 7.354167 6.591667 13.840625 7.443750 10.374306 19.456597 ## [99] 26.453125 20.583333 33.951042 23.797917 32.590625 27.559375 22.747917 ## ...
Not the easiest way to see what is going on.
NA
(missing) value.hist(dpt$PM10)
hist(dpt$PM10,col='dodgerblue', main='PM10 Concentration (Daily Average)', xlab='Conc (micrograms per cubic metre)') # Histogram with options abline(v=50,col='red') # Also add a vertical line at 50
boxplot(dpt$PM10,horizontal = TRUE, main='PM10 Concentration (Daily Average)', xlab='Conc (micrograms per cubic metre)') # Boxplot with options abline(v=50,col='red') # Also add a vertical line at 50
source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
PM10_dens <- density(dpt$PM10,na.rm = TRUE) plot(PM10_dens,main='PM10 Concentration (Daily Average)') abline(v=50,col='red')
density
function doesn’t automatically make a plotdensity(dpt$PM10,na.rm = TRUE) # Call without a plot command
## ## Call: ## density.default(x = dpt$PM10, na.rm = TRUE) ## ## Data: dpt$PM10 (272 obs.); Bandwidth 'bw' = 2.002 ## ## x y ## Min. :-2.688 Min. :8.000e-08 ## 1st Qu.:20.471 1st Qu.:1.635e-04 ## Median :43.630 Median :7.390e-04 ## Mean :43.630 Mean :1.078e-02 ## 3rd Qu.:66.789 3rd Qu.:1.237e-02 ## Max. :89.948 Max. :6.345e-02
Median and IQR are straightforward -the na.rm
option tells R to ignore the NA
value
median(dpt$PM10,na.rm=TRUE)
## [1] 12.40469
IQR(dpt$PM10,na.rm=TRUE)
## [1] 9.145833
Also the five number summary:
fivenum(dpt$PM10)
## [1] 3.317708 8.521875 12.404687 17.755208 83.942708
The numbers are - Min, Q1, Median, Q3, max
Imagine each observation is a solid block; plot is balanced on a plank
The mean (add values and divide by \(n\)) is the balancing point
Not so easy to explain with a diagram but its a measure of spread (like the IQR)
Main idea is this:
Find the difference of each observation from the mean
Square these (so being above or below the mean is always positive)
Take the average of these squared deviations
Take square root to bring it back to the original units
\[ \sigma = \sqrt{\frac{1}{n} \sum_i \left(x_i - \bar{x}\right)^2} \]
na.rm
option removes NA
values before computing the value:mean(dpt$PM10,na.rm=TRUE)
## [1] 14.6721
sd(dpt$PM10,na.rm=TRUE)
## [1] 9.774103
For a normal ( ) distribution
mean
\(\pm\) 2
\(\times\) SD
.mean(dpt$PM10,na.rm=TRUE) + c(-2,2)*sd(dpt$PM10,na.rm=TRUE)
## [1] -4.876102 34.220310
PM10
not normally distributedDownload the data rain_jan80.csv
from the Teams Class Materials
folder in the Files
section in the General
channel
Rainfall for 25 Irish weatyher stations, January 1980
This one has two columns
Source
rain_jan80 <- read_csv('rain_jan80.csv') plot(rain_jan80)
fit_line <- lm(Rainfall~Coast,data=rain_jan80) plot(rain_jan80) abline(fit_line,col='darkblue')
(-,+)
or (+,-)
(-,-)
or (+,+)
\[ \textrm{cov}(x,y) = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})}{n} \]
add up difference between x
and the mean of x
times difference between y
and the mean of y
Find the mean of this
This is in the units of x
timews the units of y
Therefore depends on units used.
In the rainfall example it is odd, to say the least
Remove the unit effects by dividing by sd(x)
$sd(y)
: rescale by both x
and y
Thus \(\textrm{cor}(x,y) = \frac{\textrm{cov}(x,y)}{ \textrm{sd}(x) \textrm{sd}(y)}\)
Correlation always takes values between -1 and 1
For Jan 1980 rainfall data
cor(rain_jan80$Coast,rain_jan80$Rainfall)
= -0.6555913New general ideas
New techniques
Practical issues
Next lecture - Using Geographical Data and Mapping
This link may be useful particularly the sf
format discussion.