Software Tools for Earth and Environmental Science

– 12th WEEK –

Emir Toker

08/01/2021

R Probability

Syllabus and Book
R Statistics - Repeat
R - Probability
Next Week

Syllabus and Book

Syllabus

Book

LINK

R - Elemantary Statistics

Centrality: Mean, Median, Mode

Measures of centrality are commonly used to explain large collections of data by describing where numeric observations are centered.

R - Elemantary Statistics

Centrality: Mean, Median, Mode

Median : “middle magnitude” of your observations

0 . 0 . 0 . 0 . 0

o . o . o . o . o . o

R - Elemantary Statistics

Centrality: Mean, Median, Mode

Mode : Simply the “most common” observation.

Sample : 2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4

2 , 2 , 2 , 2.2 , 3 , 3 , 4, 4.4 ( n=8 , n/2 = 4)

R - Elemantary Statistics

Centrality: Mean, Median, Mode

xdata <- c(2,4.4,3,3,2,2.2,2,4)

mean(xdata)
median(xdata)
min(xdata)
max(xdata)
range(xdata)

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

A quantile is a value indicates an observation rank when compared to all the other present observations.
For example, the median is itself a quantile. It’s the 0.5th quantile.
Alternatively, quantiles can be expressed as a percentile.

The median = the 0.5th quantile = The 50th percentile

Sample : 2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4

2 , 2 , 2 , 2.2 , 3 , 3 , 4, 4.4

0.5th quantile = median = 2.6

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

xdata <- c(2,4.4,3,3,2,2.2,2,4)
quantile(xdata,prob=0.8) # the 0.8th quan- tile (or 80th percentile)

## 80% 
## 3.6

quantile(xdata,prob=c(0,0.25,0.5,0.75,1))

##   0%  25%  50%  75% 100% 
## 2.00 2.00 2.60 3.25 4.40

summary(xdata) # A quartile is a type of quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.600   2.825   3.250   4.400

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

A quartile is a type of quantile.

R - Elemantary Statistics

Quantiles , Percentiles, and the Five-Number Summary

xdata <- c(2,4.4,3,3,2,2.2,2,4)

quantile(xdata)
summary(xdata)

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

How dispersed your data are. For this, measures of spread are needed.

xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)

mean(xdata)

## [1] 2.825

mean(ydata)

## [1] 2.825

R - Elemantary Statistics

plot(xdata,type="n",xlab="values",ylab="data vector",yaxt="n",bty="n", xlim = c(0,7)  )
abline(h=c(2.5,3),lty=2,col="red")
abline(v=2.825,lwd=2,lty=3)
text(c(0.0,0.0),c(2.5,3),labels=c("x","y"))
text(c(2.825),c(4),labels=c("mean"))
points(jitter(c(xdata,ydata)),c(rep(2.5,length(xdata)), rep(3,length(ydata))))

the observations in ydata are more “spread out”

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

The sample variance measures the degree of the spread of numeric observations around their arithmetic mean.

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4 ( mean = 2.825)

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

The standard deviation is simply the square root of the variance. The scale of the original observations.

0.953 represents the average distance of each observation from the mean

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range,

Unlike the variance and standard deviation, the interquartile range (IQR) is not computed with respect to the sample mean.

IQR is computed as the difference between the upper and lower quartiles of your data

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

xdata <- c(2,4.4,3,3,2,2.2,2,4)

var(xdata)

## [1] 0.9078571

sd(xdata)

## [1] 0.9528154

IQR(xdata)

## [1] 1.25

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

xdata <- c(2,4.4,3,3,2,2.2,2,4)

var()
sd()
IQR()

R - Elemantary Statistics

Covariance and Correlation

Investigate the relationship between two numeric variables to assess trends
The covariance expresses how much two numeric variables “change together” and the nature of that relationship, whether it is positive or negative.

R - Elemantary Statistics

Covariance and Correlation

x = {x1,x2,…,xn}

y = {y1,y2,…,yn}

for i = 1,. . . ,n

When you get a positive result for rxy, it shows that there is a positive lin- ear relationship. When rxy = 0, this indicates that there is no linear relationship.

R - Elemantary Statistics

Covariance and Correlation

x = {2,4.4,3,3,2,2.2,2,4}

y = {1,4.4,1,3,2,2.2,2,7}

mean x and y = 2.825

positive relationship

R - Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )

R Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )
abline(lm(y~x), col="blue")

R - Elemantary Statistics

Covariance and Correlation

Correlation allows you to interpret the covariance further by identifying the strength of any association.

R - Elemantary Statistics

Covariance and Correlation

Most common of these is Pearson’s product-moment correlation coefficient. (R default)
The correlation coefficient estimates the nature of the linear relationship between two sets of observations

−1 ≤ ρxy ≤ 1

ρxy = 1, which is a perfect positive linear relationship

R - Elemantary Statistics

Covariance and Correlation

x = {2,4.4,3,3,2,2.2,2,4}

y = {1,4.4,1,3,2,2.2,2,7}

(mean x and y = 2.825)

(sx = 0.953 and sy = 2.013)

(rxy = 1.479)

ρxy is positive

R - Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )
abline(lm(y~x), col="blue")

R - Elemantary Statistics

Covariance and Correlation

xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)

cov(xdata,ydata)

## [1] 1.479286

cor(xdata,ydata)

## [1] 0.7713962

R - Elemantary Statistics

Covariance and Correlation

cov( , )
cor( , )

R - Elemantary Statistics

Basic Data Visualization

Barplots

station_data <- read.csv("https://web.itu.edu.tr/~tokerem/18397_Cekmekoy_Omerli_15min.txt", header=T, sep = ";")

head(station_data)

##   sta_no year month day hour minutes temp precipitation pressure
## 1  18397 2017     7  26   18       0 23.9             0   1003.0
## 2  18397 2017     7  26   18      15 23.9             0   1003.1
## 3  18397 2017     7  26   18      30 23.8             0   1003.2
## 4  18397 2017     7  26   18      45 23.8             0   1003.2
## 5  18397 2017     7  26   19       0 23.6             0   1003.2
## 6  18397 2017     7  26   19      15 23.2             0   1003.1
##   relative_humidity
## 1                94
## 2                95
## 3                96
## 4                96
## 5                96
## 6                97

Basic Data Visualization

Barplots

head(station_data$temp)

## [1] 23.9 23.9 23.8 23.8 23.6 23.2

barplot(station_data$temp)

Basic Data Visualization

Barplots

station_data$temp

##   [1] 23.9 23.9 23.8 23.8 23.6 23.2 23.2 23.1 23.0 22.8 22.5 22.4 22.2 22.3 22.2
##  [16] 21.7 21.9 21.7 21.6 22.2 22.2 22.1 22.3 22.5 22.3 22.2 22.5 22.6 22.6 22.6
##  [31] 22.6 22.7 22.6 22.5 22.6 22.5 22.5 22.4 22.5 22.4 22.5 22.6 23.0 23.2 24.2
##  [46] 25.1 25.5 26.1 27.1 26.9 27.6 28.0 28.4 28.5 29.3 30.2 30.1 30.1 30.4 30.4
##  [61] 30.8 30.9 31.0 31.5 31.2 30.9 30.9 30.4 30.4 30.0 29.2 29.5 29.4 29.3 29.6
##  [76] 28.8 29.0 29.0 29.2 28.4 27.8 27.4 26.6 26.2 25.8 25.6 25.4 24.2 19.2 19.5
##  [91] 20.1 20.8 21.2 21.4 21.4 21.4 21.2 21.0 20.8 20.9 20.8 20.7 20.8 20.8 20.9
## [106] 20.6 20.6 20.5 20.7 20.8 20.4 20.4 20.6 20.5 20.4 20.5 20.5 20.6 20.5 20.5
## [121] 20.4

Basic Data Visualization

Barplots

table(station_data$temp)

## 
## 19.2 19.5 20.1 20.4 20.5 20.6 20.7 20.8 20.9   21 21.2 21.4 21.6 21.7 21.9 22.1 
##    1    1    1    4    6    4    2    6    2    1    2    3    1    2    1    1 
## 22.2 22.3 22.4 22.5 22.6 22.7 22.8   23 23.1 23.2 23.6 23.8 23.9 24.2 25.1 25.4 
##    5    3    3    8    7    1    1    2    1    3    1    2    2    2    1    1 
## 25.5 25.6 25.8 26.1 26.2 26.6 26.9 27.1 27.4 27.6 27.8   28 28.4 28.5 28.8   29 
##    1    1    1    1    1    1    1    1    1    1    1    1    2    1    1    2 
## 29.2 29.3 29.4 29.5 29.6   30 30.1 30.2 30.4 30.8 30.9   31 31.2 31.5 
##    2    2    1    1    1    1    2    1    4    1    3    1    1    1

f_temp <- table(station_data$temp)

Basic Data Visualization

Barplots

barplot(f_temp)

Basic Data Visualization

Histogram

hist(station_data$temp)

Basic Data Visualization

Histogram

library(ggplot2)
qplot(station_data$temp,geom="blank",main="Temp Hist",xlab="Temp")+ 
  geom_histogram(color="black",fill="white",breaks=seq(19,32,1),closed="right") + 
  geom_vline(mapping=aes(xintercept=c(mean(station_data$tem), median(station_data$tem)), linetype=factor(c("mean","median"))) , col=c("blue","red"),show.legend=TRUE)+ 
  scale_linetype_manual(values=c(2,3)) + 
  labs(linetype="")

Basic Data Visualization

Boxplot

boxplot(station_data$temp)

mean(station_data$temp)

## [1] 24.24132

median(station_data$temp)

## [1] 22.6

quantile(station_data$temp)

##   0%  25%  50%  75% 100% 
## 19.2 21.4 22.6 27.6 31.5

# summary(station_data$temp)

Basic Data Visualization

Boxplot

Basic Data Visualization

Histogram and Boxplot

hist(station_data$temp)
abline(v=median(station_data$temp),col="red")

boxplot(station_data$temp,horizontal = T)
abline(v=median(station_data$temp),col="red")

Basic Data Visualization

Scatter Plots

plot(station_data$temp,station_data$relative_humidity)

R - Probability

Probability
Qualitative and Quantitative Data
Discrete and Continuous Data
Probability ‘Density vs Mass’ Function (PDF vs PMF)
Cumulative Distribution Function (CDF)
Shape of PDF : Symmetry, Skewness, Modality, Kurtosis
Common Probability Distributions - Discrete
Common Probability Distributions - Continuous
Practice : Write A Function for outliers

Probability

A probability is a number that describes the “magnitude of chance” associated with making a particular observation or statement.

? what is the probability of rolling a 3 with this black dice

It’s always a number between 0 and 1 (inclusive) and is often expressed as a fraction.

Qualitative and Quantitative Data

Quantitative data can be counted, measured, and expressed using numbers. Qualitative data is descriptive and conceptual.

Discrete and Continuous Data

Discrete data is information that can only take certain values. Continuous data is data that can take any value

Probability ‘Mass vs Density’ Function (PDF vs PMF)

Probability mass and density functions are used to describe discrete and continuous probability distributions, respectively.

Cumulative Distribution Function

CDF describes the probability (with a given probability distribution) at less than or equal to x.

Dice Example - White and Black Dice

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability mass function")

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
X.cumul <- cumsum(X.prob)
barplot(X.cumul,names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X <= x)", main = "cumulative distribution function")

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability mass function")
abline(v=c(0.5:10.5))

PDF - Probability Density Function

lower < 7 < upper

X >= 2  &  X <= 7
(X[lower] - 1)/36

X > 7 & X <= 12
13 - X[upper])/36

X.outcomes <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)
lower <- X.outcomes >= 2 & X.outcomes <= 7
upper <- X.outcomes > 7 & X.outcomes <= 12

fx <- rep(0,length(X.outcomes))
fx[lower] <- (X.outcomes[lower] - 1)/36
fx[upper] <- (13 - X.outcomes[upper])/36

plot(X.outcomes,fx,type="l",ylab="f(x)", xlim = c(0,14), main = "probability density function")
abline(h=0,col="gray",lty=2)

fx.specific <- (4.5-1)/36
fx.specific.area <- 3.5*fx.specific*0.5
fx.specific.vertices <- rbind(c(1,0),c(4.5,0),c(4.5,fx.specific))

plot(X.outcomes,fx,type="l",ylab="f(x)", xlim = c(0,14), main = "probability density function")
abline(h=0,col="gray",lty=2)
polygon(fx.specific.vertices,col="gray",border=NA)
abline(v=4.5,lty=3)
text(4,0.01,labels=fx.specific.area)

R - Probability - Shape

Symmetry : Draw a vertical line down the center, and it is equally reflected with 0.5 probability.
Skewness : If a distribution is asymmetric, look at the “tail” of a distribution. Positive or right skew indicates a tail extending longer to the right of center.
Modality : Describes the number of easily identifiable peaks in the distribution of interest. Unimodal, bimodal, and trimodal…
Kurtosis : Measure of the “tailedness” of the probability distribution. Positive kurtosis indicates a distribution where more of the values are located in the tails

Kurtosis

Skewness

table(station_data$temp)

## 
## 19.2 19.5 20.1 20.4 20.5 20.6 20.7 20.8 20.9   21 21.2 21.4 21.6 21.7 21.9 22.1 
##    1    1    1    4    6    4    2    6    2    1    2    3    1    2    1    1 
## 22.2 22.3 22.4 22.5 22.6 22.7 22.8   23 23.1 23.2 23.6 23.8 23.9 24.2 25.1 25.4 
##    5    3    3    8    7    1    1    2    1    3    1    2    2    2    1    1 
## 25.5 25.6 25.8 26.1 26.2 26.6 26.9 27.1 27.4 27.6 27.8   28 28.4 28.5 28.8   29 
##    1    1    1    1    1    1    1    1    1    1    1    1    2    1    1    2 
## 29.2 29.3 29.4 29.5 29.6   30 30.1 30.2 30.4 30.8 30.9   31 31.2 31.5 
##    2    2    1    1    1    1    2    1    4    1    3    1    1    1

df_temp_table <- data.frame(table(station_data$temp))
df_temp_table

##    Var1 Freq
## 1  19.2    1
## 2  19.5    1
## 3  20.1    1
## 4  20.4    4
## 5  20.5    6
## 6  20.6    4
## 7  20.7    2
## 8  20.8    6
## 9  20.9    2
## 10   21    1
## 11 21.2    2
## 12 21.4    3
## 13 21.6    1
## 14 21.7    2
## 15 21.9    1
## 16 22.1    1
## 17 22.2    5
## 18 22.3    3
## 19 22.4    3
## 20 22.5    8
## 21 22.6    7
## 22 22.7    1
## 23 22.8    1
## 24   23    2
## 25 23.1    1
## 26 23.2    3
## 27 23.6    1
## 28 23.8    2
## 29 23.9    2
## 30 24.2    2
## 31 25.1    1
## 32 25.4    1
## 33 25.5    1
## 34 25.6    1
## 35 25.8    1
## 36 26.1    1
## 37 26.2    1
## 38 26.6    1
## 39 26.9    1
## 40 27.1    1
## 41 27.4    1
## 42 27.6    1
## 43 27.8    1
## 44   28    1
## 45 28.4    2
## 46 28.5    1
## 47 28.8    1
## 48   29    2
## 49 29.2    2
## 50 29.3    2
## 51 29.4    1
## 52 29.5    1
## 53 29.6    1
## 54   30    1
## 55 30.1    2
## 56 30.2    1
## 57 30.4    4
## 58 30.8    1
## 59 30.9    3
## 60   31    1
## 61 31.2    1
## 62 31.5    1

barplot(df_temp_table$Freq/121,names.arg=df_temp_table$Var1)

Common Probability Distributions - Discrete

Bernoulli Distribution
Binomial Distribution
Poisson Distribution

Bernoulli

Bernoulli Distribution has only two possible outcomes, such as success or failure.

Bernoulli

Probability of rain : 0.6

x <- 1
p <- 0.6
b_fx <- p^x*((1-p)^(1-x))

barplot(c(1-p,p),names.arg=c(0,1))

Binomial

The binomial distribution, x number of successes in a sequence of n independent experiments. (success = p, failure q = 1 − p)

Binomial

Density, distribution function, quantile function and random generation for the binomial distribution with parameters size and prob.

dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.

Binomial

Suppose there are twelve questions in a quiz. (size=12)
Each question has two possible answers, and only one of them is correct. ( p = 1/2 = 0.5 )
Find the probability of having four or less correct answers (x = 0:4) (student answer questions randomly)

x <- 0:4  
size <- 12
prob <- 0.5

dbinom(x , size , prob)

## [1] 0.0002441406 0.0029296875 0.0161132813 0.0537109375 0.1208496094

pbinom(4 , size , prob)

## [1] 0.1938477

Binomial

prob <- dbinom(x = 0:12 , size = 12 , prob = 0.5)
plot(prob)

Binomial

Each question has five possible answers ( p = 1/5 = 0.2 )

prob <- dbinom(x = 0:12 , size = 12 , prob = 0.2)
plot(prob)

Poisson

In general, rarely seen event, basically. While n is too big, p is too small.

λp should be interpreted as the “mean number of occurrences”

Poisson

There are three functions associated with Binomial distributions.

dpois(x, lambda)
ppois(q, lambda, lower.tail)
qpois(p, lambda, lower.tail)
rpois(n, lambda)
x : successes in a period
λ : the expected number of events
lower.tail = TRUE for left tail
q vector of quantiles
n number of random values to return
p vector of probabilities

Poisson

Historical average: 4 babies born in this hospital every day.
What is the probability that 6 babies will be born in this hospital tomorrow?

x <- 6
lambpa <- 4

dpois(x, lambpa)

## [1] 0.1041956

Poisson

What is the probability, between 6 and 16 babies in this hospital tomorrow?

x <- 6:16
lambpa <- 4
prob <- dpois(x, lambpa)
plot(x = 5:15, prob)

Common Probability Distributions - Continuous

Uniform Distribution
Normal Distribution

Uniform

The uniform distribution is a simple density function that describes a continuous random variable whose interval of possible values offers no fluctuations in probability.

Uniform

dunif( x, min, max )
punif( q, min, max )
qunif( p, min, max )
runif( n, min, max )
x : a vector of numbers.
p : a vector of probabilities.
n : number of observations.
min, max : lower and upper limits

Uniform

Select ten random numbers between one and three.

runif(n = 10, min = 1, max = 3)

##  [1] 2.821992 2.765888 2.561574 2.424556 2.928520 1.079249 2.050842 1.525587
##  [9] 1.482912 1.915594

r1 <- runif(n = 10, min = 1, max = 3)
table(r1)

## r1
## 1.08275828650221 1.22311208304018 1.29288631817326  1.4663178306073 
##                1                1                1                1 
## 1.68820196110755 1.88678338751197 2.05697202542797 2.19751014048234 
##                1                1                1                1 
## 2.34811947587878 2.48047358682379 
##                1                1

t1 <- table(r1)    # frequence

Uniform

barplot(t1)      # frequence

Uniform

barplot(table(runif(n = 100,1,3)))

Uniform

barplot(table(runif(n = 1000,1,3)))

Normal

Theoretical, characterized by a distinctive “bell-shaped” curve
Aso referred to as the Gaussian distribution
Mean = Median = Mode, Symmetric, Skewness = 0, Kurtosis = 3
50% of values less than the mean and 50% greater than the mean

Normal

Standart Normal

The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1.

Normal

dnorm( x, mean, sd )
pnorm( q, mean, sd )
qnorm( p, mean, sd )
rnorm( n, mean, sd )
x, q : vector of quantiles.
p : vector of probabilities.
n : number of observations.
mean : vector of means.
sd : vector of standard deviations.

Normal

x <- seq(-15, 15, by=1)
x

##  [1] -15 -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3
## [20]   4   5   6   7   8   9  10  11  12  13  14  15

Normal

plot(x)

Normal

hist(x)

Normal

barplot(x)

Normal

barplot(table(x))

Normal

dn <- dnorm(x, mean(x), sd(x)) 
dn

##  [1] 0.01125172 0.01340898 0.01578770 0.01836489 0.02110592 0.02396441
##  [7] 0.02688287 0.02979414 0.03262365 0.03529236 0.03772032 0.03983055
## [13] 0.04155314 0.04282898 0.04361321 0.04387780 0.04361321 0.04282898
## [19] 0.04155314 0.03983055 0.03772032 0.03529236 0.03262365 0.02979414
## [25] 0.02688287 0.02396441 0.02110592 0.01836489 0.01578770 0.01340898
## [31] 0.01125172

Normal

plot(x, dn)