Software Tools for Earth and Environmental Science

– 12th WEEK –

Emir Toker

08/01/2021

R Probability

  • Syllabus and Book
  • R Statistics - Repeat
  • R - Probability
  • Next Week

Syllabus and Book

Syllabus

Book

LINK

R - Elemantary Statistics

R - Elemantary Statistics

R - Elemantary Statistics

Centrality: Mean, Median, Mode

  • Measures of centrality are commonly used to explain large collections of data by describing where numeric observations are centered.

R - Elemantary Statistics

Centrality: Mean, Median, Mode

  • Median : “middle magnitude” of your observations

0 . 0 . 0 . 0 . 0

o . o . o . o . o . o

R - Elemantary Statistics

Centrality: Mean, Median, Mode

  • Mode : Simply the “most common” observation.

Sample : 2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4

2 , 2 , 2 , 2.2 , 3 , 3 , 4, 4.4 ( n=8 , n/2 = 4)

R - Elemantary Statistics

Centrality: Mean, Median, Mode

xdata <- c(2,4.4,3,3,2,2.2,2,4)

  • mean(xdata)
  • median(xdata)
  • min(xdata)
  • max(xdata)
  • range(xdata)

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

  • A quantile is a value indicates an observation rank when compared to all the other present observations.
  • For example, the median is itself a quantile. It’s the 0.5th quantile.
  • Alternatively, quantiles can be expressed as a percentile.

The median = the 0.5th quantile = The 50th percentile

Sample : 2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4

2 , 2 , 2 , 2.2 , 3 , 3 , 4, 4.4

0.5th quantile = median = 2.6

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

xdata <- c(2,4.4,3,3,2,2.2,2,4)
quantile(xdata,prob=0.8) # the 0.8th quan- tile (or 80th percentile)
## 80% 
## 3.6
quantile(xdata,prob=c(0,0.25,0.5,0.75,1))
##   0%  25%  50%  75% 100% 
## 2.00 2.00 2.60 3.25 4.40
summary(xdata) # A quartile is a type of quantile.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.600   2.825   3.250   4.400

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

A quartile is a type of quantile.

R - Elemantary Statistics

Quantiles , Percentiles, and the Five-Number Summary

xdata <- c(2,4.4,3,3,2,2.2,2,4)

  • quantile(xdata)
  • summary(xdata)

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

  • How dispersed your data are. For this, measures of spread are needed.
xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)

mean(xdata)
## [1] 2.825
mean(ydata)
## [1] 2.825

R - Elemantary Statistics

plot(xdata,type="n",xlab="values",ylab="data vector",yaxt="n",bty="n", xlim = c(0,7)  )
abline(h=c(2.5,3),lty=2,col="red")
abline(v=2.825,lwd=2,lty=3)
text(c(0.0,0.0),c(2.5,3),labels=c("x","y"))
text(c(2.825),c(4),labels=c("mean"))
points(jitter(c(xdata,ydata)),c(rep(2.5,length(xdata)), rep(3,length(ydata))))

the observations in ydata are more “spread out”

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

  • The sample variance measures the degree of the spread of numeric observations around their arithmetic mean.

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4 ( mean = 2.825)

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

  • The standard deviation is simply the square root of the variance. The scale of the original observations.

0.953 represents the average distance of each observation from the mean

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range,

  • Unlike the variance and standard deviation, the interquartile range (IQR) is not computed with respect to the sample mean.

  • IQR is computed as the difference between the upper and lower quartiles of your data

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

xdata <- c(2,4.4,3,3,2,2.2,2,4)

var(xdata)
## [1] 0.9078571
sd(xdata)
## [1] 0.9528154
IQR(xdata)
## [1] 1.25

R - Elemantary Statistics

R - Elemantary Statistics

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

xdata <- c(2,4.4,3,3,2,2.2,2,4)

  • var()
  • sd()
  • IQR()

R - Elemantary Statistics

Covariance and Correlation

  • Investigate the relationship between two numeric variables to assess trends
  • The covariance expresses how much two numeric variables “change together” and the nature of that relationship, whether it is positive or negative.

R - Elemantary Statistics

Covariance and Correlation

x = {x1,x2,…,xn}

y = {y1,y2,…,yn}

for i = 1,. . . ,n

When you get a positive result for rxy, it shows that there is a positive lin- ear relationship. When rxy = 0, this indicates that there is no linear relationship.

R - Elemantary Statistics

Covariance and Correlation

x = {2,4.4,3,3,2,2.2,2,4}

y = {1,4.4,1,3,2,2.2,2,7}

mean x and y = 2.825

positive relationship

R - Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )

R Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )
abline(lm(y~x), col="blue")

R - Elemantary Statistics

Covariance and Correlation

  • Correlation allows you to interpret the covariance further by identifying the strength of any association.

R - Elemantary Statistics

Covariance and Correlation

  • Most common of these is Pearson’s product-moment correlation coefficient. (R default)

  • The correlation coefficient estimates the nature of the linear relationship between two sets of observations

−1 ≤ ρxy ≤ 1

ρxy = 1, which is a perfect positive linear relationship

R - Elemantary Statistics

Covariance and Correlation

x = {2,4.4,3,3,2,2.2,2,4}

y = {1,4.4,1,3,2,2.2,2,7}

(mean x and y = 2.825)

(sx = 0.953 and sy = 2.013)

(rxy = 1.479)

ρxy is positive

R - Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )
abline(lm(y~x), col="blue")

R - Elemantary Statistics

Covariance and Correlation

xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)

cov(xdata,ydata)
## [1] 1.479286
cor(xdata,ydata)
## [1] 0.7713962

R - Elemantary Statistics

Covariance and Correlation

  • cov( , )
  • cor( , )

R - Elemantary Statistics

Basic Data Visualization

Basic Data Visualization

Barplots

station_data <- read.csv("https://web.itu.edu.tr/~tokerem/18397_Cekmekoy_Omerli_15min.txt", header=T, sep = ";")

head(station_data)
##   sta_no year month day hour minutes temp precipitation pressure
## 1  18397 2017     7  26   18       0 23.9             0   1003.0
## 2  18397 2017     7  26   18      15 23.9             0   1003.1
## 3  18397 2017     7  26   18      30 23.8             0   1003.2
## 4  18397 2017     7  26   18      45 23.8             0   1003.2
## 5  18397 2017     7  26   19       0 23.6             0   1003.2
## 6  18397 2017     7  26   19      15 23.2             0   1003.1
##   relative_humidity
## 1                94
## 2                95
## 3                96
## 4                96
## 5                96
## 6                97

Basic Data Visualization

Barplots

head(station_data$temp)
## [1] 23.9 23.9 23.8 23.8 23.6 23.2
barplot(station_data$temp)

Basic Data Visualization

Barplots

station_data$temp
##   [1] 23.9 23.9 23.8 23.8 23.6 23.2 23.2 23.1 23.0 22.8 22.5 22.4 22.2 22.3 22.2
##  [16] 21.7 21.9 21.7 21.6 22.2 22.2 22.1 22.3 22.5 22.3 22.2 22.5 22.6 22.6 22.6
##  [31] 22.6 22.7 22.6 22.5 22.6 22.5 22.5 22.4 22.5 22.4 22.5 22.6 23.0 23.2 24.2
##  [46] 25.1 25.5 26.1 27.1 26.9 27.6 28.0 28.4 28.5 29.3 30.2 30.1 30.1 30.4 30.4
##  [61] 30.8 30.9 31.0 31.5 31.2 30.9 30.9 30.4 30.4 30.0 29.2 29.5 29.4 29.3 29.6
##  [76] 28.8 29.0 29.0 29.2 28.4 27.8 27.4 26.6 26.2 25.8 25.6 25.4 24.2 19.2 19.5
##  [91] 20.1 20.8 21.2 21.4 21.4 21.4 21.2 21.0 20.8 20.9 20.8 20.7 20.8 20.8 20.9
## [106] 20.6 20.6 20.5 20.7 20.8 20.4 20.4 20.6 20.5 20.4 20.5 20.5 20.6 20.5 20.5
## [121] 20.4

Basic Data Visualization

Barplots

table(station_data$temp)
## 
## 19.2 19.5 20.1 20.4 20.5 20.6 20.7 20.8 20.9   21 21.2 21.4 21.6 21.7 21.9 22.1 
##    1    1    1    4    6    4    2    6    2    1    2    3    1    2    1    1 
## 22.2 22.3 22.4 22.5 22.6 22.7 22.8   23 23.1 23.2 23.6 23.8 23.9 24.2 25.1 25.4 
##    5    3    3    8    7    1    1    2    1    3    1    2    2    2    1    1 
## 25.5 25.6 25.8 26.1 26.2 26.6 26.9 27.1 27.4 27.6 27.8   28 28.4 28.5 28.8   29 
##    1    1    1    1    1    1    1    1    1    1    1    1    2    1    1    2 
## 29.2 29.3 29.4 29.5 29.6   30 30.1 30.2 30.4 30.8 30.9   31 31.2 31.5 
##    2    2    1    1    1    1    2    1    4    1    3    1    1    1
f_temp <- table(station_data$temp)

Basic Data Visualization

Barplots

barplot(f_temp)

Basic Data Visualization

Histogram

hist(station_data$temp)

Basic Data Visualization

Histogram

library(ggplot2)
qplot(station_data$temp,geom="blank",main="Temp Hist",xlab="Temp")+ 
  geom_histogram(color="black",fill="white",breaks=seq(19,32,1),closed="right") + 
  geom_vline(mapping=aes(xintercept=c(mean(station_data$tem), median(station_data$tem)), linetype=factor(c("mean","median"))) , col=c("blue","red"),show.legend=TRUE)+ 
  scale_linetype_manual(values=c(2,3)) + 
  labs(linetype="")

Basic Data Visualization

Boxplot

boxplot(station_data$temp)

mean(station_data$temp)
## [1] 24.24132
median(station_data$temp)
## [1] 22.6
quantile(station_data$temp)
##   0%  25%  50%  75% 100% 
## 19.2 21.4 22.6 27.6 31.5
# summary(station_data$temp)

Basic Data Visualization

Boxplot

Basic Data Visualization

Histogram and Boxplot

hist(station_data$temp)
abline(v=median(station_data$temp),col="red")

boxplot(station_data$temp,horizontal = T)
abline(v=median(station_data$temp),col="red")

Basic Data Visualization

Scatter Plots

plot(station_data$temp,station_data$relative_humidity)

R - Probability

R - Probability

  • Probability
  • Qualitative and Quantitative Data
  • Discrete and Continuous Data
  • Probability ‘Density vs Mass’ Function (PDF vs PMF)
  • Cumulative Distribution Function (CDF)
  • Shape of PDF : Symmetry, Skewness, Modality, Kurtosis
  • Common Probability Distributions - Discrete
  • Common Probability Distributions - Continuous
  • Practice : Write A Function for outliers

Probability

Probability

A probability is a number that describes the “magnitude of chance” associated with making a particular observation or statement.

? what is the probability of rolling a 3 with this black dice

It’s always a number between 0 and 1 (inclusive) and is often expressed as a fraction.

Qualitative and Quantitative Data

Quantitative data can be counted, measured, and expressed using numbers. Qualitative data is descriptive and conceptual.

Discrete and Continuous Data

Discrete data is information that can only take certain values. Continuous data is data that can take any value

Probability ‘Mass vs Density’ Function (PDF vs PMF)

Probability mass and density functions are used to describe discrete and continuous probability distributions, respectively.

Cumulative Distribution Function

CDF describes the probability (with a given probability distribution) at less than or equal to x.

Dice Example - White and Black Dice

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability mass function")

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
X.cumul <- cumsum(X.prob)
barplot(X.cumul,names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X <= x)", main = "cumulative distribution function")

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability mass function")
abline(v=c(0.5:10.5))

PDF - Probability Density Function

lower < 7 < upper

X >= 2  &  X <= 7
(X[lower] - 1)/36

X > 7 & X <= 12
13 - X[upper])/36

X.outcomes <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)
lower <- X.outcomes >= 2 & X.outcomes <= 7
upper <- X.outcomes > 7 & X.outcomes <= 12

fx <- rep(0,length(X.outcomes))
fx[lower] <- (X.outcomes[lower] - 1)/36
fx[upper] <- (13 - X.outcomes[upper])/36

plot(X.outcomes,fx,type="l",ylab="f(x)", xlim = c(0,14), main = "probability density function")
abline(h=0,col="gray",lty=2)

fx.specific <- (4.5-1)/36
fx.specific.area <- 3.5*fx.specific*0.5
fx.specific.vertices <- rbind(c(1,0),c(4.5,0),c(4.5,fx.specific))

plot(X.outcomes,fx,type="l",ylab="f(x)", xlim = c(0,14), main = "probability density function")
abline(h=0,col="gray",lty=2)
polygon(fx.specific.vertices,col="gray",border=NA)
abline(v=4.5,lty=3)
text(4,0.01,labels=fx.specific.area)

R - Probability - Shape

  • Symmetry : Draw a vertical line down the center, and it is equally reflected with 0.5 probability.

  • Skewness : If a distribution is asymmetric, look at the “tail” of a distribution. Positive or right skew indicates a tail extending longer to the right of center.

  • Modality : Describes the number of easily identifiable peaks in the distribution of interest. Unimodal, bimodal, and trimodal…

  • Kurtosis : Measure of the “tailedness” of the probability distribution. Positive kurtosis indicates a distribution where more of the values are located in the tails

Kurtosis

Skewness

table(station_data$temp)
## 
## 19.2 19.5 20.1 20.4 20.5 20.6 20.7 20.8 20.9   21 21.2 21.4 21.6 21.7 21.9 22.1 
##    1    1    1    4    6    4    2    6    2    1    2    3    1    2    1    1 
## 22.2 22.3 22.4 22.5 22.6 22.7 22.8   23 23.1 23.2 23.6 23.8 23.9 24.2 25.1 25.4 
##    5    3    3    8    7    1    1    2    1    3    1    2    2    2    1    1 
## 25.5 25.6 25.8 26.1 26.2 26.6 26.9 27.1 27.4 27.6 27.8   28 28.4 28.5 28.8   29 
##    1    1    1    1    1    1    1    1    1    1    1    1    2    1    1    2 
## 29.2 29.3 29.4 29.5 29.6   30 30.1 30.2 30.4 30.8 30.9   31 31.2 31.5 
##    2    2    1    1    1    1    2    1    4    1    3    1    1    1
df_temp_table <- data.frame(table(station_data$temp))
df_temp_table
##    Var1 Freq
## 1  19.2    1
## 2  19.5    1
## 3  20.1    1
## 4  20.4    4
## 5  20.5    6
## 6  20.6    4
## 7  20.7    2
## 8  20.8    6
## 9  20.9    2
## 10   21    1
## 11 21.2    2
## 12 21.4    3
## 13 21.6    1
## 14 21.7    2
## 15 21.9    1
## 16 22.1    1
## 17 22.2    5
## 18 22.3    3
## 19 22.4    3
## 20 22.5    8
## 21 22.6    7
## 22 22.7    1
## 23 22.8    1
## 24   23    2
## 25 23.1    1
## 26 23.2    3
## 27 23.6    1
## 28 23.8    2
## 29 23.9    2
## 30 24.2    2
## 31 25.1    1
## 32 25.4    1
## 33 25.5    1
## 34 25.6    1
## 35 25.8    1
## 36 26.1    1
## 37 26.2    1
## 38 26.6    1
## 39 26.9    1
## 40 27.1    1
## 41 27.4    1
## 42 27.6    1
## 43 27.8    1
## 44   28    1
## 45 28.4    2
## 46 28.5    1
## 47 28.8    1
## 48   29    2
## 49 29.2    2
## 50 29.3    2
## 51 29.4    1
## 52 29.5    1
## 53 29.6    1
## 54   30    1
## 55 30.1    2
## 56 30.2    1
## 57 30.4    4
## 58 30.8    1
## 59 30.9    3
## 60   31    1
## 61 31.2    1
## 62 31.5    1

barplot(df_temp_table$Freq/121,names.arg=df_temp_table$Var1)

Common Probability Distributions - Discrete

Common Probability Distributions - Discrete

  • Bernoulli Distribution
  • Binomial Distribution
  • Poisson Distribution

Bernoulli

  • Bernoulli Distribution has only two possible outcomes, such as success or failure.

Bernoulli

  • Probability of rain : 0.6
x <- 1
p <- 0.6
b_fx <- p^x*((1-p)^(1-x))

barplot(c(1-p,p),names.arg=c(0,1))

Binomial

  • The binomial distribution, x number of successes in a sequence of n independent experiments. (success = p, failure q = 1 − p)

Binomial

Density, distribution function, quantile function and random generation for the binomial distribution with parameters size and prob.

  • dbinom(x, size, prob)
  • pbinom(x, size, prob)
  • qbinom(p, size, prob)
  • rbinom(n, size, prob)
  • x is a vector of numbers.
  • p is a vector of probabilities.
  • n is number of observations.
  • size is the number of trials.
  • prob is the probability of success of each trial.

Binomial

  • Suppose there are twelve questions in a quiz. (size=12)
  • Each question has two possible answers, and only one of them is correct. ( p = 1/2 = 0.5 )
  • Find the probability of having four or less correct answers (x = 0:4) (student answer questions randomly)
x <- 0:4  
size <- 12
prob <- 0.5

dbinom(x , size , prob)
## [1] 0.0002441406 0.0029296875 0.0161132813 0.0537109375 0.1208496094
pbinom(4 , size , prob)
## [1] 0.1938477

Binomial

prob <- dbinom(x = 0:12 , size = 12 , prob = 0.5)
plot(prob)

Binomial

  • Each question has five possible answers ( p = 1/5 = 0.2 )
prob <- dbinom(x = 0:12 , size = 12 , prob = 0.2)
plot(prob)

Poisson

  • In general, rarely seen event, basically. While n is too big, p is too small.

λp should be interpreted as the “mean number of occurrences”

Poisson

There are three functions associated with Binomial distributions.

  • dpois(x, lambda)
  • ppois(q, lambda, lower.tail)
  • qpois(p, lambda, lower.tail)
  • rpois(n, lambda)
  • x : successes in a period
  • λ : the expected number of events
  • lower.tail = TRUE for left tail
  • q vector of quantiles
  • n number of random values to return
  • p vector of probabilities

Poisson

  • Historical average: 4 babies born in this hospital every day.
  • What is the probability that 6 babies will be born in this hospital tomorrow?
x <- 6
lambpa <- 4

dpois(x, lambpa)
## [1] 0.1041956

Poisson

  • What is the probability, between 6 and 16 babies in this hospital tomorrow?
x <- 6:16
lambpa <- 4
prob <- dpois(x, lambpa)
plot(x = 5:15, prob)

Common Probability Distributions - Continuous

Common Probability Distributions - Continuous

  • Uniform Distribution
  • Normal Distribution

Uniform

The uniform distribution is a simple density function that describes a continuous random variable whose interval of possible values offers no fluctuations in probability.

Uniform

Uniform

  • dunif( x, min, max )
  • punif( q, min, max )
  • qunif( p, min, max )
  • runif( n, min, max )
  • x : a vector of numbers.
  • p : a vector of probabilities.
  • n : number of observations.
  • min, max : lower and upper limits

Uniform

  • Select ten random numbers between one and three.
runif(n = 10, min = 1, max = 3)
##  [1] 2.821992 2.765888 2.561574 2.424556 2.928520 1.079249 2.050842 1.525587
##  [9] 1.482912 1.915594
r1 <- runif(n = 10, min = 1, max = 3)
table(r1)
## r1
## 1.08275828650221 1.22311208304018 1.29288631817326  1.4663178306073 
##                1                1                1                1 
## 1.68820196110755 1.88678338751197 2.05697202542797 2.19751014048234 
##                1                1                1                1 
## 2.34811947587878 2.48047358682379 
##                1                1
t1 <- table(r1)    # frequence

Uniform

barplot(t1)      # frequence

Uniform

barplot(table(runif(n = 100,1,3)))

Uniform

barplot(table(runif(n = 1000,1,3)))

Normal

  • Theoretical, characterized by a distinctive “bell-shaped” curve
  • Aso referred to as the Gaussian distribution
  • Mean = Median = Mode, Symmetric, Skewness = 0, Kurtosis = 3
  • 50% of values less than the mean and 50% greater than the mean

Normal

Normal

Standart Normal

The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1.

Normal

  • dnorm( x, mean, sd )
  • pnorm( q, mean, sd )
  • qnorm( p, mean, sd )
  • rnorm( n, mean, sd )
  • x, q : vector of quantiles.
  • p : vector of probabilities.
  • n : number of observations.
  • mean : vector of means.
  • sd : vector of standard deviations.

Normal

x <- seq(-15, 15, by=1)
x
##  [1] -15 -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3
## [20]   4   5   6   7   8   9  10  11  12  13  14  15

Normal

plot(x)

Normal

hist(x)

`

Normal

barplot(x)

Normal

barplot(table(x))

Normal

dn <- dnorm(x, mean(x), sd(x)) 
dn
##  [1] 0.01125172 0.01340898 0.01578770 0.01836489 0.02110592 0.02396441
##  [7] 0.02688287 0.02979414 0.03262365 0.03529236 0.03772032 0.03983055
## [13] 0.04155314 0.04282898 0.04361321 0.04387780 0.04361321 0.04282898
## [19] 0.04155314 0.03983055 0.03772032 0.03529236 0.03262365 0.02979414
## [25] 0.02688287 0.02396441 0.02110592 0.01836489 0.01578770 0.01340898
## [31] 0.01125172

Normal

plot(x, dn) 

Next Week

Next Week