[Compilation] Elementery Statistical Analysis

source: based on lectures from Youtube channel by datasciencelim
→ R programming

requried packages

library(tidyverse)

Normal Distribution

\[ \text{pdf of normal distribution}:\\ f(x) = \frac{1}{\sqrt {2\pi}}*exp[\frac{(x-\mu)^2}{-2\sigma^2}] \]

1.1 dnorm → input: x value / output: f(x)(pdf) of normal distribution
1.2 qnorm → input: probability(i.e. P(X<k)) / output: cutoff point for corresponding k
1.3 pnorm → input: opposite of q norm
1.4 rnorm → random f(x) values

Normal distribution visualisation

#probability density function of normal distribution with mean of 80 and sd of 10
x <- seq(40, 120, length = 300) #range of x (40, 120), evenly divided into 300 sections
y <- dnorm(x, mean = 80, sd = 10)
plot(x, y, type = "l") #type = "l": make it a line, not scatterplot

pnorm example: calculate probability

#visualise the region we are interested in
plot(x, y, type = "l")
x2 <- seq(60, 120, length = 300)
y2 <- dnorm(x2, mean = 80, sd = 10)
polygon(c(60, x2, 120), c(0, y2, 0), col = "grey")

pnorm(60, mean = 80, sd = 10) #P(X<60)

## [1] 0.02275013

1 - pnorm(60, mean = 80, sd = 10) #P(X>60) --> answer

## [1] 0.9772499

qnorm example

#visualise the region we are interested in
plot(x, y, type = "l")
x3 <- seq(40, 74.76, length = 300)
y3 <- dnorm(x3, mean = 80, sd = 10)
polygon(c(40, x3, 74.76), c(0, y3, 0), col = "dark green") #P(X < 74.76) = 0.3

#confirm
qnorm(0.3, mean = 80, sd = 10) #This should give out 74.76 --> confirmed

## [1] 74.75599

d, p, q, r prefix works same for other distribution as well. (e.g. F-distribution → df, pf, qf, rf)

Chi-Square distribution / test

\[ \text{pdf of }\chi^2\text{-distribution}\\ \begin{align} &\frac{1}{2^{k/2}\Gamma(k/2)}x^{k/2-1}e^{-x/2} \\ &s.t. \ \Gamma: \text{Gamma function} \end{align} \]

Chi-square distritbution visualisation

code source: https://rpubs.com/mpfoley73/460935

#Chi_square distribution visualisation
data.frame(chisq = 0:7000 / 100) %>% #from 0.01 to 70.00 (by 0.01) #This becomes the first column
           mutate(df_05 = dchisq(x = chisq, df = 5),
                  df_15 = dchisq(x = chisq, df = 15),
                  df_30 = dchisq(x = chisq, df = 30)) %>%
  gather(key = "df", value = "density", -chisq) %>% #vectorisation & -chisq: to preserve the value of the first column
ggplot() +
  geom_line(aes(x = chisq, y = density, color = df)) +
  labs(title = "Chi-Square at Various Degrees of Freedom",
       x = "Chi-square",
       y = "Density") +
  theme_classic() #white background

Chi_square test example: Independence test

data <- matrix(c(42, 30, 50, 87), nrow = 2, byrow = FALSE)
data #Row1: Male / Row2: Female

##      [,1] [,2]
## [1,]   42   50
## [2,]   30   87

\[ \begin{align} &\text{H0: Independence(sex)} \\ &\text{H1: Dependence} \\ &(\alpha = 0.05) \end{align} \]

chisq.test(data, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  data
## X-squared = 9.1329, df = 1, p-value = 0.00251

#df: degrees of freedom
#p-value: under 5% --> reject the null hypothesis of independence

Binomial Distribution

\[ \begin{align} &\text{pdf of binomial distribution}\\ &f(k) = {n \choose k} p^{k} (1-p)^{n-k} \end{align} \]

Binomial distribution visualisation

code source: p. 308, The R book (changed some numbers)

#coin flipping: the number of heads
p <- 0.5 #prob of head
n <- 100 #the number of trials
x <- 0:n
px <- choose(n,x)*p^x*(1-p)^(n-x) #pdf
barplot(px,names=x,xlab="outcome",ylab="probability",col="dark green")

dbinom example: visualisation revisted

y <- dbinom(x, 100, 0.5)
plot(x, y, "h")

pbinom example

#probability of obtaining less than or equal to 40 heads
pbinom(40, 100, 0.5)

## [1] 0.02844397

qbinom example

#critical point for probability of 0.2844397
qbinom(0.2844397, 100, 0.5)

## [1] 47

Hypergeometric Distribution

\[ \begin{align} &\text{pdf of hypergeometric distribution}\\ &f(x) = \frac{{b \choose x}{N-b \choose n-x}}{{N \choose n}} \end{align} \]

Think of it as replace = FALSE version of binomial distribution.

binomial approximation of hypergeometric distribution

# dbinom(x, n, p)
# x: success
round(dbinom(3, 10, 0.4), 4)

## [1] 0.215

# dhyper(x, t, f, n)
# x: success
# t: targets
# f: the rest
# --> total (t + f)
# n: trial
round(dhyper(3, 12, 18, 10),4)

## [1] 0.233

#Approximation
hyper_values <- function(a){
  app <- numeric(length = a)               
  
  for (i in 1:a) {
  app[i] <- dhyper(3, 4*i, 6*i, 10)      #increase total group  
  }
  return(app[2:100])                              
}

hyper_values(100)

##  [1] 0.2400572 0.2330263 0.2286508 0.2259296 0.2240987 0.2227880 0.2218050
##  [8] 0.2210411 0.2204307 0.2199319 0.2195167 0.2191658 0.2188653 0.2186050
## [15] 0.2183776 0.2181770 0.2179988 0.2178395 0.2176962 0.2175666 0.2174489
## [22] 0.2173415 0.2172430 0.2171525 0.2170689 0.2169916 0.2169198 0.2168530
## [29] 0.2167907 0.2167324 0.2166777 0.2166264 0.2165781 0.2165326 0.2164896
## [36] 0.2164489 0.2164104 0.2163739 0.2163392 0.2163062 0.2162748 0.2162449
## [43] 0.2162163 0.2161890 0.2161628 0.2161378 0.2161139 0.2160909 0.2160688
## [50] 0.2160476 0.2160273 0.2160076 0.2159888 0.2159706 0.2159530 0.2159361
## [57] 0.2159198 0.2159040 0.2158887 0.2158740 0.2158597 0.2158459 0.2158325
## [64] 0.2158195 0.2158070 0.2157947 0.2157829 0.2157714 0.2157602 0.2157494
## [71] 0.2157388 0.2157285 0.2157186 0.2157088 0.2156994 0.2156902 0.2156812
## [78] 0.2156724 0.2156639 0.2156556 0.2156474 0.2156395 0.2156318 0.2156242
## [85] 0.2156169 0.2156096 0.2156026 0.2155957 0.2155890 0.2155824 0.2155760
## [92] 0.2155697 0.2155635 0.2155575 0.2155516 0.2155458 0.2155401 0.2155345
## [99] 0.2155291

hyper_values(100)[99] #Compare this to the binomal value below

## [1] 0.2155291

dbinom(3, 10, 0.4) #Approximates!

## [1] 0.2149908