Collection of R Codes from My ISRT-time

Author

Mutasim Billah

AST 230

Incourse Solve

2020

1st Incourse

Question no. 01
rm(list = ls())

#a
samp<-rpois(n = 500,lambda = 2)
summary(samp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.068   3.000   7.000 
min(samp)
[1] 0
max(samp)
[1] 7
mean(samp)
[1] 2.068
var(samp)
[1] 1.983343
#b
hist(samp)

Question no. 02
samp2<-rnorm(n = 16, mean = 5,sd = sqrt(2))

#a
matrix(data = samp2, nrow = 2)
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 9.543024 4.851597 5.794626 4.590608 6.445296 4.841330 6.364498 4.503076
[2,] 3.548736 6.310540 4.470656 8.319919 3.224517 4.426277 5.707208 4.474784
#b
matrix(samp2, ncol = 2)
         [,1]     [,2]
[1,] 9.543024 6.445296
[2,] 3.548736 3.224517
[3,] 4.851597 4.841330
[4,] 6.310540 4.426277
[5,] 5.794626 6.364498
[6,] 4.470656 5.707208
[7,] 4.590608 4.503076
[8,] 8.319919 4.474784
Question no. 03

The attach() function adds the data frame to the R search path.

attach(mtcars)

The detach() function removes the data frame from the search path.

detach(mtcars)

The with() function evaluate an expression in R environment.

with(data = mtcars,{summary(mpg)})
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 
Question no. 04
id<-1:16
iq<-c(180,150,140,110,122,181,186,170,117,142,115,175,103,90,101,135)
math<-c(85,60,70,71,55,78,90,89,75,65,55,80,46,40,39,60)

mean(iq)
[1] 138.5625
sqrt(var(iq))
[1] 32.02701
mean(math)
[1] 66.125
sqrt(var(math))
[1] 16.42711
cor(iq,math)
[1] 0.8510083
cor.test(iq,math)

    Pearson's product-moment correlation

data:  iq and math
t = 6.0633, df = 14, p-value = 2.92e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6145508 0.9471563
sample estimates:
      cor 
0.8510083 
plot(iq,math)
abline(lm(math~iq))

lm(math~iq)

Call:
lm(formula = math ~ iq)

Coefficients:
(Intercept)           iq  
     5.6432       0.4365  
Question no. 05
matrix<-matrix(data = 1:110, nrow = 10)
matrix
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
 [1,]    1   11   21   31   41   51   61   71   81    91   101
 [2,]    2   12   22   32   42   52   62   72   82    92   102
 [3,]    3   13   23   33   43   53   63   73   83    93   103
 [4,]    4   14   24   34   44   54   64   74   84    94   104
 [5,]    5   15   25   35   45   55   65   75   85    95   105
 [6,]    6   16   26   36   46   56   66   76   86    96   106
 [7,]    7   17   27   37   47   57   67   77   87    97   107
 [8,]    8   18   28   38   48   58   68   78   88    98   108
 [9,]    9   19   29   39   49   59   69   79   89    99   109
[10,]   10   20   30   40   50   60   70   80   90   100   110
colnames(x = matrix)<-c("a","b","c","d","e","f","g","h","i","j","k")
matrix
       a  b  c  d  e  f  g  h  i   j   k
 [1,]  1 11 21 31 41 51 61 71 81  91 101
 [2,]  2 12 22 32 42 52 62 72 82  92 102
 [3,]  3 13 23 33 43 53 63 73 83  93 103
 [4,]  4 14 24 34 44 54 64 74 84  94 104
 [5,]  5 15 25 35 45 55 65 75 85  95 105
 [6,]  6 16 26 36 46 56 66 76 86  96 106
 [7,]  7 17 27 37 47 57 67 77 87  97 107
 [8,]  8 18 28 38 48 58 68 78 88  98 108
 [9,]  9 19 29 39 49 59 69 79 89  99 109
[10,] 10 20 30 40 50 60 70 80 90 100 110
matrix[,c(2,4,8)]
       b  d  h
 [1,] 11 31 71
 [2,] 12 32 72
 [3,] 13 33 73
 [4,] 14 34 74
 [5,] 15 35 75
 [6,] 16 36 76
 [7,] 17 37 77
 [8,] 18 38 78
 [9,] 19 39 79
[10,] 20 40 80
Question no. 06
doses<-c(10,15,20,30,45,50,60,65,70,80,85,90)
doses
 [1] 10 15 20 30 45 50 60 65 70 80 85 90
drug_c<-c(16,18,24,32,40,48,50,65,65,80,82,87)
drug_c
 [1] 16 18 24 32 40 48 50 65 65 80 82 87
drug_d<-c(15,16,20,25,35,42,47,55,56,70,75,79)
drug_d
 [1] 15 16 20 25 35 42 47 55 56 70 75 79
cbind(doses,drug_c,drug_d)
      doses drug_c drug_d
 [1,]    10     16     15
 [2,]    15     18     16
 [3,]    20     24     20
 [4,]    30     32     25
 [5,]    45     40     35
 [6,]    50     48     42
 [7,]    60     50     47
 [8,]    65     65     55
 [9,]    70     65     56
[10,]    80     80     70
[11,]    85     82     75
[12,]    90     87     79
rbind(doses, drug_c, drug_d)
       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
doses    10   15   20   30   45   50   60   65   70    80    85    90
drug_c   16   18   24   32   40   48   50   65   65    80    82    87
drug_d   15   16   20   25   35   42   47   55   56    70    75    79
#main part
plot(x = doses,y = drug_c, type = "b", main = "Drug C vs D", xlab = "Drug Doses", ylab = "Response to Drug", col="red", pch="c")
lines(x = doses, y = drug_d,type = "b", col="blue", pch="d")
legend("topleft",title = "Drug Type", c("C","D"), fill = c("red","blue"))

Question no. 07
attach(mtcars)
par(mfrow=c(2,2))
plot(wt,hp)
plot(mpg,disp)
hist(disp)
boxplot(mpg)

2nd Incourse

Question no. 03
  • Detect the errors from the following codes & explain why they appear to be errors.
    qplot(x1, x2, data=(), color=x30)
    data argument should be assigned to a valid data frame in which object x1, x2, x3 must exist.

    boxplot(nitrogen=Month, airquality, xlab=Month, ylab=Nitrogen)
    Nitrogen object does not exist in airquality data frame. We should replace it with Ozone. Also the formula is incorrect it should be written with ~ sign. Also x- and y-axis annotations should be characters not object.

    • Write down the executable R codes to generate 30 random numbers from the following distributions:
rm(list = ls())
set.seed(30)

# log normal
round(rlnorm(30), 2)
 [1] 0.28 0.71 0.59 3.57 6.20 0.22 1.12 0.47 0.51 1.32 0.36 0.16 0.51 0.94 2.41
[16] 1.31 0.98 0.59 0.24 0.16 0.85 2.13 0.40 2.23 4.44 0.33 0.59 0.24 0.29 1.26
# geometric
rgeom(30, .4)
 [1] 2 1 1 4 0 0 0 0 2 0 0 5 2 0 4 0 0 7 1 2 0 1 1 0 0 1 0 2 0 0
# gamma
round(rgamma(30, shape = 5), 2)
 [1] 8.70 7.18 3.07 4.74 2.55 5.63 3.09 2.58 4.09 8.79 4.34 3.64 5.70 6.61 7.64
[16] 6.91 2.55 5.78 4.04 6.85 2.73 5.07 4.72 5.09 4.96 2.04 3.50 6.04 3.41 6.38
Distribution Abbreviation
Beta beta
Binomial binom
Cauchy cauchy
Chi-squared (noncentral) chisq
Exponential exp
F f
Gamma gamma
Geometric geom
Hypergeometric hyper
Lognormal lnorm
Logistic logis
Multinomial multinom
Negative binomial nbinom
Normal norm
Poisson pois
Wilcoxon Signed Rank signrank
T t
Uniform unif
Weibull weibull
Wilcoxon Rank Sum wilcox
Question no. 04
  • Interpret the following R codes

apply(data5, 2, median, trim=.12)

Calculate trimmed (based on middle 76 percent) column medians of the data object data5

marks< − rnorm(50, 30, 50)

Create a numeric vector named marks by generating 50 random numbers that follow normal disrtibution with mean 30 and stadanrd deviation 50.

results< − ifelse(marks>= 50, ”Passed”, ”Failed”)

Create a character vector named results of size 50, which include “Passed” if the marks obtained is 50 or more, and include “Failed” if marks obtained in less than 50.

boxplot(oxygen∼year, weather, xlab=”year”, ylab=”oxygen”)

Create a boxplot by splitting oxygen (which must be a numeric vector) into year from the data object weather putting oxygen data in the y-axis and year in the x-axis.

Question no. 05
  • Write a fuction to calculate
    \(1-e^{-\lambda x}\) where \(x \geq 0\), \(\lambda\) is the rate parameter
rm(list = ls())
p_exp <- function(x, lambda){
  if (! x >= 0){
    0
  } else {
    1-exp(-1*(lambda*x))
  }
}
p_exp(1,4) # function we created 
[1] 0.9816844
pexp(1,4) # function built in 'stats' package
[1] 0.9816844
p_exp(-1,4) 
[1] 0
pexp(-1,4)
[1] 0

2019

2nd Incourse

Question no. 01
rm(list=ls())
attach(trees)
par(mfrow = c(2,2))
plot(Girth, Height)
plot(Girth, Volume)
hist(Volume)
boxplot(Volume)

detach(trees)
Question no. 02

The swiss data set is built in R as a built in resource.

  • Recode the variable Infant.Mortality into four categories- the values upto first quartile as Good, first quartile until median as Normal, median until third quartile as Weaker and above third quartile as Poor.
rm(list = ls())
library(tidyverse) # needed for tibble & ggplot
library(knitr) # needed for kable function

tibble(swiss) # original data set
# A tibble: 47 × 6
   Fertility Agriculture Examination Education Catholic Infant.Mortality
       <dbl>       <dbl>       <int>     <int>    <dbl>            <dbl>
 1      80.2        17            15        12     9.96             22.2
 2      83.1        45.1           6         9    84.8              22.2
 3      92.5        39.7           5         5    93.4              20.2
 4      85.8        36.5          12         7    33.8              20.3
 5      76.9        43.5          17        15     5.16             20.6
 6      76.1        35.3           9         7    90.6              26.6
 7      83.8        70.2          16         7    92.8              23.6
 8      92.4        67.8          14         8    97.2              24.9
 9      82.4        53.3          12         7    97.7              21  
10      82.9        45.2          16        13    91.4              24.4
# ℹ 37 more rows
# solution process
quantile(swiss$Infant.Mortality)
   0%   25%   50%   75%  100% 
10.80 18.15 20.00 21.70 26.60 
swiss <- within(data = swiss, expr = {
  status <- NA
  status[Infant.Mortality <= 18.15] <- "Good"
  status[Infant.Mortality > 18.15 & Infant.Mortality <= 20] <- "Normal"
  status[Infant.Mortality > 20 & Infant.Mortality <= 21.70] <- "Weaker"
  status[Infant.Mortality > 21.70] <- "Poor"
})
# let's spot the change
kable(swiss)
Fertility Agriculture Examination Education Catholic Infant.Mortality status
Courtelary 80.2 17.0 15 12 9.96 22.2 Poor
Delemont 83.1 45.1 6 9 84.84 22.2 Poor
Franches-Mnt 92.5 39.7 5 5 93.40 20.2 Weaker
Moutier 85.8 36.5 12 7 33.77 20.3 Weaker
Neuveville 76.9 43.5 17 15 5.16 20.6 Weaker
Porrentruy 76.1 35.3 9 7 90.57 26.6 Poor
Broye 83.8 70.2 16 7 92.85 23.6 Poor
Glane 92.4 67.8 14 8 97.16 24.9 Poor
Gruyere 82.4 53.3 12 7 97.67 21.0 Weaker
Sarine 82.9 45.2 16 13 91.38 24.4 Poor
Veveyse 87.1 64.5 14 6 98.61 24.5 Poor
Aigle 64.1 62.0 21 12 8.52 16.5 Good
Aubonne 66.9 67.5 14 7 2.27 19.1 Normal
Avenches 68.9 60.7 19 12 4.43 22.7 Poor
Cossonay 61.7 69.3 22 5 2.82 18.7 Normal
Echallens 68.3 72.6 18 2 24.20 21.2 Weaker
Grandson 71.7 34.0 17 8 3.30 20.0 Normal
Lausanne 55.7 19.4 26 28 12.11 20.2 Weaker
La Vallee 54.3 15.2 31 20 2.15 10.8 Good
Lavaux 65.1 73.0 19 9 2.84 20.0 Normal
Morges 65.5 59.8 22 10 5.23 18.0 Good
Moudon 65.0 55.1 14 3 4.52 22.4 Poor
Nyone 56.6 50.9 22 12 15.14 16.7 Good
Orbe 57.4 54.1 20 6 4.20 15.3 Good
Oron 72.5 71.2 12 1 2.40 21.0 Weaker
Payerne 74.2 58.1 14 8 5.23 23.8 Poor
Paysd’enhaut 72.0 63.5 6 3 2.56 18.0 Good
Rolle 60.5 60.8 16 10 7.72 16.3 Good
Vevey 58.3 26.8 25 19 18.46 20.9 Weaker
Yverdon 65.4 49.5 15 8 6.10 22.5 Poor
Conthey 75.5 85.9 3 2 99.71 15.1 Good
Entremont 69.3 84.9 7 6 99.68 19.8 Normal
Herens 77.3 89.7 5 2 100.00 18.3 Normal
Martigwy 70.5 78.2 12 6 98.96 19.4 Normal
Monthey 79.4 64.9 7 3 98.22 20.2 Weaker
St Maurice 65.0 75.9 9 9 99.06 17.8 Good
Sierre 92.2 84.6 3 3 99.46 16.3 Good
Sion 79.3 63.1 13 13 96.83 18.1 Good
Boudry 70.4 38.4 26 12 5.62 20.3 Weaker
La Chauxdfnd 65.7 7.7 29 11 13.79 20.5 Weaker
Le Locle 72.7 16.7 22 13 11.22 18.9 Normal
Neuchatel 64.4 17.6 35 32 16.92 23.0 Poor
Val de Ruz 77.6 37.6 15 7 4.97 20.0 Normal
ValdeTravers 67.6 18.7 25 7 8.65 19.5 Normal
V. De Geneve 35.0 1.2 37 53 42.34 18.0 Good
Rive Droite 44.7 46.6 16 29 50.43 18.2 Normal
Rive Gauche 42.8 27.7 22 29 58.33 19.3 Normal
  • Draw bar diagram of the newly created variable and interpret the results
ggplot(data = swiss)+
  geom_bar(mapping = aes(x = status), fill = "blue4")

  • Draw a random sample of size 10 from the swiss data set and calculate summary statistics for the variables Fertility and Education
set.seed(30)
new_swiss <- swiss[sample(nrow(swiss), 10),] # don't forget that comma at the end!!
kable(new_swiss)
Fertility Agriculture Examination Education Catholic Infant.Mortality status
Sarine 82.9 45.2 16 13 91.38 24.4 Poor
Rive Droite 44.7 46.6 16 29 50.43 18.2 Normal
Aubonne 66.9 67.5 14 7 2.27 19.1 Normal
Aigle 64.1 62.0 21 12 8.52 16.5 Good
Rive Gauche 42.8 27.7 22 29 58.33 19.3 Normal
Val de Ruz 77.6 37.6 15 7 4.97 20.0 Normal
Vevey 58.3 26.8 25 19 18.46 20.9 Weaker
Moutier 85.8 36.5 12 7 33.77 20.3 Weaker
Franches-Mnt 92.5 39.7 5 5 93.40 20.2 Weaker
Entremont 69.3 84.9 7 6 99.68 19.8 Normal
summary(new_swiss$Fertility)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.80   59.75   68.10   68.49   81.58   92.50 
summary(new_swiss$Education)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    5.0     7.0     9.5    13.4    17.5    29.0 
Question no. 03
  • Distinguish between FOR loop and WHILE loop by giving R executable examples.

For
For loop executes a statement repetitively till the variable’s value is contained in the sequence.
The syntax is for(variable in sequence) statement Example:

rm(list = ls())
for (i in 1:5) print("Hello world!")
[1] "Hello world!"
[1] "Hello world!"
[1] "Hello world!"
[1] "Hello world!"
[1] "Hello world!"

While
While loop executes a statement repetitively till the condition is true
The syntax is while(condition) statement
Example:

rm(list = ls())
i <-5
while(i > 0){
  print("Hello world!");
  i <- i-1
}
[1] "Hello world!"
[1] "Hello world!"
[1] "Hello world!"
[1] "Hello world!"
[1] "Hello world!"
Question no. 04
  • Aggregates warpbreaks data for breaks by the variables wool and tension. Ignore the warnings and find the mean number of breaks corresponding to wool=A and tension=M
rm(list = ls())
kable(aggregate(warpbreaks, by = list(wool=warpbreaks$wool,tension=warpbreaks$tension), FUN=mean))
wool tension breaks wool tension
A L 44.55556 NA NA
B L 28.22222 NA NA
A M 24.00000 NA NA
B M 28.77778 NA NA
A H 24.55556 NA NA
B H 18.77778 NA NA
Question no. 05
  • Write a function to calculate
    \(1-e^{-(\lambda x)^{a}}\)
    where \(x \geq 0, a > 0\) is the shape parameter and \(\lambda > 0\) is the scale parameter.
rm(list = ls())
p_weibull <- function(x, a, lambda){
  if (! x >= 0){
    0
  } else {
    if (! a > 0){
      "a must be greater than zero"
    } else {
      if (! lambda > 0){
       "lambda must be greater than zero" 
      } else {
        1-exp((-1*(lambda*x))**a)
      }
    }
  }
}

p_weibull(5,3,1) # function we created
[1] 1
pweibull(5,3,1) # fuction built in 'stats' package
[1] 1
p_weibull(1.0,3,2)
[1] 0.9996645
pweibull(1.0,3,.5)
[1] 0.9996645
p_weibull(1.5,3,5)
[1] 1
pweibull(1.5,3,5^-1)
[1] 1
p_weibull(-5,3,1)
[1] 0
pweibull(-5,3,1)
[1] 0
p_weibull(5,-3,1)
[1] "a must be greater than zero"
pweibull(5,-3,1)
[1] NaN
p_weibull(5,3,-1)
[1] "lambda must be greater than zero"
pweibull(5,3,-1)
[1] NaN

Final Solve

2019

Ans. to 1

clean environment and load packages
rm(list = ls())
library(tidyverse)
library(knitr)
library(skimr)
get to know your data
head(state.x77)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
str(state.x77)
 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
colnames(state.x77)
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"    
[6] "HS Grad"    "Frost"      "Area"      
skim_without_charts(state.x77)
Data summary
Name state.x77
Number of rows 50
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Population 0 1 4246.42 4464.49 365.00 1079.50 2838.50 4968.50 21198.0
Income 0 1 4435.80 614.47 3098.00 3992.75 4519.00 4813.50 6315.0
Illiteracy 0 1 1.17 0.61 0.50 0.62 0.95 1.58 2.8
Life Exp 0 1 70.88 1.34 67.96 70.12 70.67 71.89 73.6
Murder 0 1 7.38 3.69 1.40 4.35 6.85 10.67 15.1
HS Grad 0 1 53.11 8.08 37.80 48.05 53.25 59.15 67.3
Frost 0 1 104.46 51.98 0.00 66.25 114.50 139.75 188.0
Area 0 1 70735.88 85327.30 1049.00 36985.25 54277.00 81162.50 566432.0
answer of (a)
# sorting the data
data <- as.data.frame(state.x77)
a.1 <- data |> arrange(-Illiteracy) # Louisiana
a.2 <- data |> arrange(Population) # Alaska
a.3 <- data |> arrange(Murder) # North Dakota
head(a.1)
               Population Income Illiteracy Life Exp Murder HS Grad Frost
Louisiana            3806   3545        2.8    68.76   13.2    42.2    12
Mississippi          2341   3098        2.4    68.09   12.5    41.0    50
South Carolina       2816   3635        2.3    67.96   11.6    37.8    65
New Mexico           1144   3601        2.2    70.32    9.7    55.2   120
Texas               12237   4188        2.2    70.90   12.2    47.4    35
Alabama              3615   3624        2.1    69.05   15.1    41.3    20
                 Area
Louisiana       44930
Mississippi     47296
South Carolina  30225
New Mexico     121412
Texas          262134
Alabama         50708
head(a.2)
             Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alaska              365   6315        1.5    69.31   11.3    66.7   152 566432
Wyoming             376   4566        0.6    70.29    6.9    62.9   173  97203
Vermont             472   3907        0.6    71.64    5.5    57.1   168   9267
Delaware            579   4809        0.9    70.06    6.2    54.6   103   1982
Nevada              590   5149        0.5    69.03   11.5    65.2   188 109889
North Dakota        637   5087        0.8    72.78    1.4    50.3   186  69273
head(a.3)
             Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
North Dakota        637   5087        0.8    72.78    1.4    50.3   186 69273
South Dakota        681   4167        0.5    72.08    1.7    53.3   172 75955
Iowa               2861   4628        0.5    72.56    2.3    59.0   140 55941
Minnesota          3921   4675        0.6    72.96    2.3    57.6   160 79289
Rhode Island        931   4558        1.3    71.90    2.4    46.4   127  1049
Maine              1058   3694        0.7    70.39    2.7    54.7   161 30920
# you can also use `order` function
answer of (b)
b.1 <- data |> arrange(-Area) # Alaska
b.2 <- data |> arrange(-Population) # California
b.3 <- data |> arrange(-Murder) # Alabama
head(b.1)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Texas           12237   4188        2.2    70.90   12.2    47.4    35 262134
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Montana           746   4347        0.6    70.56    5.0    59.2   155 145587
New Mexico       1144   3601        2.2    70.32    9.7    55.2   120 121412
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
head(b.2)
             Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
California        21198   5114        1.1    71.71   10.3    62.6    20 156361
New York          18076   4903        1.4    70.55   10.9    52.7    82  47831
Texas             12237   4188        2.2    70.90   12.2    47.4    35 262134
Pennsylvania      11860   4449        1.0    70.43    6.1    50.2   126  44966
Illinois          11197   5107        0.9    70.14   10.3    52.6   127  55748
Ohio              10735   4561        0.8    70.82    7.4    53.2   124  40975
head(b.3)
               Population Income Illiteracy Life Exp Murder HS Grad Frost
Alabama              3615   3624        2.1    69.05   15.1    41.3    20
Georgia              4931   4091        2.0    68.54   13.9    40.6    60
Louisiana            3806   3545        2.8    68.76   13.2    42.2    12
Mississippi          2341   3098        2.4    68.09   12.5    41.0    50
Texas               12237   4188        2.2    70.90   12.2    47.4    35
South Carolina       2816   3635        2.3    67.96   11.6    37.8    65
                 Area
Alabama         50708
Georgia         58073
Louisiana       44930
Mississippi     47296
Texas          262134
South Carolina  30225
# you can also use `order` function
answer of (c)
c <- state.x77[sample(nrow(state.x77), 10),]
kable(c)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Montana 746 4347 0.6 70.56 5.0 59.2 155 145587
Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149 54464
Virginia 4981 4701 1.4 70.08 9.5 47.8 85 39780
South Carolina 2816 3635 2.3 67.96 11.6 37.8 65 30225
Kentucky 3387 3712 1.6 70.10 10.6 38.5 95 39650
Idaho 813 4119 0.6 71.87 5.3 59.5 126 82677
Ohio 10735 4561 0.8 70.82 7.4 53.2 124 40975
North Carolina 5441 3875 1.8 69.21 11.1 38.5 80 48798
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
South Dakota 681 4167 0.5 72.08 1.7 53.3 172 75955
kable(summary(c))
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Min. : 681 Min. :3635 Min. :0.500 Min. :67.96 Min. : 1.700 Min. :37.80 Min. : 20.00 Min. : 30225
1st Qu.: 1314 1st Qu.:3936 1st Qu.:0.625 1st Qu.:70.08 1st Qu.: 5.075 1st Qu.:40.83 1st Qu.: 81.25 1st Qu.: 40079
Median : 3988 Median :4257 Median :0.950 Median :70.69 Median : 8.450 Median :53.25 Median :109.50 Median : 51631
Mean : 5539 Mean :4270 Mean :1.140 Mean :70.69 Mean : 7.550 Mean :50.49 Mean :107.10 Mean : 71447
3rd Qu.: 5326 3rd Qu.:4538 3rd Qu.:1.550 3rd Qu.:71.83 3rd Qu.:10.525 3rd Qu.:58.02 3rd Qu.:143.25 3rd Qu.: 80997
Max. :21198 Max. :5114 Max. :2.300 Max. :72.48 Max. :11.600 Max. :62.60 Max. :172.00 Max. :156361
answer of (d)
cor(data$Illiteracy, data$Murder, method = "pearson")
[1] 0.7029752

the Pearson’s correlation coefficient is 0.703, so there is high correlation between number of murders and illiteracy rates.

answer of (e)
quantile(data$Area, c(.33,.66,1))
      33%       66%      100% 
 41940.34  69089.52 566432.00 
d <- within(data = data, expr = {
  size <- NA
  size [Area < 41940.34] <- "small"
  size [Area >= 41940.34 & Area < 69089.52] <- "medium"
  size [Area >= 69089.52] <- "large"
})
head(d)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
             size
Alabama    medium
Alaska      large
Arizona     large
Arkansas   medium
California  large
Colorado    large
ggplot(data = d) +
  geom_bar(mapping = aes(x = size), fill = "wheat4")

nrow(d[(d$size == "large"),])
[1] 17
nrow(d[(d$size == "small"),])
[1] 17

Ans. to 2

rm(list=ls())
attach(trees)
par(mfrow = c(2,2))
plot(Volume, Height)
plot(Girth, Volume)
hist(Height)
boxplot(Volume)

Ans. to 3

rm(list = ls())

#(a)
quantile(swiss$Infant.Mortality)
   0%   25%   50%   75%  100% 
10.80 18.15 20.00 21.70 26.60 
data <- within(data = swiss, expr = {
  status <- NA
  status [Infant.Mortality <= 18.15] <- "Good"
  status [Infant.Mortality > 18.15 & Infant.Mortality <= 20] <- "Normal"
  status [Infant.Mortality > 20 & Infant.Mortality <= 21.7] <- "Weaker"
  status [Infant.Mortality > 21.7] <- "Poor"
})
head(data)
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
             Infant.Mortality status
Courtelary               22.2   Poor
Delemont                 22.2   Poor
Franches-Mnt             20.2 Weaker
Moutier                  20.3 Weaker
Neuveville               20.6 Weaker
Porrentruy               26.6   Poor
#(b)

ggplot(data) +
  geom_bar(mapping = aes(x = status), fill = "wheat4")

# (c)
c <- swiss[(sample(nrow(swiss), 15)),]
summary(c$Agriculture)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.70   28.90   53.30   47.67   66.00   78.20 
summary(c$Education)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    9.00   11.07   12.00   29.00 

Ans. to 4

aggregate(mtcars, by = list(Cyl = mtcars$cyl), mean)
  Cyl      mpg cyl     disp        hp     drat       wt     qsec        vs
1   4 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
2   6 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
3   8 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
         am     gear     carb
1 0.7272727 4.090909 1.545455
2 0.4285714 3.857143 3.428571
3 0.1428571 3.285714 3.500000
aggregate(mtcars, by = list(Gear = mtcars$gear), median)
  Gear  mpg cyl  disp  hp drat   wt   qsec vs am gear carb
1    3 15.5   8 318.0 180 3.08 3.73 17.420  0  0    3    3
2    4 22.8   4 130.9  94 3.92 2.70 18.755  1  1    4    2
3    5 19.7   6 145.0 175 3.77 2.77 15.500  0  1    5    4

Ans. to 5

warpbreaks |> 
  group_by(wool, tension) |> 
  drop_na() |> 
  summarise(avg.breaks = mean(breaks))
# A tibble: 6 × 3
# Groups:   wool [2]
  wool  tension avg.breaks
  <fct> <fct>        <dbl>
1 A     L             44.6
2 A     M             24  
3 A     H             24.6
4 B     L             28.2
5 B     M             28.8
6 B     H             18.8

Ans. to 6

# pre-arrangements
rm(list = ls())
library(dplyr)
chicago <- readRDS("chicago.rds")
# get to know your data
colnames(chicago)
[1] "city"       "tmpd"       "dptp"       "date"       "pm25tmean2"
[6] "pm10tmean2" "o3tmean2"   "no2tmean2" 
head(chicago)
  city tmpd   dptp       date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987-01-01         NA   34.00000 4.250000  19.98810
2 chic 33.0 29.875 1987-01-02         NA         NA 3.304348  23.19099
3 chic 33.0 27.375 1987-01-03         NA   34.16667 3.333333  23.81548
4 chic 29.0 28.625 1987-01-04         NA   47.00000 4.375000  30.43452
5 chic 32.0 28.875 1987-01-05         NA         NA 4.750000  30.33333
6 chic 40.0 35.125 1987-01-06         NA   48.00000 5.833333  25.77233
skim_without_charts(chicago)
Data summary
Name chicago
Number of rows 6940
Number of columns 8
_______________________
Column type frequency:
character 1
Date 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 4 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 1987-01-01 2005-12-31 1996-07-01 6940

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
tmpd 1 1.00 50.31 19.41 -16.00 35.00 51.00 67.00 92.00
dptp 2 1.00 40.34 18.49 -25.62 27.00 39.88 55.75 78.25
pm25tmean2 4447 0.36 16.23 8.70 1.70 9.70 14.66 20.60 61.50
pm10tmean2 242 0.97 33.90 17.97 2.00 21.50 30.28 42.00 365.00
o3tmean2 0 1.00 19.44 11.39 0.15 10.07 18.52 27.00 66.59
no2tmean2 0 1.00 25.23 7.99 6.16 19.65 24.56 30.14 62.48
# (a)
tibble(select(chicago, starts_with("d")))
# A tibble: 6,940 × 2
    dptp date      
   <dbl> <date>    
 1  31.5 1987-01-01
 2  29.9 1987-01-02
 3  27.4 1987-01-03
 4  28.6 1987-01-04
 5  28.9 1987-01-05
 6  35.1 1987-01-06
 7  26.8 1987-01-07
 8  22   1987-01-08
 9  29   1987-01-09
10  27.8 1987-01-10
# ℹ 6,930 more rows
# (b)
chicago |> 
  filter(pm25tmean2 > 35 & tmpd > 75) |> 
  summarise(mean_nitrogen = mean(no2tmean2))
  mean_nitrogen
1      29.22863
# (c)
chicago |> 
  separate(col = date, into = c("year", "month", "date")) |> 
  group_by(year) |> 
  summarise(ozone = mean(o3tmean2), nitrogen = mean(no2tmean2))
# A tibble: 19 × 3
   year  ozone nitrogen
   <chr> <dbl>    <dbl>
 1 1987   20.5     25.3
 2 1988   22.2     25.3
 3 1989   20.8     27.4
 4 1990   19.7     23.1
 5 1991   20.0     21.9
 6 1992   16.1     25.9
 7 1993   15.8     26.1
 8 1994   17.3     29.1
 9 1995   18.1     27.9
10 1996   16.7     27.0
11 1997   18.6     25.8
12 1998   19.3     24.7
13 1999   20.5     24.9
14 2000   18.5     24.1
15 2001   19.4     25.4
16 2002   21.0     23.7
17 2003   21.0     25.2
18 2004   20.5     23.4
19 2005   23.2     23.2
# (d) # mutate  ## all calculations are fictional 
d <- chicago |>  
  mutate(gas = o3tmean2 + no2tmean2)
tibble(select(.data = d, gas, o3tmean2, no2tmean2))
# A tibble: 6,940 × 3
     gas o3tmean2 no2tmean2
   <dbl>    <dbl>     <dbl>
 1  24.2     4.25      20.0
 2  26.5     3.30      23.2
 3  27.1     3.33      23.8
 4  34.8     4.38      30.4
 5  35.1     4.75      30.3
 6  31.6     5.83      25.8
 7  29.9     9.29      20.6
 8  28.3    11.3       17.0
 9  27.9     4.5       23.4
10  24.5     4.96      19.5
# ℹ 6,930 more rows
# (e)
## all above

Ans. to 7

% 2019
% answer to (a)
% 5050
clear all
sum = 0;
for i = 1:100
  sum = sum + i;
end

% answer to (b)
% 3280
clear all
y = 1
t = 1
while t !=  2187
  t = t*3
  y = y+t
end

% answer to (c)
clear all
C1 = [1,2,3]
C2 = [4,5,6]
C = C1.*C2 % [4, 10, 18]
D = C1 * C2' % 32

% answer to (d)
% [2.2373, 2.6780, 2.1017]
clear all
A = [2,3,-5; 1,-7,5; 3,5,-1]
B = [2; -6; 18]
sol = inv(A) * B
sol2 = A|B
sol3 = linsolve(A,B)

% 2020
% answer to (b)
% 5461
clear all
y = 1
t = 1
while t != 4096
  t = t * 4
  y = y + t
  end

Ans. to 08

rm(list = ls())
func_plot <- function(a) {
  fx <- vector()
  Fx <- vector()
  i <- 1:10
  fx[i] <- a * exp(-a*i)
  Fx[i] <- 1 - exp(-a*i)
  par(mfrow = c(1,2))
  plot(i, fx, xlab = 'x', ylab = 'f(x)' ,main = "Density", type = "o", col = "red")
  plot(i, Fx, xlab = "x", ylab = 'F(x)' ,main = "CDF", type = "o", col = "red")
}
func_plot(1)

func_plot(2)

Ans. to 9

rm(list = ls())

avgpm <- read.csv("avgpm25.csv", sep = ",")
colnames(avgpm)
[1] "PM25"         "Daily_AQ_Val" "STATE"        "COUNTY_CODE"  "COUNTY"      
[6] "latitude"     "longitude"   
head(avgpm)
  PM25 Daily_AQ_Val   STATE COUNTY_CODE COUNTY latitude longitude
1  8.8           37 Georgia          21   Bibb 32.77746   -83.641
2 10.4           43 Georgia          21   Bibb 32.77746   -83.641
3  7.9           33 Georgia          21   Bibb 32.77746   -83.641
4  3.8           16 Georgia          21   Bibb 32.77746   -83.641
5  5.8           24 Georgia          21   Bibb 32.77746   -83.641
6  9.8           41 Georgia          21   Bibb 32.77746   -83.641
# (a)
summary(avgpm$PM25)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -3.000   5.400   7.600   8.287  10.400  50.200 
boxplot(avgpm$PM25)

hist(avgpm$PM25)

# (b)
library(tidyverse)
b <- table(avgpm$COUNTY)
b

      Bibb   Charlton    Chatham     Clarke    Clayton       Cobb     Coffee 
       560         77        640        395        120        121        119 
    DeKalb  Dougherty      Floyd     Fulton      Glynn   Gwinnett       Hall 
       712        686        365        606         99        452        416 
     Henry    Houston    Lowndes     Murray   Muscogee   Richmond     Walker 
       359        434        409         77        703        447        440 
Washington 
       121 
ggplot(avgpm) +
  geom_bar(mapping = aes(x = COUNTY))

# (c)
par(mfrow = c(2,2))
c.1 <- avgpm[avgpm$COUNTY == "Richmond",]
hist(c.1$PM25, xlab = "PM25", main = "Richmond")

c.2 <- avgpm[avgpm$COUNTY == "Clayton",]
hist(c.2$PM25, xlab = "PM25", main = "Clayton")

c.3 <- avgpm[avgpm$COUNTY == "Washington",]
hist(c.3$PM25, xlab = "PM25", main = "Washington")

c.4 <- avgpm[avgpm$COUNTY == "Houston",]
hist(c.4$PM25, xlab = "PM25", main = "Houston")

2020

Ans.to 01

rm(list = ls())
library(tidyverse)

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 2/5, position = 'identity')

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(fill = NA, position = 'identity')

ggplot(data = diamonds) +
  stat_summary(mapping = aes(x = cut, y = price), fun.ymin = min, fun.ymax = max, fun.y = median)

Ans. to 02

# (a)

lr_estimates <- function(x, y) {
  b_1 <-  sum((x-mean(x))*(y-mean(y)))/sum((x-mean(x))^2)
  
  b_0 <- mean(y) - b_1 * mean(x)
  
  return(c(b_0, b_1))
}

lr_estimates(mtcars$wt, mtcars$mpg)
[1] 37.285126 -5.344472
## (b)

r <- function(x, y) {
  
  r <- sum((x-mean(x))*(y-mean(y)))/
    sqrt((sum((x-mean(x))^2))*(sum((y-mean(y))^2)))
  
  r
}

r(mtcars$mpg, mtcars$hp)
[1] -0.7761684

Ans. to 03

rm(list = ls())
library(tidyverse)
library(reshape) # i, ii

## (i)
# ?melt
names(airquality) <- tolower(names(airquality))
head(melt(airquality, id=c("month", "day")))
  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
5     5   5    ozone    NA
6     5   6    ozone    28
## (ii)
# ?cast
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
head(cast(aqm, day ~ month ~ variable))
, , variable = ozone

   month
day  5  6   7  8  9
  1 41 NA 135 39 96
  2 36 NA  49  9 78
  3 12 NA  32 16 73
  4 18 NA  NA 78 91
  5 NA NA  64 35 47
  6 28 NA  40 66 32

, , variable = solar.r

   month
day   5   6   7  8   9
  1 190 286 269 83 167
  2 118 287 248 24 197
  3 149 242 236 77 183
  4 313 186 101 NA 189
  5  NA 220 175 NA  95
  6  NA 264 314 NA  92

, , variable = wind

   month
day    5    6    7    8    9
  1  7.4  8.6  4.1  6.9  6.9
  2  8.0  9.7  9.2 13.8  5.1
  3 12.6 16.1  9.2  7.4  2.8
  4 11.5  9.2 10.9  6.9  4.6
  5 14.3  8.6  4.6  7.4  7.4
  6 14.9 14.3 10.9  4.6 15.5

, , variable = temp

   month
day  5  6  7  8  9
  1 67 78 84 81 91
  2 72 74 85 81 92
  3 74 67 81 82 93
  4 62 84 84 86 93
  5 56 85 83 85 87
  6 66 79 83 87 84
## (iii)
# ?grep # base
grep("[a-z]", letters)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26
## (iv)
# ?qnorm
qnorm(.5)
[1] 0
## (v)
# ?ggplot
ggplot() + 
  geom_bar(data = diamonds, mapping = aes(x = cut, fill = clarity))

# library(MASS)
rm(list = ls())
library(lattice)
state <- data.frame(state.x77, region = state.region)
colnames(state)
[1] "Population" "Income"     "Illiteracy" "Life.Exp"   "Murder"    
[6] "HS.Grad"    "Frost"      "Area"       "region"    
xyplot(data = state, Life.Exp ~ Income | region)

library(tidyverse)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

ggplot( data = diamonds, mapping = aes(x = cut, color = clarity)) +
  geom_bar(fill = NA, position = "identity")

Ans. to 04

preliminary
rm(list = ls())
library(tidyverse)
data("AirPassengers")

ap <- as.data.frame(AirPassengers)

months <- rep(month.abb, times = 12)

as.character(seq(1949, 1960))
 [1] "1949" "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958"
[11] "1959" "1960"
years <- rep(c("1949", "1950", "1951", "1952", "1953", "1954", "1955",
               "1956", "1957", "1958", "1959", "1960"), each = 12)

airpassengers <- as.data.frame(cbind(ap, months, years))
tibble(airpassengers)
# A tibble: 144 × 3
       x months years
   <dbl> <chr>  <chr>
 1   112 Jan    1949 
 2   118 Feb    1949 
 3   132 Mar    1949 
 4   129 Apr    1949 
 5   121 May    1949 
 6   135 Jun    1949 
 7   148 Jul    1949 
 8   148 Aug    1949 
 9   136 Sep    1949 
10   119 Oct    1949 
# ℹ 134 more rows
(a)
library(tidyverse)

head(airpassengers |> arrange(ap))
    x months years
1 104    Nov  1949
2 112    Jan  1949
3 114    Nov  1950
4 115    Jan  1950
5 118    Feb  1949
6 118    Dec  1949
head(airpassengers |> arrange (-ap))
    x months years
1 622    Jul  1960
2 606    Aug  1960
3 559    Aug  1959
4 548    Jul  1959
5 535    Jun  1960
6 508    Sep  1960
cat("a.1 <- airpassengers |> 
  group_by(months) |> 
  summarise(min = min(x), max = max(x))

a.1 |>  arrange(min)
a.1 |>  arrange(-max)

a.2 <- airpassengers |> 
  group_by(years) |> 
  summarise(min = min(x), max = max(x))

a.2 |>  arrange(min)
a.2 |>  arrange(-max)")
a.1 <- airpassengers |> 
  group_by(months) |> 
  summarise(min = min(x), max = max(x))

a.1 |>  arrange(min)
a.1 |>  arrange(-max)

a.2 <- airpassengers |> 
  group_by(years) |> 
  summarise(min = min(x), max = max(x))

a.2 |>  arrange(min)
a.2 |>  arrange(-max)
(b)
plot(AirPassengers)

(c)
ggplot(data = airpassengers) +
  geom_boxplot(mapping = aes(x = months, y = x), col = "wheat4") +
  labs (y = "Passengers", x = "Months")

AST 231

Lectures

Probability Distribution

Generating random samples from probability distributions

rm(list=ls())
# Probability distribution of x
x = c(1, 3, 5)
px = c(0.6, 0.3, 0.1)

# What is the theoretical mean and variance of the probability distribution?
mu <- sum(x*px)
mu
[1] 2
sigma <-sqrt(sum((x^2)*px)-mu^2)
sigma
[1] 1.341641
# Generate 1000 values at random from the above distribution with replacement
draws = sample(x, size = 1000, replace = TRUE, prob = px)


# hist(draws, main = "1000 discrete draws")
hist(draws, breaks = seq(1, 5, by = 1), main = "1000 discrete draws")

mean(draws) # Close to theoretical mean?
[1] 1.98
mu
[1] 2
sigma
[1] 1.341641
sd(draws)
[1] 1.36582

Drawing random numbers from common probability distributions

births01 = sample(c("boy","girl"),100, replace=T)
births02 = sample(c("boy","girl"),100, replace=T, prob=c(0.513,0.487))
birth03 = rbinom(n=100,size=1,prob=0.513)
birth04 = rbinom(10,100,0.513)

# how close is the sample mean and variance to theoretical mean and variance?

# Probability of each outcome (pmf)
numbirths=10 # no of trials
numboys = 0:numbirths # all possible values of x (no. of success)
dd=dbinom(x=numboys,size=numbirths,prob=0.513)
barplot(dd~numboys, xlab = "number of boys babies in 10 birth", ylab="probability that this occurs")

# cumulative probability distribution
pp=pbinom(q=numboys,size=numbirths,prob=0.513)
plot(numboys,pp,type="b", xlab = "number of boys babies in 10 births", ylab = "probability of this many boy births or fewer")

#quantile function
qbinom(p=0.04612,size=numbirths,prob=0.513, lower.tail=TRUE)
[1] 2
# Other common distributions
# rpois, rnorm, rexp, runif etc.

Some properties of normal distribution

mu=3
sig=1
# area to the left
pnorm(q=mu,mean=mu,sd=sig)
[1] 0.5
# symmetry
pnorm(1,mu,sig)
[1] 0.02275013
pnorm(5,mu,sig,lower.tail = FALSE)
[1] 0.02275013
# area under normal curve
pnorm(2,0,1)-pnorm(-2,0,1)
[1] 0.9544997
qnorm(0.025,0,1)
[1] -1.959964
dnorm(1,0,1)
[1] 0.2419707

Exercises

  • Generate a sample of size 400 from a Poisson distribution with lambda=5
rm(list = ls())
samp <- rpois(400, 5)
table(samp)
samp
 0  1  2  3  4  5  6  7  8  9 10 12 13 
 1 15 42 54 74 65 48 44 33 15  6  2  1 

Sampling Distribution & CLT

Sampling distribution of sample mean

rm(list = ls())
# Create a population of IQ scores from a normal distribution with mean 100 and standard deviation of 15.
popMean = 100; # The mean of our population
popSD = 15; # The standard deviation of our population
sampSize = 10; # The size of the samples we'll draw from the population

population <- rnorm(1000000, mean = popMean, sd = popSD)
hist(population)

mean(population)
[1] 99.99444
sd(population)
[1] 15.01108
# Drawing a single sample (sampling error)
sampChamp <- sample(population, sampSize, replace=TRUE)
mean(sampChamp)
[1] 101.4726
sd(sampChamp)
[1] 12.74773
# sampling distribution of sample mean

sampDist_mean <- replicate(10000,mean(sample(population, sampSize, replace=TRUE)))
hist(sampDist_mean)

mean(sampDist_mean) # When samples are drawn from a normal population sampling distribution of sample mean will be normal.
[1] 99.98973
sd(sampDist_mean)
[1] 4.762446
# 1. How close are these mean and standard deviation to theoretical mean and standard deviation of sample means?

# 2. Now increase the sample size to n=50. What changes can you see in sampling distribution of sample mean?

The sampling distribution of sample variance: sum(x-xbar)^2/n

sampDist_var <- replicate(10000,var(sample(population, sampSize, replace=TRUE)))
hist(sampDist_var)

mean(sampDist_var)
[1] 226.0014
sd(sampDist_var)
[1] 105.3965

The central limit theorem

  • The sampling distribution of sample mean is approximately normal irrespective of the distribution of the population(in most cases), if sample size is sufficiently large.
population <- rexp(1000000, rate=0.3)
hist(population)

mean(population)
[1] 3.337654
sd(population)
[1] 3.334343
sampSize = 250

# sampling distribution of sample mean

sampDist_mean <- replicate(10000,mean(sample(population, sampSize, replace=TRUE)))
hist(sampDist_mean)

# Is the distribution normal?
# Does the shape changes if you set n=50?


mean(sampDist_mean)
[1] 3.343083
sd(sampDist_mean)
[1] 0.2112527

Exercise

  • Show the sampling distribution of coefficient of variation (sd/mean*100)
samp_dist_cv <- replicate(10000, expr = (sqrt((sampSize-1)/sampSize)*sd(sample(population, sampSize, T)))/mean(sample(population, sampSize, T)))

hist(samp_dist_cv)

Unbiasedness

Concept of unbiasedness using all possible samples

rm(list = ls())
# Sample mean
population <-seq(1,20,2)
mean(population)
[1] 10
var(population)
[1] 36.66667
sampSize <- 2

smeans<-combn (population, sampSize, mean)
mean(smeans) # equals to population mean
[1] 10
# Sample variance: E[sum(x-xbar)^2/(n-1)]=sigma^2
svars<-combn (population, sampSize, var)
mean(svars) # equals to population variance
[1] 36.66667
hist(smeans) # normally distributed

# When the population size is large evaluating unbiasedness over all possible sample becomes computationally heavy. 

Provided that the same notation is maintained consistently , all results are equivalent in either notation.

We can then check unbiasedness by taking many (but not all) samples.

population <-seq(1,20000,2)
mean(population)
[1] 10000
sampSize <- 100

# allsampl <-combn (population,sampSize)
# smeans <-apply(allsampl,2,mean)
smeans <- replicate(10000,mean(sample(population, sampSize, replace= FALSE)))
hist(smeans)

mean(smeans)
[1] 9997.699

Assignment

  • Draw a population of size 30 from an exponential distribution with rate(theta) = 0.9.

    • By considering all possible samples of size 4 show that sample mean is an unbiased estimator of population mean.

    • Is sample median an unbiased estimator of population median?

rm(list = ls())
pop <- rexp(30, .9)

mean(pop)
[1] 1.418007
mean(combn(pop, 4, mean)) # equals to population mean
[1] 1.418007
median(pop)
[1] 0.9329535
mean(combn(pop, 4, median)) # not equals to population median
[1] 1.124802

Least Square Estimates

Using analytic formula

rm(list=ls())

x <- 1:10
y <- c(3.9,7.1,12.6,14.5,16.7,19.2,23.1,27.2,27.8,31.1)
plot(y~x)
abline(lm(y~x))

ssxy=sum((x-mean(x))*(y-mean(y)))
ssx=sum((x-mean(x))^2)
beta=ssxy/ssx
alpha=mean(y)-beta*mean(x)

beta01 <- lm(y~x)$coef[2]
alpha01 <- lm(y~x)$coef[1]

Using numeric optimization

Erssq <- function(parameter,c){
  a <- parameter[1]
  b <- parameter[2]
  x <- c[,1]
  y <- c[,2]
  return(sum((y-a-b*x)^2)) # compute error sum of squares
  }
p <- data.frame(x=x,y=y)# convert matrix c into dataframe.
res <- optim(par=c(1000,2000),fn = Erssq, c=p, method = "BFGS")
res
$par
[1] 1.953333 2.975758

$value
[1] 8.887515

$counts
function gradient 
      29       10 

$convergence
[1] 0

$message
NULL

Unbiasedness of OLS estimate of regression coefficient

rm(list = ls())
n <- 25
rep <- 1000
beta <- rep(NA, rep)

for (i in 1:rep) {
  x <- rnorm(n)
  eps <- rnorm(n)
  y <- 2 + 4*x + eps
  beta[i] <- lm(y~x)$coeff[2]
}
mean(beta) # should be close to real value (the value we used in the equation) 4
[1] 3.995884

MLE

Notes

stats4 package is a must for mle function

we know the distribution but not the parameter(s)

Normal

rm(list = ls())
library(stats4) # necessary for mle function
# We know the distribution but not the parameter(s)
set.seed(804)
n <- 100
x <- rnorm(n, mean = 3, sd = 2)

LL <- function(mu, sigma) {
  R = dnorm(x, mu, sigma)
  -sum(log(R))
}

mle(minuslogl = LL, start = list(mu = 1, sigma=1))

Call:
mle(minuslogl = LL, start = list(mu = 1, sigma = 1))

Coefficients:
      mu    sigma 
2.799094 2.015202 

Bernouli

rm(list = ls())
set.seed(804)
n <- 100
x <- rbinom(n,1,0.3)

LL3 <- function(p) {
  R = dbinom(x, 1, p)
  -sum(log(R))
}

mle(minuslogl = LL3, start = list(p=0.9))

Call:
mle(minuslogl = LL3, start = list(p = 0.9))

Coefficients:
        p 
0.3100002 

Poisson

rm(list = ls())
set.seed(804)
n <- 100
x <- rpois(n, lambda = .4)

LL <- function(lmbda){
  R <- dpois(x, lmbda)
  -sum(log(R))
}

mle(minuslogl = LL, start = list(lmbda=.8))

Call:
mle(minuslogl = LL, start = list(lmbda = 0.8))

Coefficients:
    lmbda 
0.3600002 

Gamma

rm(list = ls())
set.seed(804)
n <- 100
x <- rgamma(n, shape = 5, rate = 5)

LL <- function(alpha, beta){
  R <- dgamma(x, alpha, beta)
  -sum(log(R))
}
  
mle(minuslogl = LL, start = list(alpha=5, beta=5))

Call:
mle(minuslogl = LL, start = list(alpha = 5, beta = 5))

Coefficients:
   alpha     beta 
5.511013 5.608799 

Expo

rm(list = ls())
set.seed(804)
n <- 100
x <- rexp(n, rate = .5)

LL <- function(theta){
  R <- dexp(x, theta)
  -sum(log(R))
}

mle(minuslogl = LL, start = list(theta=.8))

Call:
mle(minuslogl = LL, start = list(theta = 0.8))

Coefficients:
    theta 
0.4372334 

CI

Computing CI from a single sample

# sigma known
rm(list=ls())
n <- 1000
mu <- 100
sigma <- 10

x <- rnorm(n,mu,sigma) # drawing the sample
meanx<-mean(x)

# 95% Confidence Interval
lower<- meanx-1.96*sigma/sqrt(n)
upper<- meanx+1.96*sigma/sqrt(n)

c(lower,upper)
[1]  99.80381 101.04342

Confidence Interval for population mean

  • sigma known (can use normal approximation for sampling dist of \(\bar x\))

    • Case 1: If population is normal
    • Case 2: If population is not normal/unknown, but sample size \(n \geq 30\)
rm(list=ls())
n <- 1000
mu <- 100 
sigma <- 10
x <- rnorm(n,mu,sigma)
meanx<-mean(x)

# 99% Confidence interval
CL <- 0.99
alpha <- 1- CL
alpha_by2 <- alpha/2
z_alpha2 <- qnorm(alpha_by2, mean=0, sd=1, lower.tail=FALSE)

lower<- meanx-z_alpha2*sigma/sqrt(n)
upper<- meanx+z_alpha2*sigma/sqrt(n)

c(lower,upper)
[1]  98.79309 100.42218

Interpretation of 95 % CI for population mean

  • count population parameter \(\mu\)
rm(list = ls())
n <- 100
mu <- 100
sigma <- 10
nint<-1000

means <- rep(NA,nint) 
lower <- rep(NA,nint) 
upper <- rep(NA,nint) 
# sample size and parameters

for(i in 1:nint) {
x <- rnorm(n,mu,sigma) # generate a random sample
  
means[i] <- mean(x)  # calculate mean for each sample               
  
lower[i] <- means[i] - 1.96*sigma/sqrt(n) # calculate lower bound
  
upper[i] <- means[i] + 1.96*sigma/sqrt(n) # calculate upper bound
}      


count<- 0 # count will hold number of intervals do contain the mean mu
for(i in 1:nint) {
  if(mu >= lower[i] && mu <= upper[i])
    count<- count + 1     } 

p  <- count/nint # proportion of intervals do contain the mean mu
p  
[1] 0.958

Interpretation of 95 % CI for population proportion

  • \(\frac{pq}{n}\)
  • count population parameter \(\pi\)
rm(list = ls())
n <- 30
trial<- 1
prob<- 0.7
nint<-1000

prop <- rep(NA,nint) 
lower <- rep(NA,nint) 
upper <- rep(NA,nint) 
# sample size and parameters

for(i in 1:nint) {
x <- rbinom(n,trial,prob) #generate a random sample

prop[i] <- mean(x) #calculate mean for each sample          
  
lower[i] <- prop[i] - 1.96*sqrt(prop[i]*(1-prop[i])/n) #calculate lower bound
  
upper[i] <- prop[i] + 1.96*sqrt(prop[i]*(1-prop[i])/n) #calculate upper bound
}      

count<- 0 # count will hold number of intervals do contain the mean mu     

for(i in 1:nint) {
  if(prob >= lower[i] && prob <= upper[i]) # population parameter
    count<- count + 1     } 

p  <- count/nint   # proportion of intervals do contain the mean mu  
p  
[1] 0.954
# Exercises ####
# Check confidence level for CI for proportion

Interpretation of 95% CI for population mean (exponential distribution)

rm(list = ls())
n <- 5
lambda = 0.7
mu <- 1/lambda
nint<-1000 # number of samples - confidence intervals

means <- rep(NA,nint) 
lower <- rep(NA,nint) 
upper <- rep(NA,nint) 

for(i in 1:nint) {
x <- rexp(n,rate = lambda) # generate a random sample
  
means[i] <- mean(x) # calculate mean for each sample
  
lower[i] <- means[i] - 1.96*(1/lambda)/sqrt(n)# calculate lower bound
upper[i] <- means[i] + 1.96*(1/lambda)/sqrt(n)# calculate upper bound
} 
CIs<- c(lower,upper)

count<- 0 # count will hold number of intervals do contain the mean mu

for(i in 1:nint) {
  if(mu >= lower[i] && mu <= upper[i]) # population parameter
    count<- count + 1} 

p  <- count/nint # proportion of intervals do contain the mean mu
p  
[1] 0.955

Test of Hypotheses

Lower Tail Test of Population Mean with Known Population Variance

The National Sleep Foundation recommends that teenagers get at least 8 hours of sleep for proper health. A stats class at a large high school suspects that students at their school are getting less than eight hours of sleep on average. To test the theory, they randomly sample 42 of these students and asked how many hours of sleep they get per night. The mean of the sample is 7.5 hours. Suppose that the population standard deviation is 2.1 hours. What is the conclusion of their test?

# The null hypothesis is that mu=8.
# Alternative is mu < 8. 
# We begin with computing the test statistic.
rm(list = ls())
xbar = 7.5            # sample mean 
mu0 = 8               # hypothesized value 
sigma = 2.1           # population standard deviation 
n = 42                # sample size 
z = (xbar-mu0)/(sigma/sqrt(n)) 
z                      # test statistic 
[1] -1.543033
# We then compute the critical value at .05 significance level.

alpha = .05 
z.alpha = qnorm(alpha) 
z.alpha               # critical value 
[1] -1.644854
# What is the decision of the test?

# The p-value approach
pval = pnorm(z) 
pval     # lower tail p-value
[1] 0.06141132

Two Tail Test of Population Mean with Unknown Variance

Suppose the food label on a cookie bag states that there is 2 grams of saturated fat in a single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per cookie is 2.2 grams with a standard deviation is 0.3 gram. At .05 significance level, can we reject the claim on food label?

rm(list = ls())
xbar = 2.2             # sample mean 
mu0 = 2                # hypothesized value 
s = 0.3                # sample standard deviation 
n = 35                 # sample size 
t = (xbar-mu0)/(s/sqrt(n)) 
t 
[1] 3.944053
# Compute critical values

alpha = .05
alpha_by2 = alpha/2
t.alpha = qt(1-alpha_by2, df=n-1) 
t.alpha 
[1] 2.032245
# p-value

pval = 2*pt(t, df=n-1, lower.tail=FALSE) 
pval                   # two tail p-value 
[1] 0.000380034

Exercises

Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6 kg. Assume the population standard deviation is 2.5 kg. At .05 significance level, can we reject the null hypothesis that the mean penguin weight does not differ from last year?

rm(list = ls())
mu_0 = 15.4
x_bar = 14.6
sigma = 2.5
n = 35
# null mu = 15.4
# alternative mu ~= 15.4

z_cal <-abs((x_bar - mu_0)/(sigma/sqrt(n))) 
z_cal
[1] 1.893146
z_tab <- qnorm(0.025, lower.tail = F)
z_tab 
[1] 1.959964
pvalue <- 2 * pnorm(z_cal, lower.tail = F)
pvalue
[1] 0.05833852

Lower Tail Test of Population Proportion

\(\frac{p_0 q_0}{n}\)

It is claimed by a University authority that more than 30% of the students attending the university are hostile to a decision to increase student fees in order to enhance its employees salary. A sample survey conducted among 600 students revealed that 163 are opposed to the decision of the university. Choose a level of 0.01 to see if the sample data speak in favor of the claim.

What are the assumptions made when testing the hypothesis?

# The null hypothesis is pi = 0.3 and alternative pi < 0.3. 
# We begin with computing the test statistic.

# Calculation of test statistic
rm(list = ls())

pi_0=0.3
pi_hat=163/600
n=600
## Number of success under null hypothesis needs to be more than 10
n*pi_0
[1] 180
n*(1-pi_0)
[1] 420
z = (pi_hat-pi_0)/sqrt(pi_0*(1-pi_0)/n) # test statistic # null_p_0
z                      
[1] -1.51448
# We then compute the critical value at .01 significance level.

alpha = .01 
z.alpha = qnorm(alpha)
z.alpha # critical value 
[1] -2.326348
# What is the decision of the test?

# The p-value approach
pval = pnorm(z) 
pval                   # lower tail pvalue 
[1] 0.06495202
# Is the decision same as the critical value approach?

Another Problem

\(\frac{p_0 q_0}{n}\)

Use R data set mtcars. A data frame column mpg of the data set mtcars, there are gas mileage data of 32 U.S. automobiles (1973-74 models). A data column in mtcars, named am, indicates the transmission type of the automobile model (0 = automatic, 1 = manual) Treat the data as a random sample representing the population of automobiles in US.

The US government announced that around 50% of the automobiles in US are automatic. You think that the percentage is higher. how would you test this at 3% level of significance?

Based on the evidence can you conclude that the average miles per gallon for US automobiles is different from 20 at 1% significance level?

# The null hypothesis is pi = 0.5 and alternative pi > 0.5 

rm(list = ls())

data<-mtcars
data$mpg 
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
data$am 
 [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
n=nrow(data)
n1=nrow(data[data$am==0,])

x<-data[data$am==0,]
is.data.frame(x)
[1] TRUE
pi_hat=n1/n

pi_0=0.5

n*pi_0
[1] 16
n*(1-pi_0)
[1] 16
z = (pi_hat-pi_0)/sqrt(pi_0*(1-pi_0)/n) # test statistic 
z   
[1] 1.06066
# We then compute the critical value at .01 significance level.

alpha = .03
z.alpha = qnorm(alpha, lower.tail = FALSE)
z.alpha # critical value 
[1] 1.880794
# The p-value approach
pval = pnorm(z, lower.tail = FALSE) 
pval 
[1] 0.1444222

Test for equality of two means: Independent samples

Assuming that the data in mtcars follows the normal distribution, find the 95% confidence interval estimate of the difference between the mean gas mileage of manual and automatic transmissions. Perform a test of equality of two population means.

rm(list = ls())
mtcars$mpg 
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
mtcars$am 
 [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
# L = mtcars$am == 0 
# mpg.auto = mtcars[L,]$mpg 

mpg.auto = mtcars[mtcars$am==0,]$mpg
mpg.auto # automatic transmission mileage 
 [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 21.5
[16] 15.5 15.2 13.3 19.2
# mpg.manual = mtcars[!L,]$mpg 
mpg.manual = mtcars[mtcars$am==1,]$mpg
mpg.manual    # manual transmission mileage 
 [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4
# We can now apply the t.test function to compute the difference in means of the two sample data.

t.test(mpg.auto, mpg.manual, var.equal=TRUE)

    Two Sample t-test

data:  mpg.auto and mpg.manual
t = -4.1061, df = 30, p-value = 0.000285
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.84837  -3.64151
sample estimates:
mean of x mean of y 
 17.14737  24.39231 
t.test(mpg.auto, mpg.manual, var.equal=FALSE) 

    Welch Two Sample t-test

data:  mpg.auto and mpg.manual
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean of x mean of y 
 17.14737  24.39231 

Test for equality of two means: paired/dependent samples

A school athletics has taken a new instructor, and want to test the effectiveness of the new type of training proposed by comparing the average times of 10 runners in the 100 meters. Are below the time in seconds before and after training for each athlete.

Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
After training: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1

Test whether the new type of training made a difference in the runners’ performance.

rm(list = ls())

a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)

t.test(a,b, paired=TRUE)

    Paired t-test

data:  a and b
t = -0.21331, df = 9, p-value = 0.8358
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.5802549  0.4802549
sample estimates:
mean difference 
          -0.05 
#Manual computation
D=a-b
D_bar=mean(D)
S_D=sd(D)
n=length(a)

# test statistic
t=D_bar/(S_D/sqrt(n))
t
[1] -0.2133085
# p-value

pval = 2*pt(t, df=n-1)  
pval                  
[1] 0.83584

Suppose now that the manager of the team (given the results obtained) fired the coach who has not made any improvement, and take another, more promising. We report the times of athletes after the second training:

Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
After the second training: 12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0

Test whether there was actually an improvement.

t.test(a,b, paired=TRUE, alt="greater")

    Paired t-test

data:  a and b
t = -0.21331, df = 9, p-value = 0.5821
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 -0.4796859        Inf
sample estimates:
mean difference 
          -0.05 

Test for comparing two proportions: independent samples

\({\hat p \hat q}(\frac{1}{n_1} + \frac{1}{n_2})\)
\(\hat p = \frac{a_1 + a_2}{n_1 + n_2}\)

Test whether the proportions of female students are smaller in the ethnic Aboriginal population, use quine data set in MASS package.

rm(list = ls())
library(MASS)
head(quine) 
  Eth Sex Age Lrn Days
1   A   M  F0  SL    2
2   A   M  F0  SL   11
3   A   M  F0  SL   14
4   A   M  F0  AL    5
5   A   M  F0  AL    5
6   A   M  F0  AL   13
table(quine$Eth, quine$Sex) 
   
     F  M
  A 38 31
  N 42 35
# Z-test

n_a=38+31
n_n=42+35
x_a=38
x_n=42

p_a=x_a/n_a
p_n=x_n/n_n

p_com=(x_a + x_n)/(n_a + n_n)

Z=(p_a - p_n)/sqrt(p_com*(1-p_com)*(1/n_a + 1/n_n))

# The p-value approach
pval = pnorm(Z, lower.tail = TRUE) 
pval               
[1] 0.5254661

Anova

Analysis of variance (one-factor / one way)

rm(list = ls())

a = c(65, 48, 66, 75, 70, 55)
b = c(64, 44, 70, 70, 68, 59)
c = c(60, 50, 65, 69, 69, 57)
d = c(62, 46, 68, 72, 67, 56)

dati = c(a, b, c, d)

# create a group identifier
groups = factor(rep(letters[1:4], each = 6))

# Check assumptions

# test for homogeneity of variance

bartlett.test(dati, groups) # 'stats' package

    Bartlett test of homogeneity of variances

data:  dati and groups
Bartlett's K-squared = 0.48217, df = 3, p-value = 0.9228
# Since p-value > 0.05, the null of equal variances (homogeneity) cannot be rejected.


# Anova
fit = lm(formula = dati ~ groups)

# Check normality of residuals
shapiro.test(fit$residuals) # 'stats' package

    Shapiro-Wilk normality test

data:  fit$residuals
W = 0.91162, p-value = 0.03824
# Anova
anova (fit)
Analysis of Variance Table

Response: dati
          Df  Sum Sq Mean Sq F value Pr(>F)
groups     3    8.46   2.819  0.0327 0.9918
Residuals 20 1726.50  86.325               

Alternative way

model=aov(dati~groups)
shapiro.test(model$residuals)

    Shapiro-Wilk normality test

data:  model$residuals
W = 0.91162, p-value = 0.03824
summary(model)
            Df Sum Sq Mean Sq F value Pr(>F)
groups       3    8.5    2.82   0.033  0.992
Residuals   20 1726.5   86.33               
# verify the calculation of F-statistic and the p-value from the output table.

# Since p-value > 0.05, we do not reject the null hypothesis H0: the four means are statistically equal. 

# we can also compare the critical value at 5% and computed value

qf(0.05, 3, 20, lower.tail=FALSE)
[1] 3.098391
# Tabulated F-value > computed F-value: we do not reject the null hypohesis.

Exercises

Are average pulse rates different across different levels of smoking?

rm(list = ls())
library(MASS) # load the package
library(tidyverse) # that drop_na

survey_new <- drop_na(survey[, c(6,9)])

pulse_heavy <- survey_new[survey_new$Smoke == "Heavy",]$Pulse

pulse_never <- survey_new[survey_new$Smoke == "Never",]$Pulse

pulse_occas <- survey_new[survey_new$Smoke == "Occas",]$Pulse

pulse_regul <- survey_new[survey_new$Smoke == "Regul",]$Pulse

pulse <- c(pulse_heavy, pulse_never, pulse_occas, pulse_regul)

trt <- factor(c(rep("h", times = length(pulse_heavy)),
                rep("n", times = length(pulse_never)),
                rep("o", times = length(pulse_occas)),
                rep("r", times = length(pulse_regul))))

summary(aov(pulse~trt))
             Df Sum Sq Mean Sq F value Pr(>F)
trt           3    127   42.48   0.306  0.821
Residuals   187  25927  138.65               

Independence

Test of independence

In the built-in data set survey, the Smoke column records the students smoking habit, while the Exer column records their exercise level.
The allowed values in Smoke are “Heavy”, “Regul” (regularly), “Occas” (occasionally) and “Never”.
As for Exer, they are “Freq” (frequently), “Some” and “None”.
Test the hypothesis whether the students smoking habit is independent of their exercise level

rm(list = ls())
library(MASS) # load the MASS package 
tibble(survey)
# A tibble: 237 × 12
   Sex    Wr.Hnd NW.Hnd W.Hnd Fold    Pulse Clap  Exer  Smoke Height M.I     Age
   <fct>   <dbl>  <dbl> <fct> <fct>   <int> <fct> <fct> <fct>  <dbl> <fct> <dbl>
 1 Female   18.5   18   Right R on L     92 Left  Some  Never   173  Metr…  18.2
 2 Male     19.5   20.5 Left  R on L    104 Left  None  Regul   178. Impe…  17.6
 3 Male     18     13.3 Right L on R     87 Neit… None  Occas    NA  <NA>   16.9
 4 Male     18.8   18.9 Right R on L     NA Neit… None  Never   160  Metr…  20.3
 5 Male     20     20   Right Neither    35 Right Some  Never   165  Metr…  23.7
 6 Female   18     17.7 Right L on R     64 Right Some  Never   173. Impe…  21  
 7 Male     17.7   17.7 Right L on R     83 Right Freq  Never   183. Impe…  18.8
 8 Female   17     17.3 Right R on L     74 Right Freq  Never   157  Metr…  35.8
 9 Male     20     19.5 Right R on L     72 Right Some  Never   175  Metr…  19  
10 Male     18.5   18.5 Right R on L     90 Right Some  Never   167  Metr…  22.3
# ℹ 227 more rows
colnames(survey)
 [1] "Sex"    "Wr.Hnd" "NW.Hnd" "W.Hnd"  "Fold"   "Pulse"  "Clap"   "Exer"  
 [9] "Smoke"  "Height" "M.I"    "Age"   
# We can create a contingency table

tbl = table(survey$Smoke, survey$Exer) 
tbl                 # the contingency table 
       
        Freq None Some
  Heavy    7    1    3
  Never   87   18   84
  Occas   12    3    4
  Regul    9    1    7
ChiTest <- chisq.test(tbl) 
ChiTest

    Pearson's Chi-squared test

data:  tbl
X-squared = 5.4885, df = 6, p-value = 0.4828
# Observed counts
obs <- ChiTest$observed
obs
       
        Freq None Some
  Heavy    7    1    3
  Never   87   18   84
  Occas   12    3    4
  Regul    9    1    7
# Expected counts
exp <- round(ChiTest$expected,2)
exp
       
         Freq  None  Some
  Heavy  5.36  1.07  4.57
  Never 92.10 18.42 78.48
  Occas  9.26  1.85  7.89
  Regul  8.28  1.66  7.06
# (O-E)/sqrt(E)
res <- round(ChiTest$residuals, 3) 
res
       
          Freq   None   Some
  Heavy  0.708 -0.070 -0.734
  Never -0.531 -0.098  0.623
  Occas  0.901  0.844 -1.385
  Regul  0.249 -0.510 -0.022
# plot Pearson residuals
# library(corrplot)
# corrplot(res, is.cor = FALSE)

# Cell contribution to total Chi-square test statistic

# contrib <- 100*res^2/ChiTest$statistic
# round(contrib, 3)

# corrplot(contrib, is.cor = FALSE) # visualize contribution

Expected frequencies of some cells are less than 5, so Chi-sq approximation may be incorrect.

# We may merge two columns

ctbl = cbind(tbl[,"Freq"], tbl[,"None"] + tbl[,"Some"]) 
ctbl 
      [,1] [,2]
Heavy    7    4
Never   87  102
Occas   12    7
Regul    9    8
ChiTest_comb <- chisq.test(ctbl)
ChiTest_comb

    Pearson's Chi-squared test

data:  ctbl
X-squared = 3.2328, df = 3, p-value = 0.3571

Exercises

Is there any association between gender and how frequently students exercise?

rm(list = ls())
colnames(survey)
 [1] "Sex"    "Wr.Hnd" "NW.Hnd" "W.Hnd"  "Fold"   "Pulse"  "Clap"   "Exer"  
 [9] "Smoke"  "Height" "M.I"    "Age"   
tbl <- table(survey$Sex, survey$Exer)
tbl
        
         Freq None Some
  Female   49   11   58
  Male     65   13   40
chisq <- chisq.test(tbl)
chisq

    Pearson's Chi-squared test

data:  tbl
X-squared = 5.7184, df = 2, p-value = 0.05731
observed <- chisq$observed
observed
        
         Freq None Some
  Female   49   11   58
  Male     65   13   40
expected <- chisq$expected
expected ## so it's alright 
        
         Freq None Some
  Female   57   12   49
  Male     57   12   49
residuals <- chisq$residuals
residuals
        
               Freq       None       Some
  Female -1.0596259 -0.2886751  1.2857143
  Male    1.0596259  0.2886751 -1.2857143

Distribution Fitting

Fitting Discrete (Poisson) Distribution

rm(list = ls())
set.seed(804)
x.poi<-rpois(n=200,lambda=2.5)
Observed frequency
tab.os<-table(x.poi) ## table with empirical(observed) frequencies 
tab.os
x.poi
 0  1  2  3  4  5  6  7  8 
14 48 55 41 25 11  3  2  1 
freq.os<-vector()
for(i in 1: length(tab.os)) {
  freq.os[i]<-tab.os[[i]] } ## vector of empirical frequencies
freq.os
[1] 14 48 55 41 25 11  3  2  1
Expected frequency
lambda.est<-mean(x.poi)## estimate of parameter lambda
lambda.est
[1] 2.38
freq.ex<-(dpois(0:max(x.poi),lambda=lambda.est)*200) ## vector of fitted (expected) frequencies
freq.ex
[1] 18.5101155 44.0540749 52.4243491 41.5899836 24.7460403 11.7791152  4.6723823
[8]  1.5886100  0.4726115
Calculation
a=((freq.os-freq.ex)^2)/freq.ex
a
[1] 1.098920309 0.353436656 0.126543821 0.008369340 0.002606298 0.051533620
[7] 0.598594574 0.106534476 0.588514396
Chisq <- sum(a)
Chisq
[1] 2.935053
Pvalue <- pchisq(Chisq, df=length(tab.os)-1-1,lower.tail = F)
Pvalue
[1] 0.8909446

Fitting Continuous (Gamma) Distribution

In case of a continuous variable such as a gamma distribution as in the following example, with parameters estimated by sample data :-

rm(list = ls())
set.seed(804)
x.gam <- rgamma(200,rate=0.5,shape=3.5) ## sampling from a gamma distribution with rate=0.5 (scale parameter 2) and N1=3.5 (shape parameter)
Observed frequency
x.gam.cut<-cut(x.gam,breaks=5) ## binning data

tab.os <- table(x.gam.cut) ## binned data table
tab.os
x.gam.cut
(1.41,5.57] (5.57,9.71] (9.71,13.8]   (13.8,18]   (18,22.1] 
         82          80          26          11           1 
f.os<-vector()
for(i in 1:5) {
  f.os[i]<- tab.os[[i]] ## empirical frequencies
} 
f.os
[1] 82 80 26 11  1
Expected frequency

\(rate = \frac{mean}{variance}\) \(shape = \frac{{mean}^2}{variance}\)

Chi-square tests are always right tailed

mean.gam<-mean(x.gam) ## sample mean
var.gam<-var(x.gam) ## sample variance
l.est<-mean.gam/var.gam ## lambda estimate (corresponds to rate) 
a.est<-((mean.gam)^2)/var.gam ## alpha estimate (corresponds to shape)

E1=(pgamma(5.57,shape=a.est,rate=l.est)-pgamma(1.41,shape=a.est,rate=l.est))*200

E2=(pgamma(9.71,shape=a.est,rate=l.est)-pgamma(5.57,shape=a.est,rate=l.est))*200

E3=(pgamma(13.8,shape=a.est,rate=l.est)-pgamma(9.71,shape=a.est,rate=l.est))*200

E4=(pgamma(18,shape=a.est,rate=l.est)-pgamma(13.8,shape=a.est,rate=l.est))*200

E5=(pgamma(22.1,shape=a.est,rate=l.est)-pgamma(18,shape=a.est,rate=l.est))*200

f.ex<-c(E1,E2,E3,E4, E5) ## expected frequencies vector
Calculation
a <- ((f.os-f.ex)^2)/f.ex
a
[1] 0.75559597 0.02786197 1.00051444 0.79072188 0.28514740
chisq <- sum(a) ## chi-square statistic
chisq
[1] 2.859842
pvalue <- pchisq(q = chisq, df = 5-1-2, lower.tail = F)
pvalue
[1] 0.2393279

Exercises

Fitting Bernouli Distribution with parameter 0.7

rm(list = ls())
set.seed(804)
x <- rbinom(n = 400, size = 1, prob = 0.7)
tab.os <- table(x)
freq.os <- vector()
for (i in 1:length(tab.os)) {
  freq.os[i] <- tab.os[[i]]
}

freq.ex <- (dbinom(x = 0:max(x), size = 1, prob = 0.7)*400)

a <- ((freq.os - freq.ex)**2)/freq.ex

chisq <- sum(a)
chisq
[1] 0.01190476
pvalue <- pchisq(chisq, df = length(tab.os)-1, lower.tail = F) # always right tailed
pvalue
[1] 0.9131161

AST 232

Design

CRD-19

rm(list = ls())

obs <- c(7, 7, 15,15,11,
         12,17,12,18,18,
         14,18,18,19,19,
         19,25,22,19,23,
         7, 10,11,15,11)
trt <- as.factor(rep(LETTERS[1:5], each = 5))

data <- data.frame(obs, trt)

model <- aov(obs~trt)
summary(model)
            Df Sum Sq Mean Sq F value  Pr(>F)    
trt          4  418.6  104.66    11.6 4.9e-05 ***
Residuals   20  180.4    9.02                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Contrast

Summary

T test -> gmodels -> fit.contrast()  
Scheffe -> DescTools -> ScheffeTest()  
Tukey -> TukeyHSD <- (stats)

T test

# install.packages("gmodels")
library(gmodels)

cmat = rbind("i" = c(1,1,-1,-1,0),
             "ii" = c(3,-1,-1,-1,0))
fit.contrast(model = model, varname = trt, coeff = cmat, 
             conf.int = 0.95)
      Estimate Std. Error   t value     Pr(>|t|)  lower CI   upper CI
trti     -12.8   2.686261 -4.764987 0.0001182076 -18.40344  -7.196557
trtii    -21.6   4.652741 -4.642425 0.0001570703 -31.30545 -11.894552

Scheffe’s method

# install.packages("DescTools")
library(DescTools)
ScheffeTest(model, contrasts = t(cmat))

  Posthoc multiple comparisons of means: Scheffe Test 
    95% family-wise confidence level

$trt
         diff    lwr.ci    upr.ci   pval    
A,B-C,D -12.8 -21.89541 -3.704585 0.0032 ** 
A-B,C,D -21.6 -37.35372 -5.846279 0.0041 ** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tukey HSD

TukeyHSD(model)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = obs ~ trt)

$trt
     diff         lwr       upr     p adj
B-A   4.4  -1.2839365 10.083937 0.1807540
C-A   6.6   0.9160635 12.283937 0.0180254
D-A  10.6   4.9160635 16.283937 0.0001614
E-A  -0.2  -5.8839365  5.483937 0.9999696
C-B   2.2  -3.4839365  7.883937 0.7741275
D-B   6.2   0.5160635 11.883937 0.0283527
E-B  -4.6 -10.2839365  1.083937 0.1502294
D-C   4.0  -1.6839365  9.683937 0.2561572
E-C  -6.8 -12.4839365 -1.116063 0.0143237
E-D -10.8 -16.4839365 -5.116063 0.0001280

RCBD

rm(list = ls())

obs <- c(25,35,21,37,40,52,39,58,27,34,20,31)
trt <- as.factor(rep(LETTERS[1:3], each = 4))
block <- as.factor(rep(c("NE", "NW", "SE", "SW"), times = 3))

data.frame(obs, trt, block)
   obs trt block
1   25   A    NE
2   35   A    NW
3   21   A    SE
4   37   A    SW
5   40   B    NE
6   52   B    NW
7   39   B    SE
8   58   B    SW
9   27   C    NE
10  34   C    NW
11  20   C    SE
12  31   C    SW
model <- aov(obs ~ block + trt)
summary(model)
            Df Sum Sq Mean Sq F value   Pr(>F)    
block        3  496.9   165.6   19.55 0.001686 ** 
trt          2  917.2   458.6   54.13 0.000145 ***
Residuals    6   50.8     8.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LSD

rm(list = ls())

obs <- c(10,14,7,8,7,18,11,8,5,10,11,9,10,10,12,14)
row <- as.factor(rep(1:4, each = 4))
col <- as.factor(rep(1:4, times = 4))
trt <- as.factor(c(LETTERS[c(3,4,1,2,2,3,4,1,1,2,3,4,4,1,2,3)]))

data.frame(obs, row, col, trt)
   obs row col trt
1   10   1   1   C
2   14   1   2   D
3    7   1   3   A
4    8   1   4   B
5    7   2   1   B
6   18   2   2   C
7   11   2   3   D
8    8   2   4   A
9    5   3   1   A
10  10   3   2   B
11  11   3   3   C
12   9   3   4   D
13  10   4   1   D
14  10   4   2   A
15  12   4   3   B
16  14   4   4   C
model <- aov(obs~row+col+trt)
summary(model)
            Df Sum Sq Mean Sq F value  Pr(>F)   
row          3   18.5   6.167   3.524 0.08852 . 
col          3   51.5  17.167   9.810 0.00993 **
trt          3   72.5  24.167  13.810 0.00421 **
Residuals    6   10.5   1.750                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

BIBD

rm(list = ls())

obs <- c(13, 13, 15, 17, 
         15, 12, 18, 17, 
         16, 15, 14, 18, 
         10, 14, 16, 20, 
         16, 11, 12, 20)
trt <- as.factor(c(2:5, 1,2,4,5,1,3,4,5,1,2,3,4,1,2,3,5))
block <- as.factor(rep(1:5, each = 4))

data.frame(obs, trt, block)
   obs trt block
1   13   2     1
2   13   3     1
3   15   4     1
4   17   5     1
5   15   1     2
6   12   2     2
7   18   4     2
8   17   5     2
9   16   1     3
10  15   3     3
11  14   4     3
12  18   5     3
13  10   1     4
14  14   2     4
15  16   3     4
16  20   4     4
17  16   1     5
18  11   2     5
19  12   3     5
20  20   5     5
model <- aov(obs~block+trt)
summary(model)
            Df Sum Sq Mean Sq F value Pr(>F)  
block        4   4.30   1.075   0.185 0.9414  
trt          4  79.57  19.892   3.422 0.0474 *
Residuals   11  63.93   5.812                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Contrast

# install.packages("gmodels")
library(gmodels)

fit.contrast(model = model, varname = trt, coeff = c(1,0,0,0,-1), 
             conf.int = 0.95)
                      Estimate Std. Error   t value   Pr(>|t|)  lower CI
trt c=( 1 0 0 0 -1 ) -4.133333   1.760624 -2.347652 0.03864707 -8.008441
                       upper CI
trt c=( 1 0 0 0 -1 ) -0.2582253

Sampling

Stratified

Data Entry

rm(list = ls())

a <- c(35,43,36,39,28,28,29,25,38,27,26,32,29,40,35,41,37,31,45,34)
b <- c(27,15,4,41,49,25,10,30)
c <- c(8,14,12,15,30,32,21,20,34,7,11,24)
data <- list(a,b,c)

N <- c(155,62,93)
n <- c(20,8,12)
f <- n/N
w <- N/sum(N)

Estimates for mean

y_bar <- rep(NA, 3)
for (i in 1:3) {
  y_bar[i] <- mean(data[[i]])
}
y_st_bar <- sum(y_bar * w)
y_st_bar
[1] 27.675
v_y_bar <- rep(NA, 3)
for (i in 1:3) {
  v_y_bar[i] <- (1-f[i])*(var(data[[i]])/n[i])
}
v_y_st_bar <- sum(v_y_bar * w**2)
v_y_st_bar
[1] 1.969519
std_error <- sqrt(v_y_st_bar)
std_error
[1] 1.403396

Estimates for total

y_st <- sum(N) * y_st_bar
y_st
[1] 8579.25
std_error_total <- sum(N) * std_error
std_error_total
[1] 435.0527

Estimates for proportion

a <- c(8,3,5)

p <- a/n

p_st <- sum(p*w)
p_st
[1] 0.4
v_p <- (1-f) * ((p*(1-p))/(n-1))

v_p_st <- sum(v_p * (w^2))
v_p_st
[1] 0.005648937

Assignment 01

rm(list = ls())
data <- c(0,1,1,2,5,4,7,7,8,6,
        6,8,9,10,13,12,15,16,16,17,
        18,19,20,20,24,23,25,28,29,27,
        26,30,31,31,33,32,35,37,38,38)
mat <- matrix(data, nrow = 4, byrow = T)

Systematic

k = 10
n = 4
N = 40
y_bar <- mean(data)
yi_bar <- apply(mat, 2, mean)

v_sy <- (1/k)* sum((yi_bar - y_bar)**2)
v_sy
[1] 11.62562

SRS

v_srs <- ((N-n)/N)*(var(data)/n)
v_srs
[1] 30.65639

Stratified

mat
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    1    1    2    5    4    7    7    8     6
[2,]    6    8    9   10   13   12   15   16   16    17
[3,]   18   19   20   20   24   23   25   28   29    27
[4,]   26   30   31   31   33   32   35   37   38    38
nh <- rep(1,4)
Nh <- rep(10,4)
fh <- nh/Nh
wh <- Nh/N

v_y_bar <- rep(NA, 4)
for (i in 1:4) {
  v_y_bar[i] <- (1-fh[i])*((var(mat[i,]))/nh[i])
}

v_y_st <- sum(v_y_bar * wh**2)
v_y_st
[1] 3.034375

Assignment 02

Ratio

rm(list = ls())
x<-c(76,138,67,29,381,23,37,120,61,387,
     2,507,179,121,50,44,77,64,64,56,
     243,87,30,71,256,43,25,94,43,298)
y<-c(80,143,70,50,464,48,63,115,69,459,
     50,634,260,113,64,58,79,63,77,142,
     291,105,111,79,288,61,57,85,50,317)
x_bar <- mean(x)
y_bar <- mean(y)
X <- 18528
X_BAR <- X/140
N <- 140
n <- 30
f <- n/N
# estimate of population ratio
r <- y_bar / x_bar
r
[1] 1.237408
# ratio estimate of popoulation mean
y_r_bar <- r * X_BAR
y_r_bar
[1] 163.7621
# ratio estimate of population total
y_r <- y_r_bar * N
y_r
[1] 22926.7
# estimated variance of population raio
v_r <- ((1-f)/(n*X_BAR**2)) * sum(((y-r*x)**2)/(n-1))
v_r
[1] 0.001313316
# estimated variance of population mean
v_y_r_bar <- v_r * X_BAR**2
v_y_r_bar
[1] 23.00224
# estimated variance of population total
v_y_r <- v_y_r_bar * N**2
v_y_r
[1] 450843.9

AST 430

Required Packages

# Required Packages
library(tidyverse)
library(MASS)
library(psych)
library(cluster) # clustering algorithm
library(factoextra) # clustering visualization
library(GGally)
library(CCA)
library(CCP)
library(mvtnorm) # dmvnorm
library(ggcorrplot)
library("reshape2")

Restricted Maximum Likelihood (REML)

rm(list = ls())

set.seed(30)

y <- matrix(data = rnorm(n = 25, mean = 20, sd = 5),
            ncol = 1)
x <- matrix(data = rep(x = 1, times = 25),
            ncol = 1)
A <- diag(length(y)) - x %*% solve(t(x) %*% x) %*% t(x)

# residual likelihood
reml_lik <- function(par, y, A) {
  w <- t(A) %*% y
  mu <- rep(x = 0, times = length(w))
  cmat <- t(A) %*% (par * diag(length(y))) %*% A
  -sum(mvtnorm::dmvnorm(
    x = as.numeric(w),
    mean = mu,
    sigma = cmat,
    log = TRUE
  ))
}

# reml_lik(par = 25, y = y, A = A[, -length(y)])

optim(
  par = 25,
  f = reml_lik,
  y = y,
  A = A[, -length(y)],
)
$par
[1] 25.625

$value
[1] 71.36822

$counts
function gradient 
      18       NA 

$convergence
[1] 0

$message
NULL

Profile Likelihood

rm(list = ls())

# data
set.seed(30)

x <- round(rnorm(n = 50, mean = 35, sd = 4), 2)


# functionized method (optim is doing the work for us)

plik <- function(mu, data) {
  sig <- optim(
    par = 1,
    
    fn = function(sigma, data) {
      -sum(log(dnorm(data, mu, sigma)))
    },
    data = data
  )$par
  - sum(dnorm(data, mu, sig, log = TRUE))
}

optim(par = 15, plik, data = x)
$par
[1] 34.01953

$value
[1] 144.4455

$counts
function gradient 
      28       NA 

$convergence
[1] 0

$message
NULL
# more naive method (we are picking the value for which the function is maximized)

n <- 500 # number of values you want to check for
m <- seq(15, 50, length.out = n)

mplik <- NULL
for (i in 1:n) {
  mplik[i] = -plik(m[i], x)
}

mu_lik = as.data.frame(cbind(m, mplik)) %>%
  filter(mplik == max(mplik)) %>%
  dplyr::select(m) %>%
  as.numeric()

mu_lik
[1] 34.00802
# graphical approach

ggplot() +
  geom_line(aes(x = m, y = mplik)) +
  theme_bw() +
  xlab(expression(mu)) + ylab(expression(l(mu))) +
  geom_vline(aes(m, mplik), xintercept = mu_lik, color = "red")

Numerical Optimization

Newton’s Method (One dimensional)

theta0 <- 10
eps <- 10


while (eps > 0.001) {
  l1 <- 257.59 - 15 * theta0
  l2 <- -15
  theta1 <- theta0 - l1 / l2
  eps <- abs(theta1 - theta0)
  theta0 <- theta1
}

theta1
[1] 17.17267

Example 2

theta0 <- 2.5
eps <- 10


while (eps > 0.001) {
  l1 <- -1 * (15 / 2 * theta0) + 124.88 / (2 * theta0 ** 2)
  l2 <- 15 / (2 * theta0 ** 2) - (124.88 / theta0 ** 3)
  theta1 <- theta0 - l1 / l2
  eps <- abs(theta1 - theta0)
  theta0 <- theta1
}

theta1
[1] 2.026356

Newton-Raphson Algorithm

rm(list = ls())

## Function for score, U

U <- function(y, theta) {
  N <- length(y)
  u <- (-(2 * N) / theta) + ((2 * sum(y ^ 2)) / theta ^ 3)
  return(u)
}# end of function U


## Function for  U'

Uprime <- function(y, theta) {
  N <- length(y)
  uprime <- ((2 * N) / theta ^ 2) - ((2 * 3 * sum(y ^ 2)) / theta ^ 4)
  return(uprime)
}# end of function Uprime


#### Function for Newton-Raphson Iterative Method

NR <- function(y, theta.old) {
  ### Newton-Raphson Iterative Equation
  theta.new <- theta.old - U(y, theta.old) / Uprime(y, theta.old)
  
  i <- 1
  tol <- 0.0001
  
  while (abs(theta.new - theta.old) > tol) {
    cat("iteration", i, ":", "theta=", theta.new, "\n")
    theta.old <- theta.new
    theta.new <- theta.old - U(y, theta.old) / Uprime(y, theta.old)
    
    i <- i + 1
  } ## end of while loop
  
  return(theta.new)
  
} ## end of function NR


### Reading the data lifetime

dat <- read.csv("datasets/lifetimes.csv", header = FALSE)
# head(dat)
# str(dat)


### Initial value of theta is the mean of the data, ybar
(theta1 <- mean(dat$V1))
[1] 8805.694
NR(dat$V1, theta1)
iteration 1 : theta= 9633.777 
iteration 2 : theta= 9875.898 
iteration 3 : theta= 9892.11 
iteration 4 : theta= 9892.177 
[1] 9892.177

Expectation Maximization (EM) Algorithm

## Russell B. Millar; Example: 5.3; Page: 109


# Observed Data
y11 <- 10
y12 <- 5
y21 <- 20
y23 <- 30

#Intinal guesses

r1 <- 1 / 3
r2 <- 1 - r1

c1 <- 1 / 3
c2 <- 1 / 6
c3 <- 1 - c1 - c2

for (k in 1:10) {
  # E Step
  p13 = r1 * c3
  p22 = r2 * c2
  
  y13 = 35 * p13 / (p13 + p22)
  y22 = 35 - y13
  
  
  # M Step
  r1 = (y11 + y12 + y13) / 100
  r2 = 1 - r1
  
  c1 = (y11 + y21) / 100
  c2 = (y12 + y22) / 100
  c3 = 1 - c1 - c2
  
  cat("\n",
      "Iter",
      k,
      ": P(Row 1) =",
      r1,
      ", P(Col 1) =",
      c1,
      ", P(Col 2) =",
      c2,
      sep = " ")
}

 Iter 1 : P(Row 1) = 0.36 , P(Col 1) = 0.3 , P(Col 2) = 0.19
 Iter 2 : P(Row 1) = 0.3605505 , P(Col 1) = 0.3 , P(Col 2) = 0.1894495
 Iter 3 : P(Row 1) = 0.3610844 , P(Col 1) = 0.3 , P(Col 2) = 0.1889156
 Iter 4 : P(Row 1) = 0.361602 , P(Col 1) = 0.3 , P(Col 2) = 0.188398
 Iter 5 : P(Row 1) = 0.3621036 , P(Col 1) = 0.3 , P(Col 2) = 0.1878964
 Iter 6 : P(Row 1) = 0.3625895 , P(Col 1) = 0.3 , P(Col 2) = 0.1874105
 Iter 7 : P(Row 1) = 0.3630601 , P(Col 1) = 0.3 , P(Col 2) = 0.1869399
 Iter 8 : P(Row 1) = 0.3635155 , P(Col 1) = 0.3 , P(Col 2) = 0.1864845
 Iter 9 : P(Row 1) = 0.3639562 , P(Col 1) = 0.3 , P(Col 2) = 0.1860438
 Iter 10 : P(Row 1) = 0.3643824 , P(Col 1) = 0.3 , P(Col 2) = 0.1856176

Factor Analysis

Fit a model with 3 factors using PCA & MLE.
Then rotate the matrix.

crmat <- matrix(c(1,0.505,0.569,0.602,0.621,0.603,
                  0.505,1,0.422,0.467,0.482,0.450,
                  0.569,0.422,1,0.926,0.877,0.878,
                  0.602,0.467,0.926,1,0.874,0.894,
                  0.621,0.482,0.877,0.874,1,0.937,
                  0.603,0.450,0.878,0.894,0.937,1),
                nrow = 6, ncol = 6, byrow = TRUE)
crmat  ## correlation matrix
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,] 1.000 0.505 0.569 0.602 0.621 0.603
[2,] 0.505 1.000 0.422 0.467 0.482 0.450
[3,] 0.569 0.422 1.000 0.926 0.877 0.878
[4,] 0.602 0.467 0.926 1.000 0.874 0.894
[5,] 0.621 0.482 0.877 0.874 1.000 0.937
[6,] 0.603 0.450 0.878 0.894 0.937 1.000
n <- 276 # nrow(data)
p <- 6 # total variables  # ncol(data)
m <- 3  # since 3 factors

PCA

# Eigen Values
lmda <- eigen(crmat)$values[1:m]
lmda
[1] 4.4564485 0.7824099 0.4584251
# Eigen Vectors
eL <- eigen(crmat)$vectors[, 1:m]
eL
           [,1]       [,2]        [,3]
[1,] -0.3507933  0.3956532  0.84666989
[2,] -0.2862040  0.8146406 -0.50202982
[3,] -0.4399784 -0.2632446 -0.11051604
[4,] -0.4468851 -0.1972029 -0.09896293
[5,] -0.4488806 -0.1614667 -0.06586761
[6,] -0.4474933 -0.2134503 -0.06906640
# Loadings
L <- matrix(NA, nrow = p, ncol = m)

for (i in 1:m) {
  L[, i] <- sqrt(lmda[i]) * eL[, i]
}


# Communalities
cm <- apply(X = L ** 2, MARGIN = 1, FUN = sum)
cm
[1] 0.9994939 0.9998164 0.9225019 0.9248977 0.9203343 0.9302392
# Specific Variance
sv <- 1 - cm
sv
[1] 0.0005060766 0.0001835913 0.0774981111 0.0751022503 0.0796656638
[6] 0.0697608285
# Proportion explained:
prop <- lmda / p
prop
[1] 0.74274142 0.13040165 0.07640418
# Cumulative proportion:
cumsum(prop)
[1] 0.7427414 0.8731431 0.9495472

So 0.9495472% of the total proportion is explained by the three factors.

Rotation

varimax

varimax(L)
$loadings

Loadings:
     [,1]   [,2]   [,3]  
[1,] -0.354  0.244  0.903
[2,] -0.234  0.949  0.211
[3,] -0.921  0.166  0.218
[4,] -0.903  0.214  0.252
[5,] -0.887  0.229  0.284
[6,] -0.907  0.192  0.264

                [,1]  [,2]  [,3]
SS loadings    3.454 1.123 1.120
Proportion Var 0.576 0.187 0.187
Cumulative Var 0.576 0.763 0.950

$rotmat
          [,1]       [,2]       [,3]
[1,] 0.8546542 -0.3385584 -0.3936299
[2,] 0.4824845  0.7979214  0.3612896
[3,] 0.1917680 -0.4986980  0.8452960

MLE Method

factanal for factor analysis using MLE

# factanal performs MLE factor analysis on `Covariance` matrix

fit <-
  factanal(
    covmat = crmat,
    n.obs = n,
    factors = m,
    rotation = "none"
  )
fit

Call:
factanal(factors = m, covmat = crmat, n.obs = n, rotation = "none")

Uniquenesses:
[1] 0.512 0.317 0.117 0.005 0.019 0.088

Loadings:
     Factor1 Factor2 Factor3
[1,]  0.622   0.284   0.143 
[2,]  0.484   0.660   0.121 
[3,]  0.937                 
[4,]  0.992          -0.105 
[5,]  0.920           0.367 
[6,]  0.926           0.232 

               Factor1 Factor2 Factor3
SS loadings      4.187   0.520   0.236
Proportion Var   0.698   0.087   0.039
Cumulative Var   0.698   0.784   0.824

The degrees of freedom for the model is 0 and the fit was 8e-04 
fit2 <-
  factanal(
    covmat = crmat,
    n.obs = n,
    factors = m,
    rotation = "varimax"
  )
fit2

Call:
factanal(factors = m, covmat = crmat, n.obs = n, rotation = "varimax")

Uniquenesses:
[1] 0.512 0.317 0.117 0.005 0.019 0.088

Loadings:
     Factor1 Factor2 Factor3
[1,]  0.466   0.502   0.139 
[2,]  0.208   0.798         
[3,]  0.889   0.289         
[4,]  0.935   0.345         
[5,]  0.823   0.359   0.418 
[6,]  0.851   0.323   0.288 

               Factor1 Factor2 Factor3
SS loadings      3.326   1.324   0.293
Proportion Var   0.554   0.221   0.049
Cumulative Var   0.554   0.775   0.824

The degrees of freedom for the model is 0 and the fit was 8e-04 

Canonical Correlation Analysis

mm <- read_csv("datasets/mmreg.csv")
# summary(mm)

psych <- mm[, 1:3]
acad <- mm[, 4:8]

# ggpairs(psych)
# ggpairs(acad)

CCA::matcor(psych, acad) ## correlation matrices within and between datasets
$Xcor
                 locus_of_control self_concept motivation
locus_of_control        1.0000000    0.1711878  0.2451323
self_concept            0.1711878    1.0000000  0.2885707
motivation              0.2451323    0.2885707  1.0000000

$Ycor
               read     write       math    science      female
read     1.00000000 0.6285909  0.6792757  0.6906929 -0.04174278
write    0.62859089 1.0000000  0.6326664  0.5691498  0.24433183
math     0.67927568 0.6326664  1.0000000  0.6495261 -0.04821830
science  0.69069291 0.5691498  0.6495261  1.0000000 -0.13818587
female  -0.04174278 0.2443318 -0.0482183 -0.1381859  1.00000000

$XYcor
                 locus_of_control self_concept motivation        read
locus_of_control        1.0000000   0.17118778 0.24513227  0.37356505
self_concept            0.1711878   1.00000000 0.28857075  0.06065584
motivation              0.2451323   0.28857075 1.00000000  0.21060992
read                    0.3735650   0.06065584 0.21060992  1.00000000
write                   0.3588768   0.01944856 0.25424818  0.62859089
math                    0.3372690   0.05359770 0.19501347  0.67927568
science                 0.3246269   0.06982633 0.11566948  0.69069291
female                  0.1134108  -0.12595132 0.09810277 -0.04174278
                      write       math     science      female
locus_of_control 0.35887684  0.3372690  0.32462694  0.11341075
self_concept     0.01944856  0.0535977  0.06982633 -0.12595132
motivation       0.25424818  0.1950135  0.11566948  0.09810277
read             0.62859089  0.6792757  0.69069291 -0.04174278
write            1.00000000  0.6326664  0.56914983  0.24433183
math             0.63266640  1.0000000  0.64952612 -0.04821830
science          0.56914983  0.6495261  1.00000000 -0.13818587
female           0.24433183 -0.0482183 -0.13818587  1.00000000
cc1 <- CCA::cc(psych, acad)

# display the canonical correlations
cc1$cor
[1] 0.4640861 0.1675092 0.1039911
# raw canonical coefficients
cc1[3:4]
$xcoef
                       [,1]       [,2]       [,3]
locus_of_control -1.2538339 -0.6214776 -0.6616896
self_concept      0.3513499 -1.1876866  0.8267210
motivation       -1.2624204  2.0272641  2.0002283

$ycoef
                [,1]         [,2]         [,3]
read    -0.044620600 -0.004910024  0.021380576
write   -0.035877112  0.042071478  0.091307329
math    -0.023417185  0.004229478  0.009398182
science -0.005025152 -0.085162184 -0.109835014
female  -0.632119234  1.084642326 -1.794647036
# compute canonical loadings

cc2 <- CCA::comput(psych, acad, cc1) # additional computation

# display canonical loadings
cc2[3:6]
$corr.X.xscores
                        [,1]       [,2]       [,3]
locus_of_control -0.90404631 -0.3896883 -0.1756227
self_concept     -0.02084327 -0.7087386  0.7051632
motivation       -0.56715106  0.3508882  0.7451289

$corr.Y.xscores
              [,1]        [,2]        [,3]
read    -0.3900402 -0.06010654  0.01407661
write   -0.4067914  0.01086075  0.02647207
math    -0.3545378 -0.04990916  0.01536585
science -0.3055607 -0.11336980 -0.02395489
female  -0.1689796  0.12645737 -0.05650916

$corr.X.yscores
                         [,1]        [,2]        [,3]
locus_of_control -0.419555307 -0.06527635 -0.01826320
self_concept     -0.009673069 -0.11872021  0.07333073
motivation       -0.263206910  0.05877699  0.07748681

$corr.Y.yscores
              [,1]        [,2]       [,3]
read    -0.8404480 -0.35882541  0.1353635
write   -0.8765429  0.06483674  0.2545608
math    -0.7639483 -0.29794884  0.1477611
science -0.6584139 -0.67679761 -0.2303551
female  -0.3641127  0.75492811 -0.5434036

Test

# tests of canonical dimensions
rho <- cc1$cor

## Define number of observations, number of variables in first set,
# and number of variables in the second set.
n <- nrow(psych) # same for both
p <- ncol(psych)
q <- ncol(acad)

## Calculate p-values using the F-approximations of different test statistics:

CCP::p.asym(rho = rho,N = n, p = p, q = q, tstat = "Wilks") 
Wilks' Lambda, using F-approximation (Rao's F):
              stat    approx df1      df2     p.value
1 to 3:  0.7543611 11.715733  15 1634.653 0.000000000
2 to 3:  0.9614300  2.944459   8 1186.000 0.002905057
3 to 3:  0.9891858  2.164612   3  594.000 0.091092180
# asymptotic tests for significance of canonical correlation coefficients


p.asym(rho, n, p, q, tstat = "Hotelling")
 Hotelling-Lawley Trace, using F-approximation:
               stat    approx df1  df2     p.value
1 to 3:  0.31429738 12.376333  15 1772 0.000000000
2 to 3:  0.03980175  2.948647   8 1778 0.002806614
3 to 3:  0.01093238  2.167041   3 1784 0.090013176
p.asym(rho, n, p, q, tstat = "Pillai")
 Pillai-Bartlett Trace, using F-approximation:
               stat    approx df1  df2     p.value
1 to 3:  0.25424936 11.000571  15 1782 0.000000000
2 to 3:  0.03887348  2.934093   8 1788 0.002932565
3 to 3:  0.01081416  2.163421   3 1794 0.090440474
p.asym(rho, n, p, q, tstat = "Roy")
 Roy's Largest Root, using F-approximation:
              stat   approx df1 df2 p.value
1 to 1:  0.2153759 32.61008   5 594       0

 F statistic for Roy's Greatest Root is an upper bound.

Standardization

# standardized psych canonical coefficients

s1 <- diag(sqrt(diag(cov(psych))))
s1 %*% cc1$xcoef
           [,1]       [,2]       [,3]
[1,] -0.8404196 -0.4165639 -0.4435172
[2,]  0.2478818 -0.8379278  0.5832620
[3,] -0.4326685  0.6948029  0.6855370
# standardized acad canonical coefficients

s2 <- diag(sqrt(diag(cov(acad))))
s2 %*% cc1$ycoef
            [,1]        [,2]        [,3]
[1,] -0.45080116 -0.04960589  0.21600760
[2,] -0.34895712  0.40920634  0.88809662
[3,] -0.22046662  0.03981942  0.08848141
[4,] -0.04877502 -0.82659938 -1.06607828
[5,] -0.31503962  0.54057096 -0.89442764

Discrimination and Classification

Example - 01

xb1 <- matrix(c(-0.0065,-0.0390), ncol=1)
xb2 <- matrix(c(-0.2483,0.0262), ncol=1)
Sp_inv <- matrix(c(131.158,-90.423,-90.423,108.147), nrow = 2, byrow = T)

# Discriminant Function

func_discrimination <- t(xb1- xb2) %*% Sp_inv

# Allocation Rule

val_compare <- (1/2) * t(xb1 - xb2) %*% Sp_inv %*% (xb1 + xb2)

# Data

x <- matrix(c(-0.210, -0.044), nrow = 2, byrow = TRUE)

# Comparison

val_test <- func_discrimination %*% x

result_classification <- !(val_test < val_compare)

So the statement: Data comes from population 1 is FALSE

Example - 02

rmowners <- readr::read_csv("datasets/rmowners.csv")

dat <- rmowners %>%
  group_split(Owner)

#non-owner
x_N <- dat[[1]] %>%
  dplyr::select(-Owner)

#owner
x_Y <- dat[[2]] %>%
  dplyr::select(-Owner)

x1_bar <- as.numeric(apply(x_Y, 2, mean))
x2_bar <- as.numeric(apply(x_N, 2, mean))
s_1 <- matrix(as.numeric(var(x_Y)), nrow = 2, byrow = T)
s_2 <- matrix(as.numeric(var(x_N)), nrow = 2, byrow = T)
n1 <- nrow(x_Y)
n2 <- nrow(x_N)
# Pooled variance
spinv <- solve( ( (n1-1) * s_1 + (n2-1) * s_2 ) / (n1 + n2 -2) )

a <- t(x1_bar - x2_bar) %*% spinv

D2 <- 0.5 * t(x1_bar - x2_bar) %*% spinv %*% (x1_bar + x2_bar)

LDA & QDA

data('iris')

# str(iris)

# psych::pairs.panels(iris[1:4], 
#                    gap = 0,
#                    bg = c("red", "green", "blue")[iris$Species],
#                    pch = 21)

set.seed(123)

# head(iris)

# table(iris$Species)

# iris %>%  count(Species)

ind <- sample(x = 2,
              size = nrow(iris),
              replace = TRUE,
              prob = c(0.6, 0.4))

training <- iris[ind == 1,]
testing <- iris[ind == 2, ]

# LDA

linear <- MASS::lda(Species~., training)    # . means remaining all

# confusion matrix

p1 <- predict(linear, testing)$class

tab1 <- table(Predicted = p1, Actual = testing$Species)

# error rate
err_rate_lda <- (1-sum(diag(tab1))/sum(tab1))*100


# QDA

quad <- MASS::qda(Species~., training)

p2 <- predict(quad, testing)$class

tab2 <- table(Predicted = p2, Actual = testing$Species)

# error rate
err_rate_qda <- (1-sum(diag(tab2))/sum(tab2))*100


# Hold out procedure

cnt = 0

for(i in 1:nrow(iris)){
  training <- iris[-i, ]
  testing <- iris[i, ]
  linear <- lda(Species~., training)
  
  # confusion matrix
  p1 <- predict(linear, testing)$class
  
  if(p1 == testing$Species) cnt = cnt + 1
}

# Error rate
err_rate <- (1 - cnt / nrow(iris))* 100

## Assignment
# Perform this hold out procedure by taking 1 to 149 observations
# and plot the difference in error rate

Logistic Regression

data('iris')

data1 <- iris %>% filter(Species != "setosa")
table(data1$Species)

    setosa versicolor  virginica 
         0         50         50 
ind <- sample(x = 2,
              size = nrow(data1),
              replace = TRUE,
              prob = c(0.6,0.4))


training <-data1[ind==1,]
testing <-data1[ind==2,]



linear <- MASS::lda(Species~., training)
linear
Call:
lda(Species ~ ., data = training)

Prior probabilities of groups:
versicolor  virginica 
 0.5081967  0.4918033 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor     6.032258    2.803226     4.270968    1.329032
virginica      6.610000    2.976667     5.636667    1.973333

Coefficients of linear discriminants:
                   LD1
Sepal.Length -1.575896
Sepal.Width  -1.101478
Petal.Length  2.378646
Petal.Width   2.608424
#confusion matrix

p2 <- predict(linear, testing)$class

tab1 <- table(Predicted = p2, Actual = testing$Species)

# error rate
err_rate_lda <- (1-sum(diag(tab1))/sum(tab1))*100



# Logistic model, for comparison
model_logistic = glm(Species~., data=training ,family=binomial)

logistic_probs = data.frame(probs = predict(model_logistic, testing, type="response"))

predictions_logistic = logistic_probs %>%
  mutate(class = ifelse(probs>.5, "virginica", "versicolor"))

table(testing$Species, predictions_logistic$class)
            
             versicolor virginica
  setosa              0         0
  versicolor         15         4
  virginica           0        20

Cluster Analysis

df <- USArrests
df <- na.omit(df)
df <- scale(df)

head(df)
               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207
# Distance / Dissimilarity matrix
d <- dist(df, method = "euclidean")

# Hierarchical clustering using complete linkage
hc1 <- hclust(d=d, method = "complete")

plot(hc1, cex = 0.6, hang = -1)
abline(h =1.5, col = "red")

# Agglomerative coefficient
cluster::coef.hclust(hc1)
[1] 0.8531583
# Compute with agnes: Agglomerative Nesting
hc2 <- cluster::agnes(df, method = "complete")

names(hc2)
[1] "order"     "height"    "ac"        "merge"     "diss"      "call"     
[7] "method"    "order.lab" "data"     
# Agglomerative coefficient
hc2$ac
[1] 0.8531583
# DI_visive ANA_lysis clustering

# compute divisive hierarchical clustering
hc4 <- cluster::diana(df)
names(hc4)
[1] "order"     "height"    "dc"        "merge"     "diss"      "call"     
[7] "order.lab" "data"     
hc4$dc
[1] 0.8514345
# plot(hc4, hang = -1)

cluster::pltree(hc4, cex = 0.6, hang = -1, main = "Dendogram of diana")

sub_grp <- cutree(hc4, k = 4) # one t  # cut a tree into groups of data

# Number of members in each cluster
table(sub_grp)
sub_grp
 1  2  3  4 
 7 13 17 13 
rect.hclust(hc4, k = 4, border = 2:5) # Draw Rectangles Around Hierarchical Clusters

factoextra::fviz_cluster(list(data =df, cluster = sub_grp))

K Means cluster

k2 <- kmeans(df, centers = 2, nstart = 23)

factoextra::fviz_cluster(k2, data = df)

df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster,
         state = row.names(USArrests)) %>%
  ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
  geom_text()

k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)

Markov Chain Monte Carlo (MCMC)

Monte Carlo Methods

  • Importance Sampling

  • Rejection Sampling

    • Using uniform proposal density

    • Using beta proposal density

  • Gibbs Sampling

  • Metropolis Hasting

Rejection Sampling

Using uniform proposal density

#number of simulation
nsim <- 2500

#generate thitaTilda from h
thitaTilda <- runif(nsim)

#generate U from U(0,1)
U <- runif(nsim)

#the value of M is 2.669
M <- 2.669

#calculate the quantity require to accept that thitaTilda

quantity <- dbeta(thitaTilda, 2.7, 6.3)/(M*dunif(thitaTilda)) # dunif(x)=1 actually

#test if U<quantity
test <- (U<quantity)

#accept those thitaTilda for which (U<quantity)
theta <- thitaTilda*(U<quantity)
accepted.values <- theta[theta!=0]

#percent accepted
length(accepted.values)/nsim
[1] 0.3712

Using beta proposal density

#number of simulation
nsim <- 2500

#generate thitaTilda from h
thitaTilda <- rbeta(nsim, 2, 6)

#generate U from U(0,1)
U <- runif(nsim)

#the value of M is 1.67
M <- 1.67

#calculate the quantity require to accept that thitaTilda
quantity <- dbeta(thitaTilda, 2.7, 6.3)/(M*dbeta(thitaTilda, 2, 6))

#test if U<quantity
test <- (U<quantity)

#accept those thitaTilda for which (U<quantity)
theta <- thitaTilda*(U<quantity)
accepted.values <- theta[theta!=0]

#percent accepted
length(accepted.values)/nsim
[1] 0.6044

Metropolis Hasting

#number of simulations
nsim <- 10000
theta <- vector()
theta[1] <- 1 #initialize theta value

for(i in 2:nsim){
  theta_tilda <- rnorm(1, mean=theta[i-1], sd=1) # proposal dist.
  
  alpha <- dgamma(theta_tilda, 5, 5)/dgamma(theta[i-1], 5, 5)
  
  if(runif(1)< alpha){
    theta[i] <- theta_tilda
  } else{
    theta[i] <- theta[i-1]
  }
}#end of for loop

hist(theta)

Gibbs Sampling

N = 26000 # Total number of iterations
rho = 0.5

simx = simy = vector()
simx[1] = simy[1] = 0 # Initialise vectors

# Gibbs sampler
for(i in 2:N){
  simx[i] = rnorm(1,rho*simy[i-1],sqrt(1-rho^2))
  simy[i] = rnorm(1,rho*simx[i],sqrt(1-rho^2))
}

# burnin and thinning
burnin = 1000
thinning = 25

# Gibbs sample
sim = cbind(simx[seq(from=burnin,to=N,by=thinning)],simy[seq(from=burnin,to=N,by=thinning)])

# A diagnosis tool to assess the convergence of the sampler
plot(ts(sim[,1]))

plot(ts(sim[,2]))

# Plots of the marginals and the joint distribution

hist(sim[,1])

hist(sim[,2])

plot(sim,col=1:1000)

# plot(sim,type="l")

Importance Sampling

theta <- round(rnorm(25, mean=5, sd=sqrt(2.5)), 0)

weight <- dbinom(theta, size=10, prob=0.5)/dnorm(theta, mean=5, sd=sqrt(2.5))

imp.averagre <- sum(theta*weight)/sum(weight)

Gaussian Mixture Model

set.seed(1)

comp1.vals <- data_frame(comp = "A", 
                         vals = rnorm(50, mean = 1, sd = 0.5))
comp2.vals <- data_frame(comp = "B", 
                         vals = rnorm(50, mean = 1.5, sd = 0.5))

vals.df <- bind_rows(comp1.vals, comp2.vals)

vals.df %>%
  ggplot(aes(x = vals, y = "A", color = factor(comp))) +
  geom_point(alpha = 0.4) +
  scale_color_discrete(name = "Source of Data") +
  xlab("Values") +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "top")

wait <- faithful$waiting

wait.kmeans <- kmeans(wait, 2)
wait.kmeans.cluster <- wait.kmeans$cluster

wait.df <- data_frame(x = wait, cluster = wait.kmeans.cluster)

wait.df %>%
  mutate(num = row_number()) %>%
  ggplot(aes(y = num, x = x, color = factor(cluster))) +
  geom_point() +
  ylab("Values") +
  ylab("Data Point Number") +
  scale_color_discrete(name = "Cluster") +
  ggtitle("K-means Clustering")

wait.summary.df <- wait.df %>%
  group_by(cluster) %>%
  summarize(mu = mean(x), variance = var(x), std = sd(x), size = n())

wait.summary.df %>%
  dplyr::select(cluster, mu, variance, std)
# A tibble: 2 × 4
  cluster    mu variance   std
    <int> <dbl>    <dbl> <dbl>
1       1  54.8     34.8  5.90
2       2  80.3     31.7  5.63
wait.summary.df <- wait.summary.df %>%
  mutate(alpha = size / sum(size))

wait.summary.df %>%
  dplyr::select(cluster, size, alpha)
# A tibble: 2 × 3
  cluster  size alpha
    <int> <int> <dbl>
1       1   100 0.368
2       2   172 0.632
comp1.prod <- 
  dnorm(66, wait.summary.df$mu[1], wait.summary.df$std[1]) *
  wait.summary.df$alpha[1]


comp2.prod <- 
  dnorm(66, wait.summary.df$mu[2], wait.summary.df$std[2]) *
  wait.summary.df$alpha[2]

normalizer <- comp1.prod + comp2.prod
comp1.prod / normalizer
[1] 0.6926023
comp1.prod <- dnorm(x = wait, mean = wait.summary.df$mu[1], 
                    sd = wait.summary.df$std[1]) * wait.summary.df$alpha[1]

comp2.prod <- dnorm(x = wait, mean = wait.summary.df$mu[2], 
                    sd = wait.summary.df$std[2]) * wait.summary.df$alpha[2]

normalizer <- comp1.prod + comp2.prod

comp1.post <- comp1.prod / normalizer
comp2.post <- comp2.prod / normalizer


comp1.n <- sum(comp1.post)
comp2.n <- sum(comp2.post)

comp1.mu <- 1/comp1.n * sum(comp1.post * wait)
comp2.mu <- 1/comp2.n * sum(comp2.post * wait)

comp1.var <- sum(comp1.post * (wait - comp1.mu)^2) * 1/comp1.n
comp2.var <- sum(comp2.post * (wait - comp2.mu)^2) * 1/comp2.n

comp1.alpha <- comp1.n / length(wait)
comp2.alpha <- comp2.n / length(wait)

comp.params.df <- data.frame(comp = c("comp1", "comp2"),
                             comp.mu = c(comp1.mu, comp2.mu),
                             comp.var = c(comp1.var, comp2.var),
                             comp.alpha = c(comp1.alpha, comp2.alpha),
                             comp.cal = c("self", "self"))


# Already calculate component responsibilities for each data point from above
sum.of.comps <- comp1.prod + comp2.prod
sum.of.comps.ln <- log(sum.of.comps, base = exp(1))
sum(sum.of.comps.ln)
[1] -1034.246

AST 431

Required Packages

library(tidyverse)
library(tseries)
library(xts)
library(smooth)
library(TTR)
library(forecast)
library(dLagM)
library(car)
library(plm)
library(systemfit)

Time Series

Intro to Time series

# create and plot a quarterly time series

y <- rnorm(24, 5, 1)

yquart <- ts(y, start = c(2000, 1), frequency = 4)

# plot(as.xts(yquart))
plot(yquart)

# Classical decomposition into trend, seasonality using moving average
decompose(yquart)
$x
         Qtr1     Qtr2     Qtr3     Qtr4
2000 5.046580 3.869614 5.576719 3.719251
2001 6.625447 4.499303 6.678297 4.587480
2002 4.027713 5.025383 5.027475 3.319817
2003 6.053751 3.880401 5.335617 5.494796
2004 5.138053 4.881208 5.197684 3.931307
2005 4.196787 3.886235 6.580092 6.497819

$seasonal
           Qtr1       Qtr2       Qtr3       Qtr4
2000  0.3703085 -0.4980842  0.7297874 -0.6020116
2001  0.3703085 -0.4980842  0.7297874 -0.6020116
2002  0.3703085 -0.4980842  0.7297874 -0.6020116
2003  0.3703085 -0.4980842  0.7297874 -0.6020116
2004  0.3703085 -0.4980842  0.7297874 -0.6020116
2005  0.3703085 -0.4980842  0.7297874 -0.6020116

$trend
         Qtr1     Qtr2     Qtr3     Qtr4
2000       NA       NA 4.750399 5.026469
2001 5.242877 5.489103 5.272915 5.013958
2002 4.873366 4.508555 4.603352 4.713484
2003 4.608879 4.919269 5.076679 5.087318
2004 5.195177 4.982499 4.669405 4.427375
2005 4.475804 4.969419       NA       NA

$random
            Qtr1        Qtr2        Qtr3        Qtr4
2000          NA          NA  0.09653205 -0.70520665
2001  1.01226151 -0.49171569  0.67559458  0.17553335
2002 -1.21596092  1.01491207 -0.30566393 -0.79165491
2003  1.07456356 -0.54078376 -0.47084909  1.00948988
2004 -0.42743256  0.39679307 -0.20150794  0.10594400
2005 -0.64932592 -0.58510001          NA          NA

$figure
[1]  0.3703085 -0.4980842  0.7297874 -0.6020116

$type
[1] "additive"

attr(,"class")
[1] "decomposed.ts"
plot(decompose(yquart))

Smoothing

yt <- ts(rnorm(10,5,3), start = 2010, frequency = 1)

# Simple Moving Average
yt2 <- TTR::SMA(yt, n =2)
plot(yt)
lines(yt2, col = "red")

Autocorrelation

# Manual calculation for lag = 1
s1 <- dat[-1]
s2 <- dat[-61]
cor(s1, s2)

# with lag = 2
s1 <- dat_centered[-c(1, 2)]
s2 <- dat_centered[-c(61, 60)]
cor(s1, s2)
# Autocorrelation function (ACF)
acf(yt)

# Partial Autocorrelation function (PACF)
pacf(yt)

Stationarity

# Augmented Dickey-Fuller test
tseries::adf.test(yt)

    Augmented Dickey-Fuller Test

data:  yt
Dickey-Fuller = -2.6008, Lag order = 2, p-value = 0.3435
alternative hypothesis: stationary

Modeling

# get the ARIMA model
forecast::auto.arima(yt)
Series: yt 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
        mean
      4.5163
s.e.  0.8660

sigma^2 = 8.332:  log likelihood = -24.26
AIC=52.53   AICc=54.24   BIC=53.13
# Differencing
yt_diff <- diff(yt, lag = 1)

# Fit the model
f1 <- arima(yt, order = c(0,0,0))
summary(f1)

Call:
arima(x = yt, order = c(0, 0, 0))

Coefficients:
      intercept
         4.5163
s.e.     0.8660

sigma^2 estimated as 7.499:  log likelihood = -24.26,  aic = 52.53

Training set error measures:
                       ME     RMSE      MAE     MPE     MAPE      MASE
Training set 9.325873e-16 2.738371 2.438409 -58.961 90.78016 0.5689473
                   ACF1
Training set -0.2980673
# Forecast using the model
#f1_forecast <- forecast(f1, h = 10)
#plot(f1_forecast)

# mean absolute error in forecasting for 10 period
#mean(abs(test - f1_forecast$mean)) 

In a Nutshell

# Single Exponential Smoothing (SES)
simple_exponential <- ses(y = y, h = 500, alpha = 0.2)
plot(simple_exponential)

h1 <- HoltWinters(y,
                  alpha = 0.2,
                  beta = F,
                  gamma = F) #no trend,no seasonality hence same as `ses`
h1
h2 <- HoltWinters(y,
                  alpha = 0.2,
                  beta = 0.5,
                  gamma = F) #only trend but no seasonality
h2
h3 <- HoltWinters(y,
                  alpha = 0.2,
                  beta = 0.5,
                  gamma = 0.3) #both trend and seasonality

#Searching for potential smoother and model - 
ses_optimal <- ses(y)
ses_optimal$model
holts_optimal <- HoltWinters(y)
holts_optimal[c("alpha","beta","gamma")]
arima_optimal <- auto.arima(y)
arima_optimal


###Forecasting :
forecast(object = h1, h = 50)
forecast(object = m1, h = 50)


### Cross Validation :
#Splitting the data into train (90%) and test(10%) set
n = length(y) #Number of data values
train_data <- head(y, round(n * 0.90))
test_data <- tail(y, (n - length(train_data)))

#Exploring the training dataset to search for possible models:
acf(train_data)
pacf(train_data)
ndiffs(train_data)
auto.arima(train_data)

#Finding the ARIMA model that minimizes Mean Squared Error(MSE) : Fit the model using train data and then evaluate it with MSE using test data. And select the model that minimizes MSE.
fit1 <- arima(test_data,order = c(2,1,0))
accuracy(fit1)[2]

fit2 <- arima(test_data,order = c(3,1,0))
accuracy(fit2)[2]

fit3 <- arima(test_data,order = c(4,1,0))
accuracy(fit3)[2]

Simulating a time series

times <- 1000

# white noise term
wt <- rnorm(times)

# Simulating AR(1) model
ar1 <- numeric(times)
ar1[1] <- wt[1]
for (i in 2:times){
  ar1[i] <- 0.8 * ar1[i-1] + wt[i]
}

ar1 <- ts(ar1)

# Using one single function

ar1 <- arima.sim(model = list(order = c(1, 0, 0), ar = 0.8), n = times)

# MA(1)
ma1 <- arima.sim(list(order = c(0, 0, 1), ma = -0.8), n = times)

# ARMA(1,1)
arma11 <- arima.sim(list(order = c(1, 0, 1), ar = 0.8, ma = -0.7), n = times)

Econometry

Regression

# fitting the model
modelA <- lm(y ~ x + z)
summary(modelA)

# residuals
resA <- residuals(modelA)

# Durbin-Watson test
car::durbin.watson(modelA)

Panel Data

dat <- data.frame(country = factor(indiv), year, y, x1, x2)



# pooled OLS
ols <- lm(y ~ x1, data = dat)



# LSDV (adjusting for countries)
lsdv <- lm(y ~ x1 + country, data = dat)




# fixed effect estimator
# within model
mf <- plm::plm(y ~ x1, data = pan, model = "within", index = c("country", "year"))

plm::fixef(mf) # effects of countries

# testing for fixed effects, H0: OLS better than fixed
plm::pFtest(mf, ols)





# random effect
mr <- plm::plm(y ~ x1, data = pan, model = "random", index = c("country", "year"))

plm::ranef(mr) # effects of countries


sum(fixef(mf)) # not 0
sum(ranef(mr)) # close to 0





# Hausman specification test
# H_0: random effect (no covariance between X and alpha)
# H_1: fixed effect (covariance between X and alpha exists)
# if p‐value < 0.05, then use fixed effects model
plm::phtest(mf, mr)






# value of y_hat when x1 = 5, and country = 6

# for fixed effects model:
5 * coefficients(mf) + fixef(mf)[6]

# for random effects model:
coefficients(mr)[1] + 5 * coefficients(mr)[2] + ranef(mr)[6]

Simultaneous Equations Modeling

# lec 13-14

data("Kmenta")

# From the reduced form equation of price:
lm_p <- lm(price ~ income + farmPrice + trend, data = Kmenta)
Kmenta$phat <- lm_p$fitted.values

lm_qd <- lm(consump ~ phat + income, data = Kmenta)
lm_qd$coefficients
(Intercept)        phat      income 
 94.6333039  -0.2435565   0.3139918 
lm_qs <- lm(consump ~ phat + farmPrice + trend, data = Kmenta)
lm_qs$coefficients
(Intercept)        phat   farmPrice       trend 
 49.5324417   0.2400758   0.2556057   0.2529242 
# Using function (OLS method, using estimated price)

qd <- consump ~ phat + income # using fitted values of price
qs <- consump ~ phat + farmPrice + trend # using fitted values of price

sem <- list(demand = qd, supply = qs)
systemfit::systemfit(sem, method = "OLS", data = Kmenta)

systemfit results 
method: OLS 

Coefficients:
demand_(Intercept)        demand_phat      demand_income supply_(Intercept) 
         94.633304          -0.243557           0.313992          49.532442 
       supply_phat   supply_farmPrice       supply_trend 
          0.240076           0.255606           0.252924 
# Using instrumental variable, 2SLS method (not required to estimate 𝑃 manually)

qd1 <- consump ~ price + income  # using actual values of price
qs1 <- consump ~ price + farmPrice + trend  # using actual values of price
sem <- list(demand = qd1, supply = qs1)
instrument <- ~income+ farmPrice + trend  # probably from the reduced form equation of price
systemfit(sem, method = "2SLS", inst = instrument, data = Kmenta)

systemfit results 
method: 2SLS 

Coefficients:
demand_(Intercept)       demand_price      demand_income supply_(Intercept) 
         94.633304          -0.243557           0.313992          49.532442 
      supply_price   supply_farmPrice       supply_trend 
          0.240076           0.255606           0.252924 

Distributed Lag Models

# Rolling correlation (used for deciding whether to use dynamic lag models)

dLagM::rolCorPlot()

# define the limits with SD test
rolCorPlot(y = seaL, x = temp, width = c(3,4,7,11), SDtest = TRUE)

Modeling

mod1 <- dlm(GMSL ~ LandOcean, data = seaLevelTempSOI, q = 8) # q = lag length
summary(mod1)


mod2 <- dlm(GMSL ~ LandOcean, data = seaLevelTempSOI, q = 10)
summary(mod2)


rm <- list(LandOcean = c(0, 1, 2, 8, 9, 10)) # removing the insignificant terms
mod3 <- dlm(GMSL ~ LandOcean, data = seaLevelTempSOI, q = 10, remove = rm)
summary(mod3)


# more predictor:
mod4 <- dlm(GMSL ~ LandOcean + SOI, data = seaLevelTempSOI, q = 7)
summary(mod4)


rm2 <- list(LandOcean = 0:3, SOI = 1:7)
mod5 <- dlm(GMSL ~ LandOcean + SOI, data = seaLevelTempSOI, q = 7, remove = rm2)
summary(mod5)

Koyck Model

data("seaLevelTempSOI")
dat <- seaLevelTempSOI
colnames(dat)
[1] "GMSL"      "LandOcean" "SOI"      
m1 <- dLagM::koyckDlm(x = dat$GMSL, y = dat$LandOcean)
summary(m1)

Call:
"Y ~ (Intercept) + Y.1 + X.t"

Residuals:
      Min        1Q    Median        3Q       Max 
-0.457938 -0.057694  0.002804  0.055437  0.387750 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.304e-05  2.320e-03  -0.040    0.968    
Y.1          2.064e-01  2.451e-02   8.422   <2e-16 ***
X.t          1.082e-03  7.761e-04   1.394    0.164    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09264 on 1591 degrees of freedom
Multiple R-Squared: 0.04555,    Adjusted R-squared: 0.04435 
Wald test: 36.92 on 2 and 1591 DF,  p-value: < 2.2e-16 

Diagnostic tests:
NULL

                                 alpha        beta       phi
Geometric coefficients:  -0.0001172352 0.001081569 0.2063877

AR DLM

# AR DLM of order (p, q) with one predictor
# p = finite lag of the predictor, q = order of AR process, 

m2 <- dLagM::ardlDlm(GMSL ~ LandOcean, p = 5, q = 5, data = dat)
summary(m2)

Time series regression with "ts" data:
Start = 6, End = 1595

Call:
dynlm(formula = as.formula(model.text), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0530  -1.2446  -0.0263   1.1106   7.5420 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0001958  0.0488147   0.004 0.996800    
LandOcean.t  1.1036152  0.5541146   1.992 0.046579 *  
LandOcean.1  0.2428656  0.5500900   0.442 0.658910    
LandOcean.2 -1.4858504  0.5464113  -2.719 0.006614 ** 
LandOcean.3  0.4608764  0.5473941   0.842 0.399946    
LandOcean.4 -0.3554964  0.5491155  -0.647 0.517468    
LandOcean.5 -0.7130085  0.5534996  -1.288 0.197871    
GMSL.1       0.9709340  0.0249981  38.840  < 2e-16 ***
GMSL.2       0.0170635  0.0340200   0.502 0.616038    
GMSL.3      -0.7362338  0.0285341 -25.802  < 2e-16 ***
GMSL.4       0.3205406  0.0340105   9.425  < 2e-16 ***
GMSL.5      -0.0829308  0.0250027  -3.317 0.000931 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.946 on 1578 degrees of freedom
Multiple R-squared:  0.7727,    Adjusted R-squared:  0.7711 
F-statistic: 487.7 on 11 and 1578 DF,  p-value: < 2.2e-16