1 Introduction To R

R is a powerful programming language and software environment designed for statistical computing, data analysis, and graphical representation. It is widely used by statisticians, data analysts, and researchers due to its flexibility and extensive libraries.

1.1 What is R

R is a language and an environment in which you can perform statistical data analysis and graphics. R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

  • It is relatively new and growing rapidly in use
  • It is Completely free software and comes with ABSOLUTELY NO WARRANTY ,
  • It is easily download/installed from Internet
  • it was initially written in the 1990s by Ross Ihaka and Robert Gentleman, and is now supported by an international R-core team (in addition to thousands of people who contribute to R).

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

1.3 R and Statistics

Our introduction to the R environment did not mention statistics, yet many people use R as a statistics system. We prefer to think of it of an environment within which many classical and modern statistical techniques have been implemented. A few of these are built into the base R environment, but many are supplied as packages. There are about 25 packages supplied with R (called “standard” and “recommended” packages) and many more are available through the CRAN family of Internet sites Visit R Project and elsewhere. More details on packages are given later. Most classical statistics and much of the latest methodology is available for use with R, but users may need to be prepared to do a little work to find it. There is an important difference in philosophy between S (and hence R) and the other main statistical systems. In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.

1.4 What can R do?

  • Handle and store data
  • Perform mathematical operations on data points, vectors or matrices
  • Perform basic and advanced statistical analyses through either the preloaded functions or user contributed functions(Creating your own program)
  • Can be used as a simple and effective programming language, which includes conditionals, loops, if-else, etc. (plus any user contributed or created functions)
  • graphical facilities

1.5 Starting with R

To start R in the Windows environment double click on the R icon A new screen will appear containing one window The last line to appear will be `>’, a standard prompt to indicate that R is expecting a command. The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

The R Interface
The R Interface

1.5.1 On-line Help

The on-line help gives useful information. Getting used to using it and understanding the help will make it easier to use R. A keyword search is possible using the Search Engine and Keywords link. “help.start()”

To do a keyword search use the function apropos(). For example:

apropos("test")
##  [1] ".valueClassTest"      "ansari.test"          "bartlett.test"       
##  [4] "binom.test"           "Box.test"             "chisq.test"          
##  [7] "cor.test"             "file_test"            "fisher.test"         
## [10] "fligner.test"         "friedman.test"        "kruskal.test"        
## [13] "ks.test"              "mantelhaen.test"      "mauchly.test"        
## [16] "mcnemar.test"         "mood.test"            "oneway.test"         
## [19] "pairwise.prop.test"   "pairwise.t.test"      "pairwise.wilcox.test"
## [22] "poisson.test"         "power.anova.test"     "power.prop.test"     
## [25] "power.t.test"         "PP.test"              "prop.test"           
## [28] "prop.trend.test"      "quade.test"           "shapiro.test"        
## [31] "t.test"               "testInheritedMethods" "testVirtual"         
## [34] "var.test"             "wilcox.test"

Note that: - You need to put the keyword in double quotes (“keyword’’).
- A lot of valuable information about R can be found through:

Classwork

Use the help system to find help on the following topics.

  1. Analysis of Variance (ANOVA)
  2. Trigonometric Functions (e.g.sin, cos)
  3. Mean and median

1.5.2 Mathematics in R

R can do any types of mathematical manipulation Simple arithmetic Whatever is typed at the prompt is evaluated, and the result is printed.

2+3                  # Addition
## [1] 5
2*4 + 7            # Multiplication and addition
## [1] 15
10/3                 # Division
## [1] 3.333333
5**3               # ** to the power of...
## [1] 125
5^3                 # or we can use ^
## [1] 125

Standard functions that are found on a scientific calculator are available in R, for example:

sqrt(2)                       # Square root
## [1] 1.414214

It also provides \(\pi\) as a constant.

pi
## [1] 3.141593
sin(pi)
## [1] 1.224606e-16
cos(pi)
## [1] -1

Table 1: The Arithmetic Functions in R

Name Operation
sqrt square root
abs absolute value
sin cos tan trigonometric functions (radians)
asin acos atan inverse trigonometric functions
sinh cosh tanh hyperbolic functions
asinh acosh atanh inverse hyperbolic functions
exp log exponential and natural logarithm
log10 common logarithm
gamma lgamma gamma function and its natural log

Table 2: Mathematical and Logical Comparison Operators in R

Operator Syntax
less than <
less than or equal to <=
exactly equal to ==
greater than or equal to >=
greater than >
not equal to !=
or |
and &

Table 3: Functions that return a single value

Name Operation
length(x) the number of elements in x
sum(x) the sum of the values of x
mean(x) the mean of the values of x
var(x) the variance of the values of x
sd(x) the standard deviation of the values of x
min(x) the minimum value from the values of x
max(x) the maximum value from the values of x
prod(x) the product of the values of x
range(x) the range of the values of x

1.5.3 Storing values

A value can be stored in a named variable by assigning it with the <- or = symbols, for example:

x <- 5 # assigns 5 to x
x # type x to print value
## [1] 5
y = sqrt(9) # assign the square-root of 9 to y
y
## [1] 3

In fact you can make assignments using ->, for example:

7 -> z # assigns 7 to z
z
## [1] 7

This could be interpreted as z is assigned 7. Now it is possible to do arithmetic with x, y and z, for example:

(x * y) + z
## [1] 22
y^x - z + 6
## [1] 242
w <- x+y
w
## [1] 8

Note that: To delete an object from the list, use the rm() functions. For example

rm(w)
ls() # to list the objects (e.g., variables, data, functions) that you have created
## [1] "x" "y" "z"

1.5.4 Logical Values

R enables computation with Boolean or Logical variables. These take on either the value True or False. You can use conditional tests to generate these values:

x <- 32
x > 16 # Is x greater than 16?
## [1] TRUE
x <= 16 # Is x less than equal to 16?
## [1] FALSE

Logical values can be stored in variables in the same way as numeric values:

tf <- x>16
tf
## [1] TRUE

1.6 Data Structure in R

1.6.1 Vectors

All the values presented so far have been scalars. R can handle vectors which are a combination of scalars in a single structure.

1.6.1.1 Creating Simple Vectors

The c(…) construct can be used to create vectors, named as “concatenate” in R .That means “treat as data set”. To set up a vector named x, say, consisting of five numbers, namely 2, 4,6,10 and 11, use the R command:

x <- c(2,4,6,10,11)
x
## [1]  2  4  6 10 11

Vectors are the simplest type of object in R.

  • They can easily be created with c, the combined function.
  • There are 3 main types of vectors
  • Numeric vectors
  • Character vectors
  • Logical vectors

Functions that return vectors with the same length

  • To print the rank of the values of x:
order(x)
## [1] 1 2 3 4 5
sort.list(x)
## [1] 1 2 3 4 5
  • To print the values of x in increasing order
sort(x)
## [1]  2  4  6 10 11
x[order(x)]
## [1]  2  4  6 10 11
x[sort.list(x)]
## [1]  2  4  6 10 11
  • To print the reciprocals of the values of x
1/x
## [1] 0.50000000 0.25000000 0.16666667 0.10000000 0.09090909

Numeric Vector: is a single entity consisting of an ordered collection of numbers.

Example: To set up a numeric vector X consisting of 5 numbers ,10,6,3,6,22, we use any one of the following commands:

x<-c(10,6,3,6,22)#OR
x=c(10,6,3,6,22)#OR
assign("x",c(10,6,3,6,22))#OR
c(10,6,3,6,22)->x
y<-c(x,0,x)
x
## [1] 10  6  3  6 22

Character vectors: To set up a character/string vector z consisting of 3 place names use:

z<-c("Canberra","Sydney","Newcastle") #Or
z<-c("Canberra","Sydney","Newcastle")

Common useful escape sequences are:

  • \("\n"\) for new line
  • \("\t"\) for tab
  • \("\b"\) for backspace

Example:

cat("Abebe","\n","zelalem","\n")
## Abebe 
##  zelalem

Indexing Vectors

  • Vectors indices are placed with square brackets”[]“.
  • Vectors can be indexed in any of the following ways
  • Vector of positive integers
  • Vector of negative integers
  • Vector of named items
  • Logical vector
const=c(3.1416,2.7183,1.4142,1.6180)
names(const)=c("pi","euler","sqrt2","golden")
const[c(1,3,4)] # printing the 1st, 2ndand the 4thelements of const.
##     pi  sqrt2 golden 
## 3.1416 1.4142 1.6180
const[c(-1,-2)]
##  sqrt2 golden 
## 1.4142 1.6180
const>2
##     pi  euler  sqrt2 golden 
##   TRUE   TRUE  FALSE  FALSE
const[const>2]
##     pi  euler 
## 3.1416 2.7183

1.6.1.2 Generating Sequences in vectors

1.6.1.2.1 Regular Sequences

R can be used to generate vectors sequences. You can use the notation #1:#2 to generate a vector that consists of a sequence, where #1 and #2 are two integer numbers.

1:12
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
4:18
##  [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
-10:3
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3
x<- 1:10 # storing sequence on variable x
x
##  [1]  1  2  3  4  5  6  7  8  9 10
y <- 40:1
y
##  [1] 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
## [26] 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

The last example (y) shows that is a vector is too long for one line it simply continues on the following line (using as many lines as are necessary).

Further in the square brackets shows the position at which that line begins ( e.g: [26] mean that the first values on this line are the 26th value).

  1. To generate a repeating sequences, use the rep() functions, for example:
rep(c(1:4), 3) # repeat 1 2 3 4 three times
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4
r <- rep(c(1.2, 6.1, 3.2, 5.5), 2) # repeat vector 1.2, 6.1, 3.2, 5.5 two times
r
## [1] 1.2 6.1 3.2 5.5 1.2 6.1 3.2 5.5
  1. To generate a sequence of real numbers use the function seq(), for example:
seq(4, 6, .25) # generate integer b\n 4 & 6 with increments of 0.25.
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
  • The following examples produce the same as above:
seq(length=9, from=4, to=6)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(from=4, to=6, by=0.25)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
  • To understand the seq() function type the following:
seq(7)
## [1] 1 2 3 4 5 6 7
seq(7, 10)
## [1]  7  8  9 10
seq(7, 10, 0.5)
## [1]  7.0  7.5  8.0  8.5  9.0  9.5 10.0
rep(seq(1, 3, 0.4), 2) # generate integer b\n 1& 3 with increments of 0.4,then repeat it 2-times.
##  [1] 1.0 1.4 1.8 2.2 2.6 3.0 1.0 1.4 1.8 2.2 2.6 3.0
1.6.1.2.2 Vector Arithmetic

We can manipulate vectors in a similar manner to scalars. However, care must be take when doing such things as the results may not be the desired ones.

x <- 12:1 # x is a sequence .....
x
##  [1] 12 11 10  9  8  7  6  5  4  3  2  1
1.6.1.2.2.1 Extracting values of a Vector

If we are interested in only extracting values of the vector, then we can do this using square bracket [ ].

x<- c(1:10)**2 # generate sequence of vector (x) 1 to 10 and square each
x
##  [1]   1   4   9  16  25  36  49  64  81 100
x[6] # extracting the 6th value from vector x above
## [1] 36
x[2:4] # extracting the 2st, 3th and 4th values from x
## [1]  4  9 16
x[c(1,7,9)] # extracting the 1st, 7th and 9th values from x
## [1]  1 49 81
y <- x[c(1,5,8)] # assigning the 1st, 5th and 8th values of x to y.
y
## [1]  1 25 64
x[9:6] # extracting the 9th, 8th ,7thand 6th values from x
## [1] 81 64 49 36
x[c(1:3, 8:10)] # extracting the 1st, 2th ,3thand 8th,9th,10th values
## [1]   1   4   9  64  81 100
x[c(1,2,3,1,2,3,1,2,3,1,2,3)] # repetition of the index...
##  [1] 1 4 9 1 4 9 1 4 9 1 4 9
x[c(8,2,5,10)] # any order you please....
## [1]  64   4  25 100
x[x > 5] # Extract the value of x greater than 5
## [1]   9  16  25  36  49  64  81 100
1.6.1.2.3 Excluding value from vector

If you use a negative subscript in your selection procedure, the corresponding numbered element is not included in the return vector. For example

x <- c(3,6,9,12,15,18,21)
x
## [1]  3  6  9 12 15 18 21
x[-4] # exclude the 4th element...
## [1]  3  6  9 15 18 21
x[c(-4,-6)] # exclude 4th and 6th elements...
## [1]  3  6  9 15 21

Note that: - Vectors can only contain entries of the same type: numeric or character; don’t mix them.
- Characters should be surrounded by Quote “ ”

Example:

x<-c("A","A","A","A","B","B","B")
x
## [1] "A" "A" "A" "A" "B" "B" "B"
1.6.1.2.4 More Function on Vector

There are functions that operate on vectors and return useful information:

x <- 5:14
length(x) # Number of elements in x
## [1] 10
max(x) # Largest value in x
## [1] 14
min(x) # Smallest value in x
## [1] 5
sum(x) # Sum of all the values in x
## [1] 95
prod(x) # The product of all the values in x
## [1] 3632428800
mean(x) # The mean of the all values in x
## [1] 9.5
range(x) # Range of vector x
## [1]  5 14
var(x) # The variance of x
## [1] 9.166667
sd(x) # The standard deviation of x
## [1] 3.02765
sqrt(var(x)) # The square root of the variance (i.e. stand. deviation)
## [1] 3.02765
  • Notice that range(x) produces a vector of length two, whereas the other functions produce a scalar. Arithmetic operations on logical values work with true being equal to one, and False being equal to zero. You can count the number of True values in a vector by using sum(x).
x <- 1:10
x>7
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

1.6.2 Matrix

R lets you store data in a two-dimensional matrix.

  • A matrix can be regarded as a generalization of a vector.
  • As with vectors, all the elements of a matrix must be of the same data type.
  • A matrix can be generated in two ways.
  • Method1:Using the function dim:

Example

x <-c(1:8)
dim(x) <-c(2,4)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

Method 2: Using the function matrix:

x <-matrix(c(1:8),2,4,byrow=F)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
w<-matrix(c(1,2,40,2,3,9,20,4,60),nrow=3) # create 3-row by 3-coln. matrix
w
##      [,1] [,2] [,3]
## [1,]    1    2   20
## [2,]    2    3    4
## [3,]   40    9   60
w <-matrix(c(1,2,40,2,3,9,20,4,60),3,3)# Similar to the above create 3 by 3
w
##      [,1] [,2] [,3]
## [1,]    1    2   20
## [2,]    2    3    4
## [3,]   40    9   60
x <- matrix(c(2,3,5,7,11,13),ncol=2) # create 3-row by 2-colm matrix
x
##      [,1] [,2]
## [1,]    2    7
## [2,]    3   11
## [3,]    5   13
  • You give matrix a vector of the values, and you need to specify either ncol or nrow to tell the function the size of the matrix. Can you see the labels on the matrix rows and columns? These are like the labels you see when printing a vector. They also tell us how to extract parts of a matrix. You use square brackets with two comma-separated values:
x[2,1] # extract 2nd-row and 1st-coln. element of x
## [1] 3
x[2,2] # extract 2nd-row and 2nd-coln. element of matrix x
## [1] 11
  • If you leave out one number, you get the whole row or column:
x[,1]  # The first column element
## [1] 2 3 5
x[3,]  # The third row element
## [1]  5 13
  • Notice that returning a single row or column produces a vector. You can extract sub-matrices from a matrix by specifying a vector as one of the indices:
x[2:3] # extract Rows 2 and coln. 2 and form matrix
## [1] 3 5
  • You can specify a vector for the rows and columns subscripts to get a piece of the original matrix:
xy<- matrix(1:16,ncol=4)
xy
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
rowSums(xy) # the sum of each rows
## [1] 28 32 36 40
colSums(xy) # the sum of each columns
## [1] 10 26 42 58
mean (xy [, 2]) # to find the mean of matrix-xy in the coln-2
## [1] 6.5
mean (xy [3,]) # to find the mean of matrix-xy in the row-3
## [1] 9

1.6.2.1 Matrix Assignments and Construction

  • By using subscripts you can change values in a matrix:
mx <- matrix(seq(0, 95, length=20), ncol=5)
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   20   40   60   80
## [2,]    5   25   45   65   85
## [3,]   10   30   50   70   90
## [4,]   15   35   55   75   95
mx [3,1] <- -12 # replace 3rd-row and 1st-coln element by -12
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   20   40   60   80
## [2,]    5   25   45   65   85
## [3,]  -12   30   50   70   90
## [4,]   15   35   55   75   95
mx [3,] <- 0 # replace 3rd-row element by 0
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   20   40   60   80
## [2,]    5   25   45   65   85
## [3,]    0    0    0    0    0
## [4,]   15   35   55   75   95
  • You can replace whole rows or columns too:
mx[,2] <- c(3, 6, 9, 12) #replace the 2nd column of mx by 3,6,9,12
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    3   40   60   80
## [2,]    5    6   45   65   85
## [3,]    0    9    0    0    0
## [4,]   15   12   55   75   95
mx[4,] <- c(-1, -2, -3, -4, -5) #replace 4th-row element of mx by -1, -2, -3, -4, -5
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    3   40   60   80
## [2,]    5    6   45   65   85
## [3,]    0    9    0    0    0
## [4,]   -1   -2   -3   -4   -5
  • As with vector arithmetic, the assigned value is repeated to fill out the matrix section if it is not long enough:
mx[,1] <- c(0.5, 0.9)
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   60   80
## [2,]  0.9    6   45   65   85
## [3,]  0.5    9    0    0    0
## [4,]  0.9   -2   -3   -4   -5
  • Extra rows and columns can be added to a matrix using the rbind() and cbind() functions respectively, for example:
cbind(mx, c(0.1, 0.2, 0.3, 0.4)) # Add an extra column
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  0.5    3   40   60   80  0.1
## [2,]  0.9    6   45   65   85  0.2
## [3,]  0.5    9    0    0    0  0.3
## [4,]  0.9   -2   -3   -4   -5  0.4
rbind(mx, 100:104) # Add a an extra row.
##       [,1] [,2] [,3] [,4] [,5]
## [1,]   0.5    3   40   60   80
## [2,]   0.9    6   45   65   85
## [3,]   0.5    9    0    0    0
## [4,]   0.9   -2   -3   -4   -5
## [5,] 100.0  101  102  103  104
cbind(mx,mx) # Add the matrix to itself (column wise).
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  0.5    3   40   60   80  0.5    3   40   60    80
## [2,]  0.9    6   45   65   85  0.9    6   45   65    85
## [3,]  0.5    9    0    0    0  0.5    9    0    0     0
## [4,]  0.9   -2   -3   -4   -5  0.9   -2   -3   -4    -5
rbind(mx, mx) # Add the matrix to itself (row wise).
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   60   80
## [2,]  0.9    6   45   65   85
## [3,]  0.5    9    0    0    0
## [4,]  0.9   -2   -3   -4   -5
## [5,]  0.5    3   40   60   80
## [6,]  0.9    6   45   65   85
## [7,]  0.5    9    0    0    0
## [8,]  0.9   -2   -3   -4   -5
  • You can add rows or columns to the middle of a matrix with clever use of subscripts and rbind() or cbind(), respectively:
cbind(mx[,1:3], c(91,92,93,94), mx[,4])
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   91   60
## [2,]  0.9    6   45   92   65
## [3,]  0.5    9    0   93    0
## [4,]  0.9   -2   -3   94   -4
  • The function t() transposes a matrix (switched the rows and columns).
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   60   80
## [2,]  0.9    6   45   65   85
## [3,]  0.5    9    0    0    0
## [4,]  0.9   -2   -3   -4   -5

1.6.3 Arrays in R

  • Arrays: can be considered as a multiply subscripted collection of data entries, for example numeric. Arrays are generalizations of vectors and matrices That means, vectors in the mathematical sense are one-dimensional arrays where as matrices are two-dimensional arrays; higher dimensions are also possible.
  • There are two methods of creating arrays in R

Method 1: Using vectors
- A vector can be used by R as an array only if it has a dimension vector as its dim attribute. - A dimension vector is a vector of non-negative integers. If its length is k then the array is k-dimensional. - An array can be created by giving a vector structure, a dim, which has the form

Example
The following is a (3-dimentional) array with dimension vector c(3,5,100) and a vector of 1500 elements.

z=c(1:1500)
dim(z) <- c(3,5,100)
dim(z)
## [1]   3   5 100

1.6.4 Random Sequences

R can generate a random sequence from a number of probability density functions. The general format for generating such sequences is: rdensity(n, p1, p2, …) where density is the probability density function, n is the number of values to generate and p1, p2, … are the values needed to generate from the density function.

For example:

rnorm(40, 6, 2.3) #generate 40 values Normal distribution with mean 6 and standard deviation 2.3
##  [1]  6.274633  6.500628  8.467002  5.960944  8.330951  3.512421  9.536310
##  [8]  7.411002  5.893920  1.106829  9.220178  0.556495  7.679172  8.611401
## [15]  5.508841  4.811146  6.164987  9.693829  6.602183  3.023027  5.845941
## [22]  7.063279  5.608085  4.731206 11.162613  5.818375  3.707433  4.898752
## [29]  4.860086  9.690570  7.435617  4.595275  4.993286  1.911186  2.810205
## [36]  8.683388  5.589493  4.526243  7.393666  3.915919
rpois(20, lambda=5) # 20 value from Poisson distribution with lambda(۸)=5
##  [1] 6 5 6 2 4 7 8 1 5 3 3 7 6 2 6 4 5 8 2 3
rbinom(20, size=18, prob=0.3) # 18 values from Binom. distribution with p= 0.3
##  [1] 7 6 9 6 6 8 4 5 7 5 2 5 7 8 6 5 5 5 6 6
runif(20, min=3, max=6) # Uniform distribution (3, 6)
##  [1] 4.780706 5.720903 3.188569 4.749774 4.951848 5.387701 3.962943 5.800947
##  [9] 5.337865 4.133706 5.743204 3.846184 5.579625 3.219199 3.662900 3.795063
## [17] 4.003560 3.835260 3.522762 4.018478
rexp(20, rate=1.5) # 20 values from Exponential distribution e=1.5
##  [1] 1.00918117 0.58033463 0.27131303 0.48328905 0.44910690 0.35606058
##  [7] 0.21091749 0.20358100 0.30693818 0.44368296 0.40651114 0.90213626
## [13] 1.37793051 0.40301819 1.75014186 0.98725216 0.30260951 0.03899923
## [19] 0.03128597 1.90235076

Table: The general generating for value from distribution, the R name and argument are:

Distribution R Name Arguments
Normal norm mean, sd
Student’s t t df, ncp
F f df1, df2, ncp
Exponential exp rate
Log-normal lnorm meanlog, sdlog
Beta beta shape1, shape2, ncp
Binomial binom size, prob
Cauchy cauchy location, scale
Chi-squared chisq df, ncp
Gamma gamma shape, scale
Geometric geom prob
Hypergeometric hyper m, n, k
Logistic logis location, scale
Negative binomial nbinom size, prob
Poisson pois lambda
Signed rank signrank n
Uniform unif min, max
Weibull weibull shape, scale
Wilcoxon wilcox m, n

The function sample() generates random permutations of data or a random sample from a given vector. The first argument is a vector or a positive integer and the second the size of the sample.

For example:

sample(12) # Sampling without replacement.
##  [1]  7  5  4  6  2 11  3  8  9 10  1 12
sample(12, 5) # Sampling 5 values without replacement.
## [1]  5  2  3  1 12
sample(12, replace=TRUE) # Sampling with replacement.
##  [1]  9  3  6  8  3 11 10  2 10  9  8  1
sample(12, 8, replace=TRUE) # Sampling 8 values with replacement.
## [1]  2 12  6  3  6  9  1  3
sample(c("Heads", "Tails"), 6, replace=TRUE) # Simulating 6 coin tosses.
## [1] "Heads" "Heads" "Heads" "Tails" "Tails" "Tails"

1.6.5 Data frames in R

Data frame is a data structure which contains more than 1 vector of equal length. A data frame is very much like a matrix, except it is designed for storing statistical or experimental data. Each row represents a unit, and each column a collection of measurements on the units. Each column can store a different type of data, such as numeric or character. Let us take some example:

Example

name= c("Omer", "Faduma", "Ibrahim","Jamal")
age=c(18,22,25,27)
sex=c("F","M","M","F")
stud=data.frame(name, age, sex)
stud
name age sex
Omer 18 F
Faduma 22 M
Ibrahim 25 M
Jamal 27 F
  • Using $ sign we can access each vector from the data frame. For example:
stud$name #access or extract student data only names
## [1] "Omer"    "Faduma"  "Ibrahim" "Jamal"

1.7 Reading(Importing) Data

Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. So far, all data has been entered in R using the keyboard or generated using R functions. In reality, data is usually too large to type in and is stored in files. There are functions used to read data from external files such as Excel, SPSS, Epi-info, Stata and so on.

In R you can import text file swith the function “read.table” the functions read.csv and read.delim are functions to read―comma separated values ‖files period as a decimal and tab delimited files period as a decimal respectively.

R Syntax: used for to read Data

  • read.sav is ued to reads a file or data stored by the SPSS read.sav(file_name())

  • read.csv is ued to reads a file or data stored by CSV read.csv(file.choose())

  • read_dta is ued to reads a file or data stored by STATA read.dta(file_name())

  • read.table is ued to reads a file or data stored by TABLE read.txt(file.choose())

  • The functions read.mtp is functions to read MINITAB files read.mtp(file_name())

  • read_sas is ued to reads a file or data stored by SAS read.as7bdat(file_name())

  • read.epiinfo is ued to reads a file or data stored by EPI-Info read.rec (file_name())

  • The function in R is used to read text files containing tabular data into a data frame read.delim2(file.choose()) .

Example

  • Read CSV data
#R=read.csv(file.choose(),header=T)
  • Read text data
#R=read.table(file.choose(),header=T)

1.8 Exporting data

In R, you can export data to various file formats such as CSV, Excel,Stata, SPSS and others using different functions provided by R packages. Here are some common examples of exporting data in R:

# Create a sample data frame for five Students
data <- data.frame(
  ID = 1:5,
  Name = c("Ali", "Bilan", "Cawale", "Deeqsi", "Omer"),
  Score = c(85, 90, 88, 92, 87))
data
ID Name Score
1 Ali 85
2 Bilan 90
3 Cawale 88
4 Deeqsi 92
5 Omer 87
# Export data frame to a CSV file
write.csv(data, "data.csv", row.names = FALSE)
# Install and load the 'writexl' package for writing Excel files
#install.packages("writexl")
library(writexl)

# Export data frame to an Excel file
data1<-write_xlsx(data, "data1.xlsx")
data1
## [1] "D:\\Rpubs Projects\\Stat Comp II\\rmd\\data1.xlsx"

1.9 WEEK 1 Exercise

Instruction: Submit This Exercise R Scripts only

  1. Write the R code of the following equations

    1. \(\left(\frac{12^3 - 45}{4}\right) + 4x \left(\frac{72}{2.34} - 3\right)\)
    2. \(\log_e(72) \quad \text{or} \quad \ln(72)\)
    3. \(\log_{10}(72)\)
    4. \(e^{1.45} - 2.612\)
    5. \(\cos\left(\frac{\pi}{3}\right)\)
  2. Generate the following sequences

    1. 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0
    2. 10.0 9.8 9.6 9.4 9.2 9.0 8.8 8.6 8.4 8.2 8.0
  3. What is the meaning of rep (letters [1:5], 4))?

  4. Generate 80 values from a standard normal distribution.

  5. Generate 100 values from a binomial distribution of size 23 and probability 0.25.

  6. Generate 100 values from a log normal distribution with mean 12 and standard deviation 2.

  7. Create the following vector sequences:
    1 4 7 10 13 16 19 22 25 28
    1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
    10.0 9.5 9.0 8.5 8.0 7.5 7.0
    1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
    4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
    1 4 9 16 25 36 49 64 81 100
    1 4 9 16 25 6 14 24 36 50

  8. For each of the sequences above calculate the mean, median, variance and the range.

  9. Generate 15 values from a uniform distribution between 0.7 and 1.

  10. Generate 15 values from a uniform distribution between 0 and 0.3.

  11. Generate 28 values from a uniform distribution between 1.5 and 7.5.

  12. Using the functions runif() and trunc() simulate 20 die throws.

  13. Generate 30 values from a Normal distribution with mean 3.4 and variance 5.25 rounded to 2 decimal places. Calculate the median, range, and the upper and lower quartiles.

  14. Generate 30 values from a Poisson distribution with lambda 5.

  15. Generate 20 values from a Binomial distribution of size 15 and probability of success 0.3.

  16. Generate a randomization list of size 150 for three treatment groups: “Placebo”, “Drug A” and “Drug B”.

  17. Generate 100 values from a standard normal distribution and count the number of values greater than 1.96. Extract these into a vector called signif.100. Do the same again but this time generate 2000 values.

Submission
To submit your exercise, please fill out the form below and upload your R script file. To submit your exercise, please fill out the form below and upload your R script file. Click Here

2 Frequency Tables

In this Section, we discuss how one can tabulate and visualize single variables.

Example
In a 2-week study of the productivity of workers, the following data were obtained on the total number of acceptable pieces that 100 workers produced:

A. Construct a grouped frequency distribution.
B. Construct a histogram, frequency polygon, and Ogives.

workers<-c(65,  36, 49, 88, 79, 56, 28, 43, 67, 36, 43, 78, 37, 40, 68, 72, 55, 62, 22, 82, 88, 50, 60, 56, 57, 46, 39, 57, 73, 65, 59, 48, 76, 74, 70, 51, 40, 75, 56, 45, 35, 62, 52, 63, 32, 80, 64, 53, 74, 34, 76, 60, 48, 55, 51, 54, 45, 44, 35, 51, 21, 35, 61, 45, 33, 61, 77, 60, 85, 68, 45, 53, 34, 67, 42, 69, 52, 68, 52, 47, 63, 65, 55, 61, 73, 50, 53, 59, 41, 54, 41, 74, 82, 58, 26, 35, 47, 50, 38, 70)
range(workers)
## [1] 21 88
factor.workers<-factor(cut(workers,breaks=nclass.Sturges(workers)))
table(factor.workers)
## factor.workers
## (20.9,29.4] (29.4,37.8] (37.8,46.1] (46.1,54.5] (54.5,62.9] (62.9,71.2] 
##           4          11          15          19          19          14 
## (71.2,79.6] (79.6,88.1] 
##          12           6
workers.dat <- as.data.frame(table(factor.workers))
workers.dat <- transform(workers.dat, cumFreq = cumsum(Freq), relative = prop.table(Freq))
workers.dat
factor.workers Freq cumFreq relative
(20.9,29.4] 4 4 0.04
(29.4,37.8] 11 15 0.11
(37.8,46.1] 15 30 0.15
(46.1,54.5] 19 49 0.19
(54.5,62.9] 19 68 0.19
(62.9,71.2] 14 82 0.14
(71.2,79.6] 12 94 0.12
(79.6,88.1] 6 100 0.06
hist(workers)

par(bg = "gray90")
plot(workers, type = "o", col = "blue", pch = 20)

Plot Types in R

The type argument in the plot() function determines the type of plot. Below is a list of the possible values for type and their explanations:

  • “p”: Points
  • “l”: Lines
  • “b”: Both (points + lines)
  • “c”: Lines without points
  • “o”: Both (points over lines)
  • “h”: Histogram-like vertical lines
  • “s”: Step function (horizontal)
  • “S”: Step function (vertical)
  • “n”: No plot

Example 2:

library(readr)
demo2 <- read_csv("C:/Users/abdik/Desktop/DMA/HU Data Management And Analysis/demo2.csv")
## Rows: 24 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): id, group, pulse, time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(demo2)
id group pulse time
1 1 14 1
1 1 19 2
1 1 29 3
2 1 15 1
2 1 25 2
2 1 26 3
names(demo2)
## [1] "id"    "group" "pulse" "time"
factor.pulse<-factor(cut(demo2$pulse,breaks=nclass.Sturges(demo2$pulse)))
table(factor.pulse)
## factor.pulse
## (9.97,14.2] (14.2,18.3] (18.3,22.5] (22.5,26.7] (26.7,30.8]   (30.8,35] 
##           3           5           4           6           1           5

The nclass.Sturges option in the cut function takes care of the creation of the bins according to Sturges’ formula, which makes life a lot easier for us.

pluse.demo1 <- as.data.frame(table(factor.pulse))
pluse.demo1 <- transform(pluse.demo1, cumFreq = cumsum(Freq),relative = prop.table(Freq))
pluse.demo1
factor.pulse Freq cumFreq relative
(9.97,14.2] 3 3 0.1250000
(14.2,18.3] 5 8 0.2083333
(18.3,22.5] 4 12 0.1666667
(22.5,26.7] 6 18 0.2500000
(26.7,30.8] 1 19 0.0416667
(30.8,35] 5 24 0.2083333

3 Descriptive Statistics

In this Secton, we introduced the summary command. This produced a number of summary statistics that, at the time, made little sense. Now that we have discussed summary statistics in the lecture, it becomes possible to work with them. The current chapter describes a number of methods in R for obtaining summary statistics. In addition, we discuss how those statistics should be interpreted.

The functions included in this section perform Descriptive Statistics by quantitatively describing or summarizing different characteristics from a sample. Graphical

library(survival)
attach(cancer)
## The following objects are masked _by_ .GlobalEnv:
## 
##     age, sex
head(cancer)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 1 1 90 100 1175 NA
3 455 2 68 1 0 90 90 1225 15
3 1010 1 56 1 0 90 90 NA 15
5 210 2 57 1 1 90 60 1150 11
1 883 2 60 1 0 100 90 NA 0
12 1022 1 74 1 1 50 80 513 0
summary(cancer)
##       inst            time            status           age       
##  Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00  
##  1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00  
##  Median :11.00   Median : 255.5   Median :2.000   Median :63.00  
##  Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45  
##  3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00  
##  Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00  
##  NA's   :1                                                       
##       sex           ph.ecog          ph.karno        pat.karno     
##  Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00  
##  Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00  
##  Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00  
##  Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00  
##                  NA's   :1        NA's   :1        NA's   :3       
##     meal.cal         wt.loss       
##  Min.   :  96.0   Min.   :-24.000  
##  1st Qu.: 635.0   1st Qu.:  0.000  
##  Median : 975.0   Median :  7.000  
##  Mean   : 928.8   Mean   :  9.832  
##  3rd Qu.:1150.0   3rd Qu.: 15.750  
##  Max.   :2600.0   Max.   : 68.000  
##  NA's   :47       NA's   :14
str(cancer)
## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

In the cancer dataset, the sex variable is numeric but represents a categorical variable encoding the biological sex of the patient.

# Convert 'sex' variable to factor with levels 'male' and 'female'
cancer$sex <- factor(cancer$sex, levels = c(1, 2), labels = c("male", "female"))
head(cancer)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 male 1 90 100 1175 NA
3 455 2 68 male 0 90 90 1225 15
3 1010 1 56 male 0 90 90 NA 15
5 210 2 57 male 1 90 60 1150 11
1 883 2 60 male 0 100 90 NA 0
12 1022 1 74 male 1 50 80 513 0

Remove NA values from the dataset

cancer1<- na.omit(cancer)
summary(cancer1)
##       inst            time            status           age            sex     
##  Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00   male  :103  
##  1st Qu.: 3.00   1st Qu.: 174.5   1st Qu.:1.000   1st Qu.:57.00   female: 64  
##  Median :11.00   Median : 268.0   Median :2.000   Median :64.00               
##  Mean   :10.71   Mean   : 309.9   Mean   :1.719   Mean   :62.57               
##  3rd Qu.:15.00   3rd Qu.: 419.5   3rd Qu.:2.000   3rd Qu.:70.00               
##  Max.   :32.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00               
##     ph.ecog          ph.karno        pat.karno         meal.cal     
##  Min.   :0.0000   Min.   : 50.00   Min.   : 30.00   Min.   :  96.0  
##  1st Qu.:0.0000   1st Qu.: 70.00   1st Qu.: 70.00   1st Qu.: 619.0  
##  Median :1.0000   Median : 80.00   Median : 80.00   Median : 975.0  
##  Mean   :0.9581   Mean   : 82.04   Mean   : 79.58   Mean   : 929.1  
##  3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00   3rd Qu.:1162.5  
##  Max.   :3.0000   Max.   :100.00   Max.   :100.00   Max.   :2600.0  
##     wt.loss       
##  Min.   :-24.000  
##  1st Qu.:  0.000  
##  Median :  7.000  
##  Mean   :  9.719  
##  3rd Qu.: 15.000  
##  Max.   : 68.000

Dimension of the dataset

dim(cancer1)
## [1] 167  10

After removing NA Values from the data.

print(head(cancer1))
##   inst time status age    sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 2    3  455      2  68   male       0       90        90     1225      15
## 4    5  210      2  57   male       1       90        60     1150      11
## 6   12 1022      1  74   male       1       50        80      513       0
## 7    7  310      2  68 female       2       70        60      384      10
## 8   11  361      2  71 female       2       60        80      538       1
## 9    1  218      2  53   male       1       70        80      825      16

3.1 Graphical Plotting in R

R graphics capabilities are very powerful and flexible. R is a powerful tool for building graphics from the simplest to the most complex ones. Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks is to visualize the data. Often, a proper visualization can illuminate features of the data that can inform further analysis. We will look at four methods of visualizing data that we will use throughout the course:

  • Histograms
  • Barplots
  • Boxplots
  • Scatterplots

3.1.1 Histograms

When visualizing a single numerical variable, a histogram will be our go-to tool, which can be created in R using the hist() function. Strip plots look best when the data set is not too large. The most popular graphical summary for numeric (quantitative) data is the histogram.

A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin. The scale on the y axis can be frequency, percentage, or density (relative frequency).

library(DAAG)
## 
## Attaching package: 'DAAG'
## The following object is masked from 'package:survival':
## 
##     lung
attach(ais)
## The following object is masked _by_ .GlobalEnv:
## 
##     sex
## The following object is masked from cancer:
## 
##     sex
head(ais)
rcc wcc hc hg ferr bmi ssf pcBfat lbm ht wt sex sport
3.96 7.5 37.5 12.3 60 20.56 109.1 19.75 63.32 195.9 78.9 f B_Ball
4.41 8.3 38.2 12.7 68 20.67 102.8 21.30 58.55 189.7 74.4 f B_Ball
4.14 5.0 36.4 11.6 21 21.86 104.6 19.88 55.36 177.8 69.1 f B_Ball
4.11 5.3 37.3 12.6 69 21.88 126.4 23.66 57.18 185.0 74.9 f B_Ball
4.45 6.8 41.5 14.0 29 18.96 80.3 17.64 53.20 184.6 64.6 f B_Ball
4.10 4.4 37.4 12.5 42 21.04 75.2 15.58 53.77 174.0 63.7 f B_Ball
summary(ais)
##       rcc             wcc               hc              hg       
##  Min.   :3.800   Min.   : 3.300   Min.   :35.90   Min.   :11.60  
##  1st Qu.:4.372   1st Qu.: 5.900   1st Qu.:40.60   1st Qu.:13.50  
##  Median :4.755   Median : 6.850   Median :43.50   Median :14.70  
##  Mean   :4.719   Mean   : 7.109   Mean   :43.09   Mean   :14.57  
##  3rd Qu.:5.030   3rd Qu.: 8.275   3rd Qu.:45.58   3rd Qu.:15.57  
##  Max.   :6.720   Max.   :14.300   Max.   :59.70   Max.   :19.20  
##                                                                  
##       ferr             bmi             ssf             pcBfat      
##  Min.   :  8.00   Min.   :16.75   Min.   : 28.00   Min.   : 5.630  
##  1st Qu.: 41.25   1st Qu.:21.08   1st Qu.: 43.85   1st Qu.: 8.545  
##  Median : 65.50   Median :22.72   Median : 58.60   Median :11.650  
##  Mean   : 76.88   Mean   :22.96   Mean   : 69.02   Mean   :13.507  
##  3rd Qu.: 97.00   3rd Qu.:24.46   3rd Qu.: 90.35   3rd Qu.:18.080  
##  Max.   :234.00   Max.   :34.42   Max.   :200.80   Max.   :35.520  
##                                                                    
##       lbm               ht              wt         sex         sport   
##  Min.   : 34.36   Min.   :148.9   Min.   : 37.80   f:100   Row    :37  
##  1st Qu.: 54.67   1st Qu.:174.0   1st Qu.: 66.53   m:102   T_400m :29  
##  Median : 63.03   Median :179.7   Median : 74.40           B_Ball :25  
##  Mean   : 64.87   Mean   :180.1   Mean   : 75.01           Netball:23  
##  3rd Qu.: 74.75   3rd Qu.:186.2   3rd Qu.: 84.12           Swim   :22  
##  Max.   :106.00   Max.   :209.4   Max.   :123.20           Field  :19  
##                                                            (Other):47
hist(rcc, main = "Histogram of Red blood cell count",data = ais)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter

Note: The graph obtained depends on the bins chosen (see the example below). The best choice for a given data set is the one which illuminates the data set’s underlying structure (if any). In R there are algorithms to automatically choose bins that are likely to display well, often the default bins do a good job. This is not always the case, however, and it is better to investigate many bin choices to test the stability of the display.

par(mfrow=c(1,2))
hist(rcc, breaks = 8, main = "Histogram of Red blood cell count", data = ais)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
hist(rcc, breaks = 20,  main = " ", data = ais)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter

3.1.2 Density Plots

Density plots are generalized histograms.

## Density plot 
plot(density(ht), data = ais)     # density for height
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

3.1.3 Barplots

Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.

barplot(table(ais$sport), beside = TRUE, main = "Sport type")

3.1.4 Boxplots

To visualize the relationship between a numerical and categorical variable, we will use a boxplot. In the mpg dataset, the drv variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.

  • A boxplot is constructed by drawing a box alongside the data axis with sides located at the upper and lower hinges. A line is drawn parallel to the sides to denote the sample median. Lastly, whiskers are extended from the sides of the box to the maximum and minimum data values (more precisely, to the most extreme values that are not potential outliers).

  • Boxplots are good for quick visual summaries of data sets, and the relative positions of the values in the five numbers summary (minimum, first quartile, median, third quartile, maximum), indicating the underlying shape of the data distribution.

-They can be a handy alternative to a stripchart when the sample size is large.

  • The greatest advantage of a boxplot is that it can help to objectively identify extreme observations in the data set as described in the next section.

Boxplots are also good because one can visually assess multiple features of the data set simultaneously:

Center: can be estimated by the sample median. Spread: can be judged by the width of the box, (IQR = third quartile – first quartile).

Shape: is indicated by the relative lengths of the whiskers, and the position of the median inside the box. Boxes with unbalanced whiskers indicate skewness in the direction of the long whisker. Skewed distributions often have the median tending in the opposite direction of skewness. Kurtosis can be assessed using the box and whiskers. A wide box with short whiskers will tend to be platykurtic, while a skinny box with wide whiskers indicates leptokurtic distributions. Extreme observations are identified with open circles.

boxplot(ais)

3.1.4.1 Outliers

A potential outlier is any observation that falls beyond 1.5 times the width of the box on either side, that is, any observation less than (first quartile – 1.5*IQR) or greater than (third quartile + 1.5*IQR). A suspected outlier is any observation that falls beyond 3 times the width of the box on either side. In R, both potential and suspected outliers (if present) are denoted by open circles; there is no distinction between the two.

boxplot(lbm~sex, data = ais, main = "Boxplot of Lean 
+ body mass (kg)", xlab = "Sex")

boxplot(lbm~sport, data = ais, xlab = "Sport types")

3.1.5 Scatterplots

Lastly, to visualize the relationship between two numeric variables we will use a scatterplot. This can be done with the plot() function and the ~ syntax we just used with a boxplot. (The function plot() can also be used more generally; see the documentation for details.)

3.1.6 Multivariate data plots

Density plots

library(lattice)
densityplot(~lbm | sport, data = ais, groups = sex, scales = "free", plot.points = FALSE,  auto.key = TRUE, main = "Density plots of Lean body mass  by Support type and Sex",   xlab = "Lean body mass") 

Scatterplots

xyplot(ht ~ wt | sport, data = ais, main = "Scatterplots of Height versus Weight by Spport type",  xlab = "Weight")

Scatterplot matrix

splom(ais[,2:5], main = "Scatterplot matrix of the first four variables in ais data")

4 Writing your own functions

One of the most powerful features of R is the ability to write your own functions. These functions may help pre-process data or implement specific computational algorithms. In this class we will introduce the R function syntax and in particular the passing of variables into the function, and argument handling.

If you find you have a process that you use a lot, perhaps one that calls several R functions, you can write your own function to do it. Probably one of the most powerful aspects of the R language is the ability of a user to write functions. When doing so, many computations can be incorporated into a single function and (unlike with scripts) intermediate variables used during the computation are local to the function and are not saved to the workspace. In addition, functions allow for input values used during a computation that can be changed when the function is executed.

  • Conditional execution: if(), if()else and ifelse()

Syntax:

  • *if(condition) {commands1} *

  • *if(condition) {commands1} else { commands2} condition must evaluate to a single logical value, (i.e either TRUE or FALSE.)  

  • *ifelse (conditions vector, yes vector, no vector )*

if Condition in R

This task is carried out only if this condition is returned as TRUE. R makes it even easier: You can drop the word then and specify your choice in an if statement.

if (test_expression) {
statement
}

Example:

values <- 1:10
if (sample(values, 1) <= 10) {
  print(paste(sample(values, 1), "is less than or equal to 10"))
}
## [1] "10 is less than or equal to 10"

Example 2:
Write a function that calculate arithmetic mean for grouped data. Group midpoints are (10, 20, 30, 40) and Frequencies for each group (5, 10, 8, 7)

grouped_mean <- function(groups, frequencies) {
if (length(groups) != length(frequencies)) {
stop("Lengths of ’groups’ and ’frequencies’
should be the same.")
}
total_sum <- sum(groups * frequencies)
total_freq <- sum(frequencies)
mean <- total_sum / total_freq
return(mean)
}
groups <- c(10, 20, 30, 40) # Group midpoints
frequencies <- c(5, 10, 8, 7) # Frequencies for each group
result <- grouped_mean(groups, frequencies)
print(result)
## [1] 25.66667

if-else Condition in R An if. . . else statement contains the same elements as an if statement (see the preceding section), with some extra elements:

  • The keyword else, placed after the first code block.
  • The second block of code, contained within braces, that has to be carried out, only if the result of the condition in the if() statement is FALSE.

if (test_expression) {
statement
} else {
statement
}

x <- 9
if (x > 0) sqrt(x) else sqrt(-x)
## [1] 3
ifelse(x >= 0, sqrt(x), sqrt(-x))
## [1] 3
before<-c(26.4,28.2,22.1,24.5,21.4,27.3)
after<-c(33.9,37.1,36.1,39.5,30.0,31.4)
df<-data.frame(before,after)
df
before after
26.4 33.9
28.2 37.1
22.1 36.1
24.5 39.5
21.4 30.0
27.3 31.4
t_test<-t.test(df)
t_test
## 
##  One Sample t-test
## 
## data:  df
## t = 17.469, df = 11, p-value = 2.27e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  26.06716 33.58284
## sample estimates:
## mean of x 
##    29.825
pvalue <- t_test$p.value
if (pvalue < 0.05) {
  print("Reject the null hypothesis")
} else {
  print("Fail to reject the null hypothesis")
}
## [1] "Reject the null hypothesis"

Loops: for (), while () and repeat ()

Syntax:

  • for ( var in set ) {commands}

  • while ( condition ) {commands}

  • repeat {commands}

Where

  1. the object set is a vector,

  2. commands is a single command or a sequence of commands and

  3. var is a variable which may be used in commands.

Exercise

  1. Write a function that calculate arithmetic mean for grouped data

  2. Consider a function that returns the maximum of two scalars or the statement that they are equal.

  3. Write a function that computes the zero(s) of a quadratic equation.

  4. Write a function that returns the mean of the values in a vector that are greater than the median. Use the median() function. Call your function gtmean.

5 Probability Distributions

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).

A probability distribution describes how the probabilities of outcomes are distributed for a random variable. It provides a mathematical framework to represent and analyze uncertainty in experiments and data. Probability distributions are categorized into Discrete and Continuous types based on the nature of the random variable.

  1. Random Variable:
  • A variable whose value is determined by the outcome of a random experiment.

    - Discrete Random Variable: Takes on countable values (e.g., number of heads in a coin toss).
    - Continuous Random Variable: Takes on any value within a range (e.g., height, weight).
  1. Probability Mass Function (PMF):
  • For discrete random variables, it gives the probability of each specific value.
  1. Probability Density Function (PDF):
  • For continuous random variables, it represents the density of probabilities over a range of values. The total area under the PDF is always 1.
  1. Cumulative Distribution Function (CDF):
  • It represents the probability that a random variable takes a value less than or equal to a specific point.

Table 4: Common Probability Distribution Functions in R

Distribution Probability Density (or Mass) Function Cumulative Distribution Function (CDF) Quantile Function
Normal dnorm(Z, mean, sd) pnorm(Z, mean, sd) qnorm(Q, mean, sd)
Poisson dpois(N, lambda) ppois(N, lambda) qpois(Q, lambda)
Binomial dbinom(N, size, prob) pbinom(N, size, prob) qbinom(Q, size, prob)
Exponential dexp(N, rate) pexp(N, rate) qexp(Q, rate)
Chi-Squared dchisq(X, df) pchisq(X, df) qchisq(Q, df)

5.1 Discrete Probability Distributions

5.1.1 Bernoulli Distribution

  • Definition: The Bernoulli distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0).

\[P(X=x) = p^x (1-p)^{1-x}, \quad x \in \{0,1\}\]

Let X be a discrete random variable. \(X∼Bern(p)\) means X takes on a Bernoulli distribution where the probability of success is p.

  • Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.

  • Application: The Bernoulli distribution is commonly used to model the outcome of a single trial of an experiment where the outcome can only be success or failure.

  • Real-life example:

      - A coin toss, where the outcome can only be heads (success) or tails (failure).
      - Medical trials: Researchers use the Bernoulli distribution to analyze the efficacy of 
        a new drug or treatment by categorizing treatment outcomes as successful or unsuccessful.
      - Passing or failing an exam 
      - Winning or losing a tennis match 
# Generate 10 Bernoulli random variables with probability of success 0.3
library(Rlab)
## Rlab 4.0 attached.
## 
## Attaching package: 'Rlab'
## The following object is masked _by_ '.GlobalEnv':
## 
##     cancer
## The following object is masked from 'package:DAAG':
## 
##     ozone
## The following object is masked from 'package:survival':
## 
##     cancer
## The following objects are masked from 'package:stats':
## 
##     dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
##     qweibull, rexp, rgamma, rweibull
## The following object is masked from 'package:datasets':
## 
##     precip
rbern(n = 10, prob = 0.3)
##  [1] 0 0 0 1 0 0 0 0 0 1

5.1.2 Binomial Distribution

  • Definition: The binomial distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0), and is the result of a fixed number of independent trials.

The binomial distribution formula can be written as:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

where:
- $P(X = k) $ is the probability of getting exactly \(k\) successes in \(n\) trials,
- \(n\) is the total number of trials,
- \(p\) is the probability of success in a single trial,
- $ $ is the binomial coefficient, also known as “n choose k”.

  • Assumptions: The binary random variable has a fixed probability of success and failure, the outcomes are independent, and the trials are identical.

  • Application: The binomial distribution is commonly used to model the number of successes in a fixed number of independent trials where the outcome can only be success or failure.

  • Real-life example: The number of heads in 10 coin tosses. - Clinical Trials: A new drug is tested on 50 patients, and each has a 60% chance of showing improvement. The number of patients showing improvement follows a Binomial distribution.
    Parameters: n=50, p=0.6.

Example 1:

# Generate 10 binomial random variables with 5 trials and probability of success 0.3
rbinom(n = 10, size = 5, prob = 0.3)
##  [1] 2 2 3 2 0 2 1 1 1 0

Example 2: Suppose that an examination consists of six true and false questions, and assume that a student has no knowledge of the subject matter. The probability that the student will guess the correct answer to the first question is 30%. Likewise, the probability of guessing each of the remaining questions correctly is also 30%.

  1. What is the probability of getting more than three correct answers?
  2. What is the probability of getting at least two correct answers?
  3. What is the probability of getting at most three correct answers?
  4. What is the probability of getting less than five correct answers?
# Parameters
n <- 6       # Number of questions
p <- 0.30    # Probability of guessing correctly
a<- 1 - pbinom(3, n, p)
a
## [1] 0.07047
b<- 1 - pbinom(1, n, p)
b
## [1] 0.579825
c<- pbinom(3, n, p)
c
## [1] 0.92953
d<- pbinom(4, n, p)
d
## [1] 0.989065

5.1.3 Poisson Distribution

  • Definition: The Poisson distribution is a probability distribution for a discrete random variable that represents the number of events occurring in a fixed interval of time or space.

The Poisson distribution formula is:

\[P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}\]

where:
- \(P(X = k)\) is the probability of observing \(k\) events in a fixed interval,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\lambda\) is the average rate of event occurrences in that interval,
- \(k\) is the number of events that we are interested in,
- \(k!\) is the factorial of \(k\) (the product of all positive integers up to \(k\)).

  • Assumptions: The events occur independently, the probability of an event occurring is constant over time or space, and the events occur at a fixed rate.

  • Application: The Poisson distribution is commonly used to model the number of occurrences of rare events such as accidents, defects, or defects in a manufacturing process.

  • Real-life example: The number of car accidents that occur in a city in a given day.
    - Number of births in a maternity hospital : A maternity hospital can use the Poisson distribution to predict how many births will occur during the nigh

# Generate 10 Poisson random variables with lambda = 2
rpois(n = 10, lambda = 2)
##  [1] 4 1 1 0 0 0 5 2 0 4

Example 2: Suppose that 4% of all TVs made by A&B Company in 2000 are defective. If eight of these TVs are randomly selected from across the country and tested, what is the probability that exactly three of them are defective? Assume that each TV is made independently of the others.

# Parameters
n <- 8          # Number of TVs tested
p <- 0.04       # Probability a TV is defective
x <- 3          # Number of defective TVs
# Probability of exactly 3 defective TVs
prob_exactly_three <- dbinom(x, n, p)
prob_exactly_three
## [1] 0.002922296

5.2 Continuous Probability Distributions

Continuous probability distributions are a fundamental concept in probability theory and statistics. They are used to model random variables that can take on any value within a range of values, as opposed to discrete probability distributions that model random variables that can only take on certain values. Continuous probability distributions are commonly used in various fields, such as finance, engineering, healthcare, and social sciences, to model measurements of time, length, weight, or other continuous variables.

Some of the most commonly used continuous probability distributions include the Uniform, Normal, Exponential, Log-Normal, Gamma, Beta, Chi-Square, Student’s t, F, Weibull, Cauchy, Laplace, Pareto, Gumbel, and Logistic distributions. Each of these distributions has a unique shape and set of parameters that determine its properties, such as mean, variance, skewness, and kurtosis.

Understanding continuous probability distributions is essential for statistical analysis and data modeling. They are used to estimate probabilities, calculate expected values, and make predictions based on observed data. In addition, the properties of continuous probability distributions, such as their moments and tail behavior, have important implications for statistical inference and hypothesis testing.

In this context, it is important to note that continuous probability distributions are all based on the fundamental concept of probability density functions (PDFs), which describe the probability distribution of a continuous random variable. PDFs are used to calculate the probability of a random variable taking on a specific value or falling within a certain range of values. The area under a PDF curve represents the probability of the random variable falling within that range of values.

Overall, continuous probability distributions play a critical role in modern statistics and are essential for analyzing and interpreting data in a wide range of fields. Understanding their properties, assumptions, and applications is essential for anyone seeking to apply statistical methods to real-world problems.

5.2.1 Uniform Distribution

  • Definition: The uniform distribution is a probability distribution for a continuous random variable that takes values within a fixed interval, and all values within that interval are equally likely.

The probability density function of a continuous uniform distribution can be expressed as follows:

\[f(x|a, b) = \frac{1}{b-a} \text{ for } a \leq x \leq b\]

where:
- \(f(x|a, b)\)is the probability density function of the uniform distribution between \(a\) and \(b\),
- \(a\) is the lower limit of the distribution,
- \(b\) is the upper limit of the distribution.

This formula denotes that for a continuous uniform distribution over the interval \([a, b]\), the probability density is constant over that interval, equal to \(\frac{1}{b-a}\).

  • Assumptions: The continuous random variable has a fixed interval, and all values within that interval are equally likely.

  • Application: The uniform distribution is commonly used to model outcomes where all values within an interval are equally likely, such as the roll of a fair die or the arrival time of a customer at a store.

  • Real-life example: The arrival time of a customer at a store between 9:00 am and 10:00 am.

5.2.2 Normal Distribution

  • Definition: The normal distribution is a probability distribution for a continuous random variable that has a bell-shaped curve, with most of the values around the mean and fewer values further away from the mean.

The probability density function of the normal (Gaussian) distribution can be represented as:

\[f(x|\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

where:
- \(f(x|\mu,\sigma^2)\) is the probability density function of the normal distribution with mean \(\mu\) and variance \(\sigma^2\),
- \(\mu\) is the mean of the distribution,
- \(\sigma^2\) is the variance of the distribution,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\pi\) is the mathematical constant pi (\(\approx 3.14159\)).

This formula describes how the probability density function of a normal distribution varies as a function of \(x\), \(\mu\), and \(\sigma\).

  • Assumptions: The continuous random variable has a mean and standard deviation, and the values are normally distributed around the mean.

  • Application: The normal distribution is commonly used to model outcomes where the values are normally distributed, such as the height of a population or the scores on a standardized test.

  • Real-life example: The height of a population of people.

# Generate 10 normal random variables with mean 0 and standard deviation 1
rnorm(n = 10, mean = 0, sd = 1)
##  [1]  0.77185278 -0.52840536 -0.18737593  0.34490423 -0.43188270  0.02726214
##  [7]  0.70013407  0.65812203  2.55559330  0.62398999

Example 2:
Draw two random samples of size 100 from the following distributions N(0, 0.52) and N(2, 0.52) and produce the histograms of these samples on one page. (a) Use the function rnorm()to generate the samples in the following way:

#x <- rnorm(sample size, mean ,standard deviation) #general call of rnorm
x1 <- rnorm(100,0,0.5)
x2 <- rnorm(100,2,0.5)
  1. Use the par() function in order to deffine the number of figures in one page:
par(mfrow=c(1,2)) #put two figures in one page, two figures in one column
hist(x1)
hist(x2)

  1. What is the difference between the two histograms ? D. Does the two distributions look the same (in terms of shape)?

5.2.3 Exponential Distribution

  • Definition: The exponential distribution is a probability distribution for a continuous random variable that represents the time between events in a Poisson process.

The probability density function of the exponential distribution can be expressed as:

\[f(x|\lambda) = \lambda e^{-\lambda x} \text{ for } x \geq 0\]

where:
- \(f(x|\lambda)\) is the probability density function of the exponential distribution with rate parameter \(\lambda\),
- \(\lambda\) is the rate parameter of the distribution, often representing the average number of events in a unit time,
- \(e\) is Euler’s number \(\approx 2.71828\).

This formula describes how the probability density function of the exponential distribution changes with \(x\) and the rate parameter \(\lambda\).

  • Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.

  • Application: The exponential distribution is commonly used to model the time between occurrences of rare events such as accidents, defects, or defects in a manufacturing process.

  • Real-life example: The time between car accidents on a particular road.

# Generate 10 exponential random variables with rate parameter 0.5
rexp(n = 10, rate = 0.5)
##  [1] 6.9879500 1.2832438 1.5573691 2.2863128 2.0873891 1.9828075 4.9821932
##  [8] 1.3043925 0.7218954 2.7616857

5.2.4 Gamma Distribution

  • Definition: The gamma distribution is a probability distribution for a continuous random variable that represents the sum of n independent exponential random variables, each with the same rate parameter.

The probability density function of the gamma distribution can be represented in LaTeX as:

\[f(x|k,\theta) = \frac{1}{\Gamma(k) \theta^k} x^{k-1} e^{-\frac{x}{\theta}} \text{ for } x > 0\]

where:
- \(f(x|k,\theta)\) is the probability density function of the gamma distribution with shape parameter \(k\) and scale parameter \(\theta\),
- \(\Gamma(k)\) is the gamma function evaluated at \(k\),
- \(x\) is the random variable,
- \(k\) is the shape parameter that determines the shape of the distribution,
- \(\theta\) is the scale parameter that influences the spread of the distribution,
- \(e\) is Euler’s number \(\approx 2.71828\).

This formula describes how the probability density function of the gamma distribution varies with the random variable \(x\), shape parameter \(k\), and scale parameter \(\theta\).

  • Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.

  • Application: The gamma distribution is commonly used to model the time to complete a task that is composed of multiple independent sub-tasks, each with a known time to completion.

  • Real-life example: The time to complete a software development project that is composed of multiple independent sub-tasks.

# Generate 10 gamma random variables with shape parameter 2 and rate parameter 0.5
rgamma(n = 10, shape = 2, rate = 0.5)
##  [1] 7.643492 2.673613 4.857552 6.438503 9.477041 3.351033 6.038383 4.489101
##  [9] 6.183145 2.370551

5.2.5 Chi-Squared Distribution

  • Definition: The chi-squared distribution is a probability distribution for a continuous random variable that is used to model the distribution of the sum of squared standard normal random variables.

The Chi-Square (X2) test is a nonparametric test that is used to test hypotheses about distributions of frequencies across categories of data. The sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.

The probability density function of the chi-squared distribution can be expressed in LaTeX as:

\[f(x|k) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2 - 1} e^{-x/2}\]

where:
- \(f(x|k)\) is the probability density function of the chi-squared distribution with \(k\) degrees of freedom,
- \(\Gamma(\cdot)\) is the gamma function,
- \(x\) is the random variable,
- \(k\) is the degrees of freedom parameter.

This formula describes how the probability density function of the chi-squared distribution changes with the random variable \(x\) and the degrees of freedom parameter \(k\).

  • Assumptions: The continuous random variable is the sum of squared standard normal random variables.

  • Application: The chi-squared distribution is commonly used in hypothesis testing and confidence interval estimation when the population standard deviation is unknown and the sample size is large.

  • Real-life example: The sum of squared deviations from the mean in a sample of 100 measurements.

# Generate 10 chi-squared random variables with 5 degrees of freedom
rchisq(n = 10, df = 5)
##  [1]  3.611192  2.541391  1.557380  3.992988  2.903059  3.744683  4.553366
##  [8]  5.548837  5.276927 13.165022

5.2.6 F Distribution

  • Definition: The F distribution is a probability distribution for a continuous random variable that is used to model the distribution of the ratio of two chi-squared random variables.

  • Assumptions: The continuous random variable is the ratio of two chi-squared random variables.

  • Application: The F distribution is commonly used in hypothesis testing and confidence interval estimation when comparing the variances of two populations.

  • Real-life example: The ratio of the variances of two sets of data.

# Generate 10 F random variables with numerator degrees of freedom 5 and denominator degrees of freedom 10
rf(n = 10, df1 = 5, df2 = 10)
##  [1] 0.1924399 0.9693179 0.7488310 1.1828210 1.8700149 1.0530074 2.6654673
##  [8] 1.3521161 0.4974382 0.1966812

6 Parametric and Non-parametric Tests

6.1 Parametric Tests

Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data.These tests are used when a normal distribution is assumed.

Examples include:

  • T-test (paired or unpaired)
  • ANOVA
  • Linear regression
  • Pearson rank correlation

6.1.1 T-test

t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known (typically, the scaling term is unknown and is therefore a nuisance parameter).

When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student’s t distribution. The t-test’s most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.

  • T-Test: used to test the mean of the populations

  • The key assumption to use t-test is normality of the population from which the random samples are drawn.

  • There are three different types of t-tests which performs different tests

  • One-sample t-test: Compares the mean of population with certain known or hypothesized population mean value. Hypothesis to be tested is:

          - $H_0$:$\mu =\mu_0$          
          - $H_a$: not $H_0$ or  
  • The test statistic is given by:
    \[t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}\]

wtdat <- read.csv('~/Independent sample T-test.csv')
t.test (Leaf.length ~ varaity, var.equal=TRUE, data = wtdat)
## 
##  Two Sample t-test
## 
## data:  Leaf.length by varaity
## t = -2.8527, df = 16, p-value = 0.01152
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -0.8672012 -0.1277988
## sample estimates:
## mean in group 1 mean in group 2 
##         27.5625         28.0600

6.1.1.1 Two samples t-test

y1<-rnorm(100,0,1)
y2<-rnorm(57,2,1)
boxplot(y1,y2)

Here we know that the variances are equal, we generated the data that way. Repeat the t-test with the relevant option.

t.test(y1,y2)
## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = -11.54, df = 139.39, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.173409 -1.537620
## sample estimates:
##  mean of x  mean of y 
## 0.01300588 1.86852028

The Welch Two Sample t-test is a statistical method used to compare the means of two groups, \(y_1\) and \(y_2\), when their variances are not equal. The test statistic (t) is -12.708, indicating a significant difference between the groups. The degrees of freedom (df) are calculated using Welch’s formula, adjusted for unequal variances. The p-value is small, indicating strong evidence against the null hypothesis. The alternative hypothesis is that the true difference in means is not equal to 0. The confidence interval for the difference in means is entirely below 0, indicating that \(y_2\) has a significantly higher mean than y1. The sample means show that \(y_2\) is substantially higher than y1. In conclusion, the Welch Two Sample t-test indicates a statistically significant difference in means between \(y_1\) and \(y_2\), with the confidence interval and sample means suggesting that \(y_2\) is significantly higher than \(y_1\).

6.1.1.2 Paired samples t-Test

The Paired samples t-Test: used to compares the mean difference of the same subject after and before having certain treatments. It tests whether average difference is 0 or not after measurements are made on the same group(subject) but at two different occasion or conditions.

Example: pre or post test results of the same subjects.

yld<-read.csv("~/Paired sample T test.csv", header=TRUE, na.strings="NA", dec=".")
# Save the data in two different vector 
before <- yld$before 
after <- yld$after 
# Compute t-test 
rt <- t.test(before, after, paired = TRUE) 
rt
## 
##  Paired t-test
## 
## data:  before and after
## t = -5.7654, df = 5, p-value = 0.002205
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -14.00080  -5.36587
## sample estimates:
## mean difference 
##       -9.683333

6.1.2 One way ANOVA

The One-Way ANOVA model formula using the following Formula:

\[Y_{ij} = \mu + \varepsilon_{ij}\]

Where:
- \(Y_{ij}\) represents the observation in the \(i\)th group and the \(j\)th sample. - \(\mu\) Parameters: fixed but unknown and needed to be estimated - \(\varepsilon_{ij}\): Random error, assumed to follow normal distribution with constant varaince.

Model assumptions are:

  1. Independence of Observations: The observations within each group and between groups are independent of each other.
  2. Normality of Residuals: This can be tested using statistical tests such as the Shapiro-Wilk test or graphically using Q-Q plots.
  3. Homogeneity of Variance (Homoscedasticity): This assumption can be checked using tests such as Levene’s test or Bartlett’s test.
  4. Dependent Variable is Continuous
  5. Independent Variable is Categorical

Violation of Assumptions
If assumptions are violated, consider transformations of the data, using a non-parametric alternative (e.g., Kruskal-Wallis test), or employing a robust ANOVA method.

duodenal<-read.table("~/duodenal.txt", header = FALSE, na.strings = "NA", dec = ".")
names(duodenal) <- c("Trypsin", "Protein")    
attach(duodenal)
dim(duodenal)
## [1] 28  2
Fit.aov<-aov(Protein~Trypsin)
summary(Fit.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## Trypsin      2  19.63   9.814   1.265    0.3
## Residuals   25 193.99   7.760

Histograms by group

par(mfrow=c(2,2))
hist(Protein[Trypsin=="ls50"],col=0)
hist(Protein[Trypsin=="b51_1000"],col=0)
hist(Protein[Trypsin=="ge1000"],col=0)

qq normal plots by group

par(mfrow=c(2,2))
qqnorm(Protein[Trypsin=="ls50"])
qqnorm(Protein[Trypsin=="b51_1000"])
qqnorm(Protein[Trypsin=="ge1000"])

par(mfrow=c(1,2))
boxplot(split(Protein,Trypsin))

tapply(Protein,Trypsin,mean)
## b51_1000   ge1000     ls50 
## 5.410000 5.944444 3.933333

Diagnostic plot

par(mfrow=c(2,2))
qqnorm(Fit.aov$resid)
hist(Fit.aov$resid,col=0)
boxplot(split(Fit.aov$resid,Trypsin))

6.1.2.1 Homogeneity of variances test

Bartlett’s test

bartlett.test(Protein ~ Trypsin,data = duodenal)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Protein by Trypsin
## Bartlett's K-squared = 1.5415, df = 2, p-value = 0.4627

Levene’s test

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:DAAG':
## 
##     vif
attach(duodenal)
## The following objects are masked from duodenal (pos = 5):
## 
##     Protein, Trypsin
leveneTest(Protein, Trypsin)
## Warning in leveneTest.default(Protein, Trypsin): Trypsin coerced to factor.
Df F value Pr(>F)
group 2 0.6844803 0.5135556
25 NA NA

6.1.3 Two-way ANOVA

Two-way ANOVA is a statistical technique used to evaluate the effect of two categorical independent variables (factors) and their interaction on a continuous dependent variable. It extends one-way ANOVA by including a second factor, allowing for the analysis of interaction effects between the two factors.

\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]

Where:
- \(Y_{ijk}\) represents the observation in the cell defined by levels \(i\) and \(j\) of the two factors. - \(\mu\) is the overall mean. - \(\alpha_i\) is the effect of the \(i\)th level of factor A. - \(\beta_j\) is the effect of the \(j\)th level of factor B. - \((\alpha\beta)_{ij}\) is the interaction effect between the \(i\)th level of factor A and the \(j\)th level of factor B. - \(\varepsilon_{ijk}\) is the random error term.

TWO WAY ANOVA Table

Model Definition \[Y_{ijk} = \mu + \tau_i + \beta_j + (\gamma)_{ij} + e_{ijk}\]

  • \(\mu\): Overall mean.
  • \(\tau_i\): Effect of the \(i\)-th level of factor A.
  • \(\beta_j\): Effect of the \(j\)-th level of factor B.
  • \((\gamma)_{ij}\): Interaction effect between the \(i\)-th level of factor A and \(j\)-th level of factor B.
  • \(e_{ijk}\): Random error term.

Decomposition of Sum of Squares \[SS_{\text{TOTAL}} = SS_A + SS_B + SS_{AB} + SS_{\text{Error}}\]

  • \(SS_A\): Sum of squares due to factor A.
  • \(SS_B\): Sum of squares due to factor B.
  • \(SS_{AB}\): Sum of squares due to interaction.
  • \(SS_{\text{Error}}\): Sum of squares due to residual error.

Assumptions \[Y_{ijk} \sim N(\mu, \sigma^2), \quad e_{ijk} \sim N(0, \sigma^2)\]

  • Observations are normally distributed: \(Y_{ijk} \sim N(\mu,\sigma^2)\).
  • Error terms are independent and normally distributed with mean 0 and constant variance: \(e_{ijk} \sim N(0, \sigma^2)\).

Hypotheses

Main Effects for Factor A \[H_0: \tau_1 = \tau_2 = \ldots = \tau_k = 0 \quad \text{versus} \quad H_a: \tau_i \neq 0 \ \text{for some } i.\]

Main Effects for Factor B \[H_0: \beta_1 = \beta_2 = \ldots = \beta_b = 0 \quad \text{versus} \quad H_a: \beta_j \neq 0 \ \text{for some } j.\]

Interaction Effects \[ H_0: \gamma_{ij} = 0 \ \text{for all } i, j \quad \text{versus} \quad H_a: \gamma_{ij} \neq 0 \ \text{for some } i, j. \]

The R Interface
The R Interface
##Two way ANOVA###
trypData=read.delim2("~/Trypanosomosis.txt")
str(trypData)
## 'data.frame':    12 obs. of  4 variables:
##  $ Breed     : chr  "Boran" "Boran" "Boran" "Boran" ...
##  $ Drug      : chr  "Berenil" "Berenil" "Berenil" "Samorin" ...
##  $ PCV_before: num  18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
##  $ PCV_after : num  26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
trypData$PCV_diff <-trypData$PCV_after-trypData$PCV_before
str(trypData)
## 'data.frame':    12 obs. of  5 variables:
##  $ Breed     : chr  "Boran" "Boran" "Boran" "Boran" ...
##  $ Drug      : chr  "Berenil" "Berenil" "Berenil" "Samorin" ...
##  $ PCV_before: num  18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
##  $ PCV_after : num  26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
##  $ PCV_diff  : num  7.9 7.8 5.6 13.8 11.9 13.5 7 9.4 7.6 15.9 ...
library(ggplot2)
ggplot(trypData,aes(y = PCV_diff, x = Drug, fill = Breed)) +geom_boxplot(aes(fill = Breed)) 

ft = aov(PCV_diff~Breed * Drug, data = trypData)
print(ft)
## Call:
##    aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
## 
## Terms:
##                    Breed     Drug Breed:Drug Residuals
## Sum of Squares   0.44083 89.10750    0.80083  23.99333
## Deg. of Freedom        1        1          1         8
## 
## Residual standard error: 1.73181
## Estimated effects may be unbalanced
summary(ft)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Breed        1   0.44    0.44   0.147 0.711420    
## Drug         1  89.11   89.11  29.711 0.000608 ***
## Breed:Drug   1   0.80    0.80   0.267 0.619316    
## Residuals    8  23.99    3.00                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(ft,type="means")
## Tables of means
## Grand mean
##        
## 10.275 
## 
##  Breed 
## Breed
##    Boran Holstein 
##   10.083   10.467 
## 
##  Drug 
## Drug
## Berenil Samorin 
##    7.55   13.00 
## 
##  Breed:Drug 
##           Drug
## Breed      Berenil Samorin
##   Boran     7.100  13.067 
##   Holstein  8.000  12.933
TukeyHSD(ft)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
## 
## $Breed
##                     diff       lwr      upr     p adj
## Holstein-Boran 0.3833333 -1.922351 2.689017 0.7114204
## 
## $Drug
##                 diff      lwr      upr     p adj
## Samorin-Berenil 5.45 3.144316 7.755684 0.0006082
## 
## $`Breed:Drug`
##                                         diff        lwr       upr     p adj
## Holstein:Berenil-Boran:Berenil     0.9000000 -3.6281806  5.428181 0.9173043
## Boran:Samorin-Boran:Berenil        5.9666667  1.4384861 10.494847 0.0124387
## Holstein:Samorin-Boran:Berenil     5.8333333  1.3051527 10.361514 0.0140891
## Boran:Samorin-Holstein:Berenil     5.0666667  0.5384861  9.594847 0.0293847
## Holstein:Samorin-Holstein:Berenil  4.9333333  0.4051527  9.461514 0.0334869
## Holstein:Samorin-Boran:Samorin    -0.1333333 -4.6615139  4.394847 0.9996731
plot(TukeyHSD(ft,"Breed"),las=1)

mastitisData =read.csv2("~/Mastitis.csv")
str(mastitisData)
## 'data.frame':    12 obs. of  3 variables:
##  $ Parity   : chr  "heifer" "heifer" "heifer" "heifer" ...
##  $ Dose     : chr  "high" "high" "medium" "medium" ...
##  $ Reduction: num  6.79 3.87 30.03 38.08 53.67 ...
mastitisData$Dose = factor(mastitisData$Dose,levels =c("low", "medium", "high"))
ggplot(mastitisData,aes(y = Reduction, x = Dose, fill = Parity))+geom_point(aes(colour = Parity), size = 6)

fm= aov(Reduction ~Parity * Dose, data = mastitisData)
summary(fm)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Parity       1    627     627   36.79 0.000911 ***
## Dose         2   8758    4379  257.06 1.54e-06 ***
## Parity:Dose  2    385     192   11.30 0.009236 ** 
## Residuals    6    102      17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(fm,type="means")
## Tables of means
## Grand mean
##       
## 39.64 
## 
##  Parity 
## Parity
##      heifer multiparous 
##       32.41       46.87 
## 
##  Dose 
## Dose
##    low medium   high 
##  72.62  39.84   6.45 
## 
##  Parity:Dose 
##              Dose
## Parity        low   medium high 
##   heifer      57.86 34.06   5.33
##   multiparous 87.40 45.63   7.58
TukeyHSD(fm)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Reduction ~ Parity * Dose, data = mastitisData)
## 
## $Parity
##                        diff     lwr      upr    p adj
## multiparous-heifer 14.45333 8.62264 20.28403 0.000911
## 
## $Dose
##                 diff       lwr       upr    p adj
## medium-low  -32.7825 -41.73701 -23.82799 7.34e-05
## high-low    -66.1725 -75.12701 -57.21799 1.30e-06
## high-medium -33.3900 -42.34451 -24.43549 6.60e-05
## 
## $`Parity:Dose`
##                                        diff       lwr       upr     p adj
## multiparous:low-heifer:low           29.540  13.11411  45.96589 0.0029142
## heifer:medium-heifer:low            -23.800 -40.22589  -7.37411 0.0089285
## multiparous:medium-heifer:low       -12.225 -28.65089   4.20089 0.1529274
## heifer:high-heifer:low              -52.525 -68.95089 -36.09911 0.0001178
## multiparous:high-heifer:low         -50.280 -66.70589 -33.85411 0.0001512
## heifer:medium-multiparous:low       -53.340 -69.76589 -36.91411 0.0001079
## multiparous:medium-multiparous:low  -41.765 -58.19089 -25.33911 0.0004333
## heifer:high-multiparous:low         -82.065 -98.49089 -65.63911 0.0000071
## multiparous:high-multiparous:low    -79.820 -96.24589 -63.39411 0.0000085
## multiparous:medium-heifer:medium     11.575  -4.85089  28.00089 0.1824850
## heifer:high-heifer:medium           -28.725 -45.15089 -12.29911 0.0033811
## multiparous:high-heifer:medium      -26.480 -42.90589 -10.05411 0.0051795
## heifer:high-multiparous:medium      -40.300 -56.72589 -23.87411 0.0005298
## multiparous:high-multiparous:medium -38.055 -54.48089 -21.62911 0.0007303
## multiparous:high-heifer:high          2.245 -14.18089  18.67089 0.9916016
plot(TukeyHSD(fm,"Dose",order=T))

6.1.3.1 Pearson correlation coefficient

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.

As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:Rlab':
## 
##     michelson
## The following object is masked from 'package:DAAG':
## 
##     hills
data(cats)
str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
par(mfrow=c(1,2))
with(cats, plot(Bwt, Hwt,col=3))
title(main="Heart Weight (g) vs. Body Weight (kg)of Domestic Cats")
plot(Hwt~Bwt, data=cats,col=4,main="H vrs B W")

with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
## 
##  Pearson's product-moment correlation
## 
## data:  Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274
with(cats, cor(Bwt, Hwt)) 
## [1] 0.8041274
with(cats, cor(Bwt, Hwt))^2
## [1] 0.6466209
###Correlation test###
with(cats, cor.test(Bwt, Hwt))
## 
##  Pearson's product-moment correlation
## 
## data:  Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274
with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
## 
##  Pearson's product-moment correlation
## 
## data:  Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274
####Scatterplotdetails###
with(cats, plot(Bwt, Hwt, type="n", las=1, xlab="Body Weightin kg",ylab="Heart Weight in g",main="Heart Weight vs. BodyWeight of Cats"))
with(cats, points(Bwt[Sex=="F"], Hwt[Sex=="F"], pch=16,col="red"))
with(cats, points(Bwt[Sex=="M"], Hwt[Sex=="M"], pch=17,col="blue"))

6.2 Non-parametric Tests

Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data. These tests do not depend on any distribution and do not make any assumptions.

Examples include:
- Sign test
- Mann–Whitney
- Wilcoxon test
- Chi-square test
- Kruskal_Wallis test
- Kolmogorov_Smirnov test
- Wilcoxon Rank sum test
- Spearman correlation

6.2.1 Wilcoxon signed-rank test

The Wilcoxon signed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples.[1] The one-sample version serves a purpose similar to that of the one-sample Student’s t-test.[2] For two matched samples, it is a paired difference test like the paired Student’s t-test (also known as the “t-test for matched pairs” or “t-test for dependent samples”).

The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed. Instead, it assumes a weaker hypothesis that the distribution of this difference is symmetric around a central value and it aims to test whether this center value differs significantly from zero. The Wilcoxon test is a more powerful alternative to the sign test because it considers the magnitude of the differences, but it requires this moderately strong assumption of symmetry.

- Alternative: paired t-test
- Application: comparing the medians of two dependent groups when the assumptions of normality or equal variances are not met.
- Real-life example: comparing the pre- and post-treatment scores of patients in a clinical trial when the scores are not normally distributed.
# Wilcoxon signed-rank test
before <- c(8, 9, 10, 12, 14)
after <- c(10, 11, 12, 13, 15)
wilcox.test(before, after, paired=TRUE)
## Warning in wilcox.test.default(before, after, paired = TRUE): cannot compute
## exact p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  before and after
## V = 0, p-value = 0.05334
## alternative hypothesis: true location shift is not equal to 0

6.2.2 Chi-square test

A chi-square test is a statistical test used to determine if there is a significant association between categorical variables. It compares the observed frequencies of occurrences in different categories with the frequencies expected if there were no associations between the variables.

The Chi-Squared Test The mathematical formula is give as:

\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]

6.2.2.1 Types of chi-square tests

There are two main types of chi-square tests:

  1. Chi-Square Test of Independence: It helps determine whether the variables are independent or if there’s a relationship between them. For example, you might want to know if gender affects voting preference.

  2. Chi-Square Test of Goodness of Fit: This test checks if the sample data fits a population distribution. For example, you might want to see if a die is fair by comparing the observed frequency of each face with the expected frequency if the die were fair.

6.2.2.2 Assumptions of the chi-square test

To ensure the validity of the chi-square test, certain assumptions must be met:

  • The data must be in the form of frequencies or counts of cases.
  • The categories should be mutually exclusive.
  • For the chi-square test of independence, the expected frequency in each category should be at least 5.
  • The goodness of fit test expects a frequency of at least 1, with no more than 20% of expected frequencies being less than 5.

6.2.2.3 Practical applications of chi-square tests

Chi-square tests are widely used in academia and industry, especially for testing hypotheses about the independence of categorical variables. Some of these practical applications are:

  • Market Research: Some applications include analyzing if customer preferences for a product vary across different age groups and income levels or determining if different marketing campaigns are equally effective across demographic segments.

  • Medical Research: Common use cases are studying the association between lifestyle factors (e.g., smoking, exercise) and the incidence of diseases (e.g., lung cancer, heart disease) or evaluating whether different treatment groups show different recovery rates in clinical trials.

  • Quality Control: Commonly used to examine whether product defects are independent of the manufacturing process or specific production lines and to compare the quality of products from different suppliers to determine if there is a significant difference in defect rates.

  • Education: The tests are often used to determine if there is a significant difference in pass rates between students from different schools or teaching methods and to evaluate if introducing a new curriculum improves student performance across subjects. It’s worth noting that these are some of the many applications across academia and the industry and can be extended to other domains and fields.

layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
for (i in 1:4) 
plot(dchisq(1:28, i*3),pch=18,col="blue",main = paste("chi-square density, df =", i*3))

Example 1:
The investigator collected the blood test of patients from a hospital medical record.
A=25
B=32
AB=18
O=20
Test the null hypothesis that blood types are equally represented for those 97 patients. This is a goodness of fit test with equal expected frequencies. The “p” vector does not need to be specified, since equal frequencies is the default…

chisq.test(c(25,32,18,20)) 
## 
##  Chi-squared test for given probabilities
## 
## data:  c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781
chi=chisq.test(c(25,32,18,20))
chi
## 
##  Chi-squared test for given probabilities
## 
## data:  c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781

the null hypothesis cannot be rejected at alpha =0 .05

Example 2:
the eye color of 592 students found in the following category.

Eye Color Count
Brown 220
Blue 215
Hazel 93
Green 64

Suppose a certain physiologist proposes that 50% of student should have brown eyes, 25% blue eyes, 15% hazel eyes, and 10% green eyes. Does the sample at hand agree with the physiologist’s hypothesis

eyes<-c(220,215,93,64)
chi<-chisq.test(eyes, p=c(.5,.25,.15,.1))
chi
## 
##  Chi-squared test for given probabilities
## 
## data:  eyes
## X-squared = 50.432, df = 3, p-value = 6.462e-11

Example 3:
The following data shows the number of died and alive patients across in gender.

data<-c(4298,767,7136,643)
data.tab<-matrix(data,byrow=T,nrow=2)
data.tab
##      [,1] [,2]
## [1,] 4298  767
## [2,] 7136  643
dimnames(data.tab)=list("Sex"=c("F","M"),"Died"=c("Alive","died"))
data.tab
##    Died
## Sex Alive died
##   F  4298  767
##   M  7136  643
chisq.test(data.tab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data.tab
## X-squared = 147.76, df = 1, p-value < 2.2e-16
chisq.test(data.tab)$expected
##    Died
## Sex   Alive     died
##   F 4508.97 556.0301
##   M 6925.03 853.9699
chisq.test(data.tab)$residuals
##    Died
## Sex     Alive      died
##   F -3.141825  8.946877
##   M  2.535186 -7.219370

Example 4:
Let us see the data for students smoking habit and exercise level.

data.st<-c(7,6,6,9,7,5,12,5,5,87,84,18)
data.matrix<-matrix(data.st , byrow=T, nrow=4)
dimnames(data.matrix)=list("Exe"=c("heavy","regula","occas","never"),"Smoke"=c("freq","some","none"))
data.matrix
##         Smoke
## Exe      freq some none
##   heavy     7    6    6
##   regula    9    7    5
##   occas    12    5    5
##   never    87   84   18
chisq.test(data.matrix)
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  data.matrix
## X-squared = 13.633, df = 6, p-value = 0.03402

As the p-value 0.03402 is less than the .05 significance level, we do not accept the null hypothesis that the smoking habit is independent of the exercise level of the students.

chisq.test(data.matrix)$expected
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
##         Smoke
## Exe           freq      some      none
##   heavy   8.705179  7.721116  2.573705
##   regula  9.621514  8.533865  2.844622
##   occas  10.079681  8.940239  2.980080
##   never  86.593625 76.804781 25.601594
chisq.test(data.matrix)$residual
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
##         Smoke
## Exe             freq       some      none
##   heavy  -0.57793792 -0.6193983  2.135725
##   regula -0.20036837 -0.5250663  1.277942
##   occas   0.60485311 -1.3177955  1.170093
##   never   0.04367003  0.8210127 -1.502350

Example 5:
We apply the test to a dataset. We’ll use the Anemia Levels in Nigeria dataset, which can be downloaded from Kaggle Click here. The dataset comes from the 2018 Nigeria Demographic and Health Surveys (NDHS). It explores the impact of mothers’ age and socioeconomic factors on anemia levels among children aged 0–59 months across Nigeria’s 36 states and the Federal Capital Territory.

# Load the necessary libraries
#install.packages('readr')
library(readr)

# Load the dataset from the CSV file
dataset <- read_csv("C:/Users/abdik/Desktop/DMA/children_anaemia.csv")
## New names:
## Rows: 33924 Columns: 18
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (13): Age in 5-year groups, Type of place of residence, Highest educatio... dbl
## (5): ...1, Births in last five years, Age of respondent at 1st birth, H...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Display the first few rows of the dataset
head(dataset)
…1 Age in 5-year groups Type of place of residence Highest educational level Wealth index combined Births in last five years Age of respondent at 1st birth Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal) Anemia level Have mosquito bed net for sleeping (from household questionnaire) Smokes cigarettes Current marital status Currently residing with husband/partner When child put to breast Had fever in last two weeks Hemoglobin level adjusted for altitude (g/dl - 1 decimal) Anemia level.1 Taking iron pills, sprinkles or syrup
0 40-44 Urban Higher Richest 1 22 NA NA Yes No Living with partner Staying elsewhere Immediately No NA NA Yes
1 35-39 Urban Higher Richest 1 28 NA NA Yes No Married Living with her Hours: 1 No NA NA No
2 25-29 Urban Higher Richest 1 26 NA NA No No Married Living with her Immediately No NA NA No
3 25-29 Urban Secondary Richest 1 25 95 Moderate Yes No Married Living with her 105 No 114 Not anemic No
4 20-24 Urban Secondary Richest 1 21 NA NA Yes No No longer living together/separated NA Immediately No NA NA No
5 30-34 Urban Higher Richest 1 30 113 Mild Yes No Married Living with her NA No 119 Not anemic No
# Rename a column
colnames(dataset)[colnames(dataset) == "Anemia level...8"] <- "Anemia level"
# Display the column names
colnames(dataset)
##  [1] "...1"                                                                 
##  [2] "Age in 5-year groups"                                                 
##  [3] "Type of place of residence"                                           
##  [4] "Highest educational level"                                            
##  [5] "Wealth index combined"                                                
##  [6] "Births in last five years"                                            
##  [7] "Age of respondent at 1st birth"                                       
##  [8] "Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)"
##  [9] "Anemia level"                                                         
## [10] "Have mosquito bed net for sleeping (from household questionnaire)"    
## [11] "Smokes cigarettes"                                                    
## [12] "Current marital status"                                               
## [13] "Currently residing with husband/partner"                              
## [14] "When child put to breast"                                             
## [15] "Had fever in last two weeks"                                          
## [16] "Hemoglobin level adjusted for altitude (g/dl - 1 decimal)"            
## [17] "Anemia level.1"                                                       
## [18] "Taking iron pills, sprinkles or syrup"
# Install and load the package
#install.packages('dplyr')
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:Rlab':
## 
##     count
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Select the columns of interest
selected_data <- dataset %>% select(`Highest educational level`, `Anemia level`)


# Create a contingency table for Highest educational level and Anemia level
contingency_table <- table(selected_data[["Highest educational level"]], 
                           selected_data[["Anemia level"]])

# View the contingency table
print(contingency_table)
##               
##                Mild Moderate Not anemic Severe
##   Higher        296      238        591     14
##   No education 1450     1769       1874    113
##   Primary       608      645        900     42
##   Secondary    1240     1322       1972     62
# Perform chi-square test
chi_square_test <- chisq.test(contingency_table)
print(chi_square_test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 142.86, df = 9, p-value < 2.2e-16

6.2.3 Spearman rank correlation

  • Alternative: Pearson correlation
  • Application: measuring the strength of association between two variables when one or both variables are ordinal.
  • Real-life example: measuring the correlation between the rankings of different restaurants based on their ratings and the number of customers they have.
# Spearman rank correlation
x <- c(10, 20, 30, 40, 50)
y <- c(5, 15, 25, 35, 45)
cor.test(x, y, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 4.4409e-15, p-value = 0.01667
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho 
##   1

7 Statistical Models in R

So far we have seen only some data manipulation and the calculation of summary statistics. R can do a multitude of statistical analysis including linear models (lm), generalised linear models (glm), analysis of variance (anova), non linear models with mixed effects (nlme), generalised additive models (gam), survival analysis (survival), time series analysis (tseries), multivariate analysis (multiv) and many more.

7.1 Regression Model

7.1.1 Linear Regression model in R

7.1.1.1 Simple Linear Regression

The formula for simple linear regression, where you have one independent variable (predictor) and one dependent variable, can be expressed as follows:

The linear regression model is represented as:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

where:
- \(Y\) is the dependent variable,
- \(X\) is the independent variable (predictor),
- \(\beta_0\) is the intercept (y-intercept) of the regression line,
- \(\beta_1\) is the slope (coefficient) of the independent variable,
- \(\varepsilon\) is the error term (residuals).

The goal of simple linear regression is to determine the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.

library(DAAG)
attach(ais)
## The following object is masked _by_ .GlobalEnv:
## 
##     sex
## The following objects are masked from ais (pos = 12):
## 
##     bmi, ferr, hc, hg, ht, lbm, pcBfat, rcc, sex, sport, ssf, wcc, wt
## The following object is masked from cancer:
## 
##     sex
dim(ais)
## [1] 202  13
names(ais)
##  [1] "rcc"    "wcc"    "hc"     "hg"     "ferr"   "bmi"    "ssf"    "pcBfat"
##  [9] "lbm"    "ht"     "wt"     "sex"    "sport"
fit.ais <- lm(wt ~ ht, data=ais)
plot(wt ~ ht, data = ais, ylab = "Weight", xlab = "Height")
x <- ais$ht
y <- fit.ais$fit
lines(x,y)

anova(fit.ais)
Df Sum Sq Mean Sq F value Pr(>F)
ht 1 23769.79 23769.79446 312.6298 0
Residuals 200 15206.35 76.03176 NA NA

Note: aov() and anova() outputs are different

summary(fit.ais)
## 
## Call:
## lm(formula = wt ~ ht, data = ais)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.372  -5.296  -1.196   4.378  38.031 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -126.19049   11.39566  -11.07   <2e-16 ***
## ht             1.11712    0.06318   17.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.72 on 200 degrees of freedom
## Multiple R-squared:  0.6099, Adjusted R-squared:  0.6079 
## F-statistic: 312.6 on 1 and 200 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(ais$wt,fit.ais$fit,xlab="Observed", ylab="Predicted", main="")
title("Observed Versus Predicted Values")
abline(0,1)
hist(fit.ais$resid,col=0, main="")
title("Histogram for residuals")
qqnorm(fit.ais$resid) 

Default plots

par(mfrow=c(2,2))
plot(fit.ais)

Exercise
1. Read the dataset fuel.txt.
2. Fit a simple linear regression model in which the mileage (third column in the dataset) in the response and the weight (the second column in the dataset) of the car is a predictor.
3. Test the hypothesis that the slope is zero.
4. Use the default plots of an \(lm()\) object to produce the diagnostic plot.

7.1.2 Multiple Linear Regression

In multiple linear regression, the model involves more than one independent variable. The general formula for multiple linear regression with \(p\) independent variables can be expressed as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\]

where:
- \(Y\) is the dependent variable,
- \(X_1\), \(X_2\), \(\ldots\), \(X_p\) are the independent variables,
- \(\beta_0\) is the intercept,
- \(\beta_1\), \(\beta_2\), $, \(\beta_p\) are the coefficients associated with each independent variable,
- \(\varepsilon\) is the error term (residuals).

The goal of multiple linear regression is to estimate the coefficients \(\beta_0\), \(\beta_1\), \(\ldots\), \(\beta_p\) that best fit the data. This is typically done by minimizing the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.

Example

Y<-c(78.5,74.3 ,104.3,87.6,95.9,109.2,102.7,72.5 ,93.1 ,115.9 ,83.8 ,113.3,109.4)
x1<-c(7 ,1 ,11, 11 ,7 ,11 ,3 ,1 ,2 ,21 ,1 ,11 ,10 )
x2<-c(26, 29,56,31,52,55,71,31,54,47,40,66,68)
x3<-c(6,15, 8,8,6,9,17,22,18,4,23,9 ,8)
x4<-c(60,52,20,47,33,22,6,44,22,26,34,12,12)
  1. Identify the optimal model or models based on R^2_adj , AIC, AICC, BIC from the approach based on all possible subsets.
  2. Identify the optimal model or models based on AIC and BIC from the approach based on forward selection.
  3. Identify the optimal model or models based on AIC and BIC from the approach based on backward elimination.
  4. Carefully explain why the models chosen in (a), (b) & (c) are not all the same.
  5. Recommend a final model. Give detailed reasons to support your choice.
Halds<-data.frame(Y,x1,x2,x3,x4)
model<-lm(Y~x1+x2+x3+x4,data=Halds)
summary(model)
## 
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4, data = Halds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07
par(mfrow=c(2,2))
plot(model)

residuals <- residuals(model)
fitted_values <- fitted(model)

# Correlation test
cor.test(fitted_values, residuals)
## 
##  Pearson's product-moment correlation
## 
## data:  fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5509853  0.5509853
## sample estimates:
##          cor 
## 3.829213e-17
model1<-lm(Y~x1+x2,data=Halds)
summary(model1)
## 
## Call:
## lm(formula = Y ~ x1 + x2, data = Halds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.893 -1.574 -1.302  1.363  4.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
## x1           1.46831    0.12130   12.11 2.69e-07 ***
## x2           0.66225    0.04585   14.44 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9744 
## F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09
model2<-lm(Y~x1+x2+x4,data=Halds)
summary(model2)
## 
## Call:
## lm(formula = Y ~ x1 + x2 + x4, data = Halds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0919 -1.8016  0.2562  1.2818  3.8982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.6483    14.1424   5.066 0.000675 ***
## x1            1.4519     0.1170  12.410 5.78e-07 ***
## x2            0.4161     0.1856   2.242 0.051687 .  
## x4           -0.2365     0.1733  -1.365 0.205395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

Example 2:

library(car)
attach(mtcars)
## The following object is masked from ais (pos = 3):
## 
##     wt
## The following object is masked from package:ggplot2:
## 
##     mpg
## The following object is masked from ais (pos = 13):
## 
##     wt
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
x <- mtcars$mpg
y <- mtcars$hp

# Pearson correlation test
cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684
# Fit a model
model11 <- lm(mpg ~ hp + wt, data = mtcars)
summary(model11)
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
# Extract residuals and fitted values
residuals <- residuals(model)
fitted_values <- fitted(model)

# Correlation test
cor.test(fitted_values, residuals)
## 
##  Pearson's product-moment correlation
## 
## data:  fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5509853  0.5509853
## sample estimates:
##          cor 
## 3.829213e-17
# Load ggplot2
library(ggplot2)

# Create Scatter Plot with Regression Line
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "blue", size = 3) +                    # Scatter points
  geom_smooth(method = "lm", se = TRUE, color = "red") +    # Regression line with confidence interval
  labs(title = "Scatter Plot of MPG vs Horsepower",
       x = "Horsepower (hp)",
       y = "Miles Per Gallon (mpg)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Pairwise scatterplots for selected variables
pairs(mtcars[, c("mpg", "hp", "wt")],
      main = "Pairwise Scatterplots",
      pch = 19, col = "blue")

# Fit a Multiple Regression Model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Plot Residuals vs Fitted Values
plot(model$fitted.values, model$residuals,
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values",
     ylab = "Residuals",
     pch = 19,
     col = "darkgreen")
abline(h = 0, col = "red", lwd = 2)    # Horizontal line at 0

# Optionally, add a smooth line to detect patterns
lines(lowess(model$fitted.values, model$residuals), col = "blue", lwd = 2)

# Create Residuals vs Fitted Plot
ggplot(model, aes(x = .fitted, y = .resid)) +
  geom_point(color = "darkgreen", size = 3) +             # Residual points
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +  # Horizontal line at 0
  geom_smooth(method = "loess", color = "blue") +         # Smooth line to detect patterns
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Generate All Diagnostic Plots
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 grid
plot(model)

par(mfrow = c(1, 1))  # Reset to default
# Q-Q Plot Using ggplot2
ggplot(model, aes(sample = .stdresid)) +
  stat_qq(color = "purple") +
  stat_qq_line(color = "red") +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals") +
  theme_minimal()

# Load library for heatmap
library(corrplot)
## corrplot 0.92 loaded
# Calculate correlation matrix
cor_matrix <- cor(mtcars)

# Plot heatmap
corrplot(cor_matrix, method = "circle", type = "upper",
         tl.col = "black", tl.srt = 45,
         title = "Correlation Heatmap", mar = c(0, 0, 1, 0))

7.2 Time Series models

It is a method/technique for analyzing time series data in order to extract meaningful statistics/figure and other characteristics of the data. That is, what is the previous trend implies for and can it continues or not. It deals all about the time dependencies of the current on the past. It is a systematic approach to deal with mathematical and statistical questions posed by time correlation. It used to examine how data value change over time.

  • Statistical data which are collected, observed or recorded at successive intervals of time.
  • It is a set/sequence/collection of observation which can be recorded over the period of time(t) with data point recorded at equal time interval.
  • The measurements(time) may be taken every day, week, month, quarterly and year, or at any other regular interval.
  • It is a sequence of observations on a variable measured at successive points in time or over successive periods of time.

7.2.1 Area generate TS data

  • Economics: Daily stock market quotations (time unit is trading days) or monthly unemployment figures
  • Social Science: Population series such as annual birthrates or school enrolments
  • Epidemiology: Monthly number of influenza cases.
  • Medicine: Blood pressure traced over time or Functional
  • Magnetic Resonance Imaging (FMRI) of brain wave

7.2.1.1 Significance of time series analysis

The analysis of time series is of great significance not only to the economists and business man but also to the scientists, astronomists, geologist, sociologists, biologists, research worker etc. for the reason below.

  1. It helps in understanding past behavior & predict the future
  2. It helps in planning future operations
  3. It helps in evaluating current accomplishments
  4. It facilitates comparison of future relationship of data.

7.2.2 Type of time series Analysis

  1. Univariate Time series Analysis
  2. Multivariate Time series Analysis

7.2.3 ARIMA Model

Autoregressive Integrated Moving Average models (ARIMA models) were popularized by “George Box and Gwilym Jenkins” in the early 1970s. ARIMA models are a class of linear models that is capable of representing stationary as well as non-stationary time series. ARIMA models do not involve independent variables in their construction. They make use of the information in the series itself to generate forecasts.

It is one of the most widely implemented methods for analyzing univariate time series data. In order to understand the modeling procedure, it is useful to briefly introduce the following basic models. ARIMA models are specific subset of univariate modeling, in which a time series is expressed in terms of past values of itself (the autoregressive component) plus current and lagged values of a ‘white noise’ error term (the moving average component).

The ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting method that combines autoregressive (AR), differencing (I), and moving average (MA) components. The general formula for an ARIMA model of order $ (p, d, q)$ can be expressed as:

\[Y_t = c + \phi_1 Y_{t-1} + \ldots + \phi_p Y_{t-p} + \theta_1 \varepsilon_{t-1} + \ldots + \theta_q \varepsilon_{t-q} + \varepsilon_t\]

where:
- \(Y_t\) is the time series at time \(t\),
- \(c\) is a constant term,
- \(\phi_1\), \(\ldots\), \(\phi_p\) are the autoregressive coefficients,
- \(\theta_1\), \(\ldots\), \(\theta_q\) are the moving average coefficients,
- \(\varepsilon_t\) is the error term at time \(t\),
- \(Y_{t-1}, Y_{t-2},\)$, \(Y_{t-p}\) are lagged values of the time series,
- \(\varepsilon_{t-1}\), \(\varepsilon_{t-2}\), \(\ldots\), \(\varepsilon_{t-q}\) are lagged values of the error terms,
- \(d\) is the degree of differencing.

The ARIMA model aims to capture the autocorrelation in the time series data by incorporating the autoregressive and moving average terms. The differencing term helps in making the series stationary. The parameters \(p\), \(d\), and \(q\) are chosen based on the characteristics of the time series data.

7.2.3.1 Model Assumptions

The ARIMA model makes several assumptions about the underlying time series data, including:

  • Stationarity: The time series should be stationary, meaning that the statistical properties of the series (such as the mean and variance) do not depend on time. If the series is non-stationary, it may need to be differenced to make it stationary before using the ARIMA model.

  • Linearity: The relationship between past observations and future predictions should be linear. Nonlinear relationships may require a different type of model.

  • Independence: The error terms in the ARIMA model should be independent and identically distributed (IID), meaning that each error term should be unrelated to previous error terms and should have the same distribution. If the error terms are not IID, it may indicate that the model is misspecified.

  • Normality: The error terms should be normally distributed, with a mean of zero and constant variance. If the error terms are not normally distributed, it may indicate that the model is misspecified.

  • Adequate Sample Size: The ARIMA model assumes that the time series has a sufficient number of observations to estimate the model parameters with reasonable accuracy. A general guideline is to have at least 50 observations in the time series.

These assumptions are important to keep in mind when using the ARIMA model, as violations of these assumptions can lead to inaccurate predictions and biased parameter estimates. It is important to check these assumptions before fitting an ARIMA model to a time series and to take appropriate steps to address any violations.

Example

male_births<-read.csv("~/male.csv")
male<-ts(male_births, start=2000,freq=4)
ts.plot(male, ylab="Male Births", main="Male Live Births")

par(mfrow=c(2,1))
acf(ts(male,freq=4),NULL,main="Acf for actual series")
pp=pacf(ts(male, freq=4),NULL,main="Pacf for actual series")

#Stationary Checking
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(male,alternative="stationary") # ADF (Augmented Dickey-Fuller)p-value is less than 0.05
## 
##  Augmented Dickey-Fuller Test
## 
## data:  male
## Dickey-Fuller = -1.5531, Lag order = 4, p-value = 0.762
## alternative hypothesis: stationary
# fit Auto ARIMA Models
library(forecast)
library(ggplot2)
fit_model<-auto.arima(male,ic="aic",seasonal=TRUE,trace=TRUE) 
## 
##  ARIMA(2,0,2)(1,0,1)[4] with non-zero mean : Inf
##  ARIMA(0,0,0)           with non-zero mean : 1595.311
##  ARIMA(1,0,0)(1,0,0)[4] with non-zero mean : 1509.133
##  ARIMA(0,0,1)(0,0,1)[4] with non-zero mean : 1535.53
##  ARIMA(0,0,0)           with zero mean     : 2152.192
##  ARIMA(1,0,0)           with non-zero mean : 1521.978
##  ARIMA(1,0,0)(2,0,0)[4] with non-zero mean : 1509.428
##  ARIMA(1,0,0)(1,0,1)[4] with non-zero mean : Inf
##  ARIMA(1,0,0)(0,0,1)[4] with non-zero mean : 1513.905
##  ARIMA(1,0,0)(2,0,1)[4] with non-zero mean : Inf
##  ARIMA(0,0,0)(1,0,0)[4] with non-zero mean : 1528.147
##  ARIMA(2,0,0)(1,0,0)[4] with non-zero mean : 1485.386
##  ARIMA(2,0,0)           with non-zero mean : 1489.306
##  ARIMA(2,0,0)(2,0,0)[4] with non-zero mean : 1485.986
##  ARIMA(2,0,0)(1,0,1)[4] with non-zero mean : Inf
##  ARIMA(2,0,0)(0,0,1)[4] with non-zero mean : 1486.623
##  ARIMA(2,0,0)(2,0,1)[4] with non-zero mean : Inf
##  ARIMA(3,0,0)(1,0,0)[4] with non-zero mean : 1486.887
##  ARIMA(2,0,1)(1,0,0)[4] with non-zero mean : 1486.866
##  ARIMA(1,0,1)(1,0,0)[4] with non-zero mean : 1491.064
##  ARIMA(3,0,1)(1,0,0)[4] with non-zero mean : 1488.765
##  ARIMA(2,0,0)(1,0,0)[4] with zero mean     : Inf
## 
##  Best model: ARIMA(2,0,0)(1,0,0)[4] with non-zero mean
print(summary(fit_model))
## Series: male 
## ARIMA(2,0,0)(1,0,0)[4] with non-zero mean 
## 
## Coefficients:
##          ar1     ar2    sar1       mean
##       0.3129  0.4855  0.2552  7458.7561
## s.e.  0.0848  0.0870  0.1054   176.1787
## 
## sigma^2 = 86857:  log likelihood = -737.69
## AIC=1485.39   AICc=1486   BIC=1498.61
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -1.866228 288.9912 233.0705 -0.1768792 3.109607 0.7421209
##                     ACF1
## Training set -0.02615957
male_forecasting<-forecast(fit_model,level = c(95),h=5)
male_forecasting
##         Point Forecast    Lo 95    Hi 95
## 2026 Q1       7468.843 6891.213 8046.472
## 2026 Q2       7500.321 6895.070 8105.571
## 2026 Q3       7411.571 6718.813 8104.329
## 2026 Q4       7476.532 6757.332 8195.733
## 2027 Q1       7461.016 6651.517 8270.515
autoplot(male_forecasting)

Example
Gasoline Sales Time Series After Obtaining the Contract with the Maroodi-Jeex Region Police.

gasoline<-data.frame(weeks=1:22, sales=c(17,21,19,23,18,16,20,18,22,20,15,22,31,34,31,33,28,32,30,29,34,33))
head(gasoline)
weeks sales
1 17
2 21
3 19
4 23
5 18
6 16
gas_oline<-ts(gasoline$sales, start = 2024, frequency = 52)
print(gas_oline)
## Time Series:
## Start = c(2024, 1) 
## End = c(2024, 22) 
## Frequency = 52 
##  [1] 17 21 19 23 18 16 20 18 22 20 15 22 31 34 31 33 28 32 30 29 34 33
library(forecast)
arima_model <- auto.arima(gas_oline)
summary(arima_model)
## Series: gas_oline 
## ARIMA(0,1,0) 
## 
## sigma^2 = 16.86:  log likelihood = -59.46
## AIC=120.92   AICc=121.13   BIC=121.96
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set 0.7280455 4.011349 3.455318 1.442402 14.70862  NaN -0.1952415

The ARIMA(0,1,0) model was fitted to the gas_oline time series data representing gasoline sales in the Maroodi-Jeex Region Police after obtaining the contract. Key metrics include an estimated variance of 16.86, a log likelihood of -59.46, Akaike Information Criterion (AIC) of 120.92, corrected AIC of 121.13, and Bayesian Information Criterion (BIC) of 121.96.

Training set error measures show an average of 0.728 forecast errors, indicating the model tends to slightly over-predict. The Root Mean Squared Error (RMSE) is 4.011, the Mean Absolute Error (MAE) is 3.455, the Mean Percentage Error (MPE) is 1.44%, and the Mean Absolute Percentage Error (MAPE) is 14.71%. The MASE is not available, indicating a small average percentage error in the predictions. The autocorrelation of the first lag of the residuals is -0.195, indicating any remaining autocorrelation in the residuals.

Based on these metrics, the ARIMA(0,1,0) model may provide a reasonable fit to the gasoline sales data in the Maroodi-Jeex Region Police, with room for improvement in accuracy. Further analysis and model refinement may be necessary to enhance forecasting performance.

autoplot(arima_model$residuals, main = "Time Series plot of gasoline")

forecast_values <- forecast(arima_model, h = 10)
forecast_values
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 2024.423             33 27.73827 38.26173 24.952884 41.04712
## 2024.442             33 25.55880 40.44120 21.619660 44.38034
## 2024.462             33 23.88642 42.11358 19.061987 46.93801
## 2024.481             33 22.47655 43.52345 16.905768 49.09423
## 2024.500             33 21.23442 44.76558 15.006102 50.99390
## 2024.519             33 20.11146 45.88854 13.288672 52.71133
## 2024.538             33 19.07878 46.92122 11.709333 54.29067
## 2024.558             33 18.11759 47.88241 10.239319 55.76068
## 2024.577             33 17.21482 48.78518  8.858653 57.14135
## 2024.596             33 16.36096 49.63904  7.552785 58.44721
autoplot(forecast_values)

The Figure depicts a line graph illustrating the ARIMA(0,1,0) model’s forecasts for gasoline sales data from 2024.0 to 2024.6. The x-axis represents the observed values over time, while the y-axis represents the predicted sales. The solid black line represents actual data points, while the shaded blue area represents the predicted sales. A diagonal line serves as a trend or reference line for comparison. The title “Forecasts from ARIMA(0,1,0)” indicates that the graph is specifically displaying the forecasted values generated by the ARIMA model. The graph provides a visual representation of how well the ARIMA model forecasts match the actual gasoline sales data over the specified time period.

8 Conclusion

The course Statistical Computing II (R Programming) delved into statistical methodologies and advanced computational techniques using R, preparing students to analyze and interpret complex data. It covered topics like data manipulation, visualization, regression, machine learning, and survival analysis. Students gained hands-on experience in implementing algorithms, writing efficient code, and applying statistical methods to real-world datasets. Emphasis was placed on data wrangling, simulations, and model diagnostics. The course laid the groundwork for building expertise in data science, analytics, and statistical research, allowing students to confidently tackle problems in their academic and professional endeavors using R as a powerful tool.