R is a powerful programming language and software environment designed for statistical computing, data analysis, and graphical representation. It is widely used by statisticians, data analysts, and researchers due to its flexibility and extensive libraries.
R is a language and an environment in which you can perform statistical data analysis and graphics. R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.
Our introduction to the R environment did not mention statistics, yet many people use R as a statistics system. We prefer to think of it of an environment within which many classical and modern statistical techniques have been implemented. A few of these are built into the base R environment, but many are supplied as packages. There are about 25 packages supplied with R (called “standard” and “recommended” packages) and many more are available through the CRAN family of Internet sites Visit R Project and elsewhere. More details on packages are given later. Most classical statistics and much of the latest methodology is available for use with R, but users may need to be prepared to do a little work to find it. There is an important difference in philosophy between S (and hence R) and the other main statistical systems. In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.
To start R in the Windows environment double click on the R icon A new screen will appear containing one window The last line to appear will be `>’, a standard prompt to indicate that R is expecting a command. The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
The on-line help gives useful information. Getting used to using it and understanding the help will make it easier to use R. A keyword search is possible using the Search Engine and Keywords link. “help.start()”
To do a keyword search use the function apropos(). For example:
apropos("test")
## [1] ".valueClassTest" "ansari.test" "bartlett.test"
## [4] "binom.test" "Box.test" "chisq.test"
## [7] "cor.test" "file_test" "fisher.test"
## [10] "fligner.test" "friedman.test" "kruskal.test"
## [13] "ks.test" "mantelhaen.test" "mauchly.test"
## [16] "mcnemar.test" "mood.test" "oneway.test"
## [19] "pairwise.prop.test" "pairwise.t.test" "pairwise.wilcox.test"
## [22] "poisson.test" "power.anova.test" "power.prop.test"
## [25] "power.t.test" "PP.test" "prop.test"
## [28] "prop.trend.test" "quade.test" "shapiro.test"
## [31] "t.test" "testInheritedMethods" "testVirtual"
## [34] "var.test" "wilcox.test"
Note that: - You need to put the keyword in double
quotes (“keyword’’).
- A lot of valuable information about R can be found through:
Classwork
Use the help system to find help on the following topics.
R can do any types of mathematical manipulation Simple arithmetic Whatever is typed at the prompt is evaluated, and the result is printed.
2+3 # Addition
## [1] 5
2*4 + 7 # Multiplication and addition
## [1] 15
10/3 # Division
## [1] 3.333333
5**3 # ** to the power of...
## [1] 125
5^3 # or we can use ^
## [1] 125
Standard functions that are found on a scientific calculator are available in R, for example:
sqrt(2) # Square root
## [1] 1.414214
It also provides \(\pi\) as a constant.
pi
## [1] 3.141593
sin(pi)
## [1] 1.224606e-16
cos(pi)
## [1] -1
Table 1: The Arithmetic Functions in R
| Name | Operation |
|---|---|
| sqrt | square root |
| abs | absolute value |
| sin cos tan | trigonometric functions (radians) |
| asin acos atan | inverse trigonometric functions |
| sinh cosh tanh | hyperbolic functions |
| asinh acosh atanh | inverse hyperbolic functions |
| exp log | exponential and natural logarithm |
| log10 | common logarithm |
| gamma lgamma | gamma function and its natural log |
Table 2: Mathematical and Logical Comparison Operators in R
| Operator | Syntax |
|---|---|
| less than | < |
| less than or equal to | <= |
| exactly equal to | == |
| greater than or equal to | >= |
| greater than | > |
| not equal to | != |
| or | | |
| and | & |
Table 3: Functions that return a single value
| Name | Operation |
|---|---|
| length(x) | the number of elements in x |
| sum(x) | the sum of the values of x |
| mean(x) | the mean of the values of x |
| var(x) | the variance of the values of x |
| sd(x) | the standard deviation of the values of x |
| min(x) | the minimum value from the values of x |
| max(x) | the maximum value from the values of x |
| prod(x) | the product of the values of x |
| range(x) | the range of the values of x |
A value can be stored in a named variable by assigning it with the <- or = symbols, for example:
x <- 5 # assigns 5 to x
x # type x to print value
## [1] 5
y = sqrt(9) # assign the square-root of 9 to y
y
## [1] 3
In fact you can make assignments using ->, for example:
7 -> z # assigns 7 to z
z
## [1] 7
This could be interpreted as z is assigned 7. Now it is possible to do arithmetic with x, y and z, for example:
(x * y) + z
## [1] 22
y^x - z + 6
## [1] 242
w <- x+y
w
## [1] 8
Note that: To delete an object from the list, use the rm() functions. For example
rm(w)
ls() # to list the objects (e.g., variables, data, functions) that you have created
## [1] "x" "y" "z"
R enables computation with Boolean or Logical variables. These take on either the value True or False. You can use conditional tests to generate these values:
x <- 32
x > 16 # Is x greater than 16?
## [1] TRUE
x <= 16 # Is x less than equal to 16?
## [1] FALSE
Logical values can be stored in variables in the same way as numeric values:
tf <- x>16
tf
## [1] TRUE
All the values presented so far have been scalars. R can handle vectors which are a combination of scalars in a single structure.
The c(…) construct can be used to create vectors, named as “concatenate” in R .That means “treat as data set”. To set up a vector named x, say, consisting of five numbers, namely 2, 4,6,10 and 11, use the R command:
x <- c(2,4,6,10,11)
x
## [1] 2 4 6 10 11
Vectors are the simplest type of object in R.
Functions that return vectors with the same length
order(x)
## [1] 1 2 3 4 5
sort.list(x)
## [1] 1 2 3 4 5
sort(x)
## [1] 2 4 6 10 11
x[order(x)]
## [1] 2 4 6 10 11
x[sort.list(x)]
## [1] 2 4 6 10 11
1/x
## [1] 0.50000000 0.25000000 0.16666667 0.10000000 0.09090909
Numeric Vector: is a single entity consisting of an ordered collection of numbers.
Example: To set up a numeric vector X consisting of 5 numbers ,10,6,3,6,22, we use any one of the following commands:
x<-c(10,6,3,6,22)#OR
x=c(10,6,3,6,22)#OR
assign("x",c(10,6,3,6,22))#OR
c(10,6,3,6,22)->x
y<-c(x,0,x)
x
## [1] 10 6 3 6 22
Character vectors: To set up a character/string vector z consisting of 3 place names use:
z<-c("Canberra","Sydney","Newcastle") #Or
z<-c("Canberra","Sydney","Newcastle")
Common useful escape sequences are:
Example:
cat("Abebe","\n","zelalem","\n")
## Abebe
## zelalem
Indexing Vectors
const=c(3.1416,2.7183,1.4142,1.6180)
names(const)=c("pi","euler","sqrt2","golden")
const[c(1,3,4)] # printing the 1st, 2ndand the 4thelements of const.
## pi sqrt2 golden
## 3.1416 1.4142 1.6180
const[c(-1,-2)]
## sqrt2 golden
## 1.4142 1.6180
const>2
## pi euler sqrt2 golden
## TRUE TRUE FALSE FALSE
const[const>2]
## pi euler
## 3.1416 2.7183
R can be used to generate vectors sequences. You can use the notation #1:#2 to generate a vector that consists of a sequence, where #1 and #2 are two integer numbers.
1:12
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
4:18
## [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-10:3
## [1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
x<- 1:10 # storing sequence on variable x
x
## [1] 1 2 3 4 5 6 7 8 9 10
y <- 40:1
y
## [1] 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
## [26] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
The last example (y) shows that is a vector is too long for one line
it simply continues on the following line (using as many lines as are
necessary).
Further in the square brackets shows the position at which that line begins ( e.g: [26] mean that the first values on this line are the 26th value).
rep(c(1:4), 3) # repeat 1 2 3 4 three times
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
r <- rep(c(1.2, 6.1, 3.2, 5.5), 2) # repeat vector 1.2, 6.1, 3.2, 5.5 two times
r
## [1] 1.2 6.1 3.2 5.5 1.2 6.1 3.2 5.5
seq(4, 6, .25) # generate integer b\n 4 & 6 with increments of 0.25.
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(length=9, from=4, to=6)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(from=4, to=6, by=0.25)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(7)
## [1] 1 2 3 4 5 6 7
seq(7, 10)
## [1] 7 8 9 10
seq(7, 10, 0.5)
## [1] 7.0 7.5 8.0 8.5 9.0 9.5 10.0
rep(seq(1, 3, 0.4), 2) # generate integer b\n 1& 3 with increments of 0.4,then repeat it 2-times.
## [1] 1.0 1.4 1.8 2.2 2.6 3.0 1.0 1.4 1.8 2.2 2.6 3.0
We can manipulate vectors in a similar manner to scalars. However, care must be take when doing such things as the results may not be the desired ones.
x <- 12:1 # x is a sequence .....
x
## [1] 12 11 10 9 8 7 6 5 4 3 2 1
If we are interested in only extracting values of the vector, then we can do this using square bracket [ ].
x<- c(1:10)**2 # generate sequence of vector (x) 1 to 10 and square each
x
## [1] 1 4 9 16 25 36 49 64 81 100
x[6] # extracting the 6th value from vector x above
## [1] 36
x[2:4] # extracting the 2st, 3th and 4th values from x
## [1] 4 9 16
x[c(1,7,9)] # extracting the 1st, 7th and 9th values from x
## [1] 1 49 81
y <- x[c(1,5,8)] # assigning the 1st, 5th and 8th values of x to y.
y
## [1] 1 25 64
x[9:6] # extracting the 9th, 8th ,7thand 6th values from x
## [1] 81 64 49 36
x[c(1:3, 8:10)] # extracting the 1st, 2th ,3thand 8th,9th,10th values
## [1] 1 4 9 64 81 100
x[c(1,2,3,1,2,3,1,2,3,1,2,3)] # repetition of the index...
## [1] 1 4 9 1 4 9 1 4 9 1 4 9
x[c(8,2,5,10)] # any order you please....
## [1] 64 4 25 100
x[x > 5] # Extract the value of x greater than 5
## [1] 9 16 25 36 49 64 81 100
If you use a negative subscript in your selection procedure, the corresponding numbered element is not included in the return vector. For example
x <- c(3,6,9,12,15,18,21)
x
## [1] 3 6 9 12 15 18 21
x[-4] # exclude the 4th element...
## [1] 3 6 9 15 18 21
x[c(-4,-6)] # exclude 4th and 6th elements...
## [1] 3 6 9 15 21
Note that: - Vectors can only contain entries of the
same type: numeric or character; don’t mix them.
- Characters should be surrounded by Quote “ ”
Example:
x<-c("A","A","A","A","B","B","B")
x
## [1] "A" "A" "A" "A" "B" "B" "B"
There are functions that operate on vectors and return useful information:
x <- 5:14
length(x) # Number of elements in x
## [1] 10
max(x) # Largest value in x
## [1] 14
min(x) # Smallest value in x
## [1] 5
sum(x) # Sum of all the values in x
## [1] 95
prod(x) # The product of all the values in x
## [1] 3632428800
mean(x) # The mean of the all values in x
## [1] 9.5
range(x) # Range of vector x
## [1] 5 14
var(x) # The variance of x
## [1] 9.166667
sd(x) # The standard deviation of x
## [1] 3.02765
sqrt(var(x)) # The square root of the variance (i.e. stand. deviation)
## [1] 3.02765
x <- 1:10
x>7
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
R lets you store data in a two-dimensional matrix.
Example
x <-c(1:8)
dim(x) <-c(2,4)
x
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
Method 2: Using the function matrix:
x <-matrix(c(1:8),2,4,byrow=F)
x
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
w<-matrix(c(1,2,40,2,3,9,20,4,60),nrow=3) # create 3-row by 3-coln. matrix
w
## [,1] [,2] [,3]
## [1,] 1 2 20
## [2,] 2 3 4
## [3,] 40 9 60
w <-matrix(c(1,2,40,2,3,9,20,4,60),3,3)# Similar to the above create 3 by 3
w
## [,1] [,2] [,3]
## [1,] 1 2 20
## [2,] 2 3 4
## [3,] 40 9 60
x <- matrix(c(2,3,5,7,11,13),ncol=2) # create 3-row by 2-colm matrix
x
## [,1] [,2]
## [1,] 2 7
## [2,] 3 11
## [3,] 5 13
x[2,1] # extract 2nd-row and 1st-coln. element of x
## [1] 3
x[2,2] # extract 2nd-row and 2nd-coln. element of matrix x
## [1] 11
x[,1] # The first column element
## [1] 2 3 5
x[3,] # The third row element
## [1] 5 13
x[2:3] # extract Rows 2 and coln. 2 and form matrix
## [1] 3 5
xy<- matrix(1:16,ncol=4)
xy
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
rowSums(xy) # the sum of each rows
## [1] 28 32 36 40
colSums(xy) # the sum of each columns
## [1] 10 26 42 58
mean (xy [, 2]) # to find the mean of matrix-xy in the coln-2
## [1] 6.5
mean (xy [3,]) # to find the mean of matrix-xy in the row-3
## [1] 9
mx <- matrix(seq(0, 95, length=20), ncol=5)
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 20 40 60 80
## [2,] 5 25 45 65 85
## [3,] 10 30 50 70 90
## [4,] 15 35 55 75 95
mx [3,1] <- -12 # replace 3rd-row and 1st-coln element by -12
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 20 40 60 80
## [2,] 5 25 45 65 85
## [3,] -12 30 50 70 90
## [4,] 15 35 55 75 95
mx [3,] <- 0 # replace 3rd-row element by 0
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 20 40 60 80
## [2,] 5 25 45 65 85
## [3,] 0 0 0 0 0
## [4,] 15 35 55 75 95
mx[,2] <- c(3, 6, 9, 12) #replace the 2nd column of mx by 3,6,9,12
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 3 40 60 80
## [2,] 5 6 45 65 85
## [3,] 0 9 0 0 0
## [4,] 15 12 55 75 95
mx[4,] <- c(-1, -2, -3, -4, -5) #replace 4th-row element of mx by -1, -2, -3, -4, -5
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 3 40 60 80
## [2,] 5 6 45 65 85
## [3,] 0 9 0 0 0
## [4,] -1 -2 -3 -4 -5
mx[,1] <- c(0.5, 0.9)
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
cbind(mx, c(0.1, 0.2, 0.3, 0.4)) # Add an extra column
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.5 3 40 60 80 0.1
## [2,] 0.9 6 45 65 85 0.2
## [3,] 0.5 9 0 0 0 0.3
## [4,] 0.9 -2 -3 -4 -5 0.4
rbind(mx, 100:104) # Add a an extra row.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
## [5,] 100.0 101 102 103 104
cbind(mx,mx) # Add the matrix to itself (column wise).
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.5 3 40 60 80 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5 0.9 -2 -3 -4 -5
rbind(mx, mx) # Add the matrix to itself (row wise).
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
## [5,] 0.5 3 40 60 80
## [6,] 0.9 6 45 65 85
## [7,] 0.5 9 0 0 0
## [8,] 0.9 -2 -3 -4 -5
cbind(mx[,1:3], c(91,92,93,94), mx[,4])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 91 60
## [2,] 0.9 6 45 92 65
## [3,] 0.5 9 0 93 0
## [4,] 0.9 -2 -3 94 -4
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
Method 1: Using vectors
- A vector can be used by R as an array only if it has a dimension
vector as its dim attribute. - A dimension vector is a vector of
non-negative integers. If its length is k then the array is
k-dimensional. - An array can be created by giving a vector structure, a
dim, which has the form
Example
The following is a (3-dimentional) array with dimension vector
c(3,5,100) and a vector of 1500 elements.
z=c(1:1500)
dim(z) <- c(3,5,100)
dim(z)
## [1] 3 5 100
R can generate a random sequence from a number of probability density functions. The general format for generating such sequences is: rdensity(n, p1, p2, …) where density is the probability density function, n is the number of values to generate and p1, p2, … are the values needed to generate from the density function.
For example:
rnorm(40, 6, 2.3) #generate 40 values Normal distribution with mean 6 and standard deviation 2.3
## [1] 6.274633 6.500628 8.467002 5.960944 8.330951 3.512421 9.536310
## [8] 7.411002 5.893920 1.106829 9.220178 0.556495 7.679172 8.611401
## [15] 5.508841 4.811146 6.164987 9.693829 6.602183 3.023027 5.845941
## [22] 7.063279 5.608085 4.731206 11.162613 5.818375 3.707433 4.898752
## [29] 4.860086 9.690570 7.435617 4.595275 4.993286 1.911186 2.810205
## [36] 8.683388 5.589493 4.526243 7.393666 3.915919
rpois(20, lambda=5) # 20 value from Poisson distribution with lambda(۸)=5
## [1] 6 5 6 2 4 7 8 1 5 3 3 7 6 2 6 4 5 8 2 3
rbinom(20, size=18, prob=0.3) # 18 values from Binom. distribution with p= 0.3
## [1] 7 6 9 6 6 8 4 5 7 5 2 5 7 8 6 5 5 5 6 6
runif(20, min=3, max=6) # Uniform distribution (3, 6)
## [1] 4.780706 5.720903 3.188569 4.749774 4.951848 5.387701 3.962943 5.800947
## [9] 5.337865 4.133706 5.743204 3.846184 5.579625 3.219199 3.662900 3.795063
## [17] 4.003560 3.835260 3.522762 4.018478
rexp(20, rate=1.5) # 20 values from Exponential distribution e=1.5
## [1] 1.00918117 0.58033463 0.27131303 0.48328905 0.44910690 0.35606058
## [7] 0.21091749 0.20358100 0.30693818 0.44368296 0.40651114 0.90213626
## [13] 1.37793051 0.40301819 1.75014186 0.98725216 0.30260951 0.03899923
## [19] 0.03128597 1.90235076
Table: The general generating for value from distribution, the R name and argument are:
| Distribution | R Name | Arguments |
|---|---|---|
| Normal | norm | mean, sd |
| Student’s t | t | df, ncp |
| F | f | df1, df2, ncp |
| Exponential | exp | rate |
| Log-normal | lnorm | meanlog, sdlog |
| Beta | beta | shape1, shape2, ncp |
| Binomial | binom | size, prob |
| Cauchy | cauchy | location, scale |
| Chi-squared | chisq | df, ncp |
| Gamma | gamma | shape, scale |
| Geometric | geom | prob |
| Hypergeometric | hyper | m, n, k |
| Logistic | logis | location, scale |
| Negative binomial | nbinom | size, prob |
| Poisson | pois | lambda |
| Signed rank | signrank | n |
| Uniform | unif | min, max |
| Weibull | weibull | shape, scale |
| Wilcoxon | wilcox | m, n |
The function sample() generates random permutations of data
or a random sample from a given vector. The first argument is a vector
or a positive integer and the second the size of the sample.
For example:
sample(12) # Sampling without replacement.
## [1] 7 5 4 6 2 11 3 8 9 10 1 12
sample(12, 5) # Sampling 5 values without replacement.
## [1] 5 2 3 1 12
sample(12, replace=TRUE) # Sampling with replacement.
## [1] 9 3 6 8 3 11 10 2 10 9 8 1
sample(12, 8, replace=TRUE) # Sampling 8 values with replacement.
## [1] 2 12 6 3 6 9 1 3
sample(c("Heads", "Tails"), 6, replace=TRUE) # Simulating 6 coin tosses.
## [1] "Heads" "Heads" "Heads" "Tails" "Tails" "Tails"
Data frame is a data structure which contains more than 1 vector of equal length. A data frame is very much like a matrix, except it is designed for storing statistical or experimental data. Each row represents a unit, and each column a collection of measurements on the units. Each column can store a different type of data, such as numeric or character. Let us take some example:
Example
name= c("Omer", "Faduma", "Ibrahim","Jamal")
age=c(18,22,25,27)
sex=c("F","M","M","F")
stud=data.frame(name, age, sex)
stud
| name | age | sex |
|---|---|---|
| Omer | 18 | F |
| Faduma | 22 | M |
| Ibrahim | 25 | M |
| Jamal | 27 | F |
stud$name #access or extract student data only names
## [1] "Omer" "Faduma" "Ibrahim" "Jamal"
Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. So far, all data has been entered in R using the keyboard or generated using R functions. In reality, data is usually too large to type in and is stored in files. There are functions used to read data from external files such as Excel, SPSS, Epi-info, Stata and so on.
In R you can import text file swith the function “read.table” the functions read.csv and read.delim are functions to read―comma separated values ‖files period as a decimal and tab delimited files period as a decimal respectively.
R Syntax: used for to read Data
read.sav is ued to reads a file or data stored by the SPSS read.sav(file_name())
read.csv is ued to reads a file or data stored by CSV read.csv(file.choose())
read_dta is ued to reads a file or data stored by STATA read.dta(file_name())
read.table is ued to reads a file or data stored by TABLE read.txt(file.choose())
The functions read.mtp is functions to read MINITAB files read.mtp(file_name())
read_sas is ued to reads a file or data stored by SAS read.as7bdat(file_name())
read.epiinfo is ued to reads a file or data stored by EPI-Info read.rec (file_name())
The function in R is used to read text files containing tabular data into a data frame read.delim2(file.choose()) .
Example
#R=read.csv(file.choose(),header=T)
#R=read.table(file.choose(),header=T)
In R, you can export data to various file formats such as CSV, Excel,Stata, SPSS and others using different functions provided by R packages. Here are some common examples of exporting data in R:
# Create a sample data frame for five Students
data <- data.frame(
ID = 1:5,
Name = c("Ali", "Bilan", "Cawale", "Deeqsi", "Omer"),
Score = c(85, 90, 88, 92, 87))
data
| ID | Name | Score |
|---|---|---|
| 1 | Ali | 85 |
| 2 | Bilan | 90 |
| 3 | Cawale | 88 |
| 4 | Deeqsi | 92 |
| 5 | Omer | 87 |
# Export data frame to a CSV file
write.csv(data, "data.csv", row.names = FALSE)
# Install and load the 'writexl' package for writing Excel files
#install.packages("writexl")
library(writexl)
# Export data frame to an Excel file
data1<-write_xlsx(data, "data1.xlsx")
data1
## [1] "D:\\Rpubs Projects\\Stat Comp II\\rmd\\data1.xlsx"
Instruction: Submit This Exercise R Scripts only
Write the R code of the following equations
Generate the following sequences
What is the meaning of rep (letters [1:5], 4))?
Generate 80 values from a standard normal distribution.
Generate 100 values from a binomial distribution of size 23 and probability 0.25.
Generate 100 values from a log normal distribution with mean 12 and standard deviation 2.
Create the following vector sequences:
1 4 7 10 13 16 19 22 25 28
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
10.0 9.5 9.0 8.5 8.0 7.5 7.0
1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
1 4 9 16 25 36 49 64 81 100
1 4 9 16 25 6 14 24 36 50
For each of the sequences above calculate the mean, median, variance and the range.
Generate 15 values from a uniform distribution between 0.7 and 1.
Generate 15 values from a uniform distribution between 0 and 0.3.
Generate 28 values from a uniform distribution between 1.5 and 7.5.
Using the functions runif() and trunc() simulate 20 die throws.
Generate 30 values from a Normal distribution with mean 3.4 and variance 5.25 rounded to 2 decimal places. Calculate the median, range, and the upper and lower quartiles.
Generate 30 values from a Poisson distribution with lambda 5.
Generate 20 values from a Binomial distribution of size 15 and probability of success 0.3.
Generate a randomization list of size 150 for three treatment groups: “Placebo”, “Drug A” and “Drug B”.
Generate 100 values from a standard normal distribution and count the number of values greater than 1.96. Extract these into a vector called signif.100. Do the same again but this time generate 2000 values.
Submission
To submit your exercise, please fill out the form below and upload your
R script file. To submit your exercise, please fill out the form below
and upload your R script file.
Click
Here
In this Section, we discuss how one can tabulate and visualize single
variables.
Example
In a 2-week study of the productivity of workers, the following data
were obtained on the total number of acceptable pieces that 100 workers
produced:
A. Construct a grouped frequency distribution.
B. Construct a histogram, frequency polygon, and Ogives.
workers<-c(65, 36, 49, 88, 79, 56, 28, 43, 67, 36, 43, 78, 37, 40, 68, 72, 55, 62, 22, 82, 88, 50, 60, 56, 57, 46, 39, 57, 73, 65, 59, 48, 76, 74, 70, 51, 40, 75, 56, 45, 35, 62, 52, 63, 32, 80, 64, 53, 74, 34, 76, 60, 48, 55, 51, 54, 45, 44, 35, 51, 21, 35, 61, 45, 33, 61, 77, 60, 85, 68, 45, 53, 34, 67, 42, 69, 52, 68, 52, 47, 63, 65, 55, 61, 73, 50, 53, 59, 41, 54, 41, 74, 82, 58, 26, 35, 47, 50, 38, 70)
range(workers)
## [1] 21 88
factor.workers<-factor(cut(workers,breaks=nclass.Sturges(workers)))
table(factor.workers)
## factor.workers
## (20.9,29.4] (29.4,37.8] (37.8,46.1] (46.1,54.5] (54.5,62.9] (62.9,71.2]
## 4 11 15 19 19 14
## (71.2,79.6] (79.6,88.1]
## 12 6
workers.dat <- as.data.frame(table(factor.workers))
workers.dat <- transform(workers.dat, cumFreq = cumsum(Freq), relative = prop.table(Freq))
workers.dat
| factor.workers | Freq | cumFreq | relative |
|---|---|---|---|
| (20.9,29.4] | 4 | 4 | 0.04 |
| (29.4,37.8] | 11 | 15 | 0.11 |
| (37.8,46.1] | 15 | 30 | 0.15 |
| (46.1,54.5] | 19 | 49 | 0.19 |
| (54.5,62.9] | 19 | 68 | 0.19 |
| (62.9,71.2] | 14 | 82 | 0.14 |
| (71.2,79.6] | 12 | 94 | 0.12 |
| (79.6,88.1] | 6 | 100 | 0.06 |
hist(workers)
par(bg = "gray90")
plot(workers, type = "o", col = "blue", pch = 20)
Plot Types in R
The type argument in the plot() function determines the type of plot.
Below is a list of the possible values for type and their
explanations:
Example 2:
library(readr)
demo2 <- read_csv("C:/Users/abdik/Desktop/DMA/HU Data Management And Analysis/demo2.csv")
## Rows: 24 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): id, group, pulse, time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(demo2)
| id | group | pulse | time |
|---|---|---|---|
| 1 | 1 | 14 | 1 |
| 1 | 1 | 19 | 2 |
| 1 | 1 | 29 | 3 |
| 2 | 1 | 15 | 1 |
| 2 | 1 | 25 | 2 |
| 2 | 1 | 26 | 3 |
names(demo2)
## [1] "id" "group" "pulse" "time"
factor.pulse<-factor(cut(demo2$pulse,breaks=nclass.Sturges(demo2$pulse)))
table(factor.pulse)
## factor.pulse
## (9.97,14.2] (14.2,18.3] (18.3,22.5] (22.5,26.7] (26.7,30.8] (30.8,35]
## 3 5 4 6 1 5
The nclass.Sturges option in the cut function takes care of the creation of the bins according to Sturges’ formula, which makes life a lot easier for us.
pluse.demo1 <- as.data.frame(table(factor.pulse))
pluse.demo1 <- transform(pluse.demo1, cumFreq = cumsum(Freq),relative = prop.table(Freq))
pluse.demo1
| factor.pulse | Freq | cumFreq | relative |
|---|---|---|---|
| (9.97,14.2] | 3 | 3 | 0.1250000 |
| (14.2,18.3] | 5 | 8 | 0.2083333 |
| (18.3,22.5] | 4 | 12 | 0.1666667 |
| (22.5,26.7] | 6 | 18 | 0.2500000 |
| (26.7,30.8] | 1 | 19 | 0.0416667 |
| (30.8,35] | 5 | 24 | 0.2083333 |
In this Secton, we introduced the summary command. This produced a number of summary statistics that, at the time, made little sense. Now that we have discussed summary statistics in the lecture, it becomes possible to work with them. The current chapter describes a number of methods in R for obtaining summary statistics. In addition, we discuss how those statistics should be interpreted.
The functions included in this section perform Descriptive Statistics by quantitatively describing or summarizing different characteristics from a sample. Graphical
library(survival)
attach(cancer)
## The following objects are masked _by_ .GlobalEnv:
##
## age, sex
head(cancer)
| inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
| 3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
| 3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
| 5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
| 1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
| 12 | 1022 | 1 | 74 | 1 | 1 | 50 | 80 | 513 | 0 |
summary(cancer)
## inst time status age
## Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00
## 1st Qu.: 3.00 1st Qu.: 166.8 1st Qu.:1.000 1st Qu.:56.00
## Median :11.00 Median : 255.5 Median :2.000 Median :63.00
## Mean :11.09 Mean : 305.2 Mean :1.724 Mean :62.45
## 3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:2.000 3rd Qu.:69.00
## Max. :33.00 Max. :1022.0 Max. :2.000 Max. :82.00
## NA's :1
## sex ph.ecog ph.karno pat.karno
## Min. :1.000 Min. :0.0000 Min. : 50.00 Min. : 30.00
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 75.00 1st Qu.: 70.00
## Median :1.000 Median :1.0000 Median : 80.00 Median : 80.00
## Mean :1.395 Mean :0.9515 Mean : 81.94 Mean : 79.96
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00
## Max. :2.000 Max. :3.0000 Max. :100.00 Max. :100.00
## NA's :1 NA's :1 NA's :3
## meal.cal wt.loss
## Min. : 96.0 Min. :-24.000
## 1st Qu.: 635.0 1st Qu.: 0.000
## Median : 975.0 Median : 7.000
## Mean : 928.8 Mean : 9.832
## 3rd Qu.:1150.0 3rd Qu.: 15.750
## Max. :2600.0 Max. : 68.000
## NA's :47 NA's :14
str(cancer)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
In the cancer dataset, the sex variable is numeric but represents a categorical variable encoding the biological sex of the patient.
# Convert 'sex' variable to factor with levels 'male' and 'female'
cancer$sex <- factor(cancer$sex, levels = c(1, 2), labels = c("male", "female"))
head(cancer)
| inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 306 | 2 | 74 | male | 1 | 90 | 100 | 1175 | NA |
| 3 | 455 | 2 | 68 | male | 0 | 90 | 90 | 1225 | 15 |
| 3 | 1010 | 1 | 56 | male | 0 | 90 | 90 | NA | 15 |
| 5 | 210 | 2 | 57 | male | 1 | 90 | 60 | 1150 | 11 |
| 1 | 883 | 2 | 60 | male | 0 | 100 | 90 | NA | 0 |
| 12 | 1022 | 1 | 74 | male | 1 | 50 | 80 | 513 | 0 |
Remove NA values from the dataset
cancer1<- na.omit(cancer)
summary(cancer1)
## inst time status age sex
## Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00 male :103
## 1st Qu.: 3.00 1st Qu.: 174.5 1st Qu.:1.000 1st Qu.:57.00 female: 64
## Median :11.00 Median : 268.0 Median :2.000 Median :64.00
## Mean :10.71 Mean : 309.9 Mean :1.719 Mean :62.57
## 3rd Qu.:15.00 3rd Qu.: 419.5 3rd Qu.:2.000 3rd Qu.:70.00
## Max. :32.00 Max. :1022.0 Max. :2.000 Max. :82.00
## ph.ecog ph.karno pat.karno meal.cal
## Min. :0.0000 Min. : 50.00 Min. : 30.00 Min. : 96.0
## 1st Qu.:0.0000 1st Qu.: 70.00 1st Qu.: 70.00 1st Qu.: 619.0
## Median :1.0000 Median : 80.00 Median : 80.00 Median : 975.0
## Mean :0.9581 Mean : 82.04 Mean : 79.58 Mean : 929.1
## 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:1162.5
## Max. :3.0000 Max. :100.00 Max. :100.00 Max. :2600.0
## wt.loss
## Min. :-24.000
## 1st Qu.: 0.000
## Median : 7.000
## Mean : 9.719
## 3rd Qu.: 15.000
## Max. : 68.000
Dimension of the dataset
dim(cancer1)
## [1] 167 10
After removing NA Values from the data.
print(head(cancer1))
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 2 3 455 2 68 male 0 90 90 1225 15
## 4 5 210 2 57 male 1 90 60 1150 11
## 6 12 1022 1 74 male 1 50 80 513 0
## 7 7 310 2 68 female 2 70 60 384 10
## 8 11 361 2 71 female 2 60 80 538 1
## 9 1 218 2 53 male 1 70 80 825 16
R graphics capabilities are very powerful and flexible. R is a powerful tool for building graphics from the simplest to the most complex ones. Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks is to visualize the data. Often, a proper visualization can illuminate features of the data that can inform further analysis. We will look at four methods of visualizing data that we will use throughout the course:
When visualizing a single numerical variable, a histogram will be our go-to tool, which can be created in R using the hist() function. Strip plots look best when the data set is not too large. The most popular graphical summary for numeric (quantitative) data is the histogram.
A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin. The scale on the y axis can be frequency, percentage, or density (relative frequency).
library(DAAG)
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:survival':
##
## lung
attach(ais)
## The following object is masked _by_ .GlobalEnv:
##
## sex
## The following object is masked from cancer:
##
## sex
head(ais)
| rcc | wcc | hc | hg | ferr | bmi | ssf | pcBfat | lbm | ht | wt | sex | sport |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.96 | 7.5 | 37.5 | 12.3 | 60 | 20.56 | 109.1 | 19.75 | 63.32 | 195.9 | 78.9 | f | B_Ball |
| 4.41 | 8.3 | 38.2 | 12.7 | 68 | 20.67 | 102.8 | 21.30 | 58.55 | 189.7 | 74.4 | f | B_Ball |
| 4.14 | 5.0 | 36.4 | 11.6 | 21 | 21.86 | 104.6 | 19.88 | 55.36 | 177.8 | 69.1 | f | B_Ball |
| 4.11 | 5.3 | 37.3 | 12.6 | 69 | 21.88 | 126.4 | 23.66 | 57.18 | 185.0 | 74.9 | f | B_Ball |
| 4.45 | 6.8 | 41.5 | 14.0 | 29 | 18.96 | 80.3 | 17.64 | 53.20 | 184.6 | 64.6 | f | B_Ball |
| 4.10 | 4.4 | 37.4 | 12.5 | 42 | 21.04 | 75.2 | 15.58 | 53.77 | 174.0 | 63.7 | f | B_Ball |
summary(ais)
## rcc wcc hc hg
## Min. :3.800 Min. : 3.300 Min. :35.90 Min. :11.60
## 1st Qu.:4.372 1st Qu.: 5.900 1st Qu.:40.60 1st Qu.:13.50
## Median :4.755 Median : 6.850 Median :43.50 Median :14.70
## Mean :4.719 Mean : 7.109 Mean :43.09 Mean :14.57
## 3rd Qu.:5.030 3rd Qu.: 8.275 3rd Qu.:45.58 3rd Qu.:15.57
## Max. :6.720 Max. :14.300 Max. :59.70 Max. :19.20
##
## ferr bmi ssf pcBfat
## Min. : 8.00 Min. :16.75 Min. : 28.00 Min. : 5.630
## 1st Qu.: 41.25 1st Qu.:21.08 1st Qu.: 43.85 1st Qu.: 8.545
## Median : 65.50 Median :22.72 Median : 58.60 Median :11.650
## Mean : 76.88 Mean :22.96 Mean : 69.02 Mean :13.507
## 3rd Qu.: 97.00 3rd Qu.:24.46 3rd Qu.: 90.35 3rd Qu.:18.080
## Max. :234.00 Max. :34.42 Max. :200.80 Max. :35.520
##
## lbm ht wt sex sport
## Min. : 34.36 Min. :148.9 Min. : 37.80 f:100 Row :37
## 1st Qu.: 54.67 1st Qu.:174.0 1st Qu.: 66.53 m:102 T_400m :29
## Median : 63.03 Median :179.7 Median : 74.40 B_Ball :25
## Mean : 64.87 Mean :180.1 Mean : 75.01 Netball:23
## 3rd Qu.: 74.75 3rd Qu.:186.2 3rd Qu.: 84.12 Swim :22
## Max. :106.00 Max. :209.4 Max. :123.20 Field :19
## (Other):47
hist(rcc, main = "Histogram of Red blood cell count",data = ais)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
Note: The graph obtained depends on the bins chosen (see the example below). The best choice for a given data set is the one which illuminates the data set’s underlying structure (if any). In R there are algorithms to automatically choose bins that are likely to display well, often the default bins do a good job. This is not always the case, however, and it is better to investigate many bin choices to test the stability of the display.
par(mfrow=c(1,2))
hist(rcc, breaks = 8, main = "Histogram of Red blood cell count", data = ais)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
hist(rcc, breaks = 20, main = " ", data = ais)
## Warning in plot.window(xlim, ylim, "", ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
Density plots are generalized histograms.
## Density plot
plot(density(ht), data = ais) # density for height
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.
barplot(table(ais$sport), beside = TRUE, main = "Sport type")
To visualize the relationship between a numerical and categorical variable, we will use a boxplot. In the mpg dataset, the drv variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.
A boxplot is constructed by drawing a box alongside the data axis with sides located at the upper and lower hinges. A line is drawn parallel to the sides to denote the sample median. Lastly, whiskers are extended from the sides of the box to the maximum and minimum data values (more precisely, to the most extreme values that are not potential outliers).
Boxplots are good for quick visual summaries of data sets, and the relative positions of the values in the five numbers summary (minimum, first quartile, median, third quartile, maximum), indicating the underlying shape of the data distribution.
-They can be a handy alternative to a stripchart when the sample size is large.
Boxplots are also good because one can visually assess multiple features of the data set simultaneously:
Center: can be estimated by the sample median. Spread: can be judged by the width of the box, (IQR = third quartile – first quartile).
Shape: is indicated by the relative lengths of the whiskers, and the position of the median inside the box. Boxes with unbalanced whiskers indicate skewness in the direction of the long whisker. Skewed distributions often have the median tending in the opposite direction of skewness. Kurtosis can be assessed using the box and whiskers. A wide box with short whiskers will tend to be platykurtic, while a skinny box with wide whiskers indicates leptokurtic distributions. Extreme observations are identified with open circles.
boxplot(ais)
A potential outlier is any observation that falls beyond 1.5 times the width of the box on either side, that is, any observation less than (first quartile – 1.5*IQR) or greater than (third quartile + 1.5*IQR). A suspected outlier is any observation that falls beyond 3 times the width of the box on either side. In R, both potential and suspected outliers (if present) are denoted by open circles; there is no distinction between the two.
boxplot(lbm~sex, data = ais, main = "Boxplot of Lean
+ body mass (kg)", xlab = "Sex")
boxplot(lbm~sport, data = ais, xlab = "Sport types")
Lastly, to visualize the relationship between two numeric variables we will use a scatterplot. This can be done with the plot() function and the ~ syntax we just used with a boxplot. (The function plot() can also be used more generally; see the documentation for details.)
Density plots
library(lattice)
densityplot(~lbm | sport, data = ais, groups = sex, scales = "free", plot.points = FALSE, auto.key = TRUE, main = "Density plots of Lean body mass by Support type and Sex", xlab = "Lean body mass")
Scatterplots
xyplot(ht ~ wt | sport, data = ais, main = "Scatterplots of Height versus Weight by Spport type", xlab = "Weight")
Scatterplot matrix
splom(ais[,2:5], main = "Scatterplot matrix of the first four variables in ais data")
One of the most powerful features of R is the ability to write your own functions. These functions may help pre-process data or implement specific computational algorithms. In this class we will introduce the R function syntax and in particular the passing of variables into the function, and argument handling.
If you find you have a process that you use a lot, perhaps one that calls several R functions, you can write your own function to do it. Probably one of the most powerful aspects of the R language is the ability of a user to write functions. When doing so, many computations can be incorporated into a single function and (unlike with scripts) intermediate variables used during the computation are local to the function and are not saved to the workspace. In addition, functions allow for input values used during a computation that can be changed when the function is executed.
Syntax:
*if(condition) {commands1} *
*if(condition) {commands1} else { commands2} condition must evaluate to a single logical value, (i.e either TRUE or FALSE.)
*ifelse (conditions vector, yes vector, no vector )*
if Condition in R
This task is carried out only if this condition is returned as TRUE. R makes it even easier: You can drop the word then and specify your choice in an if statement.
if (test_expression) {
statement
}
Example:
values <- 1:10
if (sample(values, 1) <= 10) {
print(paste(sample(values, 1), "is less than or equal to 10"))
}
## [1] "10 is less than or equal to 10"
Example 2:
Write a function that calculate arithmetic mean for grouped data. Group
midpoints are (10, 20, 30, 40) and Frequencies for each group (5, 10, 8,
7)
grouped_mean <- function(groups, frequencies) {
if (length(groups) != length(frequencies)) {
stop("Lengths of ’groups’ and ’frequencies’
should be the same.")
}
total_sum <- sum(groups * frequencies)
total_freq <- sum(frequencies)
mean <- total_sum / total_freq
return(mean)
}
groups <- c(10, 20, 30, 40) # Group midpoints
frequencies <- c(5, 10, 8, 7) # Frequencies for each group
result <- grouped_mean(groups, frequencies)
print(result)
## [1] 25.66667
if-else Condition in R An if. . . else statement contains the same elements as an if statement (see the preceding section), with some extra elements:
if (test_expression) {
statement
} else {
statement
}
x <- 9
if (x > 0) sqrt(x) else sqrt(-x)
## [1] 3
ifelse(x >= 0, sqrt(x), sqrt(-x))
## [1] 3
before<-c(26.4,28.2,22.1,24.5,21.4,27.3)
after<-c(33.9,37.1,36.1,39.5,30.0,31.4)
df<-data.frame(before,after)
df
| before | after |
|---|---|
| 26.4 | 33.9 |
| 28.2 | 37.1 |
| 22.1 | 36.1 |
| 24.5 | 39.5 |
| 21.4 | 30.0 |
| 27.3 | 31.4 |
t_test<-t.test(df)
t_test
##
## One Sample t-test
##
## data: df
## t = 17.469, df = 11, p-value = 2.27e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 26.06716 33.58284
## sample estimates:
## mean of x
## 29.825
pvalue <- t_test$p.value
if (pvalue < 0.05) {
print("Reject the null hypothesis")
} else {
print("Fail to reject the null hypothesis")
}
## [1] "Reject the null hypothesis"
Loops: for (), while () and repeat ()
Syntax:
for ( var in set ) {commands}
while ( condition ) {commands}
repeat {commands}
Where
the object set is a vector,
commands is a single command or a sequence of commands and
var is a variable which may be used in commands.
Exercise
Write a function that calculate arithmetic mean for grouped data
Consider a function that returns the maximum of two scalars or the statement that they are equal.
Write a function that computes the zero(s) of a quadratic equation.
Write a function that returns the mean of the values in a vector that are greater than the median. Use the median() function. Call your function gtmean.
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).
A probability distribution describes how the probabilities of outcomes are distributed for a random variable. It provides a mathematical framework to represent and analyze uncertainty in experiments and data. Probability distributions are categorized into Discrete and Continuous types based on the nature of the random variable.
A variable whose value is determined by the outcome of a random experiment.
- Discrete Random Variable: Takes on countable values (e.g., number of heads in a coin toss).
- Continuous Random Variable: Takes on any value within a range (e.g., height, weight).Table 4: Common Probability Distribution Functions in R
| Distribution | Probability Density (or Mass) Function | Cumulative Distribution Function (CDF) | Quantile Function |
|---|---|---|---|
| Normal | dnorm(Z, mean, sd) |
pnorm(Z, mean, sd) |
qnorm(Q, mean, sd) |
| Poisson | dpois(N, lambda) |
ppois(N, lambda) |
qpois(Q, lambda) |
| Binomial | dbinom(N, size, prob) |
pbinom(N, size, prob) |
qbinom(Q, size, prob) |
| Exponential | dexp(N, rate) |
pexp(N, rate) |
qexp(Q, rate) |
| Chi-Squared | dchisq(X, df) |
pchisq(X, df) |
qchisq(Q, df) |
\[P(X=x) = p^x (1-p)^{1-x}, \quad x \in \{0,1\}\]
Let X be a discrete random variable. \(X∼Bern(p)\) means X takes on a Bernoulli distribution where the probability of success is p.
Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.
Application: The Bernoulli distribution is commonly used to model the outcome of a single trial of an experiment where the outcome can only be success or failure.
Real-life example:
- A coin toss, where the outcome can only be heads (success) or tails (failure).
- Medical trials: Researchers use the Bernoulli distribution to analyze the efficacy of
a new drug or treatment by categorizing treatment outcomes as successful or unsuccessful.
- Passing or failing an exam
- Winning or losing a tennis match # Generate 10 Bernoulli random variables with probability of success 0.3
library(Rlab)
## Rlab 4.0 attached.
##
## Attaching package: 'Rlab'
## The following object is masked _by_ '.GlobalEnv':
##
## cancer
## The following object is masked from 'package:DAAG':
##
## ozone
## The following object is masked from 'package:survival':
##
## cancer
## The following objects are masked from 'package:stats':
##
## dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
## qweibull, rexp, rgamma, rweibull
## The following object is masked from 'package:datasets':
##
## precip
rbern(n = 10, prob = 0.3)
## [1] 0 0 0 1 0 0 0 0 0 1
The binomial distribution formula can be written as:
\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]
where:
- $P(X = k) $ is the probability of getting exactly \(k\) successes in \(n\) trials,
- \(n\) is the total number of
trials,
- \(p\) is the probability of success
in a single trial,
- $ $ is the binomial coefficient, also known as “n choose k”.
Assumptions: The binary random variable has a fixed probability of success and failure, the outcomes are independent, and the trials are identical.
Application: The binomial distribution is commonly used to model the number of successes in a fixed number of independent trials where the outcome can only be success or failure.
Real-life example: The number of heads in 10
coin tosses. - Clinical Trials: A new drug is tested on 50 patients, and
each has a 60% chance of showing improvement. The number of patients
showing improvement follows a Binomial distribution.
Parameters: n=50, p=0.6.
Example 1:
# Generate 10 binomial random variables with 5 trials and probability of success 0.3
rbinom(n = 10, size = 5, prob = 0.3)
## [1] 2 2 3 2 0 2 1 1 1 0
Example 2: Suppose that an examination consists of six true and false questions, and assume that a student has no knowledge of the subject matter. The probability that the student will guess the correct answer to the first question is 30%. Likewise, the probability of guessing each of the remaining questions correctly is also 30%.
# Parameters
n <- 6 # Number of questions
p <- 0.30 # Probability of guessing correctly
a<- 1 - pbinom(3, n, p)
a
## [1] 0.07047
b<- 1 - pbinom(1, n, p)
b
## [1] 0.579825
c<- pbinom(3, n, p)
c
## [1] 0.92953
d<- pbinom(4, n, p)
d
## [1] 0.989065
The Poisson distribution formula is:
\[P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}\]
where:
- \(P(X = k)\) is the probability of
observing \(k\) events in a fixed
interval,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\lambda\) is the average rate of
event occurrences in that interval,
- \(k\) is the number of events that we
are interested in,
- \(k!\) is the factorial of \(k\) (the product of all positive integers
up to \(k\)).
Assumptions: The events occur independently, the probability of an event occurring is constant over time or space, and the events occur at a fixed rate.
Application: The Poisson distribution is commonly used to model the number of occurrences of rare events such as accidents, defects, or defects in a manufacturing process.
Real-life example: The number of car accidents
that occur in a city in a given day.
- Number of births in a maternity hospital : A maternity
hospital can use the Poisson distribution to predict how many births
will occur during the nigh
# Generate 10 Poisson random variables with lambda = 2
rpois(n = 10, lambda = 2)
## [1] 4 1 1 0 0 0 5 2 0 4
Example 2: Suppose that 4% of all TVs made by A&B Company in 2000 are defective. If eight of these TVs are randomly selected from across the country and tested, what is the probability that exactly three of them are defective? Assume that each TV is made independently of the others.
# Parameters
n <- 8 # Number of TVs tested
p <- 0.04 # Probability a TV is defective
x <- 3 # Number of defective TVs
# Probability of exactly 3 defective TVs
prob_exactly_three <- dbinom(x, n, p)
prob_exactly_three
## [1] 0.002922296
Continuous probability distributions are a fundamental concept in probability theory and statistics. They are used to model random variables that can take on any value within a range of values, as opposed to discrete probability distributions that model random variables that can only take on certain values. Continuous probability distributions are commonly used in various fields, such as finance, engineering, healthcare, and social sciences, to model measurements of time, length, weight, or other continuous variables.
Some of the most commonly used continuous probability distributions include the Uniform, Normal, Exponential, Log-Normal, Gamma, Beta, Chi-Square, Student’s t, F, Weibull, Cauchy, Laplace, Pareto, Gumbel, and Logistic distributions. Each of these distributions has a unique shape and set of parameters that determine its properties, such as mean, variance, skewness, and kurtosis.
Understanding continuous probability distributions is essential for statistical analysis and data modeling. They are used to estimate probabilities, calculate expected values, and make predictions based on observed data. In addition, the properties of continuous probability distributions, such as their moments and tail behavior, have important implications for statistical inference and hypothesis testing.
In this context, it is important to note that continuous probability distributions are all based on the fundamental concept of probability density functions (PDFs), which describe the probability distribution of a continuous random variable. PDFs are used to calculate the probability of a random variable taking on a specific value or falling within a certain range of values. The area under a PDF curve represents the probability of the random variable falling within that range of values.
Overall, continuous probability distributions play a critical role in modern statistics and are essential for analyzing and interpreting data in a wide range of fields. Understanding their properties, assumptions, and applications is essential for anyone seeking to apply statistical methods to real-world problems.
The probability density function of a continuous uniform distribution can be expressed as follows:
\[f(x|a, b) = \frac{1}{b-a} \text{ for } a \leq x \leq b\]
where:
- \(f(x|a, b)\)is the probability
density function of the uniform distribution between \(a\) and \(b\),
- \(a\) is the lower limit of the
distribution,
- \(b\) is the upper limit of the
distribution.
This formula denotes that for a continuous uniform distribution over the interval \([a, b]\), the probability density is constant over that interval, equal to \(\frac{1}{b-a}\).
Assumptions: The continuous random variable has a fixed interval, and all values within that interval are equally likely.
Application: The uniform distribution is commonly used to model outcomes where all values within an interval are equally likely, such as the roll of a fair die or the arrival time of a customer at a store.
Real-life example: The arrival time of a customer at a store between 9:00 am and 10:00 am.
The probability density function of the normal (Gaussian) distribution can be represented as:
\[f(x|\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
where:
- \(f(x|\mu,\sigma^2)\) is the
probability density function of the normal distribution with mean \(\mu\) and variance \(\sigma^2\),
- \(\mu\) is the mean of the
distribution,
- \(\sigma^2\) is the variance of the
distribution,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\pi\) is the mathematical constant
pi (\(\approx 3.14159\)).
This formula describes how the probability density function of a normal distribution varies as a function of \(x\), \(\mu\), and \(\sigma\).
Assumptions: The continuous random variable has a mean and standard deviation, and the values are normally distributed around the mean.
Application: The normal distribution is commonly used to model outcomes where the values are normally distributed, such as the height of a population or the scores on a standardized test.
Real-life example: The height of a population of people.
# Generate 10 normal random variables with mean 0 and standard deviation 1
rnorm(n = 10, mean = 0, sd = 1)
## [1] 0.77185278 -0.52840536 -0.18737593 0.34490423 -0.43188270 0.02726214
## [7] 0.70013407 0.65812203 2.55559330 0.62398999
Example 2:
Draw two random samples of size 100 from the following distributions
N(0, 0.52) and N(2, 0.52) and produce the histograms of these samples on
one page. (a) Use the function rnorm()to generate the samples in the
following way:
#x <- rnorm(sample size, mean ,standard deviation) #general call of rnorm
x1 <- rnorm(100,0,0.5)
x2 <- rnorm(100,2,0.5)
par(mfrow=c(1,2)) #put two figures in one page, two figures in one column
hist(x1)
hist(x2)
The probability density function of the exponential distribution can be expressed as:
\[f(x|\lambda) = \lambda e^{-\lambda x} \text{ for } x \geq 0\]
where:
- \(f(x|\lambda)\) is the probability
density function of the exponential distribution with rate parameter
\(\lambda\),
- \(\lambda\) is the rate parameter of
the distribution, often representing the average number of events in a
unit time,
- \(e\) is Euler’s number \(\approx 2.71828\).
This formula describes how the probability density function of the exponential distribution changes with \(x\) and the rate parameter \(\lambda\).
Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.
Application: The exponential distribution is commonly used to model the time between occurrences of rare events such as accidents, defects, or defects in a manufacturing process.
Real-life example: The time between car accidents on a particular road.
# Generate 10 exponential random variables with rate parameter 0.5
rexp(n = 10, rate = 0.5)
## [1] 6.9879500 1.2832438 1.5573691 2.2863128 2.0873891 1.9828075 4.9821932
## [8] 1.3043925 0.7218954 2.7616857
The probability density function of the gamma distribution can be represented in LaTeX as:
\[f(x|k,\theta) = \frac{1}{\Gamma(k) \theta^k} x^{k-1} e^{-\frac{x}{\theta}} \text{ for } x > 0\]
where:
- \(f(x|k,\theta)\) is the probability
density function of the gamma distribution with shape parameter \(k\) and scale parameter \(\theta\),
- \(\Gamma(k)\) is the gamma function
evaluated at \(k\),
- \(x\) is the random variable,
- \(k\) is the shape parameter that
determines the shape of the distribution,
- \(\theta\) is the scale parameter
that influences the spread of the distribution,
- \(e\) is Euler’s number \(\approx 2.71828\).
This formula describes how the probability density function of the gamma distribution varies with the random variable \(x\), shape parameter \(k\), and scale parameter \(\theta\).
Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.
Application: The gamma distribution is commonly used to model the time to complete a task that is composed of multiple independent sub-tasks, each with a known time to completion.
Real-life example: The time to complete a software development project that is composed of multiple independent sub-tasks.
# Generate 10 gamma random variables with shape parameter 2 and rate parameter 0.5
rgamma(n = 10, shape = 2, rate = 0.5)
## [1] 7.643492 2.673613 4.857552 6.438503 9.477041 3.351033 6.038383 4.489101
## [9] 6.183145 2.370551
The Chi-Square (X2) test is a nonparametric test that is used to test hypotheses about distributions of frequencies across categories of data. The sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.
The probability density function of the chi-squared distribution can be expressed in LaTeX as:
\[f(x|k) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2 - 1} e^{-x/2}\]
where:
- \(f(x|k)\) is the probability density
function of the chi-squared distribution with \(k\) degrees of freedom,
- \(\Gamma(\cdot)\) is the gamma
function,
- \(x\) is the random variable,
- \(k\) is the degrees of freedom
parameter.
This formula describes how the probability density function of the chi-squared distribution changes with the random variable \(x\) and the degrees of freedom parameter \(k\).
Assumptions: The continuous random variable is the sum of squared standard normal random variables.
Application: The chi-squared distribution is commonly used in hypothesis testing and confidence interval estimation when the population standard deviation is unknown and the sample size is large.
Real-life example: The sum of squared deviations from the mean in a sample of 100 measurements.
# Generate 10 chi-squared random variables with 5 degrees of freedom
rchisq(n = 10, df = 5)
## [1] 3.611192 2.541391 1.557380 3.992988 2.903059 3.744683 4.553366
## [8] 5.548837 5.276927 13.165022
Definition: The F distribution is a probability distribution for a continuous random variable that is used to model the distribution of the ratio of two chi-squared random variables.
Assumptions: The continuous random variable is the ratio of two chi-squared random variables.
Application: The F distribution is commonly used in hypothesis testing and confidence interval estimation when comparing the variances of two populations.
Real-life example: The ratio of the variances of two sets of data.
# Generate 10 F random variables with numerator degrees of freedom 5 and denominator degrees of freedom 10
rf(n = 10, df1 = 5, df2 = 10)
## [1] 0.1924399 0.9693179 0.7488310 1.1828210 1.8700149 1.0530074 2.6654673
## [8] 1.3521161 0.4974382 0.1966812
Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data.These tests are used when a normal distribution is assumed.
Examples include:
t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known (typically, the scaling term is unknown and is therefore a nuisance parameter).
When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student’s t distribution. The t-test’s most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.
T-Test: used to test the mean of the populations
The key assumption to use t-test is normality of the population from which the random samples are drawn.
There are three different types of t-tests which performs different tests
One-sample t-test: Compares the mean of population with certain
known or hypothesized population mean value. Hypothesis to be tested
is:
- $H_0$:$\mu =\mu_0$
- $H_a$: not $H_0$ or The test statistic is given by:
\[t = \frac{\bar{x} -
\mu}{\frac{s}{\sqrt{n}}}\]
wtdat <- read.csv('~/Independent sample T-test.csv')
t.test (Leaf.length ~ varaity, var.equal=TRUE, data = wtdat)
##
## Two Sample t-test
##
## data: Leaf.length by varaity
## t = -2.8527, df = 16, p-value = 0.01152
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.8672012 -0.1277988
## sample estimates:
## mean in group 1 mean in group 2
## 27.5625 28.0600
y1<-rnorm(100,0,1)
y2<-rnorm(57,2,1)
boxplot(y1,y2)
Here we know that the variances are equal, we generated the data that way. Repeat the t-test with the relevant option.
t.test(y1,y2)
##
## Welch Two Sample t-test
##
## data: y1 and y2
## t = -11.54, df = 139.39, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.173409 -1.537620
## sample estimates:
## mean of x mean of y
## 0.01300588 1.86852028
The Welch Two Sample t-test is a statistical method used to compare the means of two groups, \(y_1\) and \(y_2\), when their variances are not equal. The test statistic (t) is -12.708, indicating a significant difference between the groups. The degrees of freedom (df) are calculated using Welch’s formula, adjusted for unequal variances. The p-value is small, indicating strong evidence against the null hypothesis. The alternative hypothesis is that the true difference in means is not equal to 0. The confidence interval for the difference in means is entirely below 0, indicating that \(y_2\) has a significantly higher mean than y1. The sample means show that \(y_2\) is substantially higher than y1. In conclusion, the Welch Two Sample t-test indicates a statistically significant difference in means between \(y_1\) and \(y_2\), with the confidence interval and sample means suggesting that \(y_2\) is significantly higher than \(y_1\).
The Paired samples t-Test: used to compares the mean difference of the same subject after and before having certain treatments. It tests whether average difference is 0 or not after measurements are made on the same group(subject) but at two different occasion or conditions.
Example: pre or post test results of the same subjects.
yld<-read.csv("~/Paired sample T test.csv", header=TRUE, na.strings="NA", dec=".")
# Save the data in two different vector
before <- yld$before
after <- yld$after
# Compute t-test
rt <- t.test(before, after, paired = TRUE)
rt
##
## Paired t-test
##
## data: before and after
## t = -5.7654, df = 5, p-value = 0.002205
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -14.00080 -5.36587
## sample estimates:
## mean difference
## -9.683333
The One-Way ANOVA model formula using the following Formula:
\[Y_{ij} = \mu + \varepsilon_{ij}\]
Where:
- \(Y_{ij}\) represents the observation
in the \(i\)th group and the \(j\)th sample. - \(\mu\) Parameters: fixed but unknown and
needed to be estimated - \(\varepsilon_{ij}\): Random error, assumed
to follow normal distribution with constant varaince.
Model assumptions are:
Violation of Assumptions
If assumptions are violated, consider transformations of the data, using
a non-parametric alternative (e.g., Kruskal-Wallis test), or employing a
robust ANOVA method.
duodenal<-read.table("~/duodenal.txt", header = FALSE, na.strings = "NA", dec = ".")
names(duodenal) <- c("Trypsin", "Protein")
attach(duodenal)
dim(duodenal)
## [1] 28 2
Fit.aov<-aov(Protein~Trypsin)
summary(Fit.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Trypsin 2 19.63 9.814 1.265 0.3
## Residuals 25 193.99 7.760
Histograms by group
par(mfrow=c(2,2))
hist(Protein[Trypsin=="ls50"],col=0)
hist(Protein[Trypsin=="b51_1000"],col=0)
hist(Protein[Trypsin=="ge1000"],col=0)
qq normal plots by group
par(mfrow=c(2,2))
qqnorm(Protein[Trypsin=="ls50"])
qqnorm(Protein[Trypsin=="b51_1000"])
qqnorm(Protein[Trypsin=="ge1000"])
par(mfrow=c(1,2))
boxplot(split(Protein,Trypsin))
tapply(Protein,Trypsin,mean)
## b51_1000 ge1000 ls50
## 5.410000 5.944444 3.933333
Diagnostic plot
par(mfrow=c(2,2))
qqnorm(Fit.aov$resid)
hist(Fit.aov$resid,col=0)
boxplot(split(Fit.aov$resid,Trypsin))
Bartlett’s test
bartlett.test(Protein ~ Trypsin,data = duodenal)
##
## Bartlett test of homogeneity of variances
##
## data: Protein by Trypsin
## Bartlett's K-squared = 1.5415, df = 2, p-value = 0.4627
Levene’s test
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:DAAG':
##
## vif
attach(duodenal)
## The following objects are masked from duodenal (pos = 5):
##
## Protein, Trypsin
leveneTest(Protein, Trypsin)
## Warning in leveneTest.default(Protein, Trypsin): Trypsin coerced to factor.
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 2 | 0.6844803 | 0.5135556 |
| 25 | NA | NA |
Two-way ANOVA is a statistical technique used to evaluate the effect of two categorical independent variables (factors) and their interaction on a continuous dependent variable. It extends one-way ANOVA by including a second factor, allowing for the analysis of interaction effects between the two factors.
\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]
Where:
- \(Y_{ijk}\) represents the
observation in the cell defined by levels \(i\) and \(j\) of the two factors. - \(\mu\) is the overall mean. - \(\alpha_i\) is the effect of the \(i\)th level of factor A. - \(\beta_j\) is the effect of the \(j\)th level of factor B. - \((\alpha\beta)_{ij}\) is the interaction
effect between the \(i\)th level of
factor A and the \(j\)th level of
factor B. - \(\varepsilon_{ijk}\) is
the random error term.
TWO WAY ANOVA Table
Model Definition \[Y_{ijk} = \mu + \tau_i + \beta_j + (\gamma)_{ij} + e_{ijk}\]
Decomposition of Sum of Squares \[SS_{\text{TOTAL}} = SS_A + SS_B + SS_{AB} + SS_{\text{Error}}\]
Assumptions \[Y_{ijk} \sim N(\mu, \sigma^2), \quad e_{ijk} \sim N(0, \sigma^2)\]
Hypotheses
Main Effects for Factor A \[H_0: \tau_1 = \tau_2 = \ldots = \tau_k = 0 \quad \text{versus} \quad H_a: \tau_i \neq 0 \ \text{for some } i.\]
Main Effects for Factor B \[H_0: \beta_1 = \beta_2 = \ldots = \beta_b = 0 \quad \text{versus} \quad H_a: \beta_j \neq 0 \ \text{for some } j.\]
Interaction Effects \[ H_0: \gamma_{ij} = 0 \ \text{for all } i, j \quad \text{versus} \quad H_a: \gamma_{ij} \neq 0 \ \text{for some } i, j. \]
##Two way ANOVA###
trypData=read.delim2("~/Trypanosomosis.txt")
str(trypData)
## 'data.frame': 12 obs. of 4 variables:
## $ Breed : chr "Boran" "Boran" "Boran" "Boran" ...
## $ Drug : chr "Berenil" "Berenil" "Berenil" "Samorin" ...
## $ PCV_before: num 18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
## $ PCV_after : num 26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
trypData$PCV_diff <-trypData$PCV_after-trypData$PCV_before
str(trypData)
## 'data.frame': 12 obs. of 5 variables:
## $ Breed : chr "Boran" "Boran" "Boran" "Boran" ...
## $ Drug : chr "Berenil" "Berenil" "Berenil" "Samorin" ...
## $ PCV_before: num 18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
## $ PCV_after : num 26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
## $ PCV_diff : num 7.9 7.8 5.6 13.8 11.9 13.5 7 9.4 7.6 15.9 ...
library(ggplot2)
ggplot(trypData,aes(y = PCV_diff, x = Drug, fill = Breed)) +geom_boxplot(aes(fill = Breed))
ft = aov(PCV_diff~Breed * Drug, data = trypData)
print(ft)
## Call:
## aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
##
## Terms:
## Breed Drug Breed:Drug Residuals
## Sum of Squares 0.44083 89.10750 0.80083 23.99333
## Deg. of Freedom 1 1 1 8
##
## Residual standard error: 1.73181
## Estimated effects may be unbalanced
summary(ft)
## Df Sum Sq Mean Sq F value Pr(>F)
## Breed 1 0.44 0.44 0.147 0.711420
## Drug 1 89.11 89.11 29.711 0.000608 ***
## Breed:Drug 1 0.80 0.80 0.267 0.619316
## Residuals 8 23.99 3.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(ft,type="means")
## Tables of means
## Grand mean
##
## 10.275
##
## Breed
## Breed
## Boran Holstein
## 10.083 10.467
##
## Drug
## Drug
## Berenil Samorin
## 7.55 13.00
##
## Breed:Drug
## Drug
## Breed Berenil Samorin
## Boran 7.100 13.067
## Holstein 8.000 12.933
TukeyHSD(ft)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
##
## $Breed
## diff lwr upr p adj
## Holstein-Boran 0.3833333 -1.922351 2.689017 0.7114204
##
## $Drug
## diff lwr upr p adj
## Samorin-Berenil 5.45 3.144316 7.755684 0.0006082
##
## $`Breed:Drug`
## diff lwr upr p adj
## Holstein:Berenil-Boran:Berenil 0.9000000 -3.6281806 5.428181 0.9173043
## Boran:Samorin-Boran:Berenil 5.9666667 1.4384861 10.494847 0.0124387
## Holstein:Samorin-Boran:Berenil 5.8333333 1.3051527 10.361514 0.0140891
## Boran:Samorin-Holstein:Berenil 5.0666667 0.5384861 9.594847 0.0293847
## Holstein:Samorin-Holstein:Berenil 4.9333333 0.4051527 9.461514 0.0334869
## Holstein:Samorin-Boran:Samorin -0.1333333 -4.6615139 4.394847 0.9996731
plot(TukeyHSD(ft,"Breed"),las=1)
mastitisData =read.csv2("~/Mastitis.csv")
str(mastitisData)
## 'data.frame': 12 obs. of 3 variables:
## $ Parity : chr "heifer" "heifer" "heifer" "heifer" ...
## $ Dose : chr "high" "high" "medium" "medium" ...
## $ Reduction: num 6.79 3.87 30.03 38.08 53.67 ...
mastitisData$Dose = factor(mastitisData$Dose,levels =c("low", "medium", "high"))
ggplot(mastitisData,aes(y = Reduction, x = Dose, fill = Parity))+geom_point(aes(colour = Parity), size = 6)
fm= aov(Reduction ~Parity * Dose, data = mastitisData)
summary(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Parity 1 627 627 36.79 0.000911 ***
## Dose 2 8758 4379 257.06 1.54e-06 ***
## Parity:Dose 2 385 192 11.30 0.009236 **
## Residuals 6 102 17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(fm,type="means")
## Tables of means
## Grand mean
##
## 39.64
##
## Parity
## Parity
## heifer multiparous
## 32.41 46.87
##
## Dose
## Dose
## low medium high
## 72.62 39.84 6.45
##
## Parity:Dose
## Dose
## Parity low medium high
## heifer 57.86 34.06 5.33
## multiparous 87.40 45.63 7.58
TukeyHSD(fm)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Reduction ~ Parity * Dose, data = mastitisData)
##
## $Parity
## diff lwr upr p adj
## multiparous-heifer 14.45333 8.62264 20.28403 0.000911
##
## $Dose
## diff lwr upr p adj
## medium-low -32.7825 -41.73701 -23.82799 7.34e-05
## high-low -66.1725 -75.12701 -57.21799 1.30e-06
## high-medium -33.3900 -42.34451 -24.43549 6.60e-05
##
## $`Parity:Dose`
## diff lwr upr p adj
## multiparous:low-heifer:low 29.540 13.11411 45.96589 0.0029142
## heifer:medium-heifer:low -23.800 -40.22589 -7.37411 0.0089285
## multiparous:medium-heifer:low -12.225 -28.65089 4.20089 0.1529274
## heifer:high-heifer:low -52.525 -68.95089 -36.09911 0.0001178
## multiparous:high-heifer:low -50.280 -66.70589 -33.85411 0.0001512
## heifer:medium-multiparous:low -53.340 -69.76589 -36.91411 0.0001079
## multiparous:medium-multiparous:low -41.765 -58.19089 -25.33911 0.0004333
## heifer:high-multiparous:low -82.065 -98.49089 -65.63911 0.0000071
## multiparous:high-multiparous:low -79.820 -96.24589 -63.39411 0.0000085
## multiparous:medium-heifer:medium 11.575 -4.85089 28.00089 0.1824850
## heifer:high-heifer:medium -28.725 -45.15089 -12.29911 0.0033811
## multiparous:high-heifer:medium -26.480 -42.90589 -10.05411 0.0051795
## heifer:high-multiparous:medium -40.300 -56.72589 -23.87411 0.0005298
## multiparous:high-multiparous:medium -38.055 -54.48089 -21.62911 0.0007303
## multiparous:high-heifer:high 2.245 -14.18089 18.67089 0.9916016
plot(TukeyHSD(fm,"Dose",order=T))
In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.
As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:Rlab':
##
## michelson
## The following object is masked from 'package:DAAG':
##
## hills
data(cats)
str(cats)
## 'data.frame': 144 obs. of 3 variables:
## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
par(mfrow=c(1,2))
with(cats, plot(Bwt, Hwt,col=3))
title(main="Heart Weight (g) vs. Body Weight (kg)of Domestic Cats")
plot(Hwt~Bwt, data=cats,col=4,main="H vrs B W")
with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
##
## Pearson's product-moment correlation
##
## data: Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7375682 0.8552122
## sample estimates:
## cor
## 0.8041274
with(cats, cor(Bwt, Hwt))
## [1] 0.8041274
with(cats, cor(Bwt, Hwt))^2
## [1] 0.6466209
###Correlation test###
with(cats, cor.test(Bwt, Hwt))
##
## Pearson's product-moment correlation
##
## data: Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7375682 0.8552122
## sample estimates:
## cor
## 0.8041274
with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
##
## Pearson's product-moment correlation
##
## data: Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7375682 0.8552122
## sample estimates:
## cor
## 0.8041274
####Scatterplotdetails###
with(cats, plot(Bwt, Hwt, type="n", las=1, xlab="Body Weightin kg",ylab="Heart Weight in g",main="Heart Weight vs. BodyWeight of Cats"))
with(cats, points(Bwt[Sex=="F"], Hwt[Sex=="F"], pch=16,col="red"))
with(cats, points(Bwt[Sex=="M"], Hwt[Sex=="M"], pch=17,col="blue"))
Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data. These tests do not depend on any distribution and do not make any assumptions.
Examples include:
- Sign test
- Mann–Whitney
- Wilcoxon test
- Chi-square test
- Kruskal_Wallis test
- Kolmogorov_Smirnov test
- Wilcoxon Rank sum test
- Spearman correlation
The Wilcoxon signed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples.[1] The one-sample version serves a purpose similar to that of the one-sample Student’s t-test.[2] For two matched samples, it is a paired difference test like the paired Student’s t-test (also known as the “t-test for matched pairs” or “t-test for dependent samples”).
The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed. Instead, it assumes a weaker hypothesis that the distribution of this difference is symmetric around a central value and it aims to test whether this center value differs significantly from zero. The Wilcoxon test is a more powerful alternative to the sign test because it considers the magnitude of the differences, but it requires this moderately strong assumption of symmetry.
- Alternative: paired t-test
- Application: comparing the medians of two dependent groups when the assumptions of normality or equal variances are not met.
- Real-life example: comparing the pre- and post-treatment scores of patients in a clinical trial when the scores are not normally distributed.
# Wilcoxon signed-rank test
before <- c(8, 9, 10, 12, 14)
after <- c(10, 11, 12, 13, 15)
wilcox.test(before, after, paired=TRUE)
## Warning in wilcox.test.default(before, after, paired = TRUE): cannot compute
## exact p-value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: before and after
## V = 0, p-value = 0.05334
## alternative hypothesis: true location shift is not equal to 0
A chi-square test is a statistical test used to determine if there is a significant association between categorical variables. It compares the observed frequencies of occurrences in different categories with the frequencies expected if there were no associations between the variables.
The Chi-Squared Test The mathematical formula is give as:
\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]
There are two main types of chi-square tests:
Chi-Square Test of Independence: It helps determine whether the variables are independent or if there’s a relationship between them. For example, you might want to know if gender affects voting preference.
Chi-Square Test of Goodness of Fit: This test checks if the sample data fits a population distribution. For example, you might want to see if a die is fair by comparing the observed frequency of each face with the expected frequency if the die were fair.
To ensure the validity of the chi-square test, certain assumptions must be met:
Chi-square tests are widely used in academia and industry, especially for testing hypotheses about the independence of categorical variables. Some of these practical applications are:
Market Research: Some applications include analyzing if customer preferences for a product vary across different age groups and income levels or determining if different marketing campaigns are equally effective across demographic segments.
Medical Research: Common use cases are studying the association between lifestyle factors (e.g., smoking, exercise) and the incidence of diseases (e.g., lung cancer, heart disease) or evaluating whether different treatment groups show different recovery rates in clinical trials.
Quality Control: Commonly used to examine whether product defects are independent of the manufacturing process or specific production lines and to compare the quality of products from different suppliers to determine if there is a significant difference in defect rates.
Education: The tests are often used to determine if there is a significant difference in pass rates between students from different schools or teaching methods and to evaluate if introducing a new curriculum improves student performance across subjects. It’s worth noting that these are some of the many applications across academia and the industry and can be extended to other domains and fields.
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
for (i in 1:4)
plot(dchisq(1:28, i*3),pch=18,col="blue",main = paste("chi-square density, df =", i*3))
Example 1:
The investigator collected the blood test of patients from a hospital
medical record.
A=25
B=32
AB=18
O=20
Test the null hypothesis that blood types are equally represented for
those 97 patients. This is a goodness of fit test with equal expected
frequencies. The “p” vector does not need to be specified, since equal
frequencies is the default…
chisq.test(c(25,32,18,20))
##
## Chi-squared test for given probabilities
##
## data: c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781
chi=chisq.test(c(25,32,18,20))
chi
##
## Chi-squared test for given probabilities
##
## data: c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781
the null hypothesis cannot be rejected at alpha =0 .05
Example 2:
the eye color of 592 students found in the following category.
| Eye Color | Count |
|---|---|
| Brown | 220 |
| Blue | 215 |
| Hazel | 93 |
| Green | 64 |
Suppose a certain physiologist proposes that 50% of student should have brown eyes, 25% blue eyes, 15% hazel eyes, and 10% green eyes. Does the sample at hand agree with the physiologist’s hypothesis
eyes<-c(220,215,93,64)
chi<-chisq.test(eyes, p=c(.5,.25,.15,.1))
chi
##
## Chi-squared test for given probabilities
##
## data: eyes
## X-squared = 50.432, df = 3, p-value = 6.462e-11
Example 3:
The following data shows the number of died and alive patients across in
gender.
data<-c(4298,767,7136,643)
data.tab<-matrix(data,byrow=T,nrow=2)
data.tab
## [,1] [,2]
## [1,] 4298 767
## [2,] 7136 643
dimnames(data.tab)=list("Sex"=c("F","M"),"Died"=c("Alive","died"))
data.tab
## Died
## Sex Alive died
## F 4298 767
## M 7136 643
chisq.test(data.tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data.tab
## X-squared = 147.76, df = 1, p-value < 2.2e-16
chisq.test(data.tab)$expected
## Died
## Sex Alive died
## F 4508.97 556.0301
## M 6925.03 853.9699
chisq.test(data.tab)$residuals
## Died
## Sex Alive died
## F -3.141825 8.946877
## M 2.535186 -7.219370
Example 4:
Let us see the data for students smoking habit and exercise level.
data.st<-c(7,6,6,9,7,5,12,5,5,87,84,18)
data.matrix<-matrix(data.st , byrow=T, nrow=4)
dimnames(data.matrix)=list("Exe"=c("heavy","regula","occas","never"),"Smoke"=c("freq","some","none"))
data.matrix
## Smoke
## Exe freq some none
## heavy 7 6 6
## regula 9 7 5
## occas 12 5 5
## never 87 84 18
chisq.test(data.matrix)
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: data.matrix
## X-squared = 13.633, df = 6, p-value = 0.03402
As the p-value 0.03402 is less than the .05 significance level, we do not accept the null hypothesis that the smoking habit is independent of the exercise level of the students.
chisq.test(data.matrix)$expected
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
## Smoke
## Exe freq some none
## heavy 8.705179 7.721116 2.573705
## regula 9.621514 8.533865 2.844622
## occas 10.079681 8.940239 2.980080
## never 86.593625 76.804781 25.601594
chisq.test(data.matrix)$residual
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
## Smoke
## Exe freq some none
## heavy -0.57793792 -0.6193983 2.135725
## regula -0.20036837 -0.5250663 1.277942
## occas 0.60485311 -1.3177955 1.170093
## never 0.04367003 0.8210127 -1.502350
Example 5:
We apply the test to a dataset. We’ll use the Anemia Levels in Nigeria
dataset, which can be downloaded from Kaggle Click
here. The dataset comes from the 2018 Nigeria Demographic and Health
Surveys (NDHS). It explores the impact of mothers’ age and socioeconomic
factors on anemia levels among children aged 0–59 months across
Nigeria’s 36 states and the Federal Capital Territory.
# Load the necessary libraries
#install.packages('readr')
library(readr)
# Load the dataset from the CSV file
dataset <- read_csv("C:/Users/abdik/Desktop/DMA/children_anaemia.csv")
## New names:
## Rows: 33924 Columns: 18
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (13): Age in 5-year groups, Type of place of residence, Highest educatio... dbl
## (5): ...1, Births in last five years, Age of respondent at 1st birth, H...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Display the first few rows of the dataset
head(dataset)
| …1 | Age in 5-year groups | Type of place of residence | Highest educational level | Wealth index combined | Births in last five years | Age of respondent at 1st birth | Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal) | Anemia level | Have mosquito bed net for sleeping (from household questionnaire) | Smokes cigarettes | Current marital status | Currently residing with husband/partner | When child put to breast | Had fever in last two weeks | Hemoglobin level adjusted for altitude (g/dl - 1 decimal) | Anemia level.1 | Taking iron pills, sprinkles or syrup |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40-44 | Urban | Higher | Richest | 1 | 22 | NA | NA | Yes | No | Living with partner | Staying elsewhere | Immediately | No | NA | NA | Yes |
| 1 | 35-39 | Urban | Higher | Richest | 1 | 28 | NA | NA | Yes | No | Married | Living with her | Hours: 1 | No | NA | NA | No |
| 2 | 25-29 | Urban | Higher | Richest | 1 | 26 | NA | NA | No | No | Married | Living with her | Immediately | No | NA | NA | No |
| 3 | 25-29 | Urban | Secondary | Richest | 1 | 25 | 95 | Moderate | Yes | No | Married | Living with her | 105 | No | 114 | Not anemic | No |
| 4 | 20-24 | Urban | Secondary | Richest | 1 | 21 | NA | NA | Yes | No | No longer living together/separated | NA | Immediately | No | NA | NA | No |
| 5 | 30-34 | Urban | Higher | Richest | 1 | 30 | 113 | Mild | Yes | No | Married | Living with her | NA | No | 119 | Not anemic | No |
# Rename a column
colnames(dataset)[colnames(dataset) == "Anemia level...8"] <- "Anemia level"
# Display the column names
colnames(dataset)
## [1] "...1"
## [2] "Age in 5-year groups"
## [3] "Type of place of residence"
## [4] "Highest educational level"
## [5] "Wealth index combined"
## [6] "Births in last five years"
## [7] "Age of respondent at 1st birth"
## [8] "Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)"
## [9] "Anemia level"
## [10] "Have mosquito bed net for sleeping (from household questionnaire)"
## [11] "Smokes cigarettes"
## [12] "Current marital status"
## [13] "Currently residing with husband/partner"
## [14] "When child put to breast"
## [15] "Had fever in last two weeks"
## [16] "Hemoglobin level adjusted for altitude (g/dl - 1 decimal)"
## [17] "Anemia level.1"
## [18] "Taking iron pills, sprinkles or syrup"
# Install and load the package
#install.packages('dplyr')
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:Rlab':
##
## count
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Select the columns of interest
selected_data <- dataset %>% select(`Highest educational level`, `Anemia level`)
# Create a contingency table for Highest educational level and Anemia level
contingency_table <- table(selected_data[["Highest educational level"]],
selected_data[["Anemia level"]])
# View the contingency table
print(contingency_table)
##
## Mild Moderate Not anemic Severe
## Higher 296 238 591 14
## No education 1450 1769 1874 113
## Primary 608 645 900 42
## Secondary 1240 1322 1972 62
# Perform chi-square test
chi_square_test <- chisq.test(contingency_table)
print(chi_square_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 142.86, df = 9, p-value < 2.2e-16
# Spearman rank correlation
x <- c(10, 20, 30, 40, 50)
y <- c(5, 15, 25, 35, 45)
cor.test(x, y, method="spearman")
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 4.4409e-15, p-value = 0.01667
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 1
So far we have seen only some data manipulation and the calculation of summary statistics. R can do a multitude of statistical analysis including linear models (lm), generalised linear models (glm), analysis of variance (anova), non linear models with mixed effects (nlme), generalised additive models (gam), survival analysis (survival), time series analysis (tseries), multivariate analysis (multiv) and many more.
The formula for simple linear regression, where you have one independent variable (predictor) and one dependent variable, can be expressed as follows:
The linear regression model is represented as:
\[Y = \beta_0 + \beta_1 X + \varepsilon\]
where:
- \(Y\) is the dependent
variable,
- \(X\) is the independent variable
(predictor),
- \(\beta_0\) is the intercept
(y-intercept) of the regression line,
- \(\beta_1\) is the slope
(coefficient) of the independent variable,
- \(\varepsilon\) is the error term
(residuals).
The goal of simple linear regression is to determine the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.
library(DAAG)
attach(ais)
## The following object is masked _by_ .GlobalEnv:
##
## sex
## The following objects are masked from ais (pos = 12):
##
## bmi, ferr, hc, hg, ht, lbm, pcBfat, rcc, sex, sport, ssf, wcc, wt
## The following object is masked from cancer:
##
## sex
dim(ais)
## [1] 202 13
names(ais)
## [1] "rcc" "wcc" "hc" "hg" "ferr" "bmi" "ssf" "pcBfat"
## [9] "lbm" "ht" "wt" "sex" "sport"
fit.ais <- lm(wt ~ ht, data=ais)
plot(wt ~ ht, data = ais, ylab = "Weight", xlab = "Height")
x <- ais$ht
y <- fit.ais$fit
lines(x,y)
anova(fit.ais)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| ht | 1 | 23769.79 | 23769.79446 | 312.6298 | 0 |
| Residuals | 200 | 15206.35 | 76.03176 | NA | NA |
Note: aov() and anova() outputs are different
summary(fit.ais)
##
## Call:
## lm(formula = wt ~ ht, data = ais)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.372 -5.296 -1.196 4.378 38.031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -126.19049 11.39566 -11.07 <2e-16 ***
## ht 1.11712 0.06318 17.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.72 on 200 degrees of freedom
## Multiple R-squared: 0.6099, Adjusted R-squared: 0.6079
## F-statistic: 312.6 on 1 and 200 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(ais$wt,fit.ais$fit,xlab="Observed", ylab="Predicted", main="")
title("Observed Versus Predicted Values")
abline(0,1)
hist(fit.ais$resid,col=0, main="")
title("Histogram for residuals")
qqnorm(fit.ais$resid)
Default plots
par(mfrow=c(2,2))
plot(fit.ais)
Exercise
1. Read the dataset fuel.txt.
2. Fit a simple linear regression model in which the mileage (third
column in the dataset) in the response and the weight (the second column
in the dataset) of the car is a predictor.
3. Test the hypothesis that the slope is zero.
4. Use the default plots of an \(lm()\)
object to produce the diagnostic plot.
In multiple linear regression, the model involves more than one independent variable. The general formula for multiple linear regression with \(p\) independent variables can be expressed as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\]
where:
- \(Y\) is the dependent
variable,
- \(X_1\), \(X_2\), \(\ldots\), \(X_p\) are the independent variables,
- \(\beta_0\) is the intercept,
- \(\beta_1\), \(\beta_2\), $, \(\beta_p\) are the coefficients associated
with each independent variable,
- \(\varepsilon\) is the error term
(residuals).
The goal of multiple linear regression is to estimate the coefficients \(\beta_0\), \(\beta_1\), \(\ldots\), \(\beta_p\) that best fit the data. This is typically done by minimizing the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.
Example
Y<-c(78.5,74.3 ,104.3,87.6,95.9,109.2,102.7,72.5 ,93.1 ,115.9 ,83.8 ,113.3,109.4)
x1<-c(7 ,1 ,11, 11 ,7 ,11 ,3 ,1 ,2 ,21 ,1 ,11 ,10 )
x2<-c(26, 29,56,31,52,55,71,31,54,47,40,66,68)
x3<-c(6,15, 8,8,6,9,17,22,18,4,23,9 ,8)
x4<-c(60,52,20,47,33,22,6,44,22,26,34,12,12)
Halds<-data.frame(Y,x1,x2,x3,x4)
model<-lm(Y~x1+x2+x3+x4,data=Halds)
summary(model)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4, data = Halds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1750 -1.6709 0.2508 1.3783 3.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.4054 70.0710 0.891 0.3991
## x1 1.5511 0.7448 2.083 0.0708 .
## x2 0.5102 0.7238 0.705 0.5009
## x3 0.1019 0.7547 0.135 0.8959
## x4 -0.1441 0.7091 -0.203 0.8441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
## F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
par(mfrow=c(2,2))
plot(model)
residuals <- residuals(model)
fitted_values <- fitted(model)
# Correlation test
cor.test(fitted_values, residuals)
##
## Pearson's product-moment correlation
##
## data: fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5509853 0.5509853
## sample estimates:
## cor
## 3.829213e-17
model1<-lm(Y~x1+x2,data=Halds)
summary(model1)
##
## Call:
## lm(formula = Y ~ x1 + x2, data = Halds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
model2<-lm(Y~x1+x2+x4,data=Halds)
summary(model2)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x4, data = Halds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
Example 2:
library(car)
attach(mtcars)
## The following object is masked from ais (pos = 3):
##
## wt
## The following object is masked from package:ggplot2:
##
## mpg
## The following object is masked from ais (pos = 13):
##
## wt
head(mtcars)
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
x <- mtcars$mpg
y <- mtcars$hp
# Pearson correlation test
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
# Fit a model
model11 <- lm(mpg ~ hp + wt, data = mtcars)
summary(model11)
##
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
# Extract residuals and fitted values
residuals <- residuals(model)
fitted_values <- fitted(model)
# Correlation test
cor.test(fitted_values, residuals)
##
## Pearson's product-moment correlation
##
## data: fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5509853 0.5509853
## sample estimates:
## cor
## 3.829213e-17
# Load ggplot2
library(ggplot2)
# Create Scatter Plot with Regression Line
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = "blue", size = 3) + # Scatter points
geom_smooth(method = "lm", se = TRUE, color = "red") + # Regression line with confidence interval
labs(title = "Scatter Plot of MPG vs Horsepower",
x = "Horsepower (hp)",
y = "Miles Per Gallon (mpg)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Pairwise scatterplots for selected variables
pairs(mtcars[, c("mpg", "hp", "wt")],
main = "Pairwise Scatterplots",
pch = 19, col = "blue")
# Fit a Multiple Regression Model
model <- lm(mpg ~ hp + wt, data = mtcars)
# Plot Residuals vs Fitted Values
plot(model$fitted.values, model$residuals,
main = "Residuals vs Fitted Values",
xlab = "Fitted Values",
ylab = "Residuals",
pch = 19,
col = "darkgreen")
abline(h = 0, col = "red", lwd = 2) # Horizontal line at 0
# Optionally, add a smooth line to detect patterns
lines(lowess(model$fitted.values, model$residuals), col = "blue", lwd = 2)
# Create Residuals vs Fitted Plot
ggplot(model, aes(x = .fitted, y = .resid)) +
geom_point(color = "darkgreen", size = 3) + # Residual points
geom_hline(yintercept = 0, color = "red", linetype = "dashed") + # Horizontal line at 0
geom_smooth(method = "loess", color = "blue") + # Smooth line to detect patterns
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Generate All Diagnostic Plots
par(mfrow = c(2, 2)) # Arrange plots in a 2x2 grid
plot(model)
par(mfrow = c(1, 1)) # Reset to default
# Q-Q Plot Using ggplot2
ggplot(model, aes(sample = .stdresid)) +
stat_qq(color = "purple") +
stat_qq_line(color = "red") +
labs(title = "Normal Q-Q Plot",
x = "Theoretical Quantiles",
y = "Standardized Residuals") +
theme_minimal()
# Load library for heatmap
library(corrplot)
## corrplot 0.92 loaded
# Calculate correlation matrix
cor_matrix <- cor(mtcars)
# Plot heatmap
corrplot(cor_matrix, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45,
title = "Correlation Heatmap", mar = c(0, 0, 1, 0))
It is a method/technique for analyzing time series data in order to extract meaningful statistics/figure and other characteristics of the data. That is, what is the previous trend implies for and can it continues or not. It deals all about the time dependencies of the current on the past. It is a systematic approach to deal with mathematical and statistical questions posed by time correlation. It used to examine how data value change over time.
The analysis of time series is of great significance not only to the economists and business man but also to the scientists, astronomists, geologist, sociologists, biologists, research worker etc. for the reason below.
Autoregressive Integrated Moving Average models (ARIMA models) were popularized by “George Box and Gwilym Jenkins” in the early 1970s. ARIMA models are a class of linear models that is capable of representing stationary as well as non-stationary time series. ARIMA models do not involve independent variables in their construction. They make use of the information in the series itself to generate forecasts.
It is one of the most widely implemented methods for analyzing univariate time series data. In order to understand the modeling procedure, it is useful to briefly introduce the following basic models. ARIMA models are specific subset of univariate modeling, in which a time series is expressed in terms of past values of itself (the autoregressive component) plus current and lagged values of a ‘white noise’ error term (the moving average component).
The ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting method that combines autoregressive (AR), differencing (I), and moving average (MA) components. The general formula for an ARIMA model of order $ (p, d, q)$ can be expressed as:
\[Y_t = c + \phi_1 Y_{t-1} + \ldots + \phi_p Y_{t-p} + \theta_1 \varepsilon_{t-1} + \ldots + \theta_q \varepsilon_{t-q} + \varepsilon_t\]
where:
- \(Y_t\) is the time series at time
\(t\),
- \(c\) is a constant term,
- \(\phi_1\), \(\ldots\), \(\phi_p\) are the autoregressive
coefficients,
- \(\theta_1\), \(\ldots\), \(\theta_q\) are the moving average
coefficients,
- \(\varepsilon_t\) is the error term
at time \(t\),
- \(Y_{t-1}, Y_{t-2},\)$, \(Y_{t-p}\) are lagged values of the time
series,
- \(\varepsilon_{t-1}\), \(\varepsilon_{t-2}\), \(\ldots\), \(\varepsilon_{t-q}\) are lagged values of
the error terms,
- \(d\) is the degree of
differencing.
The ARIMA model aims to capture the autocorrelation in the time series data by incorporating the autoregressive and moving average terms. The differencing term helps in making the series stationary. The parameters \(p\), \(d\), and \(q\) are chosen based on the characteristics of the time series data.
The ARIMA model makes several assumptions about the underlying time series data, including:
Stationarity: The time series should be stationary, meaning that the statistical properties of the series (such as the mean and variance) do not depend on time. If the series is non-stationary, it may need to be differenced to make it stationary before using the ARIMA model.
Linearity: The relationship between past observations and future predictions should be linear. Nonlinear relationships may require a different type of model.
Independence: The error terms in the ARIMA model should be independent and identically distributed (IID), meaning that each error term should be unrelated to previous error terms and should have the same distribution. If the error terms are not IID, it may indicate that the model is misspecified.
Normality: The error terms should be normally distributed, with a mean of zero and constant variance. If the error terms are not normally distributed, it may indicate that the model is misspecified.
Adequate Sample Size: The ARIMA model assumes that the time series has a sufficient number of observations to estimate the model parameters with reasonable accuracy. A general guideline is to have at least 50 observations in the time series.
These assumptions are important to keep in mind when using the ARIMA model, as violations of these assumptions can lead to inaccurate predictions and biased parameter estimates. It is important to check these assumptions before fitting an ARIMA model to a time series and to take appropriate steps to address any violations.
Example
male_births<-read.csv("~/male.csv")
male<-ts(male_births, start=2000,freq=4)
ts.plot(male, ylab="Male Births", main="Male Live Births")
par(mfrow=c(2,1))
acf(ts(male,freq=4),NULL,main="Acf for actual series")
pp=pacf(ts(male, freq=4),NULL,main="Pacf for actual series")
#Stationary Checking
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(male,alternative="stationary") # ADF (Augmented Dickey-Fuller)p-value is less than 0.05
##
## Augmented Dickey-Fuller Test
##
## data: male
## Dickey-Fuller = -1.5531, Lag order = 4, p-value = 0.762
## alternative hypothesis: stationary
# fit Auto ARIMA Models
library(forecast)
library(ggplot2)
fit_model<-auto.arima(male,ic="aic",seasonal=TRUE,trace=TRUE)
##
## ARIMA(2,0,2)(1,0,1)[4] with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 1595.311
## ARIMA(1,0,0)(1,0,0)[4] with non-zero mean : 1509.133
## ARIMA(0,0,1)(0,0,1)[4] with non-zero mean : 1535.53
## ARIMA(0,0,0) with zero mean : 2152.192
## ARIMA(1,0,0) with non-zero mean : 1521.978
## ARIMA(1,0,0)(2,0,0)[4] with non-zero mean : 1509.428
## ARIMA(1,0,0)(1,0,1)[4] with non-zero mean : Inf
## ARIMA(1,0,0)(0,0,1)[4] with non-zero mean : 1513.905
## ARIMA(1,0,0)(2,0,1)[4] with non-zero mean : Inf
## ARIMA(0,0,0)(1,0,0)[4] with non-zero mean : 1528.147
## ARIMA(2,0,0)(1,0,0)[4] with non-zero mean : 1485.386
## ARIMA(2,0,0) with non-zero mean : 1489.306
## ARIMA(2,0,0)(2,0,0)[4] with non-zero mean : 1485.986
## ARIMA(2,0,0)(1,0,1)[4] with non-zero mean : Inf
## ARIMA(2,0,0)(0,0,1)[4] with non-zero mean : 1486.623
## ARIMA(2,0,0)(2,0,1)[4] with non-zero mean : Inf
## ARIMA(3,0,0)(1,0,0)[4] with non-zero mean : 1486.887
## ARIMA(2,0,1)(1,0,0)[4] with non-zero mean : 1486.866
## ARIMA(1,0,1)(1,0,0)[4] with non-zero mean : 1491.064
## ARIMA(3,0,1)(1,0,0)[4] with non-zero mean : 1488.765
## ARIMA(2,0,0)(1,0,0)[4] with zero mean : Inf
##
## Best model: ARIMA(2,0,0)(1,0,0)[4] with non-zero mean
print(summary(fit_model))
## Series: male
## ARIMA(2,0,0)(1,0,0)[4] with non-zero mean
##
## Coefficients:
## ar1 ar2 sar1 mean
## 0.3129 0.4855 0.2552 7458.7561
## s.e. 0.0848 0.0870 0.1054 176.1787
##
## sigma^2 = 86857: log likelihood = -737.69
## AIC=1485.39 AICc=1486 BIC=1498.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.866228 288.9912 233.0705 -0.1768792 3.109607 0.7421209
## ACF1
## Training set -0.02615957
male_forecasting<-forecast(fit_model,level = c(95),h=5)
male_forecasting
## Point Forecast Lo 95 Hi 95
## 2026 Q1 7468.843 6891.213 8046.472
## 2026 Q2 7500.321 6895.070 8105.571
## 2026 Q3 7411.571 6718.813 8104.329
## 2026 Q4 7476.532 6757.332 8195.733
## 2027 Q1 7461.016 6651.517 8270.515
autoplot(male_forecasting)
Example
Gasoline Sales Time Series After Obtaining the Contract with the
Maroodi-Jeex Region Police.
gasoline<-data.frame(weeks=1:22, sales=c(17,21,19,23,18,16,20,18,22,20,15,22,31,34,31,33,28,32,30,29,34,33))
head(gasoline)
| weeks | sales |
|---|---|
| 1 | 17 |
| 2 | 21 |
| 3 | 19 |
| 4 | 23 |
| 5 | 18 |
| 6 | 16 |
gas_oline<-ts(gasoline$sales, start = 2024, frequency = 52)
print(gas_oline)
## Time Series:
## Start = c(2024, 1)
## End = c(2024, 22)
## Frequency = 52
## [1] 17 21 19 23 18 16 20 18 22 20 15 22 31 34 31 33 28 32 30 29 34 33
library(forecast)
arima_model <- auto.arima(gas_oline)
summary(arima_model)
## Series: gas_oline
## ARIMA(0,1,0)
##
## sigma^2 = 16.86: log likelihood = -59.46
## AIC=120.92 AICc=121.13 BIC=121.96
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.7280455 4.011349 3.455318 1.442402 14.70862 NaN -0.1952415
The ARIMA(0,1,0) model was fitted to the gas_oline time
series data representing gasoline sales in the Maroodi-Jeex Region
Police after obtaining the contract. Key metrics include an estimated
variance of 16.86, a log likelihood of -59.46, Akaike Information
Criterion (AIC) of 120.92, corrected AIC of 121.13, and Bayesian
Information Criterion (BIC) of 121.96.
Training set error measures show an average of 0.728 forecast errors, indicating the model tends to slightly over-predict. The Root Mean Squared Error (RMSE) is 4.011, the Mean Absolute Error (MAE) is 3.455, the Mean Percentage Error (MPE) is 1.44%, and the Mean Absolute Percentage Error (MAPE) is 14.71%. The MASE is not available, indicating a small average percentage error in the predictions. The autocorrelation of the first lag of the residuals is -0.195, indicating any remaining autocorrelation in the residuals.
Based on these metrics, the ARIMA(0,1,0) model may provide a reasonable fit to the gasoline sales data in the Maroodi-Jeex Region Police, with room for improvement in accuracy. Further analysis and model refinement may be necessary to enhance forecasting performance.
autoplot(arima_model$residuals, main = "Time Series plot of gasoline")
forecast_values <- forecast(arima_model, h = 10)
forecast_values
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2024.423 33 27.73827 38.26173 24.952884 41.04712
## 2024.442 33 25.55880 40.44120 21.619660 44.38034
## 2024.462 33 23.88642 42.11358 19.061987 46.93801
## 2024.481 33 22.47655 43.52345 16.905768 49.09423
## 2024.500 33 21.23442 44.76558 15.006102 50.99390
## 2024.519 33 20.11146 45.88854 13.288672 52.71133
## 2024.538 33 19.07878 46.92122 11.709333 54.29067
## 2024.558 33 18.11759 47.88241 10.239319 55.76068
## 2024.577 33 17.21482 48.78518 8.858653 57.14135
## 2024.596 33 16.36096 49.63904 7.552785 58.44721
autoplot(forecast_values)
The Figure depicts a line graph illustrating the ARIMA(0,1,0) model’s forecasts for gasoline sales data from 2024.0 to 2024.6. The x-axis represents the observed values over time, while the y-axis represents the predicted sales. The solid black line represents actual data points, while the shaded blue area represents the predicted sales. A diagonal line serves as a trend or reference line for comparison. The title “Forecasts from ARIMA(0,1,0)” indicates that the graph is specifically displaying the forecasted values generated by the ARIMA model. The graph provides a visual representation of how well the ARIMA model forecasts match the actual gasoline sales data over the specified time period.
The course Statistical Computing II (R Programming) delved into
statistical methodologies and advanced computational techniques using R,
preparing students to analyze and interpret complex data. It covered
topics like data manipulation, visualization, regression, machine
learning, and survival analysis. Students gained hands-on experience in
implementing algorithms, writing efficient code, and applying
statistical methods to real-world datasets. Emphasis was placed on data
wrangling, simulations, and model diagnostics. The course laid the
groundwork for building expertise in data science, analytics, and
statistical research, allowing students to confidently tackle problems
in their academic and professional endeavors using R as a powerful
tool.