1 Introduction To R

1.1 The R environment

The R programming environment is a free, open-source software environment and programming language specifically designed for statistical computing and data graphics. It is widely used by data analysts, data scientists, and researchers across various fields, including finance, healthcare, and academia.

1.2 Getting started with R and the online help system

R is a powerful programming language and software environment designed for statistical computing, data analysis, and graphical representation. It is widely used by statisticians, data analysts, and researchers due to its flexibility and extensive libraries.

1.2.1 What is R

R is a language and an environment in which you can perform statistical data analysis and graphics. R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

  • It is relatively new and growing rapidly in use
  • It is Completely free software and comes with ABSOLUTELY NO WARRANTY ,
  • It is easily download/installed from Internet
  • it was initially written in the 1990s by Ross Ihaka and Robert Gentleman, and is now supported by an international R-core team (in addition to thousands of people who contribute to R).

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

1.2.3 R and Statistics

Our introduction to the R environment did not mention statistics, yet many people use R as a statistics system. We prefer to think of it of an environment within which many classical and modern statistical techniques have been implemented. A few of these are built into the base R environment, but many are supplied as packages. There are about 25 packages supplied with R (called “standard” and “recommended” packages) and many more are available through the CRAN family of Internet sites Visit R Project and elsewhere. More details on packages are given later. Most classical statistics and much of the latest methodology is available for use with R, but users may need to be prepared to do a little work to find it. There is an important difference in philosophy between S (and hence R) and the other main statistical systems. In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.

1.2.4 What can R do?

  • Handle and store data
  • Perform mathematical operations on data points, vectors or matrices
  • Perform basic and advanced statistical analyses through either the preloaded functions or user contributed functions(Creating your own program)
  • Can be used as a simple and effective programming language, which includes conditionals, loops, if-else, etc. (plus any user contributed or created functions)
  • graphical facilities

1.2.5 Starting with R

To start R in the Windows environment double click on the R icon A new screen will appear containing one window The last line to appear will be `>’, a standard prompt to indicate that R is expecting a command. The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

The R Interface
The R Interface

1.2.6 On-line Help

The on-line help gives useful information. Getting used to using it and understanding the help will make it easier to use R. A keyword search is possible using the Search Engine and Keywords link. “help.start()”

To do a keyword search use the function apropos(). For example:

apropos("test")
##  [1] ".valueClassTest"      "ansari.test"          "bartlett.test"       
##  [4] "binom.test"           "Box.test"             "chisq.test"          
##  [7] "cor.test"             "file_test"            "fisher.test"         
## [10] "fligner.test"         "friedman.test"        "kruskal.test"        
## [13] "ks.test"              "mantelhaen.test"      "mauchly.test"        
## [16] "mcnemar.test"         "mood.test"            "oneway.test"         
## [19] "pairwise.prop.test"   "pairwise.t.test"      "pairwise.wilcox.test"
## [22] "poisson.test"         "power.anova.test"     "power.prop.test"     
## [25] "power.t.test"         "PP.test"              "prop.test"           
## [28] "prop.trend.test"      "quade.test"           "shapiro.test"        
## [31] "t.test"               "testInheritedMethods" "testVirtual"         
## [34] "var.test"             "wilcox.test"

Note that: - You need to put the keyword in double quotes (“keyword’’).
- A lot of valuable information about R can be found through:

Classwork

Use the help system to find help on the following topics.

  1. Analysis of Variance (ANOVA)
  2. Trigonometric Functions (e.g.sin, cos)
  3. Mean and median

1.3 Mathematics in R

R can do any types of mathematical manipulation Simple arithmetic Whatever is typed at the prompt is evaluated, and the result is printed.

2+3                  # Addition
## [1] 5
2*4 + 7            # Multiplication and addition
## [1] 15
10/3                 # Division
## [1] 3.333333
5**3               # ** to the power of...
## [1] 125
5^3                 # or we can use ^
## [1] 125

Standard functions that are found on a scientific calculator are available in R, for example:

sqrt(2)                       # Square root
## [1] 1.414214

It also provides \(\pi\) as a constant.

pi
## [1] 3.141593
sin(pi)
## [1] 1.224606e-16
cos(pi)
## [1] -1

Table 1: The Arithmetic Functions in R

Name Operation
sqrt square root
abs absolute value
sin cos tan trigonometric functions (radians)
asin acos atan inverse trigonometric functions
sinh cosh tanh hyperbolic functions
asinh acosh atanh inverse hyperbolic functions
exp log exponential and natural logarithm
log10 common logarithm
gamma lgamma gamma function and its natural log

Table 3: Functions that return a single value

Name Operation
length(x) the number of elements in x
sum(x) the sum of the values of x
mean(x) the mean of the values of x
var(x) the variance of the values of x
sd(x) the standard deviation of the values of x
min(x) the minimum value from the values of x
max(x) the maximum value from the values of x
prod(x) the product of the values of x
range(x) the range of the values of x

1.3.1 Storing values

A value can be stored in a named variable by assigning it with the <- or = symbols, for example:

x <- 5 # assigns 5 to x
x # type x to print value
## [1] 5
y = sqrt(9) # assign the square-root of 9 to y
y
## [1] 3

In fact you can make assignments using ->, for example:

7 -> z # assigns 7 to z
z
## [1] 7

This could be interpreted as z is assigned 7. Now it is possible to do arithmetic with x, y and z, for example:

(x * y) + z
## [1] 22
y^x - z + 6
## [1] 242
w <- x+y
w
## [1] 8

Note that: To delete an object from the list, use the rm() functions. For example

rm(w)
ls() # to list the objects (e.g., variables, data, functions) that you have created
## [1] "x" "y" "z"

1.3.2 Logical Values

R enables computation with Boolean or Logical variables. These take on either the value True or False. You can use conditional tests to generate these values:

Table 2: Mathematical and Logical Comparison Operators in R

Operator Syntax
less than <
less than or equal to <=
exactly equal to ==
greater than or equal to >=
greater than >
not equal to !=
or |
and &
x <- 32
x > 16 # Is x greater than 16?
## [1] TRUE
x <= 16 # Is x less than equal to 16?
## [1] FALSE

Logical values can be stored in variables in the same way as numeric values:

tf <- x>16
tf
## [1] TRUE

1.4 Simple manipulation

Simple manipulation in R typically involves using base R functions or specialized packages like dbplyr from the tidyverse for tasks such as creating new variables, filtering rows, selecting columns, and summarizing data. Generally refers to tasks like selecting, filtering, sorting, and creating new variables within data structures such as vectors and data frames.

1.4.1 Dplyr Introduction

In this section, we will focus on the very basics of data manipulation and transformation using one of the most important packages from the tidyverse: dplyr. dplyr will also help us begin to explore and summarize our data. Make sure you have the Tidyverse loaded in your R environment.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1.4.1.1 What is dplyr

dplyr is one of the core packages of the tidyverse, and is one of the most widely used packages in the R ecosystem. You can think of dplyr as a package that allows for many of the basic data transformation techniques any Actuary will need in order to take raw data and transform it into usable data for an analysis. For those familiar with SQL, you will notice many parallels to SQL statements. Note that SQL knowledge is not necessary for this course, but we may make connections to SQL where appropriate since it is the most widely used language to query and manipulate data.

dplyr itself is almost a ‘sub-language’ of R that requires practice to get the hang of, but once you do, it will make performing complex data transformations simple, succinct, and easy to read.

1.4.1.2 Data Manipulation in R with Dplyr Package

Data manipulation in R involves cleaning, transforming, and organizing data to make it suitable for analysis. It includes tasks like selecting, filtering, sorting, and creating new variables, often done using the dplyr package for simple and efficient operations. The dplyr package provides functions to perform these operations using simple syntax.

Common dplyr Functions
Some of the key functions in the dplyr package used for manipulating data in R are:

Function Description
filter() Produces a subset of a data frame based on conditions.
distinct() Removes duplicate rows from a data frame.
arrange() Reorders the rows of a data frame.
select() Selects specific columns from a data frame.
rename() Renames column (variable) names in a data frame.
mutate() Creates new variables without removing existing ones.
transmute() Creates new variables and drops the original ones.
summarize() Produces summary statistics.

Additionally, there is a seventh function, group_by() that can be used in conjunction with the above functions.

If you pay attention to the bolded text above, you’ll notice that the filter(), arrange() and slice() functions work on rows in your dataset, whereas the select() and mutate() functions work on the columns or fields in your data. The summarize() function is unique in that it works on groups of rows.

1. filter() Method
We use the filter() method to extract rows that meet a given condition.
Syntax:

filter(dataframeName, condition)

  • dataframe: The input data frame to filter.
  • condition: A logical statement used to determine which rows to keep.

Example:\ We will create a sample data frame from which we extract rows where the runs column is greater than 100.

  • library(dplyr): Loads the dplyr package.
  • data.frame(): Creates a sample data frame.
  • filter(): Filters rows based on the condition runs > 100.
library(dplyr)

stats <- data.frame(player = c('A', 'B', 'C', 'D'),
                    runs = c(100, 200, 408, 19),
                    wickets = c(17, 20, NA, 5))

filter(stats, runs > 100)
player runs wickets
B 200 20
C 408 NA

2. distinct() Method
We use the distinct() method to remove duplicate rows based on one or more columns.

Syntax:
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)

  • dataframe: The input data frame.
  • columns: Optional. Column(s) to consider when identifying duplicates.
  • keep_all: If TRUE, keeps all columns in output; otherwise, keeps only the specified columns.

Example:\ We remove duplicate rows from the stats data frame, first entirely and then based on the player column.

  • distinct(stats): Removes all duplicate rows.
  • distinct(stats, player, .keep_all = TRUE): Keeps only the first occurrence of each player and retains other columns.
stats <- data.frame(player = c('A', 'B', 'C', 'D', 'A', 'A'),
                    runs = c(100, 200, 408, 19, 56, 100),
                    wickets = c(17, 20, NA, 5, 2, 17))

distinct(stats)
player runs wickets
A 100 17
B 200 20
C 408 NA
D 19 5
A 56 2
distinct(stats, player, .keep_all = TRUE) 
player runs wickets
A 100 17
B 200 20
C 408 NA
D 19 5

3. arrange() Method
We use the arrange() method to sort rows in ascending order based on one or more columns.

Syntax:
arrange(dataframeName, columnName)

  • dataframe: The input data frame.
  • column: The column by which to sort the data.

Example:\ We sort the rows of the stats data frame based on the runs column in ascending order.

  • arrange(stats, runs): Orders the rows by values in the runs column.
stats <- data.frame(player = c('A', 'B', 'C', 'D'),
                    runs = c(100, 200, 408, 19),
                    wickets = c(17, 20, NA, 5))

arrange(stats, runs)
player runs wickets
D 19 5
A 100 17
B 200 20
C 408 NA

4. select() Method
We use the select() method to extract specific columns from a data frame.

Syntax:
select(dataframeName, col1,col2,…)

Example:\ Imagine you had no familiarity with the AIS dataset and wanted to sum of skin folds for each individual. You would first need to use the select() function to extract the relevant columns. Let’s first look at all the possible columns in the AIS data using the colnames() function.

library(DAAG)     # Load the package
library(dplyr)   # Load dplyr package
data(ais)         # Load the dataset into memory
colnames(ais)     # Now this works
##  [1] "rcc"    "wcc"    "hc"     "hg"     "ferr"   "bmi"    "ssf"    "pcBfat"
##  [9] "lbm"    "ht"     "wt"     "sex"    "sport"
# Now select the relevant columns
ais<- dplyr::select(ais, rcc, wcc, lbm, bmi, ht, wt, sex, sport)
head(ais)
rcc wcc lbm bmi ht wt sex sport
3.96 7.5 63.32 20.56 195.9 78.9 f B_Ball
4.41 8.3 58.55 20.67 189.7 74.4 f B_Ball
4.14 5.0 55.36 21.86 177.8 69.1 f B_Ball
4.11 5.3 57.18 21.88 185.0 74.9 f B_Ball
4.45 6.8 53.20 18.96 184.6 64.6 f B_Ball
4.10 4.4 53.77 21.04 174.0 63.7 f B_Ball

5. rename() Method
We use the rename() method to change the names of columns in a data frame.

Syntax:

rename(dataframeName, newName=oldName)

ais <- dplyr::rename(ais, Body_Mass_index=bmi)
head(ais)
rcc wcc lbm Body_Mass_index ht wt sex sport
3.96 7.5 63.32 20.56 195.9 78.9 f B_Ball
4.41 8.3 58.55 20.67 189.7 74.4 f B_Ball
4.14 5.0 55.36 21.86 177.8 69.1 f B_Ball
4.11 5.3 57.18 21.88 185.0 74.9 f B_Ball
4.45 6.8 53.20 18.96 184.6 64.6 f B_Ball
4.10 4.4 53.77 21.04 174.0 63.7 f B_Ball

6. mutate() and transmute() Methods
We use mutate() to create new columns while keeping the existing ones.
Syntax:

mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)

ais_mutated <- mutate(ais, 
                      Ht_m = ht / 100,
                      Body_Mass_index = wt / (Ht_m^2))
# Transmute to create BMI and keep only specific columns
ais_subset <- ais %>%
  transmute(
    Sex = sex,
    Sport = sport,
    Body_Mass_index = wt / (ht / 100)^2)

# View the result
head(ais_subset)
Sex Sport Body_Mass_index
f B_Ball 20.55929
f B_Ball 20.67466
f B_Ball 21.85821
f B_Ball 21.88459
f B_Ball 18.95698
f B_Ball 21.03977

7. summarize() Method
We use the summarize() method to reduce a set of values to a single summary value using aggregation functions like sum(), mean(), etc.

Syntax:

summarize(dataframeName, aggregate_function(columnName))

# Summary of the entire dataset
summary(ais)
##       rcc             wcc              lbm         Body_Mass_index
##  Min.   :3.800   Min.   : 3.300   Min.   : 34.36   Min.   :16.75  
##  1st Qu.:4.372   1st Qu.: 5.900   1st Qu.: 54.67   1st Qu.:21.08  
##  Median :4.755   Median : 6.850   Median : 63.03   Median :22.72  
##  Mean   :4.719   Mean   : 7.109   Mean   : 64.87   Mean   :22.96  
##  3rd Qu.:5.030   3rd Qu.: 8.275   3rd Qu.: 74.75   3rd Qu.:24.46  
##  Max.   :6.720   Max.   :14.300   Max.   :106.00   Max.   :34.42  
##                                                                   
##        ht              wt         sex         sport   
##  Min.   :148.9   Min.   : 37.80   f:100   Row    :37  
##  1st Qu.:174.0   1st Qu.: 66.53   m:102   T_400m :29  
##  Median :179.7   Median : 74.40           B_Ball :25  
##  Mean   :180.1   Mean   : 75.01           Netball:23  
##  3rd Qu.:186.2   3rd Qu.: 84.12           Swim   :22  
##  Max.   :209.4   Max.   :123.20           Field  :19  
##                                           (Other):47
# Summarize average RCC and Weight by sex
ais_summary <- ais %>%
  group_by(sex) %>%
  summarise(
    Avg_Weight = mean(wt),
    Avg_RCC = mean(rcc),
    Count = n())

print(ais_summary)
## # A tibble: 2 × 4
##   sex   Avg_Weight Avg_RCC Count
##   <fct>      <dbl>   <dbl> <int>
## 1 f           67.3    4.40   100
## 2 m           82.5    5.03   102

1.4.2 Data types

Understanding the different types of data and how R deals with these data is important. The temptation is to glaze over and skip these technical details, but beware, this can come back to bite you somewhere unpleasant if you don’t pay attention. We’ve already seen an example of this when we tried (and failed) to add two character objects together using the ‘+’ operator.

R has six basic types of data; numeric, integer, logical, complex and character.

  • Numeric: The default data type for numbers in R. It stores decimal or real numbers (e.g., 10.5, 55, 787).
  • Integer: Whole numbers without decimal points. To ensure a value is treated as an integer, you append a capital L to the number (e.g., 1L, 100L).
  • Logical: Also known as boolean, it stores values of either TRUE or FALSE.
  • Character: Used for strings, words, or mixed characters and numbers, enclosed in single or double quotes (e.g., “R is exciting”, “11.5”).
  • Complex: Used for complex numbers with an imaginary part, denoted by i (e.g., 9 + 3i).
  • Raw: Used to store raw bytes and is rarely encountered in general use.

1.5 Data Structure in R

A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values. R has several built-in data structures, including:

  • Vectors
  • Matrices
  • Arrays
  • Data Frames
  • Lists
  • Factors

1.5.1 Vectors

All the values presented so far have been scalars. R can handle vectors which are a combination of scalars in a single structure.

1.5.1.1 Creating Simple Vectors

The c(…) construct can be used to create vectors, named as “concatenate” in R .That means “treat as data set”. To set up a vector named x, say, consisting of five numbers, namely 2, 4,6,10 and 11, use the R command:

x <- c(2,4,6,10,11)
x
## [1]  2  4  6 10 11

Vectors are the simplest type of object in R. They can easily be created with c, the combined function. There are 3 main types of vectors

  • Numeric vectors
  • Character vectors
  • Logical vectors

Functions that return vectors with the same length

  • To print the rank of the values of x:
order(x)
## [1] 1 2 3 4 5
sort.list(x)
## [1] 1 2 3 4 5
  • To print the values of x in increasing order
sort(x)
## [1]  2  4  6 10 11
x[order(x)]
## [1]  2  4  6 10 11
x[sort.list(x)]
## [1]  2  4  6 10 11
  • To print the reciprocals of the values of x
1/x
## [1] 0.50000000 0.25000000 0.16666667 0.10000000 0.09090909

Numeric Vector: is a single entity consisting of an ordered collection of numbers.

Example: To set up a numeric vector X consisting of 5 numbers ,10,6,3,6,22, we use any one of the following commands:

x<-c(10,6,3,6,22)#OR
x=c(10,6,3,6,22)#OR
assign("x",c(10,6,3,6,22))#OR
c(10,6,3,6,22)->x
y<-c(x,0,x)
x
## [1] 10  6  3  6 22

Character vectors: To set up a character/string vector z consisting of 3 place names use:

z<-c("Canberra","Sydney","Newcastle") #Or
z<-c("Canberra","Sydney","Newcastle")

Common useful escape sequences are:

  • \("\n"\) for new line
  • \("\t"\) for tab
  • \("\b"\) for backspace

Example:

cat("Abebe","\n","zelalem","\n")
## Abebe 
##  zelalem

Indexing Vectors

  • Vectors indices are placed with square brackets”[]“.
  • Vectors can be indexed in any of the following ways
  • Vector of positive integers
  • Vector of negative integers
  • Vector of named items
  • Logical vector
const=c(3.1416,2.7183,1.4142,1.6180)
names(const)=c("pi","euler","sqrt2","golden")
const[c(1,3,4)] # printing the 1st, 2ndand the 4thelements of const.
##     pi  sqrt2 golden 
## 3.1416 1.4142 1.6180
const[c(-1,-2)]
##  sqrt2 golden 
## 1.4142 1.6180
const>2
##     pi  euler  sqrt2 golden 
##   TRUE   TRUE  FALSE  FALSE
const[const>2]
##     pi  euler 
## 3.1416 2.7183

1.5.1.2 Generating Sequences in vectors

Regular Sequences

R can be used to generate vectors sequences. You can use the notation #1:#2 to generate a vector that consists of a sequence, where #1 and #2 are two integer numbers.

1:12
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
4:18
##  [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
-10:3
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3
x<- 1:10 # storing sequence on variable x
x
##  [1]  1  2  3  4  5  6  7  8  9 10
y <- 40:1
y
##  [1] 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
## [26] 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

The last example (y) shows that is a vector is too long for one line it simply continues on the following line (using as many lines as are necessary).

Further in the square brackets shows the position at which that line begins ( e.g: [26] mean that the first values on this line are the 26th value).

  1. To generate a repeating sequences, use the rep() functions, for example:
rep(c(1:4), 3) # repeat 1 2 3 4 three times
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4
r <- rep(c(1.2, 6.1, 3.2, 5.5), 2) # repeat vector 1.2, 6.1, 3.2, 5.5 two times
r
## [1] 1.2 6.1 3.2 5.5 1.2 6.1 3.2 5.5
  1. To generate a sequence of real numbers use the function seq(), for example:
seq(4, 6, .25) # generate integer b\n 4 & 6 with increments of 0.25.
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
  • The following examples produce the same as above:
seq(length=9, from=4, to=6)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(from=4, to=6, by=0.25)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
  • To understand the seq() function type the following:
seq(7)
## [1] 1 2 3 4 5 6 7
seq(7, 10)
## [1]  7  8  9 10
seq(7, 10, 0.5)
## [1]  7.0  7.5  8.0  8.5  9.0  9.5 10.0
rep(seq(1, 3, 0.4), 2) # generate integer b\n 1& 3 with increments of 0.4,then repeat it 2-times.
##  [1] 1.0 1.4 1.8 2.2 2.6 3.0 1.0 1.4 1.8 2.2 2.6 3.0
1.5.1.2.1 Vector Arithmetic

We can manipulate vectors in a similar manner to scalars. However, care must be take when doing such things as the results may not be the desired ones.

x <- 12:1 # x is a sequence .....
x
##  [1] 12 11 10  9  8  7  6  5  4  3  2  1

Extracting values of a Vector

If we are interested in only extracting values of the vector, then we can do this using square bracket [ ].

x<- c(1:10)**2 # generate sequence of vector (x) 1 to 10 and square each
x
##  [1]   1   4   9  16  25  36  49  64  81 100
x[6] # extracting the 6th value from vector x above
## [1] 36
x[2:4] # extracting the 2st, 3th and 4th values from x
## [1]  4  9 16
x[c(1,7,9)] # extracting the 1st, 7th and 9th values from x
## [1]  1 49 81
y <- x[c(1,5,8)] # assigning the 1st, 5th and 8th values of x to y.
y
## [1]  1 25 64
x[9:6] # extracting the 9th, 8th ,7thand 6th values from x
## [1] 81 64 49 36
x[c(1:3, 8:10)] # extracting the 1st, 2th ,3thand 8th,9th,10th values
## [1]   1   4   9  64  81 100
x[c(1,2,3,1,2,3,1,2,3,1,2,3)] # repetition of the index...
##  [1] 1 4 9 1 4 9 1 4 9 1 4 9
x[c(8,2,5,10)] # any order you please....
## [1]  64   4  25 100
x[x > 5] # Extract the value of x greater than 5
## [1]   9  16  25  36  49  64  81 100

Excluding value from vector

If you use a negative subscript in your selection procedure, the corresponding numbered element is not included in the return vector. For example

x <- c(3,6,9,12,15,18,21)
x
## [1]  3  6  9 12 15 18 21
x[-4] # exclude the 4th element...
## [1]  3  6  9 15 18 21
x[c(-4,-6)] # exclude 4th and 6th elements...
## [1]  3  6  9 15 21
  • Vectors can only contain entries of the same type: numeric or character; don’t mix them.
  • Characters should be surrounded by Quote “ ”

Example:

x<-c("A","A","A","A","B","B","B")
x
## [1] "A" "A" "A" "A" "B" "B" "B"

More Function on Vector

There are functions that operate on vectors and return useful information:

x <- 5:14
length(x) # Number of elements in x
## [1] 10
max(x) # Largest value in x
## [1] 14
min(x) # Smallest value in x
## [1] 5
sum(x) # Sum of all the values in x
## [1] 95
prod(x) # The product of all the values in x
## [1] 3632428800
mean(x) # The mean of the all values in x
## [1] 9.5
range(x) # Range of vector x
## [1]  5 14
var(x) # The variance of x
## [1] 9.166667
sd(x) # The standard deviation of x
## [1] 3.02765
sqrt(var(x)) # The square root of the variance (i.e. stand. deviation)
## [1] 3.02765
  • Notice that range(x) produces a vector of length two, whereas the other functions produce a scalar. Arithmetic operations on logical values work with true being equal to one, and False being equal to zero. You can count the number of True values in a vector by using sum(x).
x <- 1:10
x>7
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

1.5.2 Matrix

R lets you store data in a two-dimensional matrix.

  • A matrix can be regarded as a generalization of a vector.
  • As with vectors, all the elements of a matrix must be of the same data type.
  • A matrix can be generated in two ways.
  • Method1:Using the function dim:

Example

x <-c(1:8)
dim(x) <-c(2,4)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

Method 2: Using the function matrix:

x <-matrix(c(1:8),2,4,byrow=F)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
w<-matrix(c(1,2,40,2,3,9,20,4,60),nrow=3) # create 3-row by 3-coln. matrix
w
##      [,1] [,2] [,3]
## [1,]    1    2   20
## [2,]    2    3    4
## [3,]   40    9   60
w <-matrix(c(1,2,40,2,3,9,20,4,60),3,3)# Similar to the above create 3 by 3
w
##      [,1] [,2] [,3]
## [1,]    1    2   20
## [2,]    2    3    4
## [3,]   40    9   60
x <- matrix(c(2,3,5,7,11,13),ncol=2) # create 3-row by 2-colm matrix
x
##      [,1] [,2]
## [1,]    2    7
## [2,]    3   11
## [3,]    5   13
  • You give matrix a vector of the values, and you need to specify either ncol or nrow to tell the function the size of the matrix. Can you see the labels on the matrix rows and columns? These are like the labels you see when printing a vector. They also tell us how to extract parts of a matrix. You use square brackets with two comma-separated values:
x[2,1] # extract 2nd-row and 1st-coln. element of x
## [1] 3
x[2,2] # extract 2nd-row and 2nd-coln. element of matrix x
## [1] 11
  • If you leave out one number, you get the whole row or column:
x[,1]  # The first column element
## [1] 2 3 5
x[3,]  # The third row element
## [1]  5 13
  • Notice that returning a single row or column produces a vector. You can extract sub-matrices from a matrix by specifying a vector as one of the indices:
x[2:3] # extract Rows 2 and coln. 2 and form matrix
## [1] 3 5
  • You can specify a vector for the rows and columns subscripts to get a piece of the original matrix:
xy<- matrix(1:16,ncol=4)
xy
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
rowSums(xy) # the sum of each rows
## [1] 28 32 36 40
colSums(xy) # the sum of each columns
## [1] 10 26 42 58
mean (xy [, 2]) # to find the mean of matrix-xy in the coln-2
## [1] 6.5
mean (xy [3,]) # to find the mean of matrix-xy in the row-3
## [1] 9

1.5.2.1 Matrix Assignments and Construction

  • By using subscripts you can change values in a matrix:
mx <- matrix(seq(0, 95, length=20), ncol=5)
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   20   40   60   80
## [2,]    5   25   45   65   85
## [3,]   10   30   50   70   90
## [4,]   15   35   55   75   95
mx [3,1] <- -12 # replace 3rd-row and 1st-coln element by -12
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   20   40   60   80
## [2,]    5   25   45   65   85
## [3,]  -12   30   50   70   90
## [4,]   15   35   55   75   95
mx [3,] <- 0 # replace 3rd-row element by 0
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   20   40   60   80
## [2,]    5   25   45   65   85
## [3,]    0    0    0    0    0
## [4,]   15   35   55   75   95
  • You can replace whole rows or columns too:
mx[,2] <- c(3, 6, 9, 12) #replace the 2nd column of mx by 3,6,9,12
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    3   40   60   80
## [2,]    5    6   45   65   85
## [3,]    0    9    0    0    0
## [4,]   15   12   55   75   95
mx[4,] <- c(-1, -2, -3, -4, -5) #replace 4th-row element of mx by -1, -2, -3, -4, -5
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    3   40   60   80
## [2,]    5    6   45   65   85
## [3,]    0    9    0    0    0
## [4,]   -1   -2   -3   -4   -5
  • As with vector arithmetic, the assigned value is repeated to fill out the matrix section if it is not long enough:
mx[,1] <- c(0.5, 0.9)
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   60   80
## [2,]  0.9    6   45   65   85
## [3,]  0.5    9    0    0    0
## [4,]  0.9   -2   -3   -4   -5
  • Extra rows and columns can be added to a matrix using the rbind() and cbind() functions respectively, for example:
cbind(mx, c(0.1, 0.2, 0.3, 0.4)) # Add an extra column
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  0.5    3   40   60   80  0.1
## [2,]  0.9    6   45   65   85  0.2
## [3,]  0.5    9    0    0    0  0.3
## [4,]  0.9   -2   -3   -4   -5  0.4
rbind(mx, 100:104) # Add a an extra row.
##       [,1] [,2] [,3] [,4] [,5]
## [1,]   0.5    3   40   60   80
## [2,]   0.9    6   45   65   85
## [3,]   0.5    9    0    0    0
## [4,]   0.9   -2   -3   -4   -5
## [5,] 100.0  101  102  103  104
cbind(mx,mx) # Add the matrix to itself (column wise).
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  0.5    3   40   60   80  0.5    3   40   60    80
## [2,]  0.9    6   45   65   85  0.9    6   45   65    85
## [3,]  0.5    9    0    0    0  0.5    9    0    0     0
## [4,]  0.9   -2   -3   -4   -5  0.9   -2   -3   -4    -5
rbind(mx, mx) # Add the matrix to itself (row wise).
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   60   80
## [2,]  0.9    6   45   65   85
## [3,]  0.5    9    0    0    0
## [4,]  0.9   -2   -3   -4   -5
## [5,]  0.5    3   40   60   80
## [6,]  0.9    6   45   65   85
## [7,]  0.5    9    0    0    0
## [8,]  0.9   -2   -3   -4   -5
  • You can add rows or columns to the middle of a matrix with clever use of subscripts and rbind() or cbind(), respectively:
cbind(mx[,1:3], c(91,92,93,94), mx[,4])
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   91   60
## [2,]  0.9    6   45   92   65
## [3,]  0.5    9    0   93    0
## [4,]  0.9   -2   -3   94   -4
  • The function t() transposes a matrix (switched the rows and columns).
mx
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5    3   40   60   80
## [2,]  0.9    6   45   65   85
## [3,]  0.5    9    0    0    0
## [4,]  0.9   -2   -3   -4   -5

1.5.3 Arrays in R

  • Arrays: can be considered as a multiply subscripted collection of data entries, for example numeric. Arrays are generalizations of vectors and matrices That means, vectors in the mathematical sense are one-dimensional arrays where as matrices are two-dimensional arrays; higher dimensions are also possible.
  • There are two methods of creating arrays in R

Method 1: Using vectors
- A vector can be used by R as an array only if it has a dimension vector as its dim attribute. - A dimension vector is a vector of non-negative integers. If its length is k then the array is k-dimensional. - An array can be created by giving a vector structure, a dim, which has the form

Example
The following is a (3-dimentional) array with dimension vector c(3,5,100) and a vector of 1500 elements.

z=c(1:1500)
dim(z) <- c(3,5,100)
dim(z)
## [1]   3   5 100

1.5.4 Factors

Factors are the data objects which are used to categorize the data and store it as levels. They are useful for storing categorical data. They can store both strings and integers. They are useful to categorize unique values in columns like (“TRUE” or “FALSE”) or (“MALE” or “FEMALE”), etc.. They are useful in data analysis for statistical modeling.

# Creating factor using factor()
fac = factor(c("Male", "Female", "Male",
               "Male", "Female", "Male", "Female"))

print(fac)
## [1] Male   Female Male   Male   Female Male   Female
## Levels: Female Male

1.5.5 Lists in R

A list is a generic object consisting of an ordered collection of objects. A list is a generic vector containing other objects. Unlike vectors,Lists are heterogeneous data structures it can contain elements of different types, such as numbers, strings, and even other lists. Lists are created using the list() function.

# Creating a list
my_list <- list(
  name = "Jamal",
  age = 30,
  is_student = FALSE,
  scores = c(85, 90, 78),
  address = list(
    street = "123 Main St",
    city = "Anytown",
    zip = "12345"
  )
)
print(my_list)
## $name
## [1] "Jamal"
## 
## $age
## [1] 30
## 
## $is_student
## [1] FALSE
## 
## $scores
## [1] 85 90 78
## 
## $address
## $address$street
## [1] "123 Main St"
## 
## $address$city
## [1] "Anytown"
## 
## $address$zip
## [1] "12345"
empId = c(1, 2, 3, 4)

empName = c("Debi", "Sandeep", "Subham", "Shiba")

numberOfEmp = 4

empList = list(empId, empName, numberOfEmp)

print(empList)
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] "Debi"    "Sandeep" "Subham"  "Shiba"  
## 
## [[3]]
## [1] 4

1.5.6 Data frames in R

Data frame is a data structure which contains more than 1 vector of equal length. A data frame is very much like a matrix, except it is designed for storing statistical or experimental data. Each row represents a unit, and each column a collection of measurements on the units. Each column can store a different type of data, such as numeric or character. Let us take some example:

Example

name= c("Omer", "Faduma", "Ibrahim","Jamal")
age=c(18,22,25,27)
sex=c("F","M","M","F")
stud=data.frame(name, age, sex)
stud
name age sex
Omer 18 F
Faduma 22 M
Ibrahim 25 M
Jamal 27 F
  • Using $ sign we can access each vector from the data frame. For example:
stud$name #access or extract student data only names
## [1] "Omer"    "Faduma"  "Ibrahim" "Jamal"

1.6 Reading(Importing) Data

Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. So far, all data has been entered in R using the keyboard or generated using R functions. In reality, data is usually too large to type in and is stored in files. There are functions used to read data from external files such as Excel, SPSS, Epi-info, Stata and so on.

In R you can import text file swith the function “read.table” the functions read.csv and read.delim are functions to read―comma separated values ‖files period as a decimal and tab delimited files period as a decimal respectively.

R Syntax: used for to read Data

  • read.sav is ued to reads a file or data stored by the SPSS read.sav(file_name())

  • read.csv is ued to reads a file or data stored by CSV read.csv(file.choose())

  • read_dta is ued to reads a file or data stored by STATA read.dta(file_name())

  • read.table is ued to reads a file or data stored by TABLE read.txt(file.choose())

  • The functions read.mtp is functions to read MINITAB files read.mtp(file_name())

  • read_sas is ued to reads a file or data stored by SAS read.as7bdat(file_name())

  • read.epiinfo is ued to reads a file or data stored by EPI-Info read.rec (file_name())

  • The function in R is used to read text files containing tabular data into a data frame read.delim2(file.choose()) .

Example

  • Read CSV data
#R=read.csv(file.choose(),header=T)
  • Read text data
#R=read.table(file.choose(),header=T)

1.7 Exporting data

In R, you can export data to various file formats such as CSV, Excel,Stata, SPSS and others using different functions provided by R packages. Here are some common examples of exporting data in R:

# Create a sample data frame for five Students
data <- data.frame(
  ID = 1:5,
  Name = c("Ali", "Bilan", "Cawale", "Deeqsi", "Omer"),
  Score = c(85, 90, 88, 92, 87))
data
ID Name Score
1 Ali 85
2 Bilan 90
3 Cawale 88
4 Deeqsi 92
5 Omer 87
# Export data frame to a CSV file
write.csv(data, "data.csv", row.names = FALSE)
# Install and load the 'writexl' package for writing Excel files
#install.packages("writexl")
library(writexl)

# Export data frame to an Excel file
data1<-write_xlsx(data, "data1.xlsx")
data1
## [1] "D:\\Rpubs Projects\\Stat Comp II\\rmd\\data1.xlsx"

1.8 WEEK 1 Exercise

Instruction: Submit This Exercise R Scripts only

  1. Write the R code of the following equations

    1. \(\left(\frac{12^3 - 45}{4}\right) + 4x \left(\frac{72}{2.34} - 3\right)\)
    2. \(\log_e(72) \quad \text{or} \quad \ln(72)\)
    3. \(\log_{10}(72)\)
    4. \(e^{1.45} - 2.612\)
    5. \(\cos\left(\frac{\pi}{3}\right)\)
  2. Generate the following sequences

    1. 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0
    2. 10.0 9.8 9.6 9.4 9.2 9.0 8.8 8.6 8.4 8.2 8.0
  3. What is the meaning of rep (letters [1:5], 4))?

  4. Generate 80 values from a standard normal distribution.

  5. Generate 100 values from a binomial distribution of size 23 and probability 0.25.

  6. Generate 100 values from a log normal distribution with mean 12 and standard deviation 2.

  7. Create the following vector sequences:

    1. 1 4 7 10 13 16 19 22 25 28
    2. 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
    3. 10.0 9.5 9.0 8.5 8.0 7.5 7.0
    4. 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
    5. 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
    6. 1 4 9 16 25 36 49 64 81 100
    7. 1 4 9 16 25 6 14 24 36 50
  8. For each of the sequences above calculate the mean, median, variance and the range.

  9. Generate 15 values from a uniform distribution between 0.7 and 1.

  10. Generate 15 values from a uniform distribution between 0 and 0.3.

  11. Generate 28 values from a uniform distribution between 1.5 and 7.5.

  12. Using the functions runif() and trunc() simulate 20 die throws.

  13. Generate 30 values from a Normal distribution with mean 3.4 and variance 5.25 rounded to 2 decimal places. Calculate the median, range, and the upper and lower quartiles.

  14. Generate 30 values from a Poisson distribution with lambda 5.

  15. Generate 20 values from a Binomial distribution of size 15 and probability of success 0.3.

  16. Generate a randomization list of size 150 for three treatment groups: “Placebo”, “Drug A” and “Drug B”.

  17. Generate 100 values from a standard normal distribution and count the number of values greater than 1.96. Extract these into a vector called signif.100. Do the same again but this time generate 2000 values.

Submission
To submit your exercise, please fill out the form below and upload your R script file. Click Here

2 Control Statements and Programming with functions

2.1 Conditional and Repetitive executions

Control statements are used to control the flow of execution of R commands. They are used to carry out conditional and repetitive executions of R commands. Control statements include conditional statements (if, if-else, ifelse) and loops (for, while, repeat). These control statements are very useful when writing your own functions in R.

In this class we will introduce the R function syntax and in particular the passing of variables into the function, and argument handling.

If you find you have a process that you use a lot, perhaps one that calls several R functions, you can write your own function to do it. Probably one of the most powerful aspects of the R language is the ability of a user to write functions. When doing so, many computations can be incorporated into a single function and (unlike with scripts) intermediate variables used during the computation are local to the function and are not saved to the workspace. In addition, functions allow for input values used during a computation that can be changed when the function is executed.

  • Conditional execution: if(), if()else and ifelse()

Syntax:

  • if(condition) {commands 1}

  • if(condition) {commands 1} else { commands 2} condition must evaluate to a single logical value, (i.e either TRUE or FALSE.)  

  • ifelse (conditions vector, yes vector, no vector )

if Condition in R

This task is carried out only if this condition is returned as TRUE. R makes it even easier: You can drop the word then and specify your choice in an if statement.

if (test_expression) {
statement
}

Example 1:

values <- 1:10
if (sample(values, 1) <= 10) {
  print(paste(sample(values, 1), "is less than or equal to 10"))
}
## [1] "5 is less than or equal to 10"

Example 2:
Write a function that calculate arithmetic mean for grouped data. Group midpoints are (10, 20, 30, 40) and Frequencies for each group (5, 10, 8, 7)

grouped_mean <- function(groups, frequencies) {
if (length(groups) != length(frequencies)) {
stop("Lengths of ’groups’ and ’frequencies’
should be the same.")
}
total_sum <- sum(groups * frequencies)
total_freq <- sum(frequencies)
mean <- total_sum / total_freq
return(mean)
}
groups <- c(10, 20, 30, 40) # Group midpoints
frequencies <- c(5, 10, 8, 7) # Frequencies for each group
result <- grouped_mean(groups, frequencies)
print(result)
## [1] 25.66667

if-else Condition in R
An if. . . else statement contains the same elements as an if statement (see the preceding section), with some extra elements:

  • The keyword else, placed after the first code block.
  • The second block of code, contained within braces, that has to be carried out, only if the result of the condition in the if() statement is FALSE.

if (test_expression) {
statement
} else {
statement
}

x <- 9
if (x > 0) sqrt(x) else sqrt(-x)
## [1] 3
ifelse(x >= 0, sqrt(x), sqrt(-x))
## [1] 3
before<-c(26.4,28.2,22.1,24.5,21.4,27.3)
after<-c(33.9,37.1,36.1,39.5,30.0,31.4)
df<-data.frame(before,after)
df
before after
26.4 33.9
28.2 37.1
22.1 36.1
24.5 39.5
21.4 30.0
27.3 31.4
t_test<-t.test(df)
t_test
## 
##  One Sample t-test
## 
## data:  df
## t = 17.469, df = 11, p-value = 2.27e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  26.06716 33.58284
## sample estimates:
## mean of x 
##    29.825
pvalue <- t_test$p.value
if (pvalue < 0.05) {
  print("Reject the null hypothesis")
} else {
  print("Fail to reject the null hypothesis")
}
## [1] "Reject the null hypothesis"

Loops: for (), while () and repeat ()

Syntax:

  • for ( var in set ) {commands}

  • while ( condition ) {commands}

  • repeat {commands}

Where

  1. the object set is a vector,

  2. commands is a single command or a sequence of commands and

  3. var is a variable which may be used in commands.

Exercise

  1. Write a function that calculate arithmetic mean for grouped data

  2. Consider a function that returns the maximum of two scalars or the statement that they are equal.

  3. Write a function that computes the zero(s) of a quadratic equation.

  4. Write a function that returns the mean of the values in a vector that are greater than the median. Use the median() function. Call your function gtmean.

2.2 Function definition

A function accepts input arguments and produces the output by executing valid R commands that are inside the function. Functions are useful when we want to perform a certain task multiple times. In R Programming Language when we are creating a function the function name and the file in which we are creating the function need not be the same and we can have one or more functions in R.

# name_of_function <- function(argument1, argument2) {
#     statements or code that does something
#     return(something)
# }

In programming, parameters and arguments refer to the values passed into a function. They are often used interchangeably, but there is a subtle difference:

  • Parameters are the variables defined in the function definition.
  • Arguments are the actual values passed to the function when it is called.
    A function can have multiple parameters, and these are separated by commas within the parentheses.

Example1: Let’s try creating a simple example function. This function will take in a numeric value as input, and return the squared value.

square_it <- function(x) {
    square <- x * x
    return(square)
}
square_it(5)  # Calling the function with an argument of 5
## [1] 25
square_it(10)  # Calling the function with an argument of 10
## [1] 100

Example2:

add_num <- function(a,b)
{
  sum_result <- a+b
  return(sum_result)
  }

sum = add_num(35,34)

print(sum)
## [1] 69

Exercise

Write a function called multiply_it, which takes two inputs: a numeric value x, and a numeric value y. The function will return the product of these two numeric values, which is x * y. For example, multiply_it(x = 4, y = 6) will return output 24.

2.3 Scope and its Consequences

In R, scope refers to the visibility and accessibility of variables and functions within different parts of a program. Understanding scope is crucial for writing effective and bug-free code. There are two main types of scope in R: global scope and local scope. - Global Scope: Variables and functions defined in the global environment are accessible from anywhere in the R session. They can be accessed and modified by any function or code block. - Local Scope: Variables and functions defined within a function have local scope. They are only accessible within that function and cannot be accessed from outside the function.

When a function is called, R creates a new environment for that function. This environment contains the local variables and functions defined within the function. If a variable or function is not found in the local environment, R will look for it in the parent environment (the environment where the function was defined) and continue searching up the chain until it reaches the global environment.

2.4 Debugging functions

Debugging is a process of cleaning a program code from bugs to run it successfully. While writing codes, some mistakes or problems automatically appears after the compilation of code are harder to diagnose. So, fixing it takes a lot of time and after multiple levels of calls. Debugging in R is through warnings, messages, and errors. Debugging in R means debugging functions.

R provides a number of tools to help you with debugging your code. The primary tools for debugging functions in R are

  • traceback(): prints out the function call stack after an error occurs; does nothing if there’s no error
  • debug(): flags a function for “debug” mode which allows you to step through execution of a function one line at a time
  • browser(): suspends the execution of a function wherever it is called and puts the function in debug mode
  • trace(): allows you to insert debugging code into a function a specific places
  • recover(): allows you to modify the error behavior so that you can browse the function call stack

These functions are interactive tools specifically designed to allow you to pick through a function. There’s also the more blunt technique of inserting print() or cat() statements in the function.

Editor Breakpoints

Editor Breakpoints can be added in RStudio by clicking to the left of the line in RStudio or pressing Shift+F9 with the cursor on your line. A breakpoint is same as browser() but it doesn’t involve changing codes. Breakpoints are denoted by a red circle on the left side, indicating that debug mode will be entered at this line after the source is run.

traceback() Function

The traceback() function is used to give all the information on how your function arrived at an error. It will display all the functions called before the error occurred called the "call stack" in many languages, R favors calling traceback.

Example:

# # Function 1
# function_1 <- function(a){
#  a + 5
# }
# 
# # Function 2
# function_2 <- function(b) {
#  function_1(b)
# }
# 
# # Calling function
# function_2("s")
# # Call traceback()
# traceback()

For example, you may have a function a() which subsequently calls function b() which calls c() and then d(). If an error occurs, it may not be immediately clear in which function the error occurred. The traceback() function shows you how many levels deep you were when the error occurred.

# mean(x)
# traceback()

The traceback() function must be called immediately after an error occurs. Once another function is called, you lose the traceback.

Here is a slightly more complicated example using the lm() function for linear modeling.

# lm(y ~ x)
# traceback()

You can see now that the error did not get thrown until the 7th level of the function call stack, in which case the eval() function tried to evaluate the formula y ~ x and realized the object y did not exist. Looking at the traceback is useful for figuring out roughly where an error occurred but it’s not useful for more detailed debugging. For that you might turn to the debug() function.

traceback() function displays the error during evaluations. The call stack is read from the function that was run(at the bottom) to the function that was running(at the top). Also we can use traceback() as an error handler which will display error immediately without calling of traceback.

# # Function 1
# function_1 <- function(a){
#  a + 5
# }
# 
# # Function 2
# function_2 <- function(b){
#  function_1(b)
# }
# 
# # Calling error handler
# options(error = traceback)
# function_2("s")

browser() Function
browser() function is inserted into functions to open R interactive debugger. It will stop the execution of function() and you can examine the function with the environment of itself. In debug mode, we can modify objects, look at the objects in the current environment, and also continue executing. 

Example:

# function(a){
#   browser()
#   a + 5
# }

browser[1]> command in consoles confirms that you are in debug mode. Some commands to follow:

  • ls(): Objects available in current environment.
  • print(): To evaluate objects.
  • n: To examine the next statement.
  • s: To examine the next statement by stepping into function calls.
  • where: To print a stack trace.
  • c: To leave debugger and continue with execution.
  • C: To exit debugger and go back to R prompt.

Also, debug() statement automatically inserts browser() statement at the beginning of the function.

recover() Function
recover() statement is used as an error handler and not like the direct statement. In recover(), R prints the whole call stack and lets you select which function browser you would like to enter. Then debugging session starts at the selected location. Example:

# # Calling recover
# options(error = recover)
# 
# # Function 1
# function_1 <- function(a){
#  a + 5
# }
# 
# # Function 2
# function_2 <- function(b) {
#  function_1(b)
# }
# 
# # Calling function
# function_2("s")

3 Descriptive Statistics

In this Secton, we introduced the summary command. This produced a number of summary statistics that, at the time, made little sense. Now that we have discussed summary statistics in the lecture, it becomes possible to work with them. The current chapter describes a number of methods in R for obtaining summary statistics. In addition, we discuss how those statistics should be interpreted.

The functions included in this section perform Descriptive Statistics by quantitatively describing or summarizing different characteristics from a sample. Graphical.

3.1 Summary Statistics

library(survival)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:DAAG':
## 
##     lung
attach(cancer)
## The following objects are masked _by_ .GlobalEnv:
## 
##     age, sex
head(cancer)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 1 1 90 100 1175 NA
3 455 2 68 1 0 90 90 1225 15
3 1010 1 56 1 0 90 90 NA 15
5 210 2 57 1 1 90 60 1150 11
1 883 2 60 1 0 100 90 NA 0
12 1022 1 74 1 1 50 80 513 0
summary(cancer)
##       inst            time            status           age       
##  Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00  
##  1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00  
##  Median :11.00   Median : 255.5   Median :2.000   Median :63.00  
##  Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45  
##  3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00  
##  Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00  
##  NA's   :1                                                       
##       sex           ph.ecog          ph.karno        pat.karno     
##  Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00  
##  Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00  
##  Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00  
##  Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00  
##                  NA's   :1        NA's   :1        NA's   :3       
##     meal.cal         wt.loss       
##  Min.   :  96.0   Min.   :-24.000  
##  1st Qu.: 635.0   1st Qu.:  0.000  
##  Median : 975.0   Median :  7.000  
##  Mean   : 928.8   Mean   :  9.832  
##  3rd Qu.:1150.0   3rd Qu.: 15.750  
##  Max.   :2600.0   Max.   : 68.000  
##  NA's   :47       NA's   :14
str(cancer)
## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

In the cancer dataset, the sex variable is numeric but represents a categorical variable encoding the biological sex of the patient.

# Convert 'sex' variable to factor with levels 'male' and 'female'
cancer$sex <- factor(cancer$sex, levels = c(1, 2), labels = c("male", "female"))
head(cancer)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 male 1 90 100 1175 NA
3 455 2 68 male 0 90 90 1225 15
3 1010 1 56 male 0 90 90 NA 15
5 210 2 57 male 1 90 60 1150 11
1 883 2 60 male 0 100 90 NA 0
12 1022 1 74 male 1 50 80 513 0

Remove NA values from the dataset

cancer1<- na.omit(cancer)
summary(cancer1)
##       inst            time            status           age            sex     
##  Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00   male  :103  
##  1st Qu.: 3.00   1st Qu.: 174.5   1st Qu.:1.000   1st Qu.:57.00   female: 64  
##  Median :11.00   Median : 268.0   Median :2.000   Median :64.00               
##  Mean   :10.71   Mean   : 309.9   Mean   :1.719   Mean   :62.57               
##  3rd Qu.:15.00   3rd Qu.: 419.5   3rd Qu.:2.000   3rd Qu.:70.00               
##  Max.   :32.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00               
##     ph.ecog          ph.karno        pat.karno         meal.cal     
##  Min.   :0.0000   Min.   : 50.00   Min.   : 30.00   Min.   :  96.0  
##  1st Qu.:0.0000   1st Qu.: 70.00   1st Qu.: 70.00   1st Qu.: 619.0  
##  Median :1.0000   Median : 80.00   Median : 80.00   Median : 975.0  
##  Mean   :0.9581   Mean   : 82.04   Mean   : 79.58   Mean   : 929.1  
##  3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00   3rd Qu.:1162.5  
##  Max.   :3.0000   Max.   :100.00   Max.   :100.00   Max.   :2600.0  
##     wt.loss       
##  Min.   :-24.000  
##  1st Qu.:  0.000  
##  Median :  7.000  
##  Mean   :  9.719  
##  3rd Qu.: 15.000  
##  Max.   : 68.000

Dimension of the dataset

dim(cancer1)
## [1] 167  10

After removing NA Values from the data.

print(head(cancer1))
##   inst time status age    sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 2    3  455      2  68   male       0       90        90     1225      15
## 4    5  210      2  57   male       1       90        60     1150      11
## 6   12 1022      1  74   male       1       50        80      513       0
## 7    7  310      2  68 female       2       70        60      384      10
## 8   11  361      2  71 female       2       60        80      538       1
## 9    1  218      2  53   male       1       70        80      825      16

3.2 Frequency Tables

In this Section, we discuss how one can tabulate and visualize single variables.

Example
In a 2-week study of the productivity of workers, the following data were obtained on the total number of acceptable pieces that 100 workers produced:

A. Construct a grouped frequency distribution.
B. Construct a histogram, frequency polygon, and Ogives.

workers<-c(65,  36, 49, 88, 79, 56, 28, 43, 67, 36, 43, 78, 37, 40, 68, 72, 55, 62, 22, 82, 88, 50, 60, 56, 57, 46, 39, 57, 73, 65, 59, 48, 76, 74, 70, 51, 40, 75, 56, 45, 35, 62, 52, 63, 32, 80, 64, 53, 74, 34, 76, 60, 48, 55, 51, 54, 45, 44, 35, 51, 21, 35, 61, 45, 33, 61, 77, 60, 85, 68, 45, 53, 34, 67, 42, 69, 52, 68, 52, 47, 63, 65, 55, 61, 73, 50, 53, 59, 41, 54, 41, 74, 82, 58, 26, 35, 47, 50, 38, 70)
range(workers)
## [1] 21 88
factor.workers<-factor(cut(workers,breaks=nclass.Sturges(workers)))
table(factor.workers)
## factor.workers
## (20.9,29.4] (29.4,37.8] (37.8,46.1] (46.1,54.5] (54.5,62.9] (62.9,71.2] 
##           4          11          15          19          19          14 
## (71.2,79.6] (79.6,88.1] 
##          12           6
workers.dat <- as.data.frame(table(factor.workers))
workers.dat <- transform(workers.dat, cumFreq = cumsum(Freq), relative = prop.table(Freq))
workers.dat
factor.workers Freq cumFreq relative
(20.9,29.4] 4 4 0.04
(29.4,37.8] 11 15 0.11
(37.8,46.1] 15 30 0.15
(46.1,54.5] 19 49 0.19
(54.5,62.9] 19 68 0.19
(62.9,71.2] 14 82 0.14
(71.2,79.6] 12 94 0.12
(79.6,88.1] 6 100 0.06
hist(workers)

par(bg = "gray90")
plot(workers, type = "o", col = "blue", pch = 20)

Example 2:

library(readr)
demo2 <- read_csv("C:/Users/abdik/Desktop/DMA/HU Data Management And Analysis/demo2.csv")
## Rows: 24 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): id, group, pulse, time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(demo2)
id group pulse time
1 1 14 1
1 1 19 2
1 1 29 3
2 1 15 1
2 1 25 2
2 1 26 3
names(demo2)
## [1] "id"    "group" "pulse" "time"
Class<-factor(cut(demo2$pulse,breaks=nclass.Sturges(demo2$pulse)))
table(Class)
## Class
## (9.97,14.2] (14.2,18.3] (18.3,22.5] (22.5,26.7] (26.7,30.8]   (30.8,35] 
##           3           5           4           6           1           5

The nclass.Sturges option in the cut function takes care of the creation of the bins according to Sturges’ formula, which makes life a lot easier for us.

pluse.demo1 <- as.data.frame(table(Class))
pluse.demo1 <- transform(pluse.demo1, cumFreq = cumsum(Freq),relative = prop.table(Freq))
pluse.demo1
Class Freq cumFreq relative
(9.97,14.2] 3 3 0.1250000
(14.2,18.3] 5 8 0.2083333
(18.3,22.5] 4 12 0.1666667
(22.5,26.7] 6 18 0.2500000
(26.7,30.8] 1 19 0.0416667
(30.8,35] 5 24 0.2083333

3.3 Graphical Procedures

After completing this unit, students will be able to:

  • draw graphs using the base graphics packages,
  • understand plotting commands
  • identify graphics parameters

Graphical facilities are an important and extremely versatile component of the R environment. It is possible to use the facilities to display a wide variety of statistical graphs and also to build entirely new types of graph.

3.3.1 Plotting commands

Once the device driver is running, R plotting commands can be used to produce a variety of graphical displays and to create entirely new kinds of display.

Plotting commands are divided into three basic groups:
- High-level plotting functions create a new plot on the graphics device, possibly with axes, labels, titles and so on.
- Low-level plotting functions add more information to an existing plot, such as extra points, lines and labels.
- Interactive graphics functions allow you interactively add information to, or extract information from, an existing plot, using a pointing device such as a mouse.

In addition, R maintains a list of graphical parameters which can be manipulated to customize your plots.

3.3.2 High-level, low-level level and interactive plotting commands

high-level plotting functions There are a number of arguments which may be passed to high-level graphics functions, as follows:

  • add=TRUE Forces the function to act as a low-level graphics function, superimposing the plot on the current plot (some functions only).

  • axes=FALSE Suppresses generation of axes—useful for adding your own custom axes with the axis() function. The default, axes=TRUE, means include axes.

  • log=“x”

  • log=“y”

  • log=“xy”

Causes the x, y or both axes to be logarithmic. This will work for many, but not all, types of plot.

  • type=
    The type= argument controls the type of plot produced, as follows:

  • type=“p”
    Plot individual points (the default)

  • type=“l”
    Plot lines

  • type=“b”
    Plot points connected by lines (both)

  • type=“o”
    Plot points overlaid by lines

  • type=“h”
    Plot vertical lines from points to the zero axis (high-density)

  • type=“s”

  • type=“S”

Step-function plots. In the first form, the top of the vertical defines the point; in the second, the bottom.

  • type=“n” No plotting at all. However axes are still drawn (by default) and the coordinate system is set up according to the data. Ideal for creating plots with subsequent low-level graphics functions.

xlab=string ylab=string

Axis labels for the x and y axes. Use these arguments to change the default labels, usually the names of the objects used in the call to the high-level plotting function.

  • main=string Figure title, placed at the top of the plot in a large font.

  • sub=string Sub-title, placed just below the x-axis in a smaller font.

3.3.3 Graphics parameters list

In R programming language we can customize the appearance of plots. There are a variety of graphical parameters that can be used to do so. Some of the most commonly used graphical parameters include:

  • xlim and ylim: These parameters set the limits of the x and y axes, respectively. They take a vector of two values: the minimum and maximum values of the axis.
  • xlab and ylab: These parameters set the labels of the x and y axes, respectively. They take a string value. main: This parameter sets the main title of the plot. It takes a string value.
  • col: This parameter sets the color of points or lines in a plot. It takes a string value specifying a color name or a code in the format "#RRGGBB", where RR, GG, and BB are the red, green, and blue components of the color, respectively.
  • lwd: This parameter sets the width of lines in a plot. It takes numeric value.
  • sub: This parameter sets the sub-title/label of the plot.
  • pch: This parameter sets the plotting character for points in a plot. It takes an integer value between 0 and 25.
  • lty: This parameter can be used to change the line types of the plot font: This parameter sets the font style and font size in the plot. We can make the text bold, italic, bold italic, etc.
  • cex: It is a short form for character expansion. This parameter sets the size of elements in the plot, such as points or text. cex takes a numeric value with, 1 being the default size.

Similarly, there are many more graphical parameters that could be used to customize the appearance of the plot. We can set this parameter by giving them directly to the plotting function. This can be understood by the following example. Let’s plot a graph with random points with no parameters.

x <- rnorm(100)
y <- rnorm(100)
plot(x, y)

This is the output we obtain, let’s try to customize this plot.

x <- rnorm(100)
y <- rnorm(100)
plot(x, y, xlim = c(-2, 2),
     ylim = c(-2, 2),
     xlab = "X", ylab = "Y", 
     main = "My Plot", col = "blue",
     pch = 16, lwd= 2)

We can see that this plot is more informative than the previous one. We have added a title and labeled the y and x-axis, also given the limits of the axes (-2 to 2), the color is also changed from black to blue. Let’s try some examples using bar graphs.

# plotting a bar-graph with title and color
barplot(c(1,3), main="Main title",
        xlab="X axis title",
        ylab="Y axis title",
        col.main="red", col.lab="blue")

Now let’s change the size font size of the axis titles and the main title of the graph.

# Increasing the size of titles
barplot(c(1,3), main="Main title",
        xlab="X axis title",
        ylab="Y axis title",
        cex.main=2, cex.lab=1.7)

Adding an Axis to a Plot We add an axis to our plot using the axis() function in R programming. It has an attribute named ‘side’, which takes an integer as a value. These values are,

  • below
  • left
  • above
  • right
# This will create a plot of x and y with no axis,
x <- c(1, 2, 3, 4, 5)
y <- x^2
plot(x, y, xaxt = "n", yaxt = "n")

# then add x and y axis to the plot.
axis(1)
axis(2)

Changing Axis Scale

We can set x and y-axis limits by specifying the minimum and maximum values of each axis. The below code will create a plot of x and y with x-axis limits between 0 and 15 and y-axis limits between 1 and 100, and also with a log scale on the x and y-axis

x <- c(1, 2, 3, 4, 5)
y <- x^2
# setting the limits for axis
plot(x, y, xlim=c(1,15),
     ylim=c(1,100), log="xy")

Customizing Tick Mark Labels We can the appearance of the tick marks in the plot. We can change the color, font-size of tick labels using the col.axis function.

x <- 1:10
y <- x*x*x

# adding red color to the tick labels
plot(x,y, col.axis="red")

We can also hide the tick labels using xaxt and yaxt parameters. They both take a single character value, “s” for showing the axis and “n” for hiding the axis.

x <- 1:10;
y <- x*x*x

# plots with no tick labels
plot(x, y, xaxt="n", yaxt="n")

Changing plotting symbols We have many plotting symbols in R. These are generated using the 'pch' parameter which takes up an integer value to set the type of symbol. For example,

  • pch = 0, for square
  • pch = 1, for circle
  • pch = 2, for triangle point up
  • pch = 3, for plus
  • pch = 4, for cross
  • pch = 5, for diamond , etc.
x <- c(1, 2, 3, 4, 5)
y <- x^2
# makes the pointer a cross
plot(x, y, pch = 4, col = "blue")

Changing the Line Types In R, we can change the line types of a plot using the lty parameter. It can be solid, blank, dotted, dashed, etc. Ity takes up an integer value which is 0 for blank, 1 for solid, 3 for dotted, and so on.

x <- c(1, 2, 3, 4, 5)
y <- x^2

# makes a line graph with dotted lines
plot(x, y, type = "l", lty = 2, col = "red")

3.4 Graphs and Charts

R graphics capabilities are very powerful and flexible. R is a powerful tool for building graphics from the simplest to the most complex ones. Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks is to visualize the data. Often, a proper visualization can illuminate features of the data that can inform further analysis. We will look at four methods of visualizing data that we will use throughout the course:

  • Histograms
  • Barplots
  • Boxplots
  • Scatterplots
  • qqplot
  • dot charts, and
  • pie charts

Plot Types in R

The type argument in the plot() function determines the type of plot. Below is a list of the possible values for type and their explanations:

  • “p”: Points
  • “l”: Lines
  • “b”: Both (points + lines)
  • “c”: Lines without points
  • “o”: Both (points over lines)
  • “h”: Histogram-like vertical lines
  • “s”: Step function (horizontal)
  • “S”: Step function (vertical)
  • “n”: No plot

3.4.1 Histograms

When visualizing a single numerical variable, a histogram will be our go-to tool, which can be created in R using the hist() function. Strip plots look best when the data set is not too large. The most popular graphical summary for numeric (quantitative) data is the histogram.

A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin. The scale on the y axis can be frequency, percentage, or density (relative frequency).

library(DAAG)
attach(ais)
## The following object is masked _by_ .GlobalEnv:
## 
##     sex
## The following object is masked from cancer:
## 
##     sex
head(ais)
rcc wcc lbm Body_Mass_index ht wt sex sport
3.96 7.5 63.32 20.56 195.9 78.9 f B_Ball
4.41 8.3 58.55 20.67 189.7 74.4 f B_Ball
4.14 5.0 55.36 21.86 177.8 69.1 f B_Ball
4.11 5.3 57.18 21.88 185.0 74.9 f B_Ball
4.45 6.8 53.20 18.96 184.6 64.6 f B_Ball
4.10 4.4 53.77 21.04 174.0 63.7 f B_Ball
summary(ais)
##       rcc             wcc              lbm         Body_Mass_index
##  Min.   :3.800   Min.   : 3.300   Min.   : 34.36   Min.   :16.75  
##  1st Qu.:4.372   1st Qu.: 5.900   1st Qu.: 54.67   1st Qu.:21.08  
##  Median :4.755   Median : 6.850   Median : 63.03   Median :22.72  
##  Mean   :4.719   Mean   : 7.109   Mean   : 64.87   Mean   :22.96  
##  3rd Qu.:5.030   3rd Qu.: 8.275   3rd Qu.: 74.75   3rd Qu.:24.46  
##  Max.   :6.720   Max.   :14.300   Max.   :106.00   Max.   :34.42  
##                                                                   
##        ht              wt         sex         sport   
##  Min.   :148.9   Min.   : 37.80   f:100   Row    :37  
##  1st Qu.:174.0   1st Qu.: 66.53   m:102   T_400m :29  
##  Median :179.7   Median : 74.40           B_Ball :25  
##  Mean   :180.1   Mean   : 75.01           Netball:23  
##  3rd Qu.:186.2   3rd Qu.: 84.12           Swim   :22  
##  Max.   :209.4   Max.   :123.20           Field  :19  
##                                           (Other):47
hist(rcc, main = "Histogram of Red blood cell count",data = ais)
## Warning in plot.window(xlim, ylim, log, ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter

Note: The graph obtained depends on the bins chosen (see the example below). The best choice for a given data set is the one which illuminates the data set’s underlying structure (if any). In R there are algorithms to automatically choose bins that are likely to display well, often the default bins do a good job. This is not always the case, however, and it is better to investigate many bin choices to test the stability of the display.

par(mfrow=c(1,2))
hist(rcc, breaks = 8, main = "Histogram of Red blood cell count", data = ais)
## Warning in plot.window(xlim, ylim, log, ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
hist(rcc, breaks = 20,  main = " ", data = ais)
## Warning in plot.window(xlim, ylim, log, ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter

list=c(5,20,20,30,30,30,30,40,40,40,40,40,50,50,50,50,50,50,60,60,60,60,60,70,70,70,70,80,80,90)
hist(list, breaks=10, density=c(10,20,30,40,50,40,30,20,10), angle=40,
col=rainbow(9), main="")
title(main=" Histogram", font.main=3)
#par(new=TRUE)
#plot(density(list), yaxt="n", xaxt="n", ann=FALSE, col="blue")
#curve(dnorm(x, mean(list), sd(list)), col="red", lwd=2, add=T)
box() 

3.4.2 Density Plots

Density plots are generalized histograms.

## Density plot 
plot(density(ht), data = ais)     # density for height
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

3.4.3 Barplots

Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.

barplot(table(ais$sport), beside = TRUE, main = "Sport type")

To visualize the relationship between a numerical and categorical variable, we will use a boxplot. In the mpg dataset, the drv variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.

  • A boxplot is constructed by drawing a box alongside the data axis with sides located at the upper and lower hinges. A line is drawn parallel to the sides to denote the sample median. Lastly, whiskers are extended from the sides of the box to the maximum and minimum data values (more precisely, to the most extreme values that are not potential outliers).

  • Boxplots are good for quick visual summaries of data sets, and the relative positions of the values in the five numbers summary (minimum, first quartile, median, third quartile, maximum), indicating the underlying shape of the data distribution.

-They can be a handy alternative to a stripchart when the sample size is large.

  • The greatest advantage of a boxplot is that it can help to objectively identify extreme observations in the data set as described in the next section.

Boxplots are also good because one can visually assess multiple features of the data set simultaneously:

Center: can be estimated by the sample median. Spread: can be judged by the width of the box, (IQR = third quartile – first quartile).

Shape: is indicated by the relative lengths of the whiskers, and the position of the median inside the box. Boxes with unbalanced whiskers indicate skewness in the direction of the long whisker. Skewed distributions often have the median tending in the opposite direction of skewness. Kurtosis can be assessed using the box and whiskers. A wide box with short whiskers will tend to be platykurtic, while a skinny box with wide whiskers indicates leptokurtic distributions. Extreme observations are identified with open circles.

boxplot(ais)

boxplotdata=data.frame(Altitude = c(2150, 2170, 2180, 2700, 2640, 2660,
2460, 2470, 2200, 2220, 2300, 2330, 2350, 2350, 2350, 1900, 1950, 2000,
1900, 1900, 1900, 1900, 1940, 1940, 1930, 2100, 2100, 2300, 2290, 2320,
2250, 2250, 1590, 1600, 1620, 1560, 2320, 2330, 2350, 2300, 2300, 2530,
2540, 2600, 2600, 2460, 2120, 2110, 2150, 2130, 2210, 2210, 2050, 1900,
1750, 1680, 1710, 1950, 2000, 1990, 2590, 2620,1800, 1900, 1350),
groupcode = c(1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 3, 2, 2, 4, 4, 4, 4, 4, 4, 4, 5, 5,
5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 3, 5, 5, 5, 4, 4,
5, 5, 5, 5, 5, 5, 3, 3,5, 4, 5))
attach(boxplotdata)
boxplot(Altitude~groupcode, main="Altitude", las=1, col=rainbow(5),
varwidth=TRUE, cex.lab=0.8, cex.axis=0.8, ylab="Altitude", xlab="Group
Number")

treelength=data.frame(Height=c(21,23,34,23,24,43,23,23,43,42,32,43,42,54,65,23,42,45,54,32,43,54,23,32,43,54,23,54,65,45,54,34,24,34,54,54),
Slope =gl(2,18,36, labels=c("Plain","Steep")),
Altitude=gl(3,6,36, labels=c(1500,1800,2400)))

library(ggplot2)
boxplot(Height~Slope+Altitude, data=treelength, boxwex=0.50,
col=c("yellow", "orange") ,xlab="Altitude in m", ylab="Tree height",
ylim=c(0,70), cex.axis = 0.8, cex.lab=0.8)
title(main="Height of Juniperus Procera along altitudinal gradient",
font.main=2, cex.main= 0.8)
legend("topleft", c("Southern Ethiopia", "Western Ethiopia"), fill =
c("yellow", "orange"), horiz=TRUE, cex=0.8)

3.4.3.1 Outliers

A potential outlier is any observation that falls beyond 1.5 times the width of the box on either side, that is, any observation less than (first quartile – 1.5*IQR) or greater than (third quartile + 1.5*IQR). A suspected outlier is any observation that falls beyond 3 times the width of the box on either side. In R, both potential and suspected outliers (if present) are denoted by open circles; there is no distinction between the two.

boxplot(lbm~sex, data = ais, main = "Boxplot of Lean 
+ body mass (kg)", xlab = "Sex")

boxplot(lbm~sport, data = ais, xlab = "Sport types")

3.4.4 Scatterplots

Lastly, to visualize the relationship between two numeric variables we will use a scatterplot. This can be done with the plot() function and the ~ syntax we just used with a boxplot. (The function plot() can also be used more generally; see the documentation for details.)

3.4.5 Multivariate data plots

Density plots

library(lattice)
densityplot(~lbm | sport, data = ais, groups = sex, scales = "free", plot.points = FALSE,  auto.key = TRUE, main = "Density plots of Lean body mass  by Support type and Sex",   xlab = "Lean body mass") 

Scatterplots

xyplot(ht ~ wt | sport, data = ais, main = "Scatterplots of Height versus Weight by Spport type",  xlab = "Weight")

Scatterplot matrix

splom(ais[,2:5], main = "Scatterplot matrix of the first four variables in ais data")

4 Probability and Sampling Distributions

4.1 Random Sampling

In statistics, random sampling is a technique used to select a subset of individuals or items from a larger population in such a way that each individual or item has an equal chance of being selected. This method is essential for ensuring that the sample is representative of the population, allowing for accurate inferences and generalizations to be made about the entire population based on the sample data.

# Set my random seed 
set.seed(100)

# Draw the random sample 
sample( 1:50 , 10, replace = FALSE)
##  [1] 10 38 48 25 14 44 23 22  6  4

There are two primary types of random sampling:

  • Simple Random Sampling: Each element of the population has an equal probability of being selected.
  • Stratified Sampling: The population is divided into distinct subgroups and samples are randomly selected from each subgroup.

Random sampling is used in many fields, such as in predictive modeling, hypothesis testing and when building or validating machine learning models.

Random Sampling in R We will be implementing random sampling in R programming language using the dplyr Package. We first need to install and load the package into our R environment.

#install.packages("dplyr")
library(dplyr)

Random Sampling using the sample_n() Function The sample_n() function is used when we need to select a fixed number of random rows from a data frame. This function is used in situations where we need a specific number of observations for training or testing machine learning models or when performing statistical tests on a subset of the data.

Syntax:
sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, .funs = NULL)

  • tbl: The data frame from which to sample.
  • size: The number of rows to sample.
  • replace: A logical value indicating whether sampling should be done with replacement.
  • weight: An optional vector of weights for weighted sampling.
  • .env: An optional environment for evaluating expressions.
  • .funs: An optional list of functions to apply to the sampled data.
library(dplyr)

random_sample <- sample_n(iris, 10)
print(random_sample)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1           6.5         2.8          4.6         1.5 versicolor
## 2           5.6         2.5          3.9         1.1 versicolor
## 3           6.2         2.9          4.3         1.3 versicolor
## 4           6.1         2.6          5.6         1.4  virginica
## 5           4.6         3.4          1.4         0.3     setosa
## 6           4.4         3.2          1.3         0.2     setosa
## 7           6.9         3.1          5.4         2.1  virginica
## 8           7.0         3.2          4.7         1.4 versicolor
## 9           4.8         3.4          1.9         0.2     setosa
## 10          4.9         3.0          1.4         0.2     setosa

Random Sampling using the sample_frac() Function
The sample_frac() function allows to select a random fraction of rows from a data frame. This is useful when we need a percentage-based random sample.

Syntax:

<span style="color:blue"> sample_frac(tbl, size, replace = FALSE, weight = NULL, .env = NULL, .funs = NULL) </span>

library(dplyr)

random_fraction <- sample_frac(iris, 0.065)
print(random_fraction)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1           5.8         2.7          4.1         1.0 versicolor
## 2           5.9         3.0          5.1         1.8  virginica
## 3           4.6         3.2          1.4         0.2     setosa
## 4           5.4         3.4          1.5         0.4     setosa
## 5           5.4         3.0          4.5         1.5 versicolor
## 6           5.5         2.6          4.4         1.2 versicolor
## 7           6.9         3.2          5.7         2.3  virginica
## 8           5.7         4.4          1.5         0.4     setosa
## 9           6.4         3.2          5.3         2.3  virginica
## 10          5.8         2.6          4.0         1.2 versicolor

4.1.1 Random Sequences

R can generate a random sequence from a number of probability density functions. The general format for generating such sequences is: rdensity(n, p1, p2, …) where density is the probability density function, n is the number of values to generate and p1, p2, … are the values needed to generate from the density function.

For example:

rnorm(40, 6, 2.3) #generate 40 values Normal distribution with mean 6 and standard deviation 2.3
##  [1]  4.1269280  4.9915637  4.3434904  6.5311724  3.3372222  6.5682748
##  [7]  5.7904388 10.0419639  5.6827619  5.7442550  4.4129671  5.4898733
## [13]  6.4206877  6.9598436  8.4504254  8.2314646  5.7662528  9.2273680
## [19]  1.9134160  7.4325950  4.7987483  9.0411312  5.1640872  9.0338512
## [25]  6.1006919  1.6790915  4.9717570  2.0012247  6.4113892 10.3641711
## [31]  0.7745714  8.2550675  2.7827011 10.1972066  9.1769871  4.0706407
## [37]  5.3974097  5.8416587  5.1285678 11.9385055
rpois(20, lambda=5) # 20 value from Poisson distribution with lambda(۸)=5
##  [1] 5 2 3 7 6 5 5 1 5 2 5 6 6 2 3 8 2 6 9 2
rbinom(20, size=18, prob=0.3) # 18 values from Binom. distribution with p= 0.3
##  [1] 2 4 5 8 3 4 6 7 7 7 2 6 7 6 5 3 3 3 2 6
runif(20, min=3, max=6) # Uniform distribution (3, 6)
##  [1] 5.283055 5.572424 4.311797 4.251409 4.756948 5.474903 5.379418 3.978458
##  [9] 5.873796 4.959059 4.376027 4.818066 3.866166 5.153493 5.770192 5.024818
## [17] 3.557820 4.044201 3.370559 3.324349
rexp(20, rate=1.5) # 20 values from Exponential distribution e=1.5
##  [1] 0.58997332 0.45093320 0.09360471 0.21384948 0.59040967 1.44183419
##  [7] 0.38457602 0.78102098 0.38046218 0.64738387 0.28250675 1.49211983
## [13] 0.91966535 0.71400639 0.33564758 1.05919080 0.17594923 0.21128162
## [19] 1.95176195 0.09135269

Table: The general generating for value from distribution, the R name and argument are:

Distribution R Name Arguments
Normal norm mean, sd
Student’s t t df, ncp
F f df1, df2, ncp
Exponential exp rate
Log-normal lnorm meanlog, sdlog
Beta beta shape1, shape2, ncp
Binomial binom size, prob
Cauchy cauchy location, scale
Chi-squared chisq df, ncp
Gamma gamma shape, scale
Geometric geom prob
Hypergeometric hyper m, n, k
Logistic logis location, scale
Negative binomial nbinom size, prob
Poisson pois lambda
Signed rank signrank n
Uniform unif min, max
Weibull weibull shape, scale
Wilcoxon wilcox m, n

The function sample() generates random permutations of data or a random sample from a given vector. The first argument is a vector or a positive integer and the second the size of the sample.

For example:

sample(12) # Sampling without replacement.
##  [1]  4 10 11 12  8  7  9  2  6  1  3  5
sample(12, 5) # Sampling 5 values without replacement.
## [1] 2 6 9 5 4
sample(12, replace=TRUE) # Sampling with replacement.
##  [1] 11  6  7  4  7  5 12  9  5 12 11  3
sample(12, 8, replace=TRUE) # Sampling 8 values with replacement.
## [1] 11 12  4 10  2 12  5  7
sample(c("Heads", "Tails"), 6, replace=TRUE) # Simulating 6 coin tosses.
## [1] "Heads" "Tails" "Heads" "Tails" "Heads" "Tails"

4.2 Probability Distributions

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).

A probability distribution describes how the probabilities of outcomes are distributed for a random variable. It provides a mathematical framework to represent and analyze uncertainty in experiments and data. Probability distributions are categorized into Discrete and Continuous types based on the nature of the random variable.

  1. Random Variable:
  • A variable whose value is determined by the outcome of a random experiment.

    - Discrete Random Variable: Takes on countable values (e.g., number of heads in a coin toss).
    - Continuous Random Variable: Takes on any value within a range (e.g., height, weight).
  1. Probability Mass Function (PMF):
  • For discrete random variables, it gives the probability of each specific value.
  1. Probability Density Function (PDF):
  • For continuous random variables, it represents the density of probabilities over a range of values. The total area under the PDF is always 1.
  1. Cumulative Distribution Function (CDF):
  • It represents the probability that a random variable takes a value less than or equal to a specific point.

Table 4: Common Probability Distribution Functions in R

Distribution Probability Density (or Mass) Function Cumulative Distribution Function (CDF) Quantile Function
Normal dnorm(Z, mean, sd) pnorm(Z, mean, sd) qnorm(Q, mean, sd)
Poisson dpois(N, lambda) ppois(N, lambda) qpois(Q, lambda)
Binomial dbinom(N, size, prob) pbinom(N, size, prob) qbinom(Q, size, prob)
Exponential dexp(N, rate) pexp(N, rate) qexp(Q, rate)
Chi-Squared dchisq(X, df) pchisq(X, df) qchisq(Q, df)

4.3 Built-in Discrete Probability Distributions

4.3.1 Bernoulli Distribution

  • Definition: The Bernoulli distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0).

\[P(X=x) = p^x (1-p)^{1-x}, \quad x \in \{0,1\}\]

Let X be a discrete random variable. \(X∼Bern(p)\) means X takes on a Bernoulli distribution where the probability of success is p.

  • Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.

  • Application: The Bernoulli distribution is commonly used to model the outcome of a single trial of an experiment where the outcome can only be success or failure.

  • Real-life example:

      - A coin toss, where the outcome can only be heads (success) or tails (failure).
      - Medical trials: Researchers use the Bernoulli distribution to analyze the efficacy of 
        a new drug or treatment by categorizing treatment outcomes as successful or unsuccessful.
      - Passing or failing an exam 
      - Winning or losing a tennis match 
# Generate 10 Bernoulli random variables with probability of success 0.3
library(Rlab)
## Rlab 4.0 attached.
## 
## Attaching package: 'Rlab'
## The following object is masked _by_ '.GlobalEnv':
## 
##     cancer
## The following object is masked from 'package:survival':
## 
##     cancer
## The following object is masked from 'package:DAAG':
## 
##     ozone
## The following object is masked from 'package:dplyr':
## 
##     count
## The following object is masked from 'package:tibble':
## 
##     view
## The following objects are masked from 'package:stats':
## 
##     dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
##     qweibull, rexp, rgamma, rweibull
## The following object is masked from 'package:datasets':
## 
##     precip
rbern(n = 10, prob = 0.3)
##  [1] 0 0 0 0 0 0 0 0 0 0

4.3.2 Binomial Distribution

  • Definition: The binomial distribution is a probability distribution for a binary random variable that takes only two possible values, usually labeled as success (1) and failure (0), and is the result of a fixed number of independent trials.

The binomial distribution formula can be written as:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

where:
- $P(X = k) $ is the probability of getting exactly \(k\) successes in \(n\) trials,
- \(n\) is the total number of trials,
- \(p\) is the probability of success in a single trial,
- $ $ is the binomial coefficient, also known as “n choose k”.

  • Assumptions: The binary random variable has a fixed probability of success and failure, the outcomes are independent, and the trials are identical.

  • Application: The binomial distribution is commonly used to model the number of successes in a fixed number of independent trials where the outcome can only be success or failure.

  • Real-life example: The number of heads in 10 coin tosses. - Clinical Trials: A new drug is tested on 50 patients, and each has a 60% chance of showing improvement. The number of patients showing improvement follows a Binomial distribution.
    Parameters: n=50, p=0.6.

Example 1:

# Generate 10 binomial random variables with 5 trials and probability of success 0.3
rbinom(n = 10, size = 5, prob = 0.3)
##  [1] 2 3 0 1 0 2 1 2 1 0

Example 2:
Generate 20000 random sample and verify that Binomial distribution will be approximated by normal distribution for large number of trial by varying number of trials(n=10,30,50,500)and for probability of success is 0.05) and students t distribution will be approximated by normal distribution for large degrees of freedom by varying degree of freedom from 10,30,50,500.

To perform the simulation, we will take m samples from binomial distribution for same n and p. we should take m to be some large number so that we get a good idea of the underlying population the samples come from. We will then compare our samples to the normal distribution

# R code to generate 20000 random sample from Binomial by varying number of trials 
# df= 10
m=20000
par(mfrow=c(2,2))
sample1= rbinom(m,size=10,prob=0.05)
hist(sample1,main="Sample1 for n=10",prob=TRUE)
qqnorm(sample1)
qqline(sample1)
boxplot(sample1)

#df= 30
m=20000
par(mfrow=c(2,2))
sample2= rbinom(m,size=30,prob=0.05)
hist(sample2, prob=TRUE,main="Sample2 for n=30")
qqnorm(sample2)
qqline(sample2)
boxplot(sample2)

# df= 50
m=20000
par(mfrow=c(2,2))
sample3= rbinom(m,size=50,prob=0.05)
hist(sample3,prob=TRUE,main="Sample3 for n=50")
qqnorm(sample3)
qqline(sample3)
boxplot(sample3)

# df= 500
m=20000
par(mfrow=c(2,2))
sample4= rbinom(m,size=500,prob=0.05)
hist(sample4,main="Sample4 for n=500", prob=TRUE)
qqnorm(sample4)
qqline(sample4)
boxplot(sample4)

Example 3: Suppose that an examination consists of six true and false questions, and assume that a student has no knowledge of the subject matter. The probability that the student will guess the correct answer to the first question is 30%. Likewise, the probability of guessing each of the remaining questions correctly is also 30%.

  1. What is the probability of getting more than three correct answers?
  2. What is the probability of getting at least two correct answers?
  3. What is the probability of getting at most three correct answers?
  4. What is the probability of getting less than five correct answers?
# Parameters
n <- 6       # Number of questions
p <- 0.30    # Probability of guessing correctly
a<- 1 - pbinom(3, n, p)
a
## [1] 0.07047
b<- 1 - pbinom(1, n, p)
b
## [1] 0.579825
c<- pbinom(3, n, p)
c
## [1] 0.92953
d<- pbinom(4, n, p)
d
## [1] 0.989065

4.3.3 Poisson Distribution

  • Definition: The Poisson distribution is a probability distribution for a discrete random variable that represents the number of events occurring in a fixed interval of time or space.

The Poisson distribution formula is:

\[P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}\]

where:
- \(P(X = k)\) is the probability of observing \(k\) events in a fixed interval,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\lambda\) is the average rate of event occurrences in that interval,
- \(k\) is the number of events that we are interested in,
- \(k!\) is the factorial of \(k\) (the product of all positive integers up to \(k\)).

  • Assumptions: The events occur independently, the probability of an event occurring is constant over time or space, and the events occur at a fixed rate.

  • Application: The Poisson distribution is commonly used to model the number of occurrences of rare events such as accidents, defects, or defects in a manufacturing process.

  • Real-life example: The number of car accidents that occur in a city in a given day.
    - Number of births in a maternity hospital : A maternity hospital can use the Poisson distribution to predict how many births will occur during the nigh

# Generate 10 Poisson random variables with lambda = 2
rpois(n = 10, lambda = 2)
##  [1] 1 2 1 2 3 2 2 5 2 0

Example 2: Suppose that 4% of all TVs made by A&B Company in 2000 are defective. If eight of these TVs are randomly selected from across the country and tested, what is the probability that exactly three of them are defective? Assume that each TV is made independently of the others.

# Parameters
n <- 8          # Number of TVs tested
p <- 0.04       # Probability a TV is defective
x <- 3          # Number of defective TVs
# Probability of exactly 3 defective TVs
prob_exactly_three <- dbinom(x, n, p)
prob_exactly_three
## [1] 0.002922296

4.3.4 Geometric Distribution

The geometric distribution in R is one of the fundamental discrete probability distributions in statistics. It models the number of trials required to get the first success in a sequence of independent Bernoulli trials (i.e., trials with two possible outcomes: success and failure). In this article, we will explore the theory behind the geometric distribution, its probability mass function (PMF), cumulative distribution function (CDF), and its applications. We will also cover how to work with the geometric distribution in R.

The geometric distribution in R arises when we perform a sequence of independent and identically distributed Bernoulli trials, where each trial has:

  • Two possible outcomes Success (with probability p) and Failure (with probability 1−p).
  • We are interested in the number of trials needed to get the first success.
    Mathematically, the probability mass function (PMF) of a geometric distribution is given by:

\[P(X = k) = (1-p)^{k-1} p\]

where:
- \(P(X = k)\) is the probability of getting the first success on the k-th trial,
- \(p\) is the probability of success on each trial,
- \(k\) is the number of trials until the first success (k = 1, 2, 3, …).

  • Assumptions: The trials are independent, and the probability of success remains constant across trials.
  • Application: The geometric distribution is commonly used in various fields, such as quality control, reliability engineering, and marketing, to model the number of trials needed to achieve a specific outcome.
# Load necessary libraries
library(ggplot2)

# Parameters
p <- 0.3
k_values <- 1:10

# Calculate PMF
pmf_values <- dgeom(k_values - 1, prob = p)

# Create a data frame for plotting
pmf_data <- data.frame(k = k_values, pmf = pmf_values)

# Plot the PMF
ggplot(pmf_data, aes(x = k, y = pmf)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Geometric Distribution PMF (p = 0.3)",
       x = "Number of Trials (k)",
       y = "Probability") +
  theme_minimal()

4.4 Built-in Continuous Probability Distributions

Continuous probability distributions are a fundamental concept in probability theory and statistics. They are used to model random variables that can take on any value within a range of values, as opposed to discrete probability distributions that model random variables that can only take on certain values. Continuous probability distributions are commonly used in various fields, such as finance, engineering, healthcare, and social sciences, to model measurements of time, length, weight, or other continuous variables.

Some of the most commonly used continuous probability distributions include the Uniform, Normal, Exponential, Log-Normal, Gamma, Beta, Chi-Square, Student’s t,F distributions. Each of these distributions has a unique shape and set of parameters that determine its properties, such as mean, variance, skewness, and kurtosis.

Understanding continuous probability distributions is essential for statistical analysis and data modeling. They are used to estimate probabilities, calculate expected values, and make predictions based on observed data. In addition, the properties of continuous probability distributions, such as their moments and tail behavior, have important implications for statistical inference and hypothesis testing.

In this context, it is important to note that continuous probability distributions are all based on the fundamental concept of probability density functions (PDFs), which describe the probability distribution of a continuous random variable. PDFs are used to calculate the probability of a random variable taking on a specific value or falling within a certain range of values. The area under a PDF curve represents the probability of the random variable falling within that range of values.

Continuous probability distributions play a critical role in modern statistics and are essential for analyzing and interpreting data in a wide range of fields. Understanding their properties, assumptions, and applications is essential for anyone seeking to apply statistical methods to real-world problems.

4.4.1 Uniform Distribution

  • Definition: The uniform distribution is a probability distribution for a continuous random variable that takes values within a fixed interval, and all values within that interval are equally likely.

The probability density function of a continuous uniform distribution can be expressed as follows:

\[f(x|a, b) = \frac{1}{b-a} \text{ for } a \leq x \leq b\]

where:
- \(f(x|a, b)\)is the probability density function of the uniform distribution between \(a\) and \(b\),
- \(a\) is the lower limit of the distribution,
- \(b\) is the upper limit of the distribution.

This formula denotes that for a continuous uniform distribution over the interval \([a, b]\), the probability density is constant over that interval, equal to \(\frac{1}{b-a}\).

  • Assumptions: The continuous random variable has a fixed interval, and all values within that interval are equally likely.

  • Application: The uniform distribution is commonly used to model outcomes where all values within an interval are equally likely, such as the roll of a fair die or the arrival time of a customer at a store.

  • Real-life example: The arrival time of a customer at a store between 9:00 am and 10:00 am.

4.4.2 Normal Distribution

  • Definition: The normal distribution is a probability distribution for a continuous random variable that has a bell-shaped curve, with most of the values around the mean and fewer values further away from the mean.

The probability density function of the normal (Gaussian) distribution can be represented as:

\[f(x|\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

where:
- \(f(x|\mu,\sigma^2)\) is the probability density function of the normal distribution with mean \(\mu\) and variance \(\sigma^2\),
- \(\mu\) is the mean of the distribution,
- \(\sigma^2\) is the variance of the distribution,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\pi\) is the mathematical constant pi (\(\approx 3.14159\)).

This formula describes how the probability density function of a normal distribution varies as a function of \(x\), \(\mu\), and \(\sigma\).

  • Assumptions: The continuous random variable has a mean and standard deviation, and the values are normally distributed around the mean.

  • Application: The normal distribution is commonly used to model outcomes where the values are normally distributed, such as the height of a population or the scores on a standardized test.

  • Real-life example: The height of a population of people.

# Generate 10 normal random variables with mean 0 and standard deviation 1
rnorm(n = 10, mean = 0, sd = 1)
##  [1]  0.02355804  0.57633584  1.06700611  1.44528308 -1.76027157 -1.02274299
##  [7] -0.66032349  0.74305991 -0.04390308  0.45594501

Example 2:
Draw two random samples of size 100 from the following distributions N(0, 0.52) and N(2, 0.52) and produce the histograms of these samples on one page. (a) Use the function rnorm()to generate the samples in the following way:

#x <- rnorm(sample size, mean ,standard deviation) #general call of rnorm
x1 <- rnorm(100,0,0.5)
x2 <- rnorm(100,2,0.5)
  1. Use the par() function in order to deffine the number of figures in one page:
par(mfrow=c(1,2)) #put two figures in one page, two figures in one column
hist(x1)
hist(x2)

  1. What is the difference between the two histograms ? D. Does the two distributions look the same (in terms of shape)?

Example 1:
Generate 50 standard normally distributed random sample and display in histogram.

mysample <- rnorm(50) # generates random numbers
hist(mysample, prob = TRUE) # draws the histogram

Describe the shape of random sample

4.4.3 Exponential Distribution

  • Definition: The exponential distribution is a probability distribution for a continuous random variable that represents the time between events in a Poisson process.

The probability density function of the exponential distribution can be expressed as:

\[f(x|\lambda) = \lambda e^{-\lambda x} \text{ for } x \geq 0\]

where:
- \(f(x|\lambda)\) is the probability density function of the exponential distribution with rate parameter \(\lambda\),
- \(\lambda\) is the rate parameter of the distribution, often representing the average number of events in a unit time,
- \(e\) is Euler’s number \(\approx 2.71828\).

This formula describes how the probability density function of the exponential distribution changes with \(x\) and the rate parameter \(\lambda\).

  • Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.

  • Application: The exponential distribution is commonly used to model the time between occurrences of rare events such as accidents, defects, or defects in a manufacturing process.

  • Real-life example: The time between car accidents on a particular road.

# Generate 10 exponential random variables with rate parameter 0.5
rexp(n = 10, rate = 0.5)
##  [1] 1.1490388 2.5859117 0.3146110 1.9089339 4.8798685 0.5295571 0.6737470
##  [8] 1.6965217 3.2391039 2.3716948

4.4.4 Gamma Distribution

  • Definition: The gamma distribution is a probability distribution for a continuous random variable that represents the sum of n independent exponential random variables, each with the same rate parameter.

\[f(x|k,\theta) = \frac{1}{\Gamma(k) \theta^k} x^{k-1} e^{-\frac{x}{\theta}} \text{ for } x > 0\]

where:
- \(f(x|k,\theta)\) is the probability density function of the gamma distribution with shape parameter \(k\) and scale parameter \(\theta\),
- \(\Gamma(k)\) is the gamma function evaluated at \(k\),
- \(x\) is the random variable,
- \(k\) is the shape parameter that determines the shape of the distribution,
- \(\theta\) is the scale parameter that influences the spread of the distribution,
- \(e\) is Euler’s number \(\approx 2.71828\).

This formula describes how the probability density function of the gamma distribution varies with the random variable \(x\), shape parameter \(k\), and scale parameter \(\theta\).

  • Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.

  • Application: The gamma distribution is commonly used to model the time to complete a task that is composed of multiple independent sub-tasks, each with a known time to completion.

  • Real-life example: The time to complete a software development project that is composed of multiple independent sub-tasks.

# Generate 10 gamma random variables with shape parameter 2 and rate parameter 0.5
rgamma(n = 10, shape = 2, rate = 0.5)
##  [1] 0.9319142 2.9074790 2.8970930 3.1962679 1.6054692 9.8556978 2.7600363
##  [8] 0.9559170 4.2967674 0.4309526

4.4.5 Chi-Squared Distribution

  • Definition: The chi-squared distribution is a probability distribution for a continuous random variable that is used to model the distribution of the sum of squared standard normal random variables.

The Chi-Square (X2) test is a nonparametric test that is used to test hypotheses about distributions of frequencies across categories of data. The sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.

\[f(x|k) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2 - 1} e^{-x/2}\] where:
- \(f(x|k)\) is the probability density function of the chi-squared distribution with \(k\) degrees of freedom,
- \(\Gamma(\cdot)\) is the gamma function,
- \(x\) is the random variable,
- \(k\) is the degrees of freedom parameter.

This formula describes how the probability density function of the chi-squared distribution changes with the random variable \(x\) and the degrees of freedom parameter \(k\).

  • Assumptions: The continuous random variable is the sum of squared standard normal random variables.

  • Application: The chi-squared distribution is commonly used in hypothesis testing and confidence interval estimation when the population standard deviation is unknown and the sample size is large.

  • Real-life example: The sum of squared deviations from the mean in a sample of 100 measurements.

# Generate 10 chi-squared random variables with 5 degrees of freedom
rchisq(n = 10, df = 5)
##  [1] 2.040255 5.424924 2.837399 1.196678 1.980183 8.372367 4.269050 9.790097
##  [9] 2.328423 2.969191

4.4.6 F Distribution

  • Definition: The F distribution is a probability distribution for a continuous random variable that is used to model the distribution of the ratio of two chi-squared random variables.

  • Assumptions: The continuous random variable is the ratio of two chi-squared random variables.

  • Application: The F distribution is commonly used in hypothesis testing and confidence interval estimation when comparing the variances of two populations.

  • Real-life example: The ratio of the variances of two sets of data.

# Generate 10 F random variables with numerator degrees of freedom 5 and denominator degrees of freedom 10
rf(n = 10, df1 = 5, df2 = 10)
##  [1] 1.7271958 1.9262566 4.4256425 0.5940662 2.1703496 1.7676095 9.4035169
##  [8] 0.9775437 0.3089306 2.1442481

Example: Students t Distribution

###########################################################################
# R code to generate 20000 random sample from Students t Distribution by varying degree of freedoms
###########################################################################
m=20000
par(mfrow=c(2,2))
sample1= rt(m,df=10)
hist(sample1,main="Sample1 for df=10",prob=TRUE)
qqnorm(sample1)
qqline(sample1)
boxplot(sample1)

###########################################################################
m=20000
par(mfrow=c(2,2))
sample3= rt(m,df=50)
hist(sample3,prob=TRUE,main="Sample3 for df=50")
qqnorm(sample3)
qqline(sample3)
boxplot(sample3)

###########################################################################
m=20000
par(mfrow=c(2,2))
sample4= rt(m,df=500)
hist(sample4,main="Sample4 for df=500", prob=TRUE)
qqnorm(sample4)
qqline(sample4)
boxplot(sample4)

5 Parametric and Non-parametric Tests

5.1 Parametric Tests

Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data.These tests are used when a normal distribution is assumed.

Examples include:

  • T-test (paired or unpaired)
  • ANOVA
  • Linear regression
  • Pearson rank correlation

5.1.1 T-test

T-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known (typically, the scaling term is unknown and is therefore a nuisance parameter).

When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student’s t distribution. The t-test’s most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.

  • T-Test: used to test the mean of the populations

  • The key assumption to use t-test is normality of the population from which the random samples are drawn.

  • There are three different types of t-tests which performs different tests

  • One-sample t-test: Compares the mean of population with certain known or hypothesized population mean value. Hypothesis to be tested is:

    • \(H_0:\mu =\mu_0\) 
    • \(H_a: not H_0\) or
  • The test statistic is given by:
    \[t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}\]

wtdat <- read.csv('~/Independent sample T-test.csv')
t.test (Leaf.length ~ varaity, var.equal=TRUE, data = wtdat)
## 
##  Two Sample t-test
## 
## data:  Leaf.length by varaity
## t = -2.8527, df = 16, p-value = 0.01152
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -0.8672012 -0.1277988
## sample estimates:
## mean in group 1 mean in group 2 
##         27.5625         28.0600

5.1.1.1 Two samples t-test

y1<-rnorm(100,0,1)
y2<-rnorm(57,2,1)
boxplot(y1,y2)

Here we know that the variances are equal, we generated the data that way. Repeat the t-test with the relevant option.

t.test(y1,y2)
## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = -13.051, df = 124.05, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.308484 -1.700514
## sample estimates:
##  mean of x  mean of y 
## 0.08573994 2.09023908

The Welch Two Sample t-test is a statistical method used to compare the means of two groups, \(y_1\) and \(y_2\), when their variances are not equal. The test statistic (t) is -12.708, indicating a significant difference between the groups. The degrees of freedom (df) are calculated using Welch’s formula, adjusted for unequal variances. The p-value is small, indicating strong evidence against the null hypothesis. The alternative hypothesis is that the true difference in means is not equal to 0. The confidence interval for the difference in means is entirely below 0, indicating that \(y_2\) has a significantly higher mean than y1. The sample means show that \(y_2\) is substantially higher than y1. In conclusion, the Welch Two Sample t-test indicates a statistically significant difference in means between \(y_1\) and \(y_2\), with the confidence interval and sample means suggesting that \(y_2\) is significantly higher than \(y_1\).

5.1.1.2 Paired samples t-Test

The Paired samples t-Test: used to compares the mean difference of the same subject after and before having certain treatments. It tests whether average difference is 0 or not after measurements are made on the same group(subject) but at two different occasion or conditions.

Example: pre or post test results of the same subjects.

yld<-read.csv("~/Paired sample T test.csv", header=TRUE, na.strings="NA", dec=".")
# Save the data in two different vector 
before <- yld$before 
after <- yld$after 
# Compute t-test 
rt <- t.test(before, after, paired = TRUE) 
rt
## 
##  Paired t-test
## 
## data:  before and after
## t = -5.7654, df = 5, p-value = 0.002205
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -14.00080  -5.36587
## sample estimates:
## mean difference 
##       -9.683333

5.1.2 One way ANOVA

The One-Way ANOVA model formula using the following Formula:

\[Y_{ij} = \mu + \varepsilon_{ij}\]

Where:
- \(Y_{ij}\) represents the observation in the \(i\)th group and the \(j\)th sample. - \(\mu\) Parameters: fixed but unknown and needed to be estimated - \(\varepsilon_{ij}\): Random error, assumed to follow normal distribution with constant varaince.

Model assumptions are:

  1. Independence of Observations: The observations within each group and between groups are independent of each other.
  2. Normality of Residuals: This can be tested using statistical tests such as the Shapiro-Wilk test or graphically using Q-Q plots.
  3. Homogeneity of Variance (Homoscedasticity): This assumption can be checked using tests such as Levene’s test or Bartlett’s test.
  4. Dependent Variable is Continuous
  5. Independent Variable is Categorical

Violation of Assumptions
If assumptions are violated, consider transformations of the data, using a non-parametric alternative (e.g., Kruskal-Wallis test), or employing a robust ANOVA method.

duodenal<-read.table("~/duodenal.txt", header = FALSE, na.strings = "NA", dec = ".")
names(duodenal) <- c("Trypsin", "Protein")    
attach(duodenal)
dim(duodenal)
## [1] 28  2
Fit.aov<-aov(Protein~Trypsin)
summary(Fit.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## Trypsin      2  19.63   9.814   1.265    0.3
## Residuals   25 193.99   7.760

Histograms by group

par(mfrow=c(2,2))
hist(Protein[Trypsin=="ls50"],col=0)
hist(Protein[Trypsin=="b51_1000"],col=0)
hist(Protein[Trypsin=="ge1000"],col=0)

qq normal plots by group

par(mfrow=c(2,2))
qqnorm(Protein[Trypsin=="ls50"])
qqnorm(Protein[Trypsin=="b51_1000"])
qqnorm(Protein[Trypsin=="ge1000"])

par(mfrow=c(1,2))
boxplot(split(Protein,Trypsin))

tapply(Protein,Trypsin,mean)
## b51_1000   ge1000     ls50 
## 5.410000 5.944444 3.933333

Diagnostic plot

par(mfrow=c(2,2))
qqnorm(Fit.aov$resid)
hist(Fit.aov$resid,col=0)
boxplot(split(Fit.aov$resid,Trypsin))

5.1.2.1 Homogeneity of variances test

Bartlett’s test

bartlett.test(Protein ~ Trypsin,data = duodenal)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Protein by Trypsin
## Bartlett's K-squared = 1.5415, df = 2, p-value = 0.4627

Levene’s test

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:DAAG':
## 
##     vif
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
attach(duodenal)
## The following objects are masked from duodenal (pos = 5):
## 
##     Protein, Trypsin
leveneTest(Protein, Trypsin)
## Warning in leveneTest.default(Protein, Trypsin): Trypsin coerced to factor.
Df F value Pr(>F)
group 2 0.6844803 0.5135556
25 NA NA

5.1.3 Two-way ANOVA

Two-way ANOVA is a statistical technique used to evaluate the effect of two categorical independent variables (factors) and their interaction on a continuous dependent variable. It extends one-way ANOVA by including a second factor, allowing for the analysis of interaction effects between the two factors.

\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]

Where:
- \(Y_{ijk}\) represents the observation in the cell defined by levels \(i\) and \(j\) of the two factors. - \(\mu\) is the overall mean. - \(\alpha_i\) is the effect of the \(i\)th level of factor A. - \(\beta_j\) is the effect of the \(j\)th level of factor B. - \((\alpha\beta)_{ij}\) is the interaction effect between the \(i\)th level of factor A and the \(j\)th level of factor B. - \(\varepsilon_{ijk}\) is the random error term.

TWO WAY ANOVA Table

Model Definition \[Y_{ijk} = \mu + \tau_i + \beta_j + (\gamma)_{ij} + e_{ijk}\]

  • \(\mu\): Overall mean.
  • \(\tau_i\): Effect of the \(i\)-th level of factor A.
  • \(\beta_j\): Effect of the \(j\)-th level of factor B.
  • \((\gamma)_{ij}\): Interaction effect between the \(i\)-th level of factor A and \(j\)-th level of factor B.
  • \(e_{ijk}\): Random error term.

Decomposition of Sum of Squares \[SS_{\text{TOTAL}} = SS_A + SS_B + SS_{AB} + SS_{\text{Error}}\]

  • \(SS_A\): Sum of squares due to factor A.
  • \(SS_B\): Sum of squares due to factor B.
  • \(SS_{AB}\): Sum of squares due to interaction.
  • \(SS_{\text{Error}}\): Sum of squares due to residual error.

Assumptions \[Y_{ijk} \sim N(\mu, \sigma^2), \quad e_{ijk} \sim N(0, \sigma^2)\]

  • Observations are normally distributed: \(Y_{ijk} \sim N(\mu,\sigma^2)\).
  • Error terms are independent and normally distributed with mean 0 and constant variance: \(e_{ijk} \sim N(0, \sigma^2)\).

Hypotheses

Main Effects for Factor A \[H_0: \tau_1 = \tau_2 = \ldots = \tau_k = 0 \quad \text{versus} \quad H_a: \tau_i \neq 0 \ \text{for some } i.\]

Main Effects for Factor B \[H_0: \beta_1 = \beta_2 = \ldots = \beta_b = 0 \quad \text{versus} \quad H_a: \beta_j \neq 0 \ \text{for some } j.\]

Interaction Effects \[ H_0: \gamma_{ij} = 0 \ \text{for all } i, j \quad \text{versus} \quad H_a: \gamma_{ij} \neq 0 \ \text{for some } i, j. \]

The R Interface
The R Interface
##Two way ANOVA###
trypData=read.delim2("~/Trypanosomosis.txt")
str(trypData)
## 'data.frame':    12 obs. of  4 variables:
##  $ Breed     : chr  "Boran" "Boran" "Boran" "Boran" ...
##  $ Drug      : chr  "Berenil" "Berenil" "Berenil" "Samorin" ...
##  $ PCV_before: num  18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
##  $ PCV_after : num  26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
trypData$PCV_diff <-trypData$PCV_after-trypData$PCV_before
str(trypData)
## 'data.frame':    12 obs. of  5 variables:
##  $ Breed     : chr  "Boran" "Boran" "Boran" "Boran" ...
##  $ Drug      : chr  "Berenil" "Berenil" "Berenil" "Samorin" ...
##  $ PCV_before: num  18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
##  $ PCV_after : num  26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
##  $ PCV_diff  : num  7.9 7.8 5.6 13.8 11.9 13.5 7 9.4 7.6 15.9 ...
library(ggplot2)
ggplot(trypData,aes(y = PCV_diff, x = Drug, fill = Breed)) +geom_boxplot(aes(fill = Breed)) 

ft = aov(PCV_diff~Breed * Drug, data = trypData)
print(ft)
## Call:
##    aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
## 
## Terms:
##                    Breed     Drug Breed:Drug Residuals
## Sum of Squares   0.44083 89.10750    0.80083  23.99333
## Deg. of Freedom        1        1          1         8
## 
## Residual standard error: 1.73181
## Estimated effects may be unbalanced
summary(ft)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Breed        1   0.44    0.44   0.147 0.711420    
## Drug         1  89.11   89.11  29.711 0.000608 ***
## Breed:Drug   1   0.80    0.80   0.267 0.619316    
## Residuals    8  23.99    3.00                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(ft,type="means")
## Tables of means
## Grand mean
##        
## 10.275 
## 
##  Breed 
## Breed
##    Boran Holstein 
##   10.083   10.467 
## 
##  Drug 
## Drug
## Berenil Samorin 
##    7.55   13.00 
## 
##  Breed:Drug 
##           Drug
## Breed      Berenil Samorin
##   Boran     7.100  13.067 
##   Holstein  8.000  12.933
TukeyHSD(ft)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
## 
## $Breed
##                     diff       lwr      upr     p adj
## Holstein-Boran 0.3833333 -1.922351 2.689017 0.7114204
## 
## $Drug
##                 diff      lwr      upr     p adj
## Samorin-Berenil 5.45 3.144316 7.755684 0.0006082
## 
## $`Breed:Drug`
##                                         diff        lwr       upr     p adj
## Holstein:Berenil-Boran:Berenil     0.9000000 -3.6281806  5.428181 0.9173043
## Boran:Samorin-Boran:Berenil        5.9666667  1.4384861 10.494847 0.0124387
## Holstein:Samorin-Boran:Berenil     5.8333333  1.3051527 10.361514 0.0140891
## Boran:Samorin-Holstein:Berenil     5.0666667  0.5384861  9.594847 0.0293847
## Holstein:Samorin-Holstein:Berenil  4.9333333  0.4051527  9.461514 0.0334869
## Holstein:Samorin-Boran:Samorin    -0.1333333 -4.6615139  4.394847 0.9996731
plot(TukeyHSD(ft,"Breed"),las=1)

mastitisData =read.csv2("~/Mastitis.csv")
str(mastitisData)
## 'data.frame':    12 obs. of  3 variables:
##  $ Parity   : chr  "heifer" "heifer" "heifer" "heifer" ...
##  $ Dose     : chr  "high" "high" "medium" "medium" ...
##  $ Reduction: num  6.79 3.87 30.03 38.08 53.67 ...
mastitisData$Dose = factor(mastitisData$Dose,levels =c("low", "medium", "high"))
ggplot(mastitisData,aes(y = Reduction, x = Dose, fill = Parity))+geom_point(aes(colour = Parity), size = 6)

fm= aov(Reduction ~Parity * Dose, data = mastitisData)
summary(fm)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Parity       1    627     627   36.79 0.000911 ***
## Dose         2   8758    4379  257.06 1.54e-06 ***
## Parity:Dose  2    385     192   11.30 0.009236 ** 
## Residuals    6    102      17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(fm,type="means")
## Tables of means
## Grand mean
##       
## 39.64 
## 
##  Parity 
## Parity
##      heifer multiparous 
##       32.41       46.87 
## 
##  Dose 
## Dose
##    low medium   high 
##  72.62  39.84   6.45 
## 
##  Parity:Dose 
##              Dose
## Parity        low   medium high 
##   heifer      57.86 34.06   5.33
##   multiparous 87.40 45.63   7.58
TukeyHSD(fm)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Reduction ~ Parity * Dose, data = mastitisData)
## 
## $Parity
##                        diff     lwr      upr    p adj
## multiparous-heifer 14.45333 8.62264 20.28403 0.000911
## 
## $Dose
##                 diff       lwr       upr    p adj
## medium-low  -32.7825 -41.73701 -23.82799 7.34e-05
## high-low    -66.1725 -75.12701 -57.21799 1.30e-06
## high-medium -33.3900 -42.34451 -24.43549 6.60e-05
## 
## $`Parity:Dose`
##                                        diff       lwr       upr     p adj
## multiparous:low-heifer:low           29.540  13.11411  45.96589 0.0029142
## heifer:medium-heifer:low            -23.800 -40.22589  -7.37411 0.0089285
## multiparous:medium-heifer:low       -12.225 -28.65089   4.20089 0.1529274
## heifer:high-heifer:low              -52.525 -68.95089 -36.09911 0.0001178
## multiparous:high-heifer:low         -50.280 -66.70589 -33.85411 0.0001512
## heifer:medium-multiparous:low       -53.340 -69.76589 -36.91411 0.0001079
## multiparous:medium-multiparous:low  -41.765 -58.19089 -25.33911 0.0004333
## heifer:high-multiparous:low         -82.065 -98.49089 -65.63911 0.0000071
## multiparous:high-multiparous:low    -79.820 -96.24589 -63.39411 0.0000085
## multiparous:medium-heifer:medium     11.575  -4.85089  28.00089 0.1824850
## heifer:high-heifer:medium           -28.725 -45.15089 -12.29911 0.0033811
## multiparous:high-heifer:medium      -26.480 -42.90589 -10.05411 0.0051795
## heifer:high-multiparous:medium      -40.300 -56.72589 -23.87411 0.0005298
## multiparous:high-multiparous:medium -38.055 -54.48089 -21.62911 0.0007303
## multiparous:high-heifer:high          2.245 -14.18089  18.67089 0.9916016
plot(TukeyHSD(fm,"Dose",order=T))

5.1.3.1 Pearson correlation coefficient

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.

As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:Rlab':
## 
##     michelson
## The following object is masked from 'package:DAAG':
## 
##     hills
## The following object is masked from 'package:dplyr':
## 
##     select
data(cats)
str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
par(mfrow=c(1,2))
with(cats, plot(Bwt, Hwt,col=3))
title(main="Heart Weight (g) vs. Body Weight (kg)of Domestic Cats")
plot(Hwt~Bwt, data=cats,col=4,main="H vrs B W")

with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
## 
##  Pearson's product-moment correlation
## 
## data:  Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274
with(cats, cor(Bwt, Hwt)) 
## [1] 0.8041274
with(cats, cor(Bwt, Hwt))^2
## [1] 0.6466209
###Correlation test###
with(cats, cor.test(Bwt, Hwt))
## 
##  Pearson's product-moment correlation
## 
## data:  Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274
with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
## 
##  Pearson's product-moment correlation
## 
## data:  Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274
####Scatterplotdetails###
with(cats, plot(Bwt, Hwt, type="n", las=1, xlab="Body Weightin kg",ylab="Heart Weight in g",main="Heart Weight vs. BodyWeight of Cats"))
with(cats, points(Bwt[Sex=="F"], Hwt[Sex=="F"], pch=16,col="red"))
with(cats, points(Bwt[Sex=="M"], Hwt[Sex=="M"], pch=17,col="blue"))

5.2 Non-parametric Tests

Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data. These tests do not depend on any distribution and do not make any assumptions.

Examples include:
- Sign test
- Mann–Whitney
- Wilcoxon test
- Chi-square test
- Kruskal_Wallis test
- Kolmogorov_Smirnov test
- Wilcoxon Rank sum test
- Spearman correlation

5.2.1 Wilcoxon signed-rank test

The Wilcoxon signed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples.

[1] The one-sample version serves a purpose similar to that of the one-sample Student’s t-test.

[2] For two matched samples, it is a paired difference test like the paired Student’s t-test (also known as the “t-test for matched pairs” or “t-test for dependent samples”).

The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed. Instead, it assumes a weaker hypothesis that the distribution of this difference is symmetric around a central value and it aims to test whether this center value differs significantly from zero. The Wilcoxon test is a more powerful alternative to the sign test because it considers the magnitude of the differences, but it requires this moderately strong assumption of symmetry.

- Alternative: paired t-test
- Application: comparing the medians of two dependent groups when the assumptions of normality or equal variances are not met.
- Real-life example: comparing the pre- and post-treatment scores of patients in a clinical trial when the scores are not normally distributed.
# Wilcoxon signed-rank test
before <- c(8, 9, 10, 12, 14)
after <- c(10, 11, 12, 13, 15)
wilcox.test(before, after, paired=TRUE)
## Warning in wilcox.test.default(before, after, paired = TRUE): cannot compute
## exact p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  before and after
## V = 0, p-value = 0.05334
## alternative hypothesis: true location shift is not equal to 0

5.2.2 Chi-square test

A chi-square test is a statistical test used to determine if there is a significant association between categorical variables. It compares the observed frequencies of occurrences in different categories with the frequencies expected if there were no associations between the variables.

The Chi-Squared Test The mathematical formula is give as:

\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]

5.2.2.1 Types of chi-square tests

There are two main types of chi-square tests:

  1. Chi-Square Test of Independence: It helps determine whether the variables are independent or if there’s a relationship between them. For example, you might want to know if gender affects voting preference.

  2. Chi-Square Test of Goodness of Fit: This test checks if the sample data fits a population distribution. For example, you might want to see if a die is fair by comparing the observed frequency of each face with the expected frequency if the die were fair.

5.2.2.2 Assumptions of the chi-square test

To ensure the validity of the chi-square test, certain assumptions must be met:

  • The data must be in the form of frequencies or counts of cases.
  • The categories should be mutually exclusive.
  • For the chi-square test of independence, the expected frequency in each category should be at least 5.
  • The goodness of fit test expects a frequency of at least 1, with no more than 20% of expected frequencies being less than 5.

5.2.2.3 Practical applications of chi-square tests

Chi-square tests are widely used in academia and industry, especially for testing hypotheses about the independence of categorical variables. Some of these practical applications are:

  • Market Research: Some applications include analyzing if customer preferences for a product vary across different age groups and income levels or determining if different marketing campaigns are equally effective across demographic segments.

  • Medical Research: Common use cases are studying the association between lifestyle factors (e.g., smoking, exercise) and the incidence of diseases (e.g., lung cancer, heart disease) or evaluating whether different treatment groups show different recovery rates in clinical trials.

  • Quality Control: Commonly used to examine whether product defects are independent of the manufacturing process or specific production lines and to compare the quality of products from different suppliers to determine if there is a significant difference in defect rates.

  • Education: The tests are often used to determine if there is a significant difference in pass rates between students from different schools or teaching methods and to evaluate if introducing a new curriculum improves student performance across subjects. It’s worth noting that these are some of the many applications across academia and the industry and can be extended to other domains and fields.

layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
for (i in 1:4) 
plot(dchisq(1:28, i*3),pch=18,col="blue",main = paste("chi-square density, df =", i*3))

Example 1:
The investigator collected the blood test of patients from a hospital medical record.
A=25
B=32
AB=18
O=20
Test the null hypothesis that blood types are equally represented for those 97 patients. This is a goodness of fit test with equal expected frequencies. The “p” vector does not need to be specified, since equal frequencies is the default…

chisq.test(c(25,32,18,20)) 
## 
##  Chi-squared test for given probabilities
## 
## data:  c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781
chi=chisq.test(c(25,32,18,20))
chi
## 
##  Chi-squared test for given probabilities
## 
## data:  c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781

the null hypothesis cannot be rejected at alpha =0 .05

Example 2:
the eye color of 592 students found in the following category.

Eye Color Count
Brown 220
Blue 215
Hazel 93
Green 64

Suppose a certain physiologist proposes that 50% of student should have brown eyes, 25% blue eyes, 15% hazel eyes, and 10% green eyes. Does the sample at hand agree with the physiologist’s hypothesis

eyes<-c(220,215,93,64)
chi<-chisq.test(eyes, p=c(.5,.25,.15,.1))
chi
## 
##  Chi-squared test for given probabilities
## 
## data:  eyes
## X-squared = 50.432, df = 3, p-value = 6.462e-11

Example 3:
The following data shows the number of died and alive patients across in gender.

data<-c(4298,767,7136,643)
data.tab<-matrix(data,byrow=T,nrow=2)
data.tab
##      [,1] [,2]
## [1,] 4298  767
## [2,] 7136  643
dimnames(data.tab)=list("Sex"=c("F","M"),"Died"=c("Alive","died"))
data.tab
##    Died
## Sex Alive died
##   F  4298  767
##   M  7136  643
chisq.test(data.tab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data.tab
## X-squared = 147.76, df = 1, p-value < 2.2e-16
chisq.test(data.tab)$expected
##    Died
## Sex   Alive     died
##   F 4508.97 556.0301
##   M 6925.03 853.9699
chisq.test(data.tab)$residuals
##    Died
## Sex     Alive      died
##   F -3.141825  8.946877
##   M  2.535186 -7.219370

Example 4:
Let us see the data for students smoking habit and exercise level.

data.st<-c(7,6,6,9,7,5,12,5,5,87,84,18)
data.matrix<-matrix(data.st , byrow=T, nrow=4)
dimnames(data.matrix)=list("Exe"=c("heavy","regula","occas","never"),"Smoke"=c("freq","some","none"))
data.matrix
##         Smoke
## Exe      freq some none
##   heavy     7    6    6
##   regula    9    7    5
##   occas    12    5    5
##   never    87   84   18
chisq.test(data.matrix)
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  data.matrix
## X-squared = 13.633, df = 6, p-value = 0.03402

As the p-value 0.03402 is less than the .05 significance level, we do not accept the null hypothesis that the smoking habit is independent of the exercise level of the students.

chisq.test(data.matrix)$expected
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
##         Smoke
## Exe           freq      some      none
##   heavy   8.705179  7.721116  2.573705
##   regula  9.621514  8.533865  2.844622
##   occas  10.079681  8.940239  2.980080
##   never  86.593625 76.804781 25.601594
chisq.test(data.matrix)$residual
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
##         Smoke
## Exe             freq       some      none
##   heavy  -0.57793792 -0.6193983  2.135725
##   regula -0.20036837 -0.5250663  1.277942
##   occas   0.60485311 -1.3177955  1.170093
##   never   0.04367003  0.8210127 -1.502350

Example 5:
We apply the test to a dataset. We’ll use the Anemia Levels in Nigeria dataset, which can be downloaded from Kaggle Click here. The dataset comes from the 2018 Nigeria Demographic and Health Surveys (NDHS). It explores the impact of mothers’ age and socioeconomic factors on anemia levels among children aged 0–59 months across Nigeria’s 36 states and the Federal Capital Territory.

# Load the necessary libraries
#install.packages('readr')
library(readr)

# Load the dataset from the CSV file
dataset <- read_csv("C:/Users/abdik/Desktop/DMA/children_anaemia.csv")
## New names:
## Rows: 33924 Columns: 18
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (13): Age in 5-year groups, Type of place of residence, Highest educatio... dbl
## (5): ...1, Births in last five years, Age of respondent at 1st birth, H...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Display the first few rows of the dataset
#head(dataset)
# Rename a column
colnames(dataset)[colnames(dataset) == "Anemia level...8"] <- "Anemia level"
# Display the column names
colnames(dataset)
##  [1] "...1"                                                                 
##  [2] "Age in 5-year groups"                                                 
##  [3] "Type of place of residence"                                           
##  [4] "Highest educational level"                                            
##  [5] "Wealth index combined"                                                
##  [6] "Births in last five years"                                            
##  [7] "Age of respondent at 1st birth"                                       
##  [8] "Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)"
##  [9] "Anemia level"                                                         
## [10] "Have mosquito bed net for sleeping (from household questionnaire)"    
## [11] "Smokes cigarettes"                                                    
## [12] "Current marital status"                                               
## [13] "Currently residing with husband/partner"                              
## [14] "When child put to breast"                                             
## [15] "Had fever in last two weeks"                                          
## [16] "Hemoglobin level adjusted for altitude (g/dl - 1 decimal)"            
## [17] "Anemia level.1"                                                       
## [18] "Taking iron pills, sprinkles or syrup"
# Install and load the package
#install.packages('dplyr')
library(dplyr)

# Select the columns of interest
selected_data <- dplyr::select(dataset,`Highest educational level`,`Anemia level`)


# Create a contingency table for Highest educational level and Anemia level
contingency_table <- table(selected_data[["Highest educational level"]], 
                           selected_data[["Anemia level"]])

# View the contingency table
print(contingency_table)
##               
##                Mild Moderate Not anemic Severe
##   Higher        296      238        591     14
##   No education 1450     1769       1874    113
##   Primary       608      645        900     42
##   Secondary    1240     1322       1972     62
# Perform chi-square test
chi_square_test <- chisq.test(contingency_table)
print(chi_square_test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 142.86, df = 9, p-value < 2.2e-16

5.2.3 Spearman rank correlation

  • Alternative: Pearson correlation
  • Application: measuring the strength of association between two variables when one or both variables are ordinal.
  • Real-life example: measuring the correlation between the rankings of different restaurants based on their ratings and the number of customers they have.
# Spearman rank correlation
x <- c(10, 20, 30, 40, 50)
y <- c(5, 15, 25, 35, 45)
cor.test(x, y, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 4.4409e-15, p-value = 0.01667
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho 
##   1

6 Statistical Models in R

So far we have seen only some data manipulation and the calculation of summary statistics. R can do a multitude of statistical analysis including linear models (lm), generalised linear models (glm), analysis of variance (anova), non linear models with mixed effects (nlme), generalised additive models (gam), survival analysis (survival), time series analysis (tseries), multivariate analysis (multiv) and many more.

6.1 Regression Model

Regression models, in which explanatory variables are used to model the behaviour of a response variable, are without a doubt the most commonly used class of models in the statistical toolbox. In this chapter, we will have a look at different types of regression models tailored to many different sorts of data and applications.

After reading this chapter, you will be able to use R to:

  • Fit and evaluate linear models, including linear regression and ANOVA,
  • Fit and evaluate generalised linear models, including logistic regression and Poisson regression,
  • Use multiple imputation to handle missing data,
  • Fit and evaluate mixed models, and
  • Create matched samples.

6.1.1 Linear Regression model in R

6.1.1.1 Simple Linear Regression

The formula for simple linear regression, where you have one independent variable (predictor) and one dependent variable, can be expressed as follows:

The linear regression model is represented as:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

where:
- \(Y\) is the dependent variable,
- \(X\) is the independent variable (predictor),
- \(\beta_0\) is the intercept (y-intercept) of the regression line,
- \(\beta_1\) is the slope (coefficient) of the independent variable,
- \(\varepsilon\) is the error term (residuals).

The goal of simple linear regression is to determine the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.

library(DAAG)
attach(ais)
## The following object is masked _by_ .GlobalEnv:
## 
##     sex
## The following objects are masked from ais (pos = 11):
## 
##     Body_Mass_index, ht, lbm, rcc, sex, sport, wcc, wt
## The following object is masked from cancer:
## 
##     sex
dim(ais)
## [1] 202   8
names(ais)
## [1] "rcc"             "wcc"             "lbm"             "Body_Mass_index"
## [5] "ht"              "wt"              "sex"             "sport"
fit.ais <- lm(wt ~ ht, data=ais)
plot(wt ~ ht, data = ais, ylab = "Weight", xlab = "Height")
x <- ais$ht
y <- fit.ais$fit
lines(x,y)

anova(fit.ais)
Df Sum Sq Mean Sq F value Pr(>F)
ht 1 23769.79 23769.79446 312.6298 0
Residuals 200 15206.35 76.03176 NA NA

Note: aov() and anova() outputs are different

summary(fit.ais)
## 
## Call:
## lm(formula = wt ~ ht, data = ais)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.372  -5.296  -1.196   4.378  38.031 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -126.19049   11.39566  -11.07   <2e-16 ***
## ht             1.11712    0.06318   17.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.72 on 200 degrees of freedom
## Multiple R-squared:  0.6099, Adjusted R-squared:  0.6079 
## F-statistic: 312.6 on 1 and 200 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(ais$wt,fit.ais$fit,xlab="Observed", ylab="Predicted", main="")
title("Observed Versus Predicted Values")
abline(0,1)
hist(fit.ais$resid,col=0, main="")
title("Histogram for residuals")
qqnorm(fit.ais$resid) 

Default plots

par(mfrow=c(2,2))
plot(fit.ais)

Exercise
1. Read the dataset fuel.txt.
2. Fit a simple linear regression model in which the mileage (third column in the dataset) in the response and the weight (the second column in the dataset) of the car is a predictor.
3. Test the hypothesis that the slope is zero.
4. Use the default plots of an \(lm()\) object to produce the diagnostic plot.

6.1.2 Multiple Linear Regression

In multiple linear regression, the model involves more than one independent variable. The general formula for multiple linear regression with \(p\) independent variables can be expressed as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\]

where:
- \(Y\) is the dependent variable,
- \(X_1\), \(X_2\), \(\ldots\), \(X_p\) are the independent variables,
- \(\beta_0\) is the intercept,
- \(\beta_1\), \(\beta_2\), $, \(\beta_p\) are the coefficients associated with each independent variable,
- \(\varepsilon\) is the error term (residuals).

The goal of multiple linear regression is to estimate the coefficients \(\beta_0\), \(\beta_1\), \(\ldots\), \(\beta_p\) that best fit the data. This is typically done by minimizing the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.

Example

Y<-c(78.5,74.3 ,104.3,87.6,95.9,109.2,102.7,72.5 ,93.1 ,115.9 ,83.8 ,113.3,109.4)
x1<-c(7 ,1 ,11, 11 ,7 ,11 ,3 ,1 ,2 ,21 ,1 ,11 ,10 )
x2<-c(26, 29,56,31,52,55,71,31,54,47,40,66,68)
x3<-c(6,15, 8,8,6,9,17,22,18,4,23,9 ,8)
x4<-c(60,52,20,47,33,22,6,44,22,26,34,12,12)
  1. Identify the optimal model or models based on R^2_adj , AIC, AICC, BIC from the approach based on all possible subsets.
  2. Identify the optimal model or models based on AIC and BIC from the approach based on forward selection.
  3. Identify the optimal model or models based on AIC and BIC from the approach based on backward elimination.
  4. Carefully explain why the models chosen in (a), (b) & (c) are not all the same.
  5. Recommend a final model. Give detailed reasons to support your choice.
Halds<-data.frame(Y,x1,x2,x3,x4)
model<-lm(Y~x1+x2+x3+x4,data=Halds)
summary(model)
## 
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4, data = Halds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07
par(mfrow=c(2,2))
plot(model)

residuals <- residuals(model)
fitted_values <- fitted(model)

# Correlation test
cor.test(fitted_values, residuals)
## 
##  Pearson's product-moment correlation
## 
## data:  fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5509853  0.5509853
## sample estimates:
##          cor 
## 3.829213e-17
model1<-lm(Y~x1+x2,data=Halds)
summary(model1)
## 
## Call:
## lm(formula = Y ~ x1 + x2, data = Halds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.893 -1.574 -1.302  1.363  4.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
## x1           1.46831    0.12130   12.11 2.69e-07 ***
## x2           0.66225    0.04585   14.44 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9744 
## F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09
model2<-lm(Y~x1+x2+x4,data=Halds)
summary(model2)
## 
## Call:
## lm(formula = Y ~ x1 + x2 + x4, data = Halds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0919 -1.8016  0.2562  1.2818  3.8982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.6483    14.1424   5.066 0.000675 ***
## x1            1.4519     0.1170  12.410 5.78e-07 ***
## x2            0.4161     0.1856   2.242 0.051687 .  
## x4           -0.2365     0.1733  -1.365 0.205395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

Example 2:

library(car)
attach(mtcars)
## The following object is masked from ais (pos = 3):
## 
##     wt
## The following object is masked from ais (pos = 12):
## 
##     wt
## The following object is masked from package:ggplot2:
## 
##     mpg
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
x <- mtcars$mpg
y <- mtcars$hp

# Pearson correlation test
cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684
# Fit a model
model11 <- lm(mpg ~ hp + wt, data = mtcars)
summary(model11)
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
# Extract residuals and fitted values
residuals <- residuals(model)
fitted_values <- fitted(model)

# Correlation test
cor.test(fitted_values, residuals)
## 
##  Pearson's product-moment correlation
## 
## data:  fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5509853  0.5509853
## sample estimates:
##          cor 
## 3.829213e-17
# Load ggplot2
library(ggplot2)

# Create Scatter Plot with Regression Line
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "blue", size = 3) +                    # Scatter points
  geom_smooth(method = "lm", se = TRUE, color = "red") +    # Regression line with confidence interval
  labs(title = "Scatter Plot of MPG vs Horsepower",
       x = "Horsepower (hp)",
       y = "Miles Per Gallon (mpg)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Pairwise scatterplots for selected variables
pairs(mtcars[, c("mpg", "hp", "wt")],
      main = "Pairwise Scatterplots",
      pch = 19, col = "blue")

# Fit a Multiple Regression Model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Plot Residuals vs Fitted Values
plot(model$fitted.values, model$residuals,
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values",
     ylab = "Residuals",
     pch = 19,
     col = "darkgreen")
abline(h = 0, col = "red", lwd = 2)    # Horizontal line at 0

# Optionally, add a smooth line to detect patterns
lines(lowess(model$fitted.values, model$residuals), col = "blue", lwd = 2)

# Create Residuals vs Fitted Plot
ggplot(model, aes(x = .fitted, y = .resid)) +
  geom_point(color = "darkgreen", size = 3) +             # Residual points
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +  # Horizontal line at 0
  geom_smooth(method = "loess", color = "blue") +         # Smooth line to detect patterns
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
##   Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

# Generate All Diagnostic Plots
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 grid
plot(model)

par(mfrow = c(1, 1))  # Reset to default
# Q-Q Plot Using ggplot2
ggplot(model, aes(sample = .stdresid)) +
  stat_qq(color = "purple") +
  stat_qq_line(color = "red") +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals") +
  theme_minimal()

# Load library for heatmap
library(corrplot)
## corrplot 0.95 loaded
# Calculate correlation matrix
cor_matrix <- cor(mtcars)

# Plot heatmap
corrplot(cor_matrix, method = "circle", type = "upper",
         tl.col = "black", tl.srt = 45,
         title = "Correlation Heatmap", mar = c(0, 0, 1, 0))

6.2 Time Series models

It is a method/technique for analyzing time series data in order to extract meaningful statistics/figure and other characteristics of the data. That is, what is the previous trend implies for and can it continues or not. It deals all about the time dependencies of the current on the past. It is a systematic approach to deal with mathematical and statistical questions posed by time correlation. It used to examine how data value change over time.

  • Statistical data which are collected, observed or recorded at successive intervals of time.
  • It is a set/sequence/collection of observation which can be recorded over the period of time(t) with data point recorded at equal time interval.
  • The measurements(time) may be taken every day, week, month, quarterly and year, or at any other regular interval.
  • It is a sequence of observations on a variable measured at successive points in time or over successive periods of time.

6.2.1 Area generate TS data

  • Economics: Daily stock market quotations (time unit is trading days) or monthly unemployment figures
  • Social Science: Population series such as annual birthrates or school enrolments
  • Epidemiology: Monthly number of influenza cases.
  • Medicine: Blood pressure traced over time or Functional
  • Magnetic Resonance Imaging (FMRI) of brain wave

6.2.1.1 Significance of time series analysis

The analysis of time series is of great significance not only to the economists and business man but also to the scientists, astronomists, geologist, sociologists, biologists, research worker etc. for the reason below.

  1. It helps in understanding past behavior & predict the future
  2. It helps in planning future operations
  3. It helps in evaluating current accomplishments
  4. It facilitates comparison of future relationship of data.

6.2.2 Type of time series Analysis

  1. Univariate Time series Analysis
  2. Multivariate Time series Analysis

6.2.3 ARIMA Model

Autoregressive Integrated Moving Average models (ARIMA models) were popularized by “George Box and Gwilym Jenkins” in the early 1970s. ARIMA models are a class of linear models that is capable of representing stationary as well as non-stationary time series. ARIMA models do not involve independent variables in their construction. They make use of the information in the series itself to generate forecasts.

It is one of the most widely implemented methods for analyzing univariate time series data. In order to understand the modeling procedure, it is useful to briefly introduce the following basic models. ARIMA models are specific subset of univariate modeling, in which a time series is expressed in terms of past values of itself (the autoregressive component) plus current and lagged values of a ‘white noise’ error term (the moving average component).

The ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting method that combines autoregressive (AR), differencing (I), and moving average (MA) components. The general formula for an ARIMA model of order $ (p, d, q)$ can be expressed as:

\[Y_t = c + \phi_1 Y_{t-1} + \ldots + \phi_p Y_{t-p} + \theta_1 \varepsilon_{t-1} + \ldots + \theta_q \varepsilon_{t-q} + \varepsilon_t\]

where:
- \(Y_t\) is the time series at time \(t\),
- \(c\) is a constant term,
- \(\phi_1\), \(\ldots\), \(\phi_p\) are the autoregressive coefficients,
- \(\theta_1\), \(\ldots\), \(\theta_q\) are the moving average coefficients,
- \(\varepsilon_t\) is the error term at time \(t\),
- \(Y_{t-1}, Y_{t-2},\)$, \(Y_{t-p}\) are lagged values of the time series,
- \(\varepsilon_{t-1}\), \(\varepsilon_{t-2}\), \(\ldots\), \(\varepsilon_{t-q}\) are lagged values of the error terms,
- \(d\) is the degree of differencing.

The ARIMA model aims to capture the autocorrelation in the time series data by incorporating the autoregressive and moving average terms. The differencing term helps in making the series stationary. The parameters \(p\), \(d\), and \(q\) are chosen based on the characteristics of the time series data.

6.2.3.1 Model Assumptions

The ARIMA model makes several assumptions about the underlying time series data, including:

  • Stationarity: The time series should be stationary, meaning that the statistical properties of the series (such as the mean and variance) do not depend on time. If the series is non-stationary, it may need to be differenced to make it stationary before using the ARIMA model.

  • Linearity: The relationship between past observations and future predictions should be linear. Nonlinear relationships may require a different type of model.

  • Independence: The error terms in the ARIMA model should be independent and identically distributed (IID), meaning that each error term should be unrelated to previous error terms and should have the same distribution. If the error terms are not IID, it may indicate that the model is misspecified.

  • Normality: The error terms should be normally distributed, with a mean of zero and constant variance. If the error terms are not normally distributed, it may indicate that the model is misspecified.

  • Adequate Sample Size: The ARIMA model assumes that the time series has a sufficient number of observations to estimate the model parameters with reasonable accuracy. A general guideline is to have at least 50 observations in the time series.

These assumptions are important to keep in mind when using the ARIMA model, as violations of these assumptions can lead to inaccurate predictions and biased parameter estimates. It is important to check these assumptions before fitting an ARIMA model to a time series and to take appropriate steps to address any violations.

Example

male_births<-read.csv("~/male.csv")
male<-ts(male_births, start=2000,freq=4)
ts.plot(male, ylab="Male Births", main="Male Live Births")

par(mfrow=c(2,1))
acf(ts(male,freq=4),NULL,main="Acf for actual series")
pp=pacf(ts(male, freq=4),NULL,main="Pacf for actual series")

#Stationary Checking
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(male,alternative="stationary") # ADF (Augmented Dickey-Fuller)p-value is less than 0.05
## 
##  Augmented Dickey-Fuller Test
## 
## data:  male
## Dickey-Fuller = -1.5531, Lag order = 4, p-value = 0.762
## alternative hypothesis: stationary
# fit Auto ARIMA Models
library(forecast)
library(ggplot2)
fit_model<-auto.arima(male,ic="aic",seasonal=TRUE,trace=TRUE) 
## 
##  ARIMA(2,0,2)(1,0,1)[4] with non-zero mean : Inf
##  ARIMA(0,0,0)           with non-zero mean : 1595.311
##  ARIMA(1,0,0)(1,0,0)[4] with non-zero mean : 1509.133
##  ARIMA(0,0,1)(0,0,1)[4] with non-zero mean : 1535.53
##  ARIMA(0,0,0)           with zero mean     : 2152.192
##  ARIMA(1,0,0)           with non-zero mean : 1521.978
##  ARIMA(1,0,0)(2,0,0)[4] with non-zero mean : 1509.428
##  ARIMA(1,0,0)(1,0,1)[4] with non-zero mean : Inf
##  ARIMA(1,0,0)(0,0,1)[4] with non-zero mean : 1513.905
##  ARIMA(1,0,0)(2,0,1)[4] with non-zero mean : Inf
##  ARIMA(0,0,0)(1,0,0)[4] with non-zero mean : 1528.147
##  ARIMA(2,0,0)(1,0,0)[4] with non-zero mean : 1485.386
##  ARIMA(2,0,0)           with non-zero mean : 1489.306
##  ARIMA(2,0,0)(2,0,0)[4] with non-zero mean : 1485.986
##  ARIMA(2,0,0)(1,0,1)[4] with non-zero mean : Inf
##  ARIMA(2,0,0)(0,0,1)[4] with non-zero mean : 1486.623
##  ARIMA(2,0,0)(2,0,1)[4] with non-zero mean : Inf
##  ARIMA(3,0,0)(1,0,0)[4] with non-zero mean : 1486.887
##  ARIMA(2,0,1)(1,0,0)[4] with non-zero mean : 1486.866
##  ARIMA(1,0,1)(1,0,0)[4] with non-zero mean : 1491.064
##  ARIMA(3,0,1)(1,0,0)[4] with non-zero mean : 1488.765
##  ARIMA(2,0,0)(1,0,0)[4] with zero mean     : Inf
## 
##  Best model: ARIMA(2,0,0)(1,0,0)[4] with non-zero mean
print(summary(fit_model))
## Series: male 
## ARIMA(2,0,0)(1,0,0)[4] with non-zero mean 
## 
## Coefficients:
##          ar1     ar2    sar1       mean
##       0.3129  0.4855  0.2552  7458.7561
## s.e.  0.0848  0.0870  0.1054   176.1787
## 
## sigma^2 = 86857:  log likelihood = -737.69
## AIC=1485.39   AICc=1486   BIC=1498.61
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -1.866228 288.9912 233.0705 -0.1768792 3.109607 0.7421209
##                     ACF1
## Training set -0.02615957
male_forecasting<-forecast(fit_model,level = c(95),h=5)
male_forecasting
##         Point Forecast    Lo 95    Hi 95
## 2026 Q1       7468.843 6891.213 8046.472
## 2026 Q2       7500.321 6895.070 8105.571
## 2026 Q3       7411.571 6718.813 8104.329
## 2026 Q4       7476.532 6757.332 8195.733
## 2027 Q1       7461.016 6651.517 8270.515
autoplot(male_forecasting)

Example
Gasoline Sales Time Series After Obtaining the Contract with the Maroodi-Jeex Region Police.

gasoline<-data.frame(weeks=1:22, sales=c(17,21,19,23,18,16,20,18,22,20,15,22,31,34,31,33,28,32,30,29,34,33))
head(gasoline)
weeks sales
1 17
2 21
3 19
4 23
5 18
6 16
gas_oline<-ts(gasoline$sales, start = 2024, frequency = 52)
print(gas_oline)
## Time Series:
## Start = c(2024, 1) 
## End = c(2024, 22) 
## Frequency = 52 
##  [1] 17 21 19 23 18 16 20 18 22 20 15 22 31 34 31 33 28 32 30 29 34 33
library(forecast)
arima_model <- auto.arima(gas_oline)
summary(arima_model)
## Series: gas_oline 
## ARIMA(0,1,0) 
## 
## sigma^2 = 16.86:  log likelihood = -59.46
## AIC=120.92   AICc=121.13   BIC=121.96
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set 0.7280455 4.011349 3.455318 1.442402 14.70862  NaN -0.1952415

The ARIMA(0,1,0) model was fitted to the gas_oline time series data representing gasoline sales in the Maroodi-Jeex Region Police after obtaining the contract. Key metrics include an estimated variance of 16.86, a log likelihood of -59.46, Akaike Information Criterion (AIC) of 120.92, corrected AIC of 121.13, and Bayesian Information Criterion (BIC) of 121.96.

Training set error measures show an average of 0.728 forecast errors, indicating the model tends to slightly over-predict. The Root Mean Squared Error (RMSE) is 4.011, the Mean Absolute Error (MAE) is 3.455, the Mean Percentage Error (MPE) is 1.44%, and the Mean Absolute Percentage Error (MAPE) is 14.71%. The MASE is not available, indicating a small average percentage error in the predictions. The autocorrelation of the first lag of the residuals is -0.195, indicating any remaining autocorrelation in the residuals.

Based on these metrics, the ARIMA(0,1,0) model may provide a reasonable fit to the gasoline sales data in the Somali Region State Police, with room for improvement in accuracy. Further analysis and model refinement may be necessary to enhance forecasting performance.

autoplot(arima_model$residuals, main = "Time Series plot of gasoline")

forecast_values <- forecast(arima_model, h = 10)
forecast_values
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 2024.423             33 27.73827 38.26173 24.952884 41.04712
## 2024.442             33 25.55880 40.44120 21.619660 44.38034
## 2024.462             33 23.88642 42.11358 19.061987 46.93801
## 2024.481             33 22.47655 43.52345 16.905768 49.09423
## 2024.500             33 21.23442 44.76558 15.006102 50.99390
## 2024.519             33 20.11146 45.88854 13.288672 52.71133
## 2024.538             33 19.07878 46.92122 11.709333 54.29067
## 2024.558             33 18.11759 47.88241 10.239319 55.76068
## 2024.577             33 17.21482 48.78518  8.858653 57.14135
## 2024.596             33 16.36096 49.63904  7.552785 58.44721
autoplot(forecast_values)

The Figure depicts a line graph illustrating the ARIMA(0,1,0) model’s forecasts for gasoline sales data from 2024.0 to 2024.6. The x-axis represents the observed values over time, while the y-axis represents the predicted sales. The solid black line represents actual data points, while the shaded blue area represents the predicted sales. A diagonal line serves as a trend or reference line for comparison. The title “Forecasts from ARIMA(0,1,0)” indicates that the graph is specifically displaying the forecasted values generated by the ARIMA model. The graph provides a visual representation of how well the ARIMA model forecasts match the actual gasoline sales data over the specified time period.