The R programming environment is a free, open-source software environment and programming language specifically designed for statistical computing and data graphics. It is widely used by data analysts, data scientists, and researchers across various fields, including finance, healthcare, and academia.
R is a powerful programming language and software environment designed for statistical computing, data analysis, and graphical representation. It is widely used by statisticians, data analysts, and researchers due to its flexibility and extensive libraries.
R is a language and an environment in which you can perform statistical data analysis and graphics. R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.
Our introduction to the R environment did not mention statistics, yet many people use R as a statistics system. We prefer to think of it of an environment within which many classical and modern statistical techniques have been implemented. A few of these are built into the base R environment, but many are supplied as packages. There are about 25 packages supplied with R (called “standard” and “recommended” packages) and many more are available through the CRAN family of Internet sites Visit R Project and elsewhere. More details on packages are given later. Most classical statistics and much of the latest methodology is available for use with R, but users may need to be prepared to do a little work to find it. There is an important difference in philosophy between S (and hence R) and the other main statistical systems. In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.
To start R in the Windows environment double click on the R icon A new screen will appear containing one window The last line to appear will be `>’, a standard prompt to indicate that R is expecting a command. The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
The on-line help gives useful information. Getting used to using it and understanding the help will make it easier to use R. A keyword search is possible using the Search Engine and Keywords link. “help.start()”
To do a keyword search use the function apropos(). For example:
apropos("test")
## [1] ".valueClassTest" "ansari.test" "bartlett.test"
## [4] "binom.test" "Box.test" "chisq.test"
## [7] "cor.test" "file_test" "fisher.test"
## [10] "fligner.test" "friedman.test" "kruskal.test"
## [13] "ks.test" "mantelhaen.test" "mauchly.test"
## [16] "mcnemar.test" "mood.test" "oneway.test"
## [19] "pairwise.prop.test" "pairwise.t.test" "pairwise.wilcox.test"
## [22] "poisson.test" "power.anova.test" "power.prop.test"
## [25] "power.t.test" "PP.test" "prop.test"
## [28] "prop.trend.test" "quade.test" "shapiro.test"
## [31] "t.test" "testInheritedMethods" "testVirtual"
## [34] "var.test" "wilcox.test"
Note that: - You need to put the keyword in double
quotes (“keyword’’).
- A lot of valuable information about R can be found through:
Classwork
Use the help system to find help on the following topics.
R can do any types of mathematical manipulation Simple arithmetic Whatever is typed at the prompt is evaluated, and the result is printed.
2+3 # Addition
## [1] 5
2*4 + 7 # Multiplication and addition
## [1] 15
10/3 # Division
## [1] 3.333333
5**3 # ** to the power of...
## [1] 125
5^3 # or we can use ^
## [1] 125
Standard functions that are found on a scientific calculator are available in R, for example:
sqrt(2) # Square root
## [1] 1.414214
It also provides \(\pi\) as a constant.
pi
## [1] 3.141593
sin(pi)
## [1] 1.224606e-16
cos(pi)
## [1] -1
Table 1: The Arithmetic Functions in R
| Name | Operation |
|---|---|
| sqrt | square root |
| abs | absolute value |
| sin cos tan | trigonometric functions (radians) |
| asin acos atan | inverse trigonometric functions |
| sinh cosh tanh | hyperbolic functions |
| asinh acosh atanh | inverse hyperbolic functions |
| exp log | exponential and natural logarithm |
| log10 | common logarithm |
| gamma lgamma | gamma function and its natural log |
Table 3: Functions that return a single value
| Name | Operation |
|---|---|
| length(x) | the number of elements in x |
| sum(x) | the sum of the values of x |
| mean(x) | the mean of the values of x |
| var(x) | the variance of the values of x |
| sd(x) | the standard deviation of the values of x |
| min(x) | the minimum value from the values of x |
| max(x) | the maximum value from the values of x |
| prod(x) | the product of the values of x |
| range(x) | the range of the values of x |
A value can be stored in a named variable by assigning it with the <- or = symbols, for example:
x <- 5 # assigns 5 to x
x # type x to print value
## [1] 5
y = sqrt(9) # assign the square-root of 9 to y
y
## [1] 3
In fact you can make assignments using ->, for example:
7 -> z # assigns 7 to z
z
## [1] 7
This could be interpreted as z is assigned 7. Now it is possible to do arithmetic with x, y and z, for example:
(x * y) + z
## [1] 22
y^x - z + 6
## [1] 242
w <- x+y
w
## [1] 8
Note that: To delete an object from the list, use the rm() functions. For example
rm(w)
ls() # to list the objects (e.g., variables, data, functions) that you have created
## [1] "x" "y" "z"
R enables computation with Boolean or Logical variables. These take
on either the value True or False. You can use conditional tests to
generate these values:
Table 2: Mathematical and Logical Comparison Operators in R
| Operator | Syntax |
|---|---|
| less than | < |
| less than or equal to | <= |
| exactly equal to | == |
| greater than or equal to | >= |
| greater than | > |
| not equal to | != |
| or | | |
| and | & |
x <- 32
x > 16 # Is x greater than 16?
## [1] TRUE
x <= 16 # Is x less than equal to 16?
## [1] FALSE
Logical values can be stored in variables in the same way as numeric values:
tf <- x>16
tf
## [1] TRUE
Simple manipulation in R typically involves using base R functions or
specialized packages like dbplyr from the
tidyverse for tasks such as creating new variables,
filtering rows, selecting columns, and summarizing data. Generally
refers to tasks like selecting, filtering,
sorting, and creating new variables within data structures
such as vectors and data frames.
In this section, we will focus on the very basics of data manipulation and transformation using one of the most important packages from the tidyverse: dplyr. dplyr will also help us begin to explore and summarize our data. Make sure you have the Tidyverse loaded in your R environment.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dplyr is one of the core packages of the tidyverse, and
is one of the most widely used packages in the R ecosystem. You can
think of dplyr as a package that allows for many of the basic data
transformation techniques any Actuary will need in order to take raw
data and transform it into usable data for an analysis. For those
familiar with SQL, you will notice many parallels to SQL statements.
Note that SQL knowledge is not necessary for this course, but we may
make connections to SQL where appropriate since it is the most widely
used language to query and manipulate data.
dplyr itself is almost a ‘sub-language’ of R that
requires practice to get the hang of, but once you do, it will make
performing complex data transformations simple, succinct, and easy to
read.
Data manipulation in R involves cleaning, transforming, and
organizing data to make it suitable for analysis. It includes tasks like
selecting, filtering, sorting, and creating new variables, often done
using the dplyr package for simple and efficient operations. The dplyr
package provides functions to perform these operations using simple
syntax.
Common dplyr Functions
Some of the key functions in the dplyr package used for manipulating
data in R are:
| Function | Description |
|---|---|
filter() |
Produces a subset of a data frame based on conditions. |
distinct() |
Removes duplicate rows from a data frame. |
arrange() |
Reorders the rows of a data frame. |
select() |
Selects specific columns from a data frame. |
rename() |
Renames column (variable) names in a data frame. |
mutate() |
Creates new variables without removing existing ones. |
transmute() |
Creates new variables and drops the original ones. |
summarize() |
Produces summary statistics. |
Additionally, there is a seventh function, group_by()
that can be used in conjunction with the above functions.
If you pay attention to the bolded text above, you’ll notice that the
filter(), arrange() and slice()
functions work on rows in your dataset, whereas the
select() and mutate() functions work on the
columns or fields in your data. The summarize() function is
unique in that it works on groups of rows.
1. filter() Method
We use the filter() method to extract rows that meet a given
condition.
Syntax:
filter(dataframeName, condition)
Example:\ We will create a sample
data frame from which we extract rows where the runs column is greater
than 100.
library(dplyr)
stats <- data.frame(player = c('A', 'B', 'C', 'D'),
runs = c(100, 200, 408, 19),
wickets = c(17, 20, NA, 5))
filter(stats, runs > 100)
| player | runs | wickets |
|---|---|---|
| B | 200 | 20 |
| C | 408 | NA |
2. distinct() Method
We use the distinct() method to remove duplicate rows based
on one or more columns.
Syntax:
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)
Example:\ We remove duplicate
rows from the stats data frame, first entirely and then based on the
player column.
stats <- data.frame(player = c('A', 'B', 'C', 'D', 'A', 'A'),
runs = c(100, 200, 408, 19, 56, 100),
wickets = c(17, 20, NA, 5, 2, 17))
distinct(stats)
| player | runs | wickets |
|---|---|---|
| A | 100 | 17 |
| B | 200 | 20 |
| C | 408 | NA |
| D | 19 | 5 |
| A | 56 | 2 |
distinct(stats, player, .keep_all = TRUE)
| player | runs | wickets |
|---|---|---|
| A | 100 | 17 |
| B | 200 | 20 |
| C | 408 | NA |
| D | 19 | 5 |
3. arrange() Method
We use the arrange() method to sort rows in ascending order
based on one or more columns.
Syntax:
arrange(dataframeName, columnName)
Example:\ We sort the rows of the stats data frame based on the runs column in ascending order.
stats <- data.frame(player = c('A', 'B', 'C', 'D'),
runs = c(100, 200, 408, 19),
wickets = c(17, 20, NA, 5))
arrange(stats, runs)
| player | runs | wickets |
|---|---|---|
| D | 19 | 5 |
| A | 100 | 17 |
| B | 200 | 20 |
| C | 408 | NA |
4. select() Method
We use the select() method to extract specific columns from
a data frame.
Syntax:
select(dataframeName, col1,col2,…)
Example:\ Imagine you had no
familiarity with the AIS dataset and wanted to sum of skin folds for
each individual. You would first need to use the select()
function to extract the relevant columns. Let’s first look at all the
possible columns in the AIS data using the colnames()
function.
library(DAAG) # Load the package
library(dplyr) # Load dplyr package
data(ais) # Load the dataset into memory
colnames(ais) # Now this works
## [1] "rcc" "wcc" "hc" "hg" "ferr" "bmi" "ssf" "pcBfat"
## [9] "lbm" "ht" "wt" "sex" "sport"
# Now select the relevant columns
ais<- dplyr::select(ais, rcc, wcc, lbm, bmi, ht, wt, sex, sport)
head(ais)
| rcc | wcc | lbm | bmi | ht | wt | sex | sport |
|---|---|---|---|---|---|---|---|
| 3.96 | 7.5 | 63.32 | 20.56 | 195.9 | 78.9 | f | B_Ball |
| 4.41 | 8.3 | 58.55 | 20.67 | 189.7 | 74.4 | f | B_Ball |
| 4.14 | 5.0 | 55.36 | 21.86 | 177.8 | 69.1 | f | B_Ball |
| 4.11 | 5.3 | 57.18 | 21.88 | 185.0 | 74.9 | f | B_Ball |
| 4.45 | 6.8 | 53.20 | 18.96 | 184.6 | 64.6 | f | B_Ball |
| 4.10 | 4.4 | 53.77 | 21.04 | 174.0 | 63.7 | f | B_Ball |
5. rename() Method
We use the rename() method to change the names of columns in a data
frame.
Syntax:
rename(dataframeName, newName=oldName)
ais <- dplyr::rename(ais, Body_Mass_index=bmi)
head(ais)
| rcc | wcc | lbm | Body_Mass_index | ht | wt | sex | sport |
|---|---|---|---|---|---|---|---|
| 3.96 | 7.5 | 63.32 | 20.56 | 195.9 | 78.9 | f | B_Ball |
| 4.41 | 8.3 | 58.55 | 20.67 | 189.7 | 74.4 | f | B_Ball |
| 4.14 | 5.0 | 55.36 | 21.86 | 177.8 | 69.1 | f | B_Ball |
| 4.11 | 5.3 | 57.18 | 21.88 | 185.0 | 74.9 | f | B_Ball |
| 4.45 | 6.8 | 53.20 | 18.96 | 184.6 | 64.6 | f | B_Ball |
| 4.10 | 4.4 | 53.77 | 21.04 | 174.0 | 63.7 | f | B_Ball |
6. mutate() and transmute() Methods
We use mutate() to create new columns while keeping the existing
ones.
Syntax:
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
ais_mutated <- mutate(ais,
Ht_m = ht / 100,
Body_Mass_index = wt / (Ht_m^2))
# Transmute to create BMI and keep only specific columns
ais_subset <- ais %>%
transmute(
Sex = sex,
Sport = sport,
Body_Mass_index = wt / (ht / 100)^2)
# View the result
head(ais_subset)
| Sex | Sport | Body_Mass_index |
|---|---|---|
| f | B_Ball | 20.55929 |
| f | B_Ball | 20.67466 |
| f | B_Ball | 21.85821 |
| f | B_Ball | 21.88459 |
| f | B_Ball | 18.95698 |
| f | B_Ball | 21.03977 |
7. summarize() Method
We use the summarize() method to reduce a set of values to
a single summary value using aggregation functions like
sum(), mean(), etc.
Syntax:
summarize(dataframeName,
aggregate_function(columnName))
# Summary of the entire dataset
summary(ais)
## rcc wcc lbm Body_Mass_index
## Min. :3.800 Min. : 3.300 Min. : 34.36 Min. :16.75
## 1st Qu.:4.372 1st Qu.: 5.900 1st Qu.: 54.67 1st Qu.:21.08
## Median :4.755 Median : 6.850 Median : 63.03 Median :22.72
## Mean :4.719 Mean : 7.109 Mean : 64.87 Mean :22.96
## 3rd Qu.:5.030 3rd Qu.: 8.275 3rd Qu.: 74.75 3rd Qu.:24.46
## Max. :6.720 Max. :14.300 Max. :106.00 Max. :34.42
##
## ht wt sex sport
## Min. :148.9 Min. : 37.80 f:100 Row :37
## 1st Qu.:174.0 1st Qu.: 66.53 m:102 T_400m :29
## Median :179.7 Median : 74.40 B_Ball :25
## Mean :180.1 Mean : 75.01 Netball:23
## 3rd Qu.:186.2 3rd Qu.: 84.12 Swim :22
## Max. :209.4 Max. :123.20 Field :19
## (Other):47
# Summarize average RCC and Weight by sex
ais_summary <- ais %>%
group_by(sex) %>%
summarise(
Avg_Weight = mean(wt),
Avg_RCC = mean(rcc),
Count = n())
print(ais_summary)
## # A tibble: 2 × 4
## sex Avg_Weight Avg_RCC Count
## <fct> <dbl> <dbl> <int>
## 1 f 67.3 4.40 100
## 2 m 82.5 5.03 102
Understanding the different types of data and how R deals with these
data is important. The temptation is to glaze over and skip these
technical details, but beware, this can come back to bite you somewhere
unpleasant if you don’t pay attention. We’ve already seen an example of
this when we tried (and failed) to add two character objects together
using the ‘+’ operator.
R has six basic types of data; numeric, integer, logical, complex and
character.
A data structure is a particular way of organizing data in a computer
so that it can be used effectively. The idea is to reduce the space and
time complexities of different tasks. Data structures in R programming
are tools for holding multiple values. R has several built-in data
structures, including:
All the values presented so far have been scalars. R can handle vectors which are a combination of scalars in a single structure.
The c(…) construct can be used to create vectors, named as “concatenate” in R .That means “treat as data set”. To set up a vector named x, say, consisting of five numbers, namely 2, 4,6,10 and 11, use the R command:
x <- c(2,4,6,10,11)
x
## [1] 2 4 6 10 11
Vectors are the simplest type of object in R. They can easily be
created with c, the combined function. There are 3 main types of
vectors
Functions that return vectors with the same length
order(x)
## [1] 1 2 3 4 5
sort.list(x)
## [1] 1 2 3 4 5
sort(x)
## [1] 2 4 6 10 11
x[order(x)]
## [1] 2 4 6 10 11
x[sort.list(x)]
## [1] 2 4 6 10 11
1/x
## [1] 0.50000000 0.25000000 0.16666667 0.10000000 0.09090909
Numeric Vector: is a single entity consisting of an ordered collection of numbers.
Example: To set up a numeric vector X consisting of 5 numbers ,10,6,3,6,22, we use any one of the following commands:
x<-c(10,6,3,6,22)#OR
x=c(10,6,3,6,22)#OR
assign("x",c(10,6,3,6,22))#OR
c(10,6,3,6,22)->x
y<-c(x,0,x)
x
## [1] 10 6 3 6 22
Character vectors: To set up a character/string vector z consisting of 3 place names use:
z<-c("Canberra","Sydney","Newcastle") #Or
z<-c("Canberra","Sydney","Newcastle")
Common useful escape sequences are:
Example:
cat("Abebe","\n","zelalem","\n")
## Abebe
## zelalem
Indexing Vectors
const=c(3.1416,2.7183,1.4142,1.6180)
names(const)=c("pi","euler","sqrt2","golden")
const[c(1,3,4)] # printing the 1st, 2ndand the 4thelements of const.
## pi sqrt2 golden
## 3.1416 1.4142 1.6180
const[c(-1,-2)]
## sqrt2 golden
## 1.4142 1.6180
const>2
## pi euler sqrt2 golden
## TRUE TRUE FALSE FALSE
const[const>2]
## pi euler
## 3.1416 2.7183
Regular Sequences
R can be used to generate vectors sequences. You can use the notation #1:#2 to generate a vector that consists of a sequence, where #1 and #2 are two integer numbers.
1:12
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
4:18
## [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-10:3
## [1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
x<- 1:10 # storing sequence on variable x
x
## [1] 1 2 3 4 5 6 7 8 9 10
y <- 40:1
y
## [1] 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
## [26] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
The last example (y) shows that is a vector is too long for one line
it simply continues on the following line (using as many lines as are
necessary).
Further in the square brackets shows the position at which that line begins ( e.g: [26] mean that the first values on this line are the 26th value).
rep(c(1:4), 3) # repeat 1 2 3 4 three times
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
r <- rep(c(1.2, 6.1, 3.2, 5.5), 2) # repeat vector 1.2, 6.1, 3.2, 5.5 two times
r
## [1] 1.2 6.1 3.2 5.5 1.2 6.1 3.2 5.5
seq(4, 6, .25) # generate integer b\n 4 & 6 with increments of 0.25.
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(length=9, from=4, to=6)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(from=4, to=6, by=0.25)
## [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
seq(7)
## [1] 1 2 3 4 5 6 7
seq(7, 10)
## [1] 7 8 9 10
seq(7, 10, 0.5)
## [1] 7.0 7.5 8.0 8.5 9.0 9.5 10.0
rep(seq(1, 3, 0.4), 2) # generate integer b\n 1& 3 with increments of 0.4,then repeat it 2-times.
## [1] 1.0 1.4 1.8 2.2 2.6 3.0 1.0 1.4 1.8 2.2 2.6 3.0
We can manipulate vectors in a similar manner to scalars. However, care must be take when doing such things as the results may not be the desired ones.
x <- 12:1 # x is a sequence .....
x
## [1] 12 11 10 9 8 7 6 5 4 3 2 1
Extracting values of a Vector
If we are interested in only extracting values of the vector, then we can do this using square bracket [ ].
x<- c(1:10)**2 # generate sequence of vector (x) 1 to 10 and square each
x
## [1] 1 4 9 16 25 36 49 64 81 100
x[6] # extracting the 6th value from vector x above
## [1] 36
x[2:4] # extracting the 2st, 3th and 4th values from x
## [1] 4 9 16
x[c(1,7,9)] # extracting the 1st, 7th and 9th values from x
## [1] 1 49 81
y <- x[c(1,5,8)] # assigning the 1st, 5th and 8th values of x to y.
y
## [1] 1 25 64
x[9:6] # extracting the 9th, 8th ,7thand 6th values from x
## [1] 81 64 49 36
x[c(1:3, 8:10)] # extracting the 1st, 2th ,3thand 8th,9th,10th values
## [1] 1 4 9 64 81 100
x[c(1,2,3,1,2,3,1,2,3,1,2,3)] # repetition of the index...
## [1] 1 4 9 1 4 9 1 4 9 1 4 9
x[c(8,2,5,10)] # any order you please....
## [1] 64 4 25 100
x[x > 5] # Extract the value of x greater than 5
## [1] 9 16 25 36 49 64 81 100
Excluding value from vector
If you use a negative subscript in your selection procedure, the corresponding numbered element is not included in the return vector. For example
x <- c(3,6,9,12,15,18,21)
x
## [1] 3 6 9 12 15 18 21
x[-4] # exclude the 4th element...
## [1] 3 6 9 15 18 21
x[c(-4,-6)] # exclude 4th and 6th elements...
## [1] 3 6 9 15 21
Example:
x<-c("A","A","A","A","B","B","B")
x
## [1] "A" "A" "A" "A" "B" "B" "B"
More Function on Vector
There are functions that operate on vectors and return useful information:
x <- 5:14
length(x) # Number of elements in x
## [1] 10
max(x) # Largest value in x
## [1] 14
min(x) # Smallest value in x
## [1] 5
sum(x) # Sum of all the values in x
## [1] 95
prod(x) # The product of all the values in x
## [1] 3632428800
mean(x) # The mean of the all values in x
## [1] 9.5
range(x) # Range of vector x
## [1] 5 14
var(x) # The variance of x
## [1] 9.166667
sd(x) # The standard deviation of x
## [1] 3.02765
sqrt(var(x)) # The square root of the variance (i.e. stand. deviation)
## [1] 3.02765
x <- 1:10
x>7
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
R lets you store data in a two-dimensional matrix.
Example
x <-c(1:8)
dim(x) <-c(2,4)
x
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
Method 2: Using the function matrix:
x <-matrix(c(1:8),2,4,byrow=F)
x
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
w<-matrix(c(1,2,40,2,3,9,20,4,60),nrow=3) # create 3-row by 3-coln. matrix
w
## [,1] [,2] [,3]
## [1,] 1 2 20
## [2,] 2 3 4
## [3,] 40 9 60
w <-matrix(c(1,2,40,2,3,9,20,4,60),3,3)# Similar to the above create 3 by 3
w
## [,1] [,2] [,3]
## [1,] 1 2 20
## [2,] 2 3 4
## [3,] 40 9 60
x <- matrix(c(2,3,5,7,11,13),ncol=2) # create 3-row by 2-colm matrix
x
## [,1] [,2]
## [1,] 2 7
## [2,] 3 11
## [3,] 5 13
x[2,1] # extract 2nd-row and 1st-coln. element of x
## [1] 3
x[2,2] # extract 2nd-row and 2nd-coln. element of matrix x
## [1] 11
x[,1] # The first column element
## [1] 2 3 5
x[3,] # The third row element
## [1] 5 13
x[2:3] # extract Rows 2 and coln. 2 and form matrix
## [1] 3 5
xy<- matrix(1:16,ncol=4)
xy
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
rowSums(xy) # the sum of each rows
## [1] 28 32 36 40
colSums(xy) # the sum of each columns
## [1] 10 26 42 58
mean (xy [, 2]) # to find the mean of matrix-xy in the coln-2
## [1] 6.5
mean (xy [3,]) # to find the mean of matrix-xy in the row-3
## [1] 9
mx <- matrix(seq(0, 95, length=20), ncol=5)
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 20 40 60 80
## [2,] 5 25 45 65 85
## [3,] 10 30 50 70 90
## [4,] 15 35 55 75 95
mx [3,1] <- -12 # replace 3rd-row and 1st-coln element by -12
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 20 40 60 80
## [2,] 5 25 45 65 85
## [3,] -12 30 50 70 90
## [4,] 15 35 55 75 95
mx [3,] <- 0 # replace 3rd-row element by 0
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 20 40 60 80
## [2,] 5 25 45 65 85
## [3,] 0 0 0 0 0
## [4,] 15 35 55 75 95
mx[,2] <- c(3, 6, 9, 12) #replace the 2nd column of mx by 3,6,9,12
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 3 40 60 80
## [2,] 5 6 45 65 85
## [3,] 0 9 0 0 0
## [4,] 15 12 55 75 95
mx[4,] <- c(-1, -2, -3, -4, -5) #replace 4th-row element of mx by -1, -2, -3, -4, -5
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 3 40 60 80
## [2,] 5 6 45 65 85
## [3,] 0 9 0 0 0
## [4,] -1 -2 -3 -4 -5
mx[,1] <- c(0.5, 0.9)
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
cbind(mx, c(0.1, 0.2, 0.3, 0.4)) # Add an extra column
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.5 3 40 60 80 0.1
## [2,] 0.9 6 45 65 85 0.2
## [3,] 0.5 9 0 0 0 0.3
## [4,] 0.9 -2 -3 -4 -5 0.4
rbind(mx, 100:104) # Add a an extra row.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
## [5,] 100.0 101 102 103 104
cbind(mx,mx) # Add the matrix to itself (column wise).
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.5 3 40 60 80 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5 0.9 -2 -3 -4 -5
rbind(mx, mx) # Add the matrix to itself (row wise).
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
## [5,] 0.5 3 40 60 80
## [6,] 0.9 6 45 65 85
## [7,] 0.5 9 0 0 0
## [8,] 0.9 -2 -3 -4 -5
cbind(mx[,1:3], c(91,92,93,94), mx[,4])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 91 60
## [2,] 0.9 6 45 92 65
## [3,] 0.5 9 0 93 0
## [4,] 0.9 -2 -3 94 -4
mx
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 3 40 60 80
## [2,] 0.9 6 45 65 85
## [3,] 0.5 9 0 0 0
## [4,] 0.9 -2 -3 -4 -5
Method 1: Using vectors
- A vector can be used by R as an array only if it has a dimension
vector as its dim attribute. - A dimension vector is a vector of
non-negative integers. If its length is k then the array is
k-dimensional. - An array can be created by giving a vector structure, a
dim, which has the form
Example
The following is a (3-dimentional) array with dimension vector
c(3,5,100) and a vector of 1500 elements.
z=c(1:1500)
dim(z) <- c(3,5,100)
dim(z)
## [1] 3 5 100
Factors are the data objects which are used to categorize the data and store it as levels. They are useful for storing categorical data. They can store both strings and integers. They are useful to categorize unique values in columns like (“TRUE” or “FALSE”) or (“MALE” or “FEMALE”), etc.. They are useful in data analysis for statistical modeling.
# Creating factor using factor()
fac = factor(c("Male", "Female", "Male",
"Male", "Female", "Male", "Female"))
print(fac)
## [1] Male Female Male Male Female Male Female
## Levels: Female Male
A list is a generic object consisting of an ordered collection of objects. A list is a generic vector containing other objects. Unlike vectors,Lists are heterogeneous data structures it can contain elements of different types, such as numbers, strings, and even other lists. Lists are created using the list() function.
# Creating a list
my_list <- list(
name = "Jamal",
age = 30,
is_student = FALSE,
scores = c(85, 90, 78),
address = list(
street = "123 Main St",
city = "Anytown",
zip = "12345"
)
)
print(my_list)
## $name
## [1] "Jamal"
##
## $age
## [1] 30
##
## $is_student
## [1] FALSE
##
## $scores
## [1] 85 90 78
##
## $address
## $address$street
## [1] "123 Main St"
##
## $address$city
## [1] "Anytown"
##
## $address$zip
## [1] "12345"
empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(empId, empName, numberOfEmp)
print(empList)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] "Debi" "Sandeep" "Subham" "Shiba"
##
## [[3]]
## [1] 4
Data frame is a data structure which contains more than 1 vector of equal length. A data frame is very much like a matrix, except it is designed for storing statistical or experimental data. Each row represents a unit, and each column a collection of measurements on the units. Each column can store a different type of data, such as numeric or character. Let us take some example:
Example
name= c("Omer", "Faduma", "Ibrahim","Jamal")
age=c(18,22,25,27)
sex=c("F","M","M","F")
stud=data.frame(name, age, sex)
stud
| name | age | sex |
|---|---|---|
| Omer | 18 | F |
| Faduma | 22 | M |
| Ibrahim | 25 | M |
| Jamal | 27 | F |
stud$name #access or extract student data only names
## [1] "Omer" "Faduma" "Ibrahim" "Jamal"
Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. So far, all data has been entered in R using the keyboard or generated using R functions. In reality, data is usually too large to type in and is stored in files. There are functions used to read data from external files such as Excel, SPSS, Epi-info, Stata and so on.
In R you can import text file swith the function “read.table” the functions read.csv and read.delim are functions to read―comma separated values ‖files period as a decimal and tab delimited files period as a decimal respectively.
R Syntax: used for to read Data
read.sav is ued to reads a file or data stored by the SPSS read.sav(file_name())
read.csv is ued to reads a file or data stored by CSV read.csv(file.choose())
read_dta is ued to reads a file or data stored by STATA read.dta(file_name())
read.table is ued to reads a file or data stored by TABLE read.txt(file.choose())
The functions read.mtp is functions to read MINITAB files read.mtp(file_name())
read_sas is ued to reads a file or data stored by SAS read.as7bdat(file_name())
read.epiinfo is ued to reads a file or data stored by EPI-Info read.rec (file_name())
The function in R is used to read text files containing tabular data into a data frame read.delim2(file.choose()) .
Example
#R=read.csv(file.choose(),header=T)
#R=read.table(file.choose(),header=T)
In R, you can export data to various file formats such as CSV, Excel,Stata, SPSS and others using different functions provided by R packages. Here are some common examples of exporting data in R:
# Create a sample data frame for five Students
data <- data.frame(
ID = 1:5,
Name = c("Ali", "Bilan", "Cawale", "Deeqsi", "Omer"),
Score = c(85, 90, 88, 92, 87))
data
| ID | Name | Score |
|---|---|---|
| 1 | Ali | 85 |
| 2 | Bilan | 90 |
| 3 | Cawale | 88 |
| 4 | Deeqsi | 92 |
| 5 | Omer | 87 |
# Export data frame to a CSV file
write.csv(data, "data.csv", row.names = FALSE)
# Install and load the 'writexl' package for writing Excel files
#install.packages("writexl")
library(writexl)
# Export data frame to an Excel file
data1<-write_xlsx(data, "data1.xlsx")
data1
## [1] "D:\\Rpubs Projects\\Stat Comp II\\rmd\\data1.xlsx"
Instruction: Submit This Exercise R Scripts only
Write the R code of the following equations
Generate the following sequences
What is the meaning of rep (letters [1:5], 4))?
Generate 80 values from a standard normal distribution.
Generate 100 values from a binomial distribution of size 23 and probability 0.25.
Generate 100 values from a log normal distribution with mean 12 and standard deviation 2.
Create the following vector sequences:
For each of the sequences above calculate the mean, median, variance and the range.
Generate 15 values from a uniform distribution between 0.7 and 1.
Generate 15 values from a uniform distribution between 0 and 0.3.
Generate 28 values from a uniform distribution between 1.5 and 7.5.
Using the functions runif() and trunc() simulate 20 die throws.
Generate 30 values from a Normal distribution with mean 3.4 and variance 5.25 rounded to 2 decimal places. Calculate the median, range, and the upper and lower quartiles.
Generate 30 values from a Poisson distribution with lambda 5.
Generate 20 values from a Binomial distribution of size 15 and probability of success 0.3.
Generate a randomization list of size 150 for three treatment groups: “Placebo”, “Drug A” and “Drug B”.
Generate 100 values from a standard normal distribution and count the number of values greater than 1.96. Extract these into a vector called signif.100. Do the same again but this time generate 2000 values.
Submission
To submit your exercise, please fill out the form below and upload your
R script file.
Click
Here
Control statements are used to control the flow of execution of R commands. They are used to carry out conditional and repetitive executions of R commands. Control statements include conditional statements (if, if-else, ifelse) and loops (for, while, repeat). These control statements are very useful when writing your own functions in R.
In this class we will introduce the R function syntax and in particular the passing of variables into the function, and argument handling.
If you find you have a process that you use a lot, perhaps one that calls several R functions, you can write your own function to do it. Probably one of the most powerful aspects of the R language is the ability of a user to write functions. When doing so, many computations can be incorporated into a single function and (unlike with scripts) intermediate variables used during the computation are local to the function and are not saved to the workspace. In addition, functions allow for input values used during a computation that can be changed when the function is executed.
if(), if()else and
ifelse()Syntax:
if(condition) {commands 1}
if(condition) {commands 1} else { commands 2} condition must evaluate to a single logical value, (i.e either TRUE or FALSE.)
ifelse (conditions vector, yes vector, no vector )
if Condition in R
This task is carried out only if this condition is returned as TRUE. R makes it even easier: You can drop the word then and specify your choice in an if statement.
if (test_expression) {
statement
}
Example 1:
values <- 1:10
if (sample(values, 1) <= 10) {
print(paste(sample(values, 1), "is less than or equal to 10"))
}
## [1] "5 is less than or equal to 10"
Example 2:
Write a function that calculate arithmetic mean for grouped data. Group
midpoints are (10, 20, 30, 40) and Frequencies for each group (5, 10, 8,
7)
grouped_mean <- function(groups, frequencies) {
if (length(groups) != length(frequencies)) {
stop("Lengths of ’groups’ and ’frequencies’
should be the same.")
}
total_sum <- sum(groups * frequencies)
total_freq <- sum(frequencies)
mean <- total_sum / total_freq
return(mean)
}
groups <- c(10, 20, 30, 40) # Group midpoints
frequencies <- c(5, 10, 8, 7) # Frequencies for each group
result <- grouped_mean(groups, frequencies)
print(result)
## [1] 25.66667
if-else Condition in R
An if. . . else statement contains the same elements as an
if statement (see the preceding section), with some extra elements:
if() statement is FALSE.if (test_expression) {
statement
} else {
statement
}
x <- 9
if (x > 0) sqrt(x) else sqrt(-x)
## [1] 3
ifelse(x >= 0, sqrt(x), sqrt(-x))
## [1] 3
before<-c(26.4,28.2,22.1,24.5,21.4,27.3)
after<-c(33.9,37.1,36.1,39.5,30.0,31.4)
df<-data.frame(before,after)
df
| before | after |
|---|---|
| 26.4 | 33.9 |
| 28.2 | 37.1 |
| 22.1 | 36.1 |
| 24.5 | 39.5 |
| 21.4 | 30.0 |
| 27.3 | 31.4 |
t_test<-t.test(df)
t_test
##
## One Sample t-test
##
## data: df
## t = 17.469, df = 11, p-value = 2.27e-09
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 26.06716 33.58284
## sample estimates:
## mean of x
## 29.825
pvalue <- t_test$p.value
if (pvalue < 0.05) {
print("Reject the null hypothesis")
} else {
print("Fail to reject the null hypothesis")
}
## [1] "Reject the null hypothesis"
Loops: for (), while ()
and repeat ()
Syntax:
for ( var in set ) {commands}
while ( condition ) {commands}
repeat {commands}
Where
the object set is a vector,
commands is a single command or a sequence of commands and
var is a variable which may be used in commands.
Exercise
Write a function that calculate arithmetic mean for grouped data
Consider a function that returns the maximum of two scalars or the statement that they are equal.
Write a function that computes the zero(s) of a quadratic equation.
Write a function that returns the mean of the values in a vector that are greater than the median. Use the median() function. Call your function gtmean.
A function accepts input arguments and produces the output by executing valid R commands that are inside the function. Functions are useful when we want to perform a certain task multiple times. In R Programming Language when we are creating a function the function name and the file in which we are creating the function need not be the same and we can have one or more functions in R.
# name_of_function <- function(argument1, argument2) {
# statements or code that does something
# return(something)
# }
In programming, parameters and arguments refer to the values passed
into a function. They are often used interchangeably, but there is a
subtle difference:
Example1: Let’s try creating a simple example function. This function will take in a numeric value as input, and return the squared value.
square_it <- function(x) {
square <- x * x
return(square)
}
square_it(5) # Calling the function with an argument of 5
## [1] 25
square_it(10) # Calling the function with an argument of 10
## [1] 100
Example2:
add_num <- function(a,b)
{
sum_result <- a+b
return(sum_result)
}
sum = add_num(35,34)
print(sum)
## [1] 69
Exercise
Write a function called
multiply_it, which takes two inputs: a numeric value
x, and a numeric value y. The function will
return the product of these two numeric values, which is x *
y. For example, multiply_it(x = 4, y = 6) will
return output 24.
In R, scope refers to the visibility and accessibility of variables and functions within different parts of a program. Understanding scope is crucial for writing effective and bug-free code. There are two main types of scope in R: global scope and local scope. - Global Scope: Variables and functions defined in the global environment are accessible from anywhere in the R session. They can be accessed and modified by any function or code block. - Local Scope: Variables and functions defined within a function have local scope. They are only accessible within that function and cannot be accessed from outside the function.
When a function is called, R creates a new environment for that function. This environment contains the local variables and functions defined within the function. If a variable or function is not found in the local environment, R will look for it in the parent environment (the environment where the function was defined) and continue searching up the chain until it reaches the global environment.
Debugging is a process of cleaning a program code from bugs to run it successfully. While writing codes, some mistakes or problems automatically appears after the compilation of code are harder to diagnose. So, fixing it takes a lot of time and after multiple levels of calls. Debugging in R is through warnings, messages, and errors. Debugging in R means debugging functions.
R provides a number of tools to help you with debugging your code.
The primary tools for debugging functions in R are
traceback(): prints out the function call stack after
an error occurs; does nothing if there’s no errordebug(): flags a function for “debug” mode which allows
you to step through execution of a function one line at a timebrowser(): suspends the execution of a function
wherever it is called and puts the function in debug modetrace(): allows you to insert debugging code into a
function a specific placesrecover(): allows you to modify the error behavior so
that you can browse the function call stackThese functions are interactive tools specifically designed to allow
you to pick through a function. There’s also the more blunt technique of
inserting print() or cat() statements in the
function.
Editor Breakpoints
Editor Breakpoints can be added in RStudio by clicking to the left of
the line in RStudio or pressing Shift+F9 with the cursor on your line. A
breakpoint is same as browser() but it doesn’t involve
changing codes. Breakpoints are denoted by a red circle on the left
side, indicating that debug mode will be entered at this line after the
source is run.
traceback() Function
The traceback() function is used to give all the
information on how your function arrived at an error. It will display
all the functions called before the error occurred called the
"call stack" in many languages, R favors calling
traceback.
Example:
# # Function 1
# function_1 <- function(a){
# a + 5
# }
#
# # Function 2
# function_2 <- function(b) {
# function_1(b)
# }
#
# # Calling function
# function_2("s")
# # Call traceback()
# traceback()
For example, you may have a function a() which
subsequently calls function b() which calls
c() and then d(). If an error occurs, it may
not be immediately clear in which function the error occurred. The
traceback() function shows you how many levels deep you were when the
error occurred.
# mean(x)
# traceback()
The traceback() function must be called immediately
after an error occurs. Once another function is called, you lose the
traceback.
Here is a slightly more complicated example using the
lm() function for linear modeling.
# lm(y ~ x)
# traceback()
You can see now that the error did not get thrown until the 7th level
of the function call stack, in which case the eval()
function tried to evaluate the formula y ~ x and realized
the object y did not exist. Looking at the traceback is useful for
figuring out roughly where an error occurred but it’s not useful for
more detailed debugging. For that you might turn to the
debug() function.
traceback() function displays the error during
evaluations. The call stack is read from the function that was run(at
the bottom) to the function that was running(at the top). Also we can
use traceback() as an error handler which will display
error immediately without calling of traceback.
# # Function 1
# function_1 <- function(a){
# a + 5
# }
#
# # Function 2
# function_2 <- function(b){
# function_1(b)
# }
#
# # Calling error handler
# options(error = traceback)
# function_2("s")
browser() Function
browser() function is inserted into functions to open R
interactive debugger. It will stop the execution of
function() and you can examine the function with the
environment of itself. In debug mode, we can modify objects, look at the
objects in the current environment, and also continue executing.
Example:
# function(a){
# browser()
# a + 5
# }
browser[1]> command in consoles confirms that you are
in debug mode. Some commands to follow:
ls(): Objects available in current environment.print(): To evaluate objects.Also, debug() statement automatically inserts
browser() statement at the beginning of the function.
recover() Function
recover() statement is used as an error handler and not
like the direct statement. In recover(), R prints the whole
call stack and lets you select which function browser you would like to
enter. Then debugging session starts at the selected location.
Example:
# # Calling recover
# options(error = recover)
#
# # Function 1
# function_1 <- function(a){
# a + 5
# }
#
# # Function 2
# function_2 <- function(b) {
# function_1(b)
# }
#
# # Calling function
# function_2("s")
In this Secton, we introduced the summary command. This produced a number of summary statistics that, at the time, made little sense. Now that we have discussed summary statistics in the lecture, it becomes possible to work with them. The current chapter describes a number of methods in R for obtaining summary statistics. In addition, we discuss how those statistics should be interpreted.
The functions included in this section perform Descriptive Statistics by quantitatively describing or summarizing different characteristics from a sample. Graphical.
library(survival)
##
## Attaching package: 'survival'
## The following object is masked from 'package:DAAG':
##
## lung
attach(cancer)
## The following objects are masked _by_ .GlobalEnv:
##
## age, sex
head(cancer)
| inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
| 3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
| 3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
| 5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
| 1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
| 12 | 1022 | 1 | 74 | 1 | 1 | 50 | 80 | 513 | 0 |
summary(cancer)
## inst time status age
## Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00
## 1st Qu.: 3.00 1st Qu.: 166.8 1st Qu.:1.000 1st Qu.:56.00
## Median :11.00 Median : 255.5 Median :2.000 Median :63.00
## Mean :11.09 Mean : 305.2 Mean :1.724 Mean :62.45
## 3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:2.000 3rd Qu.:69.00
## Max. :33.00 Max. :1022.0 Max. :2.000 Max. :82.00
## NA's :1
## sex ph.ecog ph.karno pat.karno
## Min. :1.000 Min. :0.0000 Min. : 50.00 Min. : 30.00
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 75.00 1st Qu.: 70.00
## Median :1.000 Median :1.0000 Median : 80.00 Median : 80.00
## Mean :1.395 Mean :0.9515 Mean : 81.94 Mean : 79.96
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00
## Max. :2.000 Max. :3.0000 Max. :100.00 Max. :100.00
## NA's :1 NA's :1 NA's :3
## meal.cal wt.loss
## Min. : 96.0 Min. :-24.000
## 1st Qu.: 635.0 1st Qu.: 0.000
## Median : 975.0 Median : 7.000
## Mean : 928.8 Mean : 9.832
## 3rd Qu.:1150.0 3rd Qu.: 15.750
## Max. :2600.0 Max. : 68.000
## NA's :47 NA's :14
str(cancer)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
In the cancer dataset, the sex variable is numeric but represents a categorical variable encoding the biological sex of the patient.
# Convert 'sex' variable to factor with levels 'male' and 'female'
cancer$sex <- factor(cancer$sex, levels = c(1, 2), labels = c("male", "female"))
head(cancer)
| inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 306 | 2 | 74 | male | 1 | 90 | 100 | 1175 | NA |
| 3 | 455 | 2 | 68 | male | 0 | 90 | 90 | 1225 | 15 |
| 3 | 1010 | 1 | 56 | male | 0 | 90 | 90 | NA | 15 |
| 5 | 210 | 2 | 57 | male | 1 | 90 | 60 | 1150 | 11 |
| 1 | 883 | 2 | 60 | male | 0 | 100 | 90 | NA | 0 |
| 12 | 1022 | 1 | 74 | male | 1 | 50 | 80 | 513 | 0 |
Remove NA values from the dataset
cancer1<- na.omit(cancer)
summary(cancer1)
## inst time status age sex
## Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00 male :103
## 1st Qu.: 3.00 1st Qu.: 174.5 1st Qu.:1.000 1st Qu.:57.00 female: 64
## Median :11.00 Median : 268.0 Median :2.000 Median :64.00
## Mean :10.71 Mean : 309.9 Mean :1.719 Mean :62.57
## 3rd Qu.:15.00 3rd Qu.: 419.5 3rd Qu.:2.000 3rd Qu.:70.00
## Max. :32.00 Max. :1022.0 Max. :2.000 Max. :82.00
## ph.ecog ph.karno pat.karno meal.cal
## Min. :0.0000 Min. : 50.00 Min. : 30.00 Min. : 96.0
## 1st Qu.:0.0000 1st Qu.: 70.00 1st Qu.: 70.00 1st Qu.: 619.0
## Median :1.0000 Median : 80.00 Median : 80.00 Median : 975.0
## Mean :0.9581 Mean : 82.04 Mean : 79.58 Mean : 929.1
## 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:1162.5
## Max. :3.0000 Max. :100.00 Max. :100.00 Max. :2600.0
## wt.loss
## Min. :-24.000
## 1st Qu.: 0.000
## Median : 7.000
## Mean : 9.719
## 3rd Qu.: 15.000
## Max. : 68.000
Dimension of the dataset
dim(cancer1)
## [1] 167 10
After removing NA Values from the data.
print(head(cancer1))
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 2 3 455 2 68 male 0 90 90 1225 15
## 4 5 210 2 57 male 1 90 60 1150 11
## 6 12 1022 1 74 male 1 50 80 513 0
## 7 7 310 2 68 female 2 70 60 384 10
## 8 11 361 2 71 female 2 60 80 538 1
## 9 1 218 2 53 male 1 70 80 825 16
In this Section, we discuss how one can tabulate and visualize single
variables.
Example
In a 2-week study of the productivity of workers, the following data
were obtained on the total number of acceptable pieces that 100 workers
produced:
A. Construct a grouped frequency distribution.
B. Construct a histogram, frequency polygon, and Ogives.
workers<-c(65, 36, 49, 88, 79, 56, 28, 43, 67, 36, 43, 78, 37, 40, 68, 72, 55, 62, 22, 82, 88, 50, 60, 56, 57, 46, 39, 57, 73, 65, 59, 48, 76, 74, 70, 51, 40, 75, 56, 45, 35, 62, 52, 63, 32, 80, 64, 53, 74, 34, 76, 60, 48, 55, 51, 54, 45, 44, 35, 51, 21, 35, 61, 45, 33, 61, 77, 60, 85, 68, 45, 53, 34, 67, 42, 69, 52, 68, 52, 47, 63, 65, 55, 61, 73, 50, 53, 59, 41, 54, 41, 74, 82, 58, 26, 35, 47, 50, 38, 70)
range(workers)
## [1] 21 88
factor.workers<-factor(cut(workers,breaks=nclass.Sturges(workers)))
table(factor.workers)
## factor.workers
## (20.9,29.4] (29.4,37.8] (37.8,46.1] (46.1,54.5] (54.5,62.9] (62.9,71.2]
## 4 11 15 19 19 14
## (71.2,79.6] (79.6,88.1]
## 12 6
workers.dat <- as.data.frame(table(factor.workers))
workers.dat <- transform(workers.dat, cumFreq = cumsum(Freq), relative = prop.table(Freq))
workers.dat
| factor.workers | Freq | cumFreq | relative |
|---|---|---|---|
| (20.9,29.4] | 4 | 4 | 0.04 |
| (29.4,37.8] | 11 | 15 | 0.11 |
| (37.8,46.1] | 15 | 30 | 0.15 |
| (46.1,54.5] | 19 | 49 | 0.19 |
| (54.5,62.9] | 19 | 68 | 0.19 |
| (62.9,71.2] | 14 | 82 | 0.14 |
| (71.2,79.6] | 12 | 94 | 0.12 |
| (79.6,88.1] | 6 | 100 | 0.06 |
hist(workers)
par(bg = "gray90")
plot(workers, type = "o", col = "blue", pch = 20)
Example 2:
library(readr)
demo2 <- read_csv("C:/Users/abdik/Desktop/DMA/HU Data Management And Analysis/demo2.csv")
## Rows: 24 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): id, group, pulse, time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(demo2)
| id | group | pulse | time |
|---|---|---|---|
| 1 | 1 | 14 | 1 |
| 1 | 1 | 19 | 2 |
| 1 | 1 | 29 | 3 |
| 2 | 1 | 15 | 1 |
| 2 | 1 | 25 | 2 |
| 2 | 1 | 26 | 3 |
names(demo2)
## [1] "id" "group" "pulse" "time"
Class<-factor(cut(demo2$pulse,breaks=nclass.Sturges(demo2$pulse)))
table(Class)
## Class
## (9.97,14.2] (14.2,18.3] (18.3,22.5] (22.5,26.7] (26.7,30.8] (30.8,35]
## 3 5 4 6 1 5
The nclass.Sturges option in the cut function takes care of the creation of the bins according to Sturges’ formula, which makes life a lot easier for us.
pluse.demo1 <- as.data.frame(table(Class))
pluse.demo1 <- transform(pluse.demo1, cumFreq = cumsum(Freq),relative = prop.table(Freq))
pluse.demo1
| Class | Freq | cumFreq | relative |
|---|---|---|---|
| (9.97,14.2] | 3 | 3 | 0.1250000 |
| (14.2,18.3] | 5 | 8 | 0.2083333 |
| (18.3,22.5] | 4 | 12 | 0.1666667 |
| (22.5,26.7] | 6 | 18 | 0.2500000 |
| (26.7,30.8] | 1 | 19 | 0.0416667 |
| (30.8,35] | 5 | 24 | 0.2083333 |
After completing this unit, students will be able to:
Graphical facilities are an important and extremely versatile component of the R environment. It is possible to use the facilities to display a wide variety of statistical graphs and also to build entirely new types of graph.
Once the device driver is running, R plotting commands can be used to
produce a variety of graphical displays and to create entirely new kinds
of display.
Plotting commands are divided into three basic groups:
- High-level plotting functions create a new plot on
the graphics device, possibly with axes, labels, titles and so on.
- Low-level plotting functions add more information to
an existing plot, such as extra points, lines and labels.
- Interactive graphics functions allow you
interactively add information to, or extract information from, an
existing plot, using a pointing device such as a mouse.
In addition, R maintains a list of graphical parameters which can be manipulated to customize your plots.
high-level plotting functions There are a number of arguments which may be passed to high-level graphics functions, as follows:
add=TRUE Forces the function to act as a low-level graphics function, superimposing the plot on the current plot (some functions only).
axes=FALSE Suppresses generation of axes—useful for adding your own custom axes with the axis() function. The default, axes=TRUE, means include axes.
log=“x”
log=“y”
log=“xy”
Causes the x, y or both axes to be logarithmic. This will work for many, but not all, types of plot.
type=
The type= argument controls the type of plot produced, as
follows:
type=“p”
Plot individual points (the default)
type=“l”
Plot lines
type=“b”
Plot points connected by lines (both)
type=“o”
Plot points overlaid by lines
type=“h”
Plot vertical lines from points to the zero axis (high-density)
type=“s”
type=“S”
Step-function plots. In the first form, the top of the vertical defines the point; in the second, the bottom.
xlab=string ylab=string
Axis labels for the x and y axes. Use these arguments to change the default labels, usually the names of the objects used in the call to the high-level plotting function.
main=string Figure title, placed at the top of the plot in a large font.
sub=string Sub-title, placed just below the x-axis in a smaller font.
In R programming language we can customize the appearance of plots. There are a variety of graphical parameters that can be used to do so. Some of the most commonly used graphical parameters include:
"#RRGGBB", where RR, GG, and BB are the
red, green, and blue components of the color, respectively.Similarly, there are many more graphical parameters that could be used to customize the appearance of the plot. We can set this parameter by giving them directly to the plotting function. This can be understood by the following example. Let’s plot a graph with random points with no parameters.
x <- rnorm(100)
y <- rnorm(100)
plot(x, y)
This is the output we obtain, let’s try to customize this plot.
x <- rnorm(100)
y <- rnorm(100)
plot(x, y, xlim = c(-2, 2),
ylim = c(-2, 2),
xlab = "X", ylab = "Y",
main = "My Plot", col = "blue",
pch = 16, lwd= 2)
We can see that this plot is more informative than the previous one. We have added a title and labeled the y and x-axis, also given the limits of the axes (-2 to 2), the color is also changed from black to blue. Let’s try some examples using bar graphs.
# plotting a bar-graph with title and color
barplot(c(1,3), main="Main title",
xlab="X axis title",
ylab="Y axis title",
col.main="red", col.lab="blue")
Now let’s change the size font size of the axis titles and the main title of the graph.
# Increasing the size of titles
barplot(c(1,3), main="Main title",
xlab="X axis title",
ylab="Y axis title",
cex.main=2, cex.lab=1.7)
Adding an Axis to a Plot We add an axis to our plot
using the axis() function in R programming. It has an
attribute named ‘side’, which takes an integer as a value. These values
are,
# This will create a plot of x and y with no axis,
x <- c(1, 2, 3, 4, 5)
y <- x^2
plot(x, y, xaxt = "n", yaxt = "n")
# then add x and y axis to the plot.
axis(1)
axis(2)
Changing Axis Scale
We can set x and y-axis limits by specifying the minimum and maximum values of each axis. The below code will create a plot of x and y with x-axis limits between 0 and 15 and y-axis limits between 1 and 100, and also with a log scale on the x and y-axis
x <- c(1, 2, 3, 4, 5)
y <- x^2
# setting the limits for axis
plot(x, y, xlim=c(1,15),
ylim=c(1,100), log="xy")
Customizing Tick Mark Labels We can the appearance of the tick marks in the plot. We can change the color, font-size of tick labels using the col.axis function.
x <- 1:10
y <- x*x*x
# adding red color to the tick labels
plot(x,y, col.axis="red")
We can also hide the tick labels using xaxt and yaxt parameters. They both take a single character value, “s” for showing the axis and “n” for hiding the axis.
x <- 1:10;
y <- x*x*x
# plots with no tick labels
plot(x, y, xaxt="n", yaxt="n")
Changing plotting symbols We have many plotting
symbols in R. These are generated using the 'pch' parameter
which takes up an integer value to set the type of symbol. For
example,
x <- c(1, 2, 3, 4, 5)
y <- x^2
# makes the pointer a cross
plot(x, y, pch = 4, col = "blue")
Changing the Line Types In R, we can change the line types of a plot using the lty parameter. It can be solid, blank, dotted, dashed, etc. Ity takes up an integer value which is 0 for blank, 1 for solid, 3 for dotted, and so on.
x <- c(1, 2, 3, 4, 5)
y <- x^2
# makes a line graph with dotted lines
plot(x, y, type = "l", lty = 2, col = "red")
R graphics capabilities are very powerful and flexible. R is a powerful tool for building graphics from the simplest to the most complex ones. Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks is to visualize the data. Often, a proper visualization can illuminate features of the data that can inform further analysis. We will look at four methods of visualizing data that we will use throughout the course:
Plot Types in R
The type argument in the plot() function determines the
type of plot. Below is a list of the possible values for type and their
explanations:
When visualizing a single numerical variable, a histogram will be our
go-to tool, which can be created in R using the hist()
function. Strip plots look best when the data set is not too large. The
most popular graphical summary for numeric (quantitative) data is the
histogram.
A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin. The scale on the y axis can be frequency, percentage, or density (relative frequency).
library(DAAG)
attach(ais)
## The following object is masked _by_ .GlobalEnv:
##
## sex
## The following object is masked from cancer:
##
## sex
head(ais)
| rcc | wcc | lbm | Body_Mass_index | ht | wt | sex | sport |
|---|---|---|---|---|---|---|---|
| 3.96 | 7.5 | 63.32 | 20.56 | 195.9 | 78.9 | f | B_Ball |
| 4.41 | 8.3 | 58.55 | 20.67 | 189.7 | 74.4 | f | B_Ball |
| 4.14 | 5.0 | 55.36 | 21.86 | 177.8 | 69.1 | f | B_Ball |
| 4.11 | 5.3 | 57.18 | 21.88 | 185.0 | 74.9 | f | B_Ball |
| 4.45 | 6.8 | 53.20 | 18.96 | 184.6 | 64.6 | f | B_Ball |
| 4.10 | 4.4 | 53.77 | 21.04 | 174.0 | 63.7 | f | B_Ball |
summary(ais)
## rcc wcc lbm Body_Mass_index
## Min. :3.800 Min. : 3.300 Min. : 34.36 Min. :16.75
## 1st Qu.:4.372 1st Qu.: 5.900 1st Qu.: 54.67 1st Qu.:21.08
## Median :4.755 Median : 6.850 Median : 63.03 Median :22.72
## Mean :4.719 Mean : 7.109 Mean : 64.87 Mean :22.96
## 3rd Qu.:5.030 3rd Qu.: 8.275 3rd Qu.: 74.75 3rd Qu.:24.46
## Max. :6.720 Max. :14.300 Max. :106.00 Max. :34.42
##
## ht wt sex sport
## Min. :148.9 Min. : 37.80 f:100 Row :37
## 1st Qu.:174.0 1st Qu.: 66.53 m:102 T_400m :29
## Median :179.7 Median : 74.40 B_Ball :25
## Mean :180.1 Mean : 75.01 Netball:23
## 3rd Qu.:186.2 3rd Qu.: 84.12 Swim :22
## Max. :209.4 Max. :123.20 Field :19
## (Other):47
hist(rcc, main = "Histogram of Red blood cell count",data = ais)
## Warning in plot.window(xlim, ylim, log, ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
Note: The graph obtained depends on the bins chosen (see the example below). The best choice for a given data set is the one which illuminates the data set’s underlying structure (if any). In R there are algorithms to automatically choose bins that are likely to display well, often the default bins do a good job. This is not always the case, however, and it is better to investigate many bin choices to test the stability of the display.
par(mfrow=c(1,2))
hist(rcc, breaks = 8, main = "Histogram of Red blood cell count", data = ais)
## Warning in plot.window(xlim, ylim, log, ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
hist(rcc, breaks = 20, main = " ", data = ais)
## Warning in plot.window(xlim, ylim, log, ...): "data" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "data"
## is not a graphical parameter
## Warning in axis(1, ...): "data" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "data" is not a graphical parameter
list=c(5,20,20,30,30,30,30,40,40,40,40,40,50,50,50,50,50,50,60,60,60,60,60,70,70,70,70,80,80,90)
hist(list, breaks=10, density=c(10,20,30,40,50,40,30,20,10), angle=40,
col=rainbow(9), main="")
title(main=" Histogram", font.main=3)
#par(new=TRUE)
#plot(density(list), yaxt="n", xaxt="n", ann=FALSE, col="blue")
#curve(dnorm(x, mean(list), sd(list)), col="red", lwd=2, add=T)
box()
Density plots are generalized histograms.
## Density plot
plot(density(ht), data = ais) # density for height
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.
barplot(table(ais$sport), beside = TRUE, main = "Sport type")
To visualize the relationship between a numerical and categorical variable, we will use a boxplot. In the mpg dataset, the drv variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.
A boxplot is constructed by drawing a box alongside the data axis with sides located at the upper and lower hinges. A line is drawn parallel to the sides to denote the sample median. Lastly, whiskers are extended from the sides of the box to the maximum and minimum data values (more precisely, to the most extreme values that are not potential outliers).
Boxplots are good for quick visual summaries of data sets, and the relative positions of the values in the five numbers summary (minimum, first quartile, median, third quartile, maximum), indicating the underlying shape of the data distribution.
-They can be a handy alternative to a stripchart when the sample size is large.
Boxplots are also good because one can visually assess multiple features of the data set simultaneously:
Center: can be estimated by the sample median. Spread: can be judged by the width of the box, (IQR = third quartile – first quartile).
Shape: is indicated by the relative lengths of the whiskers, and the position of the median inside the box. Boxes with unbalanced whiskers indicate skewness in the direction of the long whisker. Skewed distributions often have the median tending in the opposite direction of skewness. Kurtosis can be assessed using the box and whiskers. A wide box with short whiskers will tend to be platykurtic, while a skinny box with wide whiskers indicates leptokurtic distributions. Extreme observations are identified with open circles.
boxplot(ais)
boxplotdata=data.frame(Altitude = c(2150, 2170, 2180, 2700, 2640, 2660,
2460, 2470, 2200, 2220, 2300, 2330, 2350, 2350, 2350, 1900, 1950, 2000,
1900, 1900, 1900, 1900, 1940, 1940, 1930, 2100, 2100, 2300, 2290, 2320,
2250, 2250, 1590, 1600, 1620, 1560, 2320, 2330, 2350, 2300, 2300, 2530,
2540, 2600, 2600, 2460, 2120, 2110, 2150, 2130, 2210, 2210, 2050, 1900,
1750, 1680, 1710, 1950, 2000, 1990, 2590, 2620,1800, 1900, 1350),
groupcode = c(1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 3, 2, 2, 4, 4, 4, 4, 4, 4, 4, 5, 5,
5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 3, 5, 5, 5, 4, 4,
5, 5, 5, 5, 5, 5, 3, 3,5, 4, 5))
attach(boxplotdata)
boxplot(Altitude~groupcode, main="Altitude", las=1, col=rainbow(5),
varwidth=TRUE, cex.lab=0.8, cex.axis=0.8, ylab="Altitude", xlab="Group
Number")
treelength=data.frame(Height=c(21,23,34,23,24,43,23,23,43,42,32,43,42,54,65,23,42,45,54,32,43,54,23,32,43,54,23,54,65,45,54,34,24,34,54,54),
Slope =gl(2,18,36, labels=c("Plain","Steep")),
Altitude=gl(3,6,36, labels=c(1500,1800,2400)))
library(ggplot2)
boxplot(Height~Slope+Altitude, data=treelength, boxwex=0.50,
col=c("yellow", "orange") ,xlab="Altitude in m", ylab="Tree height",
ylim=c(0,70), cex.axis = 0.8, cex.lab=0.8)
title(main="Height of Juniperus Procera along altitudinal gradient",
font.main=2, cex.main= 0.8)
legend("topleft", c("Southern Ethiopia", "Western Ethiopia"), fill =
c("yellow", "orange"), horiz=TRUE, cex=0.8)
A potential outlier is any observation that falls beyond 1.5 times the width of the box on either side, that is, any observation less than (first quartile – 1.5*IQR) or greater than (third quartile + 1.5*IQR). A suspected outlier is any observation that falls beyond 3 times the width of the box on either side. In R, both potential and suspected outliers (if present) are denoted by open circles; there is no distinction between the two.
boxplot(lbm~sex, data = ais, main = "Boxplot of Lean
+ body mass (kg)", xlab = "Sex")
boxplot(lbm~sport, data = ais, xlab = "Sport types")
Lastly, to visualize the relationship between two numeric variables
we will use a scatterplot. This can be done with the plot()
function and the ~ syntax we just used with a boxplot. (The function
plot() can also be used more generally; see the
documentation for details.)
Density plots
library(lattice)
densityplot(~lbm | sport, data = ais, groups = sex, scales = "free", plot.points = FALSE, auto.key = TRUE, main = "Density plots of Lean body mass by Support type and Sex", xlab = "Lean body mass")
Scatterplots
xyplot(ht ~ wt | sport, data = ais, main = "Scatterplots of Height versus Weight by Spport type", xlab = "Weight")
Scatterplot matrix
splom(ais[,2:5], main = "Scatterplot matrix of the first four variables in ais data")
In statistics, random sampling is a technique used to select a subset of individuals or items from a larger population in such a way that each individual or item has an equal chance of being selected. This method is essential for ensuring that the sample is representative of the population, allowing for accurate inferences and generalizations to be made about the entire population based on the sample data.
# Set my random seed
set.seed(100)
# Draw the random sample
sample( 1:50 , 10, replace = FALSE)
## [1] 10 38 48 25 14 44 23 22 6 4
There are two primary types of random sampling:
Random sampling is used in many fields, such as in predictive
modeling, hypothesis testing and when building or validating machine
learning models.
Random Sampling in R We will be implementing random
sampling in R programming language using the dplyr Package.
We first need to install and load the package into our R
environment.
#install.packages("dplyr")
library(dplyr)
Random Sampling using the sample_n() Function The
sample_n() function is used when we need to select a fixed
number of random rows from a data frame. This function is used in
situations where we need a specific number of observations for training
or testing machine learning models or when performing statistical tests
on a subset of the data.
Syntax:
sample_n(tbl, size, replace = FALSE, weight =
NULL, .env = NULL, .funs = NULL)
library(dplyr)
random_sample <- sample_n(iris, 10)
print(random_sample)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.5 2.8 4.6 1.5 versicolor
## 2 5.6 2.5 3.9 1.1 versicolor
## 3 6.2 2.9 4.3 1.3 versicolor
## 4 6.1 2.6 5.6 1.4 virginica
## 5 4.6 3.4 1.4 0.3 setosa
## 6 4.4 3.2 1.3 0.2 setosa
## 7 6.9 3.1 5.4 2.1 virginica
## 8 7.0 3.2 4.7 1.4 versicolor
## 9 4.8 3.4 1.9 0.2 setosa
## 10 4.9 3.0 1.4 0.2 setosa
Random Sampling using the sample_frac()
Function
The sample_frac() function allows to select a random
fraction of rows from a data frame. This is useful when we need a
percentage-based random sample.
Syntax:
<span style="color:blue"> sample_frac(tbl, size, replace = FALSE, weight = NULL, .env = NULL, .funs = NULL) </span>
library(dplyr)
random_fraction <- sample_frac(iris, 0.065)
print(random_fraction)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.8 2.7 4.1 1.0 versicolor
## 2 5.9 3.0 5.1 1.8 virginica
## 3 4.6 3.2 1.4 0.2 setosa
## 4 5.4 3.4 1.5 0.4 setosa
## 5 5.4 3.0 4.5 1.5 versicolor
## 6 5.5 2.6 4.4 1.2 versicolor
## 7 6.9 3.2 5.7 2.3 virginica
## 8 5.7 4.4 1.5 0.4 setosa
## 9 6.4 3.2 5.3 2.3 virginica
## 10 5.8 2.6 4.0 1.2 versicolor
R can generate a random sequence from a number of probability density functions. The general format for generating such sequences is: rdensity(n, p1, p2, …) where density is the probability density function, n is the number of values to generate and p1, p2, … are the values needed to generate from the density function.
For example:
rnorm(40, 6, 2.3) #generate 40 values Normal distribution with mean 6 and standard deviation 2.3
## [1] 4.1269280 4.9915637 4.3434904 6.5311724 3.3372222 6.5682748
## [7] 5.7904388 10.0419639 5.6827619 5.7442550 4.4129671 5.4898733
## [13] 6.4206877 6.9598436 8.4504254 8.2314646 5.7662528 9.2273680
## [19] 1.9134160 7.4325950 4.7987483 9.0411312 5.1640872 9.0338512
## [25] 6.1006919 1.6790915 4.9717570 2.0012247 6.4113892 10.3641711
## [31] 0.7745714 8.2550675 2.7827011 10.1972066 9.1769871 4.0706407
## [37] 5.3974097 5.8416587 5.1285678 11.9385055
rpois(20, lambda=5) # 20 value from Poisson distribution with lambda(۸)=5
## [1] 5 2 3 7 6 5 5 1 5 2 5 6 6 2 3 8 2 6 9 2
rbinom(20, size=18, prob=0.3) # 18 values from Binom. distribution with p= 0.3
## [1] 2 4 5 8 3 4 6 7 7 7 2 6 7 6 5 3 3 3 2 6
runif(20, min=3, max=6) # Uniform distribution (3, 6)
## [1] 5.283055 5.572424 4.311797 4.251409 4.756948 5.474903 5.379418 3.978458
## [9] 5.873796 4.959059 4.376027 4.818066 3.866166 5.153493 5.770192 5.024818
## [17] 3.557820 4.044201 3.370559 3.324349
rexp(20, rate=1.5) # 20 values from Exponential distribution e=1.5
## [1] 0.58997332 0.45093320 0.09360471 0.21384948 0.59040967 1.44183419
## [7] 0.38457602 0.78102098 0.38046218 0.64738387 0.28250675 1.49211983
## [13] 0.91966535 0.71400639 0.33564758 1.05919080 0.17594923 0.21128162
## [19] 1.95176195 0.09135269
Table: The general generating for value from distribution, the R name and argument are:
| Distribution | R Name | Arguments |
|---|---|---|
| Normal | norm | mean, sd |
| Student’s t | t | df, ncp |
| F | f | df1, df2, ncp |
| Exponential | exp | rate |
| Log-normal | lnorm | meanlog, sdlog |
| Beta | beta | shape1, shape2, ncp |
| Binomial | binom | size, prob |
| Cauchy | cauchy | location, scale |
| Chi-squared | chisq | df, ncp |
| Gamma | gamma | shape, scale |
| Geometric | geom | prob |
| Hypergeometric | hyper | m, n, k |
| Logistic | logis | location, scale |
| Negative binomial | nbinom | size, prob |
| Poisson | pois | lambda |
| Signed rank | signrank | n |
| Uniform | unif | min, max |
| Weibull | weibull | shape, scale |
| Wilcoxon | wilcox | m, n |
The function sample() generates random permutations of
data or a random sample from a given vector. The first argument is a
vector or a positive integer and the second the size of the
sample.
For example:
sample(12) # Sampling without replacement.
## [1] 4 10 11 12 8 7 9 2 6 1 3 5
sample(12, 5) # Sampling 5 values without replacement.
## [1] 2 6 9 5 4
sample(12, replace=TRUE) # Sampling with replacement.
## [1] 11 6 7 4 7 5 12 9 5 12 11 3
sample(12, 8, replace=TRUE) # Sampling 8 values with replacement.
## [1] 11 12 4 10 2 12 5 7
sample(c("Heads", "Tails"), 6, replace=TRUE) # Simulating 6 coin tosses.
## [1] "Heads" "Tails" "Heads" "Tails" "Heads" "Tails"
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).
A probability distribution describes how the probabilities of outcomes are distributed for a random variable. It provides a mathematical framework to represent and analyze uncertainty in experiments and data. Probability distributions are categorized into Discrete and Continuous types based on the nature of the random variable.
A variable whose value is determined by the outcome of a random experiment.
- Discrete Random Variable: Takes on countable values (e.g., number of heads in a coin toss).
- Continuous Random Variable: Takes on any value within a range (e.g., height, weight).Table 4: Common Probability Distribution Functions in R
| Distribution | Probability Density (or Mass) Function | Cumulative Distribution Function (CDF) | Quantile Function |
|---|---|---|---|
| Normal | dnorm(Z, mean, sd) |
pnorm(Z, mean, sd) |
qnorm(Q, mean, sd) |
| Poisson | dpois(N, lambda) |
ppois(N, lambda) |
qpois(Q, lambda) |
| Binomial | dbinom(N, size, prob) |
pbinom(N, size, prob) |
qbinom(Q, size, prob) |
| Exponential | dexp(N, rate) |
pexp(N, rate) |
qexp(Q, rate) |
| Chi-Squared | dchisq(X, df) |
pchisq(X, df) |
qchisq(Q, df) |
\[P(X=x) = p^x (1-p)^{1-x}, \quad x \in \{0,1\}\]
Let X be a discrete random variable. \(X∼Bern(p)\) means X takes on a Bernoulli distribution where the probability of success is p.
Assumptions: The binary random variable has a fixed probability of success and failure, and the outcomes are independent.
Application: The Bernoulli distribution is commonly used to model the outcome of a single trial of an experiment where the outcome can only be success or failure.
Real-life example:
- A coin toss, where the outcome can only be heads (success) or tails (failure).
- Medical trials: Researchers use the Bernoulli distribution to analyze the efficacy of
a new drug or treatment by categorizing treatment outcomes as successful or unsuccessful.
- Passing or failing an exam
- Winning or losing a tennis match # Generate 10 Bernoulli random variables with probability of success 0.3
library(Rlab)
## Rlab 4.0 attached.
##
## Attaching package: 'Rlab'
## The following object is masked _by_ '.GlobalEnv':
##
## cancer
## The following object is masked from 'package:survival':
##
## cancer
## The following object is masked from 'package:DAAG':
##
## ozone
## The following object is masked from 'package:dplyr':
##
## count
## The following object is masked from 'package:tibble':
##
## view
## The following objects are masked from 'package:stats':
##
## dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
## qweibull, rexp, rgamma, rweibull
## The following object is masked from 'package:datasets':
##
## precip
rbern(n = 10, prob = 0.3)
## [1] 0 0 0 0 0 0 0 0 0 0
The binomial distribution formula can be written as:
\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]
where:
- $P(X = k) $ is the probability of getting exactly \(k\) successes in \(n\) trials,
- \(n\) is the total number of
trials,
- \(p\) is the probability of success
in a single trial,
- $ $ is the binomial coefficient, also known as “n choose k”.
Assumptions: The binary random variable has a fixed probability of success and failure, the outcomes are independent, and the trials are identical.
Application: The binomial distribution is commonly used to model the number of successes in a fixed number of independent trials where the outcome can only be success or failure.
Real-life example: The number of heads in 10
coin tosses. - Clinical Trials: A new drug is tested on 50 patients, and
each has a 60% chance of showing improvement. The number of patients
showing improvement follows a Binomial distribution.
Parameters: n=50, p=0.6.
Example 1:
# Generate 10 binomial random variables with 5 trials and probability of success 0.3
rbinom(n = 10, size = 5, prob = 0.3)
## [1] 2 3 0 1 0 2 1 2 1 0
Example 2:
Generate 20000 random sample and verify that Binomial distribution will
be approximated by normal distribution for large number of trial by
varying number of trials(n=10,30,50,500)and for probability of success
is 0.05) and students t distribution will be approximated by normal
distribution for large degrees of freedom by varying degree of freedom
from 10,30,50,500.
To perform the simulation, we will take m samples from binomial distribution for same n and p. we should take m to be some large number so that we get a good idea of the underlying population the samples come from. We will then compare our samples to the normal distribution
# R code to generate 20000 random sample from Binomial by varying number of trials
# df= 10
m=20000
par(mfrow=c(2,2))
sample1= rbinom(m,size=10,prob=0.05)
hist(sample1,main="Sample1 for n=10",prob=TRUE)
qqnorm(sample1)
qqline(sample1)
boxplot(sample1)
#df= 30
m=20000
par(mfrow=c(2,2))
sample2= rbinom(m,size=30,prob=0.05)
hist(sample2, prob=TRUE,main="Sample2 for n=30")
qqnorm(sample2)
qqline(sample2)
boxplot(sample2)
# df= 50
m=20000
par(mfrow=c(2,2))
sample3= rbinom(m,size=50,prob=0.05)
hist(sample3,prob=TRUE,main="Sample3 for n=50")
qqnorm(sample3)
qqline(sample3)
boxplot(sample3)
# df= 500
m=20000
par(mfrow=c(2,2))
sample4= rbinom(m,size=500,prob=0.05)
hist(sample4,main="Sample4 for n=500", prob=TRUE)
qqnorm(sample4)
qqline(sample4)
boxplot(sample4)
Example 3: Suppose that an examination consists of six true and false questions, and assume that a student has no knowledge of the subject matter. The probability that the student will guess the correct answer to the first question is 30%. Likewise, the probability of guessing each of the remaining questions correctly is also 30%.
# Parameters
n <- 6 # Number of questions
p <- 0.30 # Probability of guessing correctly
a<- 1 - pbinom(3, n, p)
a
## [1] 0.07047
b<- 1 - pbinom(1, n, p)
b
## [1] 0.579825
c<- pbinom(3, n, p)
c
## [1] 0.92953
d<- pbinom(4, n, p)
d
## [1] 0.989065
The Poisson distribution formula is:
\[P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}\]
where:
- \(P(X = k)\) is the probability of
observing \(k\) events in a fixed
interval,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\lambda\) is the average rate of
event occurrences in that interval,
- \(k\) is the number of events that we
are interested in,
- \(k!\) is the factorial of \(k\) (the product of all positive integers
up to \(k\)).
Assumptions: The events occur independently, the probability of an event occurring is constant over time or space, and the events occur at a fixed rate.
Application: The Poisson distribution is commonly used to model the number of occurrences of rare events such as accidents, defects, or defects in a manufacturing process.
Real-life example: The number of car accidents
that occur in a city in a given day.
- Number of births in a maternity hospital : A maternity
hospital can use the Poisson distribution to predict how many births
will occur during the nigh
# Generate 10 Poisson random variables with lambda = 2
rpois(n = 10, lambda = 2)
## [1] 1 2 1 2 3 2 2 5 2 0
Example 2: Suppose that 4% of all TVs made by A&B Company in 2000 are defective. If eight of these TVs are randomly selected from across the country and tested, what is the probability that exactly three of them are defective? Assume that each TV is made independently of the others.
# Parameters
n <- 8 # Number of TVs tested
p <- 0.04 # Probability a TV is defective
x <- 3 # Number of defective TVs
# Probability of exactly 3 defective TVs
prob_exactly_three <- dbinom(x, n, p)
prob_exactly_three
## [1] 0.002922296
The geometric distribution in R is one of the fundamental discrete probability distributions in statistics. It models the number of trials required to get the first success in a sequence of independent Bernoulli trials (i.e., trials with two possible outcomes: success and failure). In this article, we will explore the theory behind the geometric distribution, its probability mass function (PMF), cumulative distribution function (CDF), and its applications. We will also cover how to work with the geometric distribution in R.
The geometric distribution in R arises when we perform a sequence of
independent and identically distributed Bernoulli trials, where each
trial has:
\[P(X = k) = (1-p)^{k-1} p\]
where:
- \(P(X = k)\) is the probability of
getting the first success on the k-th trial,
- \(p\) is the probability of success
on each trial,
- \(k\) is the number of trials until
the first success (k = 1, 2, 3, …).
# Load necessary libraries
library(ggplot2)
# Parameters
p <- 0.3
k_values <- 1:10
# Calculate PMF
pmf_values <- dgeom(k_values - 1, prob = p)
# Create a data frame for plotting
pmf_data <- data.frame(k = k_values, pmf = pmf_values)
# Plot the PMF
ggplot(pmf_data, aes(x = k, y = pmf)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Geometric Distribution PMF (p = 0.3)",
x = "Number of Trials (k)",
y = "Probability") +
theme_minimal()
Continuous probability distributions are a fundamental concept in probability theory and statistics. They are used to model random variables that can take on any value within a range of values, as opposed to discrete probability distributions that model random variables that can only take on certain values. Continuous probability distributions are commonly used in various fields, such as finance, engineering, healthcare, and social sciences, to model measurements of time, length, weight, or other continuous variables.
Some of the most commonly used continuous probability distributions
include the Uniform, Normal,
Exponential, Log-Normal, Gamma,
Beta, Chi-Square,
Student’s t,F distributions. Each of
these distributions has a unique shape and set of parameters that
determine its properties, such as mean, variance, skewness, and
kurtosis.
Understanding continuous probability distributions is essential for statistical analysis and data modeling. They are used to estimate probabilities, calculate expected values, and make predictions based on observed data. In addition, the properties of continuous probability distributions, such as their moments and tail behavior, have important implications for statistical inference and hypothesis testing.
In this context, it is important to note that continuous probability distributions are all based on the fundamental concept of probability density functions (PDFs), which describe the probability distribution of a continuous random variable. PDFs are used to calculate the probability of a random variable taking on a specific value or falling within a certain range of values. The area under a PDF curve represents the probability of the random variable falling within that range of values.
Continuous probability distributions play a critical role in modern statistics and are essential for analyzing and interpreting data in a wide range of fields. Understanding their properties, assumptions, and applications is essential for anyone seeking to apply statistical methods to real-world problems.
The probability density function of a continuous uniform distribution can be expressed as follows:
\[f(x|a, b) = \frac{1}{b-a} \text{ for } a \leq x \leq b\]
where:
- \(f(x|a, b)\)is the probability
density function of the uniform distribution between \(a\) and \(b\),
- \(a\) is the lower limit of the
distribution,
- \(b\) is the upper limit of the
distribution.
This formula denotes that for a continuous uniform distribution over the interval \([a, b]\), the probability density is constant over that interval, equal to \(\frac{1}{b-a}\).
Assumptions: The continuous random variable has a fixed interval, and all values within that interval are equally likely.
Application: The uniform distribution is commonly used to model outcomes where all values within an interval are equally likely, such as the roll of a fair die or the arrival time of a customer at a store.
Real-life example: The arrival time of a customer at a store between 9:00 am and 10:00 am.
The probability density function of the normal (Gaussian) distribution can be represented as:
\[f(x|\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
where:
- \(f(x|\mu,\sigma^2)\) is the
probability density function of the normal distribution with mean \(\mu\) and variance \(\sigma^2\),
- \(\mu\) is the mean of the
distribution,
- \(\sigma^2\) is the variance of the
distribution,
- \(e\) is Euler’s number \(\approx 2.71828\),
- \(\pi\) is the mathematical constant
pi (\(\approx 3.14159\)).
This formula describes how the probability density function of a normal distribution varies as a function of \(x\), \(\mu\), and \(\sigma\).
Assumptions: The continuous random variable has a mean and standard deviation, and the values are normally distributed around the mean.
Application: The normal distribution is commonly used to model outcomes where the values are normally distributed, such as the height of a population or the scores on a standardized test.
Real-life example: The height of a population of people.
# Generate 10 normal random variables with mean 0 and standard deviation 1
rnorm(n = 10, mean = 0, sd = 1)
## [1] 0.02355804 0.57633584 1.06700611 1.44528308 -1.76027157 -1.02274299
## [7] -0.66032349 0.74305991 -0.04390308 0.45594501
Example 2:
Draw two random samples of size 100 from the following distributions
N(0, 0.52) and N(2, 0.52) and produce the histograms of these samples on
one page. (a) Use the function rnorm()to generate the samples in the
following way:
#x <- rnorm(sample size, mean ,standard deviation) #general call of rnorm
x1 <- rnorm(100,0,0.5)
x2 <- rnorm(100,2,0.5)
par() function in order to deffine the number
of figures in one page:par(mfrow=c(1,2)) #put two figures in one page, two figures in one column
hist(x1)
hist(x2)
Example 1:
Generate 50 standard normally distributed random sample and display in
histogram.
mysample <- rnorm(50) # generates random numbers
hist(mysample, prob = TRUE) # draws the histogram
Describe the shape of random sample
The probability density function of the exponential distribution can be expressed as:
\[f(x|\lambda) = \lambda e^{-\lambda x} \text{ for } x \geq 0\]
where:
- \(f(x|\lambda)\) is the probability
density function of the exponential distribution with rate parameter
\(\lambda\),
- \(\lambda\) is the rate parameter of
the distribution, often representing the average number of events in a
unit time,
- \(e\) is Euler’s number \(\approx 2.71828\).
This formula describes how the probability density function of the exponential distribution changes with \(x\) and the rate parameter \(\lambda\).
Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.
Application: The exponential distribution is commonly used to model the time between occurrences of rare events such as accidents, defects, or defects in a manufacturing process.
Real-life example: The time between car accidents on a particular road.
# Generate 10 exponential random variables with rate parameter 0.5
rexp(n = 10, rate = 0.5)
## [1] 1.1490388 2.5859117 0.3146110 1.9089339 4.8798685 0.5295571 0.6737470
## [8] 1.6965217 3.2391039 2.3716948
\[f(x|k,\theta) = \frac{1}{\Gamma(k) \theta^k} x^{k-1} e^{-\frac{x}{\theta}} \text{ for } x > 0\]
where:
- \(f(x|k,\theta)\) is the probability
density function of the gamma distribution with shape parameter \(k\) and scale parameter \(\theta\),
- \(\Gamma(k)\) is the gamma function
evaluated at \(k\),
- \(x\) is the random variable,
- \(k\) is the shape parameter that
determines the shape of the distribution,
- \(\theta\) is the scale parameter
that influences the spread of the distribution,
- \(e\) is Euler’s number \(\approx 2.71828\).
This formula describes how the probability density function of the gamma distribution varies with the random variable \(x\), shape parameter \(k\), and scale parameter \(\theta\).
Assumptions: The events occur independently, the probability of an event occurring is constant over time, and the time between events follows an exponential distribution.
Application: The gamma distribution is commonly used to model the time to complete a task that is composed of multiple independent sub-tasks, each with a known time to completion.
Real-life example: The time to complete a software development project that is composed of multiple independent sub-tasks.
# Generate 10 gamma random variables with shape parameter 2 and rate parameter 0.5
rgamma(n = 10, shape = 2, rate = 0.5)
## [1] 0.9319142 2.9074790 2.8970930 3.1962679 1.6054692 9.8556978 2.7600363
## [8] 0.9559170 4.2967674 0.4309526
The Chi-Square (X2) test is a nonparametric test that is used to test hypotheses about distributions of frequencies across categories of data. The sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true.
\[f(x|k) = \frac{1}{2^{k/2} \Gamma(k/2)}
x^{k/2 - 1} e^{-x/2}\] where:
- \(f(x|k)\) is the probability density
function of the chi-squared distribution with \(k\) degrees of freedom,
- \(\Gamma(\cdot)\) is the gamma
function,
- \(x\) is the random variable,
- \(k\) is the degrees of freedom
parameter.
This formula describes how the probability density function of the chi-squared distribution changes with the random variable \(x\) and the degrees of freedom parameter \(k\).
Assumptions: The continuous random variable is the sum of squared standard normal random variables.
Application: The chi-squared distribution is commonly used in hypothesis testing and confidence interval estimation when the population standard deviation is unknown and the sample size is large.
Real-life example: The sum of squared deviations from the mean in a sample of 100 measurements.
# Generate 10 chi-squared random variables with 5 degrees of freedom
rchisq(n = 10, df = 5)
## [1] 2.040255 5.424924 2.837399 1.196678 1.980183 8.372367 4.269050 9.790097
## [9] 2.328423 2.969191
Definition: The F distribution is a probability distribution for a continuous random variable that is used to model the distribution of the ratio of two chi-squared random variables.
Assumptions: The continuous random variable is the ratio of two chi-squared random variables.
Application: The F distribution is commonly used in hypothesis testing and confidence interval estimation when comparing the variances of two populations.
Real-life example: The ratio of the variances of two sets of data.
# Generate 10 F random variables with numerator degrees of freedom 5 and denominator degrees of freedom 10
rf(n = 10, df1 = 5, df2 = 10)
## [1] 1.7271958 1.9262566 4.4256425 0.5940662 2.1703496 1.7676095 9.4035169
## [8] 0.9775437 0.3089306 2.1442481
Example: Students t Distribution
###########################################################################
# R code to generate 20000 random sample from Students t Distribution by varying degree of freedoms
###########################################################################
m=20000
par(mfrow=c(2,2))
sample1= rt(m,df=10)
hist(sample1,main="Sample1 for df=10",prob=TRUE)
qqnorm(sample1)
qqline(sample1)
boxplot(sample1)
###########################################################################
m=20000
par(mfrow=c(2,2))
sample3= rt(m,df=50)
hist(sample3,prob=TRUE,main="Sample3 for df=50")
qqnorm(sample3)
qqline(sample3)
boxplot(sample3)
###########################################################################
m=20000
par(mfrow=c(2,2))
sample4= rt(m,df=500)
hist(sample4,main="Sample4 for df=500", prob=TRUE)
qqnorm(sample4)
qqline(sample4)
boxplot(sample4)
Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data.These tests are used when a normal distribution is assumed.
Examples include:
T-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known (typically, the scaling term is unknown and is therefore a nuisance parameter).
When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student’s t distribution. The t-test’s most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.
T-Test: used to test the mean of the populations
The key assumption to use t-test is normality of the population from which the random samples are drawn.
There are three different types of t-tests which performs different tests
One-sample t-test: Compares the mean of population with certain
known or hypothesized population mean value. Hypothesis to be tested
is:
The test statistic is given by:
\[t = \frac{\bar{x} -
\mu}{\frac{s}{\sqrt{n}}}\]
wtdat <- read.csv('~/Independent sample T-test.csv')
t.test (Leaf.length ~ varaity, var.equal=TRUE, data = wtdat)
##
## Two Sample t-test
##
## data: Leaf.length by varaity
## t = -2.8527, df = 16, p-value = 0.01152
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.8672012 -0.1277988
## sample estimates:
## mean in group 1 mean in group 2
## 27.5625 28.0600
y1<-rnorm(100,0,1)
y2<-rnorm(57,2,1)
boxplot(y1,y2)
Here we know that the variances are equal, we generated the data that way. Repeat the t-test with the relevant option.
t.test(y1,y2)
##
## Welch Two Sample t-test
##
## data: y1 and y2
## t = -13.051, df = 124.05, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.308484 -1.700514
## sample estimates:
## mean of x mean of y
## 0.08573994 2.09023908
The Welch Two Sample t-test is a statistical method used to compare the means of two groups, \(y_1\) and \(y_2\), when their variances are not equal. The test statistic (t) is -12.708, indicating a significant difference between the groups. The degrees of freedom (df) are calculated using Welch’s formula, adjusted for unequal variances. The p-value is small, indicating strong evidence against the null hypothesis. The alternative hypothesis is that the true difference in means is not equal to 0. The confidence interval for the difference in means is entirely below 0, indicating that \(y_2\) has a significantly higher mean than y1. The sample means show that \(y_2\) is substantially higher than y1. In conclusion, the Welch Two Sample t-test indicates a statistically significant difference in means between \(y_1\) and \(y_2\), with the confidence interval and sample means suggesting that \(y_2\) is significantly higher than \(y_1\).
The Paired samples t-Test: used to compares the mean difference of
the same subject after and before having certain treatments. It tests
whether average difference is 0 or not after measurements are made on
the same group(subject) but at two different occasion or
conditions.
Example: pre or post test results of the same subjects.
yld<-read.csv("~/Paired sample T test.csv", header=TRUE, na.strings="NA", dec=".")
# Save the data in two different vector
before <- yld$before
after <- yld$after
# Compute t-test
rt <- t.test(before, after, paired = TRUE)
rt
##
## Paired t-test
##
## data: before and after
## t = -5.7654, df = 5, p-value = 0.002205
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -14.00080 -5.36587
## sample estimates:
## mean difference
## -9.683333
The One-Way ANOVA model formula using the following Formula:
\[Y_{ij} = \mu + \varepsilon_{ij}\]
Where:
- \(Y_{ij}\) represents the observation
in the \(i\)th group and the \(j\)th sample. - \(\mu\) Parameters: fixed but unknown and
needed to be estimated - \(\varepsilon_{ij}\): Random error, assumed
to follow normal distribution with constant varaince.
Model assumptions are:
Violation of Assumptions
If assumptions are violated, consider transformations of the data, using
a non-parametric alternative (e.g., Kruskal-Wallis test),
or employing a robust ANOVA method.
duodenal<-read.table("~/duodenal.txt", header = FALSE, na.strings = "NA", dec = ".")
names(duodenal) <- c("Trypsin", "Protein")
attach(duodenal)
dim(duodenal)
## [1] 28 2
Fit.aov<-aov(Protein~Trypsin)
summary(Fit.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Trypsin 2 19.63 9.814 1.265 0.3
## Residuals 25 193.99 7.760
Histograms by group
par(mfrow=c(2,2))
hist(Protein[Trypsin=="ls50"],col=0)
hist(Protein[Trypsin=="b51_1000"],col=0)
hist(Protein[Trypsin=="ge1000"],col=0)
qq normal plots by group
par(mfrow=c(2,2))
qqnorm(Protein[Trypsin=="ls50"])
qqnorm(Protein[Trypsin=="b51_1000"])
qqnorm(Protein[Trypsin=="ge1000"])
par(mfrow=c(1,2))
boxplot(split(Protein,Trypsin))
tapply(Protein,Trypsin,mean)
## b51_1000 ge1000 ls50
## 5.410000 5.944444 3.933333
Diagnostic plot
par(mfrow=c(2,2))
qqnorm(Fit.aov$resid)
hist(Fit.aov$resid,col=0)
boxplot(split(Fit.aov$resid,Trypsin))
Bartlett’s test
bartlett.test(Protein ~ Trypsin,data = duodenal)
##
## Bartlett test of homogeneity of variances
##
## data: Protein by Trypsin
## Bartlett's K-squared = 1.5415, df = 2, p-value = 0.4627
Levene’s test
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:DAAG':
##
## vif
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
attach(duodenal)
## The following objects are masked from duodenal (pos = 5):
##
## Protein, Trypsin
leveneTest(Protein, Trypsin)
## Warning in leveneTest.default(Protein, Trypsin): Trypsin coerced to factor.
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 2 | 0.6844803 | 0.5135556 |
| 25 | NA | NA |
Two-way ANOVA is a statistical technique used to evaluate the effect of two categorical independent variables (factors) and their interaction on a continuous dependent variable. It extends one-way ANOVA by including a second factor, allowing for the analysis of interaction effects between the two factors.
\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]
Where:
- \(Y_{ijk}\) represents the
observation in the cell defined by levels \(i\) and \(j\) of the two factors. - \(\mu\) is the overall mean. - \(\alpha_i\) is the effect of the \(i\)th level of factor A. - \(\beta_j\) is the effect of the \(j\)th level of factor B. - \((\alpha\beta)_{ij}\) is the interaction
effect between the \(i\)th level of
factor A and the \(j\)th level of
factor B. - \(\varepsilon_{ijk}\) is
the random error term.
TWO WAY ANOVA Table
Model Definition \[Y_{ijk} = \mu + \tau_i + \beta_j + (\gamma)_{ij} + e_{ijk}\]
Decomposition of Sum of Squares \[SS_{\text{TOTAL}} = SS_A + SS_B + SS_{AB} + SS_{\text{Error}}\]
Assumptions \[Y_{ijk} \sim N(\mu, \sigma^2), \quad e_{ijk} \sim N(0, \sigma^2)\]
Hypotheses
Main Effects for Factor A \[H_0: \tau_1 = \tau_2 = \ldots = \tau_k = 0 \quad \text{versus} \quad H_a: \tau_i \neq 0 \ \text{for some } i.\]
Main Effects for Factor B \[H_0: \beta_1 = \beta_2 = \ldots = \beta_b = 0 \quad \text{versus} \quad H_a: \beta_j \neq 0 \ \text{for some } j.\]
Interaction Effects \[ H_0: \gamma_{ij} = 0 \ \text{for all } i, j \quad \text{versus} \quad H_a: \gamma_{ij} \neq 0 \ \text{for some } i, j. \]
##Two way ANOVA###
trypData=read.delim2("~/Trypanosomosis.txt")
str(trypData)
## 'data.frame': 12 obs. of 4 variables:
## $ Breed : chr "Boran" "Boran" "Boran" "Boran" ...
## $ Drug : chr "Berenil" "Berenil" "Berenil" "Samorin" ...
## $ PCV_before: num 18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
## $ PCV_after : num 26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
trypData$PCV_diff <-trypData$PCV_after-trypData$PCV_before
str(trypData)
## 'data.frame': 12 obs. of 5 variables:
## $ Breed : chr "Boran" "Boran" "Boran" "Boran" ...
## $ Drug : chr "Berenil" "Berenil" "Berenil" "Samorin" ...
## $ PCV_before: num 18.4 20.3 22.2 16.3 15.4 19.2 21.3 17.4 18.2 22.2 ...
## $ PCV_after : num 26.3 28.1 27.8 30.1 27.3 32.7 28.3 26.8 25.8 38.1 ...
## $ PCV_diff : num 7.9 7.8 5.6 13.8 11.9 13.5 7 9.4 7.6 15.9 ...
library(ggplot2)
ggplot(trypData,aes(y = PCV_diff, x = Drug, fill = Breed)) +geom_boxplot(aes(fill = Breed))
ft = aov(PCV_diff~Breed * Drug, data = trypData)
print(ft)
## Call:
## aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
##
## Terms:
## Breed Drug Breed:Drug Residuals
## Sum of Squares 0.44083 89.10750 0.80083 23.99333
## Deg. of Freedom 1 1 1 8
##
## Residual standard error: 1.73181
## Estimated effects may be unbalanced
summary(ft)
## Df Sum Sq Mean Sq F value Pr(>F)
## Breed 1 0.44 0.44 0.147 0.711420
## Drug 1 89.11 89.11 29.711 0.000608 ***
## Breed:Drug 1 0.80 0.80 0.267 0.619316
## Residuals 8 23.99 3.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(ft,type="means")
## Tables of means
## Grand mean
##
## 10.275
##
## Breed
## Breed
## Boran Holstein
## 10.083 10.467
##
## Drug
## Drug
## Berenil Samorin
## 7.55 13.00
##
## Breed:Drug
## Drug
## Breed Berenil Samorin
## Boran 7.100 13.067
## Holstein 8.000 12.933
TukeyHSD(ft)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PCV_diff ~ Breed * Drug, data = trypData)
##
## $Breed
## diff lwr upr p adj
## Holstein-Boran 0.3833333 -1.922351 2.689017 0.7114204
##
## $Drug
## diff lwr upr p adj
## Samorin-Berenil 5.45 3.144316 7.755684 0.0006082
##
## $`Breed:Drug`
## diff lwr upr p adj
## Holstein:Berenil-Boran:Berenil 0.9000000 -3.6281806 5.428181 0.9173043
## Boran:Samorin-Boran:Berenil 5.9666667 1.4384861 10.494847 0.0124387
## Holstein:Samorin-Boran:Berenil 5.8333333 1.3051527 10.361514 0.0140891
## Boran:Samorin-Holstein:Berenil 5.0666667 0.5384861 9.594847 0.0293847
## Holstein:Samorin-Holstein:Berenil 4.9333333 0.4051527 9.461514 0.0334869
## Holstein:Samorin-Boran:Samorin -0.1333333 -4.6615139 4.394847 0.9996731
plot(TukeyHSD(ft,"Breed"),las=1)
mastitisData =read.csv2("~/Mastitis.csv")
str(mastitisData)
## 'data.frame': 12 obs. of 3 variables:
## $ Parity : chr "heifer" "heifer" "heifer" "heifer" ...
## $ Dose : chr "high" "high" "medium" "medium" ...
## $ Reduction: num 6.79 3.87 30.03 38.08 53.67 ...
mastitisData$Dose = factor(mastitisData$Dose,levels =c("low", "medium", "high"))
ggplot(mastitisData,aes(y = Reduction, x = Dose, fill = Parity))+geom_point(aes(colour = Parity), size = 6)
fm= aov(Reduction ~Parity * Dose, data = mastitisData)
summary(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Parity 1 627 627 36.79 0.000911 ***
## Dose 2 8758 4379 257.06 1.54e-06 ***
## Parity:Dose 2 385 192 11.30 0.009236 **
## Residuals 6 102 17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(fm,type="means")
## Tables of means
## Grand mean
##
## 39.64
##
## Parity
## Parity
## heifer multiparous
## 32.41 46.87
##
## Dose
## Dose
## low medium high
## 72.62 39.84 6.45
##
## Parity:Dose
## Dose
## Parity low medium high
## heifer 57.86 34.06 5.33
## multiparous 87.40 45.63 7.58
TukeyHSD(fm)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Reduction ~ Parity * Dose, data = mastitisData)
##
## $Parity
## diff lwr upr p adj
## multiparous-heifer 14.45333 8.62264 20.28403 0.000911
##
## $Dose
## diff lwr upr p adj
## medium-low -32.7825 -41.73701 -23.82799 7.34e-05
## high-low -66.1725 -75.12701 -57.21799 1.30e-06
## high-medium -33.3900 -42.34451 -24.43549 6.60e-05
##
## $`Parity:Dose`
## diff lwr upr p adj
## multiparous:low-heifer:low 29.540 13.11411 45.96589 0.0029142
## heifer:medium-heifer:low -23.800 -40.22589 -7.37411 0.0089285
## multiparous:medium-heifer:low -12.225 -28.65089 4.20089 0.1529274
## heifer:high-heifer:low -52.525 -68.95089 -36.09911 0.0001178
## multiparous:high-heifer:low -50.280 -66.70589 -33.85411 0.0001512
## heifer:medium-multiparous:low -53.340 -69.76589 -36.91411 0.0001079
## multiparous:medium-multiparous:low -41.765 -58.19089 -25.33911 0.0004333
## heifer:high-multiparous:low -82.065 -98.49089 -65.63911 0.0000071
## multiparous:high-multiparous:low -79.820 -96.24589 -63.39411 0.0000085
## multiparous:medium-heifer:medium 11.575 -4.85089 28.00089 0.1824850
## heifer:high-heifer:medium -28.725 -45.15089 -12.29911 0.0033811
## multiparous:high-heifer:medium -26.480 -42.90589 -10.05411 0.0051795
## heifer:high-multiparous:medium -40.300 -56.72589 -23.87411 0.0005298
## multiparous:high-multiparous:medium -38.055 -54.48089 -21.62911 0.0007303
## multiparous:high-heifer:high 2.245 -14.18089 18.67089 0.9916016
plot(TukeyHSD(fm,"Dose",order=T))
In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.
As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:Rlab':
##
## michelson
## The following object is masked from 'package:DAAG':
##
## hills
## The following object is masked from 'package:dplyr':
##
## select
data(cats)
str(cats)
## 'data.frame': 144 obs. of 3 variables:
## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
par(mfrow=c(1,2))
with(cats, plot(Bwt, Hwt,col=3))
title(main="Heart Weight (g) vs. Body Weight (kg)of Domestic Cats")
plot(Hwt~Bwt, data=cats,col=4,main="H vrs B W")
with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
##
## Pearson's product-moment correlation
##
## data: Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7375682 0.8552122
## sample estimates:
## cor
## 0.8041274
with(cats, cor(Bwt, Hwt))
## [1] 0.8041274
with(cats, cor(Bwt, Hwt))^2
## [1] 0.6466209
###Correlation test###
with(cats, cor.test(Bwt, Hwt))
##
## Pearson's product-moment correlation
##
## data: Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7375682 0.8552122
## sample estimates:
## cor
## 0.8041274
with(cats, cor.test(Bwt, Hwt,method = "pearson",conf.level=0.95))
##
## Pearson's product-moment correlation
##
## data: Bwt and Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7375682 0.8552122
## sample estimates:
## cor
## 0.8041274
####Scatterplotdetails###
with(cats, plot(Bwt, Hwt, type="n", las=1, xlab="Body Weightin kg",ylab="Heart Weight in g",main="Heart Weight vs. BodyWeight of Cats"))
with(cats, points(Bwt[Sex=="F"], Hwt[Sex=="F"], pch=16,col="red"))
with(cats, points(Bwt[Sex=="M"], Hwt[Sex=="M"], pch=17,col="blue"))
Non-parametric tests are used when data is not normally distributed, when the assumptions of parametric tests are violated, or when dealing with ordinal or nominal data. These tests do not depend on any distribution and do not make any assumptions.
Examples include:
- Sign test
- Mann–Whitney
- Wilcoxon test
- Chi-square test
- Kruskal_Wallis test
- Kolmogorov_Smirnov test
- Wilcoxon Rank sum test
- Spearman correlation
The Wilcoxon signed-rank test is a non-parametric rank test for
statistical hypothesis testing used either to test the location of a
population based on a sample of data, or to compare the locations of two
populations using two matched samples.
[1] The one-sample version serves a purpose similar to that of the
one-sample Student’s t-test.
[2] For two matched samples, it is a paired difference test like the
paired Student’s t-test (also known as the “t-test for matched pairs” or
“t-test for dependent samples”).
The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed. Instead, it assumes a weaker hypothesis that the distribution of this difference is symmetric around a central value and it aims to test whether this center value differs significantly from zero. The Wilcoxon test is a more powerful alternative to the sign test because it considers the magnitude of the differences, but it requires this moderately strong assumption of symmetry.
- Alternative: paired t-test
- Application: comparing the medians of two dependent groups when the assumptions of normality or equal variances are not met.
- Real-life example: comparing the pre- and post-treatment scores of patients in a clinical trial when the scores are not normally distributed.
# Wilcoxon signed-rank test
before <- c(8, 9, 10, 12, 14)
after <- c(10, 11, 12, 13, 15)
wilcox.test(before, after, paired=TRUE)
## Warning in wilcox.test.default(before, after, paired = TRUE): cannot compute
## exact p-value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: before and after
## V = 0, p-value = 0.05334
## alternative hypothesis: true location shift is not equal to 0
A chi-square test is a statistical test used to determine if there is a significant association between categorical variables. It compares the observed frequencies of occurrences in different categories with the frequencies expected if there were no associations between the variables.
The Chi-Squared Test The mathematical formula is give as:
\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]
There are two main types of chi-square tests:
Chi-Square Test of Independence: It helps determine whether the variables are independent or if there’s a relationship between them. For example, you might want to know if gender affects voting preference.
Chi-Square Test of Goodness of Fit: This test checks if the sample data fits a population distribution. For example, you might want to see if a die is fair by comparing the observed frequency of each face with the expected frequency if the die were fair.
To ensure the validity of the chi-square test, certain assumptions must be met:
Chi-square tests are widely used in academia and industry, especially for testing hypotheses about the independence of categorical variables. Some of these practical applications are:
Market Research: Some applications include analyzing if customer preferences for a product vary across different age groups and income levels or determining if different marketing campaigns are equally effective across demographic segments.
Medical Research: Common use cases are studying the association between lifestyle factors (e.g., smoking, exercise) and the incidence of diseases (e.g., lung cancer, heart disease) or evaluating whether different treatment groups show different recovery rates in clinical trials.
Quality Control: Commonly used to examine whether product defects are independent of the manufacturing process or specific production lines and to compare the quality of products from different suppliers to determine if there is a significant difference in defect rates.
Education: The tests are often used to determine if there is a significant difference in pass rates between students from different schools or teaching methods and to evaluate if introducing a new curriculum improves student performance across subjects. It’s worth noting that these are some of the many applications across academia and the industry and can be extended to other domains and fields.
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
for (i in 1:4)
plot(dchisq(1:28, i*3),pch=18,col="blue",main = paste("chi-square density, df =", i*3))
Example 1:
The investigator collected the blood test of patients from a hospital
medical record.
A=25
B=32
AB=18
O=20
Test the null hypothesis that blood types are equally represented for
those 97 patients. This is a goodness of fit test with equal expected
frequencies. The “p” vector does not need to be specified, since equal
frequencies is the default…
chisq.test(c(25,32,18,20))
##
## Chi-squared test for given probabilities
##
## data: c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781
chi=chisq.test(c(25,32,18,20))
chi
##
## Chi-squared test for given probabilities
##
## data: c(25, 32, 18, 20)
## X-squared = 4.9158, df = 3, p-value = 0.1781
the null hypothesis cannot be rejected at alpha =0 .05
Example 2:
the eye color of 592 students found in the following category.
| Eye Color | Count |
|---|---|
| Brown | 220 |
| Blue | 215 |
| Hazel | 93 |
| Green | 64 |
Suppose a certain physiologist proposes that 50% of student should have brown eyes, 25% blue eyes, 15% hazel eyes, and 10% green eyes. Does the sample at hand agree with the physiologist’s hypothesis
eyes<-c(220,215,93,64)
chi<-chisq.test(eyes, p=c(.5,.25,.15,.1))
chi
##
## Chi-squared test for given probabilities
##
## data: eyes
## X-squared = 50.432, df = 3, p-value = 6.462e-11
Example 3:
The following data shows the number of died and alive patients across in
gender.
data<-c(4298,767,7136,643)
data.tab<-matrix(data,byrow=T,nrow=2)
data.tab
## [,1] [,2]
## [1,] 4298 767
## [2,] 7136 643
dimnames(data.tab)=list("Sex"=c("F","M"),"Died"=c("Alive","died"))
data.tab
## Died
## Sex Alive died
## F 4298 767
## M 7136 643
chisq.test(data.tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data.tab
## X-squared = 147.76, df = 1, p-value < 2.2e-16
chisq.test(data.tab)$expected
## Died
## Sex Alive died
## F 4508.97 556.0301
## M 6925.03 853.9699
chisq.test(data.tab)$residuals
## Died
## Sex Alive died
## F -3.141825 8.946877
## M 2.535186 -7.219370
Example 4:
Let us see the data for students smoking habit and exercise level.
data.st<-c(7,6,6,9,7,5,12,5,5,87,84,18)
data.matrix<-matrix(data.st , byrow=T, nrow=4)
dimnames(data.matrix)=list("Exe"=c("heavy","regula","occas","never"),"Smoke"=c("freq","some","none"))
data.matrix
## Smoke
## Exe freq some none
## heavy 7 6 6
## regula 9 7 5
## occas 12 5 5
## never 87 84 18
chisq.test(data.matrix)
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: data.matrix
## X-squared = 13.633, df = 6, p-value = 0.03402
As the p-value 0.03402 is less than the .05 significance level, we do not accept the null hypothesis that the smoking habit is independent of the exercise level of the students.
chisq.test(data.matrix)$expected
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
## Smoke
## Exe freq some none
## heavy 8.705179 7.721116 2.573705
## regula 9.621514 8.533865 2.844622
## occas 10.079681 8.940239 2.980080
## never 86.593625 76.804781 25.601594
chisq.test(data.matrix)$residual
## Warning in chisq.test(data.matrix): Chi-squared approximation may be incorrect
## Smoke
## Exe freq some none
## heavy -0.57793792 -0.6193983 2.135725
## regula -0.20036837 -0.5250663 1.277942
## occas 0.60485311 -1.3177955 1.170093
## never 0.04367003 0.8210127 -1.502350
Example 5:
We apply the test to a dataset. We’ll use the Anemia Levels in Nigeria
dataset, which can be downloaded from Kaggle Click
here. The dataset comes from the 2018 Nigeria Demographic and Health
Surveys (NDHS). It explores the impact of mothers’ age and socioeconomic
factors on anemia levels among children aged 0–59 months across
Nigeria’s 36 states and the Federal Capital Territory.
# Load the necessary libraries
#install.packages('readr')
library(readr)
# Load the dataset from the CSV file
dataset <- read_csv("C:/Users/abdik/Desktop/DMA/children_anaemia.csv")
## New names:
## Rows: 33924 Columns: 18
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (13): Age in 5-year groups, Type of place of residence, Highest educatio... dbl
## (5): ...1, Births in last five years, Age of respondent at 1st birth, H...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Display the first few rows of the dataset
#head(dataset)
# Rename a column
colnames(dataset)[colnames(dataset) == "Anemia level...8"] <- "Anemia level"
# Display the column names
colnames(dataset)
## [1] "...1"
## [2] "Age in 5-year groups"
## [3] "Type of place of residence"
## [4] "Highest educational level"
## [5] "Wealth index combined"
## [6] "Births in last five years"
## [7] "Age of respondent at 1st birth"
## [8] "Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)"
## [9] "Anemia level"
## [10] "Have mosquito bed net for sleeping (from household questionnaire)"
## [11] "Smokes cigarettes"
## [12] "Current marital status"
## [13] "Currently residing with husband/partner"
## [14] "When child put to breast"
## [15] "Had fever in last two weeks"
## [16] "Hemoglobin level adjusted for altitude (g/dl - 1 decimal)"
## [17] "Anemia level.1"
## [18] "Taking iron pills, sprinkles or syrup"
# Install and load the package
#install.packages('dplyr')
library(dplyr)
# Select the columns of interest
selected_data <- dplyr::select(dataset,`Highest educational level`,`Anemia level`)
# Create a contingency table for Highest educational level and Anemia level
contingency_table <- table(selected_data[["Highest educational level"]],
selected_data[["Anemia level"]])
# View the contingency table
print(contingency_table)
##
## Mild Moderate Not anemic Severe
## Higher 296 238 591 14
## No education 1450 1769 1874 113
## Primary 608 645 900 42
## Secondary 1240 1322 1972 62
# Perform chi-square test
chi_square_test <- chisq.test(contingency_table)
print(chi_square_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 142.86, df = 9, p-value < 2.2e-16
# Spearman rank correlation
x <- c(10, 20, 30, 40, 50)
y <- c(5, 15, 25, 35, 45)
cor.test(x, y, method="spearman")
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 4.4409e-15, p-value = 0.01667
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 1
So far we have seen only some data manipulation and the calculation of summary statistics. R can do a multitude of statistical analysis including linear models (lm), generalised linear models (glm), analysis of variance (anova), non linear models with mixed effects (nlme), generalised additive models (gam), survival analysis (survival), time series analysis (tseries), multivariate analysis (multiv) and many more.
Regression models, in which explanatory variables are used to model the behaviour of a response variable, are without a doubt the most commonly used class of models in the statistical toolbox. In this chapter, we will have a look at different types of regression models tailored to many different sorts of data and applications.
After reading this chapter, you will be able to use R to:
The formula for simple linear regression, where you have one independent variable (predictor) and one dependent variable, can be expressed as follows:
The linear regression model is represented as:
\[Y = \beta_0 + \beta_1 X + \varepsilon\]
where:
- \(Y\) is the dependent
variable,
- \(X\) is the independent variable
(predictor),
- \(\beta_0\) is the intercept
(y-intercept) of the regression line,
- \(\beta_1\) is the slope
(coefficient) of the independent variable,
- \(\varepsilon\) is the error term
(residuals).
The goal of simple linear regression is to determine the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.
library(DAAG)
attach(ais)
## The following object is masked _by_ .GlobalEnv:
##
## sex
## The following objects are masked from ais (pos = 11):
##
## Body_Mass_index, ht, lbm, rcc, sex, sport, wcc, wt
## The following object is masked from cancer:
##
## sex
dim(ais)
## [1] 202 8
names(ais)
## [1] "rcc" "wcc" "lbm" "Body_Mass_index"
## [5] "ht" "wt" "sex" "sport"
fit.ais <- lm(wt ~ ht, data=ais)
plot(wt ~ ht, data = ais, ylab = "Weight", xlab = "Height")
x <- ais$ht
y <- fit.ais$fit
lines(x,y)
anova(fit.ais)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| ht | 1 | 23769.79 | 23769.79446 | 312.6298 | 0 |
| Residuals | 200 | 15206.35 | 76.03176 | NA | NA |
Note: aov() and anova() outputs are different
summary(fit.ais)
##
## Call:
## lm(formula = wt ~ ht, data = ais)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.372 -5.296 -1.196 4.378 38.031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -126.19049 11.39566 -11.07 <2e-16 ***
## ht 1.11712 0.06318 17.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.72 on 200 degrees of freedom
## Multiple R-squared: 0.6099, Adjusted R-squared: 0.6079
## F-statistic: 312.6 on 1 and 200 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(ais$wt,fit.ais$fit,xlab="Observed", ylab="Predicted", main="")
title("Observed Versus Predicted Values")
abline(0,1)
hist(fit.ais$resid,col=0, main="")
title("Histogram for residuals")
qqnorm(fit.ais$resid)
Default plots
par(mfrow=c(2,2))
plot(fit.ais)
Exercise
1. Read the dataset fuel.txt.
2. Fit a simple linear regression model in which the mileage (third
column in the dataset) in the response and the weight (the second column
in the dataset) of the car is a predictor.
3. Test the hypothesis that the slope is zero.
4. Use the default plots of an \(lm()\)
object to produce the diagnostic plot.
In multiple linear regression, the model involves more than one independent variable. The general formula for multiple linear regression with \(p\) independent variables can be expressed as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\]
where:
- \(Y\) is the dependent
variable,
- \(X_1\), \(X_2\), \(\ldots\), \(X_p\) are the independent variables,
- \(\beta_0\) is the intercept,
- \(\beta_1\), \(\beta_2\), $, \(\beta_p\) are the coefficients associated
with each independent variable,
- \(\varepsilon\) is the error term
(residuals).
The goal of multiple linear regression is to estimate the coefficients \(\beta_0\), \(\beta_1\), \(\ldots\), \(\beta_p\) that best fit the data. This is typically done by minimizing the sum of squared differences between the observed values of the dependent variable and the values predicted by the model.
Example
Y<-c(78.5,74.3 ,104.3,87.6,95.9,109.2,102.7,72.5 ,93.1 ,115.9 ,83.8 ,113.3,109.4)
x1<-c(7 ,1 ,11, 11 ,7 ,11 ,3 ,1 ,2 ,21 ,1 ,11 ,10 )
x2<-c(26, 29,56,31,52,55,71,31,54,47,40,66,68)
x3<-c(6,15, 8,8,6,9,17,22,18,4,23,9 ,8)
x4<-c(60,52,20,47,33,22,6,44,22,26,34,12,12)
Halds<-data.frame(Y,x1,x2,x3,x4)
model<-lm(Y~x1+x2+x3+x4,data=Halds)
summary(model)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4, data = Halds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1750 -1.6709 0.2508 1.3783 3.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.4054 70.0710 0.891 0.3991
## x1 1.5511 0.7448 2.083 0.0708 .
## x2 0.5102 0.7238 0.705 0.5009
## x3 0.1019 0.7547 0.135 0.8959
## x4 -0.1441 0.7091 -0.203 0.8441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
## F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
par(mfrow=c(2,2))
plot(model)
residuals <- residuals(model)
fitted_values <- fitted(model)
# Correlation test
cor.test(fitted_values, residuals)
##
## Pearson's product-moment correlation
##
## data: fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5509853 0.5509853
## sample estimates:
## cor
## 3.829213e-17
model1<-lm(Y~x1+x2,data=Halds)
summary(model1)
##
## Call:
## lm(formula = Y ~ x1 + x2, data = Halds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
model2<-lm(Y~x1+x2+x4,data=Halds)
summary(model2)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x4, data = Halds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
Example 2:
library(car)
attach(mtcars)
## The following object is masked from ais (pos = 3):
##
## wt
## The following object is masked from ais (pos = 12):
##
## wt
## The following object is masked from package:ggplot2:
##
## mpg
head(mtcars)
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
x <- mtcars$mpg
y <- mtcars$hp
# Pearson correlation test
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
# Fit a model
model11 <- lm(mpg ~ hp + wt, data = mtcars)
summary(model11)
##
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
# Extract residuals and fitted values
residuals <- residuals(model)
fitted_values <- fitted(model)
# Correlation test
cor.test(fitted_values, residuals)
##
## Pearson's product-moment correlation
##
## data: fitted_values and residuals
## t = 1.27e-16, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5509853 0.5509853
## sample estimates:
## cor
## 3.829213e-17
# Load ggplot2
library(ggplot2)
# Create Scatter Plot with Regression Line
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = "blue", size = 3) + # Scatter points
geom_smooth(method = "lm", se = TRUE, color = "red") + # Regression line with confidence interval
labs(title = "Scatter Plot of MPG vs Horsepower",
x = "Horsepower (hp)",
y = "Miles Per Gallon (mpg)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Pairwise scatterplots for selected variables
pairs(mtcars[, c("mpg", "hp", "wt")],
main = "Pairwise Scatterplots",
pch = 19, col = "blue")
# Fit a Multiple Regression Model
model <- lm(mpg ~ hp + wt, data = mtcars)
# Plot Residuals vs Fitted Values
plot(model$fitted.values, model$residuals,
main = "Residuals vs Fitted Values",
xlab = "Fitted Values",
ylab = "Residuals",
pch = 19,
col = "darkgreen")
abline(h = 0, col = "red", lwd = 2) # Horizontal line at 0
# Optionally, add a smooth line to detect patterns
lines(lowess(model$fitted.values, model$residuals), col = "blue", lwd = 2)
# Create Residuals vs Fitted Plot
ggplot(model, aes(x = .fitted, y = .resid)) +
geom_point(color = "darkgreen", size = 3) + # Residual points
geom_hline(yintercept = 0, color = "red", linetype = "dashed") + # Horizontal line at 0
geom_smooth(method = "loess", color = "blue") + # Smooth line to detect patterns
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
## Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
# Generate All Diagnostic Plots
par(mfrow = c(2, 2)) # Arrange plots in a 2x2 grid
plot(model)
par(mfrow = c(1, 1)) # Reset to default
# Q-Q Plot Using ggplot2
ggplot(model, aes(sample = .stdresid)) +
stat_qq(color = "purple") +
stat_qq_line(color = "red") +
labs(title = "Normal Q-Q Plot",
x = "Theoretical Quantiles",
y = "Standardized Residuals") +
theme_minimal()
# Load library for heatmap
library(corrplot)
## corrplot 0.95 loaded
# Calculate correlation matrix
cor_matrix <- cor(mtcars)
# Plot heatmap
corrplot(cor_matrix, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45,
title = "Correlation Heatmap", mar = c(0, 0, 1, 0))
It is a method/technique for analyzing time series data in order to extract meaningful statistics/figure and other characteristics of the data. That is, what is the previous trend implies for and can it continues or not. It deals all about the time dependencies of the current on the past. It is a systematic approach to deal with mathematical and statistical questions posed by time correlation. It used to examine how data value change over time.
The analysis of time series is of great significance not only to the economists and business man but also to the scientists, astronomists, geologist, sociologists, biologists, research worker etc. for the reason below.
Autoregressive Integrated Moving Average models (ARIMA models) were popularized by “George Box and Gwilym Jenkins” in the early 1970s. ARIMA models are a class of linear models that is capable of representing stationary as well as non-stationary time series. ARIMA models do not involve independent variables in their construction. They make use of the information in the series itself to generate forecasts.
It is one of the most widely implemented methods for analyzing univariate time series data. In order to understand the modeling procedure, it is useful to briefly introduce the following basic models. ARIMA models are specific subset of univariate modeling, in which a time series is expressed in terms of past values of itself (the autoregressive component) plus current and lagged values of a ‘white noise’ error term (the moving average component).
The ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting method that combines autoregressive (AR), differencing (I), and moving average (MA) components. The general formula for an ARIMA model of order $ (p, d, q)$ can be expressed as:
\[Y_t = c + \phi_1 Y_{t-1} + \ldots + \phi_p Y_{t-p} + \theta_1 \varepsilon_{t-1} + \ldots + \theta_q \varepsilon_{t-q} + \varepsilon_t\]
where:
- \(Y_t\) is the time series at time
\(t\),
- \(c\) is a constant term,
- \(\phi_1\), \(\ldots\), \(\phi_p\) are the autoregressive
coefficients,
- \(\theta_1\), \(\ldots\), \(\theta_q\) are the moving average
coefficients,
- \(\varepsilon_t\) is the error term
at time \(t\),
- \(Y_{t-1}, Y_{t-2},\)$, \(Y_{t-p}\) are lagged values of the time
series,
- \(\varepsilon_{t-1}\), \(\varepsilon_{t-2}\), \(\ldots\), \(\varepsilon_{t-q}\) are lagged values of
the error terms,
- \(d\) is the degree of
differencing.
The ARIMA model aims to capture the autocorrelation in the time series data by incorporating the autoregressive and moving average terms. The differencing term helps in making the series stationary. The parameters \(p\), \(d\), and \(q\) are chosen based on the characteristics of the time series data.
The ARIMA model makes several assumptions about the underlying time series data, including:
Stationarity: The time series should be stationary, meaning that the statistical properties of the series (such as the mean and variance) do not depend on time. If the series is non-stationary, it may need to be differenced to make it stationary before using the ARIMA model.
Linearity: The relationship between past observations and future predictions should be linear. Nonlinear relationships may require a different type of model.
Independence: The error terms in the ARIMA model should be independent and identically distributed (IID), meaning that each error term should be unrelated to previous error terms and should have the same distribution. If the error terms are not IID, it may indicate that the model is misspecified.
Normality: The error terms should be normally distributed, with a mean of zero and constant variance. If the error terms are not normally distributed, it may indicate that the model is misspecified.
Adequate Sample Size: The ARIMA model assumes that the time series has a sufficient number of observations to estimate the model parameters with reasonable accuracy. A general guideline is to have at least 50 observations in the time series.
These assumptions are important to keep in mind when using the ARIMA model, as violations of these assumptions can lead to inaccurate predictions and biased parameter estimates. It is important to check these assumptions before fitting an ARIMA model to a time series and to take appropriate steps to address any violations.
Example
male_births<-read.csv("~/male.csv")
male<-ts(male_births, start=2000,freq=4)
ts.plot(male, ylab="Male Births", main="Male Live Births")
par(mfrow=c(2,1))
acf(ts(male,freq=4),NULL,main="Acf for actual series")
pp=pacf(ts(male, freq=4),NULL,main="Pacf for actual series")
#Stationary Checking
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(male,alternative="stationary") # ADF (Augmented Dickey-Fuller)p-value is less than 0.05
##
## Augmented Dickey-Fuller Test
##
## data: male
## Dickey-Fuller = -1.5531, Lag order = 4, p-value = 0.762
## alternative hypothesis: stationary
# fit Auto ARIMA Models
library(forecast)
library(ggplot2)
fit_model<-auto.arima(male,ic="aic",seasonal=TRUE,trace=TRUE)
##
## ARIMA(2,0,2)(1,0,1)[4] with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 1595.311
## ARIMA(1,0,0)(1,0,0)[4] with non-zero mean : 1509.133
## ARIMA(0,0,1)(0,0,1)[4] with non-zero mean : 1535.53
## ARIMA(0,0,0) with zero mean : 2152.192
## ARIMA(1,0,0) with non-zero mean : 1521.978
## ARIMA(1,0,0)(2,0,0)[4] with non-zero mean : 1509.428
## ARIMA(1,0,0)(1,0,1)[4] with non-zero mean : Inf
## ARIMA(1,0,0)(0,0,1)[4] with non-zero mean : 1513.905
## ARIMA(1,0,0)(2,0,1)[4] with non-zero mean : Inf
## ARIMA(0,0,0)(1,0,0)[4] with non-zero mean : 1528.147
## ARIMA(2,0,0)(1,0,0)[4] with non-zero mean : 1485.386
## ARIMA(2,0,0) with non-zero mean : 1489.306
## ARIMA(2,0,0)(2,0,0)[4] with non-zero mean : 1485.986
## ARIMA(2,0,0)(1,0,1)[4] with non-zero mean : Inf
## ARIMA(2,0,0)(0,0,1)[4] with non-zero mean : 1486.623
## ARIMA(2,0,0)(2,0,1)[4] with non-zero mean : Inf
## ARIMA(3,0,0)(1,0,0)[4] with non-zero mean : 1486.887
## ARIMA(2,0,1)(1,0,0)[4] with non-zero mean : 1486.866
## ARIMA(1,0,1)(1,0,0)[4] with non-zero mean : 1491.064
## ARIMA(3,0,1)(1,0,0)[4] with non-zero mean : 1488.765
## ARIMA(2,0,0)(1,0,0)[4] with zero mean : Inf
##
## Best model: ARIMA(2,0,0)(1,0,0)[4] with non-zero mean
print(summary(fit_model))
## Series: male
## ARIMA(2,0,0)(1,0,0)[4] with non-zero mean
##
## Coefficients:
## ar1 ar2 sar1 mean
## 0.3129 0.4855 0.2552 7458.7561
## s.e. 0.0848 0.0870 0.1054 176.1787
##
## sigma^2 = 86857: log likelihood = -737.69
## AIC=1485.39 AICc=1486 BIC=1498.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.866228 288.9912 233.0705 -0.1768792 3.109607 0.7421209
## ACF1
## Training set -0.02615957
male_forecasting<-forecast(fit_model,level = c(95),h=5)
male_forecasting
## Point Forecast Lo 95 Hi 95
## 2026 Q1 7468.843 6891.213 8046.472
## 2026 Q2 7500.321 6895.070 8105.571
## 2026 Q3 7411.571 6718.813 8104.329
## 2026 Q4 7476.532 6757.332 8195.733
## 2027 Q1 7461.016 6651.517 8270.515
autoplot(male_forecasting)
Example
Gasoline Sales Time Series After Obtaining the Contract with the
Maroodi-Jeex Region Police.
gasoline<-data.frame(weeks=1:22, sales=c(17,21,19,23,18,16,20,18,22,20,15,22,31,34,31,33,28,32,30,29,34,33))
head(gasoline)
| weeks | sales |
|---|---|
| 1 | 17 |
| 2 | 21 |
| 3 | 19 |
| 4 | 23 |
| 5 | 18 |
| 6 | 16 |
gas_oline<-ts(gasoline$sales, start = 2024, frequency = 52)
print(gas_oline)
## Time Series:
## Start = c(2024, 1)
## End = c(2024, 22)
## Frequency = 52
## [1] 17 21 19 23 18 16 20 18 22 20 15 22 31 34 31 33 28 32 30 29 34 33
library(forecast)
arima_model <- auto.arima(gas_oline)
summary(arima_model)
## Series: gas_oline
## ARIMA(0,1,0)
##
## sigma^2 = 16.86: log likelihood = -59.46
## AIC=120.92 AICc=121.13 BIC=121.96
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.7280455 4.011349 3.455318 1.442402 14.70862 NaN -0.1952415
The ARIMA(0,1,0) model was fitted to the gas_oline time
series data representing gasoline sales in the Maroodi-Jeex Region
Police after obtaining the contract. Key metrics include an estimated
variance of 16.86, a log likelihood of -59.46, Akaike Information
Criterion (AIC) of 120.92, corrected AIC of 121.13, and Bayesian
Information Criterion (BIC) of 121.96.
Training set error measures show an average of 0.728 forecast errors, indicating the model tends to slightly over-predict. The Root Mean Squared Error (RMSE) is 4.011, the Mean Absolute Error (MAE) is 3.455, the Mean Percentage Error (MPE) is 1.44%, and the Mean Absolute Percentage Error (MAPE) is 14.71%. The MASE is not available, indicating a small average percentage error in the predictions. The autocorrelation of the first lag of the residuals is -0.195, indicating any remaining autocorrelation in the residuals.
Based on these metrics, the ARIMA(0,1,0) model may provide a reasonable fit to the gasoline sales data in the Somali Region State Police, with room for improvement in accuracy. Further analysis and model refinement may be necessary to enhance forecasting performance.
autoplot(arima_model$residuals, main = "Time Series plot of gasoline")
forecast_values <- forecast(arima_model, h = 10)
forecast_values
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2024.423 33 27.73827 38.26173 24.952884 41.04712
## 2024.442 33 25.55880 40.44120 21.619660 44.38034
## 2024.462 33 23.88642 42.11358 19.061987 46.93801
## 2024.481 33 22.47655 43.52345 16.905768 49.09423
## 2024.500 33 21.23442 44.76558 15.006102 50.99390
## 2024.519 33 20.11146 45.88854 13.288672 52.71133
## 2024.538 33 19.07878 46.92122 11.709333 54.29067
## 2024.558 33 18.11759 47.88241 10.239319 55.76068
## 2024.577 33 17.21482 48.78518 8.858653 57.14135
## 2024.596 33 16.36096 49.63904 7.552785 58.44721
autoplot(forecast_values)
The Figure depicts a line graph illustrating the ARIMA(0,1,0) model’s forecasts for gasoline sales data from 2024.0 to 2024.6. The x-axis represents the observed values over time, while the y-axis represents the predicted sales. The solid black line represents actual data points, while the shaded blue area represents the predicted sales. A diagonal line serves as a trend or reference line for comparison. The title “Forecasts from ARIMA(0,1,0)” indicates that the graph is specifically displaying the forecasted values generated by the ARIMA model. The graph provides a visual representation of how well the ARIMA model forecasts match the actual gasoline sales data over the specified time period.