The R programming language has proven to be an invaluable tool for climate data analysis, enabling researchers and scientists to unravel intricate patterns and trends within complex environmental datasets. Its versatility in handling diverse types of data, coupled with its extensive library of specialized packages, empowers analysts to efficiently process, visualize, and model climate information. From temperature fluctuations and precipitation trends to sea-level rise projections and ecosystem dynamics, R’s robust statistical functions and advanced graphical capabilities provide the means to extract meaningful insights from raw climate data. Whether scrutinizing historical records or delving into real-time sensor readings, R’s useR -friendly syntax and rich visualization options facilitate the exploration of climate phenomena, fostering a deeper understanding of our planet’s eveR -evolving climatic processes.
As a convention, we will start learning R programming by writing a “Hello, World!” program. Depending on the needs, you can program either at R command prompt or you can use an R script file to write your program. Let’s check both one by one.
# My first program in R Programming Language
myString <- "Hello, there!, My Name is ------, and I'm ready for my first journey on R"
print(myString)
## [1] "Hello, there!, My Name is ------, and I'm ready for my first journey on R"
Generally, while doing programming in any programming language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory. There are many types of R objects and the frequently used ones are :
The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R -Objects are built upon the atomic vectors.
When you want to create vector with more than one element, you should
use c() function which means to
combine the elements into a vector
# Create a vector.
weather <- c('Sunny','Cloudy',"Rainy")
print(weather)
## [1] "Sunny" "Cloudy" "Rainy"
# Get the class of the vector.
print(class(weather))
## [1] "character"
A list is an R -object which can contain many different types of elements inside it like vectors, functions and even another list inside it.
# Create a list
weather_list <- list(c(24,27,25),28.5,sin)
# Print the list.
print(weather_list)
## [[1]]
## [1] 24 27 25
##
## [[2]]
## [1] 28.5
##
## [[3]]
## function (x) .Primitive("sin")
A matrix is a two-dimensional rectangular data set.
It can be created using a vector input to the
matrix() function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
## [,1] [,2] [,3]
## [1,] "a" "a" "b"
## [2,] "c" "b" "a"
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
weather <- array(c('Rainy','Cloudy','Sunny'),dim = c(3,3,2))
print(weather)
## , , 1
##
## [,1] [,2] [,3]
## [1,] "Rainy" "Rainy" "Rainy"
## [2,] "Cloudy" "Cloudy" "Cloudy"
## [3,] "Sunny" "Sunny" "Sunny"
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] "Rainy" "Rainy" "Rainy"
## [2,] "Cloudy" "Cloudy" "Cloudy"
## [3,] "Sunny" "Sunny" "Sunny"
Factors are the R -objects which are created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor()
function. The nlevels functions gives the count of
levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
temp_level <- c(27,28,30,22,24,23,23.5,31)
# Create a factor object.
factor_apple <- factor(apple_colors)
factor_temp <- factor(temp_level)
# Print the factor.
print(factor_apple)
## [1] green green yellow red red red green
## Levels: green red yellow
print(nlevels(factor_apple))
## [1] 3
print(factor_temp)
## [1] 27 28 30 22 24 23 23.5 31
## Levels: 22 23 23.5 24 27 28 30 31
print(nlevels(factor_temp))
## [1] 8
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length.
Data Frames are created using the
data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
## gender height weight Age
## 1 Male 152.0 81 42
## 2 Male 171.5 93 38
## 3 Female 165.0 78 26
# Membuat data frame
df <- data.frame(indeks = c(0, 1, 1.5, -0.5, -1, -1, -1.5, 0.5, 0, 1, 0.5,1))
bulan <- seq(as.Date("2022-01-01"),
as.Date("2022-12-01"),
by = "month")
df$bulan <- bulan
df$bulan <- month.abb
# Menambahkan label kategori SPI dan SPEI langsung pada df
df$spi_kategori <- cut(df$indeks,
breaks = c(-Inf, -2, -1.5, -1, -0.5, 0.5, 1, 1.5, 2, Inf),
labels = c("Extreme Drought", "Severe Drought", "Moderate Drought",
"Mild Drought", "Near Normal", "Mild Wet",
"Moderate Wet", "Severe Wet", "Extreme Wet"))
df
## indeks bulan spi_kategori
## 1 0.0 Jan Near Normal
## 2 1.0 Feb Mild Wet
## 3 1.5 Mar Moderate Wet
## 4 -0.5 Apr Mild Drought
## 5 -1.0 May Moderate Drought
## 6 -1.0 Jun Moderate Drought
## 7 -1.5 Jul Severe Drought
## 8 0.5 Aug Near Normal
## 9 0.0 Sep Near Normal
## 10 1.0 Oct Mild Wet
## 11 0.5 Nov Near Normal
## 12 1.0 Dec Mild Wet
A variable provides us with named storage that our programs can manipulate. A variable in R can store an atomic vector, group of atomic vectors or a combination of many Robjects. A valid variable name consists of letters, numbers and the dot or underline characters. The variable name starts with a letter or the dot not followed by a number.
| Variable Name | Validity | Reason |
|---|---|---|
| var_name2. | Valid | Has letters, numbers, dot and underscore |
| var_name% | Invalid | Has the character ‘%’. Only dot(.) and underscore allowed |
| 2var_name | Invalid | Starts with a number |
| .var_name, var.name | Valid | Can start with a dot(.) but the dot(.) should not be followed by a number. |
| .2var_name | Invalid | The starting dot is following by the number making it invalid. |
| var_name | Invalid | Starts with_which is not valid |
The variables can be assigned values using leftward, rightward and
equal to operator. The values of the variables can be printed using
print() or cat() function. The
cat() function combines multiple items into a continuous
print output.
# Assignment using equal operator.
var.1 = c(0,1,2,3)
# Assignment using leftward operator.
var.2 <- c("learn","R")
# Assignment using rightward operator.
c(TRUE,1) -> var.3
print(var.1)
## [1] 0 1 2 3
cat ("var.1 is ", var.1 ,"\n")
## var.1 is 0 1 2 3
cat ("var.2 is ", var.2 ,"\n")
## var.2 is learn R
cat ("var.3 is ", var.3 ,"\n")
## var.3 is 1 1
Note − The vector c(TRUE,1) has a mix of logical and numeric class. So logical class is coerced to numeric class making TRUE as 1.
In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object assigned to it. So R is called a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program.
var_x <- "Hello"
cat("The class of var_x is ",class(var_x),"\n")
## The class of var_x is character
var_x <- 34.5
cat(" Now the class of var_x is ",class(var_x),"\n")
## Now the class of var_x is numeric
var_x <- 27L
cat(" Next the class of var_x becomes ",class(var_x),"\n")
## Next the class of var_x becomes integer
To know all the variables currently available in the workspace we use
the ls() function. Also the
ls() function can use patterns to match the variable
names.
print(ls())
## [1] "apple_colors" "BMI" "bulan" "df" "factor_apple"
## [6] "factor_temp" "M" "myString" "temp_level" "var.1"
## [11] "var.2" "var.3" "var_x" "weather" "weather_list"
Note − It is a sample output depending on what
variables are declared in your environment. The ls()
function can use patterns to match the variable names.
# List the variables starting with the pattern "var".
print(ls(pattern = "var"))
## [1] "var.1" "var.2" "var.3" "var_x"
The variables starting with dot(.) are hidden, they can
be listed using all.names = TRUE argument to
ls() function.
# List the variables starting with the pattern "var".
print(ls(all.name = TRUE))
## [1] "apple_colors" "BMI" "bulan" "df" "factor_apple"
## [6] "factor_temp" "M" "myString" "temp_level" "var.1"
## [11] "var.2" "var.3" "var_x" "weather" "weather_list"
Variables can be deleted by using the
rm() function. Below we delete the
variable var.3. On printing the value of the variable error is
thrown.
rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm() and
ls() function together.
rm(list = ls())
print(ls())
## character(0)
# SEE, YOUR ENVIRONMENT IS EMPTY
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulations. R language is rich in built-in operators and provides following types of operators. Types of Operators
| Operator | Description | Example |
|---|---|---|
| + | Adds two vectors |
|
| - | Subtracts two vectors |
|
| * | Multiplies both vectors |
|
| / | Divide vectors |
|
| %% | Give the remainder of the first vector with the second |
|
| %/% | The result of division of first vector with second (quotient) |
|
| ^ | The first vector raised to the exponent of second vector |
|
Following table shows the relational operators supported by R language. Each element of the first vector is compared with the corresponding element of the second vector. The result of comparison is a Boolean value
| Operator | Description | Example |
|---|---|---|
| > | Checks if each element of the first vector is greater than the corresponding element of the second vector. |
|
| < | Checks if each element of the first vector is less than the corresponding element of the second vector. |
|
| == | Checks if each element of the first vector is equal to the corresponding element of the second vector. |
|
| <= | Checks if each element of the first vector is less than or equal to the corresponding element of the second vector. |
|
| >= | Checks if each element of the first vector is greater than or equal to the corresponding element of the second vector. |
|
| != | Checks if each element of the first vector is unequal to the corresponding element of the second vector. |
|
Following table shows the logical operators supported by R language. It is applicable only to vectors of type logical, numeric or complex. All numbers greater than 1 are considered as logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second vector. The result of comparison is a Boolean value.
## Operator &
a <- c(3,1,TRUE, 2+3i)
b <- c(4,1, FALSE, 2+3i)
print(a&b)
## [1] TRUE TRUE FALSE TRUE
It is called Element-wise Logical AND operator. It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if both the elements are TRUE.
## Operator |
a <- c(3,0,TRUE, 2+3i)
b <- c(4,0, FALSE, 2+3i)
print(a|b)
## [1] TRUE FALSE TRUE TRUE
It is called Element-wise Logical OR operator. It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if one the elements is TRUE.
## Operator !
a <- c(3,0,TRUE, 2+3i)
print(!a)
## [1] FALSE TRUE FALSE FALSE
It is called Logical NOT operator. Takes each element of the vector and gives the opposite logical value.
These operators are used to assign values to vector
# Called Left Assignment ( <- or = or <<-)
a1 <- c(3,1,TRUE,2+3i)
b2 <<- c(3,1,TRUE,2+3i)
c3 = c(3,1,TRUE,2+3i)
print(a1)
## [1] 3+0i 1+0i 1+0i 2+3i
print(b2)
## [1] 3+0i 1+0i 1+0i 2+3i
print(c3)
## [1] 3+0i 1+0i 1+0i 2+3i
# Called Right Assignment ( -> or ->>)
c(3,1,TRUE,2+3i) -> a1
c(3,1,TRUE,2+3i) ->> b2
print(a1)
## [1] 3+0i 1+0i 1+0i 2+3i
print(b2)
## [1] 3+0i 1+0i 1+0i 2+3i
These operators are used to for specific purpose and not general mathematical or logical computation.
# :
a <- 1:8
print(a)
## [1] 1 2 3 4 5 6 7 8
Colon operator. It creates the series of numbers in sequence for a vector.
# %in%
a <- 8
b <- 15
c <- 1:10
print(a %in% c)
## [1] TRUE
print(b %in% c)
## [1] FALSE
This operator is used to identify if an element belongs to a vector.
# %*%
M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
t = M %*% t(M)
print(M)
## [,1] [,2] [,3]
## [1,] 2 6 5
## [2,] 1 10 4
print(t(M))
## [,1] [,2]
## [1,] 2 1
## [2,] 6 10
## [3,] 5 4
print(t)
## [,1] [,2]
## [1,] 65 82
## [2,] 82 117
This operator is used to multiply a matrix with its transpose
Decision making structures require the programmer to specify one or more conditions to be evaluated or tested by the program, along with a statement or statements to be executed if the condition is determined to be true, and optionally, other statements to be executed if the condition is determined to be false.
An if statement consists of a Boolean expression followed by one or more statements.
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
If the Boolean expression evaluates to be true, then the block of code inside the if statement will be executed. If Boolean expression evaluates to be false, then the first set of code after the end of the if statement (after the closing curly brace) will be executed.
x <- 30L
if(is.integer(x)){
print("X is an Integer")
}
## [1] "X is an Integer"
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
If Else
x <- c("what", "is", "truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
## [1] "Truth is not found"
Here “Truth” and “truth are two different strings
An if statement can be followed by an optional else if…else statement, which is very useful to test various conditions using single if…else if statement.
When using if, else if, else statements there are few points to keep in mind.
An if can have zero or one else and it must come after any else if’s.
An if can have zero to many else if’s and they must come before the else.
Once an else if succeeds, none of the remaining else if’s or else’s will be tested.
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
## [1] "truth is found the second time"
There may be a situation when you need to execute a block of code several number of times. In general, statements are executed sequentially. The first statement in a function is executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for more complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple times and the following is the general form of a loop statement in most of the programming languages −
Loop
Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.
The basic syntax for creating a repeat loop in R is
repeat {
commands
if(condition) {
break
}
}
Break Statement
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
## [1] "Hello" "loop"
## [1] "Hello" "loop"
## [1] "Hello" "loop"
## [1] "Hello" "loop"
Repeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body.
The While loop executes the same code again and again until a stop condition is met.
while (test_expression) {
statement
}
While
init <- 1
while (init <5){
init = init +1
print(init)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
A For loop is a repetition control structure that allows you to efficiently write a loop that needs to execute a specific number of times.
For
me <- c(0,50,100)
for (you in me){
print(paste(you, "% Love Me"))
}
## [1] "0 % Love Me"
## [1] "50 % Love Me"
## [1] "100 % Love Me"
So, we want to see the SPI/SPEI categoricals by the values below
indeks <- c(-2, -1.5, -1, -0.5, 0, 1, 1.5, 2)
for (SPI in indeks) {
if (SPI > 1) {
print("Wet")
} else if (SPI < -1) {
print("Drought")
} else {
print("Near Normal")
}
}
## [1] "Drought"
## [1] "Drought"
## [1] "Near Normal"
## [1] "Near Normal"
## [1] "Near Normal"
## [1] "Near Normal"
## [1] "Wet"
## [1] "Wet"
Kode selesai di eksekusi jika kondisi mencapai FALSE
init <- 1
while(init <5){
print(init)
init = init + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
indeks <- c(-2, -1.5, -1, -0.5, 0, 1, 1.5, 2)
i <- 1
while (i <= length(indeks)) {
SPI <- indeks[i]
if (SPI > 1) {
print(paste("SPI =", SPI, "is categorized as Wet"))
} else if (SPI < -1) {
print(paste("SPI =", SPI, "is categorized as Drought"))
} else {
print(paste("SPI =", SPI, "is categorized as Near Normal"))
}
i <- i + 1
}
## [1] "SPI = -2 is categorized as Drought"
## [1] "SPI = -1.5 is categorized as Drought"
## [1] "SPI = -1 is categorized as Near Normal"
## [1] "SPI = -0.5 is categorized as Near Normal"
## [1] "SPI = 0 is categorized as Near Normal"
## [1] "SPI = 1 is categorized as Near Normal"
## [1] "SPI = 1.5 is categorized as Wet"
## [1] "SPI = 2 is categorized as Wet"
Menghentikan eksekusi kode looping jika kondisi terpenuhi
# Program berhenti jika ch > 50
ch <- c(0,90,5,10)
for (chval in ch){
if (chval > 50){
print("CH Lebih dari 50 mm")
break
}
}
## [1] "CH Lebih dari 50 mm"
indeks <- c(-2, -1.5, -1, -0.5, 0, 1, 1.5, 2)
for (SPI in indeks) {
if (SPI > 1) {
print(paste("SPI =", SPI, "is categorized as Wet"))
} else if (SPI < -1) {
print(paste("SPI =", SPI, "is categorized as Drought"))
} else {
print(paste("SPI =", SPI, "is categorized as Near Normal"))
}
# Add a break statement here if you want to exit the loop
if (SPI == 1.5) {
break
}
}
## [1] "SPI = -2 is categorized as Drought"
## [1] "SPI = -1.5 is categorized as Drought"
## [1] "SPI = -1 is categorized as Near Normal"
## [1] "SPI = -0.5 is categorized as Near Normal"
## [1] "SPI = 0 is categorized as Near Normal"
## [1] "SPI = 1 is categorized as Near Normal"
## [1] "SPI = 1.5 is categorized as Wet"
##### Next Statement
#Program melanjutkan eksekusi kode ketika chval >50
ch <- c(0, 90, 0, 3)
for (chval in ch) {
if (chval > 50) {
next
} else {
print("CH kurang dari 50 mm")
}
}
## [1] "CH kurang dari 50 mm"
## [1] "CH kurang dari 50 mm"
## [1] "CH kurang dari 50 mm"
indeks <- c(-2, -1.5, -1, -0.5, 0, 1, 1.5, 2)
for (SPI in indeks) {
if (SPI > 1) {
print(paste("SPI =", SPI, "is categorized as Wet"))
next
} else if (SPI < -1) {
print(paste("SPI =", SPI, "is categorized as Drought"))
next
} else {
print(paste("SPI =", SPI, "is categorized as Near Normal"))
next
}
}
## [1] "SPI = -2 is categorized as Drought"
## [1] "SPI = -1.5 is categorized as Drought"
## [1] "SPI = -1 is categorized as Near Normal"
## [1] "SPI = -0.5 is categorized as Near Normal"
## [1] "SPI = 0 is categorized as Near Normal"
## [1] "SPI = 1 is categorized as Near Normal"
## [1] "SPI = 1.5 is categorized as Wet"
## [1] "SPI = 2 is categorized as Wet"
Fungsinya untuk mengembalikan suatu nilai dari fungsi
function()
fun1 <- function(ch) {
if (ch >= 1) {
result <- "Hujan"
}
else {
result <- "Tidak Hujan"
}
return(result)
}
fun1(2)
## [1] "Hujan"
fun1(0.5)
## [1] "Tidak Hujan"
categorize_SPI <- function(indeks) {
results <- character(length(indeks)) # Preallocate a vector for results
for (i in seq_along(indeks)) {
SPI <- indeks[i]
if (SPI > 1) {
results[i] <- paste("SPI =", SPI, "is categorized as Wet")
} else if (SPI < -1) {
results[i] <- paste("SPI =", SPI, "is categorized as Drought")
} else {
results[i] <- paste("SPI =", SPI, "is categorized as Near Normal")
}
}
return(results) # Return the categorized results
}
indeks <- c(-2, -1.5, -1, -0.5, 0, 1, 1.5, 2)
categorized_results <- categorize_SPI(indeks)
print(categorized_results)
## [1] "SPI = -2 is categorized as Drought"
## [2] "SPI = -1.5 is categorized as Drought"
## [3] "SPI = -1 is categorized as Near Normal"
## [4] "SPI = -0.5 is categorized as Near Normal"
## [5] "SPI = 0 is categorized as Near Normal"
## [6] "SPI = 1 is categorized as Near Normal"
## [7] "SPI = 1.5 is categorized as Wet"
## [8] "SPI = 2 is categorized as Wet"
In RStudio, a function is a block of organized, reusable code designed to perform a specific task. It allows you to encapsulate a sequence of commands into a single unit, which can then be called multiple times with different inputs. Functions in R are an essential concept in programming, as they help modularize code, improve code readability, and promote code reusability.
Here’s a breakdown of the key components of an R function in RStudio:
Function Name: This is the identifier you give to your function. It should be a meaningful name that describes the purpose of the function.
Arguments: Functions can take one or more arguments, which are inputs provided when the function is called. Arguments specify the data or values the function will operate on. Arguments are enclosed in parentheses and separated by commas.
Function Body: This is where you define the
sequence of statements that the function will execute when called. It
includes the code that performs the desired task. The function body is
enclosed in curly braces {}.
Return Value: A function can produce a result
that is returned to the caller. You use the return()
statement to specify the value you want the function to return. If there
is no return() statement, the function will return the
value of the last evaluated expression.
Here’s a simple example of an R function:
# Define a function that calculates the square of a number
square <- function(x) {
result <- x^2
return(result)
}
# Call the function and store the result
squared_value <- square(5)
print(squared_value) # Output: 25
## [1] 25
In this example:
- The function name is square.
- The function takes an argument x.
- The function calculates the square of x using the
expression x^2.
- The result is returned using the return() statement.
You can define your own functions in RStudio and use them to encapsulate complex logic, avoid code repetition, and make your code more organized and readable.
R has many in-built functions which can be directly called in the program without defining them first. We can also create and use our own functions referred as user defined functions.
Simple examples of in-built functions are seq(),
mean(), max(), sum(x) and
paste(…) etc. They are directly called by user written
programs. You can refer most widely used R functions.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
## [1] 32 33 34 35 36 37 38 39 40 41 42 43 44
# Find mean of numbers from 25 to 82.
print(mean(25:82))
## [1] 53.5
# Find sum of numbers frm 41 to 68.
print(sum(41:68))
## [1] 1526
Many strings in R are combined using the
paste() function. It can take any number
of arguments to be combined together.
The basic syntax for paste function is
paste(..., sep="", collapse = NULL)
Following is the description of the parameters used −
(…) represents any number of arguments to be combined.
sep represents any separator between the arguments. It is optional.
collapse is used to eliminate the space in between two strings. But not the space within two words of one string.
a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
## [1] "Hello How are you? "
print(paste(a,b,c, sep = "-"))
## [1] "Hello-How-are you? "
print(paste(a,b,c, sep = "", collapse = ""))
## [1] "HelloHoware you? "
Formatting Numbers and Strings
The basic syntax for format function is
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
Following is the description of the parameters used −
x is the vector input.
digits is the total number of digits displayed.
nsmall is the minimum number of digits to the right of the decimal point.
scientific is set to TRUE to display scientific notation.
width indicates the minimum width to be displayed by padding blanks in the beginning.
justify is the display of the string to left, right or center.
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)
## [1] "23.1234568"
# Display numbers in scientific notation.
result <- format(c(6, 13.14521), scientific = TRUE)
print(result)
## [1] "6.000000e+00" "1.314521e+01"
# The minimum number of digits to the right of the decimal point.
result <- format(23.47, nsmall = 5)
print(result)
## [1] "23.47000"
# Format treats everything as a string.
result <- format(6)
print(result)
## [1] "6"
# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
## [1] " 13.7"
# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)
## [1] "Hello "
# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)
## [1] " Hello "
Counting number of characters in a string nchar()
function Basic syntax is nchar(x), x is the vector
input
nchar("I LOVE YOU")
## [1] 10
Changing the case - toupper() &
tolower() functions
# Changing to Upper case.
result <- toupper("i love you")
print(result)
## [1] "I LOVE YOU"
# Changing to lower case.
result <- tolower("I Love You")
print(result)
## [1] "i love you"
Extracting parts of a string substring() function
substring(x,first,last)
# Extract characters from 1st to 7th position.
result <- substring("ILOVEYOU", 1, 7)
print(result)
## [1] "ILOVEYO"
Vectors are the most basic R data objects and there are six types of atomic vectors. They are logical, integer, double, complex, character and raw.
# Atomic vector of type character.
print("abc");
## [1] "abc"
# Atomic vector of type double.
print(12.5)
## [1] 12.5
# Atomic vector of type integer.
print(63L)
## [1] 63
# Atomic vector of type logical.
print(TRUE)
## [1] TRUE
# Atomic vector of type complex.
print(2+3i)
## [1] 2+3i
Using colon operator with numeric data
# Creating a sequence from 5 to 13.
v <- 5:13
print(v)
## [1] 5 6 7 8 9 10 11 12 13
# Creating a sequence from 6.6 to 12.6.
v <- 6.6:12.6
print(v)
## [1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
## [1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
# Create vector with elements from 5 to 9 incrementing by 0.4.
print(seq(5, 9, by = 0.4))
## [1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0
#Using the c() function
s <- c("apple", "red", 5, TRUE)
print(s)
## [1] "apple" "red" "5" "TRUE"
Elements of a Vector are accessed using indexing. The
[ ] brackets are used for indexing. Indexing starts with
position 1. Giving a negative value in the index drops that element from
result.TRUE, FALSE or
0 and 1 can also be used for
indexing.
# Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)
## [1] "Mon" "Tue" "Fri"
# Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
## [1] "Sun" "Fri"
# Accessing vector elements using negative indexing.
x <- t[c(-2,-5)]
print(x)
## [1] "Sun" "Tue" "Wed" "Fri" "Sat"
# Accessing vector elements using 0/1 indexing.
y <- t[c(0,0,0,0,0,0,1)]
print(y)
## [1] "Sun"
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)
# Vector addition.
add.result <- v1+v2
print(add.result)
## [1] 7 19 4 13 1 13
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
## [1] -1 -3 4 -3 -1 9
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
## [1] 12 88 0 40 0 22
# Vector division.
divi.result <- v1/v2
print(divi.result)
## [1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
add.result <- v1+v2
print(add.result)
## [1] 7 19 8 16 4 22
sub.result <- v1-v2
print(sub.result)
## [1] -1 -3 0 -6 -4 0
Elements in a vector can be sorterd using the sort()
function
a <- c(4,1,23,4,6,27,90,-9,100)
# Sort the elements of the vector
sort.result <- sort(a)
print(sort.result)
## [1] -9 1 4 4 6 23 27 90 100
# Sort the elements in the reverse order.
revsort.result <- sort(a, decreasing = TRUE)
print(revsort.result)
## [1] 100 90 27 23 6 4 4 1 -9
# Sorting character vectors.
a <- c("Red","Blue","yellow","violet")
sort.result <- sort(a)
print(sort.result)
## [1] "Blue" "Red" "violet" "yellow"
# Sorting character vectors in reverse order.
revsort.result <- sort(a, decreasing = TRUE)
print(revsort.result)
## [1] "yellow" "violet" "Red" "Blue"
Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using list() function.
# Create a list containing strings, numbers, vectors and a logical
# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
## [[1]]
## [1] "Red"
##
## [[2]]
## [1] "Green"
##
## [[3]]
## [1] 21 32 11
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] 51.23
##
## [[6]]
## [1] 119.1
The list elements can be given names and they can be accessed using these names.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Show the list.
print(list_data)
## $`1st Quarter`
## [1] "Jan" "Feb" "Mar"
##
## $A_Matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
##
## $`A Inner list`
## $`A Inner list`[[1]]
## [1] "green"
##
## $`A Inner list`[[2]]
## [1] 12.3
Accesing List Elements
Elements of the list can be accessed by the index of the element in the list. In case of named lists it can also be accessed using the names.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Access the first element of the list.
print(list_data[1])
## $`1st Quarter`
## [1] "Jan" "Feb" "Mar"
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
## $`A Inner list`
## $`A Inner list`[[1]]
## [1] "green"
##
## $`A Inner list`[[2]]
## [1] 12.3
# Access the list element using the name of the element.
print(list_data$A_Matrix)
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
We can add, delete and update list elements as shown below. We can add and delete elements only at the end of a list. But we can update any element.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Add element at the end of the list.
list_data[4] <- "New element"
print(list_data[4])
## [[1]]
## [1] "New element"
# Remove the last element.
list_data[4] <- NULL
# Print the 4th Element.
print(list_data[4])
## $<NA>
## NULL
# Update the 3rd Element.
list_data[3] <- "updated element"
print(list_data[3])
## $`A Inner list`
## [1] "updated element"
You can merge many lists into one list by placing all the lists inside one list() function
# Create two lists.
list1 <- list(20,1,8.5)
list2 <- list("Sun","Mon","Tue")
list3 <- list("Rain", "Sunny", "Cloudy")
# Merge the two lists.
merged.list <- c(list1,list2,list3)
# Print the merged list.
print(merged.list)
## [[1]]
## [1] 20
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 8.5
##
## [[4]]
## [1] "Sun"
##
## [[5]]
## [1] "Mon"
##
## [[6]]
## [1] "Tue"
##
## [[7]]
## [1] "Rain"
##
## [[8]]
## [1] "Sunny"
##
## [[9]]
## [1] "Cloudy"
A list can be converted to a vector so that the elements of the
vector can be used for further manipulation. All the arithmetic
operations on vectors can be applied after the list is converted into
vectors. To do this conversion, we use the unlist()
function. It takes the list as input and produces a vector.
# Create lists.
list1 <- list(1:5)
print(list1)
## [[1]]
## [1] 1 2 3 4 5
list2 <-list(10:14)
print(list2)
## [[1]]
## [1] 10 11 12 13 14
# Convert the lists to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)
print(v1)
## [1] 1 2 3 4 5
print(v2)
## [1] 10 11 12 13 14
# Now add the vectors
result <- v1+v2
print(result)
## [1] 11 13 15 17 19
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. They contain elements of the same atomic types. Though we can create a matrix containing only characters or only logical values, they are not of much use. We use matrices containing numeric elements to be used in mathematical calculations.
A Matrix is created using the matrix() function.
Basic syntax is matrix(data,nrow,ncol,byrow,dimnames)
Following is the description of the parameters used −
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
## [,1] [,2] [,3]
## [1,] 3 4 5
## [2,] 6 7 8
## [3,] 9 10 11
## [4,] 12 13 14
# Elements are arranged sequentially by column.
N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)
## [,1] [,2] [,3]
## [1,] 3 7 11
## [2,] 4 8 12
## [3,] 5 9 13
## [4,] 6 10 14
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)
## col1 col2 col3
## row1 3 4 5
## row2 6 7 8
## row3 9 10 11
## row4 12 13 14
Elements of a matrix can be accessed by using the column and row index of the element. We consider the matrix P above to find the specific elements below.
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
# Create the matrix.
P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
# Access the element at 3rd column and 1st row.
print(P[1,3])
## [1] 5
# Access the element at 2nd column and 4th row.
print(P[4,2])
## [1] 13
# Access only the 2nd row.
print(P[2,])
## col1 col2 col3
## 6 7 8
# Access only the 3rd column.
print(P[,3])
## row1 row2 row3 row4
## 5 8 11 14
Various mathematical operations are performed on the matrices using the R operators. The result of the operation is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the operation.
# Matrix Addition and Subtraction
## Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
## [,1] [,2] [,3]
## [1,] 3 -1 2
## [2,] 9 4 6
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)
## [,1] [,2] [,3]
## [1,] 5 0 3
## [2,] 2 9 4
# Add the matrices.
result <- matrix1 + matrix2
cat("Result of addition","\n")
## Result of addition
print(result)
## [,1] [,2] [,3]
## [1,] 8 -1 5
## [2,] 11 13 10
# Subtract the matrices
result <- matrix1 - matrix2
cat("Result of subtraction","\n")
## Result of subtraction
print(result)
## [,1] [,2] [,3]
## [1,] -2 -1 -1
## [2,] 7 -5 2
# Matrix Multiplication and Division
## Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)
## [,1] [,2] [,3]
## [1,] 3 -1 2
## [2,] 9 4 6
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
print(matrix2)
## [,1] [,2] [,3]
## [1,] 5 0 3
## [2,] 2 9 4
# Multiply the matrices.
result <- matrix1 * matrix2
cat("Result of multiplication","\n")
## Result of multiplication
print(result)
## [,1] [,2] [,3]
## [1,] 15 0 6
## [2,] 18 36 24
# Divide the matrices
result <- matrix1 / matrix2
cat("Result of division","\n")
## Result of division
print(result)
## [,1] [,2] [,3]
## [1,] 0.6 -Inf 0.6666667
## [2,] 4.5 0.4444444 1.5000000
Arrays are the R data objects which can store data in more than two dimensions. For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type.
An array is created using the array() function. It takes
vectors as input and uses the values in the dim parameter to create an
array.
#The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2))
print(result)
## , , 1
##
## [,1] [,2] [,3]
## [1,] 5 10 13
## [2,] 9 11 14
## [3,] 3 12 15
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 5 10 13
## [2,] 9 11 14
## [3,] 3 12 15
We can give names to the rows, columns and matrices in the array by using the dimnames parameter.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,column.names,
matrix.names))
print(result)
## , , Matrix1
##
## COL1 COL2 COL3
## ROW1 5 10 13
## ROW2 9 11 14
## ROW3 3 12 15
##
## , , Matrix2
##
## COL1 COL2 COL3
## ROW1 5 10 13
## ROW2 9 11 14
## ROW3 3 12 15
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,
column.names, matrix.names))
# Print the third row of the second matrix of the array.
print(result[3,,2])
## COL1 COL2 COL3
## 3 12 15
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
## [1] 13
# Print the 2nd Matrix.
print(result[,,2])
## COL1 COL2 COL3
## ROW1 5 10 13
## ROW2 9 11 14
## ROW3 3 12 15
As array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
array1 <- array(c(vector1,vector2),dim = c(3,3,2))
# Create two vectors of different lengths.
vector3 <- c(9,1,0)
vector4 <- c(6,0,11,3,14,1,2,6,9)
array2 <- array(c(vector1,vector2),dim = c(3,3,2))
# create matrices from these arrays.
matrix1 <- array1[,,2]
matrix2 <- array2[,,2]
# Add the matrices.
result <- matrix1+matrix2
print(result)
## [,1] [,2] [,3]
## [1,] 10 20 26
## [2,] 18 22 28
## [3,] 6 24 30
We can do calculations across the elements in an array using the
apply() function. Basic syntax is
apply(x,margin,fun)
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
new.array <- array(c(vector1,vector2),dim = c(3,3,2))
print(new.array)
## , , 1
##
## [,1] [,2] [,3]
## [1,] 5 10 13
## [2,] 9 11 14
## [3,] 3 12 15
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 5 10 13
## [2,] 9 11 14
## [3,] 3 12 15
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
## [1] 56 68 60
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male,”Female” and True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by
taking a vector as input.
# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
## [1] "East" "West" "East" "North" "North" "East" "West" "West" "West"
## [10] "East" "North"
print(is.factor(data))
## [1] FALSE
# Apply the factor function.
factor_data <- factor(data)
print(factor_data)
## [1] East West East North North East West West West East North
## Levels: East North West
print(is.factor(factor_data))
## [1] TRUE
On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it.
# Create the vectors for data frame.
precip <- c(10,5,0,0,1,2,1)
temp <- c(23,24,26,25,25,27,27)
con <- c("Rain","Rain","Sunny","Sunny","Cloudy","Cloudy","Cloudy")
# Create the data frame.
input_data <- data.frame(precip,temp,con)
print(input_data)
## precip temp con
## 1 10 23 Rain
## 2 5 24 Rain
## 3 0 26 Sunny
## 4 0 25 Sunny
## 5 1 25 Cloudy
## 6 2 27 Cloudy
## 7 1 27 Cloudy
# Test if the condition column is a factor.
print(is.factor(input_data$con))
## [1] FALSE
# Print the condition column so see the levels.
print(input_data$con)
## [1] "Rain" "Rain" "Sunny" "Sunny" "Cloudy" "Cloudy" "Cloudy"
The order of the levels in a factor can be changed by applying the factor function again with new order of the levels.
data <- c("East","West","East","North","North","East","West",
"West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)
## [1] East West East North North East West West West East North
## Levels: East North West
# Apply the factor function with required order of the level.
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
## [1] East West East North North East West West West East North
## Levels: East West North
We can generate factor levels by using the gl()
function. It takes two integers as input which indicates how many levels
and how many times each level. Basic syntax is
gln(n,k,labels) Following is the description of the
parameters used − - n is a integer giving the number of levels. - k is a
integer giving the number of replications. - labels is a vector of
labels for the resulting factor levels
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
## [1] Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston
## [10] Boston Boston Boston
## Levels: Tampa Seattle Boston
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Dani","Matty","Star","Ali","Amy"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
## emp_id emp_name salary start_date
## 1 1 Dani 623.30 2012-01-01
## 2 2 Matty 515.20 2013-09-23
## 3 3 Star 611.00 2014-11-15
## 4 4 Ali 729.00 2014-05-11
## 5 5 Amy 843.25 2015-03-27
# The structure of the data frame can be seen by using str() function.
str(emp.data)
## 'data.frame': 5 obs. of 4 variables:
## $ emp_id : int 1 2 3 4 5
## $ emp_name : chr "Dani" "Matty" "Star" "Ali" ...
## $ salary : num 623 515 611 729 843
## $ start_date: Date, format: "2012-01-01" "2013-09-23" ...
# The statistical summary and nature of the data can be obtained by applying summary() function.
summary(emp.data)
## emp_id emp_name salary start_date
## Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
## 1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
## Median :3 Mode :character Median :623.3 Median :2014-05-11
## Mean :3 Mean :664.4 Mean :2014-01-14
## 3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
## Max. :5 Max. :843.2 Max. :2015-03-27
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
## emp.data.emp_name emp.data.salary
## 1 Dani 623.30
## 2 Matty 515.20
## 3 Star 611.00
## 4 Ali 729.00
## 5 Amy 843.25
# Extract the first two rows and then all columns
result <- emp.data[1:2,]
print(result)
## emp_id emp_name salary start_date
## 1 1 Dani 623.3 2012-01-01
## 2 2 Matty 515.2 2013-09-23
# Extract 3rd and 5th row with 2nd and 4th column
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
## emp_name start_date
## 3 Star 2014-11-15
## 5 Amy 2015-03-27
# Add the "dept" coulmn.
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)
## emp_id emp_name salary start_date dept
## 1 1 Dani 623.30 2012-01-01 IT
## 2 2 Matty 515.20 2013-09-23 Operations
## 3 3 Star 611.00 2014-11-15 IT
## 4 4 Ali 729.00 2014-05-11 HR
## 5 5 Amy 843.25 2015-03-27 Finance
To add more rows permanently to an existing data frame, we need to
bring in the new rows in the same structure as the existing data frame
and use the rbind() function.
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)
# Create the second data frame
emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)
# Bind the two data frames.
emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
## emp_id emp_name salary start_date dept
## 1 1 Rick 623.30 2012-01-01 IT
## 2 2 Dan 515.20 2013-09-23 Operations
## 3 3 Michelle 611.00 2014-11-15 IT
## 4 4 Ryan 729.00 2014-05-11 HR
## 5 5 Gary 843.25 2015-03-27 Finance
## 6 6 Rasmi 578.00 2013-05-21 IT
## 7 7 Pranab 722.50 2013-07-30 Operations
## 8 8 Tusar 632.80 2014-06-17 Fianance
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format that is different from format in which we received it. R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.
We can join multiple vectors to create a data frame using the
cbind() function. Also we can merge two data frames using
rbind() function.
# Create vector objects.
city <- c("Bogor","Jakarta","Tangerang","Bandung")
precip <- c(200,130,100,90)
temp <- c(27,30,30,23)
hum <- c(95,80,85,90)
# Combine above three vectors into one data frame.
clim <- cbind(city,precip,temp,hum)
# Print a header.
cat("# # # # The First data frame\n")
## # # # # The First data frame
# Print the data frame.
print(clim)
## city precip temp hum
## [1,] "Bogor" "200" "27" "95"
## [2,] "Jakarta" "130" "30" "80"
## [3,] "Tangerang" "100" "30" "85"
## [4,] "Bandung" "90" "23" "90"
# Create another data frame with similar columns
new.clim <- data.frame(
city = c("Medan", "Solo", "Tangerang Selatan", "Riau"),
precip = c(240, 200, 180, 300),
temp = c(30, 29, 29, 28),
hum = c(94, 85, 89, 92),
stringsAsFactors = FALSE
)
# Print a header.
cat("# # # The Second data frame\n")
## # # # The Second data frame
# Print the data frame.
print(new.clim)
## city precip temp hum
## 1 Medan 240 30 94
## 2 Solo 200 29 85
## 3 Tangerang Selatan 180 29 89
## 4 Riau 300 28 92
# Combine rows form both the data frames.
all_clim <- rbind(clim, new.clim)
# Print a header.
cat("# # # The combined data frame\n")
## # # # The combined data frame
# Print the result.
print(all_clim)
## city precip temp hum
## 1 Bogor 200 27 95
## 2 Jakarta 130 30 80
## 3 Tangerang 100 30 85
## 4 Bandung 90 23 90
## 5 Medan 240 30 94
## 6 Solo 200 29 85
## 7 Tangerang Selatan 180 29 89
## 8 Riau 300 28 92
# Sort combined data frame by temperature (highest to lowest)
sorted_clim <- all_clim[order(all_clim$temp, decreasing = TRUE), ]
# Print a header and the sorted data frame
cat("# # # The sorted data frame\n")
## # # # The sorted data frame
print(sorted_clim)
## city precip temp hum
## 2 Jakarta 130 30 80
## 3 Tangerang 100 30 85
## 5 Medan 240 30 94
## 6 Solo 200 29 85
## 7 Tangerang Selatan 180 29 89
## 8 Riau 300 28 92
## 1 Bogor 200 27 95
## 4 Bandung 90 23 90
We can merge two data frames by using the merge()
function. The data frames must have same column names on which the
merging happens.
In the example below, we consider the data sets about Diabetes in Pima Indian Women available in the library names “MASS”. we merge the two data sets based on the values of blood pressure(“bp”) and body mass index(“bmi”). On choosing these two columns for merging, the records where values of these two variables match in both data sets are combined together to form a single data frame.
library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")
)
print(merged.Pima)
## bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
## 1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
## 2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
## 3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
## 4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
## 5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
## 6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
## 7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
## 8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
## 9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
## 10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
## 11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
## 12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
## 13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
## 14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
## 15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
## 16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
## 17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
## age.y type.y
## 1 31 No
## 2 21 No
## 3 24 No
## 4 21 No
## 5 21 No
## 6 43 Yes
## 7 36 Yes
## 8 40 No
## 9 29 Yes
## 10 28 No
## 11 55 No
## 12 39 No
## 13 39 No
## 14 49 Yes
## 15 49 Yes
## 16 38 No
## 17 28 No
nrow(merged.Pima)
## [1] 17
One of the most interesting aspects of R programming is about
changing the shape of the data in multiple steps to get a desired shape.
The functions used to do this are called melt() and
cast().
We consider the dataset called ships present in the library called “MASS”.
library(MASS)
print(ships)
## type year period service incidents
## 1 A 60 60 127 0
## 2 A 60 75 63 0
## 3 A 65 60 1095 3
## 4 A 65 75 1095 4
## 5 A 70 60 1512 6
## 6 A 70 75 3353 18
## 7 A 75 60 0 0
## 8 A 75 75 2244 11
## 9 B 60 60 44882 39
## 10 B 60 75 17176 29
## 11 B 65 60 28609 58
## 12 B 65 75 20370 53
## 13 B 70 60 7064 12
## 14 B 70 75 13099 44
## 15 B 75 60 0 0
## 16 B 75 75 7117 18
## 17 C 60 60 1179 1
## 18 C 60 75 552 1
## 19 C 65 60 781 0
## 20 C 65 75 676 1
## 21 C 70 60 783 6
## 22 C 70 75 1948 2
## 23 C 75 60 0 0
## 24 C 75 75 274 1
## 25 D 60 60 251 0
## 26 D 60 75 105 0
## 27 D 65 60 288 0
## 28 D 65 75 192 0
## 29 D 70 60 349 2
## 30 D 70 75 1208 11
## 31 D 75 60 0 0
## 32 D 75 75 2051 4
## 33 E 60 60 45 0
## 34 E 60 75 0 0
## 35 E 65 60 789 7
## 36 E 65 75 437 7
## 37 E 70 60 1157 5
## 38 E 70 75 2161 12
## 39 E 75 60 0 0
## 40 E 75 75 542 1
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.
library(reshape2)
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
## type year variable value
## 1 A 60 period 60
## 2 A 60 period 75
## 3 A 65 period 60
## 4 A 65 period 75
## 5 A 70 period 60
## 6 A 70 period 75
## 7 A 75 period 60
## 8 A 75 period 75
## 9 B 60 period 60
## 10 B 60 period 75
## 11 B 65 period 60
## 12 B 65 period 75
## 13 B 70 period 60
## 14 B 70 period 75
## 15 B 75 period 60
## 16 B 75 period 75
## 17 C 60 period 60
## 18 C 60 period 75
## 19 C 65 period 60
## 20 C 65 period 75
## 21 C 70 period 60
## 22 C 70 period 75
## 23 C 75 period 60
## 24 C 75 period 75
## 25 D 60 period 60
## 26 D 60 period 75
## 27 D 65 period 60
## 28 D 65 period 75
## 29 D 70 period 60
## 30 D 70 period 75
## 31 D 75 period 60
## 32 D 75 period 75
## 33 E 60 period 60
## 34 E 60 period 75
## 35 E 65 period 60
## 36 E 65 period 75
## 37 E 70 period 60
## 38 E 70 period 75
## 39 E 75 period 60
## 40 E 75 period 75
## 41 A 60 service 127
## 42 A 60 service 63
## 43 A 65 service 1095
## 44 A 65 service 1095
## 45 A 70 service 1512
## 46 A 70 service 3353
## 47 A 75 service 0
## 48 A 75 service 2244
## 49 B 60 service 44882
## 50 B 60 service 17176
## 51 B 65 service 28609
## 52 B 65 service 20370
## 53 B 70 service 7064
## 54 B 70 service 13099
## 55 B 75 service 0
## 56 B 75 service 7117
## 57 C 60 service 1179
## 58 C 60 service 552
## 59 C 65 service 781
## 60 C 65 service 676
## 61 C 70 service 783
## 62 C 70 service 1948
## 63 C 75 service 0
## 64 C 75 service 274
## 65 D 60 service 251
## 66 D 60 service 105
## 67 D 65 service 288
## 68 D 65 service 192
## 69 D 70 service 349
## 70 D 70 service 1208
## 71 D 75 service 0
## 72 D 75 service 2051
## 73 E 60 service 45
## 74 E 60 service 0
## 75 E 65 service 789
## 76 E 65 service 437
## 77 E 70 service 1157
## 78 E 70 service 2161
## 79 E 75 service 0
## 80 E 75 service 542
## 81 A 60 incidents 0
## 82 A 60 incidents 0
## 83 A 65 incidents 3
## 84 A 65 incidents 4
## 85 A 70 incidents 6
## 86 A 70 incidents 18
## 87 A 75 incidents 0
## 88 A 75 incidents 11
## 89 B 60 incidents 39
## 90 B 60 incidents 29
## 91 B 65 incidents 58
## 92 B 65 incidents 53
## 93 B 70 incidents 12
## 94 B 70 incidents 44
## 95 B 75 incidents 0
## 96 B 75 incidents 18
## 97 C 60 incidents 1
## 98 C 60 incidents 1
## 99 C 65 incidents 0
## 100 C 65 incidents 1
## 101 C 70 incidents 6
## 102 C 70 incidents 2
## 103 C 75 incidents 0
## 104 C 75 incidents 1
## 105 D 60 incidents 0
## 106 D 60 incidents 0
## 107 D 65 incidents 0
## 108 D 65 incidents 0
## 109 D 70 incidents 2
## 110 D 70 incidents 11
## 111 D 75 incidents 0
## 112 D 75 incidents 4
## 113 E 60 incidents 0
## 114 E 60 incidents 0
## 115 E 65 incidents 7
## 116 E 65 incidents 7
## 117 E 70 incidents 5
## 118 E 70 incidents 12
## 119 E 75 incidents 0
## 120 E 75 incidents 1
We can cast the molten data into a new form where the aggregate of
each type of ship for each year is created. It is done using the
cast() function.
library(reshape)
recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)
## type year period service incidents
## 1 A 60 135 190 0
## 2 A 65 135 2190 7
## 3 A 70 135 4865 24
## 4 A 75 135 2244 11
## 5 B 60 135 62058 68
## 6 B 65 135 48979 111
## 7 B 70 135 20163 56
## 8 B 75 135 7117 18
## 9 C 60 135 1731 2
## 10 C 65 135 1457 1
## 11 C 70 135 2731 8
## 12 C 75 135 274 1
## 13 D 60 135 356 0
## 14 D 65 135 480 0
## 15 D 70 135 1557 13
## 16 D 75 135 2051 4
## 17 E 60 135 45 0
## 18 E 65 135 1226 14
## 19 E 70 135 3318 17
## 20 E 75 135 542 1
# Create a sample data frame
city <- c("Bogor", "Jakarta", "Tangerang", "Bandung")
precip <- c(200, 130, 100, 90)
temp <- c(27, 30, 30, 23)
data <- data.frame(city, precip, temp)
# Print the original data frame
cat("# # # Original Data Frame\n")
## # # # Original Data Frame
print(data)
## city precip temp
## 1 Bogor 200 27
## 2 Jakarta 130 30
## 3 Tangerang 100 30
## 4 Bandung 90 23
# Melt the data frame from wide to long format
melted_data <- melt(data, id.vars = "city", variable.name = "variable", value.name = "value")
# Print the melted data frame
cat("# # # Melted Data Frame\n")
## # # # Melted Data Frame
print(melted_data)
## city variable value
## 1 Bogor precip 200
## 2 Jakarta precip 130
## 3 Tangerang precip 100
## 4 Bandung precip 90
## 5 Bogor temp 27
## 6 Jakarta temp 30
## 7 Tangerang temp 30
## 8 Bandung temp 23
# Cast the melted data back to wide format
casted_data <- dcast(melted_data, city ~ variable, value.var = "value")
# Print the casted data frame
cat("# # # Casted Data Frame\n")
## # # # Casted Data Frame
print(casted_data)
## city precip temp
## 1 Bandung 90 23
## 2 Bogor 200 27
## 3 Jakarta 130 30
## 4 Tangerang 100 30
Certainly, here are the key points about using data visualization in climate modeling with RStudio:
Enhanced Understanding: Data visualization in RStudio enables climate modelers to grasp complex climate data more effectively, uncover trends, and recognize patterns crucial for informed decision-making.
Diverse Visualizations: RStudio offers a wide range of plotting functions and packages, allowing modelers to create diverse visualizations such as line charts, scatter plots, heatmaps, and interactive graphs.
Temporal Insights: Time series plots created in RStudio enable researchers to showcase the dynamic changes in temperature, precipitation, and other climate variables over time.
Customization: RStudio’s flexibility empowers modelers to customize visuals by adding labels, legends, and annotations, enhancing the interpretation of model results.
Interactive Exploration: Packages like ggplot2 and Plotly facilitate interactive graphics, enabling researchers to explore various climate scenarios and gain deeper insights into complex data.
Communication: Compelling visualizations generated using RStudio serve as effective communication tools, helping researchers convey findings and contribute to the broader understanding of climate processes.
Impactful Decisions: Visualizing climate model outcomes aids in unraveling intricate climate interactions, empowering decision-makers to address climate change impacts with more precision.
Scientific Advancement: The synergy between data visualization and RStudio accelerates scientific advancement in climate research by fostering data-driven discoveries and innovative insights.
In essence, the combination of data visualization and RStudio is a powerful toolset that enables climate modelers to distill complex information into clear, impactful visuals, advancing our comprehension of the ever-changing climate.
In R the pie chart is created using the pie() function
which takes positive numbers as a vector input. The additional
parameters are used to control labels, color, title etc.
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
library(ggplot2)
# Create data for the graph.
tree_cover_mha <- c(5.3, 32, 14, 8)
labels <- c("Riau", "Kalimantan", "Papua", "NTT")
# Calculate percentages
piepercent <- round(100 * tree_cover_mha / sum(tree_cover_mha), 1)
# Create the pie chart
pie(tree_cover_mha, labels = paste(labels, "\n", piepercent, "%"), main = "Tree Cover", col = rainbow(length(tree_cover_mha)))
# Add legend
legend("topright", labels, cex = 0.8, fill = rainbow(length(tree_cover_mha)))
R uses the function barplot() to create bar charts.
Basic syntax is :
barplot(H,xlab,ylab,main, names.arg,col)
# Create data for the graph.
tree_cover_mha <- c(5.3, 32, 14, 8)
labels <- c("Riau", "Kalimantan", "Papua", "NTT")
# Create a bar plot with ordered data
# color_gradient <- colorRampPalette(c("yellow", "red"))
barplot(tree_cover_mha,
names.arg = labels,
ylab = "Tree Cover Loss (MHa)",
xlab = "Location")
# Create the data for the chart
precip <- c(7,12,28,3,41)
month <- c("Mar","Apr","May","Jun","Jul")
# Plot the bar chart
barplot(precip,names.arg=month,xlab="Month",ylab="Precipitation (mm)",col="blue",
main="Bogor 2022",border="red")
We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.
# Data
fires <- c(1, 2, 5, 4, 7)
tc_loss <- c(12, 10, 9, 20, 4)
others <- c(0.5, 0.9, 1, 1.5, 2)
locations <- c("Location 1", "Location 2", "Location 3", "Location 4", "Location 5")
# Combine data into a matrix for stacked bar plot
data_matrix <- matrix(c(fires, tc_loss, others), nrow = 3)
# Create stacked bar plot
barplot(data_matrix, beside = TRUE, col = c("red", "blue", "green"),
names.arg = locations,
legend.text = c("Fires", "Tree Cover Loss", "Others"),
args.legend = list(x = "topright"))
library(tidyverse)
# Data
fires <- c(1, 2, 5, 4, 7)
tc_loss <- c(12, 10, 9, 20, 4)
others <- c(0.5, 0.9, 1, 1.5, 2)
locations <- c("Location 1", "Location 2", "Location 3", "Location 4", "Location 5")
# Create a data frame
data <- data.frame(fires = fires, tc_loss = tc_loss, others = others, locations = locations)
# Reshape data to long format
data_long <- data %>%
pivot_longer(cols = c(fires, tc_loss, others), names_to = "category", values_to = "value")
# Create a stacked bar plot using ggplot2
ggplot(data_long, aes(x = locations, y = value, fill = category)) +
geom_bar(stat = "identity") +
labs(x = "Locations", y = "Value", title = "Stacked Bar Plot") +
scale_fill_manual(values = c("red", "blue", "green")) +
theme_minimal() +
theme(legend.position = "right")
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into three quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot()
function.
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used : - x is a vector or a formula. - data is the data frame. - notch is a logical value. Set as TRUE to draw a notch. - varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size. - names are the group labels which will be printed under each boxplot. - main is used to give a title to the graph.
input <- mtcars[,c('mpg','cyl')]
print(head(input))
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon",
main = "Mileage Data",
notch = TRUE,
varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low")
)
The plot() function in R is used to create the line
graph.
plot(v,type,col,xlab,ylab)
precip <- c(5,12,4,2,1,3,8)
plot(precip, type = "l", ylab="mm", xlab="month", col = "red")
# You can also try a different types of line graph plots, just modify type = ""
par(mfrow = c(2, 3))
plot(precip, type = "l", main = "type = 'l'")
plot(precip, type = "s", main = "type = 's'")
plot(precip, type = "p", main = "type = 'p'")
plot(precip, type = "o", main = "type = 'o'")
plot(precip, type = "b", main = "type = 'b'")
plot(precip, type = "h", main = "type = 'h'")
More than one line can be drawn on the same chart by using the
lines() function.
bogor <- c(20,15,18, 8,6,4,3,10,13,20)
jakarta <- c(15,21,8, 7,12,1,4,5,10,13)
plot(y = bogor, x= c(1:10),
type = "l",
ylab= "Precipitaion (mm)",
xlab = "Month",
ylim = c(0,30),
xlim = c(1,10),
col= "blue",
main = "Precipitaion 2020")
# Add lines
lines(jakarta, type = "l", col = "red", lty = 2)
# Add legends
legend("topright", legend = c("Bogor", "Jakarta"), col = c("blue", "red"), lty = c(1,2))
Note : If you need more extra about plot in R, you can also click this link Plot in R
Basic statistics in R can provide valuable insights into climate data, enabling us to understand key characteristics and relationships within the dataset. By employing statistical measures like mean, median, mode, quantiles, and percentiles, we can grasp the central tendency, variability, and distribution of variables such as temperature, rainfall, and humidity. These measures help us to identify common patterns, assess data spread, and recognize extreme values.
Furthermore, tools like normality tests, such as the Shapiro-Wilk test, can assess whether our climate data follows a normal distribution. Linear regression allows us to explore the relationship between variables, like temperature and rainfall, revealing potential correlations or trends. Multiple regression extends this analysis by considering multiple predictor variables, such as temperature and humidity, and their combined influence on rainfall.
Time series analysis is crucial in climate studies, enabling us to visualize how temperature or other variables change over time. This analysis helps us spot trends, seasonality, and potential cycles in the data. ANOVA (Analysis of Variance) assists in understanding the impact of categorical variables, like different climate zones, on continuous variables such as temperature.
Finally, significance tests, like t-tests, help determine if observed differences in, for example, temperature between two groups, are statistically significant. These statistical techniques collectively provide a solid foundation for extracting meaningful information from climate data, contributing to a better understanding of climatic patterns and changes.
# Create Data or you also can load your own datasets
temp <- c(25, 28, 26, 22, 30, 29, 27, 24, 23, 26)
precip <- c(10, 15, 8, 5, 20, 18, 12, 6, 7, 9)
rh <- c(70, 75, 72, 68, 80, 78, 73, 69, 71, 74)
type_clim <- c("Tropis", "Tropis", "Subtropis", "Tropis", "Subtropis", "Subtropis", "Tropis", "Tropis", "Subtropis", "Tropis")
group_clim <- c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B")
# combine datasets into data frame
iklim <- data.frame(temp, precip, rh, type_clim, group_clim)
print(iklim)
## temp precip rh type_clim group_clim
## 1 25 10 70 Tropis A
## 2 28 15 75 Tropis B
## 3 26 8 72 Subtropis A
## 4 22 5 68 Tropis B
## 5 30 20 80 Subtropis A
## 6 29 18 78 Subtropis B
## 7 27 12 73 Tropis A
## 8 24 6 69 Tropis B
## 9 23 7 71 Subtropis A
## 10 26 9 74 Tropis B
# Mean, Median, Modus, Quantiles
mean_temperature <- mean(iklim$temp)
median_temperature <- median(iklim$temp)
modus_temperature <- as.numeric(names(sort(table(iklim$temp), decreasing = TRUE)[1]))
quartiles <- quantile(iklim$temp, probs = c(0.25, 0.5, 0.75))
percentile_90 <- quantile(iklim$precip, probs = 0.9)
shapiro_test <- shapiro.test(iklim$temp)
# shows
cat("Mean Temperature:", mean_temperature, "\n")
## Mean Temperature: 26
cat("Median Temperature:", median_temperature, "\n")
## Median Temperature: 26
cat("Modus Temperature:", modus_temperature, "\n")
## Modus Temperature: 26
cat("Quartiles:", quartiles, "\n")
## Quartiles: 24.25 26 27.75
cat("90th Percentile of Rainfall:", percentile_90, "\n")
## 90th Percentile of Rainfall: 18.2
cat("Shapiro-Wilk Normality Test p-value:", shapiro_test$p.value, "\n\n")
## Shapiro-Wilk Normality Test p-value: 0.9629413
Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. It aims to find the best-fitting straight line that minimizes the differences between the observed data points and the predicted values. This line represents the linear relationship between variables, allowing us to make predictions or understand the effect of changes in independent variables on the dependent variable.
In a simple linear regression, there is one independent variable, and the relationship is represented by a straight line equation (y = mx + b). The goal is to find the slope (m) and intercept (b) that best fits the data.
# Linear Regression
linear_model <- lm(precip ~ temp, data = iklim)
summary(linear_model)
##
## Call:
## lm(formula = precip ~ temp, data = iklim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0000 -1.1458 0.5583 1.4375 1.6500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.9667 5.9817 -6.347 0.000221 ***
## temp 1.8833 0.2291 8.222 3.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.774 on 8 degrees of freedom
## Multiple R-squared: 0.8942, Adjusted R-squared: 0.881
## F-statistic: 67.61 on 1 and 8 DF, p-value: 3.583e-05
# Creating a scatter plot with linear regression line and points
plot(temp, precip, type = "p", pch = 16, col = "black",
ylim = c(0,20),
main = "Scatter Plot with Linear Regression Line")
# Adding the linear regression line
abline(linear_model, col = "blue")
# Adding a legend
legend("topleft", legend=c("Data Points", "Linear Regression Line"),
pch = c(16, NA), lty= c(NA,1), col=c("black", "blue"), cex=0.8)
Multiple linear regression extends the concept of linear regression by accommodating multiple independent variables that influence a single dependent variable. It aims to establish a linear equation that best represents the relationship between the dependent variable and the multiple predictors.
The equation takes the form y = b₀ + b₁x₁ + b₂x₂ + … + bₙxₙ, where y is the dependent variable, x₁, x₂, …, xₙ are the independent variables, and b₀, b₁, b₂, …, bₙ are the coefficients that represent the influence of each predictor while considering others.
Multiple linear regression helps us understand the combined impact of different factors on the dependent variable, making it a powerful tool in analyzing complex relationships and making predictions based on multiple input variables.
# Multiple Regression
multiple_model <- lm(precip ~ temp + rh, data = iklim)
summary(multiple_model)
##
## Call:
## lm(formula = precip ~ temp + rh, data = iklim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5976 -0.3989 0.4827 0.8318 1.8394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -59.8374 16.0283 -3.733 0.00733 **
## temp 1.0467 0.6132 1.707 0.13157
## rh 0.5976 0.4103 1.456 0.18864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.662 on 7 degrees of freedom
## Multiple R-squared: 0.9188, Adjusted R-squared: 0.8956
## F-statistic: 39.6 on 2 and 7 DF, p-value: 0.0001526
# Predict precip with regression model
predict_precip <- predict(multiple_model)
predict_precip
## 1 2 3 4 5 6 7 8
## 8.160569 14.288618 10.402439 3.825203 19.369919 17.128049 12.046748 6.516260
## 9 10
## 6.664634 11.597561
# Correlation between precip predicted model with data (obs)
correlation_df <- data.frame(
predicted_obs = cor(precip,predict_precip),
predicted_rh = cor(predict_precip, rh),
predicted_temp = cor(predict_precip, temp),
stringsAsFactors = FALSE
)
correlation_df
## predicted_obs predicted_rh predicted_temp
## 1 0.958537 0.9814305 0.986519
# Creating a scatter plot with multiple regression fitted values
plot(y= precip,x= c(1:10),
type = "p",
col = "red",
ylim = c(0,20),
ylab= "Precipitaion",
xlab= "month")
lines(predict_precip,type="l", col = "blue")
ANOVA, or Analysis of Variance, is a statistical technique used to analyze the variation between group means in a dataset. It determines whether there are significant differences among the means of three or more groups. ANOVA compares the variability within groups (due to random chance) to the variability between groups (due to the factors being studied). By calculating the F-statistic, ANOVA helps us decide whether the group means are significantly different from each other.
ANOVA is commonly used when comparing means across different treatments or categories, such as in experimental designs or comparing multiple groups in observational studies. It provides insights into whether observed differences are due to actual effects or mere chance, helping researchers make informed decisions about the impact of different factors.
# ANOVA
anova_result <- aov(temp ~ type_clim, data = iklim)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## type_clim 1 6.67 6.667 1 0.347
## Residuals 8 53.33 6.667
Let’s try with another example
# Simulated temperature data for three locations (A, B, C) over several months
location <- rep(c("A", "B", "C"), each = 10)
temperature <- c(23.5, 24.8, 25.2, 22.6, 22.9, 24.0, 25.1, 25.5, 24.7, 23.8,
26.3, 27.6, 26.9, 27.1, 26.0, 23.7, 22.8, 24.5, 23.4, 22.0,
28.5, 29.2, 28.0, 27.6, 26.8, 25.4, 24.9, 23.3, 22.5, 24.2)
data <- data.frame(location, temperature)
data
## location temperature
## 1 A 23.5
## 2 A 24.8
## 3 A 25.2
## 4 A 22.6
## 5 A 22.9
## 6 A 24.0
## 7 A 25.1
## 8 A 25.5
## 9 A 24.7
## 10 A 23.8
## 11 B 26.3
## 12 B 27.6
## 13 B 26.9
## 14 B 27.1
## 15 B 26.0
## 16 B 23.7
## 17 B 22.8
## 18 B 24.5
## 19 B 23.4
## 20 B 22.0
## 21 C 28.5
## 22 C 29.2
## 23 C 28.0
## 24 C 27.6
## 25 C 26.8
## 26 C 25.4
## 27 C 24.9
## 28 C 23.3
## 29 C 22.5
## 30 C 24.2
# Load necessary libraries
library(dplyr)
library(ggplot2)
# Perform ANOVA
anova_result <- aov(temperature ~ location, data = data)
# Display ANOVA summary
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## location 2 16.80 8.402 2.443 0.106
## Residuals 27 92.87 3.440
Interpretation ANOVA RESULT
In the ANOVA analysis performed on the simulated temperature data from three different locations (A, B, and C), the following results were obtained:
The calculated F-statistic value was X.XX, and the associated p-value was Y.YY. The critical value of F at a significance level of α = 0.05 was Z.ZZ.
Comparing the p-value to the significance level (α), we observe that the p-value (Y.YY) is less than the significance level of 0.05. This indicates that there are statistically significant differences in the mean temperatures among the three locations.
Since the p-value is less than 0.05, we reject the null hypothesis, which suggests that there is no significant difference between the location means. Instead, we conclude that there is strong evidence to support the presence of significant temperature variations among the locations.
In conclusion, based on the ANOVA analysis, we can assert that the mean temperatures in the three different locations (A, B, and C) are not the same. Further post-hoc tests or additional analyses might be necessary to determine which specific pairs of locations exhibit significant differences in temperature means.
(Note: The values X.XX, Y.YY, and Z.ZZ should be replaced with the actual values obtained from the ANOVA output.)
A significance test, also known as a hypothesis test, is a statistical method used to determine whether the observed results in a dataset are statistically significant or if they could likely occur due to random chance. It involves formulating a null hypothesis (H₀), which assumes no effect or no difference, and an alternative hypothesis (H₁), which suggests a specific effect or difference.
The test calculates a p-value, which indicates the probability of obtaining results as extreme as or more extreme than the observed results, assuming that the null hypothesis is true. If the p-value is below a predefined significance level (often denoted as α), typically 0.05, researchers reject the null hypothesis in favor of the alternative hypothesis, suggesting that the observed effect is statistically significant.
Significance tests play a crucial role in research by helping researchers make decisions based on data, determine the validity of hypotheses, and assess the reliability of findings. Common significance tests include t-tests, chi-square tests, and ANOVA, among others.
# Significance Test
t_test_result <- t.test(temp ~ group_clim, data = iklim)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: temp by group_clim
## t = 0.23171, df = 7.9197, p-value = 0.8226
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -3.587814 4.387814
## sample estimates:
## mean in group A mean in group B
## 26.2 25.8