This article is useful for people intended to use R as an elementary introduction. My motto here is ‘Learning is FUN!’ Let’s get started with basic introduction of R.
The starting point is the R website at www.r-project.org. You have to find the CRAN(Comprehensive R Archive Network) link under the Download section. If you click on the CRAN link, you will be shown a list of network servers all over the planet. You can choose https://cran.cnr.berkeley.edu/. Then select the appropriate download for Windows or Mac OS X. For windows users, you have to select ‘install R for the first time’ under the base section.The next prompt depends on the browser you are using. I usually use Google Chrome. The download is usually in the lower-left corner of the screen. If using Internet Explorer, select “Run” Select “Yes” to allow changes to be made. Select the appropriate prompts for the next few screens. I go with the defaults. After installation is complete, install RStudio.
Once you are done with downloading R, you have to go to the link https://www.rstudio.com for installing the free version of RStudio. Scroll to the middle of the page and select ‘DOWNLOAD RSTUDIO DESKTOP’ under the column ‘Open Source Edition’.
While working on R, we have to keep in mind that we may have various possibilities for a single task. If you want to see the data through a figure we can search for a boxplot. All you have to do is to type ‘?boxplot’ or help(boxplot). A help window opens, showing a document with the headings Description, Usage, Arguments, Details, Values, References, See also, and Examples. www.r-project.org is another important link to start with if you are stuck with any problem.
The getwd() and setwd() commands identify the current working directory and sets a new working directory, respectively. To find the deafault working directory you can type getwd().
getwd()
## [1] "/Users/mominul/OneDrive - South Dakota State University - SDSU/R Code/R DATA LOADING"
My current directory is shown above. Another useful command is q() for quitting R. It exits R. Before it does so, it will ask whether it should save the workspace. If you decide to save it, we strongly advise that you do not save it in its default directory. Doing so will cause R to load all your results automatically when it is restarted. To avoid R asking whether it should save your data, use
q(save = “no”)
To save a workspace, click File-<Save Workspace. To load an existing workspace, use File-<Load Workspace. f you want to begin a new analysis on a different dataset, it may be useful to remove all variables. One option is to quit R and restart it. Alternatively, click Misc-<Remove all objects. This will execute the command
rm(list = ls(all = TRUE))
Let’s get started! We use a dataset (unpublished data, Chris Elphick, University of Connecticut) containing seven body measurements taken from approximately 1100 saltmarsh sharp-tailed sparrows (Ammodramus caudacutus). For our purposes we use only four morphometric variables of eight birds.
#simplest way to store results in each variable
a <- 59
b <- 55
c <- 53.5
d <- 55
e <- 52.5
Alternatively, the ‘‘=’’ symbol instead of ‘‘<–’’ can be used. To name your variables, it would be great not to use a,b,c,d. Instead use names related to the data.
To concatenate data, we can use c() function.
#Concatenating Data with the c Function
Wingcrd <- c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55) # c function has created a single vector of length 8
Tarsus <- c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6,21.5)
Head <- c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5,NA)
Wt <- c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6,15.7)
Wingcrd[1] # gives result 59
## [1] 59
Wingcrd [1 : 5] #To view the first five values type
## [1] 59.0 55.0 53.5 55.0 52.5
Wingcrd [-2] #To view all except the second value
## [1] 59.0 53.5 55.0 52.5 57.5 53.0 55.0
Note that there is one bird for which the size of the head was not measured. It is indicated by NA. Depending on the function, the presence of an NA may, or may not, cause trouble. Apparently, the default ‘‘na.rm = FALSE’’ option causes the R function sum to return an NA if there is a missing value in the vector (rm refers to remove). To avoid this, use “na.rm = TRUE”
sum(Head)
## [1] NA
sum(Head, na.rm = TRUE)
## [1] 216.1
#repeat number using 'rep' function
Id <- rep(1 : 4, each = 8)
Id
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
#repeat numbers using 'seq' function
a <- seq(from = 1, to = 4, by = 1)
rep(a, each = 8)
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
The above chunk repeat the values from 1 to 4, each eight times. You can also use the seq function to generate a sequence.
#Instead of the c function, we could have used the 'vector' function
Z <- cbind(Wingcrd, Tarsus, Head, Wt)
Z[,1] #data in the first column
## [1] 59.0 55.0 53.5 55.0 52.5 57.5 53.0 55.0
Z[1 : 8, 1] # same O/P as before
## [1] 59.0 55.0 53.5 55.0 52.5 57.5 53.0 55.0
Z [2,] #second row data
## Wingcrd Tarsus Head Wt
## 55.0 19.7 30.4 13.8
Z[1, 1] # accesses the value of the first bird for Wingcrd
## Wingcrd
## 59
Z[, 2 : 3] #gives all the data for columns 2 and 3
## Tarsus Head
## [1,] 22.3 31.2
## [2,] 19.7 30.4
## [3,] 20.8 30.6
## [4,] 20.3 30.3
## [5,] 20.8 30.3
## [6,] 21.5 30.8
## [7,] 20.6 32.5
## [8,] 21.5 NA
X <- Z[4, 4] #X contains the weight for bird 4
Y <- Z[, 4] # all the Wt data
W <- Z[, -3] # minus sign is used to exclude columns or rows
D <- Z[, c(1, 3, 4)]
E <- Z[, c(-1, -3)]
# if you only need to store the number of rows in Z
zrow <- dim(Z)[1]
Z2 <- rbind(Wingcrd, Tarsus, Head, Wt) # combines the data in rows
We can create a matrix of dimension 8 by 4 that contains the data.
##create a matrix of dimension 8 by 4
Dmat <- matrix(nrow = 8, ncol = 4)
Dmat[, 1] <- c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55)
Dmat[, 2] <- c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5,20.6, 21.5)
Dmat[, 3] <- c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8,2.5, NA)
Dmat[, 4] <- c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6,15.6, 15.7)
Dmat
## [,1] [,2] [,3] [,4]
## [1,] 59.0 22.3 31.2 9.5
## [2,] 55.0 19.7 30.4 13.8
## [3,] 53.5 20.8 30.6 14.8
## [4,] 55.0 20.3 30.3 15.2
## [5,] 52.5 20.8 30.3 15.5
## [6,] 57.5 21.5 30.8 15.6
## [7,] 53.0 20.6 2.5 15.6
## [8,] 55.0 21.5 NA 15.7
The elements of Dmat, in this case, are entered by column, but we could have filled them in by row. Typing Dmat into R gives the same data matrix as we obtained with the cbind function, except that Dmat does not have column labels.
We can use the existing colnames function to add column names to Dmat
#use the existing colnames function to add column names to Dmat
colnames(Dmat) <- c("Wingcrd", "Tarsus", "Head","Wt")
Dmat
## Wingcrd Tarsus Head Wt
## [1,] 59.0 22.3 31.2 9.5
## [2,] 55.0 19.7 30.4 13.8
## [3,] 53.5 20.8 30.6 14.8
## [4,] 55.0 20.3 30.3 15.2
## [5,] 52.5 20.8 30.3 15.5
## [6,] 57.5 21.5 30.8 15.6
## [7,] 53.0 20.6 2.5 15.6
## [8,] 55.0 21.5 NA 15.7
Other useful way to combine data
Dmat2 <- as.matrix(cbind(Wingcrd, Tarsus, Head, Wt))
Dmat2
## Wingcrd Tarsus Head Wt
## [1,] 59.0 22.3 31.2 9.5
## [2,] 55.0 19.7 30.4 13.8
## [3,] 53.5 20.8 30.6 14.8
## [4,] 55.0 20.3 30.3 15.2
## [5,] 52.5 20.8 30.3 15.5
## [6,] 57.5 21.5 30.8 15.6
## [7,] 53.0 20.6 32.5 15.6
## [8,] 55.0 21.5 NA 15.7
Dmat2 and Dmat are identical. Functions such as ‘as.matrix’, ‘is.matrix’ (this function gives a TRUE if its argument is a matrix, and FALSE otherwise), as.data.frame, is.date.frame can come in handy
In a data frame we can combine variables of equal length, with each row in the data frame containing observations on the same sampling unit.
#'data.frame'is similar to the 'matrix' or 'cbind' functions
Dfrm <- data.frame(WC = Wingcrd,
TS = Tarsus,
HD = Head,
W = Wt)
Dfrm
## WC TS HD W
## 1 59.0 22.3 31.2 9.5
## 2 55.0 19.7 30.4 13.8
## 3 53.5 20.8 30.6 14.8
## 4 55.0 20.3 30.3 15.2
## 5 52.5 20.8 30.3 15.5
## 6 57.5 21.5 30.8 15.6
## 7 53.0 20.6 32.5 15.6
## 8 55.0 21.5 NA 15.7
We can use knitr package in R to generate a nice table by Kable function.
library(knitr)
kable(Dfrm)
| WC | TS | HD | W |
|---|---|---|---|
| 59.0 | 22.3 | 31.2 | 9.5 |
| 55.0 | 19.7 | 30.4 | 13.8 |
| 53.5 | 20.8 | 30.6 | 14.8 |
| 55.0 | 20.3 | 30.3 | 15.2 |
| 52.5 | 20.8 | 30.3 | 15.5 |
| 57.5 | 21.5 | 30.8 | 15.6 |
| 53.0 | 20.6 | 32.5 | 15.6 |
| 55.0 | 21.5 | NA | 15.7 |
Basically, the data.frame function creates an object, called Dfrm in this case, and within Dfrm it stores values of the four morphometric variables. The advantage of a data frame is that you can make changes to the data without affecting the original data.Possible to combine the original weight and the square root transformed weights in the data frame Dfrm.
Dfrm2 <- data.frame(WC = Wingcrd,
TS = Tarsus,
HD = Head,
W = Wt,
Wsq = sqrt(Wt))
kable(Dfrm2)
| WC | TS | HD | W | Wsq |
|---|---|---|---|---|
| 59.0 | 22.3 | 31.2 | 9.5 | 3.082207 |
| 55.0 | 19.7 | 30.4 | 13.8 | 3.714835 |
| 53.5 | 20.8 | 30.6 | 14.8 | 3.847077 |
| 55.0 | 20.3 | 30.3 | 15.2 | 3.898718 |
| 52.5 | 20.8 | 30.3 | 15.5 | 3.937004 |
| 57.5 | 21.5 | 30.8 | 15.6 | 3.949683 |
| 53.0 | 20.6 | 32.5 | 15.6 | 3.949683 |
| 55.0 | 21.5 | NA | 15.7 | 3.962323 |
In the data frame, we can also combine numerical variables, character strings, and factors.
Suppose you want a black box into which you can put as many variables as you want; some may be related, some may have similar dimensions, some may be vectors, others matrices, and yet others may contain character strings of variable names. This is what the list function can do.
# 'list' function combines data of different dimensions
x1 <- c(1, 2, 3)
x2 <- c("a", "b", "c", "d")
x3 <- 3
x4 <- matrix(nrow = 2, ncol = 2)
x4[, 1] <- c(1, 2)
x4[, 2] <- c( 3, 4)
Y <- list(x1 = x1, x2 = x2, x3 = x3, x4 = x4)
Y
## $x1
## [1] 1 2 3
##
## $x2
## [1] "a" "b" "c" "d"
##
## $x3
## [1] 3
##
## $x4
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
All information contained in Y is accessible by typing, for example, Y\(x1, Y\)x2, and so on. Nearly all functions (e.g., linear regression, generalised linear modelling, t-test, etc.) in R produce output that is stored in a list.
Y$x1 # fetch the data for x1
## [1] 1 2 3
Y$x2 # fetch the data for x2
## [1] "a" "b" "c" "d"
#Data Load from Excel
# The 'header = TRUE' option in the 'read.table' function tells R that the first row contains labels.
Squid <- read.table(file = "/Users/mominul/OneDrive - South Dakota State University - SDSU/Data Science/Books/R Books/A Beginners Guide to R/RBook/squid.txt",header = TRUE)
#Loading Excel File
library(readxl)
Vole_Skulls <- read_excel("/Users/mominul/OneDrive - South Dakota State University - SDSU/STAT 601/Project 1/Vole Skulls.xlsm")
#First 5 rows of the loaded data
head(Vole_Skulls)
## # A tibble: 6 × 5
## `Vole Skull Index` `Chromosomal Analysis ID` `Condylo Incisi… `Skull Height A…
## <dbl> <chr> <dbl> <dbl>
## 1 1 subterraneus 2350 735
## 2 2 subterraneus 2285 750
## 3 3 subterraneus 2170 738
## 4 4 subterraneus 2060 720
## 5 5 subterraneus 2275 785
## 6 6 subterraneus 2330 790
## # … with 1 more variable: Skull Width Across Rostrum (0.01mm) <dbl>
#csv data set reading
water <- read.csv('https://umich.instructure.com/files/399172/download?download_frd=1', header=T)
# text file reading, use 'read.delim()' command
colnames(water) # Show column Names
## [1] "Year..string."
## [2] "WHO.region..string."
## [3] "Country..string."
## [4] "Residence.Area.Type..string."
## [5] "Population.using.improved.drinking.water.sources......numeric."
## [6] "Population.using.improved.sanitation.facilities......numeric."
#selecting parts of a data set
wa2 <- water$Year..string.
#selecting multiple columns like 2,4 and 5
wa1 <- water[, c(2, 4, 5)]
#selecting columns from 1 to 5
wa <- water[, c(1:5)]
To read text file, use ‘read.delim()’ command.
Let us now discuss a bad way of accessing variables. We have used “$” to access variables from the data frame Squid. It can be tedious typing Squid each time we want to use certain variables from the GSI dataset. It is possible to avoid this by using the attach command.
attach(Squid)
#base R Boxplot
boxplot(GSI)
# mean of a certain variable
mean(GSI)
## [1] 2.187034
If you use the attach command, make sure that you use unique variable names. Refrain from common names such as Month, Location, and the like.
You may face a situation where you only want to work with, for example, the female data, data from a certain location, or data from the females of a certain location. To extract the subsets of data, we need to know how sex was coded.
a <- Squid$Sex
head(a)
## [1] 2 2 2 2 2 2
Basically it shows all values in the variable Sex. A better option is to use the unique command that shows how many unique values there are in this variable.
unique(Squid$Sex)
## [1] 2 1
The 1 stands for male, and the 2 for female.
SquidM <- Squid[Squid$Sex == 1, ] #male data
SquidF <- Squid[Squid$Sex == 2, ] #female data
unique(Squid$Location)
## [1] 1 3 4 2
The unique command applied on Squid$Location shows that there are four locations coded as 1, 2, 3, and 4. To extract the data from location 1, 2, or 3 we can use the following statements that all give the same result (the | symbol stands for the Boolean ‘‘or’’ and the != for ‘‘not equal’’).
Squid13 <- Squid[Squid$Location == 1 | Squid$Location == 2 | Squid$Location == 3, ] # or symbol
Squid12 <- Squid[Squid$Location != 4, ]
Squid31 <- Squid[Squid$Location < 4, ]
quid123 <- Squid[Squid$Location <= 3, ]
id123 <- Squid[Squid$Location >= 1 & Squid$Location <= 3, ]
Next we use the ‘‘&,’’ which is the Boolean ‘‘and’’ operator. Suppose we want to extract the male data from location 1. This means that the data have to be both from male squid and from location 1.
SquidM.1 <- Squid[Squid$Sex == 1 & Squid$Location == 1,]
dim(Squid) # Dimension of the original Dataset
## [1] 2644 6
dim(SquidM.1) #dimension after modification
## [1] 1031 6
head(SquidM.1 ) # First Few rows in the data set
## Sample Year Month Location Sex GSI
## 24 24 1 5 1 1 5.2970
## 58 58 1 6 1 1 3.5008
## 60 60 1 6 1 1 3.2487
## 61 61 1 6 1 1 3.2304
## 63 63 1 6 1 1 3.1848
## 65 65 1 6 1 1 2.9683
Use of and or in a single line
# data from males and from location 1 or 2
male.12 <- Squid[Squid$Sex == 1 & (Squid$Location == 1 | Squid$Location == 2), ]
dim(male.12)
## [1] 1119 6