Why do we use R?

Data analysis needs to be able to access data from a wide range of sources (database management systems, text files, statistical packages, and spreadsheets), merge the pieces of data together, clean and annotate them, analyze with modest methods, present findings in meaningful and graphically appealing ways… etc.

R is a comprehensive software package that’s ideally suited to accomplish these goals.

Exploring R

Observation (row) and variables (column)

Data type

  • Numeric or integers: counts, height, weight, temperature

  • Character: names, cities, texts

  • Logical (TRUE/FALSE)

  • Complex

  • Raw (bytes)

Data structures

R has a wide variety of objects for holding the data, including scalars, vectors, matrices, arrays, data frames, and lists.

  1. Vector: one- dimensional array that stores collection data of the same mode
a <- 3
b <- 4
c <- c(1,2,3,4,5)
d <- c("one", "two", "three")
e <- c(1, "two", 3)

Note! check the class of vector using class().

To generate a sequence:

f <- 1:20
f
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

🦆 How about making a vector containing 1 to 25 and 30 to 40?

Calculate using the objects you assigned

g <- a+b
g
## [1] 7

To generate a repetitions, we can use rep()

h <- rep(4, times= 3)
h
## [1] 4 4 4
i <- rep(c(4,5), each= 3)
i
## [1] 4 4 4 5 5 5
  1. Matrices: two- dimensional array where each element has the same mode. Matrices are created with the matrix() function.
y <- matrix(1:20, nrow=5, ncol=4)
y
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

Check matrix() function!

cells <- c(1, 26, 24, 28)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")

mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = T,
                   dimnames = list(rnames, cnames))
mymatrix
##    C1 C2
## R1  1 26
## R2 24 28
  1. Arrays: Similar to matrices but can have more than two dimensions by using array() command.

  2. Data frame: More general structure to store the data. Data frame can contain different modes of data (numeric, characteristic, etc.).

    mydf <- data.frame(col1, col2, col3, …)

🦆 Create data frame (named “mydf”) by using the information of the people in this lecture, containing name, nationality, sex, and favorite number.

  1. Factors: Categorical (nominal) and ordered categorical (ordinal) variables are called “factors”. This is crucial in R because they determine how data will be analyzed and presented visually.

Make the patient data

id <- c(1,2,3,4)
age <- c(25,34,28,52)
diabetes <- c("type1", "type2", "type1", "type1")
status <- c("poor", "improved", "excellent", "poor")
data <- data.frame(id, age, diabetes, status)

data
##   id age diabetes    status
## 1  1  25    type1      poor
## 2  2  34    type2  improved
## 3  3  28    type1 excellent
## 4  4  52    type1      poor

Check the data type.

str(data)
## 'data.frame':    4 obs. of  4 variables:
##  $ id      : num  1 2 3 4
##  $ age     : num  25 34 28 52
##  $ diabetes: chr  "type1" "type2" "type1" "type1"
##  $ status  : chr  "poor" "improved" "excellent" "poor"

Using factor!

data$diabetes <- factor(data$diabetes)
data$status <- factor(data$status, ordered = T)

str(data)
## 'data.frame':    4 obs. of  4 variables:
##  $ id      : num  1 2 3 4
##  $ age     : num  25 34 28 52
##  $ diabetes: Factor w/ 2 levels "type1","type2": 1 2 1 1
##  $ status  : Ord.factor w/ 3 levels "excellent"<"improved"<..: 3 2 1 3
  1. Lists: The most complex data structure. It allows you to gather a variety of objects (e.g., matrices, data frame, and even other lists) under one name. Lists can be created using list() command.

Data input

1. Importing data

from Comma Separated Value (.csv) file

Let import the emergency ambulance dispatch (EAD) data (read.csv()) and check the top 5 rows of the data using head() command.

data1 <- read.csv("H:\\class2023-advance environmental\\Basics of R programming_20230629\\dataset\\data2014.csv")

head(data1, 5)
##       date year month day dcount
## 1 2014/1/1 2014     1   1    100
## 2 2014/1/2 2014     1   2    118
## 3 2014/1/3 2014     1   3     91
## 4 2014/1/4 2014     1   4     82
## 5 2014/1/5 2014     1   5    115

from Excel (.xls or .xlsx) file

We need an additional package to read Excel data (csv can be opened using base package).

Just install.package("readxl")

library(readxl)
data2 <- read_xlsx("H:/class2023-advance environmental/Basics of R programming_20230629/dataset/data2015.xlsx", sheet = 1)

head(data2, 5)
## # A tibble: 5 × 5
##   date                 year month   day dcount
##   <dttm>              <dbl> <dbl> <dbl>  <dbl>
## 1 2015-01-01 00:00:00  2015     1     1    109
## 2 2015-01-02 00:00:00  2015     1     2    115
## 3 2015-01-03 00:00:00  2015     1     3    108
## 4 2015-01-04 00:00:00  2015     1     4     98
## 5 2015-01-05 00:00:00  2015     1     5    115

2. Checking data

Check the data type of each variable using str().

str(data1)
## 'data.frame':    365 obs. of  5 variables:
##  $ date  : chr  "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
##  $ year  : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ month : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dcount: int  100 118 91 82 115 93 93 99 82 74 ...

Get the variable names (column name)

names(data1)
## [1] "date"   "year"   "month"  "day"    "dcount"

Count the column number

ncol(data1)
## [1] 5

Count the role number

nrow(data1)
## [1] 365

3. Subsetting data

We can select and exclude variables and observations using the following codes.

  1. Selecting variables data[row, column] .

Extract the 5th row

data1[5,] 
##       date year month day dcount
## 5 2014/1/5 2014     1   5    115

🦆 Extract the 5th column

🦆 Subset these rows (1, 50 to 100) and column names date, and dcount.

  1. Excluding variables

Exclude column year, month, and day from the data1.

sub1 <- data1[,-c(2,3,4)]
head(sub1, 5)
##       date dcount
## 1 2014/1/1    100
## 2 2014/1/2    118
## 3 2014/1/3     91
## 4 2014/1/4     82
## 5 2014/1/5    115
  1. Selecting observation

We can subset the data based on the variable name. For example, let’s subset the data only the day when EAD > 100.

Using subset ()

sub2 <- subset(data1, dcount > 100)
head(sub2, 5)
##         date year month day dcount
## 2   2014/1/2 2014     1   2    118
## 5   2014/1/5 2014     1   5    115
## 13 2014/1/13 2014     1  13    104
## 14 2014/1/14 2014     1  14    106
## 27 2014/1/27 2014     1  27    109

Using which()

sub3 <- data1[which(data1$dcount > 100),]
head(sub3, 5)
##         date year month day dcount
## 2   2014/1/2 2014     1   2    118
## 5   2014/1/5 2014     1   5    115
## 13 2014/1/13 2014     1  13    104
## 14 2014/1/14 2014     1  14    106
## 27 2014/1/27 2014     1  27    109

🦆 Let’s subset the data1

  • sub4 for September

  • sub5 for September and the day when EAD is less than 50

  • sub6 for the whole data except for September


4. Formatting date variable

Check the data structure and convert “date” variable into date format.

str(data1)
## 'data.frame':    365 obs. of  5 variables:
##  $ date  : chr  "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
##  $ year  : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ month : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dcount: int  100 118 91 82 115 93 93 99 82 74 ...
data1$date_new <- as.Date(data1$date, format = "%Y/%m/%d")
str(data1)
## 'data.frame':    365 obs. of  6 variables:
##  $ date    : chr  "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
##  $ year    : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ month   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dcount  : int  100 118 91 82 115 93 93 99 82 74 ...
##  $ date_new: Date, format: "2014-01-01" "2014-01-02" ...

Assume we only have the separated column for year, month, and day. We can use paste() to make a new date column.

data1$date_new1 <- as.Date(paste(data1$year, data1$month, data1$day, sep = "/"), format = "%Y/%m/%d")

str(data1)
## 'data.frame':    365 obs. of  7 variables:
##  $ date     : chr  "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
##  $ year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dcount   : int  100 118 91 82 115 93 93 99 82 74 ...
##  $ date_new : Date, format: "2014-01-01" "2014-01-02" ...
##  $ date_new1: Date, format: "2014-01-01" "2014-01-02" ...

🦆 Check the date format for data2 and convert them!


5. Renaming the variables

Check the variable’s name!

names(data1)
## [1] "date"      "year"      "month"     "day"       "dcount"    "date_new" 
## [7] "date_new1"

We can change the column name (for all variables) using colname() function and can change the name for some specific column by using names().

Let’s change the variable name of “dcount” to “EAD”.

names(data1)[5] <- "EAD"
names(data1)
## [1] "date"      "year"      "month"     "day"       "EAD"       "date_new" 
## [7] "date_new1"

🦆 Change the variable name of “dcount” of data2 to “EAD”!


6. Combining two data frames (merging datasets)

If the data exist in multiple locations, you’ll need to combine them before moving forward. This section shows how to add the rows (observations) and columns (variables) to a data frame.

  1. Adding row

To join two data frames vertically by using rbind() function.

Note that the data frames must have the same variables!

data1 <- data1[,-7]
head(data1)
##       date year month day EAD   date_new
## 1 2014/1/1 2014     1   1 100 2014-01-01
## 2 2014/1/2 2014     1   2 118 2014-01-02
## 3 2014/1/3 2014     1   3  91 2014-01-03
## 4 2014/1/4 2014     1   4  82 2014-01-04
## 5 2014/1/5 2014     1   5 115 2014-01-05
## 6 2014/1/6 2014     1   6  93 2014-01-06

Combine it!

data3 <- rbind(data1, data2)
head(data3)
##       date year month day EAD   date_new
## 1 2014/1/1 2014     1   1 100 2014-01-01
## 2 2014/1/2 2014     1   2 118 2014-01-02
## 3 2014/1/3 2014     1   3  91 2014-01-03
## 4 2014/1/4 2014     1   4  82 2014-01-04
## 5 2014/1/5 2014     1   5 115 2014-01-05
## 6 2014/1/6 2014     1   6  93 2014-01-06
  1. Adding column

To merge two data frames horizontally, you use the merge() function. In most cases, two data frames are joined by one or more common key variables.

🦆 Let’s import environmental data!

env <- read.csv("H:/class2023-advance environmental/Basics of R programming_20230629/dataset/env.csv", skip = 1)

head(env)
##   X     date     dmean temp
## 1 1 2014/4/1 27.083333 13.5
## 2 2 2014/4/2 14.434783 16.5
## 3 3 2014/4/3 21.043478 16.9
## 4 4 2014/4/4 13.416667 11.5
## 5 5 2014/4/5 13.375000  8.2
## 6 6 2014/4/6  8.958333  7.8

🦆 Check (and change) the format of the “date” variable!

env$date_new <- as.Date(env$date, format = "%Y/%m/%d")
head(env)
##   X     date     dmean temp   date_new
## 1 1 2014/4/1 27.083333 13.5 2014-04-01
## 2 2 2014/4/2 14.434783 16.5 2014-04-02
## 3 3 2014/4/3 21.043478 16.9 2014-04-03
## 4 4 2014/4/4 13.416667 11.5 2014-04-04
## 5 5 2014/4/5 13.375000  8.2 2014-04-05
## 6 6 2014/4/6  8.958333  7.8 2014-04-06
env <- env[,-c(1,2)]
head(env)
##       dmean temp   date_new
## 1 27.083333 13.5 2014-04-01
## 2 14.434783 16.5 2014-04-02
## 3 21.043478 16.9 2014-04-03
## 4 13.416667 11.5 2014-04-04
## 5 13.375000  8.2 2014-04-05
## 6  8.958333  7.8 2014-04-06

Subset data3 for only EAD and date_new

data3 <- data3[,5:6]
head(data3)
##   EAD   date_new
## 1 100 2014-01-01
## 2 118 2014-01-02
## 3  91 2014-01-03
## 4  82 2014-01-04
## 5 115 2014-01-05
## 6  93 2014-01-06

Now merge env and data3 using merge() function.

data4 <- merge(data3, env, by= c("date_new"), all.x = T)
head(data4)
##     date_new EAD dmean temp
## 1 2014-01-01 100    NA   NA
## 2 2014-01-02 118    NA   NA
## 3 2014-01-03  91    NA   NA
## 4 2014-01-04  82    NA   NA
## 5 2014-01-05 115    NA   NA
## 6 2014-01-06  93    NA   NA

7. Summarizing data and basic statistics

  1. Descriptive statistics via summary()
summary(data4)
##     date_new               EAD             dmean            temp      
##  Min.   :2014-01-01   Min.   : 49.00   Min.   : 5.50   Min.   : 2.10  
##  1st Qu.:2014-07-02   1st Qu.: 72.00   1st Qu.:14.00   1st Qu.:12.40  
##  Median :2014-12-31   Median : 79.00   Median :18.52   Median :19.40  
##  Mean   :2014-12-31   Mean   : 80.37   Mean   :19.64   Mean   :18.38  
##  3rd Qu.:2015-07-01   3rd Qu.: 88.00   3rd Qu.:24.17   3rd Qu.:24.23  
##  Max.   :2015-12-31   Max.   :133.00   Max.   :46.04   Max.   :32.10  
##                       NA's   :5        NA's   :97      NA's   :90
  1. Knowing the missing value using is.na() function.
na_temp <- is.na(data4$temp)
sum(na_temp)
## [1] 90

Let’s average temperature from data4!

mean(data4$temp)
## [1] NA

We need to omit NA!

mean(data4$temp, na.rm = T)
## [1] 18.38172

🦆 Try other functions to summarize the data (e.g., mean, median, quantile)!

median(data4$temp, na.rm = T)
## [1] 19.4
quantile(data4$temp, probs = c(0.01,0.5,0.9), na.rm= T)
##     1%    50%    90% 
##  3.695 19.400 27.900

8. Basic plot

  1. Histograms

Display the distribution of a continuous variable by dividing up the range of scores into a specified number of bins on x-axis and displaying the frequency of scores in each bin on the y-axis.

hist(data4$EAD)

  1. Box plots

A “box-and-whiskers” plot describes the distribution of a continuous variable by plotting its five-number summary: the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum.

boxplot(data4$EAD)

We can see the box plots of EAD for each DOW. Let’s make a column for day of week (dow)!

data4$dow <- weekdays(data4$date_new, abbreviate = T)
head(data4)
##     date_new EAD dmean temp dow
## 1 2014-01-01 100    NA   NA Wed
## 2 2014-01-02 118    NA   NA Thu
## 3 2014-01-03  91    NA   NA Fri
## 4 2014-01-04  82    NA   NA Sat
## 5 2014-01-05 115    NA   NA Sun
## 6 2014-01-06  93    NA   NA Mon

Then, make the box plots for EAD by dow.

boxplot(data4$EAD ~ data4$dow)

  1. Time-series
plot(data4$date_new, data4$EAD)

plot(data4$date_new, data4$temp)


Data output

1. Exporting data

Data can be archived or imported into external applications. You can use the write.csv() function to output an R object to a csv. file.

For other file types, please check the books or somewhere!!


Reference

Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.