Why do we use R?
Data analysis needs to be able to access data from a wide range of sources (database management systems, text files, statistical packages, and spreadsheets), merge the pieces of data together, clean and annotate them, analyze with modest methods, present findings in meaningful and graphically appealing ways… etc.
R is a comprehensive software package that’s ideally suited to accomplish these goals.
Observation (row) and variables (column)
Numeric or integers: counts, height, weight, temperature
Character: names, cities, texts
Logical (TRUE/FALSE)
Complex
Raw (bytes)
R has a wide variety of objects for holding the data, including scalars, vectors, matrices, arrays, data frames, and lists.
a <- 3
b <- 4
c <- c(1,2,3,4,5)
d <- c("one", "two", "three")
e <- c(1, "two", 3)
Note! check the class of vector using class().
To generate a sequence:
f <- 1:20
f
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
🦆 How about making a vector containing 1 to 25 and 30 to 40?
Calculate using the objects you assigned
g <- a+b
g
## [1] 7
To generate a repetitions, we can use rep()
h <- rep(4, times= 3)
h
## [1] 4 4 4
i <- rep(c(4,5), each= 3)
i
## [1] 4 4 4 5 5 5
matrix()
function.y <- matrix(1:20, nrow=5, ncol=4)
y
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
Check matrix() function!
cells <- c(1, 26, 24, 28)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = T,
dimnames = list(rnames, cnames))
mymatrix
## C1 C2
## R1 1 26
## R2 24 28
Arrays: Similar to matrices but can have more
than two dimensions by using array() command.
Data frame: More general structure to store the data. Data frame can contain different modes of data (numeric, characteristic, etc.).
mydf <- data.frame(col1, col2, col3, …)
🦆 Create data frame (named “mydf”) by using the information of the people in this lecture, containing name, nationality, sex, and favorite number.
Make the patient data
id <- c(1,2,3,4)
age <- c(25,34,28,52)
diabetes <- c("type1", "type2", "type1", "type1")
status <- c("poor", "improved", "excellent", "poor")
data <- data.frame(id, age, diabetes, status)
data
## id age diabetes status
## 1 1 25 type1 poor
## 2 2 34 type2 improved
## 3 3 28 type1 excellent
## 4 4 52 type1 poor
Check the data type.
str(data)
## 'data.frame': 4 obs. of 4 variables:
## $ id : num 1 2 3 4
## $ age : num 25 34 28 52
## $ diabetes: chr "type1" "type2" "type1" "type1"
## $ status : chr "poor" "improved" "excellent" "poor"
Using factor!
data$diabetes <- factor(data$diabetes)
data$status <- factor(data$status, ordered = T)
str(data)
## 'data.frame': 4 obs. of 4 variables:
## $ id : num 1 2 3 4
## $ age : num 25 34 28 52
## $ diabetes: Factor w/ 2 levels "type1","type2": 1 2 1 1
## $ status : Ord.factor w/ 3 levels "excellent"<"improved"<..: 3 2 1 3
list() command.from Comma Separated Value (.csv) file
Let import the emergency ambulance dispatch
(EAD) data (read.csv()) and check
the top 5 rows of the data using head() command.
data1 <- read.csv("H:\\class2023-advance environmental\\Basics of R programming_20230629\\dataset\\data2014.csv")
head(data1, 5)
## date year month day dcount
## 1 2014/1/1 2014 1 1 100
## 2 2014/1/2 2014 1 2 118
## 3 2014/1/3 2014 1 3 91
## 4 2014/1/4 2014 1 4 82
## 5 2014/1/5 2014 1 5 115
from Excel (.xls or .xlsx) file
We need an additional package to read Excel data (csv can be opened using base package).
Just install.package("readxl")
library(readxl)
data2 <- read_xlsx("H:/class2023-advance environmental/Basics of R programming_20230629/dataset/data2015.xlsx", sheet = 1)
head(data2, 5)
## # A tibble: 5 × 5
## date year month day dcount
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2015-01-01 00:00:00 2015 1 1 109
## 2 2015-01-02 00:00:00 2015 1 2 115
## 3 2015-01-03 00:00:00 2015 1 3 108
## 4 2015-01-04 00:00:00 2015 1 4 98
## 5 2015-01-05 00:00:00 2015 1 5 115
Check the data type of each variable using str().
str(data1)
## 'data.frame': 365 obs. of 5 variables:
## $ date : chr "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dcount: int 100 118 91 82 115 93 93 99 82 74 ...
Get the variable names (column name)
names(data1)
## [1] "date" "year" "month" "day" "dcount"
Count the column number
ncol(data1)
## [1] 5
Count the role number
nrow(data1)
## [1] 365
We can select and exclude variables and observations using the following codes.
data[row, column] .Extract the 5th row
data1[5,]
## date year month day dcount
## 5 2014/1/5 2014 1 5 115
🦆 Extract the 5th column
🦆 Subset these rows (1, 50 to 100) and column names date, and dcount.
Exclude column year, month, and day from the data1.
sub1 <- data1[,-c(2,3,4)]
head(sub1, 5)
## date dcount
## 1 2014/1/1 100
## 2 2014/1/2 118
## 3 2014/1/3 91
## 4 2014/1/4 82
## 5 2014/1/5 115
We can subset the data based on the variable name. For example, let’s subset the data only the day when EAD > 100.
Using subset ()
sub2 <- subset(data1, dcount > 100)
head(sub2, 5)
## date year month day dcount
## 2 2014/1/2 2014 1 2 118
## 5 2014/1/5 2014 1 5 115
## 13 2014/1/13 2014 1 13 104
## 14 2014/1/14 2014 1 14 106
## 27 2014/1/27 2014 1 27 109
Using which()
sub3 <- data1[which(data1$dcount > 100),]
head(sub3, 5)
## date year month day dcount
## 2 2014/1/2 2014 1 2 118
## 5 2014/1/5 2014 1 5 115
## 13 2014/1/13 2014 1 13 104
## 14 2014/1/14 2014 1 14 106
## 27 2014/1/27 2014 1 27 109
🦆 Let’s subset the data1
sub4 for September
sub5 for September and the day when EAD is less than 50
sub6 for the whole data except for September
Check the data structure and convert “date” variable into date format.
str(data1)
## 'data.frame': 365 obs. of 5 variables:
## $ date : chr "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dcount: int 100 118 91 82 115 93 93 99 82 74 ...
data1$date_new <- as.Date(data1$date, format = "%Y/%m/%d")
str(data1)
## 'data.frame': 365 obs. of 6 variables:
## $ date : chr "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dcount : int 100 118 91 82 115 93 93 99 82 74 ...
## $ date_new: Date, format: "2014-01-01" "2014-01-02" ...
Assume we only have the separated column for year, month, and day. We can use paste() to make a new date column.
data1$date_new1 <- as.Date(paste(data1$year, data1$month, data1$day, sep = "/"), format = "%Y/%m/%d")
str(data1)
## 'data.frame': 365 obs. of 7 variables:
## $ date : chr "2014/1/1" "2014/1/2" "2014/1/3" "2014/1/4" ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dcount : int 100 118 91 82 115 93 93 99 82 74 ...
## $ date_new : Date, format: "2014-01-01" "2014-01-02" ...
## $ date_new1: Date, format: "2014-01-01" "2014-01-02" ...
🦆 Check the date format for data2 and convert them!
Check the variable’s name!
names(data1)
## [1] "date" "year" "month" "day" "dcount" "date_new"
## [7] "date_new1"
We can change the column name (for all variables) using
colname() function and can change the name for some
specific column by using names().
Let’s change the variable name of “dcount” to “EAD”.
names(data1)[5] <- "EAD"
names(data1)
## [1] "date" "year" "month" "day" "EAD" "date_new"
## [7] "date_new1"
🦆 Change the variable name of “dcount” of data2 to “EAD”!
If the data exist in multiple locations, you’ll need to combine them before moving forward. This section shows how to add the rows (observations) and columns (variables) to a data frame.
To join two data frames vertically by using rbind()
function.
Note that the data frames must have the same variables!
data1 <- data1[,-7]
head(data1)
## date year month day EAD date_new
## 1 2014/1/1 2014 1 1 100 2014-01-01
## 2 2014/1/2 2014 1 2 118 2014-01-02
## 3 2014/1/3 2014 1 3 91 2014-01-03
## 4 2014/1/4 2014 1 4 82 2014-01-04
## 5 2014/1/5 2014 1 5 115 2014-01-05
## 6 2014/1/6 2014 1 6 93 2014-01-06
Combine it!
data3 <- rbind(data1, data2)
head(data3)
## date year month day EAD date_new
## 1 2014/1/1 2014 1 1 100 2014-01-01
## 2 2014/1/2 2014 1 2 118 2014-01-02
## 3 2014/1/3 2014 1 3 91 2014-01-03
## 4 2014/1/4 2014 1 4 82 2014-01-04
## 5 2014/1/5 2014 1 5 115 2014-01-05
## 6 2014/1/6 2014 1 6 93 2014-01-06
To merge two data frames horizontally, you use the
merge() function. In most cases, two data frames are joined
by one or more common key variables.
🦆 Let’s import environmental data!
env <- read.csv("H:/class2023-advance environmental/Basics of R programming_20230629/dataset/env.csv", skip = 1)
head(env)
## X date dmean temp
## 1 1 2014/4/1 27.083333 13.5
## 2 2 2014/4/2 14.434783 16.5
## 3 3 2014/4/3 21.043478 16.9
## 4 4 2014/4/4 13.416667 11.5
## 5 5 2014/4/5 13.375000 8.2
## 6 6 2014/4/6 8.958333 7.8
🦆 Check (and change) the format of the “date” variable!
env$date_new <- as.Date(env$date, format = "%Y/%m/%d")
head(env)
## X date dmean temp date_new
## 1 1 2014/4/1 27.083333 13.5 2014-04-01
## 2 2 2014/4/2 14.434783 16.5 2014-04-02
## 3 3 2014/4/3 21.043478 16.9 2014-04-03
## 4 4 2014/4/4 13.416667 11.5 2014-04-04
## 5 5 2014/4/5 13.375000 8.2 2014-04-05
## 6 6 2014/4/6 8.958333 7.8 2014-04-06
env <- env[,-c(1,2)]
head(env)
## dmean temp date_new
## 1 27.083333 13.5 2014-04-01
## 2 14.434783 16.5 2014-04-02
## 3 21.043478 16.9 2014-04-03
## 4 13.416667 11.5 2014-04-04
## 5 13.375000 8.2 2014-04-05
## 6 8.958333 7.8 2014-04-06
Subset data3 for only EAD and date_new
data3 <- data3[,5:6]
head(data3)
## EAD date_new
## 1 100 2014-01-01
## 2 118 2014-01-02
## 3 91 2014-01-03
## 4 82 2014-01-04
## 5 115 2014-01-05
## 6 93 2014-01-06
Now merge env and
data3 using merge() function.
data4 <- merge(data3, env, by= c("date_new"), all.x = T)
head(data4)
## date_new EAD dmean temp
## 1 2014-01-01 100 NA NA
## 2 2014-01-02 118 NA NA
## 3 2014-01-03 91 NA NA
## 4 2014-01-04 82 NA NA
## 5 2014-01-05 115 NA NA
## 6 2014-01-06 93 NA NA
summary()summary(data4)
## date_new EAD dmean temp
## Min. :2014-01-01 Min. : 49.00 Min. : 5.50 Min. : 2.10
## 1st Qu.:2014-07-02 1st Qu.: 72.00 1st Qu.:14.00 1st Qu.:12.40
## Median :2014-12-31 Median : 79.00 Median :18.52 Median :19.40
## Mean :2014-12-31 Mean : 80.37 Mean :19.64 Mean :18.38
## 3rd Qu.:2015-07-01 3rd Qu.: 88.00 3rd Qu.:24.17 3rd Qu.:24.23
## Max. :2015-12-31 Max. :133.00 Max. :46.04 Max. :32.10
## NA's :5 NA's :97 NA's :90
is.na() function.na_temp <- is.na(data4$temp)
sum(na_temp)
## [1] 90
Let’s average temperature from data4!
mean(data4$temp)
## [1] NA
We need to omit NA!
mean(data4$temp, na.rm = T)
## [1] 18.38172
🦆 Try other functions to summarize the data (e.g., mean, median, quantile)!
median(data4$temp, na.rm = T)
## [1] 19.4
quantile(data4$temp, probs = c(0.01,0.5,0.9), na.rm= T)
## 1% 50% 90%
## 3.695 19.400 27.900
Display the distribution of a continuous variable by dividing up the range of scores into a specified number of bins on x-axis and displaying the frequency of scores in each bin on the y-axis.
hist(data4$EAD)
A “box-and-whiskers” plot describes the distribution of a continuous variable by plotting its five-number summary: the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum.
boxplot(data4$EAD)
We can see the box plots of EAD for each DOW. Let’s make a column for day of week (dow)!
data4$dow <- weekdays(data4$date_new, abbreviate = T)
head(data4)
## date_new EAD dmean temp dow
## 1 2014-01-01 100 NA NA Wed
## 2 2014-01-02 118 NA NA Thu
## 3 2014-01-03 91 NA NA Fri
## 4 2014-01-04 82 NA NA Sat
## 5 2014-01-05 115 NA NA Sun
## 6 2014-01-06 93 NA NA Mon
Then, make the box plots for EAD by dow.
boxplot(data4$EAD ~ data4$dow)
plot(data4$date_new, data4$EAD)
plot(data4$date_new, data4$temp)
Data can be archived or imported into external applications. You can
use the write.csv() function to output an R object to a
csv. file.
For other file types, please check the books or somewhere!!
Reference
Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.