Johnson Hsieh, PhD

2015-09-06

取自 http://myfootpath.com/careers/engineering-careers/statistician-careers/

取自 http://www.r-bloggers.com/mapping-the-worlds-biggest-airlines/
取自 http://r4stats.com/2013/03/19/r-2012-growth-exceeds-sas-all-time-total/
取自 http://img.diynetwork.com/DIY/2003/09/18/t134_3ca_med.jpg
>"hello world" 後按下Enter,檢查螢幕輸出(記得加上引號)1 + 1 後按下Enter,檢查螢幕輸出,注意有無引號1 + 後按下Enter,檢查螢幕輸出,注意最左下角的開頭變成+>開頭me 之後按下Enterme 之後按下tabme 隻後按下Ctrl + Enter後,觀察命令列區me 後的位置,確認游標閃爍的位置在 me 之後,按下tab1; 2;
[1] 1
[1] 2
"1; 2;"
[1] "1; 2;"
; 或 斷行 (輸入Enter) 作結尾' 或雙引號 " 所包覆的敘述當成字串+# 基礎運算 1 + 2 + 3
[1] 6
1 + 2 + 3
[1] 6
x <- 10 y <- 4 (x + y) / 2
[1] 7
c()表示 (c 取自combine之意), 元素以逗號分隔。seq函數生成有規則的數值向量(序列)# basic expression of integer vector c(1, 2, 3, 4)
[1] 1 2 3 4
# simple expression 1:4
[1] 1 2 3 4
4:1
[1] 4 3 2 1
c()表示 (c 取自combine之意), 元素以逗號分隔。seq函數生成有規則的數值向量(序列)# use seq() function seq(1, 4, 1)
[1] 1 2 3 4
seq(1, 9, by = 2) # 試著按tab鍵,執行自動補字
[1] 1 3 5 7 9
seq(1, 9, length.out = 5)
[1] 1 3 5 7 9
seq 函數列出偶數數列: 2, 4, 6, 8, 10
recycling properties,便於執行四則運算。
# shorter arguments are recycled 1:3 * 2
[1] 2 4 6
1:4 + 1:2
[1] 2 4 4 6
c(0.5, 1.5, 2.5, 3.5) * c(2, 1)
[1] 1.0 1.5 5.0 3.5
Warning message:
# warning (why?) 1:3 * 1:2
Warning in 1:3 * 1:2: 較長的物件長度並非較短物件長度的倍數
[1] 1 4 3
1:3 + 1:4
Warning in 1:3 + 1:4: 較長的物件長度並非較短物件長度的倍數
[1] 2 4 6 5
向量的四則運算,請計算以下五位女藝人的BMI Hint:
height <- c(174, 158, 160, 168, 173) weight <- c(52, 39, 42, 46, 48)
各種自救措施
help.start()
ab # 輸入`ab`後 按下tab
?abs # 等同於 help(abs)
??abs
apropos("abs")
example(abs)
vignette()
vignette("Introduction", "Matrix")
ab開頭的函數統計量: sum, mean, abs, sd, length, table, cut, …
x <- c(174, 158, 160, 168, 173) sum(x) # 向量元素加總
[1] 833
length(x) # 向量長度 (包含幾個元素)
[1] 5
mean(x) # 平均數 等價於 sum(x) / length(x)
[1] 166.6
abs(x - mean(x)) / sd(x) # T統計量
[1] 1.0088825 1.1724850 0.8998141 0.1908697 0.8725470
統計量: sum, mean, abs, sd, length, table, cut, …
a <- runif(20, 1, 100) # 隨機抽取20筆介於1~100的亂數 cut(a, c(1,20,40,60,80,100)) # 依據給定的breaks對數值資料做分組
[1] (60,80] (20,40] (1,20] (60,80] (80,100] (20,40] (1,20] [8] (20,40] (1,20] (40,60] (60,80] (80,100] (1,20] (40,60] [15] (20,40] (40,60] (80,100] (1,20] (20,40] (20,40] Levels: (1,20] (20,40] (40,60] (60,80] (80,100]
table(c("a", "a", "b", "c", "b", "a"))
a b c 3 2 1
排序: order vs. sort
order(x) # 取得物件x各元素大小排序之順序 (由小到大)
[1] 2 3 4 5 1
x[order(x)]
[1] 158 160 168 173 174
x[order(x, decreasing = TRUE)] # 由大到小排序
[1] 174 173 168 160 158
sort(x)
[1] 158 160 168 173 174
sort(x, decreasing = TRUE)
[1] 174 173 168 160 158
# `sort` 函數:直接對元素排序 # `order` 函數:取得元素排序後的順序
重抽: sample
sample(x) # 對x做重新排序
[1] 168 158 174 160 173
sample(1:5)
[1] 1 4 5 2 3
sample(1:5, size = 3) # 從1:5中任取3個值(不重複)
[1] 5 1 3
sample(1:5, size = 3, replace = TRUE) # 從1:5中任取3個值(可重複)
[1] 5 4 5
x <- c(174, 158, 160, 168, 173) x[1] # 選取第1個位置的元素
[1] 174
x[c(1, 3)] # 選取第1, 3個位置的元素
[1] 174 160
x[c(2, 3, 1)] # 選取第2, 3, 1個位置的元素 (依照順序)
[1] 158 160 174
# 在[ ]中使用負號 (-) 移除給定位置元素 (反向選取) x[-1]
[1] 158 160 168 173
x[-c(1, 3, 4)]
[1] 158 173
# 使用比較運算子 加上 `which` 函數進行取值 x > 160
[1] TRUE FALSE FALSE TRUE TRUE
index <- which(x > 160) # 滿足條件的位置為TRUE,反之為FALSE x[index]
[1] 174 168 173
# 指令壓縮,將指令寫在 [ ] 中,以達到縮短程式碼的功效 x[which(x > 160)]
[1] 174 168 173
# 也可以使用邏輯算子進行取值 x[x > 160 & x < 170] # 選取位置為TRUE的位置
[1] 168
利用指令壓縮的方式取得 x大於170 或 x小於160 的元素
Hint: 使用邏輯算子 or (|)
#' ## 元素的取代 x[2] <- 158.5 # 取代x物件的第二個元素 x
[1] 174.0 158.5 160.0 168.0 173.0
x[c(1,3)] <- 0 # 取代第一、三個元素為 0 x[6] <- 166 # 新增第六個元素為 166 # 等價於 c(x, 166) x
[1] 0.0 158.5 0.0 168.0 173.0 166.0
x[x > 160] <- 170 # 取代大於160的值為170 x
[1] 0.0 158.5 0.0 170.0 170.0 170.0
x <- c(174, 158, 160, 168, 173)y <- c("林志玲", "蔡依林", "楊丞琳", "天心", "隋棠")z <- c(TRUE, FALSE, FALSE, FALSE, TRUE)class函數判斷物件型態class(x); class(y); class(z)
[1] "numeric"
[1] "character"
[1] "logical"
as.character, as.numeric, as.logical。# 向量只容許一種類別 (字串 > 數值 > 邏輯)
c(174, 52, "林志玲") # 數值被轉換成字串
# 布林值 TRUE 被轉換成1,FALSE被轉換成0
c(174, 52, TRUE)
c(1.1, 2.4, TRUE, FALSE)
# 所有元素都被轉換成字串
c("林志玲", 174, 52, TRUE)
# 字串轉數字
a1 <- c("89", "91", "102")
as.numeric(a1)
as.character, as.numeric, as.logical。# 布林轉數字 a2 <- c(TRUE, TRUE, FALSE) as.numeric(a2) # 數字轉布林 a3 <- c(-2, -1, 0, 1, 2) # 只有0會被轉成FALSE as.logical(a3) # 數字轉字串 as.character(a3)
Sys.time() # "2015-09-03 08:50:24 CST"
[1] "2015-09-06 15:14:14 CST"
factor(c("male", "female", "female", "male"))
[1] male female female male Levels: female male
當一向量變數是類別型變數 (譬如:性別、教育水準) 時,在R語言中以factor進行定義。
# variable gender with 2 "male" entries and 3 "female" entries
# rep(x, n) 函數能重複x物件n次
gender <- c(rep("male",2), rep("female", 3))
gender
[1] "male" "male" "female" "female" "female"
gender <- factor(gender) gender
[1] male male female female female Levels: female male
levels(gender)
[1] "female" "male"
as.numeric 將factor物件轉換成數值# 1=female, 2=male internally (alphabetically) as.numeric(gender)
[1] 2 2 1 1 1
# change vector of labels for the levels
factor(gender, levels=c("male", "female"), labels=c("M", "F"))
[1] M M F F F Levels: M F
# 類別轉字串 as.character(gender)
[1] "male" "male" "female" "female" "female"
# 利用cut對資料做分級 x <- c(75, 81, 82, 76, 91, 92) cut(x, breaks = c(70, 80, 90, 100))
[1] (70,80] (80,90] (80,90] (70,80] (90,100] (90,100] Levels: (70,80] (80,90] (90,100]
x <- c("1", "2", "3", "2", "a")
as.numeric(x)
Warning: 強制變更過程中產生了 NA
[1] 1 2 3 2 NA
NA代表Not available,代表著missing value
百萬元
5,023,763
5,614,679
6,205,338
gdp <- c("5,023,763", "5,614,679", "6,205,338")
as.numeric(gsub(",", "", gdp))
[1] 5023763 5614679 6205338
將民國年 (字串) 轉為 西元年 (數值)
year <- c("民國101", "民國102", "民國103", "民國104")
list是R 物件的向量data.frame是長度相同的R 物件的向量data.frame是最常使用的物件data.frame的概念在各種資料處理的領域非常常見
data.frame的型式data.frame的功能data.frame開始的各種進階處理功能
| date | character | tot | integer | min.bemp | integer |
| hour | integer | avg.sbi | numeric | std.bemp | numeric |
| sno | integer | max.sbi | integer | temp | numeric |
| sarea | character | min.sbi | integer | humidity | numeric |
| sna | character | std.sbi | numeric | pressure | numeric |
| lat | numeric | avg.bemp | numeric | max.anemo | numeric |
| lng | numeric | max.bemp | integer | rainfall | numeric |
| 日期 | character | 總停車格 | integer | 最小空位數 | integer |
| 時間 | integer | 平均車輛數 | numeric | 空位數標準差 | numeric |
| 場站代號 | integer | 最大車輛數 | integer | 平均氣溫 | numeric |
| 場站區域 | character | 最小車輛數 | integer | 溼度 | numeric |
| 場站名稱 | character | 車輛數標準差 | numeric | 氣壓 | numeric |
| 緯度 | numeric | 平均空位數 | numeric | 最大風速 | numeric |
| 經度 | numeric | 最大空位數 | integer | 降雨量 | numeric |
# path <- "data/ubikeweatherbig5.csv" path <- file.choose() readLines(path, n = 5)
ubike <- read.table(path, sep = ",", header = TRUE, nrows = 100)
head(ubike)
ubike <- read.table(path, sep = ",", header = TRUE,
colClasses = c("factor", "integer", "integer", "factor", "factor",
"numeric", "numeric", "integer", "numeric", "integer", "integer",
"numeric", "numeric", "integer", "integer", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric"))
# object.size(ubike) # 約86MB
path <- "wrong_path" power <- read.table(file = path, header = TRUE, sep = ",")
Error in file(file, "rt") : 無法開啟連結 此外: Warning message: In file(file, "rt") : 無法開啟檔案 'wrong_path' :No such file or directory
getwd了解R 當下的路徑位置path <- "data/ubikeweatherbig5.csv" power <- read.table(file = path, header = TRUE, sep = "1")
Error in read.table(file = path, header = TRUE, sep = "1") : more columns than column names
path <- "data/ubikeweatherbig5.csv" power <- read.table(file = path, header = TRUE, sep = ",", nrows = 10)
錯誤在type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals, : 無效的多位元組字串於 '<ab>H<b8>q<b0><cf>'
UTF-8和BIG-5file函數指定編碼readLines、iconv和write來製造符合系統編碼的檔案data.frameclass(ubike)
[1] "data.frame"
colnames(ubike) <-
c("日期", "時間", "場站代號", "場站區域", "場站名稱",
"緯度", "經度", "總停車格", "平均車輛數", "最大車輛數",
"最小車輛數", "車輛數標準差", "平均空位數", "最大空位數",
"最小空位數", "空位數標準差", "平均氣溫", "溼度",
"氣壓", "最大風速", "降雨量")
# install.packages("RSQLite")
library(RSQLite)
Loading required package: DBI
db.path <- "ubike.db"
drv <- dbDriver("SQLite")
db <- dbConnect(drv, db.path)
dbWriteTable(db, "ubike", head(ubike))
dbListTables(db)
dbReadTable(db, "ubike")
dbDisconnect(db)
RMySQL, RPostgreSQL, ROracle, RJDBC, RODBCrmongodb, rredisXML套件和XPathRJSONIO套件ubike[2, 3]
[1] 2
日期 時間 場站代號 1 2014-12-08 15 1 2 2014-12-08 15 2 3 2014-12-08 15 3 4 2014-12-08 15 4 5 2014-12-08 15 5 6 2014-12-08 15 6
head(ubike[["日期"]])
[1] "2014-12-08" "2014-12-08" "2014-12-08" "2014-12-08" "2014-12-08" [6] "2014-12-08"
# head(ubike$日期) head(ubike[,1])
[1] "2014-12-08" "2014-12-08" "2014-12-08" "2014-12-08" "2014-12-08" [6] "2014-12-08"
取出場站代號為1的所有資料
ubike選取場站代號unique1比較whichans1 <- ubike[["場站代號"]] ans2 <- unique(ans1) ans3 <- ans1 == 1 ans4 <- which(ans3) ans5 <- ubike[ans3,] ans5 <- ubike[ans4,]
ubike選取場站代號1099比較ubike選取2.的列之後,用1.的方法選取平均氣溫 3.1 可利用座標的概同時選取出結果ubike[ubike[["場站代號"]] == 1 & ubike[["日期"]] == "2015-03-01",] x1 <- ubike[["場站代號"]] == 1 x2 <- ubike[["日期"]] == "2015-03-01" x3 <- x1 & x2 x4 <- ubike[x3,]
magrittr部份解決了這個問題ans1 <- ubike[["場站代號"]]
ans1.1 <- unique(ans1)
unique(ubike[["場站代號"]])
# install.packages("magrittr")
library(magrittr)
ubike[["場站代號"]] %>%
unique
data.frame做設計(名稱中的d)data.frame或資料庫中的表格)vignettevignette(all = TRUE, package = "dplyr")
vignette("introduction", package = "dplyr")
filter 對列做篩選select 對欄做篩選mutate 更改欄或新增欄arrange 排列group_by + summarise 分類
sd的用法?sd嘗試自學標準差的用法場站代號為1和日期為"2015-03-01"的資料捷運市政府站(3號出口)在"2015-03-01"的降雨量的標準差x1 <- ubike[["場站代號"]] == 1 x2 <- ubike[["日期"]] == "2015-03-01" ubike[x1 & x2, "降雨量"]
[1] 0.000 0.432 0.702 0.947 1.129 1.224 1.241 1.218 1.201 1.207 1.225 [12] 1.233 1.227 1.218 1.220 1.233 1.244 1.246 1.242 1.242 1.249 1.257 [23] 1.258 1.252
sd(ubike[x1 & x2, "降雨量"])
[1] 0.3078623
sd的用法library(dplyr)
sd(select(
filter(ubike, 場站代號 == 1, 日期 == "2015-03-01"),
降雨量)[["降雨量"]])
filter(ubike, 場站代號 == 1, 日期 == "2015-03-01") %>%
select(降雨量) %>%
extract2("降雨量") %>%
sd
group_by
group_by(ubike, 日期) %>% summarise(平均降雨量 = mean(降雨量))
group_bygroup_by(ubike, 場站區域) %>% summarise(站點數 = length(unique(場站代號))) %>% arrange(站點數)
group_by(ubike, 場站區域) %>% summarise(站點代號清單 = paste(unique(場站代號), collapse = ","))
探索一個質化變數,利用table列出所有的場站名稱出現的次數
ftable:質化 v.s. 質化bar chart:質化 v.s. 量化scatter plot: 量化 v.s. 量化?ftable example(ftable)
ftable> ## Start with a contingency table.
ftable> ftable(Titanic, row.vars = 1:3)
Survived No Yes
Class Sex Age
1st Male Child 0 5
Adult 118 57
Female Child 0 1
Adult 4 140
信義區且日期為"2015-03-01"的列平均車輛數與總停車格平均車輛數"是否超過總停車格的一半
空位較多時間時間和空位較多的交互關係x1 <- ubike[["場站區域"]] == "信義區" x2 <- ubike[["日期"]] == "2015-03-01" x3 <- ubike[x1 & x2, "平均車輛數"] x4 <- ubike[x1 & x2, "總停車格"] x5 <- x3 < x4 / 2 x6 <- ubike[x1 & x2, "時間"] ftable(x6, x5) x1 <- filter(ubike, 場站區域 == "信義區", 日期 == "2015-03-01") x2 <- mutate(x1, 空位較多 = 平均車輛數 < 總停車格 / 2) ftable(x2[["時間"]], x2[["空位較多"]]) tbl <- filter(ubike, 場站區域 == "信義區", 日期 == "2015-03-01") %>% mutate(空位較多 = 平均車輛數 < 總停車格 / 2) ftable(tbl[["時間"]], tbl[["空位較多"]])
ggplot(data=..., aes(x=..., y=...)) + geom_xxx(...) + stat_xxx(...) + facet_xxx(...) + ...
ggplot 描述 data 從哪來aes 描述圖上的元素跟 data 之類的對應關係geom_xxx 描述要畫圖的類型及相關調整的參數 常用的類型諸如:geom_bar, geom_points, geom_line …使用 data.frame 儲存資料 (不可以丟 matrix 物件) 使用 long format (利用reshape2套件將資料轉換成 1 row = 1 observation)
# grepl("要搜尋的字串", x, fixed = TRUE)
x1.1 <- grepl("2015-02", ubike[["日期"]], fixed = TRUE)
x1.2 <- ubike[["場站區域"]] == "信義區"
x2 <- group_by(ubike[x1.1,], 場站名稱)
x3 <- summarise(x2, 平均降雨量 = mean(降雨量))
x3 <- filter(ubike, grepl("2015-02", 日期, fixed = TRUE),
場站區域 == "信義區") %>%
group_by(場站名稱) %>% summarise(平均降雨量=mean(降雨量))
thm <- theme(text=element_text(size=18)) + theme_gray(base_family = "STHeiti") las2 <- theme(axis.text.x = element_text(angle = 90, hjust = 1)) ggplot(x3) + geom_bar(aes(x = 場站名稱, y = 平均降雨量), stat = "identity") + thm + las2
x1.1 <- grepl("2015-02", ubike[["日期"]], fixed = TRUE)
x1.2 <- ubike[["場站區域"]] == "信義區"
x2 <- group_by(ubike[x1.1,], 場站名稱)
# x3 <- summarise(x2, 平均降雨量 = mean(降雨量))
x3 <- filter(ubike, grepl("2015-02", 日期, fixed = TRUE),
場站區域 == "信義區") # %>%
# group_by(場站名稱) %>% summarise(平均降雨量=mean(降雨量))
ggplot(x3) +
geom_boxplot(aes(x = 場站名稱, y = 降雨量)) +
thm + las2
# grepl("要搜尋的字串", x, fixed = TRUE)
x1.1 <- grepl("2015-02", ubike[["日期"]], fixed = TRUE)
x1.2 <- ubike[["場站區域"]] == "信義區"
x2 <- group_by(ubike[x1.1,], 場站名稱)
x3 <- summarise(x2, 平均降雨量 = mean(降雨量), 平均溼度 = mean(溼度))
x3 <- filter(ubike, grepl("2015-02", 日期, fixed = TRUE),
場站區域 == "信義區") %>%
group_by(場站名稱) %>% summarise(平均降雨量 = mean(降雨量), 平均溼度 = mean(溼度))
ggplot(x3) + geom_point(aes(x = 平均溼度, y = 平均降雨量)) + thm + las2
ggplot(x3) + geom_point(aes(x = 平均溼度, y = 平均降雨量, colour = 場站名稱)) + thm + las2
savePlotbmp、png、jpeg或tiffggsavewrite.csvxtable套件sos套件,請見Demo