1 課程目標

R的控制流程，也就是條件判斷與迴圈，用途是讓電腦可以按照給定的邏輯，反覆執行指令，一直到滿足條件為止，這對於運算以及擷取資料非常有幫助。例如我們隨機找到13位民眾，訪問其身高與體重，得到身體質量指數(BMI)，想依據民眾的BMI是否在18.5到24之間，判斷他們的健康情形，並且顯示成為一個資料框，包含原始資料以及計算得到的BMI與判斷是否健康。BMI計算公式是以體重（公斤）除以身高（公尺）的平方。可寫語法如下：

library(dplyr)
bmi.f<-
    function(x,y){
   bmi<-y/x^2
   n<-length(x)
   for (i in 1:n)
  if(bmi[i]>=24){
 cat("No.", i, "BMI is:", bmi[i], ". Please lose weight.\n")
   }else if(bmi[i]>=18.5){
cat("No.", i, "BMI is:", bmi[i], ". Congratulations.\n")
   }else {
cat("No.", i, "BMI is:", bmi[i], ". Please gain weight.\n")
   }
  df<-data.frame(height=h, weight=w, BMI=bmi, row.names=NULL)
df<- df %>% mutate(health=ifelse(BMI>=24|BMI<=18.5, 'not healthy', 'healthy'))
  return(df)
 }

h<-c(1.55, 1.72, 1.72, 1.70, 1.85, 1.62, 1.68, 1.73, 1.71, 1.67, 1.55, 1.65, 1.79)
w<-c(49, 61, 59, 90, 85, 48, 65, 77, 66, 80,45,55, 81)
bmi.f(h, w)
## No. 1 BMI is: 20.4 . Congratulations.
## No. 2 BMI is: 20.62 . Congratulations.
## No. 3 BMI is: 19.94 . Congratulations.
## No. 4 BMI is: 31.14 . Please lose weight.
## No. 5 BMI is: 24.84 . Please lose weight.
## No. 6 BMI is: 18.29 . Please gain weight.
## No. 7 BMI is: 23.03 . Congratulations.
## No. 8 BMI is: 25.73 . Please lose weight.
## No. 9 BMI is: 22.57 . Congratulations.
## No. 10 BMI is: 28.69 . Please lose weight.
## No. 11 BMI is: 18.73 . Congratulations.
## No. 12 BMI is: 20.2 . Congratulations.
## No. 13 BMI is: 25.28 . Please lose weight.
##    height weight   BMI      health
## 1    1.55     49 20.40     healthy
## 2    1.72     61 20.62     healthy
## 3    1.72     59 19.94     healthy
## 4    1.70     90 31.14 not healthy
## 5    1.85     85 24.84 not healthy
## 6    1.62     48 18.29 not healthy
## 7    1.68     65 23.03     healthy
## 8    1.73     77 25.73 not healthy
## 9    1.71     66 22.57     healthy
## 10   1.67     80 28.69 not healthy
## 11   1.55     45 18.73     healthy
## 12   1.65     55 20.20     healthy
## 13   1.79     81 25.28 not healthy

我們先介紹條件判斷，再進入迴圈，最後是遞迴。

2 條件判斷

2.1 ifelse()

ifelse()適用在「非A即B」的邏輯，也就是若A為真則進行某一動作，若A不為真則不進行。
例如面訪員到某一個家戶訪問，戶中有4人，訪員需要進行戶中抽樣，但是前提是只訪問18歲以上成年人，語法可以這樣寫：

x=c(20, 50, 16, 18)
interview<-ifelse(x>=18, "Yes", "No")
print(interview)
## [1] "Yes" "Yes" "No"  "Yes"

ifelse()可以判斷一個向量的真偽，其功能類似先前提到的變數轉換：

vote <- rep(NA, 3)
vote[x>=18]<-"Yes"
vote[x<18]<-"No"
vote
## [1] "Yes" "Yes" "No"  "Yes"

\(\texttt{ifelse}\)可以增加條件判斷，例如要同時判斷年齡大於等於18歲以及是否戶籍(reg)設在受訪的縣市，符合這兩個條件是1，不符合是0，可寫成如下：

x=c(20, 50, 16, 18, 15)
reg<-c('T', 'F', 'F', 'T', 'T')#T:設籍

y<-ifelse(x>=18 & reg=='T', 1, 0)
y
## [1] 1 0 0 1 0

ifelse()也可以轉換字串為字串或者是數字，也可以把日期轉換為字串或是數字。

S<-c("2018-01-01", "2018-02-01", "2018-03-01", "2018-04-01")
S <- as.Date(S, format='%Y-%m-%d')
day<-as.Date("2018/02/28", format='%Y/%m/%d')
new.S<-ifelse(difftime(day, S)>=0, "Earlier", "Later")
new.S
## [1] "Earlier" "Earlier" "Later"   "Later"

當條件成立時的執行的動作可以是函數，例如在日光節約時間結束之後，下午茶時間延後一個小時。假設日光節約時間在11月5日結束，三個日期的下午茶時間可計算如下：


ds<-as.POSIXct("2018-11-05 00:00:00", 
      format = "%Y-%m-%d %H:%M:%S", tz="Asia/Taipei")

teatime<-as.POSIXct(c("2018-07-01 13:50:00", "2018-12-01 14:10:00", 
      "2019-02-02 14:15:00"),  format = "%Y-%m-%d %H:%M:%S", tz="Asia/Taipei")

hrs <- function(u) {
 x <- u * 3600
 return(x)
}

ifelse(difftime(teatime, ds)>=0, paste(teatime+hrs(1)), paste(teatime))
## [1] "2018-07-01 13:50:00" "2018-12-01 15:10:00" "2019-02-02 15:15:00"

請注意時間的格式，以及設定時區的函數as.POSIXct()。另外，paste()函數可貼上計算結果。

2.2 if-else

if-else 可以幫助我們建立一個條件式的函數，當函數內的向量滿足某個條件，便會進行某一個動作。要注意一次只能處理一個向量的元素。

temperature<-c(31) 
if (temperature>28){
   cat ("Turn on air condition")
}else {
  cat ("Turn off air condition")
}
## Turn on air condition

也可以進行運算

speed<-50 
if (speed>=70){
     print(speed)
}else {
    print(speed*1.6)
}
## [1] 80

如果進行迴圈的條件是向量中的某一個或是所有元素符合某項條件，其他元素也會隨之進行我們希望的動作。例如mtcars 資料中有mpg這個變數，如果是任何一輛車mpg高於50就顯示符合該條件的所有資料，反之輸出NA：

mpg <- mtcars$mpg
y <- rep(0, length(mpg))
if (any(mpg>=50)){
      mtcars[which(mpg>=50),]
} else{
      print('NA')
}
## [1] "NA"

此處使用any()這個函數，表示當物件或是多個物件滿足某一條件則回傳TRUE。例如：

A<-c(-1, 1.5); B<-c(2); C<-c('OK')
any (A>0)
## [1] TRUE
any (is.numeric(A) & is.numeric(B) & is.numeric(C))
## [1] FALSE
any (is.numeric(A) | is.numeric(B) | is.numeric(C))
## [1] TRUE

☛請練習如果「所有」車輛超過時速110，警察就會開罰單，顯示最快的車速，否則都不會開單。車速是130, 115, 120。

2.3 if-else if-else

if-else if-else 可以幫助我們建立一個條件式的函數，當函數內的向量滿足第一個條件，便會進行第一個動作，滿足第二個條件，便會進行第二個動作，一直到結束。例如電影的長度如果長過180分鐘則是太長，165分鐘是長，不到165分鐘是短。同樣的，一次只能處理一個向量的元素。

movie<-c(176) 
if(movie>=180){
     cat('Very long')
    }else if(movie>=165){
    cat('Long')
  }else{
          cat('Short')
}
## Long

在寫if-else時，請注意}應該與else或是else if連在一起。
假設在入住前3個月（90天）訂飯店為原價的85折，2個月（60天）訂飯店為原價的9折，入住前1周到兩個月是原價，入住日期為1周內則是原價加上兩成。如果入住日期為今年5月20日，原價是3000元，請練習今天的日期以及隔兩週後的今天訂房的話，分別需要多少錢？

booking<-as.Date(Sys.Date(), format='%Y-%m-%d')
booking
## [1] "2025-03-10"
checkin<-as.Date(c("2020-05-20"),    format='%Y-%m-%d')

if (difftime(checkin, booking)>90){
    print (3000*0.85)
}else if (difftime(checkin, booking)>=60){
   print (3000*0.9)
}else if (difftime(checkin, booking)>=7){
   print (3000)
}else{
  print (3000*1.2)
}
## [1] 3600

 booking <-booking + 14
if (difftime(checkin, booking)>90){
    print (3000*0.85)
}else if (difftime(checkin, booking)>=60){
   print (3000*0.9)
}else if (difftime(checkin, booking)>=30){
   print (3000)
}else{
  print (3000*1.2)
}
## [1] 3600

3 迴圈

3.1 for

for迴圈的功能是系統根據起始的條件，反覆進行同一動作。
例如重複顯示一個句子\(n\)次：

for (U in 1:5){
  cat("All work and no play","\n")
}
## All work and no play 
## All work and no play 
## All work and no play 
## All work and no play 
## All work and no play

在這個迴圈中，U是一個變數，雖然U沒有出現在後面的語法，但是系統會執行該語法所設定的次數。
也可以貼上次數：

for (i in 1:5){
  cat("All work and no play", paste(i), "times \n")
}
## All work and no play 1 times 
## All work and no play 2 times 
## All work and no play 3 times 
## All work and no play 4 times 
## All work and no play 5 times

在這個迴圈中，我們用paste()這個函數貼上\(i\)這個變數的值。
又例如從1加到10：

sum<-0
for (i in 1:10){
  sum = sum + i
  }
print(sum)

[1] 55

如果我們想保留每一次的計算結果，可以用迴圈，儲存計算結果在一個新變數：

sum<-0
x<-c(0:10)
y<-c()
for (i in 1:10){
  sum = sum + x[i]
  y[i]=sum
  cat(y[i], '+', x[i+1], '=', paste(sum(y[i]+x[i+1])),"\n")
    }
## 0 + 1 = 1 
## 1 + 2 = 3 
## 3 + 3 = 6 
## 6 + 4 = 10 
## 10 + 5 = 15 
## 15 + 6 = 21 
## 21 + 7 = 28 
## 28 + 8 = 36 
## 36 + 9 = 45 
## 45 + 10 = 55
print(y)
##  [1]  0  1  3  6 10 15 21 28 36 45

回到判斷價格的例子，如果有5個貨品，超出100元打9折，超出90寫ok，其他記錄sale，可以用for迴圈如下：

price<-c(76, 98, 100, 120, 65)
tmp<-data.frame(price, fix=rep(NA,5))
for (i in 1:5){
if(price[i]>=100){
   tmp[i, 2] = paste(price[i]*0.9)
}else if(price[i]>=90){
    tmp[i, 2]=paste('ok')
}else{
     tmp[i, 2]=paste('sale')
}
}
tmp
##   price  fix
## 1    76 sale
## 2    98   ok
## 3   100   90
## 4   120  108
## 5    65 sale

3.1.1 用sample隨機產生亂數

擲三顆公正的六面骰子1000次並且加總點數，將每一次得到的總和畫成長條圖 3.1：

set.seed(02138)
dice <- seq(1:6)
x <- c()
for (i in 1:1000){
  x[i]<-sum(sample(dice, 1), sample(dice, 1), sample(dice, 1)) 
}
# graphic
df<-data.frame(Dice=x)
library(ggplot2)
g <- ggplot(aes(Dice), data=df) + 
  geom_histogram(binwidth = 0.8, 
    fill='lightgreen', aes(y=..density..), 
    position="identity")
g

Figure 3.1: 三顆骰子點數長條圖

可以看出點數的總和近似常態分佈，集中在10點附近。
這個迴圈運用到「索引」的概念，紀錄每一次抽樣並且加總的結果，但是不需要顯示在螢幕上，而是成為一個向量，作為後續統計的資料。

3.1.2 for 與函數

撲克牌點數為1到13點，抽出三張牌，如果前兩張的點數總和小於17，而且其中一張牌小於10，那麼就抽第三張，然後顯示三張的總和；如果不符合前一個條件，那麼就顯示兩張的總和。當我們給定隨機亂數的數字，我們設定的條件式函數根據隨機亂數得到的結果輸出。

card<-function(x) {
set.seed(x)
for (i in 1:3)
  x[i]<-sample(1:13, 1)
  if (x[1]+x[2]<17 & x[1]<10 | x[2]<10 ){
  print(x[1:3])
  cat(sum(x[1:3]),"is sum of three cards \n")
    }else {
        print(x[1:2])
        cat(sum(x[1:2]), "is sum of the first 2 cards \n")
        }
  }
card(100); card(5003); card(02138)
## [1] 10  7  6
## 23 is sum of three cards
## [1] 3 2 6
## 11 is sum of three cards
## [1] 11  5  1
## 17 is sum of three cards

上述的程式可以產生一個自訂函數card()，該函數的參數x是任意整數，將產生隨機亂數。這個迴圈同樣運用到「索引」的概念。

3.1.3 for 與 if-else if-else

某飯店單人房原價是3000元。假設在入住前3個月（90天）訂房為原價的85折，不到3個月但是超過2個月（60天）訂房為原價的9折，不到2個月但是超過1個月（30天）是原價，入住前1個月內則是原價加上兩成。如果入住日期為今年的12月31日、4月20日、5月20日、6月1日、6月30日以及隔一週的今天，請練習如果今天的日期訂房的話，分別需要多少錢？

today<-as.Date(Sys.Date(), format='%Y-%m-%d')

hotel <- function(checkin){
n <- length(checkin)
price <- 3000
diff <- difftime(checkin, today)  

for (i in 1:n)
    if (diff[i]>90){
          print(checkin[i])
             cat (round(diff[i]/30,1), "months:", price*0.85, "\n")
      }else if (diff[i]>=60){
              print(checkin[i])
              cat (round(diff[i]/30,1), "months:",price*0.9,"\n")
    }else if (diff[i]>=30){
            print(checkin[i])
            cat (round(diff[i]/30,1), "months:",price,"\n")
    }else{
            print(checkin[i])
            cat (diff[i],  "days:",price*1.2, "\n")
    }
 }
checkin<-as.Date(c("2022-12-31", "2022-04-20","2022-05-20",
                   "2022-06-01"), format='%Y-%m-%d')
hotel(checkin)
## [1] "2022-12-31"
## -800 days: 3600 
## [1] "2022-04-20"
## -1055 days: 3600 
## [1] "2022-05-20"
## -1025 days: 3600 
## [1] "2022-06-01"
## -1013 days: 3600

3.1.4 雙重迴圈

如果需要兩個變數才能產生所需要的結果，可以考慮迴圈之中的迴圈，例如我們想知道11到20的乘法表，需要兩個變數相乘：

multiplication <- matrix(nrow=10, ncol=10)
for (i in 1:dim(multiplication)[1]){
  for (j in 1:dim(multiplication)[2]){
    multiplication[i,j] <- (i+10)*(j+10)
  }
}
rownames(multiplication)<-c(11:20)
colnames(multiplication)<-c(11:20)
multiplication
##     11  12  13  14  15  16  17  18  19  20
## 11 121 132 143 154 165 176 187 198 209 220
## 12 132 144 156 168 180 192 204 216 228 240
## 13 143 156 169 182 195 208 221 234 247 260
## 14 154 168 182 196 210 224 238 252 266 280
## 15 165 180 195 210 225 240 255 270 285 300
## 16 176 192 208 224 240 256 272 288 304 320
## 17 187 204 221 238 255 272 289 306 323 340
## 18 198 216 234 252 270 288 306 324 342 360
## 19 209 228 247 266 285 304 323 342 361 380
## 20 220 240 260 280 300 320 340 360 380 400

☛請嘗試用陣列加上迴圈產生三維的資料，例如我們模擬兩種隨機抽樣方式，第一種方式從5種隨機分佈每次抽出1, 10, 100, 200, 500, 1000個樣本，然後計算其平均值。第二種方式除了從5種隨機分佈每次抽出1, 10, 100, 200, 500, 1000個樣本，然後計算其平均值，還要重複以上步驟10, 100,1000次，分別計算平均值。

set.seed(02138)
sampleresult <- matrix(nrow=6, ncol=5)
R<-c(1, 10, 100, 200, 500, 1000)
L<-list(rnorm(1e+04,0,1), rnorm(1e+05,0,1), 
rnorm(1e+06,0,1), rnorm(1e+07,0,1), rnorm(1e+08,0,1))

for (i in 1:length(R)){
  for (j in 1:5){
    sampleresult[i,j] <- mean(sample (L[[j]], size=R[i], replace=T))
  }
}
sampleresult

#replication
sampleresult2 <- array(dim=c(6, 5, 3))
S<-c(10,100, 1000)
for (i in 1:length(R)){
  for (j in 1:5){
    for(s in 1:length(S)){
      su<-c();
    sampleresult2[i,j,s] <- mean({su[s]=mean(sample 
    (L[[j]], size=R[i], replace=T))})
  }
  }
}
sampleresult2

3.1.5 清理資料

for迴圈可以幫助我們清理資料，例如讀取一筆23個縣市的統計資料：

cs<-here::here('data','CS3171D1A.csv')
stat.dat<-read.csv(cs,
                  header=TRUE,sep=";",dec=".",
                  fileEncoding="BIG5")
head(stat.dat)
##                        X 臺北縣 宜蘭縣 桃園縣 新竹縣 苗栗縣 臺中縣 彰化縣
## 1 老年人口比率(65歲以上)     NA     NA     NA     NA     NA     NA     NA
## 2                   2000   6.37  10.20   7.46   9.69  10.98   7.16   9.42
## 3                   2001   6.44  10.49   7.49   9.91  11.21   7.32   9.73
## 4                   2002   6.55  10.82   7.51  10.17  11.57   7.50  10.03
## 5                   2003   6.67  11.17   7.56  10.39  11.87   7.68  10.31
## 6                   2004   6.86  11.54   7.62  10.58  12.19   7.90  10.65
##   南投縣 雲林縣 嘉義縣 臺南縣 高雄縣 屏東縣 臺東縣 花蓮縣 澎湖縣 基隆市 新竹市
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2  10.60  11.61  12.41  10.75   8.35  10.00  11.27  10.73  14.40   8.81   8.46
## 3  10.90  11.99  12.75  11.04   8.52  10.25  11.40  10.83  14.29   9.06   8.50
## 4  11.23  12.41  13.14  11.31   8.75  10.54  11.55  11.00  14.42   9.28   8.59
## 5  11.56  12.82  13.58  11.56   8.95  10.84  11.76  11.19  14.58   9.47   8.69
## 6  11.96  13.26  13.98  11.82   9.16  11.13  12.01  11.41  14.78   9.71   8.81
##   臺中市 嘉義市 臺南市 臺北市 高雄市
## 1     NA     NA     NA     NA     NA
## 2   6.49   8.67   7.69   9.67   7.16
## 3   6.60   8.85   7.85   9.94   7.41
## 4   6.79   9.15   8.06  10.25   7.63
## 5   6.94   9.46   8.24  10.58   7.93
## 6   7.15   9.70   8.46  10.92   8.24

這筆資料的最左邊一欄有一個變數名稱，但是不是位在第一列，而是在第二列，我們如何正確地讀取每一列的資料？
首先創造一個有23個元素的向量
對某一個變數進行23次的迴圈
第一個元素應該來自於資料的第二列、第二行

old.2000<-rep(NA, 23)  #讀取2010年老年人口比率
 for (u in 1:23){
   old.2000[u]<-stat.dat[2,u+1]
 }
old.2000
##  [1]  6.37 10.20  7.46  9.69 10.98  7.16  9.42 10.60 11.61 12.41 10.75  8.35
## [13] 10.00 11.27 10.73 14.40  8.81  8.46  6.49  8.67  7.69  9.67  7.16

可以進一步讀取2001, 2002…的資料，然後用資料框儲存起來。

old.2001<-c()
for (u in 1:23){
  old.2001[u]<-stat.dat[3,u+1]
     }
city <- colnames(stat.dat)
data.frame(city=city[-1],old.2000, old.2001)
##      city old.2000 old.2001
## 1  臺北縣     6.37     6.44
## 2  宜蘭縣    10.20    10.49
## 3  桃園縣     7.46     7.49
## 4  新竹縣     9.69     9.91
## 5  苗栗縣    10.98    11.21
## 6  臺中縣     7.16     7.32
## 7  彰化縣     9.42     9.73
## 8  南投縣    10.60    10.90
## 9  雲林縣    11.61    11.99
## 10 嘉義縣    12.41    12.75
## 11 臺南縣    10.75    11.04
## 12 高雄縣     8.35     8.52
## 13 屏東縣    10.00    10.25
## 14 臺東縣    11.27    11.40
## 15 花蓮縣    10.73    10.83
## 16 澎湖縣    14.40    14.29
## 17 基隆市     8.81     9.06
## 18 新竹市     8.46     8.50
## 19 臺中市     6.49     6.60
## 20 嘉義市     8.67     8.85
## 21 臺南市     7.69     7.85
## 22 臺北市     9.67     9.94
## 23 高雄市     7.16     7.41

3.2 while

如果要系統在執行到滿足某一個條件時中斷，可以用while迴圈。例如：

power<-0
while (power <= 12) {
  if (2^power<1000){
    cat(2^power, "\n")
    }else{
        cat("Stop")
    }
  power <- power +1
}
## 1 
## 2 
## 4 
## 8 
## 16 
## 32 
## 64 
## 128 
## 256 
## 512 
## StopStopStop

如果用for迴圈，可以輸出\(2^{0}\)到\(2^{12}\)，但是無法像while中斷迴圈

 for (a in -1:11){
    a <- a +1
   print(2^a)
 }
## [1] 1
## [1] 2
## [1] 4
## [1] 8
## [1] 16
## [1] 32
## [1] 64
## [1] 128
## [1] 256
## [1] 512
## [1] 1024
## [1] 2048
## [1] 4096

3.3 break

break可以在滿足某項條件情況下中斷迴圈，以上面的迴圈為例，假設我們要在超過1000時中斷：

power<-0
while (power <= 12) {
  if (2^power<1000){
    cat(2^power, "\n")
    }else{
        cat("Stop")
        break
    }
  power <- power +1
}
## 1 
## 2 
## 4 
## 8 
## 16 
## 32 
## 64 
## 128 
## 256 
## 512 
## Stop

也可以應用在訂旅館的例子中，例如超過原價就停止計算房價：

today<-as.Date(Sys.Date(), format='%Y-%m-%d')

hotel <- function(checkin){
n <- length(checkin)
price <- 3000
diff <- difftime(checkin, today)  
for (i in 1:n)
    if (diff[i]>90){
          print(checkin[i])
             cat (round(diff[i]/30,1), "months:", price*0.85, "\n")
      }else if (diff[i]>=60){
              print(checkin[i])
              cat (round(diff[i]/30,1), "months:",price*0.9,"\n")
    }else if (diff[i]>=30){
            print(checkin[i])
            cat (round(diff[i]/30,1), "months:",price,"\n")
            
    }else{
            print(checkin[i])
            cat("Over the budget")
            break
            #cat (diff[i],  "days:",price*1.2, "\n")
    }
 }
checkin<-as.Date(c("2018-12-31", "2018-04-20","2018-05-20",
                   "2018-06-01","2018-06-30"), format='%Y-%m-%d')
checkin<-c(checkin, today+7)
hotel(checkin)
## [1] "2018-12-31"
## Over the budget

3.4 警告訊息

我們可以在函數中放入停止訊息，並且中斷運作。例如分數的分母不得等於0：

divide <- function(x, y) {
  if (y == 0) {
    stop("Error: Division by zero is not allowed")
  } 
  return(x / y)
}

divide(10, 0) # This will stop execution and print the error message

改分母之後就可以執行了：

divide <- function(x, y) {
  if (y == 0) {
    stop("Error: Division by zero is not allowed")
  } 
  return(x / y)
}

divide(10, 1) # This will stop execution and print the error message
## [1] 10

3.5 next

如果想跳過滿足某個條件的迴圈，執行下一個迴圈，可用next。例如：

for (i in 1:10){
     if (i<= 4){
       next
     }
    #a <- a +1
    cat(i, "squared is ", i^2, "\n")
 }
## 5 squared is  25 
## 6 squared is  36 
## 7 squared is  49 
## 8 squared is  64 
## 9 squared is  81 
## 10 squared is  100

4 遞迴

遞迴(recursion)的目的是用比較少的條件，完成類似的工作，這對處理大量資料的工作相當有用。例如我們想計算\(1\times 2\times 3\times\ldots\)，數學叫做階乘，我們可以寫成：

fac<-function(n){
    s<-1
    for(i in 1:n){ 
            s<-s*((1:n)[i])
            message(paste(i,s,sep="\t"))
    }

    return(s) 
}
fac(5)
## [1] 120

上述的函數從1開始，函數內的迴圈每一次都進行\(1\times\ 2\times\ldots\times n\)的連乘，到第n次時，則是我們想要的階乘。

我們用遞迴的概念，重寫階乘的函數，可寫成如下：

recursive.factorial <- function(x) {
if (x == 0)    return (1)
else           return (x * recursive.factorial(x-1))
}
recursive.factorial(1)
## [1] 1
recursive.factorial(5)
## [1] 120

當x=0，得到1，其實當x=1，也會得到1，如果不等於0，那麼就是往回乘，例如x=5，應該得到5，然後又輪回第一行指令，因為x=4，所以得到4，依此類推，得到\(5\times 4\times 3\ldots 1\)的結果。

另一個例子是Fibonacci數列，Fibonacci數列指的是數列中的數是前面兩個數的和，例如：

\[0,1,1,2,3,5,8,13,21, \ldots\]。用數學式可表示為 \[ \text{x}_{n}=\text{x}_{n-1}+\text{x}_{n-2} \]

當\(n=0\)，\(\text{x}_{0}=0\)。所以第1個數不是0，而是\(\text{x}_{1}=1\)。
遞迴的函數可寫成如下：

recurse_fibonacci <- function(n) {
if(n <= 1) {
  return(n)
} else {
  return(recurse_fibonacci(n-1) + recurse_fibonacci(n-2))
  }
}

可以看到函數裡面有自己的函數，產生遞迴的效果。
試著代入\(n=9, 10, 11\)驗證是否得到Fibonacci數列：

recurse_fibonacci(9)

[1] 34

recurse_fibonacci(10)

[1] 55

recurse_fibonacci(11)

[1] 89

還有很多遞迴的例子，大家可以找來研究一下，例如利用遞迴找到一個數列中的最大值。

5 自訂統計函數結合繪圖

在許多函式中可以放入自訂的函數，節省輸入指令的時間。
例如建立一個函數，隨著資料的變異程度(variability)調整長條圖的長條寬度，使得每張長條圖呈現相同的長條數。方式是計算資料的最大值與最小值，兩個相減除以輸入的數字：

binwidth_bins <- function(n){
  force(n)
  function(x){
    (max(x) - min(x)) / n
  }
}

用常態分佈隨機抽樣產生三個長條圖5.1：

sd <- c(1, 2, 0.5)
n <- 100
set.seed(02139)
df <- data.frame(x = rnorm(3 * n, sd = sd), sd = rep(sd, n))
ggplot(df, aes(x)) +
  geom_histogram(aes(fill = as.factor(sd)), binwidth = 1) +
  facet_wrap(~ sd, scales = 'free_x') +
  labs(x = NULL)

Figure 5.1: 長條圖與函數1

把binwidth的函數放入長條圖中，同樣畫三個長條圖5.2：

sd <- c(1, 2, 0.5)
n <- 100
set.seed(02139)
df <- data.frame(x = rnorm(3 * n, sd = sd), sd = rep(sd, n))
ggplot(df, aes(x)) +
  geom_histogram(aes(fill = as.factor(sd)), binwidth = binwidth_bins(20)) +
  facet_wrap(~ sd, scales = 'free_x') +
  labs(x = NULL)

Figure 5.2: 長條圖與函數2

結合\(\texttt{binwidth_bins}\)以及創造均等分配的資料的函數，加上\(\texttt{stat_function}\)的功能，可以產生不同的平均數、樣本數的均等分配的資料，並且畫成有均等分佈及常態分佈的圖5.3：

library(ggplot2)

# Define your custom function
generate_uniform <- function(n, mean_value, range = c(0, 10)) {
  # Calculate the half-width of the uniform distribution based on the desired mean
  half_width <- mean_value / 2
  # Calculate the lower and upper bounds of the uniform distribution
  lower_bound <- mean_value - half_width
  upper_bound <- mean_value + half_width
  
  # Generate random samples from the uniform distribution
  samples <- runif(n, min = lower_bound, max = upper_bound)
  return(samples)
}

# Example usage:
set.seed(02138)  # for reproducibility
n <- 100
mean_value <- 3
samples <- data.frame(generate_uniform(n, mean_value))

# Plot the generated samples with binwidth_bins function
ggplot(samples, aes(samples[,1])) +
  geom_histogram(aes(y = ..density..), 
    binwidth = binwidth_bins(20), fill = 'pink') +
    labs(x = 'Samples', y = 'Density') +
    stat_function(fun = dnorm, 
    args = list(mean = mean(samples[,1], na.rm = TRUE), 
                    sd = sd(samples[,1], na.rm = TRUE)), 
        colour = 'black', size = 1)

Figure 5.3: 均等分佈與常態分佈

也可以直接建立一個常態分佈的函數，然後用\(\texttt{stat_function}\)來畫圖5.4：

# A normal curve function
mynormalfun <- function(xvar) {
    (1/sqrt(2*pi)*sd(xvar))*exp((-(xvar-mean(xvar))^2)/2*var(xvar))
}
#graphic
p <- ggplot(data.frame(x = seq(-3, 3, length.out=100)), aes(x = x)) +
  stat_function(fun = mynormalfun, geom = "line", col = '#22a011') +
  labs (y = 'Probability') +
  theme_bw()

#formula expression
p +
  annotate("text", x = 0, 
       y = mean(mynormalfun(seq(-3, 3, length.out=100))), 
       parse = TRUE,
       label = expression(f(x) == paste(frac(1, sqrt(2 * pi) * sigma) * e ^ {-(x-mu)^2 / 2*sigma^2}))
  )

Figure 5.4: 自建函數之常態機率密度分佈圖

如果要轉換成累積機率密度函數，我們先創建有\(\mu\)與\(\sigma\)的常態分佈函數，然後加以積分該函數，也就是\(\int_{-\infty}^{x}\{\}du\)。要注意積分有上限。畫成圖5.5表示常態分佈的資料的累積機率密度函數：

library(ggplot2); library(dplyr)
# Define the probability density function (PDF) of the normal distribution
normal_pdf <- function(x, mu, sigma) {
  exp(-((x - mu)^2) / (2 * sigma^2)) / (sigma * sqrt(2 * pi))
}

# Define the cumulative density function (CDF) of the normal distribution
mycdffun <- function(x, mu, sigma) {
  integrate(normal_pdf, lower = -Inf, upper = x, mu = mu, sigma = sigma)$value
}

# Parameters of the normal distribution
mu <- 0   # Mean
sigma <- 1   # Standard deviation

# CDF graph
dplyr::tibble(x = seq(-3, 3, length.out = 300), 
             y = sapply(x, mycdffun, mu = mu, sigma = sigma)) %>%
ggplot(aes(x = x, y = y)) +
  geom_line(col = '#a0011a') +
  labs(y = "Cumulative Probability",
     title = "Cumulative Distribution Function of Normal Distribution")+
  theme_bw()

Figure 5.5: 累積機率密度函數圖

假設有一個二項分佈，p是0.25，n是100，我們可以計算成功事件x=10到40的機率密度函數，同樣的，用\(\texttt{stat_function}\)畫成圖為5.6：

# parameters
p = 0.25; q = 1 - p; n = 100
# A binomial curve function
mybinomialfun <- function(xvar) {
    (1/sqrt(2*n*pi*p*q))*exp((-(xvar - n * p)^2)/(2 * p * q * n))
}
#graphic
ggplot(data.frame(x = seq(10, 40, by = 1)), 
       aes(x = x)) +
  stat_function(fun = mybinomialfun, geom = "line", col = '#1200bb') +
  theme_bw() +
  labs (y = 'Probability') +
  annotate("text", x = 25, 
       y = mean(mybinomialfun(x)) + 0.01, 
       parse = TRUE,
label = expression(f(x) == paste(frac(1, sqrt(2 * n * p * q * pi)) * e ^ {frac(-(x - n * p)^2, 2 * n * p * q)}))
)

Figure 5.6: 自建函數之二項分佈機率密度分佈圖

由此可見\(p(10),\ldots,p(40)\)成常態分佈．可以輸入\(\texttt{pbinom(10, 100, p = 0.25)},\ldots,\texttt{pbinom(40, 100, p = 0.25)}\) 確認。當\(f(x) = 25\)時，\(\frac{1}{\sqrt{2npq\pi}}\text{exp}^{\frac{-(x-\mu)^2}{2npq}}\)有最大值。

6 map

purrr套件裡面有一個函式\(\texttt{map}\)可以把一個集合中的元素對應到另一個集合同等或者更多的元素。例如，我們想把台幣換成日圓，假設1日圓等於0.21台幣，我們先建立一個函數，然後輸入我們想轉換的台幣金額如下：

yen <- function(x) x / 0.21
purrr::map(c(1800, 2900, 4100), yen)
## [[1]]
## [1] 8571
## 
## [[2]]
## [1] 13810
## 
## [[3]]
## [1] 19524

\(\texttt{map_dbl}\)可以用來計算連續變數的平均值、最大值、最小值、變異數。變異數開根號等於標準差。

purrr::map_dbl(UsingR::cancer, mean)
##  stomach bronchus    colon    ovary   breast 
##    286.0    211.6    457.4    884.3   1395.9
v.cancer <- purrr::map_dbl(UsingR::cancer, var)
sqrt(v.cancer)
##  stomach bronchus    colon    ovary   breast 
##    346.3    209.9    427.2   1098.6   1239.0
sd(UsingR::cancer$colon)
## [1] 427.2

7 作業

1. 請把美國的州名排成一個陣列，然後找出州名長度多於或等於13的州（提示：nchar()傳回字串的長度）

2. 請讀取studentsfull.txt這個檔案，然後取出經濟系與化學系的學生資料。

3. 老師決定把某次考試之中，60分以下開根號乘以10，60分以上維持原來批改分數，請寫一個程式幫老師轉換以下成績：34, 81, 55, 69, 77, 40, 49, 26，分數計算至小數點後第二位四捨五入。

4. 請寫一個函數可以把兩個日期之間的差距轉換成月，並且以今天日期與2020年的7月31日之間的差距為例。

5. 請練習讀取「失業率」，並且把2000年的各縣市的失業率與老年人口比率組成一個資料框。

6. 請寫一個函數計算英呎的面積以及相對應的公尺面積，並且以一個\(100\times 30\)英呎的土地為例。公尺與英呎的轉換公式：\(\mathrm{m}=\frac{\mathrm{ft}}{3.2808}\)

7. 請寫一個函數把以下的體重資料，單位從公斤轉成英磅。1 公斤 = 2.204磅。

weight<-c(31.5, 27.8, 39.2, 34.3, 28.8, 29.1, 31.1,
          31.6, 29.1, 29.8, 27.7, 28.5, 27.9, 30.3,
          29.8, 30.2, 30.4, 28.6, 29.9, 29.9)

8. 續上題，我們想從這20個受訪者中抽樣，每次抽出3位然後計算平均值，抽20次，然後畫成長條圖。

9. 請把第7題的資料中的數值轉換為字串，標準為大於但不包含32稱為過重，小於30稱為過輕。

10. 請寫一個迴圈顯示Fibonacci數列中的每一個數字，例如從0開始的7個數字。

8 更新講義時間

最後更新時間: 2025-03-10 20:40:57

社會科學統計方法

條件判斷與迴圈

蔡佳泓

4/11/2025