library(readr)
dta <- read.csv("C:/RStudio/ncku_prof_V6.csv", h=T, stringsAsFactors = TRUE)
head(dta)
## ID Initial Citation H.id Gender Degree Rank College Dept Grads FPY
## 1 10001 YCC 305 9 M D 3 ENG ESC 3 2013
## 2 10002 CYC 355 11 M D 2 ENG ESC 10 2008
## 3 10003 HBC 3452 10 M D 1 ENG ESC 0 2011
## 4 10004 HHC 15808 65 M O 1 ENG ESC 92 1997
## 5 10005 JSC 280 10 F O 2 ENG ESC 25 2011
## 6 10006 MYC 2506 22 M D 2 ENG ESC 41 2002
## Articles StuApp Colprof
## 1 30 169 309
## 2 22 169 309
## 3 14 169 309
## 4 349 169 309
## 5 23 169 309
## 6 90 169 309
str(dta)
## 'data.frame': 460 obs. of 14 variables:
## $ ID : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
## $ Initial : Factor w/ 347 levels "BCT","BHC","BLC",..: 308 60 81 90 145 201 293 176 198 276 ...
## $ Citation: int 305 355 3452 15808 280 2506 672 5735 1118 685 ...
## $ H.id : int 9 11 10 65 10 22 14 40 19 14 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 2 2 2 2 ...
## $ Degree : Factor w/ 2 levels "D","O": 1 1 1 2 2 1 1 1 2 1 ...
## $ Rank : int 3 2 1 1 2 2 1 1 2 1 ...
## $ College : Factor w/ 5 levels "ENG","LIB","MNG",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Dept : Factor w/ 25 levels "ACC","BAD","CEN",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Grads : int 3 10 0 92 25 41 36 54 74 195 ...
## $ FPY : int 2013 2008 2011 1997 2011 2002 2008 2001 1994 1991 ...
## $ Articles: int 30 22 14 349 23 90 36 123 26 70 ...
## $ StuApp : int 169 169 169 169 169 169 169 169 169 169 ...
## $ Colprof : int 309 309 309 309 309 309 309 309 309 309 ...
Assessment 1
選取H.id中數值大於12的資料,並篩選出H.id, Gender, College, Rank, Degree,和 Grads 等變項。
dta1 <- dta %>%
filter(H.id > 12) %>%
select(H.id, Gender, College, Rank, Degree, Grads)
tail(dta1)
## H.id Gender College Rank Degree Grads
## 187 14 F MNG 2 D 17
## 188 13 M MNG 1 O 89
## 189 15 M MNG 2 D 27
## 190 17 M MNG 1 D 26
## 191 27 M MNG 1 D 23
## 192 14 M MNG 2 D 9
Assessment 2
選取dta中H.id, Gender, Degree, Rank, Grads等變項資料,接著創造2個新變項,分別為計算總學年數的”academicy”和計算每學年研究生平均人數的”Grads_m”。
dta2 <- dta %>%
select(H.id, Gender, Degree, Rank, Grads) %>%
mutate(academicy = 2022 - dta$FPY,
Grads_m = Grads / academicy)
head(dta2)
## H.id Gender Degree Rank Grads academicy Grads_m
## 1 9 M D 3 3 9 0.3333333
## 2 11 M D 2 10 14 0.7142857
## 3 10 M D 1 0 11 0.0000000
## 4 65 M O 1 92 25 3.6800000
## 5 10 F O 2 25 11 2.2727273
## 6 22 M D 2 41 20 2.0500000
Assessment 3
排序規範預設都是遞增排序,如果想改為遞減排序,就在變數名稱外增加desc()。
dta3 <- dta %>%
group_by(College, Gender, Rank, Degree) %>%
summarize(mean_H.id = mean(H.id, na.rm = TRUE),
sd_H.id = sd(H.id),
var_H.id = var(H.id),
max_H.id = max(H.id),
min_H.id = max(H.id),
count = n()) %>%
arrange(desc(mean_H.id))
## `summarise()` has grouped output by 'College', 'Gender', 'Rank'. You can
## override using the `.groups` argument.
head(dta3)
## # A tibble: 6 × 10
## # Groups: College, Gender, Rank [3]
## College Gender Rank Degree mean_H.id sd_H.id var_H.id max_H.id min_H.id count
## <fct> <fct> <int> <fct> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 ENG F 1 D 34 10.4 108 46 46 3
## 2 ENG M 1 D 24.4 11.0 121. 54 54 28
## 3 ENG M 1 O 24.2 13.9 192. 92 92 76
## 4 SCI M 1 D 21 16.1 258 39 39 4
## 5 ENG F 1 O 19.5 9.71 94.3 32 32 4
## 6 SCI M 1 O 18.8 15.2 231. 58 58 24
tail(dta3)
## # A tibble: 6 × 10
## # Groups: College, Gender, Rank [5]
## College Gender Rank Degree mean_H.id sd_H.id var_H.id max_H.id min_H.id count
## <fct> <fct> <int> <fct> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 LIB F 3 O 0.5 0.707 0.5 1 1 2
## 2 LIB M 2 D 0.167 0.408 0.167 1 1 6
## 3 LIB F 2 D 0 0 0 0 0 6
## 4 LIB F 3 D 0 0 0 0 0 2
## 5 LIB M 1 D 0 0 0 0 0 4
## 6 LIB M 3 D 0 NA NA 0 0 1
(1)以filter(dta, Gender == “F”)學院(College)來說,「LIB文學院」教授的H.id較「ENG工學院」和「SCI理學院」偏低。以職等(Rank)來說,職等較高的教授其H.id也會較高,但仍可以從後6筆資料看出有例外。最後從學位國籍(Degree)來看,H.id高的前6筆中,國外和國內各佔3筆,然而後6筆則有5筆皆是國內學位,儘管如此仍不能說明取得國外學位會有較好的H.id。
dta3_2 <- filter(dta3, College == "ENG")
dta3_2 <- select(dta3_2, College, Gender, mean_H.id)
## Adding missing grouping variables: `Rank`
show(dta3_2)
## # A tibble: 10 × 4
## # Groups: College, Gender, Rank [6]
## Rank College Gender mean_H.id
## <int> <fct> <fct> <dbl>
## 1 1 ENG F 34
## 2 1 ENG M 24.4
## 3 1 ENG M 24.2
## 4 1 ENG F 19.5
## 5 2 ENG M 17.3
## 6 2 ENG F 13
## 7 3 ENG M 12.4
## 8 3 ENG M 12
## 9 2 ENG M 11.2
## 10 3 ENG F 9.17
(2)單獨拉出含有工學院的資料,再看性別的部分,並沒有發現女教授的H.id都很高,只有第一位女教授的H.id特別突出。因此無法單從前五比資料去說明男教授的平均學術產能不及女教授。
dta3_3 <- filter(dta3, College == "LIB")
dta3_3 <- select(dta3_3, College, Gender, mean_H.id)
## Adding missing grouping variables: `Rank`
show(dta3_3)
## # A tibble: 11 × 4
## # Groups: College, Gender, Rank [6]
## Rank College Gender mean_H.id
## <int> <fct> <fct> <dbl>
## 1 1 LIB M 3.86
## 2 1 LIB F 3.19
## 3 1 LIB F 1.5
## 4 2 LIB F 0.833
## 5 2 LIB M 0.727
## 6 3 LIB F 0.5
## 7 2 LIB M 0.167
## 8 2 LIB F 0
## 9 3 LIB F 0
## 10 1 LIB M 0
## 11 3 LIB M 0
(3)從前面的後6筆資料和dta3_3的資料中可以看出,只有文學院的教授的mean_H.id有出現0。
Assessment 4
dta %>%
select(College, Gender, Degree, Rank) %>%
tbl_summary(by = College)
## Warning: The `fmt_missing()` function is deprecated and will soon be removed
## * Use the `sub_missing()` function instead
| Characteristic | ENG, N = 1841 | LIB, N = 631 | MNG, N = 841 | SCI, N = 721 | SSC, N = 571 |
|---|---|---|---|---|---|
| Gender | |||||
| F | 17 (9.2%) | 34 (54%) | 24 (29%) | 14 (19%) | 22 (39%) |
| M | 167 (91%) | 29 (46%) | 60 (71%) | 58 (81%) | 35 (61%) |
| Degree | |||||
| D | 63 (34%) | 21 (33%) | 25 (30%) | 16 (22%) | 13 (23%) |
| O | 121 (66%) | 42 (67%) | 59 (70%) | 56 (78%) | 44 (77%) |
| Rank | |||||
| 1 | 111 (60%) | 29 (46%) | 36 (43%) | 36 (50%) | 28 (49%) |
| 2 | 44 (24%) | 29 (46%) | 27 (32%) | 27 (38%) | 22 (39%) |
| 3 | 29 (16%) | 5 (7.9%) | 21 (25%) | 9 (12%) | 7 (12%) |
| 1 n (%) | |||||