Read table
the data is extracted from http://www.cuny.edu/about/alumni-students-faculty/faculty/distinguished-professors/. it shows the records of professors who teach in CUNY. I already preprocess data a little
data <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Assignments/a6/professors_in_cuny.csv")
datatable(data)Tidy data
the data is already clean actually because I didn’t extract complicated information, just name, department, email and office phone number. However, there are some some prefessors who did not provide office phone number, therefore, we’re going to remove these data.
data1 <- data %>%
filter(!is.na(office_phone))Some Analysis Example
Here, we are going to perform some simple analysis based on the “clean” data
how many distinct colleges are listed and how many professors are shown in those colleges?
count() = group_by() + sum()
as we can see from the calculation, there are 15 distinct colleges listed, and for the individual college, the number of professrs is shown below under column name [n]
head(data1 %>%
count(college))## college n
## 1 Baruch College 7
## 2 Brooklyn College 6
## 3 College of Staten Island 4
## 4 CUNY Graduate Center 42
## 5 CUNY School of Law 1
## 6 Graduate School of Public Health 2
are there popular departments(department with more than three distinguished professors)?
filter() = eleminate rows with the condition(s) you provide
there are 8 departments are considered popular in the case. However, I cannot believe there is no science, and English departmemt is the most outstanding one based on the result.
a <- data1 %>%
count(department)
a %>% filter(n > 3)## department n
## 1 English 12
## 2 History 11
## 3 Mathematics 6
## 4 Music 4
## 5 Philosophy 5
## 6 Physics 4
## 7 Psychology 5
## 8 Sociology 6
decide who is working in the graducate center and show the name of professors and the status?
mutate() = add a column and fill in data
select() = specify which column you want to see
data1 <- data1 %>%
mutate(work_in_grad_center = ifelse(college == "CUNY Graduate Center", "yes", "no"))
b <- data1 %>%
select(name, work_in_grad_center)
datatable(b)- what is the percentage of those who work in grad center and who don’t
b %>%
count(work_in_grad_center) %>%
mutate(percentage = n/sum(n))## work_in_grad_center n percentage
## 1 no 73 0.6347826
## 2 yes 42 0.3652174
sort by name(aphabetic)?
arrange() = arrange the order
head(data1 %>%
select(name)%>%
arrange(name))## name
## 1 Alexandra Juhasz
## 2 Alison Griffiths
## 3 Andre Aciman
## 4 Anthony Tamburri
## 5 Arthur Apter
## 6 Azriel Genack