web crawling

It could be done by rvest and gather infomation conveniently,use Alibaba’s recruitment website.
This website is well organized and easy to gather all the initial webpath because one only need to change the page id.
The other reason is that Alibaba has a strong web service and it can stand massive visit requests,the programm won’t be shut out quickly as other website does.

website example ]

1.create all the sites for crawling(note: some of them has disappeared,one should add trycatch to catch the error and counitue the next loop)

setwd("D:\\rworkspace\\alipay")
pattern<- "https://job.alibaba.com/zhaopin/position_detail.htm?positionId="
id<-1:25000
url.list<-sapply(id,function(y) paste(pattern,y,sep=""))

write a function to crawl and save the datas

spider<-function(path,filename){
     info<-c()
     for(i in 1:length(path)){
       
       web<-path[i] %>% read_html() 
       
       possibleMissing <- tryCatch({
         
             title<-web %>% html_nodes(".bg-title") %>% html_text() %>% iconv(from="utf-8",to="gbk")#职位
             ali<-web %>% html_nodes("tr td") %>% .[1:12] 
             ali.data<-html_text(ali) %>% iconv(from="utf-8",to="gbk")
             pub.time<-ali.data[2] %>% as.Date()#发布时间
             location<-ali.data[4]#工作地点
             experience<-ali.data[6]#职业经验
             department<-ali.data[8]#所属部门
             education<-ali.data[10]#学历要求
    
             headcount<- gsub(pattern = "\n|\t|\\s",x=ali.data[12],replacement = "")#招聘人数
             
             obs<-data.frame(title,pub.time,location,experience,department,education,headcount,stringsAsFactors = F)#整合
             
             #汇报进度
             percent<-round(i/length(path)*100,2)
             percent<-paste(percent," %")
             session<-paste(percent,i,sep = "------finished id ")
             print(session)
             info<-rbind(info,obs)
             Sys.sleep(0.5)#休眠调整
           }
           ,
           error=function(e) {
             e
             print("------missing page")
           }
       )
       
       if(inherits(possibleMissing, "error")) next  #碰到空白网页不会暂停，继续搜索下一个网页
       
       #休眠
       if(i%%20==0|i%%50==0){
         Sys.sleep(3)
       }
     }
  saveRDS(object = info,file = filename)   
  return(info)
}

start downloading

spider(path = url.list,"alipay.RDS")

check the first results:well done!

head(d,30)

##                                                  title   pub.time location
## 1                                PHP开发工程师         2012-11-12   杭州市
## 2                         消息中心开发专家             2014-01-13   杭州市
## 3                             资深运营专员             2014-01-13   北京市
## 4                   2014.02.22北京技术专场             2014-01-13   杭州市
## 5                                     采购助理         2013-12-31   杭州市
## 6            蚂蚁金服-资深Java应用中间件架构师         2016-04-06   杭州市
## 7                                 行政主管             2013-12-02   北京市
## 8                                 培训经理             2013-11-04   北京市
## 9              资深爱好市场运营专员-（淘宝网）         2014-01-14   杭州市
## 10     资深爱好市场产品运营专员-（淘宝网）             2014-01-14   杭州市
## 11                        资深文案策划专员             2014-01-14   杭州市
## 12                        数字阅读-UED主管             2014-01-14   杭州市
## 13                  前端开发工程师（应用市场）         2014-01-14   北京市
## 14                              高级交互设计师         2014-01-14   北京市
## 15                       业务合作经理/专家             2014-01-14   杭州市
## 16                蚂蚁金服-数据挖掘工程师-上海         2016-03-01   上海市
## 17                蚂蚁金服-数据挖掘工程师-杭州         2016-03-03   杭州市
## 18                       数据仓库ETL工程师             2013-04-08   杭州市
## 19      International Business Development             2014-03-19   杭州市
## 20                       虾米-版权合作专员             2014-01-14   北京市
## 21  资深数字产品C/C++服务端开发工程师/专家             2014-01-14   杭州市
## 22              高级Java工程师（核心平台）             2014-01-14   杭州市
## 23                                课程运营             2014-01-14   杭州市
## 24             课程运营（淘宝网-淘宝大学）             2014-01-14   杭州市
## 25 资深数字产品Java基础平台开发工程师/专家             2014-01-15   杭州市
## 26                [城市智能]前端开发工程师             2014-01-15   杭州市
## 27                [城市智能]项目java开发工程师         2014-01-15   杭州市
## 28                              镜像平台PM             2014-01-15   北京市
## 29                                  镜像平台PM         2014-01-15   杭州市
## 30                            镜像市场客户运营         2014-01-15   北京市
##    experience         department education headcount
## 1    三年以上     信息平台事业部      本科         1
## 2    三年以上           技术保障      本科         5
## 3    三年以上         搜索事业部      本科         1
## 4    三年以上         国际事业部      本科      若干
## 5    二年以上             采购部      本科         1
## 6    五年以上           蚂蚁金服      本科         5
## 7    三年以上         国际事业部      本科         1
## 8    三年以上         国际事业部      本科         1
## 9    五年以上 淘宝行业市场事业部      本科         1
## 10   五年以上 淘宝行业运营事业部      本科         1
## 11   三年以上           蚂蚁金服      本科      若干
## 12   五年以上     数字娱乐事业群      本科         1
## 13   三年以上     手机助手发展部      本科         2
## 14   三年以上                CTO      本科         2
## 15   五年以上           小微金服      本科         3
## 16   三年以上           蚂蚁金服      本科         2
## 17   三年以上           蚂蚁金服      本科         2
## 18   五年以上           蚂蚁金服      本科        20
## 19   三年以上            国际B2C      本科         2
## 20   三年以上           阿里音乐      本科         1
## 21   三年以上         淘宝技术部      本科      若干
## 22   二年以上           蚂蚁金服      本科      若干
## 23   三年以上           淘宝大学      本科      若干
## 24   三年以上           淘宝大学      本科      若干
## 25   三年以上         淘宝技术部      本科      若干
## 26   三年以上            W工作室      本科         2
## 27   三年以上            W工作室      本科         2
## 28   三年以上       阿里云事业群      本科      若干
## 29   三年以上       阿里云事业群      本科      若干
## 30   三年以上       阿里云事业群      本科      若干

total number of data are over 20000,not exhibited here.

Deep analysis of the data would be performed later,a little busy with course work these days. :)

小爬虫

Xinyu Jiao

2016年4月15日

web crawling