How much are charity, fundraising, NGO and non-profits currently paying their new staff? Web scrapping CharityJobs…

Last updated: 22 October 2016

Rupees

In this post I try to explore this and some other related questions using public recruitment data from the CharityJobs website. According to CharityJob, the site is the United Kingdom’s busiest one for charity, fundraising, NGO and not for profit jobs.

In addition to presenting these powerful open-source tools and data-exploring techniques, I hope that this post can help the public, specially applicants and workers in development aid organisations to get an update on salaries and trends in the sector. The jobs analysed here are mostly UK-based ones and published by UK-based organisations. Therefore, the results are not meant to represent the entire sector worldwide. I still hope though that this post can provide some positive contribuition to the evolution of development aid work in both the southern and the northern hemispheres.

For those of you who are only interested in the end analysis, please jump to the results section. However, I encourage you to explore how these tools work. I believe that they can help speeding up and improving quality of the so-much-needed charity, social-enterprise, development-aid and humanitarian work globally.

I used here some basic techniques of web scraping (web harvesting or web data extraction), which is a computer software technique of extracting information from websites. The source code in RMarkdown is available for download and use based on GNU General Public License at: Rmarkdown code. Everything was prepared with the open-source and freely-accesible statistical computing software R (R version 3.2.0 - http://cran.r-project.org/) and the IDE RStudio (Version 0.99.441 - http://www.rstudio.com/).

This post is based on public data, is my sole responsibility and can in no way be taken to reflect the views of CharityJobs’ staff.

Downloading data from CharityJobs

Using RStudio, the first step is to download the website data. CharityJobs’ search engine contains over 140 webpages, each of them with a list of 18 jobs in most cases. Hence I expected to get information about around 2,500 job announcements. For that, the first step was to download the data and get rid of what I did not wanted (e.g. css and html codes). The code chunck below describes how I did it. The code contains explanatory comments indicated by hashtags (‘#’). I am sure that many would be able to write this code in a much more elegant and efficient way. I would be very thankful to receive your comments and suggestions!

#  Loading the necessary packages. It assumes that they are installed.
#   Please type '?install.packages()' on the R console for additional information.  
    suppressWarnings(suppressPackageStartupMessages(require(rvest))) # Credits to Hadley Wickham (2016)
    suppressPackageStartupMessages(require(stringr)) # Credits to Hadley Wickham (2015)
    suppressPackageStartupMessages(require(lubridate)) # Credits to Garrett Grolemund, Hadley Wickham (2011)
    suppressPackageStartupMessages(require(dplyr)) # Credits to Hadley Wickham and Romain Francois (2015)
    suppressPackageStartupMessages(require(xml2)) # Credits to Hadley Wickham (2015)
    suppressPackageStartupMessages(require(pander)) # Credits to Gergely Daróczi and Roman Tsegelskyi (2015)
    suppressPackageStartupMessages(require(ggplot2)) # Credits to Hadley Wickham (2009)
    
    ## Creating list of URLs (webpages)
      urls  <- paste("https://www.charityjob.co.uk/jobs?page=", seq(1:140), sep = "")

  ## Downloading website information into a list called `charityjobs` and closing connections
      charityjobs  <- lapply(urls, . %>% read_html(.))

Tyding up and parsing data

The next step is to parse or clean up the text string of each of the about 140 webpages. I decided to build a custom function for that, which I could use to loop through the content of each element of the charityjobs list. The function should also save the parsed data into a data frame. This data frame should include information on recruiters, position titles, salary ranges and deadline data. The code chuck below presents this function, which I called salarydata.

## Creating a function for parsing data which uses the read_html output (list 'charityjobs')
salarydata <- function(list) {
  
  # Creating auxiliary variables and databases
  list_size <- length(list)
  salaries  <- data.frame(deadline=character(),
                          recruiter=character(), 
                          position=character(),
                          salary_range=character()) 
  
  for (i in seq_along(1:list_size)){ 
    size <- list[[i]] %>% html_nodes(".salary") %>% html_text() %>% length()
    
    #Intermediary dataframe
    sal  <- data.frame(deadline=rep(NA, size),
                       recruiter=rep(NA, size), 
                       position=rep(NA, size),
                       salary_range=rep(NA, size)) 
    
    ## Filling out intermediary data for deadlines for application
    sal$deadline[1:size]  <- list[[i]] %>% 
      html_nodes(".closing:nth-child(4) span") %>% html_text() %>% 
      .[!grepl("^[Closing:](*)",.)]  %>% rbind()
    
    ## Filling out intermediary data for recruiters
    sal$recruiter[1:size]  <- (list[[i]] %>% 
                                 html_nodes(".recruiter") %>% html_text() %>% 
                                 gsub("\r\n\ \\s+", "",.) %>% 
                                 gsub("\r\n", " ", .) %>% 
                                 gsub("^\\s+|\\s+$", "", .)) %>% 
      rbind()
    
    ## Filling out intermediary data for positions
    sal$position[1:size]  <-   list[[i]] %>% 
      html_nodes(".title") %>% html_text() %>% 
      gsub("\r\n\ \\s+", "",.) %>% 
      gsub("\r\n", " ", .) %>% 
      gsub("^\\s+|\\s+$", "", .) %>% 
      rbind()
    
    ## Filling out intermediary data for salary ranges
    sal$salary_range[1:size]  <- list[[i]] %>%
      html_nodes(".salary") %>% html_text() %>%
      gsub("(£..)\\.", "\\1", .) %>% gsub("\\.(.)k(+) |\\.(.)K(+)", "\\100 \\2", .) %>%
      gsub("(*)k(+) |(*)K(+)", "\\1000 \\2", .) %>%
      gsub("k", "000 ", .) %>% # Substituting remaining ks
      gsub("^(£..)\\-", "\\1000; ", .) %>% # Adding thousands for figures withou "k"
      gsub("- £", "; ", .) %>% gsub("-£", "; ", .) %>% gsub("£", "", .) %>% # Removing pounds signs
      gsub("-", ";", .) %>% gsub("–", ";", .) #%>% # Removing dashes
      
    ## Excluding per-hour and per-day jobs
    sal <-  sal %>% filter(!grepl("hours", sal$salary_range))
    sal <-  sal %>% filter(!grepl("hour", sal$salary_range))
    sal <-  sal %>% filter(!grepl("p/h", sal$salary_range))
    sal <-  sal %>% filter(!grepl("week", sal$salary_range))
    sal <-  sal %>% filter(!grepl("ph", sal$salary_range))
    sal <-  sal %>% filter(!grepl("day", sal$salary_range))
    sal <-  sal %>% filter(!grepl("daily", sal$salary_range))
    sal <-  sal %>% filter(!grepl("plus", sal$salary_range))
    sal <-  sal %>% filter(!grepl("\\+", sal$salary_range))
    
    salaries <- rbind(salaries, sal) 
    
  }
  return(salaries)
}

Creating full dataframe and other adjustments

The last step before exploring the data was to run the function salarydata to create the full dataframe. After that, I parsed lower and upper salaries into separated columns, deleted data which may have been incorrectly parsed or data concerning daily-rate and hourly-rate jobs / consulting assignments. Only yearly salaries between GBP 4,000 and GBP 150,000 have been considered. All salary data is in British Pounds (GBP) and refer to annual salaries, which sometimes do not include benefits such as pension.

Cleaning the salary-range variable was a tricky step as the website allows users to type in both salary amounts and additional text (e.g. 30,000, 30K, or 25-30k). Therefore, I had to iterate some times until the output was good enough. I am quite sure that the code chunk below can be written in a more elegant way. Again, please let me know in case you have any suggestions here.

# Creating a full and clean dataframe
  salaries <- salarydata(charityjobs)

# Parsing salary-range variable
  salaries$salary_range <- gsub(", ", ",", salaries$salary_range) %>% 
    gsub(" ; ", ";", .) %>% gsub("; ", ";", .) %>% 
    gsub(",[:A-z:]", " ", .) %>%
    gsub("\\(*", "", .) %>% 
    gsub("\\:", "", .) %>% 
    gsub("[:A-z:],[:A-z:]", " ", .) %>% 
    gsub("(..),00\\...", "\\1,000", .) %>% 
    gsub("(..),0\\...", "\\1,000", .) %>% 
    gsub("[A-z]", "", .) %>% gsub(",", "", .) %>% 
    gsub("\\.", "", .) %>% gsub("^\\s+", "", .) %>% 
    gsub("\\s([[:digit:]])", ";\\1", .) %>% 
    gsub("\\s+", "", .) %>% gsub("^[[:digit:]]{1};", "", .) %>% 
    gsub("\\(", "", .) %>% gsub("\\)", "", .) %>% # Deleting "(" and ")" 
    gsub("\\/", "", .) %>% gsub("000000", "0000", .) %>% # Deleting "/" and correcting digits
    gsub("([[:digit:]]{2})00000", "\\1000", .)  %>%  # Correcting number of digits
    gsub("([[:digit:]]{5})00", "\\1", .) # Correcting number of digits 
  
# Adjusting data and computing lower and upper salaries using ";" as separator
  salaries  <- suppressWarnings(salaries %>% 
  mutate(upper_salary=gsub("^.*;", "", salaries$salary_range)) %>% 
  mutate(lower_salary=gsub(";.*", "", salaries$salary_range)) %>% 
  mutate(upper_salary=as.numeric(upper_salary)) %>% 
  mutate(lower_salary=as.numeric(lower_salary)) %>% 
  filter(upper_salary<150000) %>% filter(upper_salary>4000) %>%
  filter(lower_salary<150000) %>% filter(lower_salary>4000)  %>%
  mutate(lower_salary=ifelse(lower_salary>=upper_salary, NA, lower_salary)) %>%
  filter(is.na(upper_salary)!=TRUE) %>% tbl_df() %>% 
  select(deadline, recruiter, position, 
         lower_salary, upper_salary, salary_range) %>% 
  mutate(deadline=dmy(deadline))) %>% filter(as.Date(deadline)<Sys.Date()+30)
  
  # Removing entries which are too different from most entries in terms of format
  salaries  <- salaries %>% 
    filter(position!="WELFARE OFFICER The Cinema and Television Benevolent Fund") %>%
    filter(position!="Senior Development Manager (International) University of Exeter") %>%
    filter(position!="Head of communications TMP (UK) Limited") %>%  
    filter(position!="Programme Partnerships Manager (Flexible location) Alzheimer's Society") %>%
    unique() # Filtering any duplicated entries

The output below presents the summary of the full dataframe (10 first observations).

## Source: local data frame [1,355 x 6]
## 
##      deadline                                                  recruiter
##        (time)                                                      (chr)
## 1  2016-10-31                                     Rethink Mental Illness
## 2  2016-11-12                                            TPP Recruitment
## 3  2016-11-07                                      CARE International UK
## 4  2016-11-19                                             Robertson Bell
## 5  2016-11-07                                                  The Winch
## 6  2016-11-02                                        Canal & River Trust
## 7  2016-11-02                                             Missing People
## 8  2016-11-02 North London Cares & South London Cares (The Cares Family)
## 9  2016-11-06                                         Family Futures CIC
## 10 2016-11-06                                Helplines Partnership (HLP)
## ..        ...                                                        ...
## Variables not shown: position (chr), lower_salary (dbl), upper_salary
##   (dbl), salary_range (chr)

Results

The table below presents the summary statistics concerning the lower and upper salaries. The final dataset contains information of 1,355 jobs of various types, based on yearly-salary figures. They exclude consultancy assignments and other jobs based on hour and day rates as well as jobs which did not provide salary information.

The table below presents standard descriptive statistics for lower and upper salaries. For job announcements providing a single value (not a salary range), that single amount has been incorporated to the dataset variable upper_salary while the variable lower_salary was set as NA (not available). That is why the number of observations (N) for lower salaries is 708 and for upper salaries 1,355. About 48% of the job announcements did not provide salary range information but just the single salary amount.

Summary statistics of salaries (in British pounds)

Statistic	N	Mean	Median	St. Dev.	Min.	Max.
Lower salary	708	30,211	28,000	10,119	4,290	70,000
Upper salary	1,355	32,546	30,000	11,477	7,048	75,000

In a more in-depth analysis for some future post, it can be interesting to look into payments for jobs paying by hour and by day as well for more specific job categories. One way for approaching specific job categories can be by defining tags for job titles using standard words from titles (e.g., director, management, assistant) and groupping them by tag type in a new factor variable.

Histogram with distribution of lower salaries

Histogram with distribution of upper salaries

The 10 most frequent recruiters

The table below presents the ranking of the 10 most frequent recruiters in the dataset. Column “N” presents the number of total announcements for each recruiter while column “Freq” shows the percentage of total announcements for each recruiter. Among these are also recruitment agencies.

Ranking	Recruiter	N	Freq
1	Robertson Bell	59	4.35
2	TPP Recruitment	44	3.25
3	Prospectus Ltd	42	3.10
4	Charity People Ltd	41	3.03
5	Harris Hill Charity Recruitment	29	2.14
6	Morgan Law Limited	24	1.77
7	British Red Cross	21	1.55
8	Creative Support	20	1.48
9	Eden Brown	18	1.33
10	Morgan Hunt	17	1.25

The tables below show the ranking of the jobs with the 10 lowest and 10 highest upper salaries.

The jobs with the 10 lowest upper salaries

Ranking	Title	Amount
1	Play Quarters Project Worker London Play	7,048
2	Housing Worker - Mortimer House, Exeter Westward	7,231
3	Play Worker PACT	7,722
4	Administrator (Female)* Stonewater	8,902
5	Projects Assistant, Urban Wild Places Octopus Community Network Limited	10,800
6	Shop Manager Oxfam GB	11,360
7	Peer Support Coordinator - Dementia Project City & Hackney Carers Centre	11,538
8	20’s-30’s Development Leader - St Michael’s Church, Budbrooke Diocese of Coventry	11,979
9	Grants and Projects Assistant (term time only) Family Holiday Association	12,906
10	Part-time FSE/ESOL Tutor The ClementJames Centre	13,000

The jobs with the 10 highest upper salaries

Ranking	Title	Amount
1	Group Financial Controller Allen Lane Financial Recruitment	75,000
2	Interim International Group Head of Financial Accounts Robertson Bell	75,000
3	Director of Philanthropy I CAN	75,000
4	Programme Director Prospectus	75,000
5	Director of Income International HIV/AIDS Alliance	74,000
6	Head of Legal and Democratic Services East Herts Council	73,530
7	Head of Infrastructure Management The Guide Dogs for the Blind Association	72,000
8	Programme Director - Financial Transparency Pro-Finance	70,000
9	Director of Advice & Content Asthma UK	70,000
10	Head of Finance Hays London Ebury Gate	70,000

I also wanted to quickly explore possible relationships between deadline dates and salary levels, just for fun. It could be, for example, that some periods had lower average-salary offers than others.

Despite the large number of job announcements in the dataset (N=1,355), all observations refer to jobs with application deadlines between 22 October 2016 and 20 November 2016. This is a short time span for such analysis, but I explored it anyway just as an example of what these tools and techniques can do.

The plot below shows upper salaries by job application deadline (restricted only to those finishing in less than 30 days counting from 22 October 2016). The red line represents the results of the linear regression as a very basic attempt to describe how changes in the deadline variable (predictor) relate to changes in the upper_salary variable (response). The linear model does not help to explain the response variable variation (\(R^{2}\) = 0.006) however it suggests a statistically significant relationship between upper salary and application-deadline date for \(\alpha\) = 0.05 (p = 0.004). The output below presents the summary of the results obtained in the linear regression.

Summary of linear regression (Upper salary vs. applicaton deadline)

## 
## Call:
## lm(formula = upper_salary ~ deadline, data = salaries_under30d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24511  -8113  -2122   6054  43377 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -2.015e+06  7.187e+05  -2.804  0.00512 **
## deadline     1.197e+02  4.201e+01   2.849  0.00445 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11450 on 1353 degrees of freedom
## Multiple R-squared:  0.005963,   Adjusted R-squared:  0.005229 
## F-statistic: 8.117 on 1 and 1353 DF,  p-value: 0.004452

Next, I will use word clouds to explore job titles. The larger the word in the cloud, the higher is its frequency in the dataset. The words below are only those mentioned in at least 10 job announcements. The plot indicates that management positions are the most frequent ones, followed by coordination jobs, as well as officer, recruitment and fund-raising jobs.

Word cloud of job titles

The cloud below shows the most frequent words in the names of the recruiting institutions. I assumed that its results could provide hints about the most active thematic areas in terms of job anouncements. The words in the plot below are also those which have been mentioned in at least 10 job anouncements. The word cloud suggests that recruitment agencies are among the leading ones, as expected (see section “The 10 most frequent recruiters”). Organisations working with children, cancer, law, international actions, education and alzheimer patients also seem to stand out.

Word cloud of recruiters

Moving forward

The charity, development aid, not-for-profit and social enterprise sector is evolving rapidly. This process is powered both by increasingly-critical global challenges and, of course, by capable and motivated entrepreneurs, staff and service suppliers. This is a sector which is sometimes too much romantised by some people. As a consultant and entrepreneur in the sector, I am often asked how I manage to deal with all day dreamers I come accross in my way. No judgment about that but this indicates how much the sector is still unknown to the public. This is a sector which has become increasingly professional and results oriented. I believe that data science can help the sector, particularly concerning monitoring and evaluating performance including staff and beneficiary / client satisfaction.

I hope you enjoyed this tour and would be happy to receive your suggestions for additonal analysis and improvement.

Keep coding and take care!

Written by: Dr. rer. pol. Eduardo W. Ferreira (Ph. D.) / Consultant, trainer and facilitator in designing, managing and evaluating projects and programmes in Africa, Asia, Europe, Central and South America for non-governmental organisations, governments, consultancy firms, research institutions and international organisations (Additional information).