Project Overview:

This project serves as the first phase in a broader study of 501c3 (tax exempt) not-for-profit organizations operating in the state of Vermont. It represents a proof-of-concept for collecting, organizing, and evaluating financial data and related metrics for these organizations. And an initial attempt to identify emergent features in their operations and leadership that, when studied in aggregate, can provide information to understand their status, distribution, resources, and collaborative potential at a regional scale.

The information that informs this research is drawn from annual financial disclosures to the IRS by the 501c3 organizations. These disclosures include FORM 990, FORM 990EZ, FORM 990N, and Form 990PF along with their respective schedules. While these are public documents, options for collating them in a systematic manner (i.e., across multiple organizations by sector, geography, range-of-time, etc.) are limited - absent a fee-for-service platform.

This report is divided into two .rmd files (a and b) and outlines steps for accessing and decomposing 501c3 disclosures via. web-based search queries and scraping techniques - in batch. It also provides initial results of exploratory data analysis using Form 990 information, focusing on cross-sector comparisons, as well as description of challenges inherent to this approach.

The bulk of data collected in this manner were identified using Propublica’s Nonprofit Explorer (https://projects.propublica.org/nonprofits/) and sequentially extracted as unstructured XML pages served by Amazon Web Server. From search to analysis, the following steps are included in the report:

Part a – see project file: Data607_Proj4a_Connin

  1. Data identification and preliminary scraping
  2. Secondary scraping to collect XML links to 990 Forms
  3. Tertiary scaping to collect/download/convert-to-text raw 990 data

Part b

  1. Data processing - text analysis
  2. Initial exploratory data analysis and assessment
  3. Conclusions

R packages/libraries.

Establish file paths for text analysis

Step 4. Extract the data from individual .txt files by category. Process data for analysis.

#Create collector dataframe for string data extracted from raw 990 text documents

org_990 <- data.frame(matrix(ncol=25))

colnames(org_990)<-c("xml_num","organization", "tax_year", "form", "address", "city", "website", "ein", "start_yr", "mission", "principal_officer", "employee_count", "volunteer_count", "PY_total_revenue", "PY_total_expenses", "PY_net", "CY_total_revenue", "CY_total_expenses", "CY_net", "CY_total_EOY_assets", "CY_total_EOY_liabilities", "CY_fund_balance", "CY_total_salaries", "CY_grants_contributions", "CY_program_service_rev")

for(file in arts_files){
  
    string<-readLines(file, skipNul = TRUE)  #we only the lines that have text. Blank lines still counted.
    
    if(length(string)<3) next #skip files that are NA

    string <- read_lines(string,skip=16,n_max = 97)  #skip lines to target select information
    string<-glue_collapse(string, sep="\n")
  
  xml_num<-file%>%str_extract("\\d+")
  organization<-string%>%str_extract("(?<=<BusinessNameLine1Txt>).+[:alnum:](?=<)")
  tax_year<-string%>%str_extract("(?<=<TaxYr>).+[:alnum:](?=<)")
  form<-string%>%str_extract("(?<=<ReturnTypeCd>).+[:alnum:](?=<)")
  address<-string%>%str_extract("(?<=<AddressLine1Txt>).+[:alnum:](?=<)")
  city<-string%>%str_extract("(?<=<CityNm>).+[:alnum:](?=<)")
  website<-string%>%str_extract("(?<=<WebsiteAddressTxt>).+[:alnum:](?=<)")
  ein<-string%>%str_extract("(?<=<EIN>).+[:alnum:](?=<)")
  start_yr<-string%>%str_extract("(?<=<FormationYr>).+[:alnum:](?=<)")
  mission<-string%>%str_extract("(?<=<MissionDesc>).+[:alnum:].(?=<)")
  principal_officer<-string%>%str_extract("(?<=<PrincipalOfficerNm>).+[:alnum:](?=<)")
  employee_count<-string%>%str_extract("(?<=<TotalEmployeeCnt>).+[:alnum:](?=<)")
  volunteer_count<-string%>%str_extract("(?<=<TotalVolunteersCnt>).+[:alnum:](?=<)")
  PY_total_revenue<-string%>%str_extract("(?<=<PYTotalRevenueAmt>).+[:alnum:](?=<)")
  PY_total_expenses<-string%>%str_extract("(?<=<PYTotalExpensesAmt>).+[:alnum:](?=<)")
  PY_net<-string%>%str_extract("(?<=<PYRevenuesLessExpensesAmt>).+[:alnum:](?=<)")
  CY_total_revenue<-string%>%str_extract("(?<=<CYTotalRevenueAmt>).+[:alnum:](?=<)")
  CY_total_expenses<-string%>%str_extract("(?<=<PYTotalExpensesAmt>).+[:alnum:](?=<)")
  CY_net<-string%>%str_extract("(?<=<CYRevenuesLessExpensesAmt>).+[:alnum:](?=<)")
  CY_total_EOY_assets<-string%>%str_extract("(?<=<TotalAssetsEOYAmt>).+[:alnum:](?=<)")
  CY_total_EOY_liabilities<-string%>%str_extract("(?<=<TotalLiabilitiesEOYAmt>).+[:alnum:](?=<)")
  CY_fund_balance<-string%>%str_extract("(?<=<NetAssetsOrFundBalancesEOYAmt>).+[:alnum:](?=<)")
  CY_total_salaries<-string%>%str_extract("(?<=<CYSalariesCompEmpBnftPaidAmt>).+[:alnum:](?=<)")
  CY_grants_contributions<-string%>%str_extract("(?<=<CYContributionsGrantsAmt>).+[:alnum:](?=<)")
  CY_program_service_rev<-string%>%str_extract("(?<=<CYProgramServiceRevenueAmt>).+[:alnum:](?=<)")
  
  df <-data.frame(xml_num, organization, tax_year, form, address, city, website, ein, start_yr, mission, principal_officer, employee_count, volunteer_count, PY_total_revenue, PY_total_expenses, PY_net, CY_total_revenue, CY_total_expenses, CY_net, CY_total_EOY_assets, CY_total_EOY_liabilities, CY_fund_balance, CY_total_salaries, CY_grants_contributions, CY_program_service_rev)%>%trimws()
  
  org_990<-rbind(org_990, df)

} 

org_990$xml_num<-as.numeric(org_990$xml_num) # convert from char to dbl for sorting
org_990%<>%arrange(org_990, (xml_num))
View(org_990)

#write_csv(org_990,"arts990.csv")

Create an aggregated dataset of 990 information.

Note: for demonstrative purposes this report focuses on the 990 Form and does not include data from forms: 990EZ, 990PF, 990N. Additional scripts will be required to differentiate and parse these forms. In addition, this report only includes the Section 1 summary data from the 990’s and is limited to the most recent filing year listed for each organization. Future work will focus on extracting information from other 990 sections, schedules, and reporting periods.

Step 5. Initial data exploration and visualization.

The following code chunk highlights organization counts and start-dates by category from the mid-19th century to present.

#create data subset for analysis and limit forms to 2019 or 2018 filing 

vtnps<- data.frame(rbind(env, educ, inter, relig, heal, arts, publ))%>%
  relocate(category, .after=organization)%>%
  filter(form %in% "990")%>%
  filter(tax_year==2019 | tax_year==2018)

# show basic stats and df composition

skim(vtnps)
Data summary
Name vtnps
Number of rows 597
Number of columns 26
_______________________
Column type frequency:
character 9
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
organization 0 1.00 8 70 0 522 0
category 0 1.00 6 23 0 7 0
form 0 1.00 3 3 0 1 0
address 0 1.00 6 35 0 499 0
city 0 1.00 5 20 0 170 0
website 39 0.93 3 44 0 414 0
ein 0 1.00 9 9 0 524 0
mission 195 0.67 9 1000 0 352 0
principal_officer 54 0.91 6 26 0 460 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
xml_num 0 1.00 270.35 168.92 3 132.00 250.0 390.0 679 <U+2587><U+2587><U+2586><U+2585><U+2582>
tax_year 0 1.00 2018.45 0.50 2018 2018.00 2018.0 2019.0 2019 <U+2587><U+2581><U+2581><U+2581><U+2586>
start_yr 24 0.96 1985.44 32.12 1819 1975.00 1995.0 2005.0 2019 <U+2581><U+2581><U+2581><U+2582><U+2587>
employee_count 312 0.48 164.26 587.27 10 18.00 30.0 92.0 8686 <U+2587><U+2581><U+2581><U+2581><U+2581>
volunteer_count 284 0.52 302.11 1303.67 10 25.00 50.0 117.0 14000 <U+2587><U+2581><U+2581><U+2581><U+2581>
PY_total_revenue 9 0.98 8927343.18 63662126.87 -16281 277462.25 592029.5 1717746.2 1433600411 <U+2587><U+2581><U+2581><U+2581><U+2581>
PY_total_expenses 9 0.98 8278029.14 60609912.23 -1005823 238535.75 550441.0 1421340.8 1360613119 <U+2587><U+2581><U+2581><U+2581><U+2581>
PY_net 13 0.98 653761.39 4686280.87 -7299122 -19803.75 22864.5 137345.2 72987292 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_revenue 0 1.00 8980391.21 67253585.89 -33419 272757.00 603793.0 1825858.0 1529529583 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_expenses 9 0.98 8278029.14 60609912.23 -1005823 238535.75 550441.0 1421340.8 1360613119 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_net 3 0.99 279235.25 3158780.06 -25315580 -35786.25 12086.0 111772.0 62328659 <U+2581><U+2587><U+2581><U+2581><U+2581>
CY_total_EOY_assets 8 0.99 13331113.04 83460700.94 -36495 299968.00 1025034.0 4225068.0 1695000215 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_EOY_liabilities 94 0.84 5643250.12 44897442.59 39 20406.00 120996.0 690764.0 723839777 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_fund_balance 134 0.78 5966779.33 49711368.60 -3444185 176380.00 688895.0 2253933.5 1033709335 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_salaries 80 0.87 5650006.93 40412075.85 760 143793.00 324377.0 973829.0 832147624 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_grants_contributions 39 0.93 1036136.54 2952325.40 17 94915.75 272284.5 746145.0 40996621 <U+2587><U+2581><U+2581><U+2581><U+2581>
CY_program_service_rev 130 0.78 9805485.16 73336731.06 25 64017.00 262558.0 1024773.0 1477791395 <U+2587><U+2581><U+2581><U+2581><U+2581>
# glimpse dataset

glimpse(vtnps)
## Rows: 597
## Columns: 26
## $ xml_num                  <dbl> 9, 25, 37, 38, 49, 57, 60, 62, 63, 72, 76, 90~
## $ organization             <chr> "Clean Energy States Alliance Inc", "GUANACAS~
## $ category                 <chr> "environment_animals", "environment_animals",~
## $ tax_year                 <dbl> 2019, 2018, 2019, 2018, 2019, 2018, 2018, 201~
## $ form                     <chr> "990", "990", "990", "990", "990", "990", "99~
## $ address                  <chr> "50 State Street", "4780 MAIN ROAD", "PO Box ~
## $ city                     <chr> "Montpelier", "HUNTINGTON", "Charlotte", "Bro~
## $ website                  <chr> "www.cesa.org", "www.gdfcf.org", "www.lewiscr~
## $ ein                      <chr> "270029803", "943280315", "030349148", "03600~
## $ start_yr                 <dbl> 2002, 1997, 1995, 1915, 1992, 2000, 2016, 197~
## $ mission                  <chr> "Technical support to States' clean energy ag~
## $ principal_officer        <chr> "Anthony Vargo", NA, "Martha Illick", "SUSAN ~
## $ employee_count           <dbl> NA, NA, NA, 18, 31, NA, NA, 47, 19, 21, 16, N~
## $ volunteer_count          <dbl> 14, 15, 74, NA, NA, NA, NA, 180, 16, 83, NA, ~
## $ PY_total_revenue         <dbl> 1565197, 6007561, 156205, 458419, 7980411, 11~
## $ PY_total_expenses        <dbl> 1395391, 2166001, 128464, 488587, 7726366, 71~
## $ PY_net                   <dbl> 169806, 3841560, 27741, -30168, 254045, 38863~
## $ CY_total_revenue         <dbl> 1541453, 4011292, 226889, 393201, 7861211, 18~
## $ CY_total_expenses        <dbl> 1395391, 2166001, 128464, 488587, 7726366, 71~
## $ CY_net                   <dbl> -17351, 1794998, 33768, -144257, -748228, 136~
## $ CY_total_EOY_assets      <dbl> 1094126, 17111063, 95950, 3401790, 6736910, 2~
## $ CY_total_EOY_liabilities <dbl> 269797, 181243, NA, 7082, 1725796, NA, 22502,~
## $ CY_fund_balance          <dbl> 824329, 16929820, NA, 3394708, 5011114, 22175~
## $ CY_total_salaries        <dbl> 838077, 760634, 44213, 238038, 4465965, NA, 1~
## $ CY_grants_contributions  <dbl> 1251720, 3534249, 224829, 224514, 6888145, 18~
## $ CY_program_service_rev   <dbl> 284379, 271040, 1556, 69841, 871478, NA, 4881~
# plot history of start year with annual count -- w/o category

vtnps %>%
  ggplot(aes(x=start_yr)) +
  geom_histogram(alpha=0.6, binwidth = 1, fill="dark blue") +
  labs( x='\nStart Year', y = 'Total Count', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ 
  theme(axis.text.x = element_text(angle = 45, hjust = 0, vjust = 1, size=10)) 

#profile organization start-yr and number by category

vtnps%>%
  filter(start_yr>1850)%>%
  count(start_yr,category) %>%
  ggplot()+geom_col(aes(x=start_yr,y=n,fill=category))+
  labs( x='\nStart Year', y = 'Total Count', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")

#density of start over time

ggplot(data=vtnps, aes(x=start_yr, group=category, fill=category)) +
    geom_density(adjust=1.5, alpha=.3) +
    labs( x='\nStart Year', y = 'Density', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")+
    theme_ipsum()

# annual addition of new nonprofits since 1950

vtnps %>%
  filter(start_yr>1950)%>%
  ggplot(aes(x=start_yr, color=category, fill=category, alpha=.4)) +
  geom_histogram(alpha=0.6, binwidth = 1) +
  scale_fill_viridis(discrete=TRUE, option="D") +
  scale_color_viridis(discrete=TRUE, option="D") +
  labs( x='\nStart Year', y = 'Total Addition\n', title = "Annual Addition of 501c3 Non-Profits in Vermont", subtitle="1950 to Present")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ 

  facet_wrap(.~category, ncol=2)

#annual addition of new nonprofits since 2000

vtnps %>%
  filter(start_yr>2000)%>%
  ggplot(aes(x=start_yr, color=category, fill=category, alpha=.4)) +
  geom_histogram(alpha=0.6, binwidth = 1) +
  scale_fill_viridis(discrete=TRUE, option="D") +
  scale_color_viridis(discrete=TRUE, option="D") +
  labs( x='\nStart Year', y = 'Total Addition\n', title = "Annual Addition of 501c3 Non-Profits in Vermont", subtitle="2000 to Present")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ 
  theme(axis.text.x = element_text(angle = 45, hjust = 0, vjust = 1, size=10)) 

  facet_wrap(~category, ncol=2)
## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>

The following code chunk focuses on employee count data by category

#summary stats for employee count

 vtnps$employee_count%>%summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0    18.0    30.0   164.3    92.0  8686.0     312
# Frequency of employee counts in organization

vtnps%>%
  ggplot(aes(x=employee_count))+
  geom_histogram(color="dark blue", fill="dark blue", alpha=.7)+
  labs( x='\nTotal Employees in Organization', y = 'Frequency\n', title = "Employee Counts for 501c3 Non-Profits in Vermont")+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# quantiles for total employee counts

vtnps%>%group_by(category)%>%
  group_by(category)%>%
  mutate(category=fct_reorder(category, employee_count, .fun='median'))%>%
  ggplot( aes(x=reorder(category, employee_count), y=employee_count)) + 
    geom_boxplot(fill="dark blue", alpha=.3, outlier.colour = "red")+
    coord_flip()+
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))

# Top 20 employee counts by organization

vtnps%>%
  top_n(20, employee_count) %>%
  ggplot(aes(x=employee_count, y=reorder(organization, employee_count), fill=category))+
  geom_col()+
  scale_fill_manual(values = c("#C3D7A4","#4E84C4"))+
  labs( x='\nTotal Employees in Organization', y = 'Organization\n', title = "Top 20 Employee Counts", subtitle= "501c3 Non-Profits in Vermont")+
  theme_minimal()

# Limit analysis to organizations below 100 employees
  
  vtnps%>%
  filter(employee_count<100)%>%
  ggplot(aes(x=employee_count)) +
  geom_boxplot()+
  theme_minimal()

#Distribution of employee counts <1000
  
vtnps%>%
  filter(employee_count<1000)%>%
  ggplot(aes(x=employee_count)) + 
  geom_bar(colour="blue", fill="dark blue", alpha=.3)+
  labs( x="Number of Employees", y = "Number of Organizations\n", title = "Distribution of Employee Counts", subtitle="Organizations with 1000 or Fewer Employees")+
  theme_minimal()+
  theme(panel.grid = element_blank())

# quantiles for organizations with fewer than 1000 employees
  
vtnps%>%
    group_by(category)%>%
    filter(employee_count<1000)%>%
    ggplot(aes(x=reorder(category, employee_count), y=employee_count))+
    geom_boxplot(fill="dark blue", alpha=.3, outlier.colour = "red")+
    labs( x="", y = "\n\nNumber of Employees", title = "Total Employees by 501c3 Category", subtitle="Organizations with 1000 or Fewer Employees")+
    coord_flip()+
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))

The following code chunk focuses on salary data by category

# average salary by category --> excludes salaries over 200,000

vtnps%>%
  mutate(avg_salary = CY_total_salaries/employee_count)%>%
  filter(avg_salary<200000)%>%
  group_by(category)%>%
  mutate(category=fct_reorder(category, avg_salary, .fun='median'))%>%
  ggplot( aes(x=reorder(category, avg_salary), y=avg_salary)) + 
  geom_boxplot(fill="dark green", alpha=.3, outlier.colour = "blue")+
   labs( x="", y = "\n\nAverage Salary", title = "Average Salaries by 501c3 Category", subtitle="Salaries less than $200,000\n\n")+
  coord_flip()+
  theme_minimal()+
  theme(axis.title.y = element_blank())+
  theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10)) 

The following code chunk highights selected financial data by category

#Ratio of liabilities assets >1 indicates an excess of debt -- the proportion of a company’s assets that are financed by debt. A ratio greater than 1 shows that a considerable portion of debt is funded by assets. In other words, the company has more liabilities than assets. A high ratio also indicates that a company may be putting itself at risk of default on its loans if interest rates were to rise suddenly.


vtnps%>%
  mutate(liability_ratio = CY_total_EOY_liabilities/CY_total_EOY_assets)%>%
  filter(liability_ratio<50)%>%  #note, one outlier at 400+
  ggplot(aes(x=category, y=liability_ratio))+
  geom_point(color='darkblue')+
  coord_flip()+
  geom_hline(yintercept = 1, color="red")+
   labs( x="501c3 Category", y = "\n\nLiability Ratio", title = "Liability Ratios by 501c3 Category\n", subtitle="A ratio >1 indicates an excess of debt over assets\n\n")+
  theme_minimal()+
  theme(panel.grid = element_blank())+
  theme(axis.title.y = element_blank())

lratio<-vtnps%>%
  mutate(liability_ratio = CY_total_EOY_liabilities/CY_total_EOY_assets)

labove<-length(which(lratio$liability_ratio > 1))
lbelow<-length(which(lratio$liability_ratio <= 1))
(lprop<-round((labove/lbelow)*100, 0))
## [1] 5
glue("The number of organizations with a liability >1 and <=1 is {labove} and {lbelow}, respectively. The latter represents {lprop}% of the total count.")
## The number of organizations with a liability >1 and <=1 is 22 and 480, respectively. The latter represents 5% of the total count.
#Ratio of revenue to expenses between yrs. net operating ratio of income to expenses, which tells you how efficiently the organization is using its money to fund operations

vtnps%>%
  mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
  mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
  mutate(net_annual_diff=current_net-prior_net)%>%
  group_by(category)%>%
  filter(current_net<100)%>%
  ggplot(aes(x=category, net_annual_diff))+
  geom_point()+
  geom_hline(yintercept = 0, color="red")+
  coord_flip()+
  labs(x="", y= "Net Difference in Revenue:Expenses", title=("\n\nDifference in end-of-year revenue:expense"), subtitle=('Last two reporting years\n\n'))+
  theme_minimal()+
  theme(panel.grid = element_blank())+
  theme(axis.title.y = element_blank())

#further comparison of revenue to expense

adiff<-vtnps%>%
  mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
  mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
  mutate(net_annual_diff=current_net-prior_net)

(adiff_a<-length(which(adiff$net_annual_diff > 0)))
## [1] 339
(adiff_b<-length(which(adiff$net_annual_diff <= 0)))
## [1] 249
(adiff_r<-round((adiff_b/adiff_a)*100, 0))
## [1] 73
a_diff<-vtnps%>%
  mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
  mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
  mutate(net_annual_diff=current_net-prior_net)

labove<-length(which(lratio$liability_ratio > 1))
lbelow<-length(which(lratio$liability_ratio <= 1))
(lprop<-round((labove/lbelow)*100, 0))
## [1] 5
#per capita net revenue by organization category

vtnps%>%
  mutate(net=CY_total_revenue-CY_total_expenses)%>%
  mutate(rev_person=net/employee_count)%>%
  group_by(category)%>%
  filter(net<500000)%>%
  ggplot(aes(x=reorder(category, rev_person), y=rev_person))+
  geom_boxplot()+
  geom_hline(yintercept = 0, color="red")+
  coord_flip()+
  labs(x="Per Capita Net", y= "", title=("Current Year Per Capita Net Revenue\n"), subtitle=('Organizations with Net Revenue < $500,000\n\n'))+
  theme_minimal()+
  theme(panel.grid = element_blank())+
  theme(axis.title.y = element_blank())
## Warning: Removed 282 rows containing non-finite values (stat_boxplot).

# preliminary comparison of NY revenue and employee metrics

vtnps%>%
  filter(CY_total_salaries<200000000)%>%
  ggplot(aes(x=employee_count, y=CY_total_revenue))+
  geom_point()+
  theme_minimal()
## Warning: Removed 232 rows containing missing values (geom_point).

vtnps%>%
  filter(CY_total_salaries<200000000)%>%
  ggplot(aes(x=CY_total_salaries, y=CY_total_revenue))+
  geom_point()+
  theme_minimal()

##Conclusions and Recommendations

This project will continue with addition work focused on filling in missing information as well as constructing a data-table that holds information from multiple 990 forms and multiple years (2013 to present). The latter steps will require further scripting and automation to streamline data processing steps. It may also be advantageous to construct a relational database to facilitate data exploration and analysis.

In addition, additional scripts will be required to effectively extract key financial information from 990 forms that were not included in this initial phase. These features should include the following for 501c3 metrics:

  1. Liquidity - cash on hand that is not restricted
  2. Program expenses as percentage of total expenses.
  3. Sources of unrestricted recurring dollars.
  4. Full-cost coverage.
  5. Fund-raising expenses as percentage of total contributions.
  6. Cash flow from operations.

Other metrics include revenue reliability (across years) and revenue surplus.

Epilogue

The following chunk provides code for accessing 990 related information using Propublica’s API. In this context, select data is available via. query in JSON format.

url <- “https://projects.propublica.org/nonprofits/api/v2/search.json?limit=1000&q=%22vermont%22&ntee%5Bid%5D=7

res<-httr::GET(url)

restxt<-httr::content(res, as = ‘text’)

j <- fromJSON(restxt)

glimpse(j)

length(str(j$organizations[[1]]))

df_json2 <- fromJSON(json2)

str(j, nest.lev)

ny <- c() for (i in length(j)) { for (x in i) { print(x) } }