Data607_Proj4b

Project Overview:

This project serves as the first phase in a broader study of 501c3 (tax exempt) not-for-profit organizations operating in the state of Vermont. It represents a proof-of-concept for collecting, organizing, and evaluating financial data and related metrics for these organizations. And an initial attempt to identify emergent features in their operations and leadership that, when studied in aggregate, can provide information to understand their status, distribution, resources, and collaborative potential at a regional scale.

The information that informs this research is drawn from annual financial disclosures to the IRS by the 501c3 organizations. These disclosures include FORM 990, FORM 990EZ, FORM 990N, and Form 990PF along with their respective schedules. While these are public documents, options for collating them in a systematic manner (i.e., across multiple organizations by sector, geography, range-of-time, etc.) are limited - absent a fee-for-service platform.

This report is divided into two .rmd files (a and b) and outlines steps for accessing and decomposing 501c3 disclosures via. web-based search queries and scraping techniques - in batch. It also provides initial results of exploratory data analysis using Form 990 information, focusing on cross-sector comparisons, as well as description of challenges inherent to this approach.

The bulk of data collected in this manner were identified using Propublica’s Nonprofit Explorer (https://projects.propublica.org/nonprofits/) and sequentially extracted as unstructured XML pages served by Amazon Web Server. From search to analysis, the following steps are included in the report:

Part a – see project file: Data607_Proj4a_Connin

Data identification and preliminary scraping
Secondary scraping to collect XML links to 990 Forms
Tertiary scaping to collect/download/convert-to-text raw 990 data

Part b

Data processing - text analysis
Initial exploratory data analysis and assessment
Conclusions

R packages/libraries.

Establish file paths for text analysis

Step 4. Extract the data from individual .txt files by category. Process data for analysis.

#Create collector dataframe for string data extracted from raw 990 text documents

org_990 <- data.frame(matrix(ncol=25))

colnames(org_990)<-c("xml_num","organization", "tax_year", "form", "address", "city", "website", "ein", "start_yr", "mission", "principal_officer", "employee_count", "volunteer_count", "PY_total_revenue", "PY_total_expenses", "PY_net", "CY_total_revenue", "CY_total_expenses", "CY_net", "CY_total_EOY_assets", "CY_total_EOY_liabilities", "CY_fund_balance", "CY_total_salaries", "CY_grants_contributions", "CY_program_service_rev")

for(file in arts_files){
  
    string<-readLines(file, skipNul = TRUE)  #we only the lines that have text. Blank lines still counted.
    
    if(length(string)<3) next #skip files that are NA

    string <- read_lines(string,skip=16,n_max = 97)  #skip lines to target select information
    string<-glue_collapse(string, sep="\n")
  
  xml_num<-file%>%str_extract("\\d+")
  organization<-string%>%str_extract("(?<=<BusinessNameLine1Txt>).+[:alnum:](?=<)")
  tax_year<-string%>%str_extract("(?<=<TaxYr>).+[:alnum:](?=<)")
  form<-string%>%str_extract("(?<=<ReturnTypeCd>).+[:alnum:](?=<)")
  address<-string%>%str_extract("(?<=<AddressLine1Txt>).+[:alnum:](?=<)")
  city<-string%>%str_extract("(?<=<CityNm>).+[:alnum:](?=<)")
  website<-string%>%str_extract("(?<=<WebsiteAddressTxt>).+[:alnum:](?=<)")
  ein<-string%>%str_extract("(?<=<EIN>).+[:alnum:](?=<)")
  start_yr<-string%>%str_extract("(?<=<FormationYr>).+[:alnum:](?=<)")
  mission<-string%>%str_extract("(?<=<MissionDesc>).+[:alnum:].(?=<)")
  principal_officer<-string%>%str_extract("(?<=<PrincipalOfficerNm>).+[:alnum:](?=<)")
  employee_count<-string%>%str_extract("(?<=<TotalEmployeeCnt>).+[:alnum:](?=<)")
  volunteer_count<-string%>%str_extract("(?<=<TotalVolunteersCnt>).+[:alnum:](?=<)")
  PY_total_revenue<-string%>%str_extract("(?<=<PYTotalRevenueAmt>).+[:alnum:](?=<)")
  PY_total_expenses<-string%>%str_extract("(?<=<PYTotalExpensesAmt>).+[:alnum:](?=<)")
  PY_net<-string%>%str_extract("(?<=<PYRevenuesLessExpensesAmt>).+[:alnum:](?=<)")
  CY_total_revenue<-string%>%str_extract("(?<=<CYTotalRevenueAmt>).+[:alnum:](?=<)")
  CY_total_expenses<-string%>%str_extract("(?<=<PYTotalExpensesAmt>).+[:alnum:](?=<)")
  CY_net<-string%>%str_extract("(?<=<CYRevenuesLessExpensesAmt>).+[:alnum:](?=<)")
  CY_total_EOY_assets<-string%>%str_extract("(?<=<TotalAssetsEOYAmt>).+[:alnum:](?=<)")
  CY_total_EOY_liabilities<-string%>%str_extract("(?<=<TotalLiabilitiesEOYAmt>).+[:alnum:](?=<)")
  CY_fund_balance<-string%>%str_extract("(?<=<NetAssetsOrFundBalancesEOYAmt>).+[:alnum:](?=<)")
  CY_total_salaries<-string%>%str_extract("(?<=<CYSalariesCompEmpBnftPaidAmt>).+[:alnum:](?=<)")
  CY_grants_contributions<-string%>%str_extract("(?<=<CYContributionsGrantsAmt>).+[:alnum:](?=<)")
  CY_program_service_rev<-string%>%str_extract("(?<=<CYProgramServiceRevenueAmt>).+[:alnum:](?=<)")
  
  df <-data.frame(xml_num, organization, tax_year, form, address, city, website, ein, start_yr, mission, principal_officer, employee_count, volunteer_count, PY_total_revenue, PY_total_expenses, PY_net, CY_total_revenue, CY_total_expenses, CY_net, CY_total_EOY_assets, CY_total_EOY_liabilities, CY_fund_balance, CY_total_salaries, CY_grants_contributions, CY_program_service_rev)%>%trimws()
  
  org_990<-rbind(org_990, df)

} 

org_990$xml_num<-as.numeric(org_990$xml_num) # convert from char to dbl for sorting
org_990%<>%arrange(org_990, (xml_num))
View(org_990)

#write_csv(org_990,"arts990.csv")

Create an aggregated dataset of 990 information.

Note: for demonstrative purposes this report focuses on the 990 Form and does not include data from forms: 990EZ, 990PF, 990N. Additional scripts will be required to differentiate and parse these forms. In addition, this report only includes the Section 1 summary data from the 990’s and is limited to the most recent filing year listed for each organization. Future work will focus on extracting information from other 990 sections, schedules, and reporting periods.

Step 5. Initial data exploration and visualization.

The following code chunk highlights organization counts and start-dates by category from the mid-19th century to present.

#create data subset for analysis and limit forms to 2019 or 2018 filing 

vtnps<- data.frame(rbind(env, educ, inter, relig, heal, arts, publ))%>%
  relocate(category, .after=organization)%>%
  filter(form %in% "990")%>%
  filter(tax_year==2019 | tax_year==2018)

# show basic stats and df composition

skim(vtnps)

Data summary

Name	vtnps
Number of rows	597
Number of columns	26
_______________________
Column type frequency:
character	9
numeric	17
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
organization	0	1.00	8	70	522
category	0	1.00	6	23	7
form	0	1.00	3	3	1
address	0	1.00	6	35	499
city	0	1.00	5	20	170
website	39	0.93	3	44	414
ein	0	1.00	9	9	524
mission	195	0.67	9	1000	352
principal_officer	54	0.91	6	26	460

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
xml_num	0	1.00	270.35	168.92	3	132.00	250.0	390.0	679	<U+2587><U+2587><U+2586><U+2585><U+2582>
tax_year	0	1.00	2018.45	0.50	2018	2018.00	2018.0	2019.0	2019	<U+2587><U+2581><U+2581><U+2581><U+2586>
start_yr	24	0.96	1985.44	32.12	1819	1975.00	1995.0	2005.0	2019	<U+2581><U+2581><U+2581><U+2582><U+2587>
employee_count	312	0.48	164.26	587.27	10	18.00	30.0	92.0	8686	<U+2587><U+2581><U+2581><U+2581><U+2581>
volunteer_count	284	0.52	302.11	1303.67	10	25.00	50.0	117.0	14000	<U+2587><U+2581><U+2581><U+2581><U+2581>
PY_total_revenue	9	0.98	8927343.18	63662126.87	-16281	277462.25	592029.5	1717746.2	1433600411	<U+2587><U+2581><U+2581><U+2581><U+2581>
PY_total_expenses	9	0.98	8278029.14	60609912.23	-1005823	238535.75	550441.0	1421340.8	1360613119	<U+2587><U+2581><U+2581><U+2581><U+2581>
PY_net	13	0.98	653761.39	4686280.87	-7299122	-19803.75	22864.5	137345.2	72987292	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_revenue	0	1.00	8980391.21	67253585.89	-33419	272757.00	603793.0	1825858.0	1529529583	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_expenses	9	0.98	8278029.14	60609912.23	-1005823	238535.75	550441.0	1421340.8	1360613119	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_net	3	0.99	279235.25	3158780.06	-25315580	-35786.25	12086.0	111772.0	62328659	<U+2581><U+2587><U+2581><U+2581><U+2581>
CY_total_EOY_assets	8	0.99	13331113.04	83460700.94	-36495	299968.00	1025034.0	4225068.0	1695000215	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_EOY_liabilities	94	0.84	5643250.12	44897442.59	39	20406.00	120996.0	690764.0	723839777	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_fund_balance	134	0.78	5966779.33	49711368.60	-3444185	176380.00	688895.0	2253933.5	1033709335	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_total_salaries	80	0.87	5650006.93	40412075.85	760	143793.00	324377.0	973829.0	832147624	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_grants_contributions	39	0.93	1036136.54	2952325.40	17	94915.75	272284.5	746145.0	40996621	<U+2587><U+2581><U+2581><U+2581><U+2581>
CY_program_service_rev	130	0.78	9805485.16	73336731.06	25	64017.00	262558.0	1024773.0	1477791395	<U+2587><U+2581><U+2581><U+2581><U+2581>

# glimpse dataset

glimpse(vtnps)

## Rows: 597
## Columns: 26
## $ xml_num                  <dbl> 9, 25, 37, 38, 49, 57, 60, 62, 63, 72, 76, 90~
## $ organization             <chr> "Clean Energy States Alliance Inc", "GUANACAS~
## $ category                 <chr> "environment_animals", "environment_animals",~
## $ tax_year                 <dbl> 2019, 2018, 2019, 2018, 2019, 2018, 2018, 201~
## $ form                     <chr> "990", "990", "990", "990", "990", "990", "99~
## $ address                  <chr> "50 State Street", "4780 MAIN ROAD", "PO Box ~
## $ city                     <chr> "Montpelier", "HUNTINGTON", "Charlotte", "Bro~
## $ website                  <chr> "www.cesa.org", "www.gdfcf.org", "www.lewiscr~
## $ ein                      <chr> "270029803", "943280315", "030349148", "03600~
## $ start_yr                 <dbl> 2002, 1997, 1995, 1915, 1992, 2000, 2016, 197~
## $ mission                  <chr> "Technical support to States' clean energy ag~
## $ principal_officer        <chr> "Anthony Vargo", NA, "Martha Illick", "SUSAN ~
## $ employee_count           <dbl> NA, NA, NA, 18, 31, NA, NA, 47, 19, 21, 16, N~
## $ volunteer_count          <dbl> 14, 15, 74, NA, NA, NA, NA, 180, 16, 83, NA, ~
## $ PY_total_revenue         <dbl> 1565197, 6007561, 156205, 458419, 7980411, 11~
## $ PY_total_expenses        <dbl> 1395391, 2166001, 128464, 488587, 7726366, 71~
## $ PY_net                   <dbl> 169806, 3841560, 27741, -30168, 254045, 38863~
## $ CY_total_revenue         <dbl> 1541453, 4011292, 226889, 393201, 7861211, 18~
## $ CY_total_expenses        <dbl> 1395391, 2166001, 128464, 488587, 7726366, 71~
## $ CY_net                   <dbl> -17351, 1794998, 33768, -144257, -748228, 136~
## $ CY_total_EOY_assets      <dbl> 1094126, 17111063, 95950, 3401790, 6736910, 2~
## $ CY_total_EOY_liabilities <dbl> 269797, 181243, NA, 7082, 1725796, NA, 22502,~
## $ CY_fund_balance          <dbl> 824329, 16929820, NA, 3394708, 5011114, 22175~
## $ CY_total_salaries        <dbl> 838077, 760634, 44213, 238038, 4465965, NA, 1~
## $ CY_grants_contributions  <dbl> 1251720, 3534249, 224829, 224514, 6888145, 18~
## $ CY_program_service_rev   <dbl> 284379, 271040, 1556, 69841, 871478, NA, 4881~

# plot history of start year with annual count -- w/o category

vtnps %>%
  ggplot(aes(x=start_yr)) +
  geom_histogram(alpha=0.6, binwidth = 1, fill="dark blue") +
  labs( x='\nStart Year', y = 'Total Count', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ 
  theme(axis.text.x = element_text(angle = 45, hjust = 0, vjust = 1, size=10))

#profile organization start-yr and number by category

vtnps%>%
  filter(start_yr>1850)%>%
  count(start_yr,category) %>%
  ggplot()+geom_col(aes(x=start_yr,y=n,fill=category))+
  labs( x='\nStart Year', y = 'Total Count', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")

#density of start over time

ggplot(data=vtnps, aes(x=start_yr, group=category, fill=category)) +
    geom_density(adjust=1.5, alpha=.3) +
    labs( x='\nStart Year', y = 'Density', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")+
    theme_ipsum()

# annual addition of new nonprofits since 1950

vtnps %>%
  filter(start_yr>1950)%>%
  ggplot(aes(x=start_yr, color=category, fill=category, alpha=.4)) +
  geom_histogram(alpha=0.6, binwidth = 1) +
  scale_fill_viridis(discrete=TRUE, option="D") +
  scale_color_viridis(discrete=TRUE, option="D") +
  labs( x='\nStart Year', y = 'Total Addition\n', title = "Annual Addition of 501c3 Non-Profits in Vermont", subtitle="1950 to Present")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ 

  facet_wrap(.~category, ncol=2)

#annual addition of new nonprofits since 2000

vtnps %>%
  filter(start_yr>2000)%>%
  ggplot(aes(x=start_yr, color=category, fill=category, alpha=.4)) +
  geom_histogram(alpha=0.6, binwidth = 1) +
  scale_fill_viridis(discrete=TRUE, option="D") +
  scale_color_viridis(discrete=TRUE, option="D") +
  labs( x='\nStart Year', y = 'Total Addition\n', title = "Annual Addition of 501c3 Non-Profits in Vermont", subtitle="2000 to Present")+
  theme_minimal()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ 
  theme(axis.text.x = element_text(angle = 45, hjust = 0, vjust = 1, size=10))

  facet_wrap(~category, ncol=2)

## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>

The following code chunk focuses on employee count data by category

#summary stats for employee count

 vtnps$employee_count%>%summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0    18.0    30.0   164.3    92.0  8686.0     312

# Frequency of employee counts in organization

vtnps%>%
  ggplot(aes(x=employee_count))+
  geom_histogram(color="dark blue", fill="dark blue", alpha=.7)+
  labs( x='\nTotal Employees in Organization', y = 'Frequency\n', title = "Employee Counts for 501c3 Non-Profits in Vermont")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# quantiles for total employee counts

vtnps%>%group_by(category)%>%
  group_by(category)%>%
  mutate(category=fct_reorder(category, employee_count, .fun='median'))%>%
  ggplot( aes(x=reorder(category, employee_count), y=employee_count)) + 
    geom_boxplot(fill="dark blue", alpha=.3, outlier.colour = "red")+
    coord_flip()+
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))

# Top 20 employee counts by organization

vtnps%>%
  top_n(20, employee_count) %>%
  ggplot(aes(x=employee_count, y=reorder(organization, employee_count), fill=category))+
  geom_col()+
  scale_fill_manual(values = c("#C3D7A4","#4E84C4"))+
  labs( x='\nTotal Employees in Organization', y = 'Organization\n', title = "Top 20 Employee Counts", subtitle= "501c3 Non-Profits in Vermont")+
  theme_minimal()

# Limit analysis to organizations below 100 employees
  
  vtnps%>%
  filter(employee_count<100)%>%
  ggplot(aes(x=employee_count)) +
  geom_boxplot()+
  theme_minimal()

#Distribution of employee counts <1000
  
vtnps%>%
  filter(employee_count<1000)%>%
  ggplot(aes(x=employee_count)) + 
  geom_bar(colour="blue", fill="dark blue", alpha=.3)+
  labs( x="Number of Employees", y = "Number of Organizations\n", title = "Distribution of Employee Counts", subtitle="Organizations with 1000 or Fewer Employees")+
  theme_minimal()+
  theme(panel.grid = element_blank())

# quantiles for organizations with fewer than 1000 employees
  
vtnps%>%
    group_by(category)%>%
    filter(employee_count<1000)%>%
    ggplot(aes(x=reorder(category, employee_count), y=employee_count))+
    geom_boxplot(fill="dark blue", alpha=.3, outlier.colour = "red")+
    labs( x="", y = "\n\nNumber of Employees", title = "Total Employees by 501c3 Category", subtitle="Organizations with 1000 or Fewer Employees")+
    coord_flip()+
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))

The following code chunk focuses on salary data by category

# average salary by category --> excludes salaries over 200,000

vtnps%>%
  mutate(avg_salary = CY_total_salaries/employee_count)%>%
  filter(avg_salary<200000)%>%
  group_by(category)%>%
  mutate(category=fct_reorder(category, avg_salary, .fun='median'))%>%
  ggplot( aes(x=reorder(category, avg_salary), y=avg_salary)) + 
  geom_boxplot(fill="dark green", alpha=.3, outlier.colour = "blue")+
   labs( x="", y = "\n\nAverage Salary", title = "Average Salaries by 501c3 Category", subtitle="Salaries less than $200,000\n\n")+
  coord_flip()+
  theme_minimal()+
  theme(axis.title.y = element_blank())+
  theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))

The following code chunk highights selected financial data by category

#Ratio of liabilities assets >1 indicates an excess of debt -- the proportion of a company’s assets that are financed by debt. A ratio greater than 1 shows that a considerable portion of debt is funded by assets. In other words, the company has more liabilities than assets. A high ratio also indicates that a company may be putting itself at risk of default on its loans if interest rates were to rise suddenly.


vtnps%>%
  mutate(liability_ratio = CY_total_EOY_liabilities/CY_total_EOY_assets)%>%
  filter(liability_ratio<50)%>%  #note, one outlier at 400+
  ggplot(aes(x=category, y=liability_ratio))+
  geom_point(color='darkblue')+
  coord_flip()+
  geom_hline(yintercept = 1, color="red")+
   labs( x="501c3 Category", y = "\n\nLiability Ratio", title = "Liability Ratios by 501c3 Category\n", subtitle="A ratio >1 indicates an excess of debt over assets\n\n")+
  theme_minimal()+
  theme(panel.grid = element_blank())+
  theme(axis.title.y = element_blank())

lratio<-vtnps%>%
  mutate(liability_ratio = CY_total_EOY_liabilities/CY_total_EOY_assets)

labove<-length(which(lratio$liability_ratio > 1))
lbelow<-length(which(lratio$liability_ratio <= 1))
(lprop<-round((labove/lbelow)*100, 0))

## [1] 5

glue("The number of organizations with a liability >1 and <=1 is {labove} and {lbelow}, respectively. The latter represents {lprop}% of the total count.")

## The number of organizations with a liability >1 and <=1 is 22 and 480, respectively. The latter represents 5% of the total count.

#Ratio of revenue to expenses between yrs. net operating ratio of income to expenses, which tells you how efficiently the organization is using its money to fund operations

vtnps%>%
  mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
  mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
  mutate(net_annual_diff=current_net-prior_net)%>%
  group_by(category)%>%
  filter(current_net<100)%>%
  ggplot(aes(x=category, net_annual_diff))+
  geom_point()+
  geom_hline(yintercept = 0, color="red")+
  coord_flip()+
  labs(x="", y= "Net Difference in Revenue:Expenses", title=("\n\nDifference in end-of-year revenue:expense"), subtitle=('Last two reporting years\n\n'))+
  theme_minimal()+
  theme(panel.grid = element_blank())+
  theme(axis.title.y = element_blank())

#further comparison of revenue to expense

adiff<-vtnps%>%
  mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
  mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
  mutate(net_annual_diff=current_net-prior_net)

(adiff_a<-length(which(adiff$net_annual_diff > 0)))

## [1] 339

(adiff_b<-length(which(adiff$net_annual_diff <= 0)))

## [1] 249

(adiff_r<-round((adiff_b/adiff_a)*100, 0))

## [1] 73

a_diff<-vtnps%>%
  mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
  mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
  mutate(net_annual_diff=current_net-prior_net)

labove<-length(which(lratio$liability_ratio > 1))
lbelow<-length(which(lratio$liability_ratio <= 1))
(lprop<-round((labove/lbelow)*100, 0))

## [1] 5

#per capita net revenue by organization category

vtnps%>%
  mutate(net=CY_total_revenue-CY_total_expenses)%>%
  mutate(rev_person=net/employee_count)%>%
  group_by(category)%>%
  filter(net<500000)%>%
  ggplot(aes(x=reorder(category, rev_person), y=rev_person))+
  geom_boxplot()+
  geom_hline(yintercept = 0, color="red")+
  coord_flip()+
  labs(x="Per Capita Net", y= "", title=("Current Year Per Capita Net Revenue\n"), subtitle=('Organizations with Net Revenue < $500,000\n\n'))+
  theme_minimal()+
  theme(panel.grid = element_blank())+
  theme(axis.title.y = element_blank())

## Warning: Removed 282 rows containing non-finite values (stat_boxplot).

# preliminary comparison of NY revenue and employee metrics

vtnps%>%
  filter(CY_total_salaries<200000000)%>%
  ggplot(aes(x=employee_count, y=CY_total_revenue))+
  geom_point()+
  theme_minimal()

## Warning: Removed 232 rows containing missing values (geom_point).

vtnps%>%
  filter(CY_total_salaries<200000000)%>%
  ggplot(aes(x=CY_total_salaries, y=CY_total_revenue))+
  geom_point()+
  theme_minimal()

##Conclusions and Recommendations

This project will continue with addition work focused on filling in missing information as well as constructing a data-table that holds information from multiple 990 forms and multiple years (2013 to present). The latter steps will require further scripting and automation to streamline data processing steps. It may also be advantageous to construct a relational database to facilitate data exploration and analysis.

In addition, additional scripts will be required to effectively extract key financial information from 990 forms that were not included in this initial phase. These features should include the following for 501c3 metrics:

Liquidity - cash on hand that is not restricted
Program expenses as percentage of total expenses.
Sources of unrestricted recurring dollars.
Full-cost coverage.
Fund-raising expenses as percentage of total contributions.
Cash flow from operations.

Other metrics include revenue reliability (across years) and revenue surplus.

Epilogue

The following chunk provides code for accessing 990 related information using Propublica’s API. In this context, select data is available via. query in JSON format.

url <- “https://projects.propublica.org/nonprofits/api/v2/search.json?limit=1000&q=%22vermont%22&ntee%5Bid%5D=7”

res<-httr::GET(url)

restxt<-httr::content(res, as = ‘text’)

j <- fromJSON(restxt)

glimpse(j)

length(str(j$organizations[[1]]))

df_json2 <- fromJSON(json2)

str(j, nest.lev)

ny <- c() for (i in length(j)) { for (x in i) { print(x) } }

Data607_Proj4b_Connin

Sean Connin

Project Overview:

Step 4. Extract the data from individual .txt files by category. Process data for analysis.

Step 5. Initial data exploration and visualization.

Epilogue