8871 Main Form
The 8871 main form has all the information that is 1:1 with the company. In this header form we see a lot of basic information you would expect on any company record such as addresses, contact information, purpose, and ID numbers.
Query DB
df <- dbGetQuery(con, 'SELECT * FROM main_8871')
df <- df %>%
mutate(across(c(exempt_state, business_address_state,contact_address_state,custodian_address_state,
mailing_address_state), as.factor)) %>%
mutate(across(c(established_date,material_change_date), ymd))
names(df)## [1] "record_type" "form_type"
## [3] "form_id_number" "initial_report_indicator"
## [5] "amended_report_indicator" "final_report_indicator"
## [7] "ein" "organization_name"
## [9] "mailing_address_1" "mailing_address_2"
## [11] "mailing_address_city" "mailing_address_state"
## [13] "mailing_address_zip_code" "mailing_address_zip_ext"
## [15] "e_mail_address" "established_date"
## [17] "custodian_name" "custodian_address_1"
## [19] "custodian_address_2" "custodian_address_city"
## [21] "custodian_address_state" "custodian_address_zip_code"
## [23] "custodian_address_zip_ext" "contact_person_name"
## [25] "contact_address_1" "contact_address_2"
## [27] "contact_address_city" "contact_address_state"
## [29] "contact_address_zip_code" "contact_address_zip_ext"
## [31] "business_address_1" "business_address_2"
## [33] "business_address_city" "business_address_state"
## [35] "business_address_zip_code" "business_address_zip_ext"
## [37] "exempt_8872_indicator" "exempt_state"
## [39] "exempt_990_indicator" "purpose"
## [41] "material_change_date" "insert_datetime"
## [43] "related_entity_bypass" "eain_bypass"
Identifiers
Each form submission gets a unique form_id.
I think a unique organization can be identified by the EIN number. It’s certainly very close to that - this will be useful in figuring out how to collapse down amendments.
p1 <- df %>%
select(ein) %>%
count(ein) %>%
ggplot(aes(n)) +
geom_histogram(fill = "violet") +
xlim(0,5) +
labs(title="EINs")
p2 <- df %>%
select(organization_name) %>%
count(organization_name) %>%
ggplot(aes(n)) +
geom_histogram(fill = "orange") +
xlim(0,5) +
labs(title="Organization Names")
p3 <- df %>%
mutate(organization_name=tolower(organization_name)) %>%
select(ein,organization_name) %>%
group_by(ein) %>%
summarise(n=n_distinct(organization_name)) %>%
ggplot(aes(x=n)) +
geom_histogram(fill = 'darkgreen') +
labs(title="Organization Names Per EIN")
(p1 + p2) / p3Null Values
Null values are a good place to start. While I can get an idea of what my missing data is like I can generally look at what columns I even have. There isn’t a ton to be concerned about here, but this is what I noticed:
Most of the columns with heavy nulls surround zip extension and address 2 fields.
There are also null values around exemption and change date fields. I would expect that not every form has these.
It is a bit interesting that established date is null, and I would be curious when that value isn’t required.
df %>%
sample_frac(0.2) %>%
naniar::vis_miss()Amendments and Bypasses
Thoughts:
The majority of reports are not final reports. We will need to figure out what needs to be done about amendments. Can we just use the final report? Or is there value in what was changed? Do the final reports have all data, or can the exclude things that were completed on previous versions?
Exemptions and bypasses are much more common than I thought and it may be interesting to look at when these apply.
p1 <- categorical_plot_cnt(df,"initial_report_indicator","Initial Report Indicator")
p2 <- categorical_plot_cnt(df,"amended_report_indicator","Amended Report Indicator")
p3 <- categorical_plot_cnt(df,"final_report_indicator","Final Report Indicator")
p4 <- categorical_plot_cnt(df,"exempt_8872_indicator","Exempt 8872 Indicator")
p5 <- categorical_plot_cnt(df,"related_entity_bypass","Related Entity Bypass")
p6 <- categorical_plot_cnt(df,"eain_bypass","EAIN Bypass")
(p1 + p2 + p3) / (p4 + p5 + p6)Purpose
I don’t see anything that surprises me from a simple word count on the purpose fields, but this is a very simple look so more digging could reveal something more interesting.
library(qdap)
term_count <- freq_terms(df$purpose, 50,stopwords=tm::stopwords("english"))
term_count %>%
arrange(FREQ) %>%
mutate(WORD=factor(WORD,levels=WORD)) %>%
ggplot(aes(x=WORD,y=FREQ,fill=FREQ)) +
geom_bar(stat='identity') +
coord_flip() +
theme_minimal() +
theme(legend.position = "none") +
labs(title="Top 50 Words in Purpose")Entities
Olson Hagel & Fishburn LLP is dominating this list. They are a firm specializing in political law.
I will be doing some more research on exactly what services they provide and what their level of involvement is.
no@email is a very common email address.
p1 <- categorical_plot_cnt(df,"custodian_name","Custodian Name",10)
p2 <- categorical_plot_cnt(df,"contact_person_name","Contact Name",10)
p3 <- categorical_plot_cnt(df,"e_mail_address","Email",20)
(p1 + p2) / p3Location
We see that a lot of the geographic files line up here, which is good. A much deeper analysis on organization locations is warranted, but this is just about getting a quick peek.
p1 <- categorical_plot_cnt(df,"business_address_state","Business Adress State",10)
p2 <- categorical_plot_cnt(df,"business_address_city","Business Adress City",10)
p3 <- categorical_plot_cnt(df,"custodian_address_state","Custodian Adress State",10)
p4 <- categorical_plot_cnt(df,"custodian_address_city","Custodian Adress City",10)
p5 <- categorical_plot_cnt(df,"contact_address_state","Contact Adress State",10)
p6 <- categorical_plot_cnt(df,"contact_address_city","Contact Adress City",10)
p7 <- categorical_plot_cnt(df,"mailing_address_state","Mailing Adress State",10)
p8 <- categorical_plot_cnt(df,"mailing_address_city","Mailing Adress City",10)
(p1 + p2) / (p3 + p4) / (p5 + p6) / (p7 + p8)Correlations
If the report is an amended report, it’s negatively correlated with related entity bypasses, so it may be they can bypass the form because the information was previously submitted. This indicates we can’t simply discard early revisions in favor of the final report.
We do however wee that these bypasses are also correlated with the state listed on the form, which is interesting. This could be related to state regulations, and would be interesting to see if there are higher volumes of entities in states that allow for exemptions and bypasses.
df %>%
select(initial_report_indicator, amended_report_indicator,final_report_indicator,
exempt_8872_indicator,related_entity_bypass,eain_bypass,exempt_state,
business_address_state,contact_address_state,custodian_address_state,
mailing_address_state) %>%
replace_na(list(initial_report_indicator=-1, amended_report_indicator=-1,
final_report_indicator=-1,exempt_8872_indicator=-1,related_entity_bypass=-1,
eain_bypass=-1,exempt_state=-1,business_address_state=-1,
contact_address_state=-1,custodian_address_state=-1,
mailing_address_state=-1)) %>%
mutate(across(everything(), as.numeric)) %>%
cor(use = "pairwise.complete.obs", method = "spearman") %>%
corrplot(type = "upper", diag = FALSE, method = "square")