SBIR Award Analysis using Topic Modelling

Group 3: Sudeshna Sarkar, Animesh Jain, Ashish Poojari, Santosh Neelapala

I. Problem Description

SBIR Awards: As SBIR contains Awards dataset that ranges over the time from 1983 to 2022. The Department of Health and Human Services (HHS) utilizes grants, cooperative agreements and contracts to support research and development in the Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) programs. The largest percentage of HHS SBIR and STTR awards are made as grants, primarily through the National Institutes of Health, or the NIH. The NIH is the largest granting organization participating in the SBIR and STTR programs. The NIH budget for Fiscal Year 2020 was $1.1B for the SBIR program and $150M for the STTR program.

1.) To understand the abstract and what major areas are involved.
2.) Analyze and Dig deeper into sub divisions of Health and Human Services
3.) Throughout the years, how technology,education, techniques and conditions impacted Department of Health and Human Services

II.Peer Comment

  1. What are the other qualitative analysis techniques you will be using apart from KeyATM? • CTM
    • STM
    • KeyATM Base

  2. Computation and visualization of the data:
    • Coherence and exclusivity within the topics
    • LDavis

  3. What document variables are you using in your modeling.
    • Award.Year

  4. What are the various keywords you will be using for the Keyword model?
    • Education, Technology, Techniques and COnditions for Health Data

III. Data Analysis

SBIR Awards: The Small Business Innovation Research (or SBIR) program is a United States Government program, coordinated by the Small Business Administration, intended to help certain small businesses conduct research and development (R&D). url-https://www.sbir.gov/

1.) Exploring Agency columns, there are 13 different departments and we focused on Department of Health and Human Services as it has highest number of Awards.
2.) Generating SerialNum for docid_field using the Contract column.
3.) Subseting the dataset for HHS and saving a new .csv file

Corpus and Tokens

1.) corpus is designed to be a more or less static container of texts with respect to processing and analysis
2.) Using text_field as “Abstract” and docid_field = “serialNum”

Stop Words and Document Feature Matrix

We have identified the Stop words and converted it into Document Feature Matrix

Top Feature and Perplexity

Top Features for the dataset

Perplexity for Topic 5 is low, which seems to have the best performance

Modelling

CTM-Correlated Topic Model (CTM)

1.) Ctm function in the topicmodels package is very slow. Another way to fit CTMs is to use the stm package
2.) Plotting using labeltype- Frex Plot and Prob Plot
3.) frex: are the words that are both frequent and exclusive, identifying words that distinguish topics
4.) prob: are the words within each topic with the highest probability

STM-Structure Topic Model (STM)

1.) STM allows us to use these document-level variables in topic modeling
2.) We used document level variables Award.Year to fit the mode
3.) Our earlier CTM model ai_tmod_ctm, we did not consider the prevalence and data arguments in stm

Comparing CTM and STM

1.) Two metrics used Semantic Coherence and Exclusivity
2.) Semantic coherence measures the consistency of the words used within the topic. Larger values are better and mean the topic is more consistent. Low values sometimes imply the topic may be composed of sub-topics.
3.)Exclusivity measures how distinctive the top words are to that topic. For this,larger or smaller is not necessarily better or worse, but indicates whether the topic is unique (high value) or broad (low value).
4.) Overall, There is no much of diffrence in both the models, we see that the STM has slightly higher exclusivity.

Keyword Assisted Topic Model

1.) Prepared Keywords for the Mode using the Topic data
2.) Four Keywords- Education, Techniques, Technology and Condition
3.) KeyATM Base is used to understand better Topics and corelation

Key take aways

1.) Explored different Topic Modeling algorithms- STM, CTM.
2.) Better understanding about Key Assisted Models, future perspective is to use KeyATM Dynamic