What research field is leading in “Big Data” research methods?

Institutions with the largest percent of awards that contain at least one of the following words: “big data, analytics, machine learning, predictive modeling,artificial intelligence, data science”

allawards %>%
  filter(!is.na(institution)) %>% 
  group_by(institution, year)  %>%
  count(ind) %>%
  group_by(institution, year)  %>%
  mutate(percent = round(n/sum(n) * 100, 2)) %>%
  arrange(desc(ind),desc(n)) %>% 
  filter(ind == TRUE) %>%
  top_n(25,n) %>% 
  ggplot(aes(x=year, y = n, color = institution)) +
  geom_line(show.legend = F) +
  labs(title = "Institutions with the largest percent of awards that contain at least
one of the following words",
  subtitle = "big data, analytics, machine learning, 
       predictive modeling,artificial intelligence, data science"
  )

#HELP Institution with largest number of awards that contained at least 1 keyword per year

allawards %>%
  filter(!is.na(institution)) %>% 
  group_by(institution, year)  %>%
  count(ind) %>%
  group_by(institution, year)  %>%
  mutate(percent = round(n/sum(n) * 100, 2)) %>%
  arrange(desc(ind),desc(n)) %>% 
  filter(ind == TRUE) %>%
  top_n(10,n) %>% 
  ggplot(aes(x=year, y = n, color = institution)) +
  geom_bar(position="dodge",stat="identity") +
  labs(title = "Top Institutions with largest number of awards that containen one of the following keywords",
       subtitle = "big data, analytics, machine learning, 
       predictive modeling,artificial intelligence, data science")

Combined Carnegie classification list with awards data. This new database contains R1 and R2 ranking, State location, HBCU and HSI and other classification that may be useful later on.

##  [1] "year"                             "Dir"                             
##  [3] "Div"                              "big.data"                        
##  [5] "analytics"                        "machine.learning"                
##  [7] "predictive.modeling"              "artificial.intelligence"         
##  [9] "data.science"                     "ind"                             
## [11] "nkeys"                            "AbstractNarration"               
## [13] "AwardTitle"                       "AwardAmount"                     
## [15] "Performance_Institution"          "AwardInstrument"                 
## [17] "Organization"                     "ProgramOfficer"                  
## [19] "AbstractNarration.1"              "FUND_AGCY_CODE"                  
## [21] "AwardID"                          "Investigator"                    
## [23] "Institution"                      "ProgramElement"                  
## [25] "ProgramReference"                 "FUND_OBLG"                       
## [27] "email"                            "label"                           
## [29] "institution"                      "unitid"                          
## [31] "city"                             "state"                           
## [33] "level"                            "control"                         
## [35] "Undergraduate Program"            "Graduate Program"                
## [37] "Enrollment Profile"               "Undergraduate Profile"           
## [39] "Size & Setting"                   "Basic"                           
## [41] "Community Engagement"             "3939 results for  all categories"
## [43] "R_stat"                           "pop_serve"                       
## [45] "hbcu"                             "hsi"

Explanation of visuals

(Rasmussen, 2021) inspired the analysis and visualizations of this project. One of the methods Rasmussen uses is Term Counting. “This method simply counts the number of documents containing some form of a given term. For example, for “inclusion” terms we look for documents containing any of the following forms: “inclusive,” “inclusivity,” and “inclusion.” Even if a document contains multiple instances of different forms of a given term, it will only be counted once. The percentage of documents containing a particular term over the total number of documents in each NSF directorate in a given year is reported.”

Stemming is another methods used in the author’s analysis. I did not apply this method in this exploration. Stemming can be found in the NLTK package for python programming. I will inquire how to apply this technique in my project using the R program.

The following visuals are my first attempt at employing tidytext functions to create visuals with high TF-IDF using bind_tf_idf(). I used Silge & Robinson, 2017, as my reference. Heuristically the theory of this method is a soft approach of measuring importance of words by information theory experts (Silge & Robinson, 2017)

library(tidytext)
#yr21 All abstract words in awards with stop words 
awards_21 <- allawards %>% 
  filter(year == 2021)

#tokenize abstract words from IND data in 2021 
abst_words_21 <- awards_21 %>% 
    filter(ind == TRUE) %>% 
    unnest_tokens(output = word, input = AbstractNarration) %>% 
#i remove stop words although it is not suggested to remove stop words in example code from Silge & Robinson, 2017
    anti_join(stop_words, by = "word") %>% 
    filter(str_detect(word, "^[:alpha]")) %>% 
    count(AwardID,Dir,word, sort = TRUE)

#total freq of words per award in 21 
total_words21 <- abst_words_21 %>% 
    group_by(AwardID) %>% 
    summarize(total = sum(n))

#join group words and total words column 
#There is one row in this award_words data frame;
abst_words_21 <- left_join(abst_words_21,total_words21)

ggplot(abst_words_21, aes(n/total, fill = Dir)) +
  geom_histogram(show.legend = FALSE) +
  xlim(0, 0.1) +
  facet_wrap(~Dir, ncol = 3, scales = "free_y") +
  labs(title = "Term Frequency Distribution of 2021 awards that contained at least one of the following words",
       subtitle = "big data, analytics, machine learning, 
       predictive modeling,artificial intelligence, data science")

Terms with high TF-IDF of 2021 awards with at lease 1 keyword I used only year 2021 as a trial before I did the full data set.

## # A tibble: 59,386 × 7
##    AwardID Dir   word                 n      tf   idf  tf_idf
##    <chr>   <chr> <chr>            <int>   <dbl> <dbl>   <dbl>
##  1 2150816 Other heterointerface      9 0.00522  2.20 0.0115 
##  2 2153784 GEO   plateaus            13 0.00355  2.20 0.00779
##  3 2150816 Other heterointerfaces     5 0.00290  2.20 0.00637
##  4 2102143 GEO   hadean               9 0.00245  2.20 0.00539
##  5 2114942 BIO   pollen               9 0.00232  2.20 0.00510
##  6 2154238 Other asi                  4 0.00232  2.20 0.00510
##  7 2120943 GEO   permafrost          11 0.00300  1.50 0.00451
##  8 2132013 Other aurora               7 0.00406  1.10 0.00446
##  9 2121063 BIO   pepbinding           7 0.00181  2.20 0.00397
## 10 2106996 Other leeds                3 0.00174  2.20 0.00382
## # … with 59,376 more rows
## # A tibble: 10 × 8
##    AwardID Dir   word         n total       tf   idf   tf_idf
##    <chr>   <chr> <chr>    <int> <int>    <dbl> <dbl>    <dbl>
##  1 2124224 CISE  lt          40   118 0.00118  -5.22 -0.00616
##  2 2130263 CISE  ai          29   105 0.000856 -3.66 -0.00314
##  3 2117860 SBE   ai          26    95 0.00617  -3.66 -0.0226 
##  4 2148680 EDU   ai          26   105 0.00263  -3.66 -0.00964
##  5 2129072 SBE   lt          24    70 0.00570  -5.22 -0.0297 
##  6 2123818 EDU   language    23    90 0.00233  -2.72 -0.00634
##  7 2124052 EDU   language    23    90 0.00233  -2.72 -0.00634
##  8 2116109 EDU   ai          22   103 0.00223  -3.66 -0.00815
##  9 2103936 CISE  ai          21    88 0.000620 -3.66 -0.00227
## 10 2104105 CISE  ai          21    88 0.000620 -3.66 -0.00227
library(forcats)

award21_tf_idf %>%
  group_by(Dir) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = Dir)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Dir, ncol = 3, scales = "free") +
  labs(x = "tf-idf", y = NULL, 
       title = "2021 abstracs with high TF-IDF") 

Next steps for continuting research over the summer

-I plan to continue to clean up the institution data. -Work on creating stronger institution visuals that give information in respect to their “Big Data” award comparison. -Create word network visualizations inspired by Li and Jiang 2021 -creating ngrams and word associations -possible topic modeling

## # A tibble: 770,427 × 2
##    bigram           n
##    <chr>        <int>
##  1 br gt        39800
##  2 lt br        39800
##  3 of the       22076
##  4 gt lt        19464
##  5 in the       15716
##  6 this project 14101
##  7 will be      13499
##  8 gt this      11766
##  9 this award   10381
## 10 project will 10237
## # … with 770,417 more rows
## # A tibble: 577,867 × 2
##    bigram                 n
##    <chr>              <int>
##  1 br gt              39800
##  2 lt br              39800
##  3 gt lt              19464
##  4 gt this            11766
##  5 project will       10237
##  6 broader impacts     9403
##  7 intellectual merit  8980
##  8 support through     8864
##  9 merit and           8857
## 10 evaluation using    8856
## # … with 577,857 more rows
## # A tibble: 9 × 2
##   Dir       n
##   <chr> <int>
## 1 BIO      38
## 2 CISE    316
## 3 EDU     196
## 4 ENG      68
## 5 GEO      20
## 6 MPS      91
## 7 Other     8
## 8 SBE      15
## 9 TIP       9