Institutions with the largest percent of awards that contain at least one of the following words: “big data, analytics, machine learning, predictive modeling,artificial intelligence, data science”
allawards %>%
filter(!is.na(institution)) %>%
group_by(institution, year) %>%
count(ind) %>%
group_by(institution, year) %>%
mutate(percent = round(n/sum(n) * 100, 2)) %>%
arrange(desc(ind),desc(n)) %>%
filter(ind == TRUE) %>%
top_n(25,n) %>%
ggplot(aes(x=year, y = n, color = institution)) +
geom_line(show.legend = F) +
labs(title = "Institutions with the largest percent of awards that contain at least
one of the following words",
subtitle = "big data, analytics, machine learning,
predictive modeling,artificial intelligence, data science"
)
#HELP Institution with largest number of awards that contained at least 1 keyword per year
allawards %>%
filter(!is.na(institution)) %>%
group_by(institution, year) %>%
count(ind) %>%
group_by(institution, year) %>%
mutate(percent = round(n/sum(n) * 100, 2)) %>%
arrange(desc(ind),desc(n)) %>%
filter(ind == TRUE) %>%
top_n(10,n) %>%
ggplot(aes(x=year, y = n, color = institution)) +
geom_bar(position="dodge",stat="identity") +
labs(title = "Top Institutions with largest number of awards that containen one of the following keywords",
subtitle = "big data, analytics, machine learning,
predictive modeling,artificial intelligence, data science")
Combined Carnegie classification list with awards data. This new database contains R1 and R2 ranking, State location, HBCU and HSI and other classification that may be useful later on.
## [1] "year" "Dir"
## [3] "Div" "big.data"
## [5] "analytics" "machine.learning"
## [7] "predictive.modeling" "artificial.intelligence"
## [9] "data.science" "ind"
## [11] "nkeys" "AbstractNarration"
## [13] "AwardTitle" "AwardAmount"
## [15] "Performance_Institution" "AwardInstrument"
## [17] "Organization" "ProgramOfficer"
## [19] "AbstractNarration.1" "FUND_AGCY_CODE"
## [21] "AwardID" "Investigator"
## [23] "Institution" "ProgramElement"
## [25] "ProgramReference" "FUND_OBLG"
## [27] "email" "label"
## [29] "institution" "unitid"
## [31] "city" "state"
## [33] "level" "control"
## [35] "Undergraduate Program" "Graduate Program"
## [37] "Enrollment Profile" "Undergraduate Profile"
## [39] "Size & Setting" "Basic"
## [41] "Community Engagement" "3939 results for all categories"
## [43] "R_stat" "pop_serve"
## [45] "hbcu" "hsi"
(Rasmussen, 2021) inspired the analysis and visualizations of this project. One of the methods Rasmussen uses is Term Counting. “This method simply counts the number of documents containing some form of a given term. For example, for “inclusion” terms we look for documents containing any of the following forms: “inclusive,” “inclusivity,” and “inclusion.” Even if a document contains multiple instances of different forms of a given term, it will only be counted once. The percentage of documents containing a particular term over the total number of documents in each NSF directorate in a given year is reported.”
Stemming is another methods used in the author’s analysis. I did not apply this method in this exploration. Stemming can be found in the NLTK package for python programming. I will inquire how to apply this technique in my project using the R program.
The following visuals are my first attempt at employing tidytext functions to create visuals with high TF-IDF using bind_tf_idf(). I used Silge & Robinson, 2017, as my reference. Heuristically the theory of this method is a soft approach of measuring importance of words by information theory experts (Silge & Robinson, 2017)
library(tidytext)
#yr21 All abstract words in awards with stop words
awards_21 <- allawards %>%
filter(year == 2021)
#tokenize abstract words from IND data in 2021
abst_words_21 <- awards_21 %>%
filter(ind == TRUE) %>%
unnest_tokens(output = word, input = AbstractNarration) %>%
#i remove stop words although it is not suggested to remove stop words in example code from Silge & Robinson, 2017
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "^[:alpha]")) %>%
count(AwardID,Dir,word, sort = TRUE)
#total freq of words per award in 21
total_words21 <- abst_words_21 %>%
group_by(AwardID) %>%
summarize(total = sum(n))
#join group words and total words column
#There is one row in this award_words data frame;
abst_words_21 <- left_join(abst_words_21,total_words21)
ggplot(abst_words_21, aes(n/total, fill = Dir)) +
geom_histogram(show.legend = FALSE) +
xlim(0, 0.1) +
facet_wrap(~Dir, ncol = 3, scales = "free_y") +
labs(title = "Term Frequency Distribution of 2021 awards that contained at least one of the following words",
subtitle = "big data, analytics, machine learning,
predictive modeling,artificial intelligence, data science")
Terms with high TF-IDF of 2021 awards with at lease 1 keyword I used only year 2021 as a trial before I did the full data set.
## # A tibble: 59,386 × 7
## AwardID Dir word n tf idf tf_idf
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 2150816 Other heterointerface 9 0.00522 2.20 0.0115
## 2 2153784 GEO plateaus 13 0.00355 2.20 0.00779
## 3 2150816 Other heterointerfaces 5 0.00290 2.20 0.00637
## 4 2102143 GEO hadean 9 0.00245 2.20 0.00539
## 5 2114942 BIO pollen 9 0.00232 2.20 0.00510
## 6 2154238 Other asi 4 0.00232 2.20 0.00510
## 7 2120943 GEO permafrost 11 0.00300 1.50 0.00451
## 8 2132013 Other aurora 7 0.00406 1.10 0.00446
## 9 2121063 BIO pepbinding 7 0.00181 2.20 0.00397
## 10 2106996 Other leeds 3 0.00174 2.20 0.00382
## # … with 59,376 more rows
## # A tibble: 10 × 8
## AwardID Dir word n total tf idf tf_idf
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 2124224 CISE lt 40 118 0.00118 -5.22 -0.00616
## 2 2130263 CISE ai 29 105 0.000856 -3.66 -0.00314
## 3 2117860 SBE ai 26 95 0.00617 -3.66 -0.0226
## 4 2148680 EDU ai 26 105 0.00263 -3.66 -0.00964
## 5 2129072 SBE lt 24 70 0.00570 -5.22 -0.0297
## 6 2123818 EDU language 23 90 0.00233 -2.72 -0.00634
## 7 2124052 EDU language 23 90 0.00233 -2.72 -0.00634
## 8 2116109 EDU ai 22 103 0.00223 -3.66 -0.00815
## 9 2103936 CISE ai 21 88 0.000620 -3.66 -0.00227
## 10 2104105 CISE ai 21 88 0.000620 -3.66 -0.00227
library(forcats)
award21_tf_idf %>%
group_by(Dir) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = Dir)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Dir, ncol = 3, scales = "free") +
labs(x = "tf-idf", y = NULL,
title = "2021 abstracs with high TF-IDF")
-I plan to continue to clean up the institution data. -Work on creating stronger institution visuals that give information in respect to their “Big Data” award comparison. -Create word network visualizations inspired by Li and Jiang 2021 -creating ngrams and word associations -possible topic modeling
## # A tibble: 770,427 × 2
## bigram n
## <chr> <int>
## 1 br gt 39800
## 2 lt br 39800
## 3 of the 22076
## 4 gt lt 19464
## 5 in the 15716
## 6 this project 14101
## 7 will be 13499
## 8 gt this 11766
## 9 this award 10381
## 10 project will 10237
## # … with 770,417 more rows
## # A tibble: 577,867 × 2
## bigram n
## <chr> <int>
## 1 br gt 39800
## 2 lt br 39800
## 3 gt lt 19464
## 4 gt this 11766
## 5 project will 10237
## 6 broader impacts 9403
## 7 intellectual merit 8980
## 8 support through 8864
## 9 merit and 8857
## 10 evaluation using 8856
## # … with 577,857 more rows
## # A tibble: 9 × 2
## Dir n
## <chr> <int>
## 1 BIO 38
## 2 CISE 316
## 3 EDU 196
## 4 ENG 68
## 5 GEO 20
## 6 MPS 91
## 7 Other 8
## 8 SBE 15
## 9 TIP 9