OVERVIEW

Goal: Track national and regional support of data science in the united states by looking at periodicals from around the country in a effort to track the relative positive or negative sentiment and word frequencies.

What patterns do you see, why do you believe this to be the case? What additional information might you want? Be as specific as possible, but keep in mind this is an initial exploratory effort…more analysis might be needed…but the result can and should advise the next steps you present to the firm.

Methodology

  1. Broke US up into 6 major regions
  2. Chose newspaper from each region (in the rockys, chose 2 because limited data)
  3. Ran afinn and bing sentiment analysis of each corpus
  4. Visualized results
  5. Ran word frequency analysis
  6. Drew conclusions in context

Code Implemenation

Example with South East Region (used articles from Richmond)

va_text <- read_rtf('va_articles.RTF')

va_text <- tibble(va_text) #convert to tibble
va_text$va_text <- as.character(va_text$va_text)
va_text <- va_text
va_text <- va_text %>%
  unnest_tokens(word, va_text)%>%
  anti_join(stop_words)%>% 
  count(word, sort=TRUE) #tokenize words

View(va_text)

get_sentiments('afinn') #sentiment analysis using afinn
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments('bing') #sentiment analysis with afinn
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
va_text_affin <- va_text %>% 
  inner_join(get_sentiments("afinn"))#using a inner join to match words and add the sentiment variable
view(va_text_affin)

va_text_bing <- va_text %>%
  inner_join(get_sentiments("bing")) #using a inner join to match words and add the sentiment variable

View(va_text_bing)

## 
## negative positive 
##      117      171
## [1] 0.59375

## 
## negative positive 
##      294      236
## [1] 0.554717

## 
## negative positive 
##      132      186
## [1] 0.5849057

## 
## negative positive 
##      254      239
## [1] 0.515213

## 
## negative positive 
##      120      136
## [1] 0.53125

## 
## negative positive 
##      256      201
## [1] 0.5601751

Conclusions

South East Analysis:

Looking at both bing and afinn sentiment analyses, the South East tends to have a slightly-positive outlook on data science. The result of the bing analysis was 59% positive and the largest bin size in the afinn histogram was at 2 out of the -5 to 5 range. The positive skew may be due the development of the Data Science School at UVA (the topic of a couple of articles analyzed). But is offset by a lack of knowledge and job opportunities for data science in the region. This can be offset by DS job growth in major industries of the region such as agriculture, fossil fuels, and manufacturing.

Rocky Mountain Analysis:

Looking at both bing and afinn sentiment analyses, the Rocky Mountain Region tends to have a slightly-negative outlook on data science. The result of the bing analysis was 56% negative and the largest bin size in the afinn histogram was at -2 out of the -5 to 5 range. This may be due to limited data science job opportunities outside of major cities (Denver, Salt Lake,…) and lack of knowledge about the field. Since the sentiment is relatively neutral, there is definitely opportunity to garner support for the data science in the region with increased publicity.

North East Analysis:

Looking at both bing and afinn sentiment analyses, the North East tends to have a slightly positive outlook on data science. The results of the bing analysis showed that roughly 53% of the words used in data science articles were considered positive. When looking at the afinn histogram, we also saw that the results yielded a slightly left skewed distribution with more points on the positive side.

This skewness could be attributed to the number of big companies in New York. As seen by some of the more populous words in the New York articles, we could see that words like “facebook” and “google” made a good number of occurrences. “company” and “companies” were also among some of the top words in New York articles suggesting a focus on industry. These positively associated words could also be offset by the current environment as well, specifically because of the coronavirus which could be noted by the prevalence of words such as “covid” and “pandemic”.

Pacific Analysis:

Looking at both bing and afinn sentiment analyses, the Pacific Region tends to have a slightly negative outlook on data science. The results of the bing analysis suggested that 56% of the words used in data science articles were considered negative. When looking at the afinn histogram, we saw that the two largest bins were on the positive side of the histogram, but that there was also a moderate amount of points in the three bins on the negative side.

This slightly negative skewness could be attributed to the large focus on climate change research being performed in Oregon as seen by the prevalence of words such as “researchers”, “research”, “climate”, and “weather”, and “plastic”.

South West Analysis:

After creating tables for both sentiment analyses “bing” and “afinn”, we can see that the words in articles regarding data science for the state of Oklahoma are generally more positive than negative. There is also a larger range of positive sentiment, as seen in the “afinn” analysis the positive values go up to 5 while the negative values only go down to -3.

From the chart above, we can see that for the state of Oklahoma, the words “oklahoma”, “oklahoman”, and “thunder” are the three most important words to the corpus that is all of the Oklahoma articles. These makes sense, as articles located in Oklahoma may revolve around the state and common things like thunder.

MidWest Analysis

For the state of Illinois, the most three important words are “illinois”, “editionnc”, and “mwrd”. While “illinois” makes sense, more analysis and possibly more cleaning must be done as “editionnc” and “mwrd” aren’t actual words but show up in articles many times.

Overall

In conclusion, the overall sentiment toward the field of data science in the US is fairly neutral. Regions such as the: South East, Midwest, and South West skew slightly positive. While the regions: Rocky Mountains, North East, and Pacific are slightly negative. Overall these sentiments are rather mild and will likely change as the field develops and the public gains knowledge of the subject.

Going forward, more data could be collected for each region to give a better idea of the regional sentiment. Looking at data from one local newspaper for each region does not give the full picture. There is likely differing opinions between rural and urban areas of each region that was not articulated by this data analysis. Also, in order for the tf_idf analysis to be helpful, more data cleaning is necessary. The most correlated words per corpus were not related to data science at all.