Hypothesis

Hotel Reviews in the Southern United States are more positive than hotels in the Northern United States.

Introduction

In order to test my hypothesis, I am going to separately look at hotel reviews in both the Northern U.S. and the Southern U.S. My study will only consist of states in the eastern half of the country. I am using data from kaggle that has 10,000 hotel reviews on hotels across the United States. https://www.kaggle.com/datafiniti/hotel-reviews There is debate about which states are in the Northern United States and Southern United States. According to the U.S. government as mentioned on Britannica.com, https://www.britannica.com/place/the-North northern states include CT, IL, IN, IA, KS, ME, MA, MI, MN, NE, NH, NJ, NY, ND, OH, PA, RI, SD, VT, and WI. Also, Britannica.com mentions that southern states include AL, AR, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN, TX, VA, and WV. https://www.britannica.com/place/the-South-region.

I am going to generate bar charts that will show the level of sentiment for the top words for Southern and Northern Hotel reviews. My data set contains thousands of hotel reviews from 2002 to 2018, so I will have enough data to test my hypothesis of whether there is a difference in the level of sentiment between hotels in the north and south. I am suspecting that sentiments could be more negative amongst northern hotels since northern states typically have a higher cost of living than southern states. Therefore, people might have higher expectations for hotels in the Northern United States.

Code Description

I am using the Datafiniti_Hotel_Reviews data to see if hotel reviews in the Southern U.S. are more positive than hotels in the Northern United States. The reviews.text variable will help me determine this.

library(readr)
Datafiniti_Hotel_Reviews <- read_csv("~/Downloads/hotel-reviews/Datafiniti_Hotel_Reviews.csv")
View(Datafiniti_Hotel_Reviews)

Code Description

I wrote code below to pull out the 20 most frequent words in the reviews.text variable. I used the anti_join functon in order to get rid of all stop words such as and, if, or, and an. I used the ggplot function to make a bar chart in descending order of the 20 most frequent words.

Datafiniti_Hotel_Reviews %>%
  unnest_tokens(word, reviews.text) %>% 
    anti_join(stop_words) %>% 
    count(word, sort=TRUE) %>%
  head(20) %>%
  ggplot(aes(reorder(word, n),n)) + geom_col() + coord_flip() +
     labs(title="Most Frequent Words in the reviews.text variable",
         x="20 Most Frequent Words", 
         y="Number of Cases")

Analysis

According to the bar chart, the five most common words used in a hotel review for the data set was hotel, staff, stay, clean, and breakfast. Hotel which was the most frequent word was used over 6,000 times among the data set’s 10,000 observations. This is not surprising because these are words that you would expect to see as frequent words in hotel reviews. I did this visualization in order to get a good idea of what the most frequent words are before I see what the most common sentiment words are.

Code Description

I wrote code below using the wordcloud2 function to see what the 20 most frequent words are in the data set. I filtered it to the 20 most frequent words for the reviews.text variable. I used the anti_join function in order to get rid of unnecessary words such as if, and, or, and an.

library(wordcloud2)
Datafiniti_Hotel_Reviews %>%
unnest_tokens(word, reviews.text) %>% 
anti_join(stop_words) %>%  
count(word, sort=TRUE) %>%
head(20) %>%
wordcloud2()

Analysis

The word cloud shows the 20 most frequent words, with the font size of the word determining how it ranks from 1 to 20. We can see that check is the 20th most frequent word since it has the smallest font size. Hotel has to be the most frequent word of the 20 words since it clearly has the biggest font size.

Code Description

I ran code to show what the average characters are per review for southern hotels. I wanted to see if the average amount of characters per review is roughly the same between southern and northern hotels.

Datafiniti_Hotel_Reviews %>%
filter(province %in% c("AL", "AR", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV")) -> southern
mean(nchar(southern$reviews.text))

## [1] 311.1902

Analysis

For southern hotels, the average number of characters for a review of a southern hotel was about 311 characters.

Code Description

I ran code to show what the average characters are per review for nouthern hotels.

Datafiniti_Hotel_Reviews %>%
filter(province %in% c("CT", "IL", "IN", "IA", "KS", "ME", "MA","MI", "MN", "NE", "NH", "NJ", "NY", "ND", "OH", "PA", "RI", "SD", "VT", "WI")) ->northern
mean(nchar(northern$reviews.text))

## [1] 320.9687

Analysis

For northern hotels, the average number of characters per review was about 320 characters. This indicates that the length of reviews for Southern and Northern hotels are roughly the same on average since there is only a 10 character difference on average.

Code Description

I wrote code below to see what the most frequent negative words are for southern hotel reviews. I used the inner join function in order to get sentiments and filtered the data to states in the south.

Datafiniti_Hotel_Reviews %>%
unnest_tokens(word, reviews.text) %>% 
filter(province %in% c("AL", "AR", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV")) %>% 
inner_join(get_sentiments("afinn")) %>% 
anti_join(stop_words) %>% 
count(word, sort=TRUE) %>% 
head(25) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() +
   labs(title=" 25 Most Frequent Sentiment Words for Southern Hotels",
         x="25 Most Frequent Sentiment Words", 
         y="Number of Cases")

Analysis

Based on the bar chart the most frequent negative word is “bad” which was used about 400 times. It does appear that for Southern hotel reviews, positive sentiment words are more typically used since the top 6 most frequent words are all positive.

Code Description

I wrote code below to see what the most frequent negative words are for northern hotel reviews. I used the inner join function in order to get sentiments and filtered the data to states in the north.

Datafiniti_Hotel_Reviews %>%
unnest_tokens(word, reviews.text) %>%
filter(province %in% c("CT", "IL", "IN", "IA", "KS", "ME", "MA","MI", "MN", "NE", "NH", "NJ", "NY", "ND", "OH", "PA", "RI", "SD", "VT", "WI")) %>%
inner_join(get_sentiments("afinn")) %>% 
anti_join(stop_words) %>%
count(word, value, sort=TRUE) %>%
head(25) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() +
     labs(title=" 25 Most Frequent Sentiment Words for Northern Hotels",
         x="25 Most Frequent Sentiment Words", 
         y="Number of Cases")

Analysis

Based on the bar chart it appears that for Northern hotel reviews, “bad” is also the most frequent negative word. “Bad” is also used about 400 times for Northern hotel reviews, so we do not have evidence to suggest that hotel reviews in the north or south are neccesarily more negative. I also saw that positive sentiment words are mostly the top sentiment words for Nouthern hotel reviews which matches what I saw for Sorthern hotel reviews.

Code Description

I wrote code below to filter my data set to just look at hotel reviews for hotels that are in the South. I took out the stop words, so I would not have any unnecessary words in my visualization. I used the arrange function to sort the 25 most frequent words in descending order. I used the inner_join function in order to get positive sentiments ranging from 0 to 5. The value code in my arrange function indicates that I am looking at words that have positive sentiments. A ggplot was created to show what the twenty most frequent words were for reviews of Southern hotels that that have the highest level of positive sentiments.

Datafiniti_Hotel_Reviews %>%
  unnest_tokens(word, reviews.text) %>% 
  filter(province %in% c("AL", "AR", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV")) %>% 
    anti_join(stop_words) %>% 
    inner_join(get_sentiments("afinn")) %>% 
    count(word, value, sort=TRUE) %>%
     arrange(desc(value)) %>%
  head(25) %>%
  ggplot(aes(reorder(word, n),n, fill=value)) + geom_col() + coord_flip() +
    labs(title=" Top Words for Southern Hotel Reviews that Give Positive Sentiments",
         x="20 Most Frequent Words with Positive Sentiments", 
         y="Number of Cases")

Analysis

According to the bar chart, the lighter colors indicate more positive sentiment. Nice which is the most frequent word has a sentiment of 3.0 and it is used in hotel reviews over a 1,000 times. Outstanding has a very high positive sentiment of 5.0, but it was only used about fifty times in the southern hotel reviews.

Code Description

I wrote code below to filter my data to only look at hotel reviews for northern states. I used the anti_join function to eliminate all unnecessary words from my results. I used the inner_join function to show the top 25 most frequent words with sentiments. The value in the arrange function allowed me to only show posiive words. The ggplot function was used to create a bar chart to show the 25 most frequent words that give the highest values of positive sentiments.

  Datafiniti_Hotel_Reviews %>%
  unnest_tokens(word, reviews.text) %>% 
  filter(province %in% c("CT", "IL", "IN", "IA", "KS", "ME", "MA","MI", "MN", "NE", "NH", "NJ", "NY", "ND", "OH", "PA", "RI", "SD", "VT", "WI")) %>%
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  count(word, value, sort=TRUE) %>%
  arrange(desc(value)) %>%
  head(25) %>%
  ggplot(aes(reorder(word, n),n, fill=value)) + geom_col() + coord_flip() +
     labs(title=" Top Words for Northern Hotel Reviews that Give Positive Sentiments",
         x="20 Most Frequent Words with Positive Sentiments", 
         y="Number of Cases")

Analysis

According to the bar chart, it is clear that the most frequent word was nice which is also the most frequent positive sentiment word for southern hotel reviews with a value of 3. It appears that more people for nothern hotel reviews chose to use the word “wonderful” than “perfect” whereas for southern hotel reviews more people chose perfect than wonderful. This could mean something as wonderful has a positive sentiment value of 4 whereas perfect has a positive sentiment of 3. Overall, it appears that the level of positive sentiment in hotel reviews for northern and southern hotels is not much different.

Code Description

I wrote code below to filter to the southern hotels. I got rid of all the stop words and got sentiments for the top 25 most frequent words that are seen as positive or negative. The -value part in the arrange function was used to only show the top 25 most frequent words with negative sentiments. A bar graph was generated to show the 25 most frequent words that have the highest values of negative sentiments.

Datafiniti_Hotel_Reviews %>%
  unnest_tokens(word, reviews.text) %>% 
  filter(province %in% c("AL", "AR", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV")) %>% 
    anti_join(stop_words) %>% 
    inner_join(get_sentiments("afinn")) %>% 
    count(word, value, sort=TRUE) %>%
  arrange(desc(-value)) %>%
  head(25) %>%
  ggplot(aes(reorder(word, n), n, fill=value)) + geom_col() + coord_flip() +
     labs(title=" Top Words for Southern Hotel Reviews that Give Negative Sentiments",
         x="20 Most Frequent Words with Negative Sentiments", 
         y="Number of Cases")

Analysis

According to the bar chart, the most frequent negative word was bad which had a sentiment score of -3. It appears that the vast majority of the top 25 most frequent words with negative sentiments had sentiment scores of -3. There are only a couple minor cases for the top 25 words in which the sentiment score was higher than -3.

Code Description

I wrote code below to generate a bar chart that shows the top 25 most frequent words in northern hotel reviews that have the highest values of negative sentiments. I filitered it to just northern states, so I could compare it to southern states. I used the anti_join function to eliminate stop words and the arrange function with -value to show words that express negative sentiment.

Datafiniti_Hotel_Reviews %>%
  unnest_tokens(word, reviews.text) %>% 
  filter(province %in% c("CT", "IL", "IN", "IA", "KS", "ME", "MA","MI", "MN", "NE", "NH", "NJ", "NY", "ND", "OH", "PA", "RI", "SD", "VT", "WI")) %>% 
    anti_join(stop_words) %>% 
    inner_join(get_sentiments("afinn")) %>% 
    count(word, value, sort=TRUE) %>%
  arrange(desc(-value)) %>%
  head(25) %>%
  ggplot(aes(reorder(word, n),n, fill=value)) + geom_col() + coord_flip() +
     labs(title=" Top Words for Northern Hotel Reviews that Give Negative Sentiments",
         x="20 Most Frequent Words with Negative Sentiments", 
         y="Number of Cases")

Analysis

According to the bar chart, the most frequent word that gave the highest value of negative sentiment was “bad”. This matches the results of the southern hotel reviews. The northern hotels and the southern hotel reviews both have a very small amount of words that have negative sentiments above -3.

Code Description

I wrote code below to create bigrams for my reviews.text variable. I used the count function to count how many times the same two words appear in a hotel review next to each other. I used the seperate function to separate the 2 words. I filtered out all stop words from word 1 and word 2, so there would not be any unneccesary words such as and, if, and or appearing in the table.

library(tidyverse)
  Datafiniti_Hotel_Reviews %>%
    unnest_tokens(bigram, reviews.text, token="ngrams", n=2) %>%
    count(bigram, sort=TRUE) ->Hotel1
  
  
Hotel2 <- Hotel1 %>% 
separate(bigram, c("word1", "word2"), sep = " ")

Hotel3 <- Hotel2 %>%
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)
Hotel3

## # A tibble: 40,824 x 3
##    word1       word2         n
##    <chr>       <chr>     <int>
##  1 front       desk        942
##  2 friendly    staff       348
##  3 walking     distance    333
##  4 hotel       staff       205
##  5 free        breakfast   199
##  6 customer    service     194
##  7 desk        staff       189
##  8 continental breakfast   178
##  9 highly      recommend   160
## 10 nice        hotel       155
## # … with 40,814 more rows

Analysis

A majority of the bigrams have an adjective followed by a noun which is what I was expecting. I was not surprised to see that “nice” which is the most frequent word used to express positive sentiment for northern hotel reviews as one of the top 10 most frequent bigrams. Friendly which was word 1 for the 2nd most frequent bigram was not in the top 25 most frequent words for expressing positive sentiment for either region which surprises me. It could be that “friendly” does not register as a positive sentiment when it can in fact be interpreted as a positive word.

Conclusion

My hypotheis appears to not be true based on the comparison of bar charts for the 25 most frequent words in hotel reviews that have the highest values of positive sentiments for northern and southern hotels. I can conclude based on the bar charts that the most frequent word by far for each region is “nice”. The distributions for the bar charts appear to be roughly the same in terms of most frequent words for the different levels of sentiment. For words that express negative sentiment, the results for southern and northern hotel reviews did not differ much in terms of values of sentiment. They also did not differ really at all in terms of most frequent negative words or positive words. I think I obtained these results because not everyone who stays at a hotel in the north is from the north and vice versa. Therefore, it is hard to determine if people in the north have higher expectations for northern hotels since not everyone who stays there is from the north. People could have the same expectations for hotels regardless of what region of the country the hotel is in. I could further my research by looking at whether weather could impact why southern hotels might have more positive reviews in winter. Do southern hotel reviews get more positive reviews in winter because the weather tends to be warmer?

Text Analysis Project 1

Sam Greenberg

3/23/2020

Hypothesis

Introduction

Code Description

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Code Description

Analysis

Conclusion