State Appeal Index

11/22/2020

Introduction

As a result of COVID 19, some may argue that the workforce will likely become increasingly mobile. People now have the option of working from home, and in many cases, out of state. Many people, including myself, are starting to think about uprooting for a change of scenery.

My goal is to create a state appeal ranking by referencing home price statistics compared to travel forum sentiment. A key assumption here is that sentiment of travelers is likely to overlap with sentiment of residents.

Obtain

FHFA

Home Price Index data from the Federal Housing Finance Agency https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index-Datasets.aspx#mpo

fhfa <- read_xls("state_statistics_for_download.xls")

#remove junk headers
fhfa <- fhfa %>%
  set_names(fhfa[3,])%>%
  slice(-(1:3))

TripAdvisor

Posts from the TripAdvisor forum: Found in discussion threads links, organized state links; found at the below path. https://www.TripAdvisor.com/ListForums-g191-i3-United_States.html

trip_url <- "https://www.tripadvisor.com"
trip_usforums_url <-paste0(trip_url,"/ListForums-g191-i3-United_States")
trip_usforums_source <- readLines(trip_usforums_url, encoding = "UTF-8")

Trip Advisor State Links

Gather all links from trip advisor United States forum main page

trip_usforums_all_links <- getHTMLLinks(trip_usforums_source)

Identify links for the state specific forums

#Using "ShowForum" keyword to identify state bucket links
trip_usforums_state_links <- str_subset(trip_usforums_all_links, "ShowForum")

Trip Advisor Discussion Links

Create function for gathering discussion topic links

getlinks <- function(root, ext, keyword) {
  links <- paste0(root,ext)%>%
            readLines()%>%
            getHTMLLinks()%>%
            str_subset(keyword)
  return(links)
}

Loop through the state forums to pull the discussion topic links

topic_keyword <- "ShowTopic"
trip_discussion_links <- data_frame()
for(i in 1:length(trip_usforums_state_links)){
  topic_links <- getlinks(trip_url,trip_usforums_state_links[i],topic_keyword)
  trip_discussion_links <- union_all(trip_discussion_links,data_frame(topic_links,trip_usforums_state_links[i]))
}

#Remove duplicate paths
trip_discussion_links <- filter(trip_discussion_links,row_number() %% 2 ==0)

Sentiment

Loop through the discussion thread links to perform sentiment analysis resulting in a score column

for (i in 1:length(trip_discussion_links$topic_links)) {
  #testing confirmed we can isolate the discussion thread by sub-setting with the pattern u002Fschema.org"
  thread <- str_subset(readLines(paste0(trip_url,trip_discussion_links$topic_links[i][1]), encoding = "UTF-8"),"u002Fschema.org")
  
  #tokenize the dicsussion thread and assign a sentiment score baded on the "afinn" lexicon.
  scoring <- thread %>%
  unlist()%>%
  data.frame()%>%
  rename(text = 1)%>%
  unnest_tokens(word,text)%>%
  anti_join(stop_words)%>%
  inner_join(get_sentiments("afinn"))
  
  #take the average score and save it into our dataframe
  trip_discussion_links$sentiment_score[i] <- mean(scoring$value)

}

Scrub

Now we are ready to clean and prepare the data we acquired.

FHFA Data frame

master_fhfa <- fhfa %>%
  mutate(year = str_sub(`Year-Quarter`,1,4), quarter = str_sub(`Year-Quarter`,-2))%>%
  #filter to the most recent year
  filter(year== max(year))%>%
  rename(abb = State)%>%
  #using the inner join we can filter out any non-state netered metrix
  inner_join(usa::states)%>%
  rename(state = name)%>%
  group_by(state)%>%
  summarise(home_price = round(mean(as.numeric(`Average Price`))))%>%
  mutate(state = str_replace_all(state," ","_"))

Trip Advisor Data frame

master_trip_advisor <- trip_discussion_links%>%
  rename(topic = 1, state = 2)%>%
  mutate(state = str_extract(state, "(?<=([:digit:]-))[:upper:]([:alpha:]+(_?))+"))%>%
  select(2,3)%>%
  group_by(state)%>%
  summarise(sentiment_score = round(mean(sentiment_score),2))

Explore

Combine the data sets for examine distribution and correlation

master_data <- master_trip_advisor %>%
  inner_join(master_fhfa)

Sentiment Score Distribution

Normal distribution for sentiment_score

master_data %>%
  ggplot(aes(x=sentiment_score))+geom_density()

Home Price Distribution

Somewhat normal distribution for home_price; right-skewed as a result of a floor effect.

master_data %>%
  ggplot(aes(x=home_price))+geom_density()

Correlation

ggplot(master_data, aes(x= sentiment_score, y = home_price))+
  geom_point()  +
  geom_smooth(method = "lm")

## Correlation

cor(master_data$sentiment_score,master_data$home_price)

## [1] -0.1953569

summary(lm(master_data$sentiment_score ~ master_data$home_price))

## 
## Call:
## lm(formula = master_data$sentiment_score ~ master_data$home_price)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52333 -0.10849 -0.00554  0.11719  0.52920 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.056e+00  7.896e-02  13.377   <2e-16 ***
## master_data$home_price -4.711e-07  3.378e-07  -1.394    0.169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2229 on 49 degrees of freedom
## Multiple R-squared:  0.03816,    Adjusted R-squared:  0.01854 
## F-statistic: 1.944 on 1 and 49 DF,  p-value: 0.1695

Correlation

The data shows an interesting and possibly counter-intuitive result; a -.25 correlation between the variable. Higher sentiment_score, correlates to a lower average home price. With a p-value of .083, these results can be considered statistically significant.

Model

After some interesting EDA, we now move to normalize the metrics by using standard error, and create a ranking based on those combined values.

Get SDs

score_sd <- sd(master_data$sentiment_score)
score_mean <- mean(master_data$sentiment_score)
price_sd <- sd(master_data$home_price)
price_mean <- mean(master_data$home_price)

Create Z-score columns for to normalize Calculate the ranking metric by subtracting the home price z-score from the sentiment z-score

master_data_normal <- master_data %>%
  mutate(sentiment_z = round((sentiment_score - score_mean)/score_sd,2) , home_price_z = round((home_price - price_mean)/price_sd,2))%>%
  mutate(appeal = sentiment_z - home_price_z)%>%
  arrange(desc(appeal))

Top Ten

head(master_data_normal, 10)

## # A tibble: 10 x 6
##    state         sentiment_score home_price sentiment_z home_price_z appeal
##    <chr>                   <dbl>      <dbl>       <dbl>        <dbl>  <dbl>
##  1 Michigan                 1.53     114148        2.56        -1.08   3.64
##  2 Arizona                  1.5      181358        2.42        -0.36   2.78
##  3 Arkansas                 1.3      137057        1.53        -0.83   2.36
##  4 Wyoming                  1.39     208331        1.93        -0.07   2   
##  5 Nebraska                 1.12     137220        0.73        -0.83   1.56
##  6 Louisiana                1.18     170112        1           -0.48   1.48
##  7 Missouri                 1.11     143968        0.69        -0.76   1.45
##  8 West_Virginia            1.12     149282        0.73        -0.7    1.43
##  9 Idaho                    1.15     165429        0.87        -0.53   1.4 
## 10 Kentucky                 1.06     148909        0.47        -0.7    1.17

Bottom Ten

tail(master_data_normal, 10)

## # A tibble: 10 x 6
##    state              sentiment_score home_price sentiment_z home_price_z appeal
##    <chr>                        <dbl>      <dbl>       <dbl>        <dbl>  <dbl>
##  1 Illinois                     0.63      182672      -1.44         -0.34  -1.10
##  2 New_York                     0.89      296177      -0.290         0.87  -1.16
##  3 Maryland                     0.83      290126      -0.56          0.81  -1.37
##  4 Oregon                       0.63      231372      -1.44          0.18  -1.62
##  5 New_Mexico                   0.52      189543      -1.93         -0.27  -1.66
##  6 South_Dakota                 0.45      175984      -2.24         -0.41  -1.83
##  7 Colorado                     0.570     263992      -1.71          0.53  -2.24
##  8 District_of_Colum…           0.89      490828      -0.290         2.96  -3.25
##  9 California                   0.63      384832      -1.44          1.82  -3.26
## 10 Hawaii                       0.74      592312      -0.96          4.05  -5.01

Interpret

The results below reflect the most appealing states based on the index we have developed. Aside from Florida, these results do not necessarily show the most populated states. This may have been foreshadowed by the negative correlation between sentiment and home price uncovered above; as home prices are typically a substantial consequence of population. I believe the results beg a hypothesis to be tested.

Null - The results below are not trustworthy, as the sentiment expressed in travel forums do no represent the actual experiences of long-term inhabitants, and it may be the case that these low population states make for lovely trips, but not as lovely homes.

Alternative - The results below are representative of human experiences on those state residents, and this is an indicator of the current and future population direction. It may be that the practicality and novelty of the big cities and power states is slowly wearing off. The expansion of remote work may give people the freedom to live a happier and more financially free life.