Best Places to Live in the USA for Data Scientist

Final Project Proposal - Finding an Ideal Place to Live

There is a quote, “If you want to catch a fish, you have to go where the fish are!”
In this project, We would try to find good places to live in america for a Data Scientist. The question We are asking is “What place are ideal for technology and management people with families to live?”

There is always a big demand of Data Scientist skills everywhere but there are lots of factors that go into making a place an ideal place. Mostly today, Data Scientist are used in Technology, Finance and Pharma areas. We would assume that placces where these jobs are higher would also have higer need for Data Scientists.

Some of the key considerations is always availability of ample work opportunities along with strong wage support. But stronger wages in some areas also come with higher cost of living (like California, Boston or NY) which can be indicated by Consumer Price Index to taxes in certain areas. Along with these, other factors like number of other families, diversity also play a big role in bring up a family. In addition to these, We also plan to search and analyze some additional parameters like availability of recreational facilities like parks in the area.


Inspiration - Finding a Happy Place! :)

Most people end up where they live as a result of uncontrolled events in their life like a new job. A high paying job is important but cannot be the only criteria of determining where to live. Many a times, where a person ends up might not be the best place for that person and takes a toll on his/her happiness. 

Personally, I myself ended up living in New York area as as result of events that occured in my life. But I have often wondered where would I live if I was ever offered a do-over. This project would help me do a very objective analysis and discover what an ideal place for me might look like.


Import Libraries

# Read the Data
lib.list <- c("readxl")    

# Analyze Data
lib.list <- c(lib.list, 
                "tidyverse",
                "DT"
              )

# Model Data
lib.list <- c(lib.list, 
                "modelr",
                "scales",
                "car"
            )

# Plot Data
lib.list <- c(lib.list, 
                "maps",
                "mapdata",
                "ggmap",
                "ggrepel",   
                "plotly",
                "gridExtra"
)

lapply(lib.list, require, character.only = TRUE)

Data Preparation

All wages Data

wages_estimated <- read_excel("C:\\CUNY\\607Data\\Assignments\\finalProject\\mwe_2016complete.xlsx", 
                       sheet ="MWE_2016")

all_wages <-
  wages_estimated %>%
  select(`Average Hourly Wage`, `Area Level`, `Area`, `Occupation Text`) %>%
  filter(!grepl('Alaska|Hawaii|*AK|*HI',`Area`))

states <- map_data("state")

head(all_wages, 100) %>%
  datatable(options = list(
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

State Longitude Data

statelonglat <- read_excel("C:\\CUNY\\607Data\\Assignments\\finalProject\\state_longitude_latutude.xlsx")

State Parity Data

state_price_parity <-
  read_excel("C:\\CUNY\\607Data\\Assignments\\finalProject\\state_price_parity.xlsx") %>%
  filter(!grepl('Alaska|Hawaii',`Area`))

Most Educated State

state_educational_level <-    
    read_excel("C:\\CUNY\\607Data\\Assignments\\finalProject\\most_educated_states.xlsx")

State Employment

state_employment <- read_excel("C:\\CUNY\\607Data\\Assignments\\finalProject\\state_employment.xlsx")

State Percent Real GDP

state_pc_real_gdp <- read_excel("C:\\CUNY\\607Data\\Assignments\\finalProject\\state_pc_real_gdp.xlsx")

National Level - Data Analysis using ANOVA

National Level - Data Analysis using ANOVA

**Hypothesis Testing**
We can use Hypothesis testing to check if there is significant difference amoung the hourly wages

\(H_0\) would state that there is NO difference in average hourly wage of all the jobs

\(H_0 states \mu_1 = \mu_2 = \mu_3 ...\mu_n\)

\(H_a\) states that there is a difference in average hourly wage of all the occupations

\(H_a states \mu_1 \ne \mu_2 \ne \mu_3 ...\mu_n\)

usa_wages <-
all_wages %>%
  group_by(`Occupation Text`) %>%
  filter(grepl('National',`Area Level`)) %>%
  filter(grepl('Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>%
  filter(!grepl('Keyers|Clerk',`Occupation Text`))

usa_anova <- aov(`Average Hourly Wage`~`Occupation Text`, data = usa_wages)

summary(usa_anova)
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## `Occupation Text`  20  14248   712.4   3.498 6.6e-06 ***
## Residuals         140  28515   203.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-Value = 6.6e-06, which is smaller than 0.05 This proves that our \(H_0\) is false


We can use GGplot boxplots to analyze the data

ggplotly(
  usa_wages %>% ggplot(aes(reorder(`Occupation Text`, `Average Hourly Wage`), `Average Hourly Wage`)) + labs(x = 'Technology Related Jobs') + geom_boxplot() + scale_y_continuous() + ggtitle("Wages of Different Technology Related Jobs") + theme_minimal() + theme(panel.grid.minor = element_blank(),
        plot.title = element_text(size = 14, margin = margin(b = 10))) +
  coord_flip(), width = 880, height = 660
)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

State Level - Data Analysis using ANOVA

State Level - Data Analysis

But the wages of the technology jobs also differ by states

techjob <- all_wages %>%
  group_by(`Area`) %>% 
  filter(grepl('State',`Area Level`)) %>%
  filter(grepl('Computer Systems Analysts|Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>% 
    summarise(median_wage = median(`Average Hourly Wage`)) %>%
  left_join(statelonglat, by = "Area")


ggplot(data = states) + 
  geom_polygon(aes(x = states$long, y = states$lat, group = group), fill = "pink", color = "white") +
  geom_point( data=techjob, aes(x = Longitude, y = Latitude, size=median_wage, color = median_wage)) +
  scale_size( name="median_wage") +
  scale_colour_gradientn(colours=rainbow(4)) +
  geom_text_repel(data = techjob,aes(x = Longitude, y = Latitude, label = Area)) +
  coord_fixed(1.3) +
  guides(fill=FALSE) +
  labs(x="Longitute", y="Latitude") +
  ggtitle("Technology Related Wages Across States") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10)))


State Price Parity - Data Analysis

newwage <-
all_wages %>%
  filter(grepl('Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>%
  filter(!grepl('Keyers|Clerk',`Occupation Text`)) %>%
  select(`Average Hourly Wage`, "Area Level", "Area", "Occupation Text") %>%
  filter(grepl('State',`Area Level`)) %>%
  left_join(state_price_parity) %>%
  mutate(adjusted_wage = `Average Hourly Wage`/`Price Parity`*100)
## Joining, by = "Area"
adjustedTechJobs <-
newwage %>%
group_by(`Area`) %>%
filter(grepl('Computer Systems Analysts|Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>%
filter(!grepl('Keyers|Clerk',`Occupation Text`)) %>%
summarise(median_wage = median(`Average Hourly Wage`), median_adjusted_wage = median(adjusted_wage)) %>%
left_join(statelonglat, by = "Area")

ggplot(data = states) + 
geom_polygon(aes(x = states$long, y = states$lat, group = group), fill = "pink", color = "white") +
geom_point( data=adjustedTechJobs, aes(x = adjustedTechJobs$Longitude, y = adjustedTechJobs$Latitude, size=median_adjusted_wage, color = median_adjusted_wage)) +
scale_size( name="median_adjusted_wage") +
scale_colour_gradientn(colours=rainbow(4))+
geom_text_repel(data = adjustedTechJobs,aes(x = adjustedTechJobs$Longitude, y = adjustedTechJobs$Latitude, label = Area)) +
coord_fixed(1.3) +
guides(fill=FALSE) +
labs(x="Longitute", y="Latitude") +
ggtitle("Parity Adjusted Technology Wages") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10)))

Conclusion

As seen from above analysis
1. State with highest wages are 
    a. New York
    b. New Jersey
    c. Virginia
    d. Massachusetts
    e. California
2. Once you adjust for partiy the states with highest income would be
    a. Virginia
    b. North Carolina
    c. New York
    d. Iowa
    e. Kansas
3. The best sectors for employment were 
    a. Computer and Information Systems
    b. Securities, Commodoties and Financial Sectors
    c. Financial Managers

I wanted to do more analysis with the data but could not do so due to some some personal emergency on my side. 

I want to thank the professor Andy for all his support. I learnt alot during this class and hopefully the quality of my previous projects highlight the learnings well.

Thank you!

Appendix - Important Links

1. Wage Date: https://www.bls.gov/mwe/
2. Consumer Price Index Data: https://www.bls.gov/cpi/
3. Unemployment Data: https://www.bls.gov/web/laus/laumstrk.htm
4. State and National Parks: https://en.wikipedia.org/wiki/Lists_of_state_parks_by_U.S._state
5. Census Data: https://www.census.gov/data/datasets/2017/demo/popest/nation-detail.html#ds
6. Average House Prices: https://www.statisticbrain.com/home-sales-average-price/
7. Price Parity Data: https://www.coursera.org/learn/analytics-tableau/lecture/IIUOH/blending-price-parity-data-with-our-salary-data