Welcome to my final project!

Introduction

1. Problem

Which city or state is the ideal location for a MSBA student to work and live? Nowadays Big Data is a really hot topic. Accordingly, there is a big demand of data related jobs everywhere in USA, so it is difficult but necessary to choose the most suitable place. Why am I so interested in this problem? Because as a STEM program student, I plan to work in USA for a few years. Unlike many other Chinese students who go back to China right after graduation, I do believe that working and living in US will be one of the most valuable treasures in my life!

2. Data

Wages should be considered at first, but other factors like Price Parity are also very important, because no matter how high a person’s Wage is, his living quality can be reduced due to big living expenditure.

However, does the Original Wage not matter at all? I mean, the Wages posted on LinkedIn can only be the Original Wage. So when we search jobs on LinkedIn, can we just simply ignore the Wage differences among similar job positions? When we choose where to live and work, other important factors such as the Economic Development Level(which can be reflected by the index, Per Capita Real GDP), Employment and Education Level should be considered, too. Only considering Price Parity, those important factors may be neglected. But the Original Wage can be influenced by them more easily, so it can help us predict these factors better.

3. Approach/Analytic Technique

Mainly, I plan to use visualization to answer my question. I will use the boxplots to show that the Wage differences between jobs on the national level. I will also use the maps to show the differences between the data grouped by area. What’s more, I will use some statistical techniques like regression to analyze the relationship between the Wages and other variables.

4. How to Decide

I want to say, it is always difficult to make a decision. However, my project can help you make a better decision based on considering the most important factors such as Wages, Price Parity, Economic Development Level, Employment and Education Level, comprehensively.

Packages Required

To reproduce the code and results throughout this project you will need to load the following packages.

library(tidyverse) #visualizeing, transforming, inputing, tidying and joining data
library(readxl)    #inputing data from Excel
library(plotly)    #showing plots and data at the same time
library(ggmap)     #loading the function of maping
library(maps)      #loading maps
library(mapdata)   #loading map data
library(DT)        #previewing the data sets
library(ggrepel)   #avoid label overlapping
library(car)       #testing multicollinearity and heteroscedasticity 
library(modelr)    #testing if the model can be applied generally
library(scales)    #formatting scales to non-scientific type
library(gridExtra) #putting multiple plots into 1 page

Data Preparation

I got data sets from different resources. I got Wage Data by State and by Job Characterisrics and state_employment from Bureau of Labor Statistics(BLS), State Price Parity from Coursera, Metropolitan Area Price Parity and state_pc_real_gdp from Bureau of Economic Analysis(BEA), State Longitude and Latitude from Ink Plant Code, Metropolitan Area Longitude and Latitude from StatCrunch, state_educational_level from WalletHub.
Wage Data by State and by Job Characterisrics is the main data set I use. It was originally collected in 2016, by BLS to summarise the Wages in different states, occupation titles and occupation levels. The original data set has 12 variables. It is a relatively complete data set and doesn’t have missing values. I use several other different data sets so I think it’s better to make a brief description than to list them thoroughly here. Price Parity is used to calculate the Adjusted Wage, Longitude and Latitude data sets are used to be merged with my main data sets so I can put data into maps. state_pc_real_gdp, state_employment and state_educational_level are used to analyze the factors that influence the Wage.
Then I read these files into R.

messwage <- read_excel("Rfinalprojectdata/mwe_2016complete.xlsx", sheet ="MWE_2016")
statelonglat <- read_excel("Rfinalprojectdata/longlat.xlsx")
metro <- read_excel("Rfinalprojectdata/metro.xlsx")
state_price_parity <-
  read_excel("Rfinalprojectdata/Coursera_price_parity.xlsx") %>%
  filter(!grepl('Alaska|Hawaii',`region`))
metro_price_parity <- read_excel("Rfinalprojectdata/metro_price_parity.xlsx")
state_pc_real_gdp <- read_excel("Rfinalprojectdata/per_capita_real_GDP_by_state.xlsx")
state_employment <- read_excel("Rfinalprojectdata/state_M2016_dl.xlsx")
state_educational_level <- read_excel("Rfinalprojectdata/Most_Educated_States.xlsx")

The data is not very messy so I just need to select the variables I want to analyze. Besides, to make the further visualization more convenient and clear, I filter these 2 states Alaska and Hawaii, as well as the metropolitan areas in these 2 states

wage <-
  messwage %>%
  select(`Average Hourly Wage`, `Area Level`, `region`, `Occupation Text`) %>%
  filter(!grepl('Alaska|Hawaii|*AK|*HI',`region`))
states <- map_data("state")

Preview of the clean data set.

head(wage,100) %>%
  datatable(options = list(
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

For the main data set Wage Data by State and by Job Characterisrics, it has 255479 observations and 12 variables. I only need to concern the variables Averge Hourly Wage, Area Level, region and Occupation Text. The 1st one is a numeric variable, while others are characters. Average Hourly Wage is the Original Wage, and it has the Min. 8.26, Median 16.84, Mean 20.04, and Max. 99.88. Area Level has National, State, Metropolitan and Non-metropolitan Levels, which I need to make groups to analyze. region has the spesific names of the areas, so I can see which place is the best place to live and work. Occupation Text has the speicific job titles, so I can find the data related jobs in it.

Exploratory Data Analysis

National

the Wages of Data Related Jobs Differ on National Level

First, on national level, is there a significant Average Hourly Wage difference among different data related jobs? ANOVA will test the hypotheses:

H₀ : μ₁ = μ₂ = … = μ_n (The means of Wage of different data related jobs(Marketing Managers, Software Developers, Financial Analysts, etc) are all the same.)

H_a : Not all the population means are equal

national_wage <-
wage %>%
  group_by(`Occupation Text`) %>%
  filter(grepl('National',`Area Level`)) %>%
  filter(grepl('Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>%
  filter(!grepl('Keyers|Clerk',`Occupation Text`))
anova <- aov(`Average Hourly Wage`~`Occupation Text`, data = national_wage)
summary(anova)

##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## `Occupation Text`  20  14248   712.4   3.498 6.6e-06 ***
## Residuals         140  28515   203.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the p-value = 6.6e-06, which is much smaller than α = 0.05 or even 0.01. Therefore, we can conclude that there is a significant Wage difference among different data related jobs.

To see more clearly, I draw some boxplots to analyze the differences in Average Hourly Wage. Computer, Software and Information related jobs have the highest Wages, and then Financial jobs. Also, Wage of the Manager level is higher than Analyst level.

ggplotly(
  national_wage %>%
  ggplot(aes(reorder(`Occupation Text`, `Average Hourly Wage`), `Average Hourly Wage`)) +
  labs(x = 'Data Related Jobs') +
  geom_boxplot() +
  scale_y_continuous() +
  ggtitle("Figure 1: Wages of Different Data Related Jobs") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10))) +
  coord_flip(),
width = 880, height = 660)

State

the Wages of Data Related Jobs Differ on State Level

Computer System Analysts have obersavations with the biggest quantity of states and metropolitan areas, which will be easier to compare the differences, so I take this job title for example.

csAnalysts <-
wage %>%
  group_by(`region`) %>%
  filter(grepl('State',`Area Level`)) %>%
  filter(grepl('Computer Systems Analysts',`Occupation Text`)) %>%
  summarise(median_wage = median(`Average Hourly Wage`)) %>%
  left_join(statelonglat, by = "region")

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, group = group), fill = "grey90", color = "white") +
  geom_point( data=csAnalysts, aes(x = long, y = lat, size=median_wage, color = median_wage)) +
  scale_size( name="median_wage") +
  scale_colour_gradientn(colours=rainbow(4)) +
  geom_text_repel(data = csAnalysts,aes(x = long, y = lat, label = region)) +
  coord_fixed(1.3) +
  guides(fill=FALSE) +
  labs(x="Longitute", y="Latitude") +
  ggtitle("Figure 2: Computer System Analyst Wages Across States") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10)))

It seems that Computer System Analysts in New Jersey State and New York State are really paid well, right? But now let me take the Price Parity into consideration.

newwage <-
wage %>%
  filter(grepl('Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>%
  filter(!grepl('Keyers|Clerk',`Occupation Text`)) %>%
  select(`Average Hourly Wage`, "Area Level", "region", "Occupation Text") %>%
  filter(grepl('State',`Area Level`)) %>%
  left_join(state_price_parity) %>%
  mutate(adjusted_wage = `Average Hourly Wage`/`Price Parity`*100)

AdjustedCSAnalysts <-
newwage %>%
group_by(`region`) %>%
filter(grepl('Computer Systems Analysts',`Occupation Text`)) %>%
summarise(median_wage = median(`Average Hourly Wage`), median_adjusted_wage = median(adjusted_wage)) %>%
left_join(statelonglat, by = "region")

ggplot(data = states) + 
geom_polygon(aes(x = long, y = lat, group = group), fill = "grey90", color = "white") +
geom_point( data=AdjustedCSAnalysts, aes(x = long, y = lat, size=median_adjusted_wage, color = median_adjusted_wage)) +
scale_size( name="median_adjusted_wage") +
scale_colour_gradientn(colours=rainbow(4))+
geom_text_repel(data = AdjustedCSAnalysts,aes(x = long, y = lat, label = region)) +
coord_fixed(1.3) +
guides(fill=FALSE) +
labs(x="Longitute", y="Latitude") +
ggtitle("Figure 3: Adjusted Computer System Analyst Wages Across States") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10)))

So it is a totally different story now. We found that the Computer System Analysts in Missouri and Ohio are the luckiest guys, right?

Here is the datatable for Computer System Analysts Wages across states. I arrange the data in a descending order of Adjusted Wage.

AdjustedCSAnalysts %>%
mutate(median_adjusted_wage = format(round(median_adjusted_wage, 2), nsmall = 2),
       median_wage = format(round(median_wage, 2), nsmall = 2)) %>%
select(region, median_wage, median_adjusted_wage) %>%
arrange(desc(median_adjusted_wage)) %>%
datatable(options = list(
columnDefs = list(list(className = 'dt-center', targets = 1:3))
))

Metro

the Wages of Data Related Jobs Differ on Metropolitan Area Level

So now let’s have a look at Metropolitan Area.

metronewwage <-
  wage %>%
  filter(grepl('Analysts|Data|Intelligence|Software|Machine Learning|Solutions|Consultant|Financial|Operation|Computer and Information Systems Managers|Network and Computer Systems Administrators',`Occupation Text`)) %>%
  filter(!grepl('Keyers|Clerk',`Occupation Text`)) %>%
  select(`Average Hourly Wage`, "Area Level", "region", "Occupation Text") %>%
  filter(grepl('Metro area',`Area Level`)) %>%
  left_join(metro_price_parity) %>%
  mutate(metro_adjusted_wage = `Average Hourly Wage`/`Price Parity`*100)

MetroAdjustedCSAnalysts <-
metronewwage %>%
  group_by(`region`, `ID`) %>%
  filter(grepl('Metro area',`Area Level`)) %>%
  filter(grepl('Computer Systems Analysts',`Occupation Text`)) %>%
  summarise(median_metro_adjusted_wage = median(metro_adjusted_wage)) %>%
  left_join(metro, by = "region")

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, group = group), fill = "grey90", color = "white") +
  geom_point( data = MetroAdjustedCSAnalysts, aes(x = long, y = lat, size=median_metro_adjusted_wage, color =                        median_metro_adjusted_wage)) +
  scale_size( name="median_metro_adjusted_wage") +
  scale_colour_gradientn(colours=rainbow(4))+
  geom_text_repel(data = MetroAdjustedCSAnalysts,aes(x = long, y = lat, label = ID)) +
  coord_fixed(1.3) +
  guides(fill=FALSE) +
  labs(x="Longitute", y="Latitude") +
  ggtitle("Figure 4: Adjusted Computer System Analyst Wages Across Metro Areas") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10)))

So Computer System Analysts in St.Louis, MO-IL, and Cincinnati, OH-KY-IN are best paid. And here I attach the Metropolitan Area list with ID. I arrange the data in a descending order of Adjusted Wage.

MetroAdjustedCSAnalysts %>%
  mutate(median_metro_adjusted_wage = format(round(median_metro_adjusted_wage, 2), nsmall = 2)) %>%
  select(ID, region, median_metro_adjusted_wage) %>%
  arrange(desc(median_metro_adjusted_wage)) %>%
  na.omit() %>%
  datatable(options = list(
  columnDefs = list(list(className = 'dt-center', targets = 1:3))
  ))

Factors

the Factors that May influence the Wages

However, now can we really consider the Adjusted Wage level as the only index when relocating? Does the highest Adjusted Wage mean really mean the highest living quality? I do believe that the Original “Average Hourly Wage” has some reference value. For example, can the Adjusted Wage reflect important factors such as Per Capita Real GDP, Employment and Educational Level? Maybe the Original Wage can help us predict these factors better. In other words, we need to find if these factors influence the Wage levels.

First, let’s build a data set with the information of Wages, Per Capita Real GDP, Employment and Education Level, by merging the related data sets together.

regression_wage <-
newwage %>%
  left_join(state_pc_real_gdp) %>%
  left_join(state_employment) %>%
  mutate(TOT_EMP = as.numeric(as.character(TOT_EMP))) %>%
  left_join(state_educational_level) %>%
  select(`region`,`Occupation Text`, `Average Hourly Wage`, `adjusted_wage`, `2016`, TOT_EMP, `Total Score`) %>%
  group_by(region) %>%
  summarise(median_wage = median(`Average Hourly Wage`, na.rm = TRUE), median_adjusted_wage = median(adjusted_wage,      na.rm = TRUE), `Per Capita Real GDP` = median(`2016`, na.rm = TRUE),`Employment` = median(TOT_EMP, na.rm = TRUE),      `Education Level` = median(`Total Score`, , na.rm = TRUE)) %>%
  na.omit()

Next, let’s see if there is correlation between the dependent variable Wage, and the independent variables, Per Capita Real GDP, Employment and Education Level.

cor(regression_wage$median_wage, regression_wage$`Per Capita Real GDP`)
cor(regression_wage$median_wage, regression_wage$`Employment`)
cor(regression_wage$median_wage, regression_wage$`Education Level`)
cor(regression_wage$median_adjusted_wage, regression_wage$`Per Capita Real GDP`)
cor(regression_wage$median_adjusted_wage, regression_wage$`Employment`)
cor(regression_wage$median_adjusted_wage, regression_wage$`Education Level`)

Correlations	Per Capita Real GDP	Employment	Education Level
Original Wage	0.717766	0.4715283	0.5932142
Adjusted Wage	0.3956327	0.2588557	0.218387

It seems that the Original Wage has strong correlation with independent variables, while the Adjusted Wage has a much weaker correlation with independent variables.

To see more clearly about the correlations between the Original Wage and independent variables, let’s look at the plots.

plot1 <-
  regression_wage %>%
  ggplot(aes(`median_wage`,`Per Capita Real GDP`)) +
  geom_point(alpha = .3) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  geom_smooth() +
  geom_smooth(method = "lm", color = "red")

plot2 <-
regression_wage%>%
  ggplot(aes(`median_wage`, Employment)) +
  geom_point(alpha = .3) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = comma) +
  geom_smooth() +
  geom_smooth(method = "lm", color = "red")

plot3 <-
regression_wage %>%
  ggplot(aes(`median_wage`,`Education Level`)) +
  geom_point(alpha = .3) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = comma)+
  geom_smooth() +
  geom_smooth(method = "lm", color = "red")
grid.arrange(plot1, plot2, plot3, ncol=3)

Let’s start building the regression model! model1 is for the Original Wage.

set.seed(31)
sample <- sample(c(TRUE, FALSE), nrow(regression_wage), replace = T, prob = c(0.6, 0.4))
train <- regression_wage[sample, ]

model1 <- lm(median_wage ~ `Per Capita Real GDP`+Employment+`Education Level`, data = train)
summary(model1)

## 
## Call:
## lm(formula = median_wage ~ `Per Capita Real GDP` + Employment + 
##     `Education Level`, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7500 -1.6112 -0.1987  1.2980  4.8711 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.974e+01  2.277e+00   8.667 2.05e-09 ***
## `Per Capita Real GDP` 2.607e-04  6.777e-05   3.847 0.000632 ***
## Employment            5.937e-06  2.248e-06   2.640 0.013386 *  
## `Education Level`     2.540e-02  2.981e-02   0.852 0.401353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.148 on 28 degrees of freedom
## Multiple R-squared:  0.7335, Adjusted R-squared:  0.705 
## F-statistic: 25.69 on 3 and 28 DF,  p-value: 3.417e-08

Yes, it seems a pretty good model! R-square is 0.7335, which means 73.55% of the Wage change can be explained by all these 3 independent variables. The p-value of F-statistic is 3.417e-08, much smaller than α, which means these 3 variables together have a significant influence on the Original Wage. Also either Per Capita Real GDP or Employment has a significant influence on the Original Wage independently. But Education Level itself doesn’t have a significant influence on the Original Wage.

So now. let’s exclude this independent variable Education Level, to build model2.

model2 <- lm(median_wage ~ `Per Capita Real GDP`+Employment, data = train)
summary(model2)

## 
## Call:
## lm(formula = median_wage ~ `Per Capita Real GDP` + Employment, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4951 -1.4264  0.0524  1.4245  4.6753 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.904e+01  2.116e+00   8.999 6.82e-10 ***
## `Per Capita Real GDP` 3.022e-04  4.690e-05   6.445 4.74e-07 ***
## Employment            5.227e-06  2.078e-06   2.515   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.138 on 29 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7078 
## F-statistic: 38.54 on 2 and 29 DF,  p-value: 6.807e-09

The Adjusted R-squared of model2 is 0.7078, which is a little higher than that of model1, 0.705. So I think model2 is better than model1!

Now let’s check if there is a strong relationship between the Adjusted Wage and these 2 independent variable?

model3 <- lm(median_adjusted_wage ~ `Per Capita Real GDP`+Employment, data = train)
summary(model3)

## 
## Call:
## lm(formula = median_adjusted_wage ~ `Per Capita Real GDP` + Employment, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0354 -2.2303 -0.8023  1.9672  6.0959 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.997e+01  2.700e+00  11.098  5.9e-12 ***
## `Per Capita Real GDP` 1.090e-04  5.985e-05   1.822   0.0788 .  
## Employment            1.699e-06  2.652e-06   0.641   0.5268    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.729 on 29 degrees of freedom
## Multiple R-squared:  0.1691, Adjusted R-squared:  0.1118 
## F-statistic: 2.951 on 2 and 29 DF,  p-value: 0.06814

The R-Square, adjusted R-Square, F-statistic for the whole model3 and p-values of each independent variable show that, compared to the Adjusted Wage, the Original Wage has a much stronger relationship with Per Capita Real GDP and Education Level.

So far model2 is the best model. So let’s check it further. Look at the plots first.

par(mfrow=c(2,3))
plot(model2, which = 1)
plot(model2, which = 2)
plot(model2, which = 3)
plot(model2, which = 4)
plot(model2, which = 5)

From Residuals vs. Fitted plot, we can see that there is no serious heteroscedasticity problem. From Q-Q plot and Scale-Location plot, we can see that residuals are normally distributed. From Cook’s Distance plot, and Residuals vs Leverage plot, we can see that there aren’t too many outliers.

Next, use vif and ncvTest to make sure if clean_model has multicollinearity and heteroscedasticity problems.

vif(model2)

## `Per Capita Real GDP`            Employment 
##              1.240596              1.240596

ncvTest(model2)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.6596209    Df = 1     p = 0.4166938

VIF < 4, so we can conclude that model2 doesn’t have multicollinearity problems. p > 0.1, which means there is some but not serious heteroscedasticity problem.

And then, let’s see if model2 can be applied out of sample.

test <- regression_wage[!sample,]
test %>%
  select(median_wage, `Per Capita Real GDP`, Employment) %>%
  add_predictions(model2) %>%
  summarise(MSE = mean((median_wage - pred)))

## # A tibble: 1 x 1
##           MSE
##         <dbl>
## 1 -0.05502762

We can see that MSE = -0.05502762, which is very small. So model2 can be well applied out of sample.

Compare

Compare The Most Important Factors by State

So Per Capita Real GDP and Employment seem to be the most important factors related to Wages. Now let’s look at Employment by State. California, Texas, New York and Florida have the biggest employment amount, and Ohio also does a good job. Missouri is not that good but above the median level.

state_employment %>%
  group_by(region) %>%
  mutate(TOT_EMP = as.numeric(as.character(TOT_EMP))) %>%
  summarise(Employment = median(TOT_EMP, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(region,Employment), y = Employment)) +
  geom_bar(aes(fill = Employment),stat = "identity")+
  labs(x = "States") +
  ggtitle("Figure 5: Employment Across States") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia", size = 7),
        plot.title = element_text(size = 20, margin = margin(b = 10))) +
  coord_flip()

Next let’s look at Per Capita Real GDP by State. Massachusetts，New York, Connecticut, Delaware and North Dakota have the highest Per Capita Real GDP, while Missouri and Ohio are below the average level.

clean_state_gdp <-
  state_pc_real_gdp %>%
  filter(!grepl('District of Columbia|Alaska|Hawaii|Far West|Great Lakes|Mideast|New England|Plains|Rocky Mountain|Southeast|Southwest|United States',
                region)) %>%
  na.omit() %>%
  gather(Year, `Per Capita Real GDP`, 8:27) %>%
  select(region, Year, `Per Capita Real GDP`)

top5 <- clean_state_gdp %>%
  arrange(desc(`Per Capita Real GDP`)) %>%
  filter(Year == 2016) %>%
  slice(1:5)

bottom5 <- clean_state_gdp %>%
  arrange(`Per Capita Real GDP`) %>%
  filter(Year == 2016) %>%
  slice(1:5)

avg <- clean_state_gdp %>%
  group_by(Year) %>%
  summarise(Avg_mn = mean(`Per Capita Real GDP`),
            Avg_md = median(`Per Capita Real GDP`)) %>%
  mutate(Avg = "Average")

ggplotly(
  clean_state_gdp %>%
  ggplot(aes(Year, `Per Capita Real GDP`, group = region)) +
  geom_line(alpha = .1) +
  geom_line(data = filter(clean_state_gdp, region %in% top5$region),
            aes(Year, `Per Capita Real GDP`, group = region), color = "dodgerblue") +
  geom_line(data = filter(clean_state_gdp, region %in% bottom5$region),
            aes(Year, `Per Capita Real GDP`, group = region), color = "red") +
  geom_line(data = avg, aes(Year, Avg_mn, group = 1), linetype = "dashed") +
  geom_line(data = filter(clean_state_gdp, grepl('Ohio',region)),
            aes(Year, `Per Capita Real GDP`), color = "purple") +
  geom_line(data = filter(clean_state_gdp,grepl('Missouri',region)),
            aes(Year, `Per Capita Real GDP`), color = "green") +
  geom_text_repel(data = top5, aes(label = region), nudge_x = .5, size = 3) +
  geom_point(data = top5, aes(Year, `Per Capita Real GDP`), color = "dodgerblue") +
  geom_text_repel(data = bottom5, aes(label = region), nudge_x = 0.5, size = 3) +
  geom_point(data = bottom5, aes(Year, `Per Capita Real GDP`), color = "red") +
  scale_y_continuous(NULL, labels = scales::dollar) +
  ggtitle("Figure 6: 1997~2016 Per Capita Real GDP by State") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        text = element_text(family = "Georgia"),
        plot.title = element_text(size = 20, margin = margin(b = 10))),
width = 880, height = 660)

Summary

Yes, here it is! I finally made it!

So, Which city or state is the ideal location for a MSBA student to work and live? What factors need to be considered, such as Wage, Price Parity, Economic Development Level(which can be reflected by the index, Per Capita Real GDP), Employment and Education Level, and how these factors are related to each other?
I got several different data sets such as Wage Data by State and by Job Characterisrics, State Longitude and Latitude, Metropolitan Area Longitude and Latitude, state_employment, State Price Parity, Metropolitan Area Price Parity, state_pc_real_gdp, state_employment and state_educational_level. Mainly, I use visualization to answer my question. I also use the boxplots to show that the Wage differences between jobs on the national level. Besides, I use the maps to show the Original Wage differences and also the Adjusted Wage differences between the data grouped by area. What’s more, I use some statistical techniques like regression to analyze the relationship between the Wages and other factors, and how the factors influence our choices.
As for the Data Related Job Choices, Computer, Software and Information related jobs have the highest Wages, and then Financial jobs. Also, Wage of the Manager level is higher than Analyst level. So we have to learn more knowledge about computer science and finance to get well-paid job. And then work hard to get promotion, which will make you better paid! As for the Adjusted Wage, the states Missouri and Ohio, as well as some metropolitan areas in these 2 states have relatively high Adjusted Wages, which means that, considering living costs, people in Missouri and Ohio with data related jobs have relatively higher Wages than the counterparts elsewhere. As for the Wage, it is more complicated, because it can also reflect other 2 factors, Employment and Economic Development Level well. California, Texas, New York and Florida have the highes Employment amount, and Ohio also does a good job. Missouri is not that good but above the median level. Massachusetts，New York, Connecticut, Delaware and North Dakota have the highest Economic Development Level, while Missouri and Ohio are below the average level.
So here comes to the conclusion. If you want to have a stable and decent job, and enjoy relatively high quality life, then you should go to Missouri and Ohio! If you want to find or transfer jobs easily, California, Texas, New York and Florida are your top choices, while Ohio is also very good and Missouri is not too bad. If you like pressue and competition, and you want large developing space in your career, Massachusetts，New York, Connecticut, Delaware and North Dakota are waiting for you! As the ancient Chinese philosopher Mencius said,“You can never have both fish and bear’s paw”, which is the same as “You can’t have your cake and eat it too”.
The main limitations are about data. There is not enough data related job information in each State and Metropolitan Area. For example, there’re only 5 States with the information of Financial Analysts, which is not enough for me to draw an exact and general conclusion. Also, I only build a model on State Level, because I tried but I couldn’t build up a well-fitted model on Metropolitan Level. I think it is also due to the lack of data.

R Final Project: Where to Go?

Yang Ding

8/23/2017