Introdustion to the data

The data set I will be using for my analysis is income-prediction-data set-us-20th-century-data found on Kaggle. I will be using two csv files from the data set and the names of the files are Train and Test. The train file contains 200,000 randomly selected immigrants with 43 columns. The test file contains 100,000 immigrants with 42 columns. The contents of the two files are immigrates coming to America from 1994 to 1995.

Data set: income-prediction-data set-us-20th-century-data

Key variables

  • Age
  • Gender
  • Education
  • Class
  • Education institute
  • Marital status
  • Race
  • Is Hispanic
  • Employment commitment
  • Unemployment reason
  • Employment state
  • Wage per hour
  • Is part of labor union
  • Working week per year
  • Industry code
  • Main Industry code
  • Occupation code
  • Main Occupation code
  • Total employed
  • Household stat
  • Household summary
  • Under 18 family
  • Veterans admin questionnaire
  • Veteran benefit
  • Tax status
  • Gains
  • Losses
  • Stocks status
  • Citizenship
  • Migration year
  • Country of birth own
  • Country of birth father
  • Country of birth mother
  • Migration code change in msa
  • Migration previous sunbelt
  • Migration code move within registration
  • Migration code change in registration
  • Residence 1 year ago
  • Old residence registration
  • Old residence state
  • Importance of record
  • Income above limit

Research Goal

The goal for the anylist is to find out more about the migrants that were coming to america in 1994 to 1995. The variables that I will be focusing on to figure this out are in red in Key Variables. The reasoning for the choosing variables is to focus on what the migrants bring to the work force and a gimps into the background of the migrants.

library(data.table)
library(plyr)
Train <- fread("/Users/colemcnulty/Desktop/Income Prediction./Train.csv")
Test <- fread("/Users/colemcnulty/Desktop/Income Prediction./Test.csv")

#______________Main Data frame________________________________________
df1 <- join(Train,Test, by=c('ID'), type='full')
df2 <- df1[df1$age >= 18, ]
df3 <- transform(df2,
                 income_above_limit = as.numeric(as.factor(income_above_limit)))
df_3 <- na.omit(df3)

#__________Getting Hispanic seperated____________________________________

mex_df <- transform(df_3,
                 is_hispanic = as.numeric(as.factor(is_hispanic)))
mex_df1 <- mex_df[mex_df$is_hispanic != 1, ]
other_race <- mex_df[mex_df$is_hispanic == 1, ]
rac_numer <- transform(other_race,
                 race = as.numeric(as.factor(race)))

#--------Frist Plot----------------------------------------------------
white_df <- rac_numer[rac_numer$race == 5, ]
black_df <- rac_numer[rac_numer$race == 3,]
asin_df <- rac_numer[rac_numer$race == 2,]
Indian_df <- rac_numer[rac_numer$race == 1,]


#---------Second plot---------------------------------------------------
white_ind <- count(white_df$industry_code_main)
black_ind <- count(black_df$industry_code_main)
asin_ind <- count(asin_df$industry_code_main)
Indian_ind <- count(Indian_df$industry_code_main)
mex_ind <- count(mex_df1$industry_code_main)


#--------Third plot-----------------------------------------------------
white_tax <- count(white_df$tax_status)
black_tac <- count(black_df$tax_status)
asin_tax <- count(asin_df$tax_status)
Indian_tax <- count(Indian_df$tax_status)
mex_tax <- count(mex_df1$tax_status)


#--------Fourth Graph---------------------------------------------------
mex_educ <- count(mex_df1$education)
indian_educ <- count(Indian_df$education)
black_educ <- count(black_df$education)
white_edu <- count(white_df$education)
asin_edu <- count(asin_df$education)

Cleaning

In the code above you can see my cleaning process for the data I want to represent. In my main data frame I only want migrants who are above the age of 18. This is due to there being many children in my data set that will not contribute for to the work forces. I will also be making subset of migrants to compare the subset with each other.

Subset infomation

The subsets will be determined by the race of the migrant and there will be four subset: “White”, “Black”, “Asian”, and “Amer Indian Aleut or Eskimo”.

Bar Chart Graph

#_______________Hard Code_______________
first_graph <- data.frame(
  race = c("White", "Black", "Asian","Amer Indian Aleut or Eskimo", "Latino"),
  percent = c(90, 95.85, 89, 96, 96.5 )
)

#____________First Graph_______________________

library(ggplot2)
ggplot(first_graph, aes(x = reorder(race, -percent), y=percent)) +
  geom_bar(colour = "black", fill ="blue", stat = "identity") +
  labs(title = "Percent of Immigrants Below $50k A Year by Race", x = "Race", y = "Percent") +
  geom_text(aes(label = paste0(percent,"%")), vjust = -.5, size = 3.2)

Information: Bar Chart

In the Bar chart above it shows Immigrant by race and what percent of them are making below income (less the 50k a year).

Notes from the visualization

  • Black, Latino, Amer Indian Aleut or Eskimo have a very similar percentage of people making less than $50k a year.
  • White and Asian Immigrants have a very similar percentage of people making less than $50k a year.
  • Most of the migrants are making a low income per year for every race.

Conclusion: Bar Chart

We see from the chart that most of the migrants are making below 50k a year. But White and Asian Immigrants have the lowest percent of migrants below 50k a year. While Black, Latino, Amer Indian Aleut or Eskimo have the highest percent of migrants below 50k a year.

Stack Bi Bar Plot

library(dplyr)

#===============BI CHART=======================================================

#----Preparing Data----------------------------------------------------------


#____White_df_industry_______________________________________________________
white_ind <-white_ind %>%
  mutate(race = 'White')
sorted_wind <- order(white_ind$freq, decreasing = TRUE)
top_3_white <- head(white_ind[sorted_wind, ], 3)

#____Black_df_industry_______________________________________________________
black_ind <-black_ind %>%
  mutate(race = 'Black')
sorted_black <- order(black_ind$freq, decreasing = TRUE)
top_3_black <- head(black_ind[sorted_black, ], 3)

#_____First_join_______________________________________________________
join_1 <- join(top_3_black,top_3_white, by=c('race'), type='full')

#____Asian_df_industry_______________________________________________________
asin_ind <-asin_ind %>%
  mutate(race = 'Asian')
sorted_asin <- order(asin_ind$freq, decreasing = TRUE)
top_3_asin <- head(asin_ind[sorted_asin, ], 3)

#____Indian_df_industry______________________________________________________
Indian_ind <-Indian_ind %>%
  mutate(race = 'Amer Indian Aleut or Eskimo')
sorted_ind <- order(Indian_ind$freq, decreasing = TRUE)
top_3_ind <- head(Indian_ind[sorted_ind, ], 3)

#___2_3_join_________________________________________________________________
join_2 <- join(top_3_asin,top_3_ind, by=c('race'), type='full')
join_3 <- join(join_1, join_2, by=c('race'), type='full')

#____latino_df_industry______________________________________________________
mex_ind <-mex_ind %>%
  mutate(race = 'Latino')
sorted_m <- order(mex_ind$freq, decreasing = TRUE)
top_3_m <- head(mex_ind[sorted_m, ], 3)

#____last_join_df_industry___________________________________________________
join_4 <-  join(join_3, top_3_m, by=c('race'), type='full')

#-----Graphing---------------------------------------------------------------
library(viridis)
library(hrbrthemes)
library(plyr)
library(ggplot2)

ggplot(join_4, aes(fill=race, y=freq, x=x)) + 
  geom_bar(position="fill", stat="identity") +
  scale_fill_viridis(discrete = T) +
  labs(title = "Industry For Each Race",
       x = "Industry",
       y = "Frequency",
       fill = "Race")+
  theme_ipsum()+
  theme(axis.text.x = element_text(angle = 90))

Information: Stack Bi Bar Plot

In this chart we can see what percentage of the subgroups for industry. The Industry that were chosen where the top three industry of each race. One important thing to point out is “Not in universe or children” means unemployed. This is because we took out all of the underage migrants from the data so we need to assume that the variable means unemployed.

Notes from the visualization

  • Education industry is mostly made up of black migrants with a small percentage of Amer Indian Aleut or Eskimo migrants.
  • White migrants take up a large percentage of Manufacturing durable goods, “unemployed”, and Retail industry.
  • Latino migrants take up almost all of the Manufacturing non-durable goods.

Conclusion: Stack Bi Bar Plot

The plot tells us what percent of race is in each of the selected industry. By inspecting the graph we can now see the industry that make the migrants be considered below income. One thing to point out about White migrants, is that the reason a good percent of them are considered below income is due to a large portion of them being unemployed as seen in the plot. We also see the trends in industry and race.

Multiple Bar Chart

#==============TAX Graph=======================================================

#------Preparing---------------------------------------------------------------
#____White_tax_df___________________________________________________________
white_tax <-white_tax %>%
  mutate(race = 'White')
white_tax$percent <-  white_tax$freq/(sum(white_tax$freq))

#____black_tax_df___________________________________________________________
black_tac <-black_tac %>%
  mutate(race = 'Black')
black_tac$percent <-  black_tac$freq/(sum(black_tac$freq))

#____1_join__________________________________________________________________
join_tax1 <- join(white_tax,black_tac, by=c('race'), type='full')

#____asin_tax_df___________________________________________________________
asin_tax <-asin_tax %>%
  mutate(race = 'Asin')
asin_tax$percent <-  asin_tax$freq/(sum(asin_tax$freq))

#____Indian_tax_df___________________________________________________________
Indian_tax <-Indian_tax %>%
  mutate(race = 'Amer Indian Aleut or Eskimo')
Indian_tax$percent <-  Indian_tax$freq/(sum(Indian_tax$freq))

#____1_2_join________________________________________________________________
join_tax2 <- join(Indian_tax,asin_tax, by=c('race'), type='full')
join_tax3 <- join(join_tax1, join_tax2, by=c('race'), type='full')

#____mex_tax_df______________________________________________________________
mex_tax <-mex_tax %>%
  mutate(race = 'Latino')
mex_tax$percent <-  mex_tax$freq/(sum(mex_tax$freq))

#____4_join__________________________________________________________________
join_tax4 <-  join(join_tax3, mex_tax, by=c('race'), type='full')

#-----Graphing---------------------------------------------------------------
library(lubridate)
library(scales)
library(ggthemes)
library(RColorBrewer)

ggplot(join_tax4, aes(x= race, y = percent, fill=x))+
  geom_bar(stat = "identity", position="dodge")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(fill = "Tax Status",
       x = "Race",
       y = "Percent",
       title = "Tax Status For Race")+
  facet_wrap(~x, ncol= 4, nrow =2,) +
  theme(strip.text.x = element_text(size = 5))+
  theme(axis.text.x = element_text(angle = 90))

Information: Miltiple Bar Chart

In these chart we get to see the different taxes states of each races. The charts show use the percent of migrants of what tax status they fall into.

Tax Status key

  • Head of household: An unmarried taxpayer who has dependents and pays for more than half the costs of the home and everyone who resides there.
  • Joint both 65+: If they are married and both agree to file a joint return and above the above the age of 65.
  • Joint one under 65 & one 65+: One person in the party is above 65 and one is under 65, both party agree to file a joint return.
  • Nonfiler: Not filled
  • Single: This status is for taxpayers who are unmarried, divorced or legally separated under a divorce or separate maintenance decree governed by state law

Notes from the visualization

  • Most of the migrants fall under the “joint both under 65”
  • Black migrants have the highest percent of “Head of household” tax status
  • White migrants have a low but highest percent out of all the races in “join both 65+” tax status.

Conclusion: Multiple Bar Chart

In the charts we can see that most of the migrants tax status is Joint both under 65. We can conclude by this information that most of the migrants are married, non elderly, and split there taxes between themselves and their spouses.

Pie Charts

#==============Education Graph===============================================

#-------Preparing-----------------------------------------------------------_

#____Mex_edu_df______________________________________________________________
sorted_mex <- order(mex_educ$freq, decreasing = TRUE)
top_5_mex <- head(mex_educ[sorted_mex, ], 5)
top_5_mex <-top_5_mex %>%
  mutate(race = 'Latino')

#____Indian_edu_df___________________________________________________________
sorted_indian <- order(indian_educ$freq, decreasing = TRUE)
top_5_indian <- head(indian_educ[sorted_indian, ], 5)
top_5_indian <-top_5_indian %>%
  mutate(race = 'Amer Indian Aleut or Eskimo')

#____1_join__________________________________________________________________
join_edu1 <- join(top_5_mex,top_5_indian, by=c('race'), type='full')

#____black_edu_df____________________________________________________________
sorted_black <- order(black_educ$freq, decreasing = TRUE)
top_5_black <- head(black_educ[sorted_black, ], 5)
top_5_black <-top_5_black %>%
  mutate(race = 'Black')

#____white_edu_df____________________________________________________________
sorted_white <- order(white_edu$freq, decreasing = TRUE)
top_5_white <- head(white_edu[sorted_white, ], 5)
top_5_white <-top_5_white %>%
  mutate(race = 'White')

#____2_3_join______________________________________________________________
join_edu2 <- join(top_5_black,top_5_white, by=c('race'), type='full')
join_edu3 <- join(join_edu2,join_edu1, by=c('race'), type='full')

#____Asin_edu_df_____________________________________________________________
sorted_asin <- order(asin_edu$freq, decreasing = TRUE)
top_5_asin <- head(asin_edu[sorted_asin, ], 5)
top_5_asin <-top_5_asin %>%
  mutate(race = 'Asian')

#____4_join_________________________________________________________________
join_edu4 <- join(join_edu3,top_5_asin, by=c('race'), type='full')

#------ploting---------------------------------------------------------------

library(plotly)

plot_ly() %>%
  add_pie(data=join_edu4[join_edu4$race == 'Asian',], labels = ~x, values =~freq,
          name = "Asian", title = "Asian Education", domain=list(row=0, column=0)) %>%
  add_pie(data=join_edu4[join_edu4$race == 'Black',], labels = ~x, values =~freq,
          name = "Black", title = "Black Education", domain=list(row=0, column=1)) %>%
  add_pie(data=join_edu4[join_edu4$race == 'Amer Indian Aleut or Eskimo',], labels = ~x, values =~freq,
          name = "Amer Indian Aleut or Eskimo", title = "Amer Indian Aleut or Eskimo Education", domain=list(row=1, column=0)) %>%
  add_pie(data=join_edu4[join_edu4$race == 'White',], labels = ~x, values =~freq,
          name = "White", title = "White Education", domain=list(row=1, column=1)) %>%
  add_pie(data=join_edu4[join_edu4$race == 'Latino',], labels = ~x, values =~freq,
          name = "Latino", title = "Latino Education", domain=list(row=1, column=2)) %>%
  layout(title = "Race's Education", showlegend = TRUE,
         grid =list(rows=2, columns=3))

Information: Pie Charts

In the pie charts we can see the education experience of each race. The education variables that were selected are the top 5 education of each race.

Notes from the visualization

  • The highest percent education of education for each race is “High school graduate”
  • Asian migrants has the highest percent of migrants with a bachelors degree
  • Latino and White migrants have the highest percent of migrants with just a 7th or 8th grade of education.

Conclusion: Pie Charts

With the information of education of each race given by the pie charts, we can see how educated the migrants are. We can now see that the most educated sub group are Asian and White migrants, which is correlated with why they have a lower below income percent then the other 3 races.

Horizontal Stack Bar Chart

#==========stacked bar plot==================================================

#_____Preparing_______________________________________________________________

black_wage <-black_df %>%
  mutate(race2 = 'Black')

white_wage <-white_df %>%
  mutate(race2 = 'White')

join_heat1 <- join(white_wage,black_wage, by=c('race2'), type='full')

asin_wage <-asin_df %>%
  mutate(race2 = 'Asian')


Indian_wage <-Indian_df %>%
  mutate(race2 = 'Amer Indian Aleut or Eskimo')

join_heat2 <- join(asin_wage,Indian_wage, by=c('race2'), type='full')
join_heat3 <- join(join_heat1,join_heat2, by=c('race2'), type='full')

mex_wage <-mex_df1 %>%
  mutate(race2 = 'Latino')

join_heat4 <- join(join_heat3,mex_wage, by=c('race2'), type='full')


#____Selcting the varibales I need___________________________________________
heat_df <- select(join_heat4, race2, industry_code_main, working_week_per_year)

#___Transform_Due_to_filtiering_issues_______________________________________
transformed_heat<- transform(heat_df,
                             industry_code_main = as.numeric(as.factor(industry_code_main)))

#_______Getting the industry variables_____________________________

clean_heat <- transformed_heat[transformed_heat$industry_code_main %in% c(12, 20,11,6), ]

table_data <- table(clean_heat$race2, clean_heat$industry_code_main,clean_heat$working_week_per_year)
percentage_table <- prop.table(table_data, margin = 1) * 100
tabel <- data.frame(percentage_table)

tabel_clean <- tabel[tabel$Freq != 0.000000000, ]
tabel_clean2 <- tabel_clean[tabel$Var3 != 0, ]
table_clean3<- na.omit(tabel_clean2)

#-----Graphing---------------------------------------------------------------

ggplot(table_clean3, aes(x= reorder(Var2,Freq,sum), y=Freq, fill = Var1)) +
  geom_bar(stat = "identity")+
  coord_flip() +
  labs(title = "Working Weeks Per Year by Industry",
       x = "Industry",
       y = "Working WeeksPer Year",
       fill = "Race") +
  theme_hc()

Key: Horizontal Stack Bar Chart

  • 20: Retail
  • 6: Education
  • 11: Manufacturing durable goods
  • 12: Manufacturing non-durable goods

Information: Horizontal Stack Bar Chart

This chart tells us how many weeks per year a migrant works in a industry. The reason for this graph is to see how much a migrant works in a industry. This graph also shows the amount hours the each subset works in the said industry.

Notes from the visualization

  • Retail works the most out of any other industry and evenly spared threw the sub-groups.
  • Manufacturing non-durable goods works the least.

Conclusion: Horizontal Stack Bar Chart

This chart in particular shows us some very interesting findings. Going back to the bi graph we see that the races that took up most of the industry does not work the most. For example in Manufacturing non-durable goods seems to be almost all Latino migrants in the bi-chart, but they do not consist of all of the weeks per year, it is a even spread. This is not a bad thing because it shows that there in not discrimination with race in the industry.

Research Conclusion

With the help of visualization we can see a better gimps into the migrants background and why they are below the income. By looking at the industry we can see the trends in what industry each race dominates. We get information with the education level and tax status of each race. Then final we see which industry works the most.

Final Finding

  • Low education is a driving factor to unemployment and low income.
  • Most of the migrants are married young with a joint tax status.
  • Work is evenly distributed between the race, even if a race dominates a industry.
  • Based in your race you are more likely to work in a certain industry.