The data set I will be using for my analysis is income-prediction-data set-us-20th-century-data found on Kaggle. I will be using two csv files from the data set and the names of the files are Train and Test. The train file contains 200,000 randomly selected immigrants with 43 columns. The test file contains 100,000 immigrants with 42 columns. The contents of the two files are immigrates coming to America from 1994 to 1995.
Data set: income-prediction-data set-us-20th-century-data
The goal for the anylist is to find out more about the migrants that were coming to america in 1994 to 1995. The variables that I will be focusing on to figure this out are in red in Key Variables. The reasoning for the choosing variables is to focus on what the migrants bring to the work force and a gimps into the background of the migrants.
library(data.table)
library(plyr)
Train <- fread("/Users/colemcnulty/Desktop/Income Prediction./Train.csv")
Test <- fread("/Users/colemcnulty/Desktop/Income Prediction./Test.csv")
#______________Main Data frame________________________________________
df1 <- join(Train,Test, by=c('ID'), type='full')
df2 <- df1[df1$age >= 18, ]
df3 <- transform(df2,
income_above_limit = as.numeric(as.factor(income_above_limit)))
df_3 <- na.omit(df3)
#__________Getting Hispanic seperated____________________________________
mex_df <- transform(df_3,
is_hispanic = as.numeric(as.factor(is_hispanic)))
mex_df1 <- mex_df[mex_df$is_hispanic != 1, ]
other_race <- mex_df[mex_df$is_hispanic == 1, ]
rac_numer <- transform(other_race,
race = as.numeric(as.factor(race)))
#--------Frist Plot----------------------------------------------------
white_df <- rac_numer[rac_numer$race == 5, ]
black_df <- rac_numer[rac_numer$race == 3,]
asin_df <- rac_numer[rac_numer$race == 2,]
Indian_df <- rac_numer[rac_numer$race == 1,]
#---------Second plot---------------------------------------------------
white_ind <- count(white_df$industry_code_main)
black_ind <- count(black_df$industry_code_main)
asin_ind <- count(asin_df$industry_code_main)
Indian_ind <- count(Indian_df$industry_code_main)
mex_ind <- count(mex_df1$industry_code_main)
#--------Third plot-----------------------------------------------------
white_tax <- count(white_df$tax_status)
black_tac <- count(black_df$tax_status)
asin_tax <- count(asin_df$tax_status)
Indian_tax <- count(Indian_df$tax_status)
mex_tax <- count(mex_df1$tax_status)
#--------Fourth Graph---------------------------------------------------
mex_educ <- count(mex_df1$education)
indian_educ <- count(Indian_df$education)
black_educ <- count(black_df$education)
white_edu <- count(white_df$education)
asin_edu <- count(asin_df$education)
In the code above you can see my cleaning process for the data I want to represent. In my main data frame I only want migrants who are above the age of 18. This is due to there being many children in my data set that will not contribute for to the work forces. I will also be making subset of migrants to compare the subset with each other.
The subsets will be determined by the race of the migrant and there will be four subset: “White”, “Black”, “Asian”, and “Amer Indian Aleut or Eskimo”.
#_______________Hard Code_______________
first_graph <- data.frame(
race = c("White", "Black", "Asian","Amer Indian Aleut or Eskimo", "Latino"),
percent = c(90, 95.85, 89, 96, 96.5 )
)
#____________First Graph_______________________
library(ggplot2)
ggplot(first_graph, aes(x = reorder(race, -percent), y=percent)) +
geom_bar(colour = "black", fill ="blue", stat = "identity") +
labs(title = "Percent of Immigrants Below $50k A Year by Race", x = "Race", y = "Percent") +
geom_text(aes(label = paste0(percent,"%")), vjust = -.5, size = 3.2)
In the Bar chart above it shows Immigrant by race and what percent of them are making below income (less the 50k a year).
Notes from the visualization
We see from the chart that most of the migrants are making below 50k a year. But White and Asian Immigrants have the lowest percent of migrants below 50k a year. While Black, Latino, Amer Indian Aleut or Eskimo have the highest percent of migrants below 50k a year.
library(dplyr)
#===============BI CHART=======================================================
#----Preparing Data----------------------------------------------------------
#____White_df_industry_______________________________________________________
white_ind <-white_ind %>%
mutate(race = 'White')
sorted_wind <- order(white_ind$freq, decreasing = TRUE)
top_3_white <- head(white_ind[sorted_wind, ], 3)
#____Black_df_industry_______________________________________________________
black_ind <-black_ind %>%
mutate(race = 'Black')
sorted_black <- order(black_ind$freq, decreasing = TRUE)
top_3_black <- head(black_ind[sorted_black, ], 3)
#_____First_join_______________________________________________________
join_1 <- join(top_3_black,top_3_white, by=c('race'), type='full')
#____Asian_df_industry_______________________________________________________
asin_ind <-asin_ind %>%
mutate(race = 'Asian')
sorted_asin <- order(asin_ind$freq, decreasing = TRUE)
top_3_asin <- head(asin_ind[sorted_asin, ], 3)
#____Indian_df_industry______________________________________________________
Indian_ind <-Indian_ind %>%
mutate(race = 'Amer Indian Aleut or Eskimo')
sorted_ind <- order(Indian_ind$freq, decreasing = TRUE)
top_3_ind <- head(Indian_ind[sorted_ind, ], 3)
#___2_3_join_________________________________________________________________
join_2 <- join(top_3_asin,top_3_ind, by=c('race'), type='full')
join_3 <- join(join_1, join_2, by=c('race'), type='full')
#____latino_df_industry______________________________________________________
mex_ind <-mex_ind %>%
mutate(race = 'Latino')
sorted_m <- order(mex_ind$freq, decreasing = TRUE)
top_3_m <- head(mex_ind[sorted_m, ], 3)
#____last_join_df_industry___________________________________________________
join_4 <- join(join_3, top_3_m, by=c('race'), type='full')
#-----Graphing---------------------------------------------------------------
library(viridis)
library(hrbrthemes)
library(plyr)
library(ggplot2)
ggplot(join_4, aes(fill=race, y=freq, x=x)) +
geom_bar(position="fill", stat="identity") +
scale_fill_viridis(discrete = T) +
labs(title = "Industry For Each Race",
x = "Industry",
y = "Frequency",
fill = "Race")+
theme_ipsum()+
theme(axis.text.x = element_text(angle = 90))
In this chart we can see what percentage of the subgroups for industry. The Industry that were chosen where the top three industry of each race. One important thing to point out is “Not in universe or children” means unemployed. This is because we took out all of the underage migrants from the data so we need to assume that the variable means unemployed.
Notes from the visualization
The plot tells us what percent of race is in each of the selected industry. By inspecting the graph we can now see the industry that make the migrants be considered below income. One thing to point out about White migrants, is that the reason a good percent of them are considered below income is due to a large portion of them being unemployed as seen in the plot. We also see the trends in industry and race.
#==============TAX Graph=======================================================
#------Preparing---------------------------------------------------------------
#____White_tax_df___________________________________________________________
white_tax <-white_tax %>%
mutate(race = 'White')
white_tax$percent <- white_tax$freq/(sum(white_tax$freq))
#____black_tax_df___________________________________________________________
black_tac <-black_tac %>%
mutate(race = 'Black')
black_tac$percent <- black_tac$freq/(sum(black_tac$freq))
#____1_join__________________________________________________________________
join_tax1 <- join(white_tax,black_tac, by=c('race'), type='full')
#____asin_tax_df___________________________________________________________
asin_tax <-asin_tax %>%
mutate(race = 'Asin')
asin_tax$percent <- asin_tax$freq/(sum(asin_tax$freq))
#____Indian_tax_df___________________________________________________________
Indian_tax <-Indian_tax %>%
mutate(race = 'Amer Indian Aleut or Eskimo')
Indian_tax$percent <- Indian_tax$freq/(sum(Indian_tax$freq))
#____1_2_join________________________________________________________________
join_tax2 <- join(Indian_tax,asin_tax, by=c('race'), type='full')
join_tax3 <- join(join_tax1, join_tax2, by=c('race'), type='full')
#____mex_tax_df______________________________________________________________
mex_tax <-mex_tax %>%
mutate(race = 'Latino')
mex_tax$percent <- mex_tax$freq/(sum(mex_tax$freq))
#____4_join__________________________________________________________________
join_tax4 <- join(join_tax3, mex_tax, by=c('race'), type='full')
#-----Graphing---------------------------------------------------------------
library(lubridate)
library(scales)
library(ggthemes)
library(RColorBrewer)
ggplot(join_tax4, aes(x= race, y = percent, fill=x))+
geom_bar(stat = "identity", position="dodge")+
theme_light()+
theme(plot.title = element_text(hjust = 0.5))+
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = "Set2") +
labs(fill = "Tax Status",
x = "Race",
y = "Percent",
title = "Tax Status For Race")+
facet_wrap(~x, ncol= 4, nrow =2,) +
theme(strip.text.x = element_text(size = 5))+
theme(axis.text.x = element_text(angle = 90))
In these chart we get to see the different taxes states of each races. The charts show use the percent of migrants of what tax status they fall into.
Notes from the visualization
In the charts we can see that most of the migrants tax status is Joint both under 65. We can conclude by this information that most of the migrants are married, non elderly, and split there taxes between themselves and their spouses.
#==============Education Graph===============================================
#-------Preparing-----------------------------------------------------------_
#____Mex_edu_df______________________________________________________________
sorted_mex <- order(mex_educ$freq, decreasing = TRUE)
top_5_mex <- head(mex_educ[sorted_mex, ], 5)
top_5_mex <-top_5_mex %>%
mutate(race = 'Latino')
#____Indian_edu_df___________________________________________________________
sorted_indian <- order(indian_educ$freq, decreasing = TRUE)
top_5_indian <- head(indian_educ[sorted_indian, ], 5)
top_5_indian <-top_5_indian %>%
mutate(race = 'Amer Indian Aleut or Eskimo')
#____1_join__________________________________________________________________
join_edu1 <- join(top_5_mex,top_5_indian, by=c('race'), type='full')
#____black_edu_df____________________________________________________________
sorted_black <- order(black_educ$freq, decreasing = TRUE)
top_5_black <- head(black_educ[sorted_black, ], 5)
top_5_black <-top_5_black %>%
mutate(race = 'Black')
#____white_edu_df____________________________________________________________
sorted_white <- order(white_edu$freq, decreasing = TRUE)
top_5_white <- head(white_edu[sorted_white, ], 5)
top_5_white <-top_5_white %>%
mutate(race = 'White')
#____2_3_join______________________________________________________________
join_edu2 <- join(top_5_black,top_5_white, by=c('race'), type='full')
join_edu3 <- join(join_edu2,join_edu1, by=c('race'), type='full')
#____Asin_edu_df_____________________________________________________________
sorted_asin <- order(asin_edu$freq, decreasing = TRUE)
top_5_asin <- head(asin_edu[sorted_asin, ], 5)
top_5_asin <-top_5_asin %>%
mutate(race = 'Asian')
#____4_join_________________________________________________________________
join_edu4 <- join(join_edu3,top_5_asin, by=c('race'), type='full')
#------ploting---------------------------------------------------------------
library(plotly)
plot_ly() %>%
add_pie(data=join_edu4[join_edu4$race == 'Asian',], labels = ~x, values =~freq,
name = "Asian", title = "Asian Education", domain=list(row=0, column=0)) %>%
add_pie(data=join_edu4[join_edu4$race == 'Black',], labels = ~x, values =~freq,
name = "Black", title = "Black Education", domain=list(row=0, column=1)) %>%
add_pie(data=join_edu4[join_edu4$race == 'Amer Indian Aleut or Eskimo',], labels = ~x, values =~freq,
name = "Amer Indian Aleut or Eskimo", title = "Amer Indian Aleut or Eskimo Education", domain=list(row=1, column=0)) %>%
add_pie(data=join_edu4[join_edu4$race == 'White',], labels = ~x, values =~freq,
name = "White", title = "White Education", domain=list(row=1, column=1)) %>%
add_pie(data=join_edu4[join_edu4$race == 'Latino',], labels = ~x, values =~freq,
name = "Latino", title = "Latino Education", domain=list(row=1, column=2)) %>%
layout(title = "Race's Education", showlegend = TRUE,
grid =list(rows=2, columns=3))
In the pie charts we can see the education experience of each race. The education variables that were selected are the top 5 education of each race.
Notes from the visualization
With the information of education of each race given by the pie charts, we can see how educated the migrants are. We can now see that the most educated sub group are Asian and White migrants, which is correlated with why they have a lower below income percent then the other 3 races.
#==========stacked bar plot==================================================
#_____Preparing_______________________________________________________________
black_wage <-black_df %>%
mutate(race2 = 'Black')
white_wage <-white_df %>%
mutate(race2 = 'White')
join_heat1 <- join(white_wage,black_wage, by=c('race2'), type='full')
asin_wage <-asin_df %>%
mutate(race2 = 'Asian')
Indian_wage <-Indian_df %>%
mutate(race2 = 'Amer Indian Aleut or Eskimo')
join_heat2 <- join(asin_wage,Indian_wage, by=c('race2'), type='full')
join_heat3 <- join(join_heat1,join_heat2, by=c('race2'), type='full')
mex_wage <-mex_df1 %>%
mutate(race2 = 'Latino')
join_heat4 <- join(join_heat3,mex_wage, by=c('race2'), type='full')
#____Selcting the varibales I need___________________________________________
heat_df <- select(join_heat4, race2, industry_code_main, working_week_per_year)
#___Transform_Due_to_filtiering_issues_______________________________________
transformed_heat<- transform(heat_df,
industry_code_main = as.numeric(as.factor(industry_code_main)))
#_______Getting the industry variables_____________________________
clean_heat <- transformed_heat[transformed_heat$industry_code_main %in% c(12, 20,11,6), ]
table_data <- table(clean_heat$race2, clean_heat$industry_code_main,clean_heat$working_week_per_year)
percentage_table <- prop.table(table_data, margin = 1) * 100
tabel <- data.frame(percentage_table)
tabel_clean <- tabel[tabel$Freq != 0.000000000, ]
tabel_clean2 <- tabel_clean[tabel$Var3 != 0, ]
table_clean3<- na.omit(tabel_clean2)
#-----Graphing---------------------------------------------------------------
ggplot(table_clean3, aes(x= reorder(Var2,Freq,sum), y=Freq, fill = Var1)) +
geom_bar(stat = "identity")+
coord_flip() +
labs(title = "Working Weeks Per Year by Industry",
x = "Industry",
y = "Working WeeksPer Year",
fill = "Race") +
theme_hc()
This chart tells us how many weeks per year a migrant works in a industry. The reason for this graph is to see how much a migrant works in a industry. This graph also shows the amount hours the each subset works in the said industry.
Notes from the visualization
This chart in particular shows us some very interesting findings. Going back to the bi graph we see that the races that took up most of the industry does not work the most. For example in Manufacturing non-durable goods seems to be almost all Latino migrants in the bi-chart, but they do not consist of all of the weeks per year, it is a even spread. This is not a bad thing because it shows that there in not discrimination with race in the industry.
With the help of visualization we can see a better gimps into the migrants background and why they are below the income. By looking at the industry we can see the trends in what industry each race dominates. We get information with the education level and tax status of each race. Then final we see which industry works the most.