These data set published by the United Nation, which calculates Female migrants as a percentage of the international migrant stock. The dataset compares different Asian countries or areas’ female migrants as a percentage of the international migrant stock and shows their changes from 1990 to 2015.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
Immigration<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Immigration%20Data.csv")
Immigration
## Sort.order Major.area..region..country.or.area.of.destination Code
## 1 1 China 156
## 2 2 China, Hong Kong SAR 344
## 3 3 China, Macao SAR 446
## 4 4 Dem. People's Republic of Korea 408
## 5 5 Japan 392
## 6 6 Mongolia 496
## 7 7 Republic of Korea 410
## Type.of.data..a. X1990 X1995 X2000 X2005 X2010 X2015 X2017
## 1 C 49.0 49.6 50.0 44.0 40.5 38.6 38.6
## 2 B 49.3 51.9 54.1 56.4 58.7 60.5 60.5
## 3 B 53.0 53.9 54.6 53.3 54.0 54.6 54.6
## 4 I 49.0 50.0 51.0 50.7 50.5 50.2 50.2
## 5 C 49.1 51.3 52.7 54.1 55.0 55.0 55.0
## 6 C 49.1 46.7 44.4 34.0 26.0 27.0 27.0
## 7 C 43.7 42.9 41.4 41.5 42.9 43.9 43.9
Some header is very long and need to be simplify. At the same time, we delete those annoying X before the year.
cols <- c("Order","County","Code","Type","1990","1995","2000","2005","2010","2015","2017")
colnames(Immigration) <- cols
Immigration
## Order County Code Type 1990 1995 2000 2005 2010
## 1 1 China 156 C 49.0 49.6 50.0 44.0 40.5
## 2 2 China, Hong Kong SAR 344 B 49.3 51.9 54.1 56.4 58.7
## 3 3 China, Macao SAR 446 B 53.0 53.9 54.6 53.3 54.0
## 4 4 Dem. People's Republic of Korea 408 I 49.0 50.0 51.0 50.7 50.5
## 5 5 Japan 392 C 49.1 51.3 52.7 54.1 55.0
## 6 6 Mongolia 496 C 49.1 46.7 44.4 34.0 26.0
## 7 7 Republic of Korea 410 C 43.7 42.9 41.4 41.5 42.9
## 2015 2017
## 1 38.6 38.6
## 2 60.5 60.5
## 3 54.6 54.6
## 4 50.2 50.2
## 5 55.0 55.0
## 6 27.0 27.0
## 7 43.9 43.9
The data set year values as header name. It is not easy to process the data if we want to compare the data by country from 1995 to 2015.
Immigration<-gather(Immigration,"Year","Percentage",5:11)
head(Immigration)
## Order County Code Type Year Percentage
## 1 1 China 156 C 1990 49.0
## 2 2 China, Hong Kong SAR 344 B 1990 49.3
## 3 3 China, Macao SAR 446 B 1990 53.0
## 4 4 Dem. People's Republic of Korea 408 I 1990 49.0
## 5 5 Japan 392 C 1990 49.1
## 6 6 Mongolia 496 C 1990 49.1
We calculate the average migrant rate of different Asian Countries or area, and try to find out which county or area has the highest female migrant rate.
Immigration2<-Immigration%>%
group_by(County)%>%
mutate(mean=mean(Percentage))%>%
arrange(Year)
Immigration2
## # A tibble: 49 x 7
## # Groups: County [7]
## Order County Code Type Year Percentage mean
## <int> <fct> <int> <fct> <chr> <dbl> <dbl>
## 1 1 China 156 C 1990 49 44.3
## 2 2 China, Hong Kong SAR 344 B 1990 49.3 55.9
## 3 3 China, Macao SAR 446 B 1990 53 54
## 4 4 Dem. People's Republic of Korea 408 I 1990 49 50.2
## 5 5 Japan 392 C 1990 49.1 53.2
## 6 6 Mongolia 496 C 1990 49.1 36.3
## 7 7 Republic of Korea 410 C 1990 43.7 42.9
## 8 1 China 156 C 1995 49.6 44.3
## 9 2 China, Hong Kong SAR 344 B 1995 51.9 55.9
## 10 3 China, Macao SAR 446 B 1995 53.9 54
## # ... with 39 more rows
ggplot(Immigration2, aes(x=County, y=mean, fill=County)) +
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
coord_flip() +
ggtitle("Female migrants as a percentage of the international migrant stock") +
xlab("County") + ylab("Average percentage of the international migrant stock")
ggplot(Immigration,aes(x=Year,y=Percentage,group = 1,color = County))+geom_line()+facet_wrap(~County, scales = "free_x")+ggtitle("Change of female migrants as a percentage of the international migrant stock")+ theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
As we can see from the graph, China Hong Kong SAR has 55.91% female migrants rate, which means Hong Kong’s female citizens have the tendency to move to other countries or areas. Quite opposite, Mongolia’s female migrant rate is only 36.31%. From the yearly change of female migrants’ percentage graph, we can see Hong Kong’s rate increase every year, but Mongolia have a really big drop every year.
This table from Pew Pew Research Trust, which explores the relationship between incomes and religions.
Income<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Income%20distribution2.csv")
Income
## Religious.tradition Less.than..30.000 X.30.000..49.999
## 1 Buddhist 0.36 0.18
## 2 Catholic 0.36 0.19
## 3 Evangelical Protestant 0.35 0.22
## 4 Hindu 0.17 0.13
## 5 Historically Black Protestant 0.53 0.22
## 6 Jehovah's Witness 0.48 0.25
## 7 Jewish 0.16 0.15
## 8 Mainline Protestant 0.29 0.20
## 9 Mormon 0.27 0.20
## 10 Muslim 0.34 0.17
## 11 Orthodox Christian 0.18 0.17
## 12 Unaffiliated (religious "nones") 0.33 0.20
## X.50.000..99.999 X.100.000.or.more Sample.Size
## 1 0.32 0.13 233
## 2 0.26 0.19 6137
## 3 0.28 0.14 7462
## 4 0.34 0.36 172
## 5 0.17 0.08 1704
## 6 0.22 0.04 208
## 7 0.24 0.44 708
## 8 0.28 0.23 5208
## 9 0.33 0.20 594
## 10 0.29 0.20 205
## 11 0.36 0.29 155
## 12 0.26 0.21 6790
I simplify the header and remove X from the income header.
cols <- c("Religious","Less_than_$30,000","$30,000-$49,999","$50,000-$99,999","$100,000_or_more","Sample_Size")
colnames(Income) <- cols
Income
## Religious Less_than_$30,000 $30,000-$49,999
## 1 Buddhist 0.36 0.18
## 2 Catholic 0.36 0.19
## 3 Evangelical Protestant 0.35 0.22
## 4 Hindu 0.17 0.13
## 5 Historically Black Protestant 0.53 0.22
## 6 Jehovah's Witness 0.48 0.25
## 7 Jewish 0.16 0.15
## 8 Mainline Protestant 0.29 0.20
## 9 Mormon 0.27 0.20
## 10 Muslim 0.34 0.17
## 11 Orthodox Christian 0.18 0.17
## 12 Unaffiliated (religious "nones") 0.33 0.20
## $50,000-$99,999 $100,000_or_more Sample_Size
## 1 0.32 0.13 233
## 2 0.26 0.19 6137
## 3 0.28 0.14 7462
## 4 0.34 0.36 172
## 5 0.17 0.08 1704
## 6 0.22 0.04 208
## 7 0.24 0.44 708
## 8 0.28 0.23 5208
## 9 0.33 0.20 594
## 10 0.29 0.20 205
## 11 0.36 0.29 155
## 12 0.26 0.21 6790
Income range shouldn’t be the header. I create a variable named Income, and move the header under the income column.
Income<-gather(Income,"Income","Percentage",2:5)
head(Income)
## Religious Sample_Size Income Percentage
## 1 Buddhist 233 Less_than_$30,000 0.36
## 2 Catholic 6137 Less_than_$30,000 0.36
## 3 Evangelical Protestant 7462 Less_than_$30,000 0.35
## 4 Hindu 172 Less_than_$30,000 0.17
## 5 Historically Black Protestant 1704 Less_than_$30,000 0.53
## 6 Jehovah's Witness 208 Less_than_$30,000 0.48
We select the income Less Than $30,000 and more than $100,000 to show the distribution of different Religious.
Low_income<-Income%>%
filter(Income=="Less_than_$30,000")%>%
arrange(desc(Percentage))
Low_income
## Religious Sample_Size Income
## 1 Historically Black Protestant 1704 Less_than_$30,000
## 2 Jehovah's Witness 208 Less_than_$30,000
## 3 Buddhist 233 Less_than_$30,000
## 4 Catholic 6137 Less_than_$30,000
## 5 Evangelical Protestant 7462 Less_than_$30,000
## 6 Muslim 205 Less_than_$30,000
## 7 Unaffiliated (religious "nones") 6790 Less_than_$30,000
## 8 Mainline Protestant 5208 Less_than_$30,000
## 9 Mormon 594 Less_than_$30,000
## 10 Orthodox Christian 155 Less_than_$30,000
## 11 Hindu 172 Less_than_$30,000
## 12 Jewish 708 Less_than_$30,000
## Percentage
## 1 0.53
## 2 0.48
## 3 0.36
## 4 0.36
## 5 0.35
## 6 0.34
## 7 0.33
## 8 0.29
## 9 0.27
## 10 0.18
## 11 0.17
## 12 0.16
ggplot(Low_income, aes(x=Religious, y=Percentage, fill=Religious)) +
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
coord_flip() +
ggtitle("Low-income Distribution by Religious") +
xlab("Religious") + ylab("Low_Income_Percentage")
Low_income<-Income%>%
filter(Income=="$100,000_or_more")%>%
arrange(desc(Percentage))
Low_income
## Religious Sample_Size Income
## 1 Jewish 708 $100,000_or_more
## 2 Hindu 172 $100,000_or_more
## 3 Orthodox Christian 155 $100,000_or_more
## 4 Mainline Protestant 5208 $100,000_or_more
## 5 Unaffiliated (religious "nones") 6790 $100,000_or_more
## 6 Mormon 594 $100,000_or_more
## 7 Muslim 205 $100,000_or_more
## 8 Catholic 6137 $100,000_or_more
## 9 Evangelical Protestant 7462 $100,000_or_more
## 10 Buddhist 233 $100,000_or_more
## 11 Historically Black Protestant 1704 $100,000_or_more
## 12 Jehovah's Witness 208 $100,000_or_more
## Percentage
## 1 0.44
## 2 0.36
## 3 0.29
## 4 0.23
## 5 0.21
## 6 0.20
## 7 0.20
## 8 0.19
## 9 0.14
## 10 0.13
## 11 0.08
## 12 0.04
ggplot(Low_income, aes(x=Religious, y=Percentage, fill=Religious)) +
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
coord_flip() +
ggtitle("High-income Distribution by Religious") +
xlab("Religious") + ylab("Low_Income_Percentage")
The data can provide information for government who offer benefits to the lowest income. From the graph we can see that the lowest income percentage for Historically Black Protestant and Jehovah’s Witness is 0.53 and 0.48, respectively. Government need to do more to increase their incomes. We also find out some interesting fact that 44% of Jewish has income morn $100,000 per year. What cause the big difference and make Jewish rich? Sociologist may have many researches already.
This dataset shows counts of new PhDs in the mathematical sciences for 2008-09 and 2011-12 categorized by type of institution, gender, and US citizenship status.
A factor with levels I(Pu) for group I public universities, I(Pr) for group I private universities, II and III for groups II and III, IV for statistics and biostatistics programs, and Va for applied mathemeatics programs.
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
math<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/AMSsurvey.csv")
math
## X type sex citizen count count11
## 1 1 I(Pu) Male US 132 148
## 2 2 I(Pu) Female US 35 40
## 3 3 I(Pr) Male US 87 63
## 4 4 I(Pr) Female US 20 22
## 5 5 II Male US 96 161
## 6 6 II Female US 47 53
## 7 7 III Male US 47 71
## 8 8 III Female US 32 28
## 9 9 IV Male US 71 89
## 10 10 IV Female US 54 55
## 11 11 Va Male US 34 42
## 12 12 Va Female US 14 21
## 13 13 I(Pu) Male Non-US 130 136
## 14 14 I(Pu) Female Non-US 29 32
## 15 15 I(Pr) Male Non-US 79 82
## 16 16 I(Pr) Female Non-US 25 26
## 17 17 II Male Non-US 89 116
## 18 18 II Female Non-US 50 56
## 19 19 III Male Non-US 53 61
## 20 20 III Female Non-US 39 30
## 21 21 IV Male Non-US 122 153
## 22 22 IV Female Non-US 105 115
## 23 23 Va Male Non-US 28 27
## 24 24 Va Female Non-US 12 17
count means year range, 2008 to 2009 and count11 means 2011 to 2012.
cols <- c("Num","Type","Sex","Citizen","2008-09","2011-12")
colnames(math) <- cols
math
## Num Type Sex Citizen 2008-09 2011-12
## 1 1 I(Pu) Male US 132 148
## 2 2 I(Pu) Female US 35 40
## 3 3 I(Pr) Male US 87 63
## 4 4 I(Pr) Female US 20 22
## 5 5 II Male US 96 161
## 6 6 II Female US 47 53
## 7 7 III Male US 47 71
## 8 8 III Female US 32 28
## 9 9 IV Male US 71 89
## 10 10 IV Female US 54 55
## 11 11 Va Male US 34 42
## 12 12 Va Female US 14 21
## 13 13 I(Pu) Male Non-US 130 136
## 14 14 I(Pu) Female Non-US 29 32
## 15 15 I(Pr) Male Non-US 79 82
## 16 16 I(Pr) Female Non-US 25 26
## 17 17 II Male Non-US 89 116
## 18 18 II Female Non-US 50 56
## 19 19 III Male Non-US 53 61
## 20 20 III Female Non-US 39 30
## 21 21 IV Male Non-US 122 153
## 22 22 IV Female Non-US 105 115
## 23 23 Va Male Non-US 28 27
## 24 24 Va Female Non-US 12 17
The dataset has two variables, 2008-09 and 2011-12, but they should be observations under the column name year. We use gather() to change column into rows.
math<-gather(math,"Year","n",5:6)
#math <- math %>%
#gather(Year, Frequency, 5:6) %>%
#spread(Sex, Frequency)
head(math)
## Num Type Sex Citizen Year n
## 1 1 I(Pu) Male US 2008-09 132
## 2 2 I(Pu) Female US 2008-09 35
## 3 3 I(Pr) Male US 2008-09 87
## 4 4 I(Pr) Female US 2008-09 20
## 5 5 II Male US 2008-09 96
## 6 6 II Female US 2008-09 47
math2<-math%>%
group_by(Type,Sex)%>%
summarise(sum=sum(n))%>%
arrange(desc(sum))
math2
## # A tibble: 12 x 3
## # Groups: Type [6]
## Type Sex sum
## <fct> <fct> <int>
## 1 I(Pu) Male 546
## 2 II Male 462
## 3 IV Male 435
## 4 IV Female 329
## 5 I(Pr) Male 311
## 6 III Male 232
## 7 II Female 206
## 8 I(Pu) Female 136
## 9 Va Male 131
## 10 III Female 129
## 11 I(Pr) Female 93
## 12 Va Female 64
ggplot(data = math2, aes(x=Sex,y=sum))+
geom_bar(stat = 'identity',aes(fill=Type))+
geom_text(aes(x = Sex, y = sum,
label = paste(sum),
group = Sex,
vjust = -0.01)) +
labs(title = "Phd Distribution",
x = "Sex",
y = "Number of Phdt") +
facet_wrap(~Type, ncol = 8)+
theme_bw()
The histogram shows the comparison of PhD number base on different types of institutions. From the sample data we collect, I(Pu), group I public universities, has the most male PhD students. But when we compare the number of female students, IV, statistics and biostatistics programs, has 329 female students. If the sample is large enough, we can get the conclusion that female PhD student prefer the statics and biostatistics programs.
math3<-math%>%
group_by(Sex)%>%
summarise(sum=sum(n))
math3
## # A tibble: 2 x 2
## Sex sum
## <fct> <int>
## 1 Female 957
## 2 Male 2117
ggplot(math3, aes(x=Sex, y=sum, fill=Sex)) +
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.2) +
ggtitle("Female phd VS Male Phd") +
xlab("Sex") + ylab("Phd Number")
math4<-math%>%
group_by(Citizen)%>%
summarise(sum=sum(n))
math4
## # A tibble: 2 x 2
## Citizen sum
## <fct> <int>
## 1 Non-US 1612
## 2 US 1462
ggplot(math4, aes(x=Citizen, y=sum, fill=Citizen))+
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.2) +
ggtitle("Citizen phd VS Non-Citizen Phd") +
xlab("Citizen") + ylab("Phd Number")
math5<-math%>%
group_by(Year)%>%
summarise(sum=sum(n))
math5
## # A tibble: 2 x 2
## Year sum
## <chr> <int>
## 1 2008-09 1430
## 2 2011-12 1644
ggplot(math5, aes(x=Year, y=sum, fill=Year))+
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.2) +
ggtitle("The number of phd increase from 2018 to 2012") +
xlab("Year") + ylab("Phd Number")
We will do more analysis base on the PhD student’s sex, citizen and survey year.From the table, we can easily caculate total female Phd students’rate, 957/(957+2117)=31.13% and total Non Citizen Phd Rate,1612/(1612+1462)=52.44%.From 2008 to 2012, the totalnumber of Phd has (1644-1430)/1430=14.97% increase.