607_project2

DataSet I -Immigration

These data set published by the United Nation, which calculates Female migrants as a percentage of the international migrant stock. The dataset compares different Asian countries or areas’ female migrants as a percentage of the international migrant stock and shows their changes from 1990 to 2015.

Load the data into R

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(ggplot2)
Immigration<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Immigration%20Data.csv")
Immigration

##   Sort.order Major.area..region..country.or.area.of.destination Code
## 1          1                                              China  156
## 2          2                               China, Hong Kong SAR  344
## 3          3                                   China, Macao SAR  446
## 4          4                    Dem. People's Republic of Korea  408
## 5          5                                              Japan  392
## 6          6                                           Mongolia  496
## 7          7                                  Republic of Korea  410
##   Type.of.data..a. X1990 X1995 X2000 X2005 X2010 X2015 X2017
## 1                C  49.0  49.6  50.0  44.0  40.5  38.6  38.6
## 2                B  49.3  51.9  54.1  56.4  58.7  60.5  60.5
## 3                B  53.0  53.9  54.6  53.3  54.0  54.6  54.6
## 4                I  49.0  50.0  51.0  50.7  50.5  50.2  50.2
## 5                C  49.1  51.3  52.7  54.1  55.0  55.0  55.0
## 6                C  49.1  46.7  44.4  34.0  26.0  27.0  27.0
## 7                C  43.7  42.9  41.4  41.5  42.9  43.9  43.9

change the header

Some header is very long and need to be simplify. At the same time, we delete those annoying X before the year.

cols <- c("Order","County","Code","Type","1990","1995","2000","2005","2010","2015","2017")
colnames(Immigration) <- cols
Immigration

##   Order                          County Code Type 1990 1995 2000 2005 2010
## 1     1                           China  156    C 49.0 49.6 50.0 44.0 40.5
## 2     2            China, Hong Kong SAR  344    B 49.3 51.9 54.1 56.4 58.7
## 3     3                China, Macao SAR  446    B 53.0 53.9 54.6 53.3 54.0
## 4     4 Dem. People's Republic of Korea  408    I 49.0 50.0 51.0 50.7 50.5
## 5     5                           Japan  392    C 49.1 51.3 52.7 54.1 55.0
## 6     6                        Mongolia  496    C 49.1 46.7 44.4 34.0 26.0
## 7     7               Republic of Korea  410    C 43.7 42.9 41.4 41.5 42.9
##   2015 2017
## 1 38.6 38.6
## 2 60.5 60.5
## 3 54.6 54.6
## 4 50.2 50.2
## 5 55.0 55.0
## 6 27.0 27.0
## 7 43.9 43.9

Data tidy

The data set year values as header name. It is not easy to process the data if we want to compare the data by country from 1995 to 2015.

Immigration<-gather(Immigration,"Year","Percentage",5:11)
head(Immigration)

##   Order                          County Code Type Year Percentage
## 1     1                           China  156    C 1990       49.0
## 2     2            China, Hong Kong SAR  344    B 1990       49.3
## 3     3                China, Macao SAR  446    B 1990       53.0
## 4     4 Dem. People's Republic of Korea  408    I 1990       49.0
## 5     5                           Japan  392    C 1990       49.1
## 6     6                        Mongolia  496    C 1990       49.1

Data Analysis

We calculate the average migrant rate of different Asian Countries or area, and try to find out which county or area has the highest female migrant rate.

Immigration2<-Immigration%>%
 
  group_by(County)%>%
  mutate(mean=mean(Percentage))%>%
  arrange(Year)
  Immigration2

## # A tibble: 49 x 7
## # Groups:   County [7]
##    Order County                           Code Type  Year  Percentage  mean
##    <int> <fct>                           <int> <fct> <chr>      <dbl> <dbl>
##  1     1 China                             156 C     1990        49    44.3
##  2     2 China, Hong Kong SAR              344 B     1990        49.3  55.9
##  3     3 China, Macao SAR                  446 B     1990        53    54  
##  4     4 Dem. People's Republic of Korea   408 I     1990        49    50.2
##  5     5 Japan                             392 C     1990        49.1  53.2
##  6     6 Mongolia                          496 C     1990        49.1  36.3
##  7     7 Republic of Korea                 410 C     1990        43.7  42.9
##  8     1 China                             156 C     1995        49.6  44.3
##  9     2 China, Hong Kong SAR              344 B     1995        51.9  55.9
## 10     3 China, Macao SAR                  446 B     1995        53.9  54  
## # ... with 39 more rows

 ggplot(Immigration2, aes(x=County, y=mean, fill=County)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    coord_flip() + 
    ggtitle("Female migrants as a percentage of the international migrant stock") +
    xlab("County") + ylab("Average percentage of the international migrant stock")

 ggplot(Immigration,aes(x=Year,y=Percentage,group = 1,color = County))+geom_line()+facet_wrap(~County, scales = "free_x")+ggtitle("Change of female migrants as a percentage of the international migrant stock")+  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

As we can see from the graph, China Hong Kong SAR has 55.91% female migrants rate, which means Hong Kong’s female citizens have the tendency to move to other countries or areas. Quite opposite, Mongolia’s female migrant rate is only 36.31%. From the yearly change of female migrants’ percentage graph, we can see Hong Kong’s rate increase every year, but Mongolia have a really big drop every year.

DataSet II -Income Distribution

This table from Pew Pew Research Trust, which explores the relationship between incomes and religions.

Load the data to R

Income<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Income%20distribution2.csv")
Income

##                 Religious.tradition Less.than..30.000 X.30.000..49.999
## 1                          Buddhist              0.36             0.18
## 2                          Catholic              0.36             0.19
## 3            Evangelical Protestant              0.35             0.22
## 4                             Hindu              0.17             0.13
## 5     Historically Black Protestant              0.53             0.22
## 6                 Jehovah's Witness              0.48             0.25
## 7                            Jewish              0.16             0.15
## 8               Mainline Protestant              0.29             0.20
## 9                            Mormon              0.27             0.20
## 10                           Muslim              0.34             0.17
## 11               Orthodox Christian              0.18             0.17
## 12 Unaffiliated (religious "nones")              0.33             0.20
##    X.50.000..99.999 X.100.000.or.more Sample.Size
## 1              0.32              0.13         233
## 2              0.26              0.19        6137
## 3              0.28              0.14        7462
## 4              0.34              0.36         172
## 5              0.17              0.08        1704
## 6              0.22              0.04         208
## 7              0.24              0.44         708
## 8              0.28              0.23        5208
## 9              0.33              0.20         594
## 10             0.29              0.20         205
## 11             0.36              0.29         155
## 12             0.26              0.21        6790

change the header

I simplify the header and remove X from the income header.

cols <- c("Religious","Less_than_$30,000","$30,000-$49,999","$50,000-$99,999","$100,000_or_more","Sample_Size")
colnames(Income) <- cols
Income

##                           Religious Less_than_$30,000 $30,000-$49,999
## 1                          Buddhist              0.36            0.18
## 2                          Catholic              0.36            0.19
## 3            Evangelical Protestant              0.35            0.22
## 4                             Hindu              0.17            0.13
## 5     Historically Black Protestant              0.53            0.22
## 6                 Jehovah's Witness              0.48            0.25
## 7                            Jewish              0.16            0.15
## 8               Mainline Protestant              0.29            0.20
## 9                            Mormon              0.27            0.20
## 10                           Muslim              0.34            0.17
## 11               Orthodox Christian              0.18            0.17
## 12 Unaffiliated (religious "nones")              0.33            0.20
##    $50,000-$99,999 $100,000_or_more Sample_Size
## 1             0.32             0.13         233
## 2             0.26             0.19        6137
## 3             0.28             0.14        7462
## 4             0.34             0.36         172
## 5             0.17             0.08        1704
## 6             0.22             0.04         208
## 7             0.24             0.44         708
## 8             0.28             0.23        5208
## 9             0.33             0.20         594
## 10            0.29             0.20         205
## 11            0.36             0.29         155
## 12            0.26             0.21        6790

Tidy Data

Income range shouldn’t be the header. I create a variable named Income, and move the header under the income column.

Income<-gather(Income,"Income","Percentage",2:5)
head(Income)

##                       Religious Sample_Size            Income Percentage
## 1                      Buddhist         233 Less_than_$30,000       0.36
## 2                      Catholic        6137 Less_than_$30,000       0.36
## 3        Evangelical Protestant        7462 Less_than_$30,000       0.35
## 4                         Hindu         172 Less_than_$30,000       0.17
## 5 Historically Black Protestant        1704 Less_than_$30,000       0.53
## 6             Jehovah's Witness         208 Less_than_$30,000       0.48

Data Analysis

We select the income Less Than $30,000 and more than $100,000 to show the distribution of different Religious.

Low_income<-Income%>%
  filter(Income=="Less_than_$30,000")%>%
  arrange(desc(Percentage))
 Low_income

##                           Religious Sample_Size            Income
## 1     Historically Black Protestant        1704 Less_than_$30,000
## 2                 Jehovah's Witness         208 Less_than_$30,000
## 3                          Buddhist         233 Less_than_$30,000
## 4                          Catholic        6137 Less_than_$30,000
## 5            Evangelical Protestant        7462 Less_than_$30,000
## 6                            Muslim         205 Less_than_$30,000
## 7  Unaffiliated (religious "nones")        6790 Less_than_$30,000
## 8               Mainline Protestant        5208 Less_than_$30,000
## 9                            Mormon         594 Less_than_$30,000
## 10               Orthodox Christian         155 Less_than_$30,000
## 11                            Hindu         172 Less_than_$30,000
## 12                           Jewish         708 Less_than_$30,000
##    Percentage
## 1        0.53
## 2        0.48
## 3        0.36
## 4        0.36
## 5        0.35
## 6        0.34
## 7        0.33
## 8        0.29
## 9        0.27
## 10       0.18
## 11       0.17
## 12       0.16

 ggplot(Low_income, aes(x=Religious, y=Percentage, fill=Religious)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    coord_flip() + 
    ggtitle("Low-income Distribution by Religious") +
    xlab("Religious") + ylab("Low_Income_Percentage")

 Low_income<-Income%>%
  filter(Income=="$100,000_or_more")%>%
  arrange(desc(Percentage))
 Low_income

##                           Religious Sample_Size           Income
## 1                            Jewish         708 $100,000_or_more
## 2                             Hindu         172 $100,000_or_more
## 3                Orthodox Christian         155 $100,000_or_more
## 4               Mainline Protestant        5208 $100,000_or_more
## 5  Unaffiliated (religious "nones")        6790 $100,000_or_more
## 6                            Mormon         594 $100,000_or_more
## 7                            Muslim         205 $100,000_or_more
## 8                          Catholic        6137 $100,000_or_more
## 9            Evangelical Protestant        7462 $100,000_or_more
## 10                         Buddhist         233 $100,000_or_more
## 11    Historically Black Protestant        1704 $100,000_or_more
## 12                Jehovah's Witness         208 $100,000_or_more
##    Percentage
## 1        0.44
## 2        0.36
## 3        0.29
## 4        0.23
## 5        0.21
## 6        0.20
## 7        0.20
## 8        0.19
## 9        0.14
## 10       0.13
## 11       0.08
## 12       0.04

 ggplot(Low_income, aes(x=Religious, y=Percentage, fill=Religious)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    coord_flip() + 
    ggtitle("High-income Distribution by Religious") +
    xlab("Religious") + ylab("Low_Income_Percentage")

The data can provide information for government who offer benefits to the lowest income. From the graph we can see that the lowest income percentage for Historically Black Protestant and Jehovah’s Witness is 0.53 and 0.48, respectively. Government need to do more to increase their incomes. We also find out some interesting fact that 44% of Jewish has income morn $100,000 per year. What cause the big difference and make Jewish rich? Sociologist may have many researches already.

DataSet III-American Math Society Survey

Data Description:

This dataset shows counts of new PhDs in the mathematical sciences for 2008-09 and 2011-12 categorized by type of institution, gender, and US citizenship status.

A factor with levels I(Pu) for group I public universities, I(Pr) for group I private universities, II and III for groups II and III, IV for statistics and biostatistics programs, and Va for applied mathemeatics programs.

library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
math<-read.csv("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/AMSsurvey.csv")
math

##     X  type    sex citizen count count11
## 1   1 I(Pu)   Male      US   132     148
## 2   2 I(Pu) Female      US    35      40
## 3   3 I(Pr)   Male      US    87      63
## 4   4 I(Pr) Female      US    20      22
## 5   5    II   Male      US    96     161
## 6   6    II Female      US    47      53
## 7   7   III   Male      US    47      71
## 8   8   III Female      US    32      28
## 9   9    IV   Male      US    71      89
## 10 10    IV Female      US    54      55
## 11 11    Va   Male      US    34      42
## 12 12    Va Female      US    14      21
## 13 13 I(Pu)   Male  Non-US   130     136
## 14 14 I(Pu) Female  Non-US    29      32
## 15 15 I(Pr)   Male  Non-US    79      82
## 16 16 I(Pr) Female  Non-US    25      26
## 17 17    II   Male  Non-US    89     116
## 18 18    II Female  Non-US    50      56
## 19 19   III   Male  Non-US    53      61
## 20 20   III Female  Non-US    39      30
## 21 21    IV   Male  Non-US   122     153
## 22 22    IV Female  Non-US   105     115
## 23 23    Va   Male  Non-US    28      27
## 24 24    Va Female  Non-US    12      17

Change header name

count means year range, 2008 to 2009 and count11 means 2011 to 2012.

cols <- c("Num","Type","Sex","Citizen","2008-09","2011-12")
colnames(math) <- cols
math

##    Num  Type    Sex Citizen 2008-09 2011-12
## 1    1 I(Pu)   Male      US     132     148
## 2    2 I(Pu) Female      US      35      40
## 3    3 I(Pr)   Male      US      87      63
## 4    4 I(Pr) Female      US      20      22
## 5    5    II   Male      US      96     161
## 6    6    II Female      US      47      53
## 7    7   III   Male      US      47      71
## 8    8   III Female      US      32      28
## 9    9    IV   Male      US      71      89
## 10  10    IV Female      US      54      55
## 11  11    Va   Male      US      34      42
## 12  12    Va Female      US      14      21
## 13  13 I(Pu)   Male  Non-US     130     136
## 14  14 I(Pu) Female  Non-US      29      32
## 15  15 I(Pr)   Male  Non-US      79      82
## 16  16 I(Pr) Female  Non-US      25      26
## 17  17    II   Male  Non-US      89     116
## 18  18    II Female  Non-US      50      56
## 19  19   III   Male  Non-US      53      61
## 20  20   III Female  Non-US      39      30
## 21  21    IV   Male  Non-US     122     153
## 22  22    IV Female  Non-US     105     115
## 23  23    Va   Male  Non-US      28      27
## 24  24    Va Female  Non-US      12      17

Tidy Data

The dataset has two variables, 2008-09 and 2011-12, but they should be observations under the column name year. We use gather() to change column into rows.

math<-gather(math,"Year","n",5:6)

#math <- math %>% 
  #gather(Year, Frequency, 5:6) %>%  
  #spread(Sex, Frequency)   
head(math)

##   Num  Type    Sex Citizen    Year   n
## 1   1 I(Pu)   Male      US 2008-09 132
## 2   2 I(Pu) Female      US 2008-09  35
## 3   3 I(Pr)   Male      US 2008-09  87
## 4   4 I(Pr) Female      US 2008-09  20
## 5   5    II   Male      US 2008-09  96
## 6   6    II Female      US 2008-09  47

Data Analysis

math2<-math%>%
  group_by(Type,Sex)%>%
  summarise(sum=sum(n))%>%
  arrange(desc(sum)) 
math2

## # A tibble: 12 x 3
## # Groups:   Type [6]
##    Type  Sex      sum
##    <fct> <fct>  <int>
##  1 I(Pu) Male     546
##  2 II    Male     462
##  3 IV    Male     435
##  4 IV    Female   329
##  5 I(Pr) Male     311
##  6 III   Male     232
##  7 II    Female   206
##  8 I(Pu) Female   136
##  9 Va    Male     131
## 10 III   Female   129
## 11 I(Pr) Female    93
## 12 Va    Female    64

ggplot(data = math2, aes(x=Sex,y=sum))+
  geom_bar(stat = 'identity',aes(fill=Type))+
  geom_text(aes(x = Sex, y = sum, 
                label = paste(sum),
                group = Sex,
                vjust = -0.01)) +
  labs(title = "Phd Distribution", 
       x = "Sex", 
       y = "Number of Phdt") +
  facet_wrap(~Type, ncol = 8)+
  theme_bw()

The histogram shows the comparison of PhD number base on different types of institutions. From the sample data we collect, I(Pu), group I public universities, has the most male PhD students. But when we compare the number of female students, IV, statistics and biostatistics programs, has 329 female students. If the sample is large enough, we can get the conclusion that female PhD student prefer the statics and biostatistics programs.

math3<-math%>%
  group_by(Sex)%>%
  summarise(sum=sum(n))
math3

## # A tibble: 2 x 2
##   Sex      sum
##   <fct>  <int>
## 1 Female   957
## 2 Male    2117

     ggplot(math3, aes(x=Sex, y=sum, fill=Sex)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.2) +
    ggtitle("Female phd VS Male Phd") +
    xlab("Sex") + ylab("Phd Number")

math4<-math%>%
  group_by(Citizen)%>%
  summarise(sum=sum(n))
math4

## # A tibble: 2 x 2
##   Citizen   sum
##   <fct>   <int>
## 1 Non-US   1612
## 2 US       1462

ggplot(math4, aes(x=Citizen, y=sum, fill=Citizen))+
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.2) +
    ggtitle("Citizen phd VS Non-Citizen Phd") +
    xlab("Citizen") + ylab("Phd Number")

math5<-math%>%
  group_by(Year)%>%
  summarise(sum=sum(n))
math5

## # A tibble: 2 x 2
##   Year      sum
##   <chr>   <int>
## 1 2008-09  1430
## 2 2011-12  1644

ggplot(math5, aes(x=Year, y=sum, fill=Year))+
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.2) +
    ggtitle("The number of phd increase from 2018 to 2012") +
    xlab("Year") + ylab("Phd Number")

Conclusion

We will do more analysis base on the PhD student’s sex, citizen and survey year.From the table, we can easily caculate total female Phd students’rate, 957/(957+2117)=31.13% and total Non Citizen Phd Rate,1612/(1612+1462)=52.44%.From 2008 to 2012, the totalnumber of Phd has (1644-1430)/1430=14.97% increase.

607_project2

Mengqin Cai

10/6/2019

DataSet I -Immigration

Load the data into R

change the header

Data tidy

Data Analysis

DataSet II -Income Distribution

Load the data to R

change the header

Tidy Data

Data Analysis

DataSet III-American Math Society Survey

Data Description:

Change header name

Tidy Data

Data Analysis

Conclusion