library(dplyr)
library(ggplot2)
library(ggrepel)

HW exercise 1.

Use the dataset containing the average number of visitors (monthly) in New Zealand by country of residence to explore the seasonal patterns between the eight countries. Is there a hemisphere effect?

Load and check

dta <- read.csv('../data/visitors_NZ.csv')
str(dta)

'data.frame':   170 obs. of  10 variables:
 $ Month    : Factor w/ 170 levels "1998M09","1998M10",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Australia: int  17244 18090 16750 25909 27228 19461 19200 19595 12564 10987 ...
 $ ChinaPRof: int  748 941 1054 1270 1375 1660 1456 1488 1449 1413 ...
 $ Japan    : int  6093 5039 6112 6670 6008 7478 7341 5030 4275 3758 ...
 $ Korea    : int  979 1083 1144 1836 2716 2245 1611 1416 1192 1255 ...
 $ Germany  : int  1320 2459 5195 5499 6430 7320 6094 3157 1450 765 ...
 $ UK       : int  5794 7876 13362 20238 22557 27477 20187 13448 8587 7271 ...
 $ Canada   : int  973 1418 2236 2935 3623 4394 3573 1795 1160 903 ...
 $ USA      : int  3837 6093 8468 7865 10007 12533 10519 6737 4932 4845 ...
 $ Total    : int  57930 68203 84370 113853 122130 124305 104106 83414 59800 52426 ...

dta

Data transformation

dta_long <- dta %>% reshape2::melt() %>%
  mutate(Year_Month = paste0(
    gsub(as.character(Month), pattern = 'M', replacement = '/'), '/01'),
         Year_Month2 = gsub(as.character(Month), pattern = 'M', replacement = '/'),
         Year_Month3 = as.Date(Year_Month ,format = '%Y/%m/%d'),
         Month = as.integer(substr(as.character(Month), 6, 7)),
         hemisphere = factor(ifelse(variable %in% c('Australia'), 'S', 'N')))
dta_long

summary(dta_long)

     Month             variable       value         Year_Month       
 Min.   : 1.000   Australia:170   Min.   :   748   Length:1530       
 1st Qu.: 4.000   ChinaPRof:170   1st Qu.:  4304   Class :character  
 Median : 7.000   Japan    :170   Median :  7180   Mode  :character  
 Mean   : 6.535   Korea    :170   Mean   : 22928                     
 3rd Qu.:10.000   Germany  :170   3rd Qu.: 19420                     
 Max.   :12.000   UK       :170   Max.   :212693                     
                  (Other)  :510                                      
 Year_Month2         Year_Month3         hemisphere
 Length:1530        Min.   :1998-09-01   N:1360    
 Class :character   1st Qu.:2002-03-01   S: 170    
 Mode  :character   Median :2005-09-16             
                    Mean   :2005-09-15             
                    3rd Qu.:2009-04-01             
                    Max.   :2012-10-01

Data visualization

Time series plot for the total count

date_labels <- levels(factor(dta_long$Year_Month2))
dta_long %>% dplyr::filter(variable == 'Total') %>%
  ggplot(data = ., aes(x = Year_Month3, y = value)) +
  geom_line(aes(), size = .5) +
  geom_point(aes(color = ifelse((Month <= 9) & (Month >= 4), 
                                'Summer', 'Winter')), size = 1) +
  scale_x_date(breaks = as.Date(paste(rep(1998:2012, each = 2), c(6, 12), 1, sep = '/')),
               labels = c('1996/06', date_labels[(1:length(date_labels)) %% 6 == 4], '2012/12')) +
  scale_fill_manual(values = c('hotpink', 'dodgeblue')) + 
  labs(x = 'Time', colour = 'Season in North hemisphere') +
  facet_wrap(. ~ variable, ncol = 2) +
  theme_bw() +
  theme(legend.position = 'top',
        axis.text.x = element_text(angle = 45, size = 5, hjust = 1,
                                   margin = margin(t = 3)))

Seasonal trend is obvious. There are more visitors in April to September (winter of North hemisphere, summer in South hemisphere). It is not surprising that people tend to go traveling in warm weather.

Time series plot for eight countries

dta_long %>% dplyr::filter(variable != 'Total') %>%
  ggplot(data = ., aes(x = Year_Month3, y = value)) +
  geom_line(aes(), size = .5) +
  geom_point(aes(color = ifelse((Month <= 9) & (Month >= 4), 
                                'Summer', 'Winter')), size = 1) +
  scale_x_date(breaks = as.Date(paste(rep(1998:2012, each = 2), c(6, 12), 1, sep = '/')),
               labels = c('1996/06', date_labels[(1:length(date_labels)) %% 6 == 4], '2012/12')) +
  scale_fill_manual(values = c('hotpink', 'dodgeblue')) + 
  labs(x = 'Time', colour = 'Season in North hemisphere') +
  facet_wrap(. ~ variable, ncol = 2) +
  theme_bw() +
  theme(legend.position = 'top',
        axis.text.x = element_text(angle = 45, size = 5, hjust = 1,
                                   margin = margin(t = 3)))

Seasonal trend is not so obvious in China but is pretty obvious in other seven countries, especially in Australia and the UK. It is possible that beacuse these two countries have more visitors visiting New Zealand.

Visitor counts comparison

Lollipop plot

dta_MEAN <- dta_long %>% dplyr::filter(variable != 'Total') %>%
  group_by(variable) %>% summarise(MEAN = mean(value)) %>% arrange(-MEAN)
dta_MEAN %>% mutate(variable = factor(variable, levels = variable[8:1])) %>%
  qplot(data =., x = MEAN, y = variable) +
  geom_segment(aes(xend = 0, yend = variable)) +
  scale_x_continuous(breaks = seq(0, max(dta_MEAN$MEAN) + 2000, by = 2000)) +
  labs(x = 'Average visitor counts in 1998 - 2012', y = 'Country') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, margin = margin(t = 7)))

Box plot

dta_long %>% dplyr::filter(variable != 'Total') %>%
  mutate(variable = factor(variable, levels = dta_MEAN$variable[8:1])) %>%
  qplot(data =., x = value, y = variable, geom = 'boxplot') + 
  scale_x_continuous(breaks = seq(0, max(dta[-c(1,10)]) + 2000, by = 4000)) +
  labs(x = 'Visitor counts in 1998 - 2012', y = 'Country') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, margin = margin(t = 7)))

Time series plot

dta_long %>% dplyr::filter(variable != 'Total') %>%
  ggplot(data = ., aes(x = Year_Month3, y = value)) +
  geom_line(aes(color = variable), size = .5) +
  geom_point(aes(color = variable), size = 1) +
  scale_x_date(breaks = as.Date(paste(rep(1998:2012, each = 2), c(6, 12), 1, sep = '/')),
               labels = c('1996/06', date_labels[(1:length(date_labels)) %% 6 == 4], '2012/12')) +
  scale_y_continuous(breaks = seq(0, 60000, by = 5000)) +
  labs(x = 'Time', y = 'Visitor counts in month', colour = 'Country') +
  theme_bw() +
  theme(legend.position = 'top',
        axis.text.x = element_text(angle = 45, size = 5, hjust = 1,
                                   margin = margin(t = 3)))

Compared to other six countries, there are more visitors from Australia and the UK, especially Australia. It is possible that beacuse these two countries and New Zealand belong to the Common Wealth. Also, both Australia and New Zealand belong to the South hemisphere, which is possibly one of the reason why there are so many visitors from Australia visiting New Zealand.

Conclusion

The hemisphere effect can be found in:

There are more visitors visit New Zealand in the winter of S. hemisphere (the summer in N. hemisphere).
There are more visitors visiting New Zealand from Australia, which is also located in S. hemisphere.

HW exercise 2.

Use the sample data set to estimate the mean life expectancy of Nobel prize winners.

Load and check

dta <- read.table('../data/Nobel_prize_winners.txt', header = TRUE)
str(dta)

'data.frame':   21 obs. of  3 variables:
 $ ID  : Factor w/ 21 levels "Bernard Katz",..: 3 8 17 1 13 15 7 14 2 16 ...
 $ Born: Factor w/ 21 levels "August 8, 1902",..: 5 9 2 12 1 14 6 10 16 3 ...
 $ Died: Factor w/ 21 levels "April 16, 1972",..: 21 8 20 2 19 6 5 16 7 4 ...

dta

Compute

library(lubridate)
dta_trans <- apply(dta[,-1], 2, function(v)
  sapply(strsplit(as.character(v), split = ' '),
         function(x) gsub(paste(x[2], x[1], x[3], sep = '-'),
                          pattern = ',', replacement = '')))
Life_time <- as.Date(dmy(dta_trans[,2])) - as.Date(dmy(dta_trans[,1]))
mean(Life_time / dyears(1))

[1] 82.39517

The estimated mean life expectancy of Nobel prize winners is 82.39517 years.

Discuss

dta %>% mutate(born = mdy(Born), died = mdy(Died),
               span = born %--% died, y = as.duration(span) / dyears(1))

這樣的語法在日期格式調整時，似乎會出一些狀況，像是第一列的出生日期應為February 27, 1926，轉換後變成了1926-02-07，還有蠻多其他錯誤。

HW exercise 3.

Use the following sample of records for profit made, arrival date, and departure date of group travel booked at a travel agency in Taiwan to estimate the mean profit per day of service.

Load and check

dta <- read.csv('../data/travel_agency_TW.csv')
str(dta)

'data.frame':   96 obs. of  3 variables:
 $ Expense : int  15393 27616 8876 57378 32613 46998 10744 3269 16195 55842 ...
 $ Arrival : Factor w/ 83 levels "2014/10/10","2014/10/13",..: 79 83 78 74 75 31 31 26 29 77 ...
 $ Depature: Factor w/ 79 levels "2014/10/1","2014/10/13",..: 73 77 73 76 75 32 69 26 28 73 ...

dta

Compute

days <- as.numeric(as.Date(dta$Depature) - as.Date(dta$Arrival)) + 1
mean(dta$Expense) / mean(days)

[1] 5522.453

The estimated mean profit per day of service is 5522.453 units.

HW exercise 4.

The following rather awful plot is shown on a web page hosted by the Taiwanese Ministry of Education.

$The original plot$

The original plot

Revise it so that it is a proper time series plot. For your convenience, the data points have been extracted and saved in the the file. What had happened in the early 1990’s and how do we know if the trend reversal is real? You may want to augment the data set with further data points from 2012 to 2018 available in the foreign students in the U.S. data file.

Load and check

dtaTW <- read.table('../data/data_hw0518_4_1.txt')
dtaUS <- read.table('../data/data_hw0518_4_2.txt', header = TRUE)
summary(dtaTW)

       V1       
 Min.   : 2553  
 1st Qu.:14905  
 Median :28930  
 Mean   :24497  
 3rd Qu.:32286  
 Max.   :37580

summary(dtaUS)

      Year      Country     Number      
 Min.   :2012   CN:7    Min.   : 18105  
 1st Qu.:2013   JP:7    1st Qu.: 19334  
 Median :2015   TW:7    Median : 21516  
 Mean   :2015           Mean   :119596  
 3rd Qu.:2017           3rd Qu.:274439  
 Max.   :2018           Max.   :369548

str(dtaUS)

'data.frame':   21 obs. of  3 variables:
 $ Year   : int  2012 2013 2014 2015 2016 2017 2018 2012 2013 2014 ...
 $ Country: Factor w/ 3 levels "CN","JP","TW": 3 3 3 3 3 3 3 2 2 2 ...
 $ Number : int  21867 21266 20993 21127 21516 22454 23369 19568 19334 19064 ...

dtaUS

dtaUS %>% dplyr::filter(Country == 'TW')

Transform

dtaTW <- dtaTW %>% mutate(Year = c(seq(1950, 1990, by=5), 1991:2004),
                          Count = V1, V1 = NULL)
dtaTW

dtaUS_TW <- dtaUS %>% dplyr::filter(Country == 'TW') %>% 
  left_join(., dtaUS %>% group_by(Year) %>% summarise(Year_number = sum(Number))) %>%
  mutate(Proportion = Number / Year_number)
dtaUS_TW

Visualize

1. Time series plot of the number of Taiwanese students in the US from 1950 to 2005

dtaTW %>% ggplot(data = ., aes(x = Year, y = Count)) +
  geom_line(aes(), size = .5) +
  geom_point(aes(), size = 1) +
  geom_label_repel(aes(label = Count), size = 1.5,
                   segment.color = 'grey10', segment.size = .25) + 
  scale_x_continuous(limits = c(min(dtaTW$Year) - 5, max(dtaTW$Year) + 5),
                     breaks = seq(min(dtaTW$Year) - 5, max(dtaTW$Year) + 5, by = 5)) +
  scale_y_continuous(limits = c(0, max(dtaTW$Count) + 4000),
                     breaks = seq(0, max(dtaTW$Count) + 4000, by = 4000)) +
  theme_bw()

2. Time series plot of the number (proportion) of Taiwanese students in the US from 2012 to 2018

dtaUS_TW %>% ggplot(data = ., aes(x = Year, y = Number)) +
  geom_line(aes(), size = .5) +
  geom_point(aes(), size = 1) +
  geom_label_repel(aes(label = Number), size = 2.5,
                   segment.color = 'grey10', segment.size = .25) + 
  scale_x_continuous(breaks = seq(min(dtaUS_TW$Year) - 1, 
                                  max(dtaUS_TW$Year) + 1, by = 1)) +
  scale_y_continuous(limits = c((min(dtaUS_TW$Number) %/% 1000)*1000,
                                (max(dtaUS_TW$Number) %/% 1000 + 1)*1000),
                     breaks = seq((min(dtaUS_TW$Number) %/% 1000)*1000,
                                  (max(dtaUS_TW$Number) %/% 1000 + 1)*1000, by = 500)) +
  theme_bw()

dtaUS_TW %>% ggplot(data = ., aes(x = Year, y = Proportion)) +
  geom_line(aes(), size = .5) +
  geom_point(aes(), size = 1) +
  geom_label_repel(aes(label = paste0(round(Proportion, 2)*100, '%')), size = 2.5,
                   segment.color = 'grey10', segment.size = .25) + 
  scale_x_continuous(breaks = seq(min(dtaUS_TW$Year) - 1, 
                                  max(dtaUS_TW$Year) + 1, by = 1)) +
  scale_y_continuous(limits = c(.05, .1),
                     breaks = seq(.05, .1, by = .01),
                     labels = paste0(seq(.05, .1, by = .01)*100, '%')) +
  theme_bw()

3. Time series plot of the number of Taiwanese students in the US from 1950 to 2018 (the combination of two datasets)

dtaUS_TW %>% mutate(Count = Number) %>%
  dplyr::select(Year, Count) %>% rbind(., dtaTW) %>%
  ggplot(data = ., aes(x = Year, y = Count)) +
  geom_line(aes(), size = .5) +
  geom_point(aes(), size = 1) +
  geom_label_repel(aes(label = Count), size = 1.5,
                   segment.color = 'grey10', segment.size = .25) + 
  scale_x_continuous(limits = c(1950, 2020), breaks = seq(1950, 2020, by = 5)) +
  scale_y_continuous(limits = c(0, max(dtaTW$Count) + 4000),
                     breaks = seq(0, max(dtaTW$Count) + 4000, by = 4000)) +
  theme_bw()

HW exercise 5.

How different groups spend their day is an article published in The New York Times using the data collected from The American Time Use Survey. Discuss what we need to have in order to replicate this piece of graphical journalism in Taiwan.

[ANS]

We need to collect a representative sample via a reasonable sampling method (e.g., sampling stratified by the cities and counties), and obtain the following measurement/indices from the sample:

Basic demographic variables
1. gender
2. age
3. ethnicity
4. education level
5. employment status
6. the no. of children
Daily-activity-related measurements
1. daily activities
2. start and end time of the daily activities.

In the part of data analysis, we need to turn the data set into the format of time series data to calculate the percentage of each activity in each time period.

DataM: Homework Exercise 0518 (Dates & Times)

DataM: Homework Exercise 0518 (Dates & Times)

HW exercise 1.

Load and check

Data transformation

Data visualization

Time series plot for the total count

Time series plot for eight countries

Visitor counts comparison

Lollipop plot

Box plot

Time series plot

Conclusion

HW exercise 2.

Load and check

Compute

Discuss

HW exercise 3.

Load and check

Compute

HW exercise 4.

Load and check

Transform

Visualize

1. Time series plot of the number of Taiwanese students in the US from 1950 to 2005

2. Time series plot of the number (proportion) of Taiwanese students in the US from 2012 to 2018

3. Time series plot of the number of Taiwanese students in the US from 1950 to 2018 (the combination of two datasets)

HW exercise 5.