DataM: Homework Exercise 0518 (Dates & Times)
HW exercise 1.
Use the dataset containing the average number of visitors (monthly) in New Zealand by country of residence to explore the seasonal patterns between the eight countries. Is there a hemisphere effect?
Load and check
'data.frame': 170 obs. of 10 variables:
$ Month : Factor w/ 170 levels "1998M09","1998M10",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Australia: int 17244 18090 16750 25909 27228 19461 19200 19595 12564 10987 ...
$ ChinaPRof: int 748 941 1054 1270 1375 1660 1456 1488 1449 1413 ...
$ Japan : int 6093 5039 6112 6670 6008 7478 7341 5030 4275 3758 ...
$ Korea : int 979 1083 1144 1836 2716 2245 1611 1416 1192 1255 ...
$ Germany : int 1320 2459 5195 5499 6430 7320 6094 3157 1450 765 ...
$ UK : int 5794 7876 13362 20238 22557 27477 20187 13448 8587 7271 ...
$ Canada : int 973 1418 2236 2935 3623 4394 3573 1795 1160 903 ...
$ USA : int 3837 6093 8468 7865 10007 12533 10519 6737 4932 4845 ...
$ Total : int 57930 68203 84370 113853 122130 124305 104106 83414 59800 52426 ...
Data transformation
dta_long <- dta %>% reshape2::melt() %>%
mutate(Year_Month = paste0(
gsub(as.character(Month), pattern = 'M', replacement = '/'), '/01'),
Year_Month2 = gsub(as.character(Month), pattern = 'M', replacement = '/'),
Year_Month3 = as.Date(Year_Month ,format = '%Y/%m/%d'),
Month = as.integer(substr(as.character(Month), 6, 7)),
hemisphere = factor(ifelse(variable %in% c('Australia'), 'S', 'N')))
dta_long Month variable value Year_Month
Min. : 1.000 Australia:170 Min. : 748 Length:1530
1st Qu.: 4.000 ChinaPRof:170 1st Qu.: 4304 Class :character
Median : 7.000 Japan :170 Median : 7180 Mode :character
Mean : 6.535 Korea :170 Mean : 22928
3rd Qu.:10.000 Germany :170 3rd Qu.: 19420
Max. :12.000 UK :170 Max. :212693
(Other) :510
Year_Month2 Year_Month3 hemisphere
Length:1530 Min. :1998-09-01 N:1360
Class :character 1st Qu.:2002-03-01 S: 170
Mode :character Median :2005-09-16
Mean :2005-09-15
3rd Qu.:2009-04-01
Max. :2012-10-01
Data visualization
Time series plot for the total count
date_labels <- levels(factor(dta_long$Year_Month2))
dta_long %>% dplyr::filter(variable == 'Total') %>%
ggplot(data = ., aes(x = Year_Month3, y = value)) +
geom_line(aes(), size = .5) +
geom_point(aes(color = ifelse((Month <= 9) & (Month >= 4),
'Summer', 'Winter')), size = 1) +
scale_x_date(breaks = as.Date(paste(rep(1998:2012, each = 2), c(6, 12), 1, sep = '/')),
labels = c('1996/06', date_labels[(1:length(date_labels)) %% 6 == 4], '2012/12')) +
scale_fill_manual(values = c('hotpink', 'dodgeblue')) +
labs(x = 'Time', colour = 'Season in North hemisphere') +
facet_wrap(. ~ variable, ncol = 2) +
theme_bw() +
theme(legend.position = 'top',
axis.text.x = element_text(angle = 45, size = 5, hjust = 1,
margin = margin(t = 3)))Seasonal trend is obvious. There are more visitors in April to September (winter of North hemisphere, summer in South hemisphere). It is not surprising that people tend to go traveling in warm weather.
Time series plot for eight countries
dta_long %>% dplyr::filter(variable != 'Total') %>%
ggplot(data = ., aes(x = Year_Month3, y = value)) +
geom_line(aes(), size = .5) +
geom_point(aes(color = ifelse((Month <= 9) & (Month >= 4),
'Summer', 'Winter')), size = 1) +
scale_x_date(breaks = as.Date(paste(rep(1998:2012, each = 2), c(6, 12), 1, sep = '/')),
labels = c('1996/06', date_labels[(1:length(date_labels)) %% 6 == 4], '2012/12')) +
scale_fill_manual(values = c('hotpink', 'dodgeblue')) +
labs(x = 'Time', colour = 'Season in North hemisphere') +
facet_wrap(. ~ variable, ncol = 2) +
theme_bw() +
theme(legend.position = 'top',
axis.text.x = element_text(angle = 45, size = 5, hjust = 1,
margin = margin(t = 3)))Seasonal trend is not so obvious in China but is pretty obvious in other seven countries, especially in Australia and the UK. It is possible that beacuse these two countries have more visitors visiting New Zealand.
Visitor counts comparison
Lollipop plot
dta_MEAN <- dta_long %>% dplyr::filter(variable != 'Total') %>%
group_by(variable) %>% summarise(MEAN = mean(value)) %>% arrange(-MEAN)
dta_MEAN %>% mutate(variable = factor(variable, levels = variable[8:1])) %>%
qplot(data =., x = MEAN, y = variable) +
geom_segment(aes(xend = 0, yend = variable)) +
scale_x_continuous(breaks = seq(0, max(dta_MEAN$MEAN) + 2000, by = 2000)) +
labs(x = 'Average visitor counts in 1998 - 2012', y = 'Country') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, margin = margin(t = 7)))Box plot
dta_long %>% dplyr::filter(variable != 'Total') %>%
mutate(variable = factor(variable, levels = dta_MEAN$variable[8:1])) %>%
qplot(data =., x = value, y = variable, geom = 'boxplot') +
scale_x_continuous(breaks = seq(0, max(dta[-c(1,10)]) + 2000, by = 4000)) +
labs(x = 'Visitor counts in 1998 - 2012', y = 'Country') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, margin = margin(t = 7)))Time series plot
dta_long %>% dplyr::filter(variable != 'Total') %>%
ggplot(data = ., aes(x = Year_Month3, y = value)) +
geom_line(aes(color = variable), size = .5) +
geom_point(aes(color = variable), size = 1) +
scale_x_date(breaks = as.Date(paste(rep(1998:2012, each = 2), c(6, 12), 1, sep = '/')),
labels = c('1996/06', date_labels[(1:length(date_labels)) %% 6 == 4], '2012/12')) +
scale_y_continuous(breaks = seq(0, 60000, by = 5000)) +
labs(x = 'Time', y = 'Visitor counts in month', colour = 'Country') +
theme_bw() +
theme(legend.position = 'top',
axis.text.x = element_text(angle = 45, size = 5, hjust = 1,
margin = margin(t = 3)))Compared to other six countries, there are more visitors from Australia and the UK, especially Australia. It is possible that beacuse these two countries and New Zealand belong to the Common Wealth. Also, both Australia and New Zealand belong to the South hemisphere, which is possibly one of the reason why there are so many visitors from Australia visiting New Zealand.
Conclusion
The hemisphere effect can be found in:
- There are more visitors visit New Zealand in the winter of S. hemisphere (the summer in N. hemisphere).
- There are more visitors visiting New Zealand from Australia, which is also located in S. hemisphere.
HW exercise 2.
Use the sample data set to estimate the mean life expectancy of Nobel prize winners.
Load and check
'data.frame': 21 obs. of 3 variables:
$ ID : Factor w/ 21 levels "Bernard Katz",..: 3 8 17 1 13 15 7 14 2 16 ...
$ Born: Factor w/ 21 levels "August 8, 1902",..: 5 9 2 12 1 14 6 10 16 3 ...
$ Died: Factor w/ 21 levels "April 16, 1972",..: 21 8 20 2 19 6 5 16 7 4 ...
Compute
library(lubridate)
dta_trans <- apply(dta[,-1], 2, function(v)
sapply(strsplit(as.character(v), split = ' '),
function(x) gsub(paste(x[2], x[1], x[3], sep = '-'),
pattern = ',', replacement = '')))
Life_time <- as.Date(dmy(dta_trans[,2])) - as.Date(dmy(dta_trans[,1]))
mean(Life_time / dyears(1))[1] 82.39517
The estimated mean life expectancy of Nobel prize winners is 82.39517 years.
Discuss
dta %>% mutate(born = mdy(Born), died = mdy(Died),
span = born %--% died, y = as.duration(span) / dyears(1)) 這樣的語法在日期格式調整時,似乎會出一些狀況,像是第一列的出生日期應為February 27, 1926,轉換後變成了1926-02-07,還有蠻多其他錯誤。
HW exercise 3.
Use the following sample of records for profit made, arrival date, and departure date of group travel booked at a travel agency in Taiwan to estimate the mean profit per day of service.
Load and check
'data.frame': 96 obs. of 3 variables:
$ Expense : int 15393 27616 8876 57378 32613 46998 10744 3269 16195 55842 ...
$ Arrival : Factor w/ 83 levels "2014/10/10","2014/10/13",..: 79 83 78 74 75 31 31 26 29 77 ...
$ Depature: Factor w/ 79 levels "2014/10/1","2014/10/13",..: 73 77 73 76 75 32 69 26 28 73 ...
Compute
[1] 5522.453
The estimated mean profit per day of service is 5522.453 units.
HW exercise 4.
The following rather awful plot is shown on a web page hosted by the Taiwanese Ministry of Education.
The original plot
Revise it so that it is a proper time series plot. For your convenience, the data points have been extracted and saved in the the file. What had happened in the early 1990’s and how do we know if the trend reversal is real? You may want to augment the data set with further data points from 2012 to 2018 available in the foreign students in the U.S. data file.
Load and check
dtaTW <- read.table('../data/data_hw0518_4_1.txt')
dtaUS <- read.table('../data/data_hw0518_4_2.txt', header = TRUE)
summary(dtaTW) V1
Min. : 2553
1st Qu.:14905
Median :28930
Mean :24497
3rd Qu.:32286
Max. :37580
Year Country Number
Min. :2012 CN:7 Min. : 18105
1st Qu.:2013 JP:7 1st Qu.: 19334
Median :2015 TW:7 Median : 21516
Mean :2015 Mean :119596
3rd Qu.:2017 3rd Qu.:274439
Max. :2018 Max. :369548
'data.frame': 21 obs. of 3 variables:
$ Year : int 2012 2013 2014 2015 2016 2017 2018 2012 2013 2014 ...
$ Country: Factor w/ 3 levels "CN","JP","TW": 3 3 3 3 3 3 3 2 2 2 ...
$ Number : int 21867 21266 20993 21127 21516 22454 23369 19568 19334 19064 ...
Transform
dtaUS_TW <- dtaUS %>% dplyr::filter(Country == 'TW') %>%
left_join(., dtaUS %>% group_by(Year) %>% summarise(Year_number = sum(Number))) %>%
mutate(Proportion = Number / Year_number)
dtaUS_TWVisualize
1. Time series plot of the number of Taiwanese students in the US from 1950 to 2005
dtaTW %>% ggplot(data = ., aes(x = Year, y = Count)) +
geom_line(aes(), size = .5) +
geom_point(aes(), size = 1) +
geom_label_repel(aes(label = Count), size = 1.5,
segment.color = 'grey10', segment.size = .25) +
scale_x_continuous(limits = c(min(dtaTW$Year) - 5, max(dtaTW$Year) + 5),
breaks = seq(min(dtaTW$Year) - 5, max(dtaTW$Year) + 5, by = 5)) +
scale_y_continuous(limits = c(0, max(dtaTW$Count) + 4000),
breaks = seq(0, max(dtaTW$Count) + 4000, by = 4000)) +
theme_bw()2. Time series plot of the number (proportion) of Taiwanese students in the US from 2012 to 2018
dtaUS_TW %>% ggplot(data = ., aes(x = Year, y = Number)) +
geom_line(aes(), size = .5) +
geom_point(aes(), size = 1) +
geom_label_repel(aes(label = Number), size = 2.5,
segment.color = 'grey10', segment.size = .25) +
scale_x_continuous(breaks = seq(min(dtaUS_TW$Year) - 1,
max(dtaUS_TW$Year) + 1, by = 1)) +
scale_y_continuous(limits = c((min(dtaUS_TW$Number) %/% 1000)*1000,
(max(dtaUS_TW$Number) %/% 1000 + 1)*1000),
breaks = seq((min(dtaUS_TW$Number) %/% 1000)*1000,
(max(dtaUS_TW$Number) %/% 1000 + 1)*1000, by = 500)) +
theme_bw()dtaUS_TW %>% ggplot(data = ., aes(x = Year, y = Proportion)) +
geom_line(aes(), size = .5) +
geom_point(aes(), size = 1) +
geom_label_repel(aes(label = paste0(round(Proportion, 2)*100, '%')), size = 2.5,
segment.color = 'grey10', segment.size = .25) +
scale_x_continuous(breaks = seq(min(dtaUS_TW$Year) - 1,
max(dtaUS_TW$Year) + 1, by = 1)) +
scale_y_continuous(limits = c(.05, .1),
breaks = seq(.05, .1, by = .01),
labels = paste0(seq(.05, .1, by = .01)*100, '%')) +
theme_bw()3. Time series plot of the number of Taiwanese students in the US from 1950 to 2018 (the combination of two datasets)
dtaUS_TW %>% mutate(Count = Number) %>%
dplyr::select(Year, Count) %>% rbind(., dtaTW) %>%
ggplot(data = ., aes(x = Year, y = Count)) +
geom_line(aes(), size = .5) +
geom_point(aes(), size = 1) +
geom_label_repel(aes(label = Count), size = 1.5,
segment.color = 'grey10', segment.size = .25) +
scale_x_continuous(limits = c(1950, 2020), breaks = seq(1950, 2020, by = 5)) +
scale_y_continuous(limits = c(0, max(dtaTW$Count) + 4000),
breaks = seq(0, max(dtaTW$Count) + 4000, by = 4000)) +
theme_bw()HW exercise 5.
How different groups spend their day is an article published in The New York Times using the data collected from The American Time Use Survey. Discuss what we need to have in order to replicate this piece of graphical journalism in Taiwan.
[ANS]
We need to collect a representative sample via a reasonable sampling method (e.g., sampling stratified by the cities and counties), and obtain the following measurement/indices from the sample:
- Basic demographic variables
- gender
- age
- ethnicity
- education level
- employment status
- the no. of children
- Daily-activity-related measurements
- daily activities
- start and end time of the daily activities.
In the part of data analysis, we need to turn the data set into the format of time series data to calculate the percentage of each activity in each time period.