I proposed a data set that contained the sleep patterns and work schedule of Railroad Dispatchers. The data sets can be found here, I am curious how naps may impact the self-reported alertness at work of these workers. My assumption is that those that take fewer naps would not report feeling as alert at work. I will look at the overall count of naps over the two week period and the average length of the nap to see if there is any differences with the feeling of alertness at work and the nap summary data.
background.url <- file(paste(url,
"Dispatchers_Background_Data - Christophe Hunt.csv",
sep = ""), open="r" )
backgroud.data <- read.csv(background.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
background.dict.url <- file(paste(url,
"Dispatchers_Background_Data - variable key - Christophe Hunt.csv",
sep = ""), open="r" )
background.data.dict <- read.csv(background.dict.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
dis.sleep.work.sched.url <- file(paste(url,
"Dispatchers_Daily_Log - Christophe Hunt.csv",
sep = ""), open="r" )
dis.data <- read.csv(dis.sleep.work.sched.url, sep = ",", header = TRUE, stringsAsFactors = FALSE)
dis.sleep.work.sched.dict.url <- file(paste(url,
"Dispatchers_Daily_Log - variable key - Christophe Hunt.csv",
sep = ""), open = "r")
dis.data.dict <- read.csv(dis.sleep.work.sched.dict.url, sep = ",", header = TRUE, stringsAsFactors = FALSE)
data.time.format <- "%e-%b-%Y %H:%M:%S"
backgroud.data$ID <- as.integer(backgroud.data$ID)
dis.data <- dis.data %>%
select(ID, Date, contains("nap")) %>%
mutate(Time.of.Nap1 = difftime(strptime(Nap1_time_awoke, format = data.time.format ),
strptime(Nap1_time_fell_asleep, format = data.time.format)),
Time.of.Nap2 = difftime(strptime(Nap2_time_awoke, format = data.time.format),
strptime(Nap2_time_fell_asleep, format = data.time.format)),
Time.of.Nap3 = difftime(strptime(Nap3_time_awoke, format = data.time.format),
strptime(Nap3_time_fell_asleep, format = data.time.format))) %>%
select(ID, contains("Time.of.Nap")) %>%
gather(Nap.Type, Nap.Length, Time.of.Nap1:Time.of.Nap3) %>%
group_by(ID) %>%
summarise(Nap.avg = mean(Nap.Length, na.rm = TRUE), Nap.count = sum(!is.na(Nap.Length))) %>%
left_join(backgroud.data, by = "ID") %>%
select(ID, contains("Nap."), contains("Alert_at_Work"))
Alert.at.Work <- background.data.dict[(background.data.dict$Name == "Alert_at_Work"),]
Description <- as.data.frame(gsub("=", "", unlist(str_extract_all(Alert.at.Work$Definition, "=\\w+"))))
Alert.at.Work <- setDT(Description, keep.rownames = TRUE)
colnames(Alert.at.Work) <- c("ID", "Description")
Alert.at.Work$Description <- as.character(Alert.at.Work$Description)
dis.data$Alert_at_Work <- mapvalues(dis.data$Alert_at_Work,
from = Alert.at.Work$ID, to = Alert.at.Work$Description)
dis.data[dis.data == "NaN"] <- NA
dis.data$Alert_at_Work <- factor(dis.data$Alert_at_Work,
levels = (unique(Alert.at.Work$Description)))
In the data transformation above, I calculate the time between naps, then I group by ID and calculate the average length of the nap and the total count of naps. I then join on the background data to get the feelings of alertness at work selection. The remaining data transformation is reading from the data dictionary and cleaning up the data frame for the next step of plotting.
ggplot(dis.data, aes(Nap.count, fill = Alert_at_Work)) +
geom_density(alpha=.3) +
ggtitle("Density plot of number of naps \nwithin two week period for Railroad Dispatchers") +
facet_wrap(~Alert_at_Work, nrow = length(Alert.at.Work$Description)) +
xlab("Count of naps within two week period") +
theme_minimal() +
theme(legend.position="none")
ggplot(dis.data, aes((Nap.avg/60), fill = Alert_at_Work)) +
geom_density(alpha=.3) +
facet_wrap(~Alert_at_Work, nrow = length(Alert.at.Work$Description)) +
ggtitle("Density plot of average length of naps \nwithin two week period for Railroad Dispatchers (in hours)") +
xlab("Average length of naps within two week period (in hours)") +
theme_minimal() +
theme(legend.position="none")
From the above density plots we see that those that report always feeling alert at work seem to take few naps but for a longer time than others. However, it is interesting that those reporting never feeling alert seem to have similar distributions as those reporting always feeling alert. It would interesting to review other aspects of the background survey against the nap summary data to see more possible differences.
Jeffrey Nieman proposed some interesting questions related to calorie intake of those that eat out compared to those that do not. I find the subject fascinating as well so I chose his data set to inspect.
Jeffrey extracted the data from this page and provided a nice csv file. Many thanks to Jeffrey.
cal.url <- file(paste(url,"calories and restaurants - Jeffrey Nieman.csv", sep = ""), open="r" )
cal.intake <- read.csv(cal.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
cal.intake$Gender[cal.intake$Gender == ""] <- NA
cal.intake <- cal.intake %>%
fill(Gender) %>%
mutate(percent.of.cals.from.rest =
(intake.kcal.from.restaurants.for.all.individuals/
Total.Intake.kcal.for.all.individuals)) %>%
select(Gender, Age, percent.of.cals.from.rest)
cal.intake$Age <- factor(cal.intake$Age, levels = unique(cal.intake$Age))
In the above transformation I take the calorie intake data then fill in missing gender fields, followed by dividing intake of calories to get the percent of calories from restuarants. Then I clean up the data set so its appropriately formated for graphing.
ggplot(cal.intake, aes(x = Age, y = percent.of.cals.from.rest, fill = Gender)) +
ylab("Percent of Overall Calories from Restuarants\n") +
xlab("\nAge Group") +
scale_y_continuous(label = percent) +
theme_classic() +
ggtitle("Percentage of Calories obtained from Restuarants by Age Group and Gender\n")+
geom_bar(stat="identity", position = "dodge", width = .7)
colnames(cal.intake)[3] <- "Percent of Calories obtained from Restaurants"
cal.intake[,3] <- percent(cal.intake[,3])
kable(cal.intake, align = "c")
| Gender | Age | Percent of Calories obtained from Restaurants |
|---|---|---|
| Male | 2 to 5 | 12.9% |
| Male | 6 to 11 | 13.1% |
| Male | 12 to 19 | 29.1% |
| Male | 20 to 39 | 33.3% |
| Male | 40 to 59 | 22.9% |
| Male | 60 + | 19.0% |
| Female | 2 to 5 | 11.6% |
| Female | 6 to 11 | 18.8% |
| Female | 12 to 19 | 28.5% |
| Female | 20 to 39 | 28.9% |
| Female | 40 to 59 | 23.3% |
| Female | 60 + | 18.3% |
The results of the data is very interesting, I am surprised at the calories obtained from restuarants for the 2 to 5 age group. I wouldn’t have thought that this group would be obtaining 11.6% for females and 12.9% for males. I would be interested to find out more about this rate.
Musa Ganiyu proposed an interesting data set that looks at the Global Life Expectancy from the WHO. I am interested in this data set because I find the probablity of death at certain ages an important topic.
life.exp.url <- file(paste(url,"life expectancy - Musa Ganiyu.csv", sep = ""), open="r" )
life.exp <- read.csv(life.exp.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
colnames(life.exp) <- rbind(paste(colnames(life.exp),life.exp[1,], sep = " " ))
colnames(life.exp) <- str_trim(gsub("\\.[0-9]\\s", " ", gsub("[X]", "", colnames(life.exp))))
life.exp <- life.exp[-1,]
life.exp <- life.exp %>%
select(Indicator, `Age Group`, contains("male")) %>%
filter(Indicator == "nqx - probability of dying between ages x and x+n") %>%
gather("Observation", "Probability", `2013 Female`:`1990 Male`) %>%
extract(Observation, c("Year", "Gender"), "([[:alnum:]]+)\\s([[:alnum:]]+)")
life.exp$`Age Group` <- factor(life.exp$`Age Group`, levels = unique(life.exp$`Age Group`))
ggplot(life.exp, aes(x = `Age Group`, y = as.numeric(Probability), group = Year)) +
facet_wrap(~Gender, nrow = 2) +
ylab("Probability of Dying at Age Group\n") +
ggtitle("Probability of Dying by Age Group and Gender from 1990 to 2013") +
theme_light() +
geom_line(aes(colour = Year, group = Year)) +
scale_y_continuous(label = percent) +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="top")
Looking at the data there doesn’t appear to be any significant findings. The probability of death increases to 100% once we reach 100 years old which is expected as does the decreasing probability of death over the years with medical advances. My thoughts are that finding another element to further divide the groups such as distinct income brackets may be more informative.
life.exp$Gender <- as.factor(life.exp$Gender)
life.exp$Year <- as.numeric(life.exp$Year)
life.exp$Probability <- as.numeric(life.exp$Probability)
colnames(life.exp)[2] <- "Age_Group"
### the above formats life.exp data frame to easier to use format
fit <- lm(Probability ~ Age_Group + Year + Gender, data=life.exp)
new.year <- 2015
life.exp2015 <- data.frame(Age_Group = unique(life.exp$Age_Group), Gender = unique(life.exp$Gender), Year = new.year)
life.exp2015$Probability <- predict(fit, newdata = life.exp2015)
###the above block creates our model then predicts the new values for 2015
life.exp <- life.exp[,(-1)]
life.exp <- life.exp[,c(1,3,2,4)]
life.exp <- rbind(life.exp, life.exp2015)
life.exp$Year <- as.factor(life.exp$Year)
## the above block combines the data sets
ggplot(life.exp, aes(x = Age_Group, y = as.numeric(Probability), group = Year)) +
facet_wrap(~Gender, nrow = 2) +
ylab("Probability of Dying at Age Group\n") +
ggtitle("Probability of Dying by Age Group and Gender\n from 1990 to 2013 with a 2015 prediction") +
theme_light() +
geom_line(aes(colour = Year, group = Year)) +
scale_y_continuous(label = percent) +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="top")
The above is a prediction of the probability of death for each age group in the year 2015. Obviously there are number of issues to be worked out, the prediction goes negative for females which is impossible for death probability. Also, the curve is logrithamic and I am not entirely sure how to set up my model for that so I will have to accept this less optimal solution until a later date when I have improved my prediction skills. I think Musa proposed a very interesting questions and I look forward to see some possible solution to the prediction.