by Jose Rey, 2014. For the Data Management for Clinical Research MOOC.
This is an exploration of the data resulting from a FIFA World Soccer Cup annonimous survey. It is part of the final project for the Data Management for Clinical Research Coursera MOOC. This pilot survey has a total of 53 participants.
Required libraries
require(ggplot2)
Read the data exported from REDCap (having in the directory, both, the '.R' and the '.csv'). When 'sourced' within the R programming environment, the '.R' file will load the '.csv' file into memory, in a variable called 'data'.
source('TheFIFASoccerWorldCu_R_2014-07-12_0204.r')
Some data cleaning, includes:
data <- data[-17,]
data2 <- data[data$gender%in%c(1,2),] # only male and female
data3 <- data[data$marital_status%in%c(1,2),] # only single and married
A quick graph with a kind of expected correlation, for sort of verificaiton purposes. In this graph, the axis mean the following:
x_lb <- rev(levels(data$sports_tv_involvement.factor)) # labels to use
graph <- ggplot(data3, aes(sports_tv_involvement, percent_games))
graph + stat_smooth(method='lm', formula=y~I((x)^2), lwd = 2)+
geom_point(lwd=4, aes(colour=marital_status.factor)) +
scale_x_continuous(labels=x_lb)
As expected, there is a possitive correlation with a decent r2:
summary(lm(percent_games~I(sports_tv_involvement^2), data=data3))
##
## Call:
## lm(formula = percent_games ~ I(sports_tv_involvement^2), data = data3)
##
## Residuals:
## What percentage of the soccer world cup matches do you expect to watch?
## Min 1Q Median 3Q Max
## -20.37 -6.42 -2.80 -0.95 41.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.799 2.610 1.07 0.29
## I(sports_tv_involvement^2) 8.619 0.975 8.84 1.5e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 36 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.684, Adjusted R-squared: 0.676
## F-statistic: 78.1 on 1 and 36 DF, p-value: 1.52e-10
This graph is a density distribution of expected percent of games to watch (percent_games) for males and females. This graph is not really saying much, but it is a cool graph.
qplot(percent_games, data=data2, geom="density", fill=gender.factor,
alpha=I(.5), ylab="Density")
Here, a different density graph is used to visually compare the distribution of physical activity (times exercised per week) per marital status.
p <- ggplot(data3, aes(marital_status.factor, sports_physical_involvment))
p + geom_violin(aes(fill=marital_status.factor)) +
geom_jitter(position = position_jitter(width = .02, height=.02), size=3)
Finally, this graph that compares the trends for the relationship of age group to physical sport activity in our populations. It yields an interesting result, whereas the single people tend to exercise less with age (red), and married people tend to exercise more with age (blue). The regression doesn't show a very significant relationship, however, there seems to be a trend, at least for the population. The vertical axis corresponds to physical activity per week in (times exercised per week)
x_lb <- levels(data$age_group.factor) # labels to use
graph <- ggplot(aes(age_group, sports_physical_involvment,
colour=marital_status.factor), data=data3)
graph + stat_smooth(method='lm', formula=y~x, lwd = 2, alpha=0.3)+
geom_jitter(lwd=4, aes(colour=marital_status.factor),
position = position_jitter(width = .1, height=.1)) +
scale_x_continuous(breaks=1:length(x_lb),labels=x_lb)
Study data were collected and managed using REDCap electronic data capture tools hosted at the course's mirror server.1 REDCap (Research Electronic Data Capture) is a secure, web-based application designed to support data capture for research studies, providing 1) an intuitive interface for validated data entry; 2) audit trails for tracking data manipulation and export procedures; 3) automated export procedures for seamless data downloads to common statistical packages; and 4) procedures for importing data from external sources.
1 Paul A. Harris, Robert Taylor, Robert Thielke, Jonathon Payne, Nathaniel Gonzalez, Jose G. Conde, Research electronic data capture (REDCap) - A metadata-driven methodology and workflow process for providing translational research informatics support, J Biomed Inform. 2009 Apr;42(2):377-81.