Due on or before 23 December 2016 at 23:59
In your final paper, you will conduct and interpret analyses in R on a dataset of your choice. The goal of this assignment is to reinforce the main topics in the course, and to give you one document that you can look back on to help you in your future R adventures.
Our last two class periods (14 Dec and 21 Dec) are entirely dedicated to giving you time to work on your papers in class. I encourage you to use this time to work on your papers and get help from me and your fellow R pirates.
Can I work with someone else on the same dataset?
Yes! You are welcome to use the same dataset as another student in the class, and you are welcome to work together on your analyses. However, you must each write your own code and text and turn in your own work! If I suspect that you did not write your paper / code, I reserve the right to ask you to explain it to me personally. If you cannot convince me that your code is your own, you will not pass the class.
How do I get a dataset?
You can use any new dataset you’d like (it can’t be one we’ve already used in class). If you have data for a Bachelor’s or Master’s thesis, you are very welcome to use that. Ideally, your dataset shoudl have at least 100 rows and 5 columns – if you have a dataset you really want to analyze with fewer rows or columns, come talk to me.
If you do not have a dataset already that you’d like to analyze, here are a few places to get one.
- Published psychology datasets. In recent years, more and more researchers have been posting their raw data online. This is a great development for science, and gives you real world datasets to analyse! Here are a few options
- Harvard dataverse: https://dataverse.harvard.edu/dataverse/harvard. Harvard University publishes thousands of open datasets. You can search for one related to psychology.
- Journal of Judgment and Decision Making: http://journal.sjdm.org/. The Journal of Judgment and Decison Making requires all recent authors to publish their data. If you look at issues after 2014, you should find raw data for most if not all published articles.
- Machine Learning datasets
UCI machine learning database: (http://archive.ics.uci.edu/ml/datasets.html). This database has 307 datasets from many fields. If you want to use one of these datasets, I recommend you use those with a Multivariate data type and a Regression default task (http://goo.gl/hm2v4B).
Kaggle: https://www.kaggle.com/datasets. Kaggle is a data science website with hundreds of datasets. Some of these datasets are very, very large. So I recommend you look for one that is on the smaller side.
R-Bloggers: https://www.r-bloggers.com/r-packages-for-data-access/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29. This page has many links to R packages containing data.
- Other
- ClinicalTrials.gov: https://clinicaltrials.gov/ct2/home. ClinicalTrials.gov has hundreds of clinical datasets.
- Financial datasets: http://www.r-bloggers.com/financial-data-accessible-from-r-part-iii/. Several financial datasets are available here.
How should I write and format my paper?
Your paper will be a combination of text, R code, and R output. Your paper does not need to be formatted in any particular style (though it would be good practice for you to try to approximate APA style). However, you do need to include your name, date, and a title of your paper at the beginning of the document.
Here are three ways you can write your paper (in order of my recommendation)
Markdown. Markdown is the most elegant, and easiest way to write documents. With Markdown, you can easily include all of your R code, output, and writing in one document. We’ll go over how to write APA style papers with R code in Markdown later in the course.
LaTeX. We won’t cover LaTeX in this course, but if you already use it, you are welcome to use it for this course.
Word. This is the least elegant solution. It will require you to copy and paste all of your code and and results from R to Word. However, if you insist on using it you can.
Including R code and output
When you refer to an analysis in your text, you must display both your R code and R output at the same location. If you use Markdown, this will happen automatically. However, if you use Word, you will need to take a screenshot of both your R code and your output from RStudio, and paste it into your Word document.
What should be in my paper?
There should be four sections in your paper: Dataset description, Questions, Analyses, and Conclusion.
Section 1: Dataset Description
Your paper should start with a description of the dataset. In describing your datas, you must answer the following questions (in paragraph form). Be as descriptive as you can, but if you don’t know exactly how the data were collected (perhaps because you got it online) that’s ok. Just say as much as you can.
- How did you obtain the dataset?
- How were the data originally collected?
- What are the columns in the dataset? For each column, give the variable name and a brief description of what it represents. You only need to describe columns that you actually use in your analysis.
Section 2: Questions
Next you should list 5-10 questions that you would like to answer. For example, if I was analyzing the ChickWeight dataset, I could ask the following:
- How did the chicken weights generally change over time?
- Was there a difference in the the average chicken weights as a result of the different diets?
- Were the chicken weights at time 1 normally distributed?
- Was there a difference in weights between time 2 and time 4?
- Did more chickens die in one diet than another?
Section 3: Analyses (including Tasks)
This is the main section of your document. In this section, you should conduct the relevant tasks to answer each of your 5-10 research questions. You need to complete each of the following tasks at some point in your paper. However, you can complete them in any order you wish, and you do not need to restrict yourself to one task for each question.
Task Checklist
Read the data into R from a text file using
read.table()Show summary statistics of your dataframe using
summary()orstr().Display the first few rows of your dataframe with
head().Print the names of the columns in the dataframe with
names()Recode the values of at least one column using indexing and assignment. For example, in a column of sex data, you could change “female” to 1, and “male” to 0.
Calculate at least one standard deviation with
sd(), one mean withmean(), and one median withmedian().Print a table of the frequencies of outcomes of a categorical variable with
table().Count the number of outliers in one numerical vector. Define an outlier as any datapoint that is more than 3 standard deviations away from the mean.
Create at least one scatterplot with
plot(). Include a legend withlegend()and a regression line.Create at least one histogram with
hist()Create at least one pirateplot with
pirateplot()Use
aggregate()ordplyrto calculate descriptive statistics across groups of data.Calculate at least one t-test with
t.test(). Write your results in APA format.Calculate at least one t-test test on a subset of data by combining
subset()witht.test(). For example, you could conduct a t-test comparing results from a performance test between two experimental groups, but only for females. Write your results in APA format.Calculate at least one correlation test with
cor.test(). Write your results in APA format.Calculate at least one correlation test on a subset of data by combining
subset()withcor.test(). For example, you could conduct a correlation between height and weight but only for males. Write your results in APA format.Calculate at least one chi-square test with
chisq.test(). Write your results in APA format.Calculate at least one one-way ANOVA analysis with
aov()with 1 independent variable. Calculate post-hoc tests for any significant effects. Write your results in APA format.Calculate at least one multiple-variable ANOVA analysis with
aov()with 2 or more independent variables. Calculate post-hoc tests for any significant effects. Write your results in APA format.Calculate at least one regression analysis with
lm()orglm(). Write your results in APA format.Create (and use!) at least one new custom function. For example, you could create a function called
my.histogram()that creates a histogram with colors that you like. Or you could create a function calledfind.outliers()that looks for outliers in a vector and then tells you where they are in that vector.
An Example of completing some tasks in the form of a question
Here’s how to stay organized in completing your tasks.
Write what you are going to do: (e.g.; “I will calculate the mean age for each time period”).
Display your nicely formatted R code with appropriate comments.
Display the output from your R code.
Interpret what you found. (e.g.; “The mean age at time period 2 was XX”). If the task involves a statistical test, report a full APA style conclusion in your text (either by typing it manually, or by using mini-chunks in Markdown). Do not just include the
apa()function in your R code (though you may use theapa()function to help you!)
For example, in the following, I will ask questions about chicken weights, run the appopriate R code, display the results, and then interpret them.
Was there an effect of diet on Chicken weights? To start my analysis, I calculated the mean weight of chickens separately for each diet.
# Calculate mean weight for each diet
aggregate(formula = weight ~ Diet,
data = ChickWeight,
FUN = mean)## Diet weight
## 1 1 102.6455
## 2 2 122.6167
## 3 3 142.9500
## 4 4 135.2627
My results showed that chickens on Diet 1 had the smallest mean weight (102.65), and chickens on Diet 3 had the highest mean weight (142.95). To see if this effect was significant, I conducted a one-way ANOVA
# One-way ANOVA of weight on Diet
weight.aov <- aov(formula = weight ~ Diet, data = ChickWeight)
# Print the results
summary(weight.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 3 155863 51954 10.81 6.43e-07 ***
## Residuals 574 2758693 4806
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The one-way ANOVA was significant F(3, 574) = 10.81, p < .01, suggesting that diet did affect weights. To understand the nature of this difference, I conducted a post-hoc test
# Post-hoc test on anova
TukeyHSD(weight.aov)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ Diet, data = ChickWeight)
##
## $Diet
## diff lwr upr p adj
## 2-1 19.971212 -0.2998092 40.24223 0.0552271
## 3-1 40.304545 20.0335241 60.57557 0.0000025
## 4-1 32.617257 12.2353820 52.99913 0.0002501
## 3-2 20.333333 -2.7268370 43.39350 0.1058474
## 4-2 12.646045 -10.5116315 35.80372 0.4954239
## 4-3 -7.687288 -30.8449649 15.47039 0.8277810
Post-hoc tests revealed a significant difference between diets 1 and 3 (p < .05), and diets 1 and 4 (p < .05). No other comparisons were significant at the .05 level.
Section 4: Conclusion
Write a brief summary of your main conclusions in 1-3 paragraphs. You don’t need to go nuts here. Just write a few paragraphs that summarize the main conclusions from your results.
How do I submit my paper?
Submit your paper like a regular WPA by emailing your document to me by 23 December 2016 at 23:59.
How will I be graded?
I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.
- Are all four sections present?
- Were all tasks completed? If not, the did author ask me for help (before the final deadline)?
- Is the code properly formatted and commented – that is, can I read it and understand it?
- Is the paper nicely formatted? That is, does it look like the student put effort into making a nice document that (s)he will be proud to show to his/her grandchildren?
If the answer to all these questions is “Yes” (or “Mostly Yes”), you will be just fine.