Due on or before 16 June 2017 at 23:59

In your final paper, you will conduct and interpret analyses in R on a dataset of your choice. The goal of this assignment is to reinforce the main topics in the course, and to give you one document that you can look back on to help you in your future R adventures.

Our last few class periods are entirely dedicated to giving you time to work on your papers in class. I encourage you to use this time to work on your papers and get help from me and your fellow R pirates.

What data can I analyse?

You need to analyze one new dataset that we have not already used in class. Ideally, your dataset should have at least 100 rows and 5 columns, where at least two columns are numeric and continuous (like height and age), and two columns are categorical (like sex and favorite movie). If you have a dataset you really want to analyze that does not satisfy these criteria, come talk to me and we’ll find a solution.

You can obtain a dataset from one of the following three methods:

Method 1: Use your own dataset

If you have data for a Bachelor’s or Master’s thesis, you are very welcome to use that. I don’t care if you have analyzed it in the past.

Method 2: Ask me for a randomly generated dataset

Send me an email with the subject “Please send me a dataset for my final paper!”. I will email a randomly generated dataset to you.

Method 3: Find a real dataset online

Ifyou would like to analyze a real dataset, here are a few places to get one. Note: I recommend you look for data in .csv or .txt format as these will be much easier to read than .xls or .sav files.

  1. Journal of Judgment and Decision Making: http://journal.sjdm.org/. The Journal of Judgment and Decision Making requires all recent authors to publish their data. If you look at issues after 2014, you should find raw data for most if not all published articles.

  2. Psychological Science: http://www.psychologicalscience.org/publications/psychological_science. Many articles published in Psychological Science in the past 2 years have open data.

  3. Harvard dataverse: https://dataverse.harvard.edu/dataverse/harvard. Harvard University publishes thousands of open datasets. You can search for one related to psychology.

  4. ClinicalTrials.gov: https://clinicaltrials.gov/ct2/home. ClinicalTrials.gov has hundreds of clinical datasets.

  5. Financial datasets: http://www.r-bloggers.com/financial-data-accessible-from-r-part-iii/. Several financial datasets are available here.

Do not spend too long trying to find a dataset! Many students in past years have spent most of their final class periods just finding data, rather than using that time to write their papers and get help. This is not the best use of your time. Instead, spend an hour or two outside of class looking for a dataset.

If you are unsuccessful finding a good dataset that you can get into R and understand, ask me for one!

Can I work with someone else on the same dataset?

Yes! You are welcome to use the same dataset as another student in the class, and you are welcome to work together on your analyses. However, you must each write your own code and text and turn in your own work! If I suspect that you did not write your paper / code, I reserve the right to ask you to explain it to me personally. If you cannot convince me that your code is your own, you will not pass the class.

How should I write and format my paper?

Your paper will be a combination of text, R code, and R output. Your paper does not need to be formatted in any particular style - though it would be good practice for you to try to approximate APA style using papaja. However, you do need to include your name, date, and a title of your paper at the beginning of the document.

Here are three ways you can write your paper (in order of my recommendation)

  1. Markdown, either regular markdown or in APA format using papaja. The result will be a pdf file in pseudo-APA format.

  2. Word. This is the least elegant solution. It will require you to copy and paste all of your code and and results from R to Word. However, if you insist on using it you can.

Including R code and output

When you refer to an analysis in your text, you must display both your R code and R output at the same location. If you use Markdown, this will happen automatically.

If you use Word, you will need to take a screenshot of both your R code and your output from RStudio, and paste it into your Word document. This will be a pain in the ass, I recommend using Markdown instead!!

What should be in my paper?

There should be four sections in your paper: Dataset description, Questions, Analyses, and Conclusion.

Section 1: Dataset Description

Your paper should start with a description of the dataset. In describing your data, you must answer the following questions (in paragraph form). Be as descriptive as you can, but if you don’t know exactly how the data were collected (perhaps because you got it online) that’s ok. Just say as much as you can.

  1. How did you obtain the dataset?
  2. How were the data originally collected?

Section 2: Questions

Next you should list 5-10 questions that you would like to answer. For example, if I was analyzing the ChickWeight dataset, I could ask the following:

  1. How did the chicken weights generally change over time?
  2. Was there a difference in the the average chicken weights as a result of the different diets?
  3. Were the chicken weights at time 1 normally distributed?
  4. Was there a difference in weights between time 2 and time 4?
  5. Did more chickens die in one diet than another?

You don’t need to go nuts here, just show me that you have some questions you want to answer before diving into analysing data.

Section 3: Analyses (including Tasks)

This is the main section of your document. In this section, you should conduct the relevant tasks to answer each of your 5-10 research questions. You need to complete each of the following tasks at some point in your paper. However, you can complete them in any order you wish, and you do not need to restrict yourself to one task for each question.

Task Checklist

  1. Read the data into R from a text file using read.table()

  2. Show summary statistics of your dataframe using summary() or str().

  3. Print the names of the columns in the dataframe with names()

  4. Recode the values of at least one column using indexing and assignment. For example, in a column of sex data, you could change “female” to 1, and “male” to 0.

# Hint: 
data <- data.frame(sex = c(1, 2, 1, 1, 2),
                   age = c(20, 32, 19, 22, 24))

# Create a new variable sex.2 with NA values
data$sex.2 <- NA

# Change values of sex.2 based on values of sex
data$sex.2[data$sex == 1] <- "male"
data$sex.2[data$sex == 2] <- "female"
  1. Calculate at least one standard deviation with sd(), one mean with mean(), and one median with median().

  2. Print a table of the frequencies of outcomes of a categorical variable with table().

  3. Count the number of outliers in one numerical vector. Define an outlier as any datapoint that is more than 3 standard deviations away from the mean.

## Hint:
outliers <- sum(data$x > ___ | data$x < __)
  1. Create at least one scatterplot with plot(). Include a regression line.

  2. Create at least one histogram with hist()

  3. Create at least one pirateplot with pirateplot()

  4. Use aggregate() or dplyr to calculate descriptive statistics across groups of data.

  5. Calculate at least one t-test with t.test(). Write your results in APA format.

  6. Calculate at least one t-test test on a subset of data by combining subset() with t.test(). For example, you could conduct a t-test comparing results from a performance test between two experimental groups, but only for females. Write your results in APA format.

  7. Calculate at least one correlation test with cor.test(). Write your results in APA format.

  8. Calculate at least one correlation test on a subset of data by combining subset() with cor.test(). For example, you could conduct a correlation between height and weight but only for males. Write your results in APA format.

  9. Calculate at least one chi-square test with chisq.test(). Write your results in APA format.

  10. Calculate at least one one-way ANOVA analysis with aov() with 1 independent variable. Calculate post-hoc tests for any significant effects. Write your results in APA format. If there are more than 4 paired comparisons in your analysis, you don’t need to describe every paired difference, just explain the most important ones.

  11. Calculate at least one multiple-variable ANOVA analysis with aov() with 2 or more independent variables. Calculate post-hoc tests for any significant effects. Write your results in APA format. If there are more than than 4 paired comparisons in your analysis, you don’t need to describe every paired difference, just explain the most important ones.

  12. Calculate at least one regression analysis with lm() or glm(). Write your results in APA format.

  13. Create (and use!) at least one custom function. For example, you could create a function called my.histogram() that creates a histogram with colors that you like. Or you could create a function called find.outliers() that looks for outliers in a vector and then tells you where they are in that vector.

  14. Create one loop (e.g.; to create multiple plots or to do calculations across several rows or columns)

An Example of completing some tasks in the form of a question

Here’s how to stay organized in completing your tasks.

  1. Write what you are going to do: (e.g.; “I will calculate the mean age for each time period”).

  2. Display your nicely formatted R code with appropriate comments.

  3. Display the output from your R code.

  4. Interpret what you found. (e.g.; “The mean age at time period 2 was XX”). If the task involves a statistical test, report a full APA style conclusion in your text (either by typing it manually, or by using mini-chunks in Markdown).

For example, in the following, I will ask questions about chicken weights, run the appropriate R code, display the results, and then interpret them.

Was there an effect of diet on Chicken weights? To start my analysis, I calculated the mean weight of chickens separately for each diet.

# Calculate mean weight for each diet
aggregate(formula = weight ~ Diet, 
          data = ChickWeight,
          FUN = mean)
##   Diet   weight
## 1    1 102.6455
## 2    2 122.6167
## 3    3 142.9500
## 4    4 135.2627

My results showed that chickens on Diet 1 had the smallest mean weight (102.65), and chickens on Diet 3 had the highest mean weight (142.95). To see if this effect was significant, I conducted a one-way ANOVA

# One-way ANOVA of weight on Diet
weight.aov <- aov(formula = weight ~ Diet, data = ChickWeight)

# Print the results
summary(weight.aov)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## Diet          3  155863   51954   10.81 6.43e-07 ***
## Residuals   574 2758693    4806                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The one-way ANOVA was significant F(3, 574) = 10.81, p < .01, suggesting that diet did affect weights. To understand the nature of this difference, I conducted a post-hoc test

# Post-hoc test on anova

TukeyHSD(weight.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ Diet, data = ChickWeight)
## 
## $Diet
##          diff         lwr      upr     p adj
## 2-1 19.971212  -0.2998092 40.24223 0.0552271
## 3-1 40.304545  20.0335241 60.57557 0.0000025
## 4-1 32.617257  12.2353820 52.99913 0.0002501
## 3-2 20.333333  -2.7268370 43.39350 0.1058474
## 4-2 12.646045 -10.5116315 35.80372 0.4954239
## 4-3 -7.687288 -30.8449649 15.47039 0.8277810

Post-hoc tests revealed a significant difference between diets 1 and 3 (p < .05), and diets 1 and 4 (p < .05). No other comparisons were significant at the .05 level.

What if some of these tasks don’t make sense for my data?

If some of these tasks don’t make sense for your data, explain why, then replace the task with another one at the appropriate level of difficulty that displays your R knowledge.

Section 4: Conclusion

Write a brief summary of your main conclusions in 1-3 paragraphs. You don’t need to go nuts here. Just write a few paragraphs that summarize the main conclusions from your results.

Example Final Paper

Here is an example of a final paper written on the ChickWeight dataset

HTML: http://rpubs.com/YaRrr/finalpaper_example_2017

How do I submit my paper?

Submit your paper like a regular WPA by emailing your documents (ie., in .R, .Rmd, .doc, or .pdf format) to me by 16 June 2017 at 23:59.

How will I be graded?

I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.

  • Are all four sections present?
  • Were all tasks completed (assuming they make sense for your data)? If not, the did author ask me for help (before the final deadline)?
  • Is the code properly formatted and commented – that is, can I read it and understand it?
  • Is the paper nicely formatted? That is, does it look like the student put effort into making a nice document that (s)he will be proud to show to his/her grandchildren?

If the answer to all these questions is “Yes” (or “Mostly Yes”), you will be just fine.