Due on or before 15 June 2016 at 23:59

In your final paper, you will conduct an extensive set of analyses in R on a dataset of your choice. You can use any dataset you’d like. If you have data for a Bachelor’s or Master’s thesis, you are very welcome to use that. Don’t dorry if you do not have a dataset. I’ll show you how to get one in the section “How do I get a dataset?”.

Our last two class periods (25 May and 1 June) are entirely dedicated to giving you time to work on your papers in class. I encourage you to use this time to work on your papers and get help from me and your fellow R pirates.

Can I work with someone else on the same dataset?

Yes. You are welcome to use the same dataset as another student in the class, and you are welcome to work together on your analyses. However, you must each write your own code and text and turn in your own work! If I suspect that you did not write your paper / code, I reserve the right to ask you to explain it to me personally. If you cannot convince me that your code is your own, you will get a 0 on the assignment.

How do I get a dataset?

You need to use a new dataset that we we have not already used in class. Ideally, you should use a dataset with at least 100 rows and 5 columns – if you have a dataset you really want to analyze with fewer rows or columns, come talk to me. If you do not have a dataset already that you’d like to analyze, here are a few places to get one.

  1. Datsets from the UCI machine learning database (http://archive.ics.uci.edu/ml/datasets.html). This database has 307 datasets from many fields. If you want to use one of these datasets, I recommend you use those with a Multivariate data type and a Regression default task (http://goo.gl/hm2v4B).

  2. Financial datasets: Several financial datasets are available in the following link (http://www.r-bloggers.com/financial-data-accessible-from-r-part-iii/)

  3. The Flights dataset. This dataset contains data from all flights leaving the Houston airport in one year. You can access the data at http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt

  4. Create your own new dataset! You are welcome to collect your own data by, for example, conducting a survey.

How should I write and format my paper?

Your paper will be a combination of text, R code, and R output. Your paper does not need to be formatted in any particular style (though it would be good practice for you to write an APA style paper). However, you do need to include your name, date, and a title of your paper at the beginning of the document.

Here are three ways you can write your paper (in order of preference)

  1. LaTeX. LaTeX is the most elegant, and easiest way to write APA style papers. You can easily include all of your R code and output using Sweave in RStudio. We’ll go over how to write APA style papers with R code in LaTeX later.

  2. Markdown. We’ve already gone over the basics of Markdown.

  3. Word. This is the least elegant solution. However, if you insist on using it you can.

Including R code and output

When you refer to an analysis in your text, you must display both your R code and R output at the same location. If you use LaTeX or Markdown, this will happen automatically. However, if you use Word, you will need to take a screenshot of both your R code and your output from RStudio, and paste it into your Word document.

What should be in my paper?

There should be four sections in your paper: Dataset description, Questions, Analyses, and Conclusion.

Section 1: Dataset Description

Your paper should start with a description of the dataset. In describing your datas, you must answer the following four questions (in paragraph form). Be as descriptive as you can, but if you don’t know exactly how the data were collected (perhaps because you got it online) that’s ok. Just say as much as you can.

  1. How did you obtain the dataset?
  2. How were the data originally collected?
  3. What are the columns in the dataset? For each column, give the variable name and a brief description of what it represents. You only need to describe columns that you actually use in your analysis.

Section 2: Questions

Next you should list 5-10 questions that you would like to answer. For example, if I was analyzing the ChickWeight dataset, I could ask the following:

  1. How did the chicken weights generally change over time?
  2. Was there a difference in the the average chicken weights as a result of the different diets?
  3. Were the chicken weights at time 1 normally distributed?
  4. Was there a difference in weights between time 2 and time 4?
  5. Did more chickens die in one diet than another?

Section 3: Analyses (including Tasks)

In this section, you should conduct the relevant tasks to answer each of your 5-10 research questions. You need to complete each of the following tasks at some point in your paper. You do not need to restrict yourself to one task for each question. For example, to answer the question “How did the chicken weights change over time?”, I could create a pirateplot (task A), calculate a regression (task B) and/or calculate the mean weight at each time point (task C).

Tasks

  1. Display the first few rows of your dataframe.

  2. Display the number of rows and columns in the dataframe.

  3. Show summary statistics of every column in your dataframe.

  4. Recode the values of at least one column using indexing and reassignment. For example, in a column of sex data, you could change “female” to 1, and “male” to 0.

  5. Calculate at least one standard deviation, one mean, and one median.

  6. Count the number of outliers in a numerical vector. Define an outlier as any datapoint that is more than 3 standard deviations away from the mean.

  7. Print a table of the frequencies of outcomes of a categorical variable.

  8. Create at least one scatterplot containing data from two different groups (e.g; a set of green points and a set of red points representing different groups). Include a legend.

  9. Create at least one histogram and add additional elements (e.g.; lines and/or text) showing the mean and median of the data.

  10. Using a pirateplot, beanplot, or boxplot, show the distribution of a dependent variable for each level of a categorical independent variable.

  11. Use the aggregate function or dplyr to calculate descriptive statistics across groups of data.

  12. Calculate at least one t-test. Write your results in APA format.

  13. Calculate at least one correlation test. Write your results in APA format.

  14. Calculate at least one chi square test. Write your results in APA format.

  15. Calculate at least one regression analysis. Write your results in APA format.

  16. Calculate at least one one-way ANOVA analysis with 1 independent variable. Calculate post-hoc tests for any significant effects. Write your results in APA format.

  17. Calculate at least one multiple-variable ANOVA analysis with 2 or more independent variables. Calculate post-hoc tests for any significant effects. Write your results in APA format.

  18. Create (and use!) at least one new custom function. For example, you could create a function called my.histogram() that creates a histogram with colors you like. Or you could create a function called find.outliers() that looks for outliers in a vector and then tells you where they are in that vector.

4 Steps for completing each task

  1. Write what you are going to do: (e.g.; “I will calculate the mean age for each time period”).

  2. Display your nicely formatted R code with appropriate comments.

  3. Display the output from your R code.

  4. Interpret what you found. (e.g.; “The mean age at time period 2 was XX”). If the task involves a statistical test, report a full APA style conclusion in your text (either by typing it manually, or by using mini-chunks in LaTeX or Markdown). Do not just include the apa() function in your R code.

Here is an example of how to complete a task in the context of a question.

To start my analysis, I calculated the mean weight of chickens separately for each diet.

# Calculate mean weight for each diet

with(ChickWeight, 
     aggregate(weight ~ Diet, 
               FUN = mean))
##   Diet   weight
## 1    1 102.6455
## 2    2 122.6167
## 3    3 142.9500
## 4    4 135.2627

My results showed that chickens on Diet 1 had the smallest mean weight (102.65), and chickens on Diet 3 had the highest mean weight (142.95).

Section 4: Conclusion

Write a brief summary of your main conclusions in 1-3 paragraphs. You don’t need to go nuts here. Just write a few paragraphs that summarize the main conclusions from your results.

How do I submit my paper?

Submit your paper like a regular WPA by emailing your document to me by 15 June 2016 at 23:59.

How will I be graded?

I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.

  • Is the paper nicely formatted?
  • Are all four sections present?
  • Were all tasks completed?
  • Is the code properly formatted – that is, can I read it and understand it?
  • Are there comments in the code where appropriate?

If the answer to all these questions is “Yes”, you will get a good grade.