Data analysis and visualization in R: Final Paper

Due on or before 15 March 2016 at 23:59

In your final paper, you will conduct an extensive set of analyses in R on a dataset of your choice. You can use any dataset you’d like. If you have data for a Bachelor’s or Master’s thesis, you are very welcome to use that. If you do not have a dataset, look at the next section “How do I get a dataset?”.

Our last two class periods (February 3 and February 10) are entirely dedicated to giving you time to work on your papers in class. I encourage you to use this time to work on your papers and get help from me and your fellow R pirates.

Can I work with someone else on the same dataset?

Yes. You are welcome to use the same dataset as another student in the class, and you are welcome to work together on your analyses. However, you must each write your own code and text and turn in your own work! If I suspect that you did not write your paper / code, I reserve the right to ask you to explain it to me personally. If you cannot convince me that your code is your own, you will get a 0 on the assignment.

How do I get a dataset?

You need to use a new dataset that we we have not already used in class. You should use a dataset with at least 100 rows and 5 columns. If you do not have a dataset already that you’d like to analyze, here are a few places to get one.

Datsets from the UCI machine learning database (http://archive.ics.uci.edu/ml/datasets.html). This database has 307 datasets from many fields. If you want to use one of these datasets, I recommend you use those with a Multivariate data type and a Regression default task (http://goo.gl/hm2v4B).
Financial datasets: Several financial datasets are available in the following link (http://www.r-bloggers.com/financial-data-accessible-from-r-part-iii/)
The Flights dataset. This dataset contains data from all flights leaving the Houston airport in one year. You can access the data at http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt
Create your own new dataset! You are welcome to collect your own data by, for example, conducting a survey.

How should I format my paper?

Your paper must be written entirely in an RMarkdown document and knitted to HTML. Make sure to print all of your R code in the document by including “echo = T” in the chunk options (or leave it blank and it will print your code by default). You must include your name, date, and the title of the course on the first page of your paper. There is no minimum or maximum page length. It does not need to follow APA or any other style format.

If for some reason you absolutely cannot get R to knit your document to an HTML file, you can turn in your .Rmd markdown document via emai. However you may receive a 15% penalty to your grade.

What should be in my paper?

There should be four sections in your paper: Dataset description, Questions, Analyses, and Conclusion.

Section 1: Dataset Description

Your paper should start with a description of the dataset. Make sure you answer these four questions (in paragraph form). Be as descriptive as you can, but if you don’t know exactly how the data were collected (perhaps because you got it online) that’s ok. Just say as much as you can.

How did you obtain the dataset?
How were the data originally collected?
How many rows and columns are in the dataset?
What are the columns in the dataset? For each column, give the variable name and a brief description of what it represents.

Section 2: Questions

Next you should list 5-10 questions that you would like to answer. For example, if I was analyzing the ChickWeight dataset, I could ask the following:

How did the chicken weights generally change over time?
Was there a difference in the the average chicken weights as a result of the different diets?
Were the chicken weights at time 1 normally distributed?
Was there a difference in weights between time 2 and time 4?
Did more chickens die in one diet than another?

Section 3: Analyses (including 11 Tasks)

In this section, you should conduct the relevant analyses to answer each of your 5-10 research questions. I expect to see all of the relevant R code in (separate) chunks. I also expect you to use comments in your code when appropriate. Before each chunk, write what you are going to do: (e.g.; “I will calculate the mean age for each time period”“) After each chunk, write what you found. (e.g.;”The mean age at time period 2 was XX“”).

You need to complete each of the following tasks at some point in your paper (not for every question, just at least once in the entire paper). Please comment in your R code when you have completed a task by writing “TASK X” before the relevant code as follows:

### TASK 1
code code code

11 Tasks

Recode the values of at least one column using indexing and reassignment.
Calculate at least one standard deviation using sd(), one mean using mean() and one median using median().
Calculate at least one t-test using t.test(). Write your results in APA format.
Calculate at least one correlation test using cor.test(). Write your results in APA format.
Calculate at least one regression analysis using lm() or glm(). Write your results in APA format.
Create at least one scatterplot containing data from two different groups (e.g; a set of green points and a set of red points representing different groups), with added regression lines using abline().
Create at least one histogram using hist() with additional reference lines showing the mean and median of the group.
Using a pirateplot(), beanplot(), or boxplot(), show the distribution of a dependent variable for each level of a categorical independent variable.
Use the aggregate function or dplyr to calculate descriptive statistics across groups of data.
Create (and use!) at least one new custom function that is different from those in the book and the WPAs.
Construct and use a loop. You can use the loop to calculate data or to create plots (or any other reasonable purpose).

You do not need to restrict yourself to one task for each question. For example, to answer the question “How did the chicken weights change over time?”, I could create a plot, calculate a regression and/or calculate the mean (or median) weight at each time point.

Section 4: Conclusion

Write a brief summary of your main conclusions in 1-3 paragraphs. You don’t need to go nuts here. Just write a few paragraphs that summarize the main conclusions from your results.

How do I submit my paper?

Submit your paper like a regular WPA by publishing the knitted HTML document to RPubs and then entering the RPubs .html link on the WPA submission page. If you do not want your paper to be public (all RPubs documents are public), you can also email the .html file to me.

How will I be graded?

I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.

Was the paper knitted to an .html document?
Are all four sections present?
Were all 11 tasks completed (with the comment ### TASK X before each task)?
Is the code properly formatted – that is, can I read it and understand it?
Are there comments in the code where appropriate?

If the answer to all these questions is “Yes”, you will get a good grade.