In your final paper, you will conduct an extensive set of analyses in R on a dataset of your choice. You can use any dataset you’d like. If you have data for a Bachelor’s or Master’s thesis, you are very welcome to use that. If you do not have a dataset, look at the next section “How do I get a dataset?”.
Our last two class periods (February 3 and February 10) are entirely dedicated to giving you time to work on your papers in class. I encourage you to use this time to work on your papers and get help from me and your fellow R pirates.
Yes. You are welcome to use the same dataset as another student in the class, and you are welcome to work together on your analyses. However, you must each write your own code and text and turn in your own work! If I suspect that you did not write your paper / code, I reserve the right to ask you to explain it to me personally. If you cannot convince me that your code is your own, you will get a 0 on the assignment.
You need to use a new dataset that we we have not already used in class. You should use a dataset with at least 100 rows and 5 columns. If you do not have a dataset already that you’d like to analyze, here are a few places to get one.
Datsets from the UCI machine learning database (http://archive.ics.uci.edu/ml/datasets.html). This database has 307 datasets from many fields. If you want to use one of these datasets, I recommend you use those with a Multivariate data type and a Regression default task (http://goo.gl/hm2v4B).
Financial datasets: Several financial datasets are available in the following link (http://www.r-bloggers.com/financial-data-accessible-from-r-part-iii/)
The Flights dataset. This dataset contains data from all flights leaving the Houston airport in one year. You can access the data at http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt
Create your own new dataset! You are welcome to collect your own data by, for example, conducting a survey.
Your paper must be written entirely in an RMarkdown document and knitted to HTML. Make sure to print all of your R code in the document by including “echo = T” in the chunk options (or leave it blank and it will print your code by default). You must include your name, date, and the title of the course on the first page of your paper. There is no minimum or maximum page length. It does not need to follow APA or any other style format.
If for some reason you absolutely cannot get R to knit your document to an HTML file, you can turn in your .Rmd markdown document via emai. However you may receive a 15% penalty to your grade.
There should be four sections in your paper: Dataset description, Questions, Analyses, and Conclusion.
Your paper should start with a description of the dataset. Make sure you answer these four questions (in paragraph form). Be as descriptive as you can, but if you don’t know exactly how the data were collected (perhaps because you got it online) that’s ok. Just say as much as you can.
Next you should list 5-10 questions that you would like to answer. For example, if I was analyzing the ChickWeight dataset, I could ask the following:
In this section, you should conduct the relevant analyses to answer each of your 5-10 research questions. I expect to see all of the relevant R code in (separate) chunks. I also expect you to use comments in your code when appropriate. Before each chunk, write what you are going to do: (e.g.; “I will calculate the mean age for each time period”“) After each chunk, write what you found. (e.g.;”The mean age at time period 2 was XX“”).
You need to complete each of the following tasks at some point in your paper (not for every question, just at least once in the entire paper). Please comment in your R code when you have completed a task by writing “TASK X” before the relevant code as follows:
### TASK 1
code code code
11 Tasks
Recode the values of at least one column using indexing and reassignment.
Calculate at least one standard deviation using sd(), one mean using mean() and one median using median().
Calculate at least one t-test using t.test(). Write your results in APA format.
Calculate at least one correlation test using cor.test(). Write your results in APA format.
Calculate at least one regression analysis using lm() or glm(). Write your results in APA format.
Create at least one scatterplot containing data from two different groups (e.g; a set of green points and a set of red points representing different groups), with added regression lines using abline().
Create at least one histogram using hist() with additional reference lines showing the mean and median of the group.
Using a pirateplot(), beanplot(), or boxplot(), show the distribution of a dependent variable for each level of a categorical independent variable.
Use the aggregate function or dplyr to calculate descriptive statistics across groups of data.
Create (and use!) at least one new custom function that is different from those in the book and the WPAs.
Construct and use a loop. You can use the loop to calculate data or to create plots (or any other reasonable purpose).
You do not need to restrict yourself to one task for each question. For example, to answer the question “How did the chicken weights change over time?”, I could create a plot, calculate a regression and/or calculate the mean (or median) weight at each time point.
Write a brief summary of your main conclusions in 1-3 paragraphs. You don’t need to go nuts here. Just write a few paragraphs that summarize the main conclusions from your results.
Submit your paper like a regular WPA by publishing the knitted HTML document to RPubs and then entering the RPubs .html link on the WPA submission page. If you do not want your paper to be public (all RPubs documents are public), you can also email the .html file to me.
I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.
If the answer to all these questions is “Yes”, you will get a good grade.