Due on 10th of June 2017

Your final assignment for the course is to conduct an analysis on a new datset in R and to interpret the results. The final document you submit will contain the R code and output for this analysis, as well as written interpretations and descriptions

The goal of this analysis is for your to practise the skills you have learnt in this course. Therefore, the analysis you will be required to conduct will contain a broad range of the tasks you have been doing throughout the course.

What data can I use?

The dataset has to be one that is new to this course. By this I mean it can’t be one of the datasets which we have analyzed in the textbook or weekly WPAs. It can however be a dataset which you already have, such as from a masters or phd project, even if you have already analysed this data for other purposes. If you don’t have a dataset, you can find one yourself (some journals provide open access to the data from papers for instance), or you can ask me to give you a dataset.

Standard Project

Most of you will be working on datasets which are similar in form to the datasets we have used throughout this course. In general you will have a dataset which has several columns of data, with some columns containing numeric/continuous variables and other containing categorical variables.

For these standard datasets, I will provide a list of tasks/analyses that you need to provide. If you want to use a standard dataset, ideally it should have at least 5 columns of data, with two columns containing categorical variables and two containing numeric variables. All datasets I provide will meet this criteria.

If you are using your own dataset, these columns don’t have to contain variables you are actually interested in for your external project. For instance, you may have a dataset with a column of a continuous DV and a column of a categorical IV. In addition you have taken demogrpahic information which you aren’t really interested in, such as age (a continuous variable) and gender/sex (a categorical variable). This dataset would therefore be fine to use, as you can use the age and gender data for some of the analyses. If you don’t have the correct data types to complete the standard tasks/analyses you will have two options 1) you can provide an explanation of why the analysis is inappropriate, and conduct an alternative of similar difficulty 2) you can generate a column of fake data.

Non-standard Project

Some of you have already discussed with me projects which are very different to the types of analyses we have conducted in this course. For these projects you will still submit a final document with your R code, R output and interpretations, but you won’t be folowing the checklist of standard tasks provided below. If you have a project like ths in mind, talk to me directly by the end of class on the 18th (or sooner ideally), and we can agree on whether the project is an appropriate assessment and what you need to do.

Structure of the paper

Your paper should contain four sections: Data description, research question, analyses and conclusion.

Section 1: Data description

This section should contain (in paragraph form) a basic verbal description/explanation of your dataset. In this section you should explain where the dataset came from, what the different variables are that have been measured, and how the original study was conducted. If you are using a dataset provided by me, you can make up some of these details. This section doesn’t have to be exhaustive, it should just provide a basic idea of the dataset.

Section 2: Research Questions

In this section you should provide a list of 5 to 10 research questions that you want to answer with this dataset. The purposes of this section is to highlight some of the research questions that could be answered by this dataset, before you start going through actual analyses. For example, if you were using the math dataset from WPA # 8 you might have come up with the following research questions:

Do the two schools differ in third grade performance?
Does family support affect performance?
Are urban students more likely to do extra-curricular activities than rural students?
Do any of age, absences and sex predict third grade performance?
Is there a correlation between health status and absences?
Is first period performance normally distributed?
Is there an interaction between family support and school support in regards to third period performance?

If you are doing a non-standard project, your research questions may look very different, and may require more explanation.

Section 3: Analyses

If you are conducting a standard project you should complete the list of tasks below. These tasks have been broken into 5 subsections (i.e. Descriptives, t-tests & ANOVA, Regression etc.), but you can complete them in any order that you consider appropriate for your dataset. If a required task isn’t appropriate for your dataset, explain why, and conduct an alternative analysis of similar difficulty.

For each of these tasks, you should do the following:

Write what you are going to do: (e.g.; “I will calculate the mean age for each condition”).
Display your nicely formatted R code with appropriate comments.
Display the output from your R code.
Interpret what you found. (e.g.; “The mean age in condition 1 was XX”). If the task involves a statistical test, report a full APA style conclusion in your text.

If you are completing a non-standard project, you don’t need to complete the tasks below. However you should have a similar format to the dot points above for each task/analysis which you complete in your project.

Descriptives

You should complete all taks in this subsection.

Read/load data into R from a text file or .Rdata file.
Provide a summary of the data.frame, using str and summary.
For at least one column, report the mean, median and sd.
For at least one column, report a table of frequencies with table.
Create at least one histogram (hist).
Recode at least one variable using indexing. (E.g. If gender is coded as “male” or “female”, recode them as 1 or 2 in a new column)
For at least one column, check for any data that is outside the acceptable range and if there is any recode it as NA. For example if one of your variables was age, you would want to exclude any responses that were below 0.

T-tests & ANOVA.

Choose two of the tests from this subsection.

Conduct at least one t test with t.test.
Conduct at least one, one-way ANOVA (with aov), with an Independent Variable that has at least 3 levels. If the result is significant, or if warranted by your research question, conduct appropriate post-hoc tests.
Conduct at least one multi-variable ANOVA (with aov), with at least 2 Independent Variables. If it is appropriate to do so, conduct post-hoc tests.

For both tests you choose you should do the following:

Conduct the appropriate test.
Calculate appropriate descriptive statistics by group using aggregate.
Produce an appropriately labelled and interpretable figure using barplot, boxplot, vioplot or pirateplot.
Intepret/write your results in APA style (approximately).
Repeat the test on a subset of data, using subset, and also report these results (without the figure or extra descriptives).

Regression, Correlation and Chi-Square.

Complete all the tasks in this subsection.

Conduct at least one regression analysis using lm. You can have multiple predictors or a single predictor. Interpret/write the results in APA style.
Produce an appropriately labelled and interpretable scatterplot for one of the predictors and the DV using plot. It should include a regression line.
Conduct at least one Chi-Square test (chisq.test). Write the results in APA style.
Conduct at least one correlation test (cor.test). write the results in APA style.

Loops

Use a loop to create histograms for a subset of the columns of your dataset (i.e. at least 4). The histograms should all be displayed in a single plot window and ideally each histogram should have an appropriate label on the x-axis, and an appropriate title.

Functions

Choose one of the tasks in this subsection.

Write a custom function that will replicate the results of one of the tests you conducted in the “t.test and ANOVA” section. By this I mean that the function should accept as inputs vectors of data for each of the necessary variables, and return the test object, the appropriate descriptive statistics, and produce an appropriate plot. For example if you chose the t.test, it should accept a vector of data for each group (in vector or formula format, whichever you prefer) and return the results of a t.test comparing these groups (saved as an object, not printed), the means and standard deviations of each group and a plot comparing these groups. Simulate some data to test this function.
Write your own custom function to perform a task of similar difficulty.

Section 4: Conclusion

Write a brief summary of your main conclusions in 1-3 paragraphs. This is just a summary of the main results and what you can conclude from them.

How do I actually write/format all this?

At the end of the day I want to receive a document that contains a mixture of written sections, R code and R output. I don’t care how you produce this document. Here are two suggestions:

The easy but inefficient way is to take screenshots of your R code and output for each task. You can then produce the written sections in whatever program you normally use (i.e. word, LaTeX etc.) and add the screenshots in the appropriate sections.
The more efficient method is to use R markdown. This is a package for Rstudio which lets you create pdf, word and html documents which include sections of R code and output. This is the package I have used to create the WPAs for this course. There are only a few features of markdown that you will need to understand to produce the document for your final project. I’ll go through these features in the first Final Project lesson, and provide a template to anyone who wants to try and use R markdown to produce their document. Nathaniel Phillips has also written a good document for the basics, and there are tutorials on the R markdown website.

You can use either of these two methods to produce your document, or do something completely different. It won’t effect your mark. Here is an example of what your final project document could look like. All the sections with a comment of # INSERT CODE HERE would be actual R code producing the output following it. The tasks in the example document don’t perfectly match the ones you are required to do, nor do the lengths of some of the sections (like the conclusion). This is just an example of the overall format.

How do I submit it?

Just like with the WPAs, you need to email it to me before the due date (10/06/2017). You should put the course code in the subject line of the email, and name your document LastnameFirstname-RFinalProject. You should also have your name at the top of the document.

How will it be graded?

How will I be graded?

I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.

Are all four sections present?
Were all tasks completed (assuming they make sense for your data)? If not, the did author ask me for help (before the final deadline)?
Is the code properly formatted and commented – that is, can I read it and understand it?
Is the paper nicely formatted? That is, does it look like the student put effort into making a nice document that is easy to follow.

If the answer to all these questions is “Yes” (or “Mostly Yes”), you will be just fine.

R Course Final Project (Spring 2017)