Due on 20th of December 2018 at 5pm.

Your final assignment for the course is to conduct an analysis on a new datset in R and to interpret the results. The final document you submit will contain the R code and output for this analysis, as well as written interpretations and descriptions.

The goal of this analysis is for your to practise the skills you have learnt in this course. Therefore, the analysis you will be required to conduct will contain a broad range of the tasks you have been doing throughout the course.

What data can I use?

The dataset has to be one that is new to this course. By this I mean it can’t be one of the datasets which we have analyzed in the textbook or weekly WPAs. It can however be a dataset which you already have, such as from a masters or phd project, even if you have already analysed this data for other purposes. If you don’t have a dataset, you can find one yourself (some journals provide open access to the data from papers for instance), or you can ask me to give you a dataset.

Dataset Requirements

Most of you should try to find datasets which are similar in form to the datasets we have used throughout this course. In general you need a dataset which has several columns of data, with some columns containing numeric/continuous variables and other containing categorical variables.

For the project I will provide a list of tasks/analyses that you need to provide/complete. To complete these tasks, ideally, your dataset should have at least 5 columns of data, with at least two columns containing categorical variables and at least two containing numeric variables. All datasets I provide will meet this criteria.

If you are using your own dataset, these columns don’t have to contain variables you are actually interested in for your external project. For instance, you may have a dataset with a column of a continuous DV and a column of a categorical IV. In addition you have taken demogrpahic information which you aren’t really interested in, such as age (a continuous variable) and gender/sex (a categorical variable). This dataset would therefore be fine to use, as you can use the age and gender data for some of the analyses. If you don’t have the correct data types to complete a particular tasks/analyses you will have two options 1) you can provide an explanation of why the analysis is inappropriate, and conduct an alternative of similar difficulty 2) you can generate a column of fake data.

Custom Project

As much as possible I want to encourage you to use real datasets for this project. Therefore, some of you may have your own dataset which isn’t really appropriate for most of the tasks listed below. In this case you cn meet with me and we can see if we can come up with a list of custom tasks for you to perform based on the dataset you have/analysis you ultimately want to conduct on this dataset. For these projects you will still submit a final document with your R code, R output and interpretations, but you won’t be folowing the checklist of standard tasks provided below. If you have a project like ths in mind, talk to me directly by the end of class on the 29th or arrange a time to meet with me before the 6th, and we can agree on whether the project is an appropriate assessment and what you need to do.

Where can I find data?

If you don’t have your own dataset below are links to a few journals that encourage or require data to be made available with publication. The links generally direct to the data pages of the journal, or the page where they outline their policy on data sharing:

And here is a link to some more general repositorires where data from studies is uploaded for public access.

Structure of the paper

Your paper should contain four sections: Data description, research question, analyses and conclusion.

Section 1: Data description

This section should contain (in paragraph form) a basic verbal description/explanation of your dataset. In this section you should explain where the dataset came from, what the different variables are that have been measured, and how the original study was conducted. If you are using a dataset provided by me, you can make up some of these details. This section doesn’t have to be exhaustive, it should just provide a basic idea of the dataset.

Section 2: Research Questions

In this section you should provide a list of 5 to 10 research questions that you want to answer with this dataset. The purposes of this section is to highlight some of the research questions that could be answered by this dataset, before you start going through actual analyses. For example, if you were using the math dataset from WPA # 8 you might have come up with the following research questions:

Do the two schools differ in third grade performance?
Does family support affect performance?
Are urban students more likely to do extra-curricular activities than rural students?
Do any of age, absences and sex predict third grade performance?
Is there a correlation between health status and absences?
Is first period performance normally distributed?
Is there an interaction between family support and school support in regards to third period performance?

If you are doing a custom project, your research questions may look very different, and may require more explanation.

Section 3: Analyses

If you are conducting a standard project you should complete the list of tasks below. These tasks have been broken into 5 subsections (i.e. Descriptives, t-tests & ANOVA, Regression etc.), but you can complete them in any order that you consider appropriate for your dataset. If a required task isn’t appropriate for your dataset, explain why, and conduct an alternative analysis of similar difficulty.

For each of these tasks, you should do the following:

Write what you are going to do: (e.g.; “I will calculate the mean age for each condition”).
Display your nicely formatted R code with appropriate comments.
Display the output from your R code.
Interpret what you found. (e.g.; “The mean age in condition 1 was XX”). If the task involves a statistical test, report a full APA style conclusion in your text.

If you are completing a custom project, you don’t need to complete the tasks below. However you should have a similar format to the dot points above for each task/analysis which you complete in your project.

Descriptives

You should complete all taks in this subsection.

Read/load data into R from a text file or .Rdata file.
Provide a summary of the data.frame, using str and summary.
For at least one column, report the mean, median and standard deviation.
For at least one column, report a table of frequencies.
Create at least one histogram.
Recode at least one variable using indexing. (E.g. If gender is coded as “male” or “female”, recode them as 1 or 2 in a new column)
For at least one column, check for any data that is outside the acceptable range and if there is any recode it as NA. For example if one of your variables was age, you would want to exclude any responses that were below 0.

T-tests & ANOVA.

Choose two of the tests from this subsection.

Conduct at least one t test.
Conduct at least one, one-way ANOVA, with an Independent Variable that has at least 3 levels. If the result is significant, or if warranted by your research question, conduct appropriate post-hoc tests.
Conduct at least one multi-variable ANOVA, with at least 2 Independent Variables. If it is appropriate to do so, conduct post-hoc tests.

For both tests you choose you should do the following:

Conduct the appropriate test.
Calculate appropriate descriptive statistics by group.
Produce an appropriately labelled and interpretable figure using barplot, boxplot, vioplot or pirateplot. For a barplot this should include error bars.
Intepret/write your results in APA style (approximately). This includes interpreting any post-hoc tests or interactions.
Repeat the test on a subset of the data and also report these results (without the figure or extra descriptives).

Regression, Correlation and Chi-Square.

Complete all the tasks in this subsection.

Conduct at least one regression analysis. You can have multiple predictors or a single predictor. Interpret/write the results in APA style.
Produce an appropriately labelled and interpretable scatterplot for one of the predictors and the DV. It should include a regression line.
Conduct at least one Chi-Square test. Write the results in APA style.
Conduct at least one correlation test. write the results in APA style.

Loops (This section may change if WPA9 doesn’t go well.)

Use a loop to create histograms for a subset of the columns of your dataset (i.e. at least 4). The histograms should all be displayed in a single plot window and ideally each histogram should have an appropriate label on the x-axis, and an appropriate title.

Functions (This section may change if WPA9 doesn’t go well.)

Choose one of the tasks in this subsection.

Write a custom function that will replicate the results of one of the tests you conducted in the “t.test and ANOVA” section. By this I mean that the function should accept as inputs vectors of data for each of the necessary variables, and return the test object, the appropriate descriptive statistics, and produce an appropriate plot. For example if you chose the t.test, it should accept a vector of data for each group (in vector or formula format, whichever you prefer) and return the results of a t.test comparing these groups (saved as an object, not printed), the means and standard deviations of each group and a plot comparing these groups. Simulate some data to test this function.
Write your own custom function to perform a task of similar difficulty.

Section 4: Conclusion

Write a brief summary of your main conclusions in 1-3 paragraphs. This is just a summary of the main results and what you can conclude from them.

How do I actually write/format all this?

At the end of the day I want to receive a document that contains a mixture of written sections, R code and R output. I don’t care how you produce this document. Here are two suggestions:

The easy but inefficient way is to take screenshots of your R code and output for each task. You can then produce the written sections in whatever program you normally use (i.e. word, LaTeX etc.) and add the screenshots in the appropriate sections.
The more efficient method is to use R markdown. This is a package for Rstudio which lets you create pdf, word and html documents which include sections of R code and output. This is the package I have used to create the WPAs for this course. There are only a few features of markdown that you will need to understand to produce the document for your final project. I’ll go through these features in the first Final Project lesson, and provide a template to anyone who wants to try and use R markdown to produce their document. Nathaniel Phillips, the author of your textbook, has also written a good document for the basics, and there are tutorials on the R markdown website.

You can use either of these two methods to produce your document, or do something completely different. It won’t effect your mark. I think the easiest method would be to use R markdown to create a word document which contains your R code, R output and Figures. You can then add the other text to this word document like normal. On the 6th I’ll go through how to set up a basic R markdown document to do this.

How do I submit it?

Just like with the WPAs, you need to email it to me by 5pm on the due date (20/12/2018). You should put the course code in the subject line of the email, and name your document LastnameFirstname-RFinalProject. You should also have your name at the top of the document.

How will it be graded?

How will I be graded?

I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.

Are all four sections present?
Were all tasks completed (assuming they make sense for your data)? If not, the did author ask me for help (before the final deadline)?
Is the code properly formatted and commented – that is, can I read it and understand it?
Is the paper nicely formatted? That is, does it look like the student put effort into making a nice document that is easy to follow.

If the answer to all these questions is “Yes” (or “Mostly Yes”), you will be just fine.

When can I start?

You can start as soon as you have found a dataset. It would be a good idea to check with me that the dataset you plan to use will be acceptable, but if it meets the guidelines above then it isn’t necessary for you to ask me. Here are some points to consider before starting:
1. If you want to do a custom project you need permission from me. So I would reccomend not starting until you have talked to me about it. 2. I want to encourage people to find their own datasets, either from papers you have read, or from research you are doing yourself. Therefore if you don’t want to find your own dataset, and instead want to use a dataset provided by me, you won’t be able to start until the 6th of December. This is because I won’t send out these datasets until the 6th of December. 3. You can change your dataset at any time, BUT you will need to do the whole project with the new dataset. 4. You can help each other with the final project, but you can’t share a dataset with your friends/do the report together. So if you want to help each other make sure you have different datasets and that everyone is ultimately doing their own work. This is an individual project, but done in a collaborative environment. 5. Don’t start on the loops and functions sections until after WPA 9, as I may need to adjust these.

R Course Final Project (Autumn 2018)