Your final assignment for the course is to conduct an analysis on a new datset in R and to interpret the results. The final document you submit will contain the R code and output for this analysis, as well as written interpretations and descriptions
The goal of this analysis is for your to practise the skills you have learnt in this course. Therefore, the analysis you will be required to conduct will contain a broad range of the tasks you have been doing throughout the course.
The dataset has to be one that is new to this course. By this I mean it can’t be one of the datasets which we have analyzed in the textbook or weekly WPAs. It can however be a dataset which you already have, such as from a masters or phd project, even if you have already analysed this data for other purposes. If you don’t have a dataset, you can find one yourself (some journals provide open access to the data from papers for instance), or you can ask me to give you a dataset.
Most of you will be working on datasets which are similar in form to the datasets we have used throughout this course. In general you will have a dataset which has several columns of data, with some columns containing numeric/continuous variables and other containing categorical variables.
For these standard datasets, I will provide a list of tasks/analyses that you need to provide. If you want to use a standard dataset, ideally it should have at least 5 columns of data, with two columns containing categorical variables and two containing numeric variables. All datasets I provide will meet this criteria.
If you are using your own dataset, these columns don’t have to contain variables you are actually interested in for your external project. For instance, you may have a dataset with a column of a continuous DV and a column of a categorical IV. In addition you have taken demogrpahic information which you aren’t really interested in, such as age (a continuous variable) and gender/sex (a categorical variable). This dataset would therefore be fine to use, as you can use the age and gender data for some of the analyses. If you don’t have the correct data types to complete the standard tasks/analyses you will have two options 1) you can provide an explanation of why the analysis is inappropriate, and conduct an alternative of similar difficulty 2) you can generate a column of fake data.
Some of you have already discussed with me projects which are very different to the types of analyses we have conducted in this course. For these projects you will still submit a final document with your R code, R output and interpretations, but you won’t be folowing the checklist of standard tasks provided below. If you have a project like ths in mind, talk to me directly by the end of class on the 18th (or sooner ideally), and we can agree on whether the project is an appropriate assessment and what you need to do.
Your paper should contain four sections: Data description, research question, analyses and conclusion.
This section should contain (in paragraph form) a basic verbal description/explanation of your dataset. In this section you should explain where the dataset came from, what the different variables are that have been measured, and how the original study was conducted. If you are using a dataset provided by me, you can make up some of these details. This section doesn’t have to be exhaustive, it should just provide a basic idea of the dataset.
In this section you should provide a list of 5 to 10 research questions that you want to answer with this dataset. The purposes of this section is to highlight some of the research questions that could be answered by this dataset, before you start going through actual analyses. For example, if you were using the math dataset from WPA # 8 you might have come up with the following research questions:
If you are doing a non-standard project, your research questions may look very different, and may require more explanation.
If you are conducting a standard project you should complete the list of tasks below. These tasks have been broken into 5 subsections (i.e. Descriptives, t-tests & ANOVA, Regression etc.), but you can complete them in any order that you consider appropriate for your dataset. If a required task isn’t appropriate for your dataset, explain why, and conduct an alternative analysis of similar difficulty.
For each of these tasks, you should do the following:
Write what you are going to do: (e.g.; “I will calculate the mean age for each condition”).
Display your nicely formatted R code with appropriate comments.
Display the output from your R code.
Interpret what you found. (e.g.; “The mean age in condition 1 was XX”). If the task involves a statistical test, report a full APA style conclusion in your text.
If you are completing a non-standard project, you don’t need to complete the tasks below. However you should have a similar format to the dot points above for each task/analysis which you complete in your project.
You should complete all taks in this subsection.
Read/load data into R from a text file or .Rdata file.
Provide a summary of the data.frame, using str and summary.
For at least one column, report the mean, median and sd.
For at least one column, report a table of frequencies with table.
Create at least one histogram (hist).
Recode at least one variable using indexing. (E.g. If gender is coded as “male” or “female”, recode them as 1 or 2 in a new column)
For at least one column, check for any data that is outside the acceptable range and if there is any recode it as NA. For example if one of your variables was age, you would want to exclude any responses that were below 0.
Choose two of the tests from this subsection.
Conduct at least one t test with t.test.
Conduct at least one, one-way ANOVA (with aov), with an Independent Variable that has at least 3 levels. If the result is significant, or if warranted by your research question, conduct appropriate post-hoc tests.
Conduct at least one multi-variable ANOVA (with aov), with at least 2 Independent Variables. If it is appropriate to do so, conduct post-hoc tests.
For both tests you choose you should do the following:
aggregate.barplot, boxplot, vioplot or pirateplot.subset, and also report these results (without the figure or extra descriptives).Complete all the tasks in this subsection.
Conduct at least one regression analysis using lm. You can have multiple predictors or a single predictor. Interpret/write the results in APA style.
Produce an appropriately labelled and interpretable scatterplot for one of the predictors and the DV using plot. It should include a regression line.
Conduct at least one Chi-Square test (chisq.test). Write the results in APA style.
Conduct at least one correlation test (cor.test). write the results in APA style.
Choose one of the tasks in this subsection.
Write a custom function that will replicate the results of one of the tests you conducted in the “t.test and ANOVA” section. By this I mean that the function should accept as inputs vectors of data for each of the necessary variables, and return the test object, the appropriate descriptive statistics, and produce an appropriate plot. For example if you chose the t.test, it should accept a vector of data for each group (in vector or formula format, whichever you prefer) and return the results of a t.test comparing these groups (saved as an object, not printed), the means and standard deviations of each group and a plot comparing these groups. Simulate some data to test this function.
Write your own custom function to perform a task of similar difficulty.
Write a brief summary of your main conclusions in 1-3 paragraphs. This is just a summary of the main results and what you can conclude from them.
At the end of the day I want to receive a document that contains a mixture of written sections, R code and R output. I don’t care how you produce this document. Here are two suggestions:
The easy but inefficient way is to take screenshots of your R code and output for each task. You can then produce the written sections in whatever program you normally use (i.e. word, LaTeX etc.) and add the screenshots in the appropriate sections.
The more efficient method is to use R markdown. This is a package for Rstudio which lets you create pdf, word and html documents which include sections of R code and output. This is the package I have used to create the WPAs for this course. There are only a few features of markdown that you will need to understand to produce the document for your final project. I’ll go through these features in the first Final Project lesson, and provide a template to anyone who wants to try and use R markdown to produce their document. Nathaniel Phillips has also written a good document for the basics, and there are tutorials on the R markdown website.
You can use either of these two methods to produce your document, or do something completely different. It won’t effect your mark. Here is an example of what your final project document could look like. All the sections with a comment of # INSERT CODE HERE would be actual R code producing the output following it. The tasks in the example document don’t perfectly match the ones you are required to do, nor do the lengths of some of the sections (like the conclusion). This is just an example of the overall format.
Just like with the WPAs, you need to email it to me before the due date (10/06/2017). You should put the course code in the subject line of the email, and name your document LastnameFirstname-RFinalProject. You should also have your name at the top of the document.
How will I be graded?
I will grade your paper based on how well you followed the instructions above. Here is a checklist I will use when grading your paper.
If the answer to all these questions is “Yes” (or “Mostly Yes”), you will be just fine.