In your final paper, you will conduct an extensive set of analyses in R on a dataset of your choice. You can use any dataset you’d like, however, it would be best if your dataset had at least 100 rows and 5 columns. If you have data for a Bachelor’s or Master’s thesis, you are very welcome to use that. If you do not have a dataset, look at the next section “How do I get a dataset?”.

You are welcome to use the same dataset as another student in the class, and you are welcome to work together on your analyses. However, you must each write your own code and text and turn in your own work!

How do I get a dataset?

If you do not have a dataset, here are a few places to get datasets

  1. R Datasets: run “library(help = datasets”)" to see a list of the datasets preloaded in R. Make sure to use one with several columns and (hopefully) at least 100 rows.

  2. Datsets from the UCI machine learning database ( This database has 307 datasets from many fields. If you want to use one of these datasets, I recommend you use those with a Multivariate data type and a Regression default task (

  3. Flights dataset: Use this dataset on flights leaving the Houston airport (

  4. Create your own new dataset!: You are welcome to collect your own data by, for example, conducting a survey of students.

How should I format my paper?

Your paper must be written entirely in an RMarkdown document and knitted to either an HTML or PDF document. Make sure to print all of your R code in the document by including “echo = T” in the chunk options (or leave it blank and it will print your code by default). You must include your name, date, and the title of the course on the first page of your paper. There is no minimum or maximum page length.

If for some reason you absolutely cannot get R to knit your document to an HTML or PDF file, you can turn in a print-out of your R code - however you may receive a 15% penalty to your grade.

What should be in my paper?

There should be four sections in your paper: Dataset description, Questions, Analyses, and Conclusion:

Section 1: Dataset Description

Your paper should start with a description of the dataset. Make sure you answer these four questions (in paragraph form).

  1. How did you obtain the dataset?
  2. How were the data originally collected?
  3. How many rows and columns are in the dataset?
  4. What are the columns in the dataset? For each column, give the variable name and a brief description of what it represents.

Section 2: Questions

Next you should list 5 questions that you would like to answer. For example, if I was analyzing the ChickWeight dataset, I could ask the following:

  1. How did the chicken weights generally change over time?
  2. Was there a difference in the the average chicken weights as a result of the different diets?
  3. Were the chicken weights at time 1 normally distributed?
  4. Was there a difference in weights between time 2 and time 4?
  5. Did more chickens die in one diet than another?

Section 3: Analyses

In this section, you should conduct the relevant analyses to answer each of your five research questions. I expect to see all of the relevant R code in a chunk, and I expect you to include the main result in your written text using a mini-chunk.

You do not need to restrict yourself to one analysis for each question. For example, to answer the question “How did the chicken weights change over time?”, I could create a plot, calculate a regression and/or calculate the mean (or median) weight at each time point.

At some point in your analyses, you need to use each of the following 8 commands at some point. You do not need to do all of these for each of your analyses questions! You just need to do each one once across all of your analyses.

  1. Recode the values of at least one column using indexing and reassignment.
  2. Calculate at least one standard deviation using sd(), one mean using mean() and one median using median().
  3. Calculate at least one t-test using t.test().
  4. Calculate at least one correlation test using cor.test().
  5. Calculate at least one regression analysis using lm() or glm().
  6. Create at least one scatterplot containing data from two different groups (e.g; a set of green points and a set of red points representing different groups), with added regression lines using abline().
  7. Create at least one histogram using hist() with additional reference lines showing the mean and median of the group.
  8. Use the aggregate function or dplyr to calculate descriptive statistics across groups of data.
  9. Use par(mfrow = c(x, y)) to put two or more plots next to each other.
  10. Create (and use!) at least one custom function.

Section 4: Conclusion

Write a brief summary of your main conclusions in a few paragraphs.

How will I be graded?

I will grade your paper based on how well you followed the instructions above, how well formatted and clean your code is. If you follow the instructions and have well formatted code, you’ll get a good grade.