Here is an example of a dataset with some questions to help guide your preparation for the midterm. To be clear, your questions will differ depending on your data set, but you should expect the style of questions to be very similar. If you need some guidance on how any part of this might go for your data set, I’m happy to meet and walk you through some steps.
My goal for this midterm is to assess your ability to use RStudio to analyze data and extract information. There are two core skills being assessed here: summarizing or visualizing data with R, and thinking critically about these summaries and observations.
Before the exam, you’ll choose a dataset you’re interested in and come up with a few questions you have about it. During the exam, you’ll show me you can manipulate this dataset in R and give preliminary answers to those questions.
I’ve put a Google Form on the class Moodle page which will ask you for the following information. You’ll sign up for a time when you’d like to take your exam, link me to the dataset you’d like to work with, and write 3 questions which the dataset might be able to answer.
To start you off, here are some websites with freely available data coming from a wide range of sources. The datasets below come from all kinds of different sources. Some are in formats different from the .csv files we’ve used in class, others will be far too large for us to deal with, and still others will be too small to contain anything interesting. I’m available to help you find a suitable dataset and load it into RStudio before your exam. If you want to use a different dataset from someplace else, that’s ok too. I’m confident that whatever your interests are, we can find something relevant to you.
Jester: ratings of films, jokes, and movies
MovieLens: movie ratings with year and genres
Sports-reference: statistics for baseball, basketball, football, hockey at pro and college levels
Kaggle.com: great source of introductory and interesting datasets
Before the exam begins, I ask that you already have ‘tidyverse’ loaded and ready to go. This will make sure that during your exam we’ll have as much time as possible to work! There will only be 15 minutes per exam, and running out the clock won’t help here. All you should need is
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
I’ll ask you to start by uploading the data you intend to work with. Using read_csv or whatever command you might need, the end result should be a tibble containing your dataset. In this case our data file is shapes.csv, so we upload the file into our workspace and run the command below.
shapes <- read_csv("shapes.csv")
## Parsed with column specification:
## cols(
## shape = col_character(),
## color = col_character(),
## area = col_double()
## )
Next, I’ll want you to go through the variables in your data set and explain briefly what they represent and whether they are categorical or numerical. If there are a lot of variables, it’s ok to just explain those you think are interesting.
shapes
## # A tibble: 1,000 x 3
## shape color area
## <chr> <chr> <dbl>
## 1 square yellow 9409
## 2 circle yellow 4072.
## 3 triangle blue 2028
## 4 square blue 3025
## 5 square blue 9216
## 6 square yellow 4356
## 7 circle yellow 78.5
## 8 triangle red 4563
## 9 square yellow 1156
## 10 triangle red 5043
## # … with 990 more rows
It may also help to list all the variables at once, without any observations. You can do this with following command.
names(shapes)
## [1] "shape" "color" "area"
In this case, we have 3 variables: shape, color, and area. The first two are categorical, while the third is numerical and continuous.
This is where questions will begin to differ based on your individual data sets. I will ask you 1-2 questions like this to get a sense of how well you can manipulate your dataset and extract information. I’ve given some example questions below to guide your preparation. It will help you a lot to prepare for this ahead of time. Can you compute summary statistics for the numerical variables? Can you find counts for each value of the categorical variables?
shapes %>% summarize( maxarea=max(area), minarea=min(area), meanarea=mean(area), medianarea=median(area), sdarea=sd(area))
## # A tibble: 1 x 5
## maxarea minarea meanarea medianarea sdarea
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31416. 0.8 3945. 2611. 4754.
I would also want you to interpret these in the context of your dataset. For example, if your dataset has to do with cooking, I’d ask what these statistics would be telling you about that. There isn’t much to go on in this shapes example because the data doesn’t represent anything, but presumably your example will have more interesting context and content.
shapes %>% group_by(color) %>% summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## color count
## <chr> <int>
## 1 blue 360
## 2 green 78
## 3 red 290
## 4 yellow 272
shapes %>% group_by(shape) %>% summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## shape count
## <chr> <int>
## 1 circle 120
## 2 square 477
## 3 triangle 403
To see how color relates to area, group by those results to see the breakdowns of summary statistics for each color.
shapes %>% group_by(color) %>% summarize( maxarea=max(area), minarea=min(area), meanarea=mean(area), medianarea=median(area), sdarea=sd(area))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 6
## color maxarea minarea meanarea medianarea sdarea
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 blue 21642. 0.8 3208. 2500 3039.
## 2 green 27759. 1 5761. 3844 6695.
## 3 red 31416. 0.8 3816. 2437. 5093.
## 4 yellow 31416. 1 4538. 2818. 5352.
Obtaining a breakdown by shape is very similar.
shapes %>% group_by(shape) %>% summarize( maxarea=max(area), minarea=min(area), meanarea=mean(area), medianarea=median(area), sdarea=sd(area))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 6
## shape maxarea minarea meanarea medianarea sdarea
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 circle 31416. 28.3 10703. 7857. 9245.
## 2 square 9801 1 3411. 2500 2949.
## 3 triangle 7351. 0.8 2565. 2107. 2166.
shapes %>% filter(color=="yellow", shape=="square") %>% summarize(meangreenarea=mean(area))
## # A tibble: 1 x 1
## meangreenarea
## <dbl>
## 1 3333.
shapes %>% filter(color=="red", area>3000) %>% group_by(shape) %>% summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## shape count
## <chr> <int>
## 1 circle 20
## 2 square 21
## 3 triangle 84
Now we will use ggplot to visualize some of this data. There is more than enough to cover with just the 5NG (boxplots, scatterplots, histograms, bar plots, line graphs).
ggplot(shapes, mapping=aes(x=area))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This distribution is skewed right (to the side of the long tail), with center close to 0.
ggplot(shapes, mapping=aes(x=shape,y=area))+geom_boxplot()
ggplot(shapes, mapping=aes(x=color,y=area))+geom_boxplot()
ggplot(shapes, mapping=aes(x=shape, fill=color))+geom_bar()
Your dataset will likely have many more variables. If this dataset included other types of variables it would have made sense to use the two other named graphs (a scatterplot or a line graph). However, in this case, we didn’t need those. This may also be true in your data set! When you are preparing ahead of time, you can decide which graphs are most appropriate for your question.
After showing you can use some basic data manipulation and visualization tools, we’ll come back to the questions you proposed when you chose this dataset. You may or may not have definitively answered them, and this is fine. What I’m looking for are some of your thoughts, questions, and insights on the data. Here is your chance to shine - tell me what interests you about the dataset you chose! Here are some prompts to help frame this discussion. As in the previous section, it’s ok to prepare this ahead of time. In fact, it will help you a lot to do so!
Were you able to answer your questions with the dataset you chose? If not, why not? If so, what summary statistics or visualization helped you answer this question?
Show me a summary statistic or visualization from the data which confirms something you thought would be true. Why did you expect to see this?
Show me a summary statistic or visualization from the data that you did not expect to see or were surprised to see. Why was this surprising?