Midterm 1 Framework, Example Dataset with Questions

Here is an example of a dataset with some questions to help guide your preparation for the midterm. To be clear, your questions will differ depending on your data set, but you should expect the style of questions to be very similar. If you need some guidance on how any part of this might go for your data set, I’m happy to meet and walk you through some steps.

Task 0: (2 weeks before the exam) Schedule your exam, choose a dataset, and give three questions that you’d like to answer using this data

My goal for this midterm is to assess your ability to use RStudio to analyze data and extract information. There are two core skills being assessed here: summarizing or visualizing data with R, and thinking critically about these summaries and observations.

Before the exam, you’ll choose a dataset you’re interested in and come up with a few questions you have about it. During the exam, you’ll show me you can manipulate this dataset in R and give preliminary answers to those questions.

I’ve put a Google Form on the class Moodle page which will ask you for the following information. You’ll sign up for a time when you’d like to take your exam, link me to the dataset you’d like to work with, and write 3 questions which the dataset might be able to answer.

To start you off, here are some websites with freely available data coming from a wide range of sources. The datasets below come from all kinds of different sources. Some are in formats different from the .csv files we’ve used in class, others will be far too large for us to deal with, and still others will be too small to contain anything interesting. I’m available to help you find a suitable dataset and load it into RStudio before your exam. If you want to use a different dataset from someplace else, that’s ok too. I’m confident that whatever your interests are, we can find something relevant to you.

Awesome Public Datasets

Data Repositories

Jester: ratings of films, jokes, and movies

MovieLens: movie ratings with year and genres

Sports-reference: statistics for baseball, basketball, football, hockey at pro and college levels

Kaggle.com: great source of introductory and interesting datasets

fivethirtyeight: data journalism website. Your questions shouldn’t be answered by the article the data comes from.

CDC WONDER: Center for Disease Control’s public health data

Police Data Intiative: info from police departments across the country. Datasets on hate crimes, arrests, community surveys, etc.

Task 1: (Right before the exam) Make sure the tidyverse is loaded in your workspace

Before the exam begins, I ask that you already have ‘tidyverse’ loaded and ready to go. This will make sure that during your exam we’ll have as much time as possible to work! There will only be 15 minutes per exam, and running out the clock won’t help here. All you should need is

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Task 2: Load in your data set

I’ll ask you to start by uploading the data you intend to work with. Using read_csv or whatever command you might need, the end result should be a tibble containing your dataset. In this case our data file is shapes.csv, so we upload the file into our workspace and run the command below.

shapes <- read_csv("shapes.csv")

## Parsed with column specification:
## cols(
##   shape = col_character(),
##   color = col_character(),
##   area = col_double()
## )

Task 3: Identify variables and observations

Next, I’ll want you to go through the variables in your data set and explain briefly what they represent and whether they are categorical or numerical. If there are a lot of variables, it’s ok to just explain those you think are interesting.

shapes

## # A tibble: 1,000 x 3
##    shape    color    area
##    <chr>    <chr>   <dbl>
##  1 square   yellow 9409  
##  2 circle   yellow 4072. 
##  3 triangle blue   2028  
##  4 square   blue   3025  
##  5 square   blue   9216  
##  6 square   yellow 4356  
##  7 circle   yellow   78.5
##  8 triangle red    4563  
##  9 square   yellow 1156  
## 10 triangle red    5043  
## # … with 990 more rows

It may also help to list all the variables at once, without any observations. You can do this with following command.

names(shapes)

## [1] "shape" "color" "area"

In this case, we have 3 variables: shape, color, and area. The first two are categorical, while the third is numerical and continuous.

Task 4: Exploratory Data Analysis (Summary Statistics)

This is where questions will begin to differ based on your individual data sets. I will ask you 1-2 questions like this to get a sense of how well you can manipulate your dataset and extract information. I’ve given some example questions below to guide your preparation. It will help you a lot to prepare for this ahead of time. Can you compute summary statistics for the numerical variables? Can you find counts for each value of the categorical variables?

Summary Statistics Sample Questions

In this case, we only have one numerical variable, but it would be natural to ask for some summary statistics: the maximum, minimum, mean, median, and standard deviation.

shapes %>% summarize( maxarea=max(area), minarea=min(area), meanarea=mean(area), medianarea=median(area), sdarea=sd(area))

## # A tibble: 1 x 5
##   maxarea minarea meanarea medianarea sdarea
##     <dbl>   <dbl>    <dbl>      <dbl>  <dbl>
## 1  31416.     0.8    3945.      2611.  4754.

I would also want you to interpret these in the context of your dataset. For example, if your dataset has to do with cooking, I’d ask what these statistics would be telling you about that. There isn’t much to go on in this shapes example because the data doesn’t represent anything, but presumably your example will have more interesting context and content.

Now we also have some categorical variables in play. It will help to know how many of each shape and color there are.

shapes %>% group_by(color) %>% summarize(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   color  count
##   <chr>  <int>
## 1 blue     360
## 2 green     78
## 3 red      290
## 4 yellow   272

shapes %>% group_by(shape) %>% summarize(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   shape    count
##   <chr>    <int>
## 1 circle     120
## 2 square     477
## 3 triangle   403

To see how color relates to area, group by those results to see the breakdowns of summary statistics for each color.

shapes %>% group_by(color) %>%  summarize( maxarea=max(area), minarea=min(area), meanarea=mean(area), medianarea=median(area), sdarea=sd(area))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 6
##   color  maxarea minarea meanarea medianarea sdarea
##   <chr>    <dbl>   <dbl>    <dbl>      <dbl>  <dbl>
## 1 blue    21642.     0.8    3208.      2500   3039.
## 2 green   27759.     1      5761.      3844   6695.
## 3 red     31416.     0.8    3816.      2437.  5093.
## 4 yellow  31416.     1      4538.      2818.  5352.

Obtaining a breakdown by shape is very similar.

shapes %>% group_by(shape) %>%  summarize( maxarea=max(area), minarea=min(area), meanarea=mean(area), medianarea=median(area), sdarea=sd(area))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 6
##   shape    maxarea minarea meanarea medianarea sdarea
##   <chr>      <dbl>   <dbl>    <dbl>      <dbl>  <dbl>
## 1 circle    31416.    28.3   10703.      7857.  9245.
## 2 square     9801      1      3411.      2500   2949.
## 3 triangle   7351.     0.8    2565.      2107.  2166.

We can also combine these questions: what is the average area of a yellow square?

shapes %>% filter(color=="yellow", shape=="square") %>%  summarize(meangreenarea=mean(area))

## # A tibble: 1 x 1
##   meangreenarea
##           <dbl>
## 1         3333.

If you know an object is red, with area larger than 3,000, what percentage of these are squares? triangles? circles? We can filter out only those objects that satisfy the conditions, group by shape, and count using n() how many there are of each shape. It looks like there are 84 red triangles with area at least 3000, and this is more than any other shape.

shapes %>% filter(color=="red", area>3000) %>% group_by(shape) %>%  summarize(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   shape    count
##   <chr>    <int>
## 1 circle      20
## 2 square      21
## 3 triangle    84

Task 5: Exploratory Data Analysis (Visualization)

Now we will use ggplot to visualize some of this data. There is more than enough to cover with just the 5NG (boxplots, scatterplots, histograms, bar plots, line graphs).

Visualization Sample Questions

In parallel with the summary statistics questions, we can visualize the center and spread of the area variable with a histogram.

ggplot(shapes, mapping=aes(x=area))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This distribution is skewed right (to the side of the long tail), with center close to 0.

It’s also natural to see how area splits up along different shapes and colors. Since we’re comparing how summary statistics for numerical variable split along a categorical variable, it makes sense to use side-by-side boxplots.

ggplot(shapes, mapping=aes(x=shape,y=area))+geom_boxplot()

ggplot(shapes, mapping=aes(x=color,y=area))+geom_boxplot()

To compare the two categorical variables of shape and color, we can make a stacked barplot that shows the distribution of each color within each shape. (Notice that the colors in the diagram don’t match the actual colors at all! If this bothers you as much as it bothered me, you’ll be comforted to know that it’s a very quick fix.)

ggplot(shapes, mapping=aes(x=shape, fill=color))+geom_bar()

Your dataset will likely have many more variables. If this dataset included other types of variables it would have made sense to use the two other named graphs (a scatterplot or a line graph). However, in this case, we didn’t need those. This may also be true in your data set! When you are preparing ahead of time, you can decide which graphs are most appropriate for your question.

Conclusion: Questions and answers

After showing you can use some basic data manipulation and visualization tools, we’ll come back to the questions you proposed when you chose this dataset. You may or may not have definitively answered them, and this is fine. What I’m looking for are some of your thoughts, questions, and insights on the data. Here is your chance to shine - tell me what interests you about the dataset you chose! Here are some prompts to help frame this discussion. As in the previous section, it’s ok to prepare this ahead of time. In fact, it will help you a lot to do so!

Were you able to answer your questions with the dataset you chose? If not, why not? If so, what summary statistics or visualization helped you answer this question?
Show me a summary statistic or visualization from the data which confirms something you thought would be true. Why did you expect to see this?
Show me a summary statistic or visualization from the data that you did not expect to see or were surprised to see. Why was this surprising?