HW 2 Instructions

Due Wednesday, September 10, 2025 at 11:59 PM

Purpose

This assignment will give you experience with:

  • creating an R Project Directory with data and img folders.

  • saving, editing and using an Quarto (.qmd) file (Review).

  • knitting (rendering) an R Quarto file to create an HTML file.

  • creating a README file.

  • working with a larger dataset.

  • using the dplyr package to select variables and slice and filter data.

  • creating a basic plot with minimal formatting.

Instructions

HW 2 - First Steps

  1. Create an R project named HW 2 <first name> <last name>.

    1. File > New Project > New Directory > R Project

    2. In box, name this project:

    • HW 2 <first name> <last name>

    • My project name: HW 2 Penelope Pooler

    1. Click Create Project.
    • Note that if you create an R Project that is NOT a Quarto project, a Quarto file is not created.
  2. Create img and data folders with the R Project.

  3. Download the provided file, HW2_Template.qmd from the Homework Assignments page of the 455 website.

  4. Save the downloaded HW2_Template.qmd file to your R project.

  • Change file name to be HW2_FirstName_LastName.qmd.

    • For example, I would change the template file to be named HW2_Penelope_Pooler.qmd.

    • There should be no spaces in a file name of a Quarto (.qmd) file.

    • Change title in the file header to be ‘HW 2’.

    • Specify yourself as the author.


NOTES

  • Provided header text below shows the correct format.

  • This header text also creates a floating Table of Contents (toc) and will show chunk labels.

  • Note that the options below will make the code chunks and code chunk labels visible in the output. We will change these options in later assignments and in the group project.

---
title: "HW 2"
author: "Penelope Pooler"
date: last-modified
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
  html:
    code-line-numbers: true
    code-fold: true
    code-tools: true
execute:
  echo: fenced 
---
  1. Create a new chunk under the Setup header and text and add the following text to the body of the chunk:
#|label: setup

# this line specifies options for default options for all R chunks
knitr::opts_chunk$set(echo=T,  
                      highlight=T)

# suppress scientific notation
options(scipen=100)

# install helper package (pacman) if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")

# install and load required packages
# pacman should be first package in parentheses and then list others
pacman::p_load(pacman, tidyverse, gridExtra, magrittr)

# verify packages (comment out in finished documents)
p_loaded()
  1. Click the green triangle on the right side of the setup chunk to run this code.

HW 2 - Part 1: glimpse and unique

Chunk 2: Examining the diamonds Data

Notes:

  • This chunk reviews the glimpse and unique commands from Week 1.

  • diamonds is large R dataset that is part of the ggplot2 package in the tidyverse package suite.

  • Provided code for HW 2 - Part 1 (Chunk 2) WILL NOT RUN until the tidyverse package suite is loaded by running the setup chunk with the provided code.

  • When you run the glimpse() command you will see the variable type <ord>, which is an ordered factor variable.

Steps to Follow:

  1. Run the R code in the provided R chunk for HW 2 - Part 1 (Chunk 2) which reviews:

    • how to save a dataset and examine it using glimpse.

    • how to examine the levels of a variable using unique.

    • traditional and piped code to do the same task.

  2. Use the unique command with or without piping to examine the levels of:

    • the clarity variable in the diamonds R dataset.

    • the color variable in the diamonds R dataset.

    • Note that the new unique commands that you write should be ADDED to Chunk 2.

  3. Answer these Blackboard Questions:

BB Question 1

How many rows (observations) and columns (variables) are in the my_diamonds dataset which is a copy of the diamonds R dataset that you have saved to your Global Environment?

  • ____ rows

  • ____ columns

BB Question 2

Order the levels of the clarity variable from first to last based on the output from using the unique command with this variable.

BB Question 3

Fill in the blanks. The color variable in the diamonds dataset has factor levels that are alphabetical.

  • The first level of diamond color is ____.

  • The last level of diamond color is ____.

HW 2 - Part 2: select

Chunk 3: Selecting variables in a dataset

Notes:

  • This chunk demonstrates using the select command to select variables in a dataset.

  • In the provided code for HW 2 - Part 2 (Chunk 3),there are three code examples that select the first 7 variables in the my_diamonds dataset and save them as a new dataset:

    • my_diamonds1 is created by specifying variables to INCLUDE.

    • my_diamonds2 is created by specifying variables to EXCLUDE.

    • select is ALSO used to reorder the variables (price is first).

Steps to Follow:

  1. Create my_diamonds3 using the select command to only INCLUDE the first FIVE variables:

    • Variables included: price, carat, cut, color, clarity
  2. Create my_diamonds4 which will be identical to my_diamonds3, but is created by EXCLUDING the last FIVE variables using the ! operator and the c(...) operator to group the variables:

    • Variables excluded: depth, table, x, y, z

HW 2 - Part 3: slice

Chunk 4: Selecting rows by row number

Notes:

  • This chunk demonstrates using the slice command to select observations (rows).

  • In the provided code for HW 2 - Part 3 (Chunk 4), and subsequent chunks you will continue to build on your code from Part 2 (Chunk 3) with more commands.

  • Using piping in your code makes this process more efficient.

Steps to Follow:

  1. Copy and the code you wrote to create my_diamonds3 in Chunk 3 and paste this code into Chunk 4.

  2. Use the examples to add on to your code and select rows: 1001 through 30000 and 45001 through 50000

    • This dataset should still be named my_diamonds3

    • Piping will make your coding more efficient and easier to read.

  3. Answer the following Blackboard Question to verify that your dataset is correct:

BB Question 4

After successfully completing the R code in Chunk 4 of HW 2, the my_diamonds3 dataset is smaller than the original dataset.

my_diamonds3 has:

  • fewer variables (columns) after using the select command as specified.

  • fewer observations (rows) after using the slice command as specified.

After completing Chunk 4, the my_diamonds3 dataset has:

  • ____ rows.

  • ____ columns.

HW 2 - Part 4: filter and summary

Chunk 5: Filtering data by value and summarizing

Notes:

  • This chunk demonstrates:

    • using the filter command to select rows by variables values.

    • using the summary command to summarize variables.

  • The filter command enables us to select observations by one or more values of one or more variables.

  • You can use multiple consecutive filter commands or you can use the and operator, &, or the or operator, |.

  • You can use filter to INCLUDE rows or EXCLUDE rows with !.

  • The provided code includes multiple examples of how to complete the same two filtering tasks

Steps to Follow:

  1. Copy the code you wrote in Chunk 4 and then paste it into Chunk 5.

  2. Add the specified slice command to subset the my_diamonds3 dataset in Chunk 5.

  3. Use one of the examples in the provided R code for Chunk 5 to complete these TWO specified filter tasks:

  1. Filter my_diamonds3 to diamonds weighing 1.25 or more carats.

  2. filter my_diamonds3 to these cut categories: `

    • Very Good, Premium, Ideal
  1. Use the example summary command code to summarize the factor variable clarity in the final my_diamonds3 dataset.

  2. Answer the following Blackboard Questions:


BB Question 5

In Chunk 5 of HW 2, you use the my_diamonds3 dataset from Chunk 4, and then you filter the data by carat and by cut category.

  • How many observations are in this final my_diamonds3 dataset?
BB Question 6

Fill in the blanks to indicate how many observations are in each of the three most valuable categories in the my_diamonds3 dataset.

  • There are ____ observations in VVS2 level of clarity variable.

  • There are ____ observations in VVS1 level of clarity variable.

  • There are ____ observations in IF level of clarity variable.

HW 2 - Part 5: Creating plots with ggplot

Chunk 6: Creating Basic Plots

Notes:

  • The provided R code demonstrates using

    • ggplot to make some basic plots

    • grid.arrange to present multiple plots in a grid or column.

  • In Part 5, you will create a chunk and copy the provided R code into the chunk you create.

Steps to Follow:

  1. Create a new chunk (Chunk 6) under the HW 2 - Part 5 heading in your HW 2 Markdown file (created from the provided template).

  2. Copy and paste the provided R code below into the chunk you created.

  3. After the label fence add this text: creating plots with ggplot

    • Leave one space after the colon: #|label: creating plots with ggplot

    • This step is included and required so that students no how to label chunks in Quarto files.

  4. Use the example scatter plot code below to create a saved plot named scatter_cut.

    • This will be similar to the example code for the scatter_clarity scatter plot.

    • Remove # at the beginning of this line # scatter_cut <- to start code.

    • Replace color=clarity with color=cut.

  5. Use the example code provided below to create a saved plot named scatter_color.

    • This will be similar to the code for the scatter_clarity scatter plot.

    • Remove # at the beginning of this line # scatter_color <- to start code.

    • Replace color=clarity with color=color.

  6. Use the provided grid.arrange command to create a 2x2 grid of all four scatter plots.

    • Remove # at the beginning of the line with grid.arrange(..., ncol=2)

    • This code will only work if you use the provided names to save your scatterplots.

  7. Use the provided example boxplot code below to create a saved plot named box_clarity

    • This will be similar to the provided code for the side-by-side grouped boxplot, box_color.

    • Note that there is a ggsave command after the box_color code that will export this plot to the img file you created.

  8. Use the provided grid.arrange command to create a stacked column of the two boxplot figures.

    • Remove # at the beginning of the line with grid.arrange(..., ncol=1)

    • This code will only work if you use the provided name to save your new boxplot figure.

  9. Answer the following Blackboard Questions:

BB Question 7

Compare the three scatter plots to determine which one of the three variables, clarity, cut, or color, shows the least evidence of a relationship with price or carats, i.e., shows no trending color pattern.

BB Question 8

Fill in the blank:

Compare the two boxplots of the my_diamonds3 dataset to determine which variable, color, or clarity, has one category that is substantially lower in prices from the other categories.

  • The ____ level in the ____ variable includes diamonds that are substantially lower in price than the other levels.
Provided R code for Chunk 6
  • Create chunk then copy and paste code below into it.
#|label: 

#### scatterplots ####

# scatter_none is the most basic scatter plot of carat vs. price 
# no other variables are included
# to view this plot by itself (not required), enclose code in parentheses
scatter_none <- my_diamonds |>
   ggplot() +
   geom_point(aes(x=carat, y=price))


# scatter_clarity adds the option color=clarity to the aes (aesthetic)
# observations are color coded by diamond clarity level
# theme_classic() added to remove background

scatter_clarity <- my_diamonds |>
  ggplot() +
  geom_point(aes(x=carat, y=price, color=clarity)) +
  theme_classic()


# create plot named scatter_cut using the above code
# change color=clarity to color=cut

# scatter_cut <- 


# create plot named scatter_color using the above code
# change color=clarity to color=color

# scatter_color <- 


# plot all 4 plots above in a 2x2 grid and answer Blackboard Questions

# grid.arrange(scatter_none, scatter_clarity, 
#              scatter_color, scatter_cut, ncol=2)


#### boxplots ####

# below is a plot of grouped side-by_side boxplots 
# Within each cut category there is a separate boxplot for each color
# this is one good way to examine categorical data

# code is enclosed in parentheses 
# plot is saved as box_color AND is shown on the screen

(box_color <- my_diamonds3 |>
  ggplot() +
  geom_boxplot(aes(x=cut, y=price, fill=color))+ 
  theme_classic())

ggsave(filename="img/HW2_Diamond_Color_Boxplots.png", 
  width = 6, height = 4)

# create a plot of grouped side-by_side boxplots that show
# boxplots for each clarity category with each cut category

# same plot as above but change fill=color to fill=clarity
# name this plot box_clarity

# (box_clarity <- )


# plot these two sets of plots in a stacked column (ncol=1)

# grid.arrange(box_color, box_clarity, ncol=1)

HW 2 - Final Steps

  1. Save your completed HW 2 R Quarto File (.qmd) within your project folder.

  2. Render your .qmd file to create a .html file.

  3. Answer all 8 Blackboard questions associated with this assignment.

    • Reminder: You are welcome and encouraged to work together and practice sharing Quarto (.qmd) files but each student should submit their own zipped R project and Blackboard assignment.
  4. Create a README file for HW 2 using the README template provided with the HW 2 files.

  5. Zip your entire Project Directory into a compressed File and submit it.

    • The zipped project directory should contain:

    • The HW 2 project folder with the following inside:

      • the HW2 Quarto (.qmd) and HTML files labeled with your full name.

      • the img folder with the exported plot file

      • the empty data folder

      • The complete and accurate README.txt file that is saved with the same name as the project, e.g. `HW 2 Penelope Pooler README.txt

      • The .Rproj file

Grading Criteria

  • (8 pts.) Each Blackboard question for HW 2 is worth 1 point.

  • (2 pts.) Completing HW 1 - First Steps as specified.

  • (2 pts.) Part 1: Full credit for:

    • Correctly completing the chunk labeled examining the diamonds data

      • Each unique command is 1 point.


  • (2 pts.) Part 2: Full credit for:

    • Creating the two identical datasets using select to include variables and exclude variables.

      • Each select command is 1 point


  • (2 pts.) Part 3: Full credit for:

    • Creating the specified dataset using the select command and the slice command.


  • (2 pts.) Part 4: Full credit for:

    • Creating the specified dataset using the select, slice, and filter commands


  • (3 pts.) Part 5: Full credit for:

    • 1 point for creating both specified scatterplots

    • 1 point for creating the specified 2x2 grid of scatterplots

    • 1 point for creating the specified grouped boxplots and the column of 2 grouped boxplots

  • (4 pts.) Completing the HW 2 - Final Steps and correctly submitting your zipped project directory.

    • 1 point for creating a correct README file

    • 1 point for having both the .qmd and .html files

    • 1 point for having the empty data folder and img folder with one plot in it

    • 1 point for zipping and submitting your project correctly