1 Instructions

Welcome to your first problem set! This assignment will help you practice the fundamental R skills we’ve covered in the first few weeks of class. You’ll work with both simple objects and real survey data to build your data manipulation skills.

Before you begin:

  • Change the author name at the top of this document to your own name
  • Adjust the working directory in the setup chunk to match your folder structure
  • Save this file as DS4P_A1_[lastname].rmd (replace [lastname] with your actual last name, no brackets)

How to complete this assignment:

  • Write your code in the provided code chunks
  • Add brief comments to explain what your code does
  • Make sure your code and results are visible in the knitted HTML file
  • You can discuss with classmates, but write your own answers
  • Submit the knitted .html file to Canvas by the deadline

Grading:

  • Full credit (1 point): Correct code, results, and explanations
  • Partial credit (0.5 points): Mostly correct with minor errors
  • No credit (0 points): Missing, incomplete, or mostly incorrect

2 Part 1: Building R Fundamentals

In this section, we’ll practice creating and manipulating objects in R. These skills form the foundation for all data analysis work.

2.1 Working with Vectors

Vectors are the basic building blocks of data in R. Let’s start by creating and exploring a numeric vector.

2.1.1 Creating and indexing sequences

  1. Create a numeric sequence (1 point): Create an object called my_vector that contains numbers from -5 to 35, incrementing by 0.5. This gives us a nice range of values to work with. Use seq() function.
# your code her
  1. Check the vector length (1 point): How many elements are in your vector? Use length() function.
# your code here
  1. Access specific elements (1 point): Sometimes we need to look at just part of our data. Print the last three elements of my_vector.
# your code here
  1. Pattern-based selection (1 point): R makes it easy to select elements following a pattern. Print every fifth element of my_vector (i.e., the 5th, 10th, 15th, etc.). This is a little tricky, so don’t worry if you aren’t sure where to start.

Hint: Think about using the seq() function to create a sequence of positions (5, 10, 15, …) and then use those positions as indices for my_vector. Alternatively, you can use a special sequence notation inside the square brackets.

# your code here

2.1.2 Real-world application: Temperature conversion

Now let’s apply what we’ve learned to a practical problem. Imagine you’re a climate scientist who has collected temperature data in Celsius, but your American colleagues need it in Fahrenheit.

  1. Organize your data (1 point): First, create a new object called celsius_readings that contains the same values as my_vector.
# your code here
  1. Convert temperatures (1 point): Convert all Celsius readings to Fahrenheit using the formula \(F = (C \times 9/5) + 32\). Store the results in fahrenheit_readings. Notice how R automatically applies this calculation to every element!
# your code here
  1. Compare scales (1 point): Let’s see how these two temperature scales compare. Print the first 5 values from both celsius_readings and fahrenheit_readings side by side.
# your code here
  1. Calculate the average (1 point): What’s the average temperature in Fahrenheit across all our readings? Calculate the mean and round to one decimal place. Use round round() function.
# your code here
  1. Find the median (1 point): Calculate the median temperature in Fahrenheit (rounded to one decimal). How does it compare to the mean? What might any difference tell us about the distribution? (Hint: Remember the function summary())
# your code here

2.2 Understanding Data Types

R handles different types of data differently. Let’s explore what happens when we mix data types.

2.2.1 Working with mixed data types

  1. Create character data (1 point): Create a character vector called id_codes containing these values: “101”, “14A”, “003”, “27”. These might represent product codes or participant IDs.
# your code here
  1. Check the data type (1 point): What class is id_codes? Use the appropriate function to find out.
# your code here
  1. Attempt type conversion (1 point): Try to convert these ID codes to numeric values. Use as.numeric() function. Store the result as id_codes_num. What happens?
# your code here
  1. Count the problems (1 point): How many NA (missing) values appeared after the conversion? This tells us how many codes couldn’t be converted.
# your code here
  1. Identify the culprits (1 point): Which original entries caused the NAs? Why do you think R couldn’t convert them?
# your code here

2.2.2 Working with ordered categories

Many survey questions use ordered response scales (like “low”, “medium”, “high”). R has a special data type for these: ordered factors.

  1. Create satisfaction data (1 point): Create a character vector called satisfy that contains six satisfaction ratings in this order: “low”, “high”, “medium”, “high”, “low”, “medium”. This represents survey responses about satisfaction levels from six respondents.
# your code here
  1. Convert to ordered factor (1 point): Convert this to an ordered factor called satisfy_fct where the order is: low < medium < high. This tells R that these categories have a meaningful order.
# your code here
  1. Convert to numeric scores (1 point): Sometimes we need numeric values for analysis. Print this numeric version back out. Which number do you think corresponds “high”? Why do you think that?
# your code here

3 Part 2: Analyzing Real Survey Data

Now we’ll apply our skills to real data from the 2024 Cooperative Congressional Election Study (CCES), one of the most important political surveys in the United States. This survey interviews tens of thousands of Americans about their political attitudes and behaviors.

3.1 Understanding the Data

The CCES dataset includes these variables:

  • pid7: Party identification (1=Strong Democrat to 7=Strong Republican)
  • birthyr: Birth year (four-digit year)
  • votereg: Voter registration (1=registered, 0=not registered)
  • gender4: Gender identity (1=Man, 2=Woman, 3=Non-binary, 4=Other)
  • educ: Education (1=Less than HS to 6=Postgraduate degree)
  • race: Race/ethnicity (1=White, 2=Black, 3=Hispanic, 4=Asian, 5=Native American, 6=Middle Eastern, 7=Two or more races, 8=Other)

3.2 Loading and exploring the data

  1. Import the dataset (3 points): Load the CCES data and get familiar with its structure:
    • Load cces_short.csv using read.csv() using the row.names = 1 option
    • Display the first 6 rows to preview the data
    • Show all variable names
# your code here

3.3 Preparing variables for analysis

  1. Convert to appropriate data types (4 points): Most of these variables represent categories, not quantities. Convert pid7, votereg, gender4, educ, and race to factors. Keep birthyr as numeric since it represents an actual year.
# your code here

3.4 Creating new variables

  1. Calculate respondent ages (2 points): Create a new variable called age by subtracting birth year from 2024. Then calculate the average age of respondents (remember to handle any missing values).
# your code here

3.5 Comparing political groups

  1. Create partisan subsets (2 points): Let’s compare Democrats and Republicans. Create two new data frames:
    • democrats: respondents with pid7 values 1, 2, or 3 (Strong Dem, Not very strong Dem, Lean Dem)
    • republicans: respondents with pid7 values 5, 6, or 7 (Lean Rep, Not very strong Rep, Strong Rep)
# your code here
  1. Analyze voter registration (3 points): Calculate the average age of voters in each partisan group. Which group is older? What might explain this difference?
# your code here

3.6 Working with Tidyverse

The tidyverse provides powerful tools for data manipulation. Let’s explore how it can make our analysis more efficient.

  1. Load tidyverse and practice selection (4 points):
    • Load the tidyverse package
    • Use tidyverse to select only pid7, birthyr, and votereg columns
    • Filter to show only respondents born after 1990 (not including the year 1990)
    • Assign this dataset to cces_small

3.7 Advanced logical operations

  1. Identify strong partisans (2 points): Use your dataset people born after 1990. Create a logical variable partisan that equals TRUE for Strong Democrats (pid7==1) OR Strong Republicans (pid7==7), and FALSE for everyone else. What proportion of respondents are strongly partisan (Don’t forget to remove the NA)?
# your code here
  1. Complex filtering (3 points): Create a subset called target_voters that includes only people meeting ALL these criteria:
    • Are registered to vote
    • Were born after 2000
    • Are NOT Strong Republicans (pid7 != 7)
    How many rows are left?
# your code here

3.8 Saving your work

  1. Export cleaned data (2 points): Save target_voters data as cces_short_clean.csv for future analysis.
# your code here

4 Reflection

After completing this problem set, you should be comfortable with:

  • Creating and manipulating vectors
  • Converting between data types
  • Loading and exploring real datasets
  • Creating new variables
  • Using logical operations to subset data
  • Basic tidyverse operations
  • Saving processed data for future use

These skills form the foundation for all the data analysis we’ll do throughout the course!

5 AI Usage Declaration

Please answer the following questions about your use of AI tools on this assignment:

  1. AI Usage Level (select one):
  • No use - I completed this assignment without any AI assistance
  • Checking - I used AI only to check my work after completing problems
  • Some use - I used AI for hints or help on specific problems
  • Heavy use - I used AI extensively throughout the assignment
  1. AI System Used (if applicable):

[Your text here]

  1. Reflection on AI use (2-3 sentences): If you used AI, briefly describe whether you found it helpful for learning the material. If you didn’t use AI, you can skip this or explain why you chose not to use it.