Instructions
Welcome to your first problem set! This assignment will help you
practice the fundamental R skills we’ve covered in the first few weeks
of class. You’ll work with both simple objects and real survey data to
build your data manipulation skills.
Before you begin:
- Change the author name at the top of this document to your own
name
- Adjust the working directory in the setup chunk to match your folder
structure
- Save this file as
DS4P_A1_[lastname].rmd (replace
[lastname] with your actual last name, no brackets)
How to complete this assignment:
- Write your code in the provided code chunks
- Add brief comments to explain what your code does
- Make sure your code and results are visible in the knitted HTML
file
- You can discuss with classmates, but write your own answers
- Submit the knitted
.html file to Canvas by the
deadline
Grading:
- Full credit (1 point): Correct code, results, and explanations
- Partial credit (0.5 points): Mostly correct with minor errors
- No credit (0 points): Missing, incomplete, or mostly incorrect
Part 1: Building R
Fundamentals
In this section, we’ll practice creating and manipulating objects in
R. These skills form the foundation for all data analysis work.
Working with
Vectors
Vectors are the basic building blocks of data in R. Let’s start by
creating and exploring a numeric vector.
Creating and
indexing sequences
- Create a numeric sequence (1 point): Create an
object called
my_vector that contains numbers from -5 to
35, incrementing by 0.5. This gives us a nice range of values to work
with. Use seq() function.
# your code her
- Check the vector length (1 point): How many
elements are in your vector? Use
length() function.
# your code here
- Access specific elements (1 point): Sometimes we
need to look at just part of our data. Print the last three elements of
my_vector.
# your code here
- Pattern-based selection (1 point): R makes it easy to select
elements following a pattern. Print every fifth element of my_vector
(i.e., the 5th, 10th, 15th, etc.). This is a little tricky, so don’t
worry if you aren’t sure where to start.
Hint: Think about using the seq() function to create
a sequence of positions (5, 10, 15, …) and then use those positions as
indices for my_vector. Alternatively, you can use a special
sequence notation inside the square brackets.
# your code here
Real-world
application: Temperature conversion
Now let’s apply what we’ve learned to a practical problem. Imagine
you’re a climate scientist who has collected temperature data in
Celsius, but your American colleagues need it in Fahrenheit.
- Organize your data (1 point): First, create a new
object called
celsius_readings that contains the same
values as my_vector.
# your code here
- Convert temperatures (1 point): Convert all Celsius
readings to Fahrenheit using the formula \(F =
(C \times 9/5) + 32\). Store the results in
fahrenheit_readings. Notice how R automatically applies
this calculation to every element!
# your code here
- Compare scales (1 point): Let’s see how these two
temperature scales compare. Print the first 5 values from both
celsius_readings and fahrenheit_readings side
by side.
# your code here
- Calculate the average (1 point): What’s the average
temperature in Fahrenheit across all our readings? Calculate the mean
and round to one decimal place. Use round
round()
function.
# your code here
- Find the median (1 point): Calculate the median
temperature in Fahrenheit (rounded to one decimal). How does it compare
to the mean? What might any difference tell us about the distribution?
(Hint: Remember the function
summary())
# your code here
Understanding Data
Types
R handles different types of data differently. Let’s explore what
happens when we mix data types.
Working with mixed
data types
- Create character data (1 point): Create a character
vector called
id_codes containing these values: “101”,
“14A”, “003”, “27”. These might represent product codes or participant
IDs.
# your code here
- Check the data type (1 point): What class is
id_codes? Use the appropriate function to find out.
# your code here
- Attempt type conversion (1 point): Try to convert
these ID codes to numeric values. Use
as.numeric()
function. Store the result as id_codes_num. What
happens?
# your code here
- Count the problems (1 point): How many NA (missing)
values appeared after the conversion? This tells us how many codes
couldn’t be converted.
# your code here
- Identify the culprits (1 point): Which original
entries caused the NAs? Why do you think R couldn’t convert them?
# your code here
Working with
ordered categories
Many survey questions use ordered response scales (like “low”,
“medium”, “high”). R has a special data type for these: ordered
factors.
- Create satisfaction data (1 point): Create a
character vector called
satisfy that contains six
satisfaction ratings in this order: “low”, “high”, “medium”, “high”,
“low”, “medium”. This represents survey responses about satisfaction
levels from six respondents.
# your code here
- Convert to ordered factor (1 point): Convert this
to an ordered factor called
satisfy_fct where the order is:
low < medium < high. This tells R that these categories have a
meaningful order.
# your code here
- Convert to numeric scores (1 point): Sometimes we
need numeric values for analysis. Print this numeric version back out.
Which number do you think corresponds “high”? Why do you think
that?
# your code here
Part 2: Analyzing Real
Survey Data
Now we’ll apply our skills to real data from the 2024 Cooperative
Congressional Election Study (CCES), one of the most important political
surveys in the United States. This survey interviews tens of thousands
of Americans about their political attitudes and behaviors.
Understanding the
Data
The CCES dataset includes these variables:
- pid7: Party identification (1=Strong Democrat to
7=Strong Republican)
- birthyr: Birth year (four-digit year)
- votereg: Voter registration (1=registered, 0=not
registered)
- gender4: Gender identity (1=Man, 2=Woman,
3=Non-binary, 4=Other)
- educ: Education (1=Less than HS to 6=Postgraduate
degree)
- race: Race/ethnicity (1=White, 2=Black, 3=Hispanic,
4=Asian, 5=Native American, 6=Middle Eastern, 7=Two or more races,
8=Other)
Loading and exploring
the data
- Import the dataset (3 points): Load the CCES data
and get familiar with its structure:
- Load
cces_short.csv using read.csv() using
the row.names = 1 option
- Display the first 6 rows to preview the data
- Show all variable names
# your code here
Preparing variables
for analysis
- Convert to appropriate data types (4 points): Most
of these variables represent categories, not quantities. Convert
pid7, votereg, gender4,
educ, and race to factors. Keep
birthyr as numeric since it represents an actual year.
# your code here
Creating new
variables
- Calculate respondent ages (2 points): Create a new
variable called
age by subtracting birth year from 2024.
Then calculate the average age of respondents (remember to handle any
missing values).
# your code here
Comparing political
groups
- Create partisan subsets (2 points): Let’s compare
Democrats and Republicans. Create two new data frames:
democrats: respondents with pid7 values 1, 2, or 3
(Strong Dem, Not very strong Dem, Lean Dem)
republicans: respondents with pid7 values 5, 6, or 7
(Lean Rep, Not very strong Rep, Strong Rep)
# your code here
- Analyze voter registration (3 points): Calculate
the average age of voters in each partisan group. Which group is older?
What might explain this difference?
# your code here
Working with
Tidyverse
The tidyverse provides powerful tools for data manipulation. Let’s
explore how it can make our analysis more efficient.
- Load tidyverse and practice selection (4 points):
- Load the tidyverse package
- Use tidyverse to select only
pid7,
birthyr, and votereg columns
- Filter to show only respondents born after 1990 (not including the
year 1990)
- Assign this dataset to cces_small
Advanced logical
operations
- Identify strong partisans (2 points): Use your
dataset people born after 1990. Create a logical variable
partisan that equals TRUE for Strong Democrats (pid7==1) OR
Strong Republicans (pid7==7), and FALSE for everyone else. What
proportion of respondents are strongly partisan (Don’t forget to remove
the NA)?
# your code here
- Complex filtering (3 points): Create a subset
called
target_voters that includes only people meeting ALL
these criteria:
- Are registered to vote
- Were born after 2000
- Are NOT Strong Republicans (pid7 != 7)
How many rows are left?
# your code here
Saving your work
- Export cleaned data (2 points): Save
target_voters data as cces_short_clean.csv for
future analysis.
# your code here
Reflection
After completing this problem set, you should be comfortable
with:
- Creating and manipulating vectors
- Converting between data types
- Loading and exploring real datasets
- Creating new variables
- Using logical operations to subset data
- Basic tidyverse operations
- Saving processed data for future use
These skills form the foundation for all the data analysis we’ll do
throughout the course!
AI Usage
Declaration
Please answer the following questions about your use of AI tools on
this assignment:
- AI Usage Level (select one):
- No use - I completed this assignment without any AI assistance
- Checking - I used AI only to check my work after completing
problems
- Some use - I used AI for hints or help on specific problems
- Heavy use - I used AI extensively throughout the assignment
- AI System Used (if applicable):
[Your text here]
- Reflection on AI use (2-3 sentences): If you used
AI, briefly describe whether you found it helpful for learning the
material. If you didn’t use AI, you can skip this or explain why you
chose not to use it.