The American National Election Study is a gold-standard survey that runs every two years. It is one of the most important sources of data for the study of American politics.
In this final exam/project, you will use these data to investigate the attitudes of Americans.
This project is based on the ANES Time Series Cumulative Data File, which aggregates many questions that have been included in the survey since 1948. When you work with survey data like these, you generally work with a data file that includes responses, a codebook file that documents the questions asked, and a variable guide that lists in abbreviated form the questions on the survey and the variable name (column) of the data associated with each question. You’ll need to use all these files to perform your analysis. Make sure you get a copy of the abbreviated variable list to avoid sifting through the entire codebook when you want to find a variable of interest.
For these questions, do not worry about survey weights. Similarly, you should discard any responses where the interviewee reported “Don’t know” or “refused to answer” or similar responses. You can just recode these kinds of responses as missing data, and you should not report or include these in your answers/analysis. (In real life, you probably would not adopt this approach, but for our purposes, it’s ok for now).
Part One includes a series of guided questions. This is worth 40 points.
In Part Two, you explore questions of interest to you. Part Two is worth 20 points.
You’ll work in a combination of R and Python, so you’ll have some code chunks in R and some code chunks in Python.
Create a .qmd file that answers the questions in Part 1 and describes and performs your investigation in Part 2. The .qmd file should be attractive and well-formatted when rendered - for example, points will be reduced if you dump huge amounts of console output into the rendered document. You want the document to readable and meaningful. Display the code chunks as you go to show your work. Render an html file with your report and submit it along with the .qmd file.
When you import the ANES data into Python, you might encounter some issues with data types. To avoid these problems, use the following code to make sure that pandas coerces all numbers into numeric format. Note that I renamed the actual .csv to ANES.csv here for clarity - the actual name when you download the document is longer.
#import pandas as pd #anes = pd.read_csv("ANES.csv") #anes2 = anes.apply(pd.to_numeric,errors="coerce")
This code forces pandas to parse the data as numeric, avoiding problems with mixed types of data in columns (for some reason, ANES included some spaces in the .csv which Python doens’t deal with well). If it occurs that pandas cannot coerce the data in a cell into a number, it will replace whatever value is there with a NaN. Note: there are a few columns in this data set where this transformation will cause a problem, like VCF0901b, which is the state postal abbreviation. If you want to do any state level analysis in Python, you’ll need to account for this. I would just re-import the original .csv and then bind that unaltered state code column back to the transformed as a new column. In R, you shouldn’t encounter any of these issues.
Part One: Guided Questions (40 points)
Question 1 (2 points)
R
Create a tibble that shows how many respondents are in each wave of the survey. Use VCF0004 to get the year/wave of the survey.
How are survey respondents distributed across the major geographic regions of the US in the 1996 wave of the survey? (i.e., how many respondents per region). Use VCF0112 to get the Census region.
import pandas as pdimport sysprint(sys.version)
3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]
Region
South 642
North Central 458
West 354
Northeast 260
Name: count, dtype: int64
Question 3 (4 points)
R
Considering the 2008 wave and subsequent waves, what percent of these interviews in each wave were partially or entirely translated to Spanish? For this question for simplicity, consider only pre-election interviews, VCF0018a.
One of the questions on this survey has the interviewer read a list of words and phrases that people use to describe political figures. Then, the interviewer asks the interviewee to think about a given political figure, and the interviewer asks whether a given phrase describes that political figure extremely well, quite well, not too well or not well at all.
So, for example, the interviewer might say, “Think about Ronald Reagan. In your opinion, does the phrase or word ‘intelligent’ describe Ronald Reagan extremely well, quite well, not too well, or not well at all?”
Based on the survey data between 1980 and 2008, which president did women under the age of 40 think was the most knowledgeable? Which president was the least knowledgeable in the eyes of this group? You can average across all surveys during a president’s term. Some presidents will be included in more waves than others - that’s fine, use the average regardless of the number of terms.
Use VCF0342 for the measure of how knowledgeable the president is, VCF0104 for the respondent gender, and VCF0101 for the respondent age.
import pandas as pdanes = pd.read_csv("ANES.csv", low_memory=False, dtype=str) anes2 = anes.apply(pd.to_numeric,errors="coerce")anes_filtered = anes2[(anes2['VCF0004'] >=1980) & (anes2['VCF0004'] <=2008)]filtered_data = anes_filtered[(anes_filtered['VCF0104'] ==2) & (anes_filtered['VCF0101'] <40)]filtered_data = filtered_data[filtered_data['VCF0342'].between(1, 4, inclusive='both')]year_to_president = {1980: 'Jimmy Carter', 1981: 'Ronald Reagan', 1982: 'Ronald Reagan', 1983: 'Ronald Reagan', 1984: 'Ronald Reagan',1985: 'Ronald Reagan', 1986: 'Ronald Reagan', 1987: 'Ronald Reagan', 1988: 'Ronald Reagan',1989: 'George H. W. Bush', 1990: 'George H. W. Bush', 1991: 'George H. W. Bush', 1992: 'George H. W. Bush',1993: 'Bill Clinton', 1994: 'Bill Clinton', 1995: 'Bill Clinton', 1996: 'Bill Clinton',1997: 'Bill Clinton', 1998: 'Bill Clinton', 1999: 'Bill Clinton', 2000: 'Bill Clinton',2001: 'George W. Bush', 2002: 'George W. Bush', 2003: 'George W. Bush', 2004: 'George W. Bush',2005: 'George W. Bush', 2006: 'George W. Bush', 2007: 'George W. Bush', 2008: 'George W. Bush'}filtered_data['President'] = filtered_data['VCF0004'].map(year_to_president)average_knowledgeable = filtered_data.groupby('President')['VCF0342'].mean()most_knowledgeable_president = average_knowledgeable.idxmax()least_knowledgeable_president = average_knowledgeable.idxmin()print(f"The most knowledgeable president according to women under 40 is: {most_knowledgeable_president}")
The most knowledgeable president according to women under 40 is: George W. Bush
print(f"The least knowledgeable president according to women under 40 is: {least_knowledgeable_president}")
The least knowledgeable president according to women under 40 is: Bill Clinton
Question 5 (8 points)
R
These days, the evidence suggests that higher levels of education are associated with more liberal political attitudes, as measured on a traditional seven-point ideology scale.
Track this pattern over time. Use respondents from 1980, 1992, 2000, and 2020. What is the average political ideology of survey respondents with a college degree or greater vs. the political ideology of respondents without a college degree? (Note: some college doesn’t count)
In addition, repeat this, but compare how this breaks down on along racial lines. Is the pattern the same for whites and non-whites?
Use VCF0803 for the ideology scale and VCF0110 as the education measure.
`summarise()` has grouped output by 'education_group'. You can override using
the `.groups` argument.
print("Average ideology by education and race group:")
[1] "Average ideology by education and race group:"
print(average_ideology_by_education_race)
# A tibble: 4 × 3
# Groups: education_group [2]
education_group race_group average_ideology
<chr> <chr> <dbl>
1 College or Greater Non-White 3.45
2 College or Greater White 3.71
3 No College Degree Non-White 3.47
4 No College Degree White 3.99
Q5. Higher education level for both whites and non-whites resulted in more liberal political views.
Question 6 (8 points)
Python
Several questions on this survey are related to social trust. I’m talking here about questions VCF0619-VCF0621. Let’s just look at the 2004 survey responses.
Construct a scale that adds together the responses to these three questions so that higher values indicate greater social trust. Set the scale so it runs from zero (the minimum amount of trust).
Now, consider how this scale relates to respondents’ partisan identity (strong Democrat to strong Republican). Do you see any evidence that greater social trust is associated with partisan identity? Use VCF0301 to measure party identity.
There is some evidence that greater social trust is associated with partisan identity, particularly showing that Republicans have higher social trust on average compared to Democrats and Independents. However, the differences between Democrats and Independents are minimal, suggesting that while there is an association, it may not be very pronounced for all party identities.
Question 7 (8 points)points)
R
A common type of question on political surveys is the “feeling thermometer” where respondents are asked how warm/cold they feel about certain topics or political groups or figures.
It is widely believed that political polarization today is worse than in the past - Republicans have more negative feelings about Democrats today than they did in years past, and vice versa.
Use these survey data to assess this claim. Use VCF0302 for the party identification of the respondent, VCF0218 for the thermometer for the Democratic party, and VCF0224 for the Republican party.
The data supports the claim that political polarization has worsened over time. The increase in the average feeling thermometer differences between Democrats and Republicans from 1980 to 2020 highlights a trend of growing partisan divide, corroborating the observation that political polarization today is more pronounced than in the past.
Part 2: Exploration (20 points)
For the remainder of this project, you will further explore the ANES data and then create and answer some questions that are interesting to you about American politics.
Answer four different questions or test four different claims, or conduct some other analysis of your choice, in the style of what you did in Part 1. In each question, introduce at least one new variable into your analysis.
If you have any concerns about what makes a good question, please reach out for help.
You’ll probably need about 500 words of text (1.5-2 pages) to describe what you are looking at/what analysis you are performing, and of course you’ll need to include the associated code that provides the numerical/statistical evidence for your claims.
Use Python for at least two of these explorations.
You have flexibility to pursue questions that interest you. But, you need to leverage the data in a meaningful way. For example, “The mean response to question X was higher than the mean response to question Y” is not good enough. Use the temporal dimension of these data, the demographics of survey respondents, or the responses to different questions to create interesting comparisons or to focus on populations of particular interest to you. Let your creativity flow!
Good luck and have fun!
Q1. How has political ideology shifted over time, and how does this shift compare between genders?
The analysis of political ideology over time by gender reveals several noteworthy trends. In 1980, women had slightly more liberal average ideology scores compared to men, but by 1992, this gap had narrowed significantly, with both genders exhibiting similar levels of political ideology. In 2000, both men and women showed a substantial decrease in their average ideology scores, although women continued to have slightly higher scores. By 2020, men’s average political ideology scores surpassed those of women, indicating a shift towards more conservative views among men compared to women.
Q2.
library(dplyr)anes <-read.csv("ANES.csv")anes_filtered <- anes %>%select(VCF0110, VCF0720)anes_filtered$VCF0110 <-as.numeric(anes_filtered$VCF0110)anes_filtered$VCF0720 <-as.numeric(anes_filtered$VCF0720)anes_filtered <- anes_filtered %>%mutate(Education_Group =ifelse(VCF0110 >=4, "College or Greater", "No College Degree"))average_voting_by_education <- anes_filtered %>%group_by(Education_Group) %>%summarize(Average_Voting =mean(VCF0720, na.rm =TRUE))print(average_voting_by_education)
# A tibble: 2 × 2
Education_Group Average_Voting
<chr> <dbl>
1 College or Greater 1.06
2 No College Degree 1.02
Q2. The analysis of political participation by education level shows a distinct pattern. Individuals with a college degree or higher tend to have a slightly higher average voting frequency compared to those without a college degree. Specifically, the average voting frequency is higher for those with a college degree (1.0638) compared to those without (1.0166). This suggests that higher education levels are associated with greater political participation, as measured by voting frequency. This trend aligns with the broader understanding that higher educational attainment often correlates with increased political engagement and activity.
Q3. The Relationship Between Political Discussion Frequency and Environmental Regulation Preferences
The examination of environmental regulation preferences based on the frequency of political discussions reveals interesting insights. Those who engage in frequent political discussions tend to support stronger environmental regulations (average score of 4.656) compared to those who discuss politics occasionally (average score of 3.620) or rarely (average score of 4.004). This indicates that individuals who frequently discuss politics are more likely to favor rigorous environmental policies, potentially reflecting a higher level of awareness and concern for environmental issues among these individuals.
Q4. Trends in the Importance of Gun Control Over Time
import pandas as pdanes = pd.read_csv("ANES.csv", low_memory=False)anes['VCF9239'] = pd.to_numeric(anes['VCF9239'], errors='coerce')negative_values_2008 = anes[(anes['VCF0004'] ==2008) & (anes['VCF9239'] <0)]print("Negative values in 2008 before correction:")
anes.loc[(anes['VCF0004'] ==2008) & (anes['VCF9239'] <0), 'VCF9239'] =Noneaverage_importance_by_year = anes.groupby('VCF0004')['VCF9239'].mean()average_importance_by_year_clean = average_importance_by_year.dropna()print("Average importance of gun control issue by year after correction:")
Average importance of gun control issue by year after correction:
Q4.
In 2000, the average importance rating was 3.736, indicating a relatively high level of concern about gun control at that time. By 2004, this average dropped to 2.137, reflecting a decrease in perceived importance. The importance rating rose again in 2008 to 3.598, but then fell to 2.992 in 2012, demonstrating fluctuating levels of concern. In 2016, the importance rating was 2.135, almost identical to the 2004 level. By 2020, the importance had decreased significantly to 0.982, indicating a substantial decline in concern about gun control in recent years.
The data suggests that public concern about gun control has experienced considerable fluctuations over the past two decades, with a notable decline in recent years. This variation could be influenced by numerous factors, including political events, policy changes, and shifts in public opinion.