Final_Draugns

Programming and Data Management Final Project/Exam

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

anes <- read.csv("C:/Users/krisj/OneDrive/Desktop/JHU/Code/ANES.csv")

The American National Election Study is a gold-standard survey that runs every two years. It is one of the most important sources of data for the study of American politics.

In this final exam/project, you will use these data to investigate the attitudes of Americans.

This project is based on the ANES Time Series Cumulative Data File, which aggregates many questions that have been included in the survey since 1948. When you work with survey data like these, you generally work with a data file that includes responses, a codebook file that documents the questions asked, and a variable guide that lists in abbreviated form the questions on the survey and the variable name (column) of the data associated with each question. You’ll need to use all these files to perform your analysis. Make sure you get a copy of the abbreviated variable list to avoid sifting through the entire codebook when you want to find a variable of interest.

For these questions, do not worry about survey weights. Similarly, you should discard any responses where the interviewee reported “Don’t know” or “refused to answer” or similar responses. You can just recode these kinds of responses as missing data, and you should not report or include these in your answers/analysis. (In real life, you probably would not adopt this approach, but for our purposes, it’s ok for now).

Part One includes a series of guided questions. This is worth 40 points.

In Part Two, you explore questions of interest to you. Part Two is worth 20 points.

You’ll work in a combination of R and Python, so you’ll have some code chunks in R and some code chunks in Python.

Create a .qmd file that answers the questions in Part 1 and describes and performs your investigation in Part 2. The .qmd file should be attractive and well-formatted when rendered - for example, points will be reduced if you dump huge amounts of console output into the rendered document. You want the document to readable and meaningful. Display the code chunks as you go to show your work. Render an html file with your report and submit it along with the .qmd file.

When you import the ANES data into Python, you might encounter some issues with data types. To avoid these problems, use the following code to make sure that pandas coerces all numbers into numeric format. Note that I renamed the actual .csv to ANES.csv here for clarity - the actual name when you download the document is longer.

#import pandas as pd #anes = pd.read_csv("ANES.csv")   #anes2 = anes.apply(pd.to_numeric,errors="coerce")

This code forces pandas to parse the data as numeric, avoiding problems with mixed types of data in columns (for some reason, ANES included some spaces in the .csv which Python doens’t deal with well). If it occurs that pandas cannot coerce the data in a cell into a number, it will replace whatever value is there with a NaN. Note: there are a few columns in this data set where this transformation will cause a problem, like VCF0901b, which is the state postal abbreviation. If you want to do any state level analysis in Python, you’ll need to account for this. I would just re-import the original .csv and then bind that unaltered state code column back to the transformed as a new column. In R, you shouldn’t encounter any of these issues.

Part One: Guided Questions (40 points)

Question 1 (2 points)

Create a tibble that shows how many respondents are in each wave of the survey. Use VCF0004 to get the year/wave of the survey.

library(dplyr)
anes_data <- read.csv("ANES.csv")
wave_counts <- anes_data %>%
  group_by(VCF0004) %>%
  summarise(Respondents = n())

print(wave_counts)

# A tibble: 32 × 2
   VCF0004 Respondents
     <int>       <int>
 1    1948         662
 2    1952        1899
 3    1954        1139
 4    1956        1762
 5    1958        1450
 6    1960        1181
 7    1962        1297
 8    1964        1571
 9    1966        1291
10    1968        1557
# ℹ 22 more rows

Question 2 (2 points)

Python

How are survey respondents distributed across the major geographic regions of the US in the 1996 wave of the survey? (i.e., how many respondents per region). Use VCF0112 to get the Census region.

import pandas as pd

import sys
print(sys.version)

3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]

anes = pd.read_csv("ANES.csv", low_memory=False, dtype=str)
anes2 = anes.apply(pd.to_numeric,errors="coerce")

code_to_region = {
    1.0: 'Northeast',
    2.0: 'North Central',
    3.0: 'South',
    4.0: 'West'
}

anes2['Region'] = anes2['VCF0112'].map(code_to_region)

region_counts = anes2[anes2['VCF0004'] == 1996]['Region'].value_counts()


print(region_counts)

Region
South            642
North Central    458
West             354
Northeast        260
Name: count, dtype: int64

Question 3 (4 points)

Considering the 2008 wave and subsequent waves, what percent of these interviews in each wave were partially or entirely translated to Spanish? For this question for simplicity, consider only pre-election interviews, VCF0018a.

spanish_interviews <- anes_data %>%
  filter(VCF0004 >= 2008, !is.na(VCF0018a)) %>%
  group_by(VCF0004) %>%
  summarise(Spanish = sum(VCF0018a == 1),
            Total = n()) %>%
  mutate(Percentage = (Spanish / Total) * 100)

print(spanish_interviews)

# A tibble: 4 × 4
  VCF0004 Spanish Total Percentage
    <int>   <int> <int>      <dbl>
1    2008      94  2322      4.05 
2    2012     225  5914      3.80 
3    2016      63  4270      1.48 
4    2020      69  8280      0.833

Question 4 (8 points)

Python

One of the questions on this survey has the interviewer read a list of words and phrases that people use to describe political figures. Then, the interviewer asks the interviewee to think about a given political figure, and the interviewer asks whether a given phrase describes that political figure extremely well, quite well, not too well or not well at all.

So, for example, the interviewer might say, “Think about Ronald Reagan. In your opinion, does the phrase or word ‘intelligent’ describe Ronald Reagan extremely well, quite well, not too well, or not well at all?”

Based on the survey data between 1980 and 2008, which president did women under the age of 40 think was the most knowledgeable? Which president was the least knowledgeable in the eyes of this group? You can average across all surveys during a president’s term. Some presidents will be included in more waves than others - that’s fine, use the average regardless of the number of terms.

Use VCF0342 for the measure of how knowledgeable the president is, VCF0104 for the respondent gender, and VCF0101 for the respondent age.

import pandas as pd

anes = pd.read_csv("ANES.csv", low_memory=False, dtype=str)  
anes2 = anes.apply(pd.to_numeric,errors="coerce")

anes_filtered = anes2[(anes2['VCF0004'] >= 1980) & (anes2['VCF0004'] <= 2008)]


filtered_data = anes_filtered[(anes_filtered['VCF0104'] == 2) & (anes_filtered['VCF0101'] < 40)]

filtered_data = filtered_data[filtered_data['VCF0342'].between(1, 4, inclusive='both')]

year_to_president = {
    1980: 'Jimmy Carter', 1981: 'Ronald Reagan', 1982: 'Ronald Reagan', 1983: 'Ronald Reagan', 1984: 'Ronald Reagan',
    1985: 'Ronald Reagan', 1986: 'Ronald Reagan', 1987: 'Ronald Reagan', 1988: 'Ronald Reagan',
    1989: 'George H. W. Bush', 1990: 'George H. W. Bush', 1991: 'George H. W. Bush', 1992: 'George H. W. Bush',
    1993: 'Bill Clinton', 1994: 'Bill Clinton', 1995: 'Bill Clinton', 1996: 'Bill Clinton',
    1997: 'Bill Clinton', 1998: 'Bill Clinton', 1999: 'Bill Clinton', 2000: 'Bill Clinton',
    2001: 'George W. Bush', 2002: 'George W. Bush', 2003: 'George W. Bush', 2004: 'George W. Bush',
    2005: 'George W. Bush', 2006: 'George W. Bush', 2007: 'George W. Bush', 2008: 'George W. Bush'
}



filtered_data['President'] = filtered_data['VCF0004'].map(year_to_president)


average_knowledgeable = filtered_data.groupby('President')['VCF0342'].mean()

most_knowledgeable_president = average_knowledgeable.idxmax()
least_knowledgeable_president = average_knowledgeable.idxmin()

print(f"The most knowledgeable president according to women under 40 is: {most_knowledgeable_president}")

The most knowledgeable president according to women under 40 is: George W. Bush

print(f"The least knowledgeable president according to women under 40 is: {least_knowledgeable_president}")

The least knowledgeable president according to women under 40 is: Bill Clinton

Question 5 (8 points)

These days, the evidence suggests that higher levels of education are associated with more liberal political attitudes, as measured on a traditional seven-point ideology scale.

Track this pattern over time. Use respondents from 1980, 1992, 2000, and 2020. What is the average political ideology of survey respondents with a college degree or greater vs. the political ideology of respondents without a college degree? (Note: some college doesn’t count)

In addition, repeat this, but compare how this breaks down on along racial lines. Is the pattern the same for whites and non-whites?

Use VCF0803 for the ideology scale and VCF0110 as the education measure.

library(dplyr)
anes <- read.csv("ANES.csv")
anes$VCF0803 <- as.numeric(as.character(anes$VCF0803))
anes$VCF0110 <- as.numeric(as.character(anes$VCF0110))

race_column <- 'VCF0105'

anes_filtered <- anes %>% filter(VCF0004 %in% c(1980, 1992, 2000, 2020))

anes_filtered <- anes_filtered %>%
  mutate(education_group = ifelse(VCF0110 >= 4, "College or Greater", "No College Degree"))

anes_filtered <- anes_filtered %>% filter(!is.na(VCF0803) & VCF0803 <= 7)

anes_filtered <- anes_filtered %>%
  mutate(race_group = case_when(
    VCF0105a == 1 ~ "White",
    VCF0105a %in% c(2, 3, 4, 5, 6) ~ "Non-White",
    TRUE ~ NA_character_
  ))

anes_filtered <- anes_filtered %>% filter(!is.na(race_group))

average_ideology_by_education <- anes_filtered %>%
  group_by(education_group) %>%
  summarize(average_ideology = mean(VCF0803, na.rm = TRUE))

print("Average ideology by education group:")

[1] "Average ideology by education group:"

print(average_ideology_by_education)

# A tibble: 2 × 2
  education_group    average_ideology
  <chr>                         <dbl>
1 College or Greater             3.66
2 No College Degree              3.87

average_ideology_by_education_race <- anes_filtered %>%
  group_by(education_group, race_group) %>%
  summarize(average_ideology = mean(VCF0803, na.rm = TRUE))

`summarise()` has grouped output by 'education_group'. You can override using
the `.groups` argument.

print("Average ideology by education and race group:")

[1] "Average ideology by education and race group:"

print(average_ideology_by_education_race)

# A tibble: 4 × 3
# Groups:   education_group [2]
  education_group    race_group average_ideology
  <chr>              <chr>                 <dbl>
1 College or Greater Non-White              3.45
2 College or Greater White                  3.71
3 No College Degree  Non-White              3.47
4 No College Degree  White                  3.99

Q5. Higher education level for both whites and non-whites resulted in more liberal political views.

Question 6 (8 points)

Python

Several questions on this survey are related to social trust. I’m talking here about questions VCF0619-VCF0621. Let’s just look at the 2004 survey responses.

Construct a scale that adds together the responses to these three questions so that higher values indicate greater social trust. Set the scale so it runs from zero (the minimum amount of trust).

Now, consider how this scale relates to respondents’ partisan identity (strong Democrat to strong Republican). Do you see any evidence that greater social trust is associated with partisan identity? Use VCF0301 to measure party identity.

import pandas as pd

anes = pd.read_csv("ANES.csv", low_memory=False, dtype=str)

anes[['VCF0004', 'VCF0619', 'VCF0620', 'VCF0621']] = anes[['VCF0004', 'VCF0619', 'VCF0620', 'VCF0621']].apply(pd.to_numeric, errors='coerce')

anes_2004 = anes[anes['VCF0004'] == 2004].copy()

anes_2004['social_trust'] = anes_2004[['VCF0619', 'VCF0620', 'VCF0621']].sum(axis=1)
anes_2004['social_trust'] = anes_2004['social_trust'] - anes_2004['social_trust'].min()

anes_2004 = anes_2004.dropna(subset=['VCF0301'])

anes_2004['VCF0301'] = pd.to_numeric(anes_2004['VCF0301'], errors='coerce')

def party_identity_label(x):
    if x <= 3:
        return 'Democrat'
    elif x == 4:
        return 'Independent'
    else:
        return 'Republican'

anes_2004['party_identity'] = anes_2004['VCF0301'].apply(party_identity_label)

average_social_trust_by_party = anes_2004.groupby('party_identity')['social_trust'].mean()

print(anes_2004[['VCF0619', 'VCF0620', 'VCF0621', 'social_trust']].head())

       VCF0619  VCF0620  VCF0621  social_trust
46226      1.0      9.0      2.0          12.0
46227      2.0      2.0      1.0           5.0
46228      2.0      2.0      2.0           6.0
46229      0.0      0.0      0.0           0.0
46230      1.0      2.0      1.0           4.0

print("Average Social Trust by Party Identity:")

Average Social Trust by Party Identity:

print(average_social_trust_by_party)

party_identity
Democrat       4.157895
Independent    4.157025
Republican     4.538302
Name: social_trust, dtype: float64

Q6. Democrat 4.157895. Independent 4.157025. Republican 4.538302.

There is some evidence that greater social trust is associated with partisan identity, particularly showing that Republicans have higher social trust on average compared to Democrats and Independents. However, the differences between Democrats and Independents are minimal, suggesting that while there is an association, it may not be very pronounced for all party identities.

Question 7 (8 points)points)

A common type of question on political surveys is the “feeling thermometer” where respondents are asked how warm/cold they feel about certain topics or political groups or figures.

It is widely believed that political polarization today is worse than in the past - Republicans have more negative feelings about Democrats today than they did in years past, and vice versa.

Use these survey data to assess this claim. Use VCF0302 for the party identification of the respondent, VCF0218 for the thermometer for the Democratic party, and VCF0224 for the Republican party.

library(dplyr)
library(readr)

anes <- read_csv("ANES.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 68224 Columns: 1030
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr    (3): Version, VCF0900c, VCF0901b
dbl (1019): VCF0004, VCF0006, VCF0006a, VCF0009x, VCF0010x, VCF0011x, VCF000...
lgl    (8): VCF0391c, VCF0391d, VCF1030a, VCF1030b, VCF1031a, VCF1031b, VCF1...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

anes <- anes %>%
  mutate(across(c(VCF0004, VCF0302, VCF0218, VCF0224), as.numeric))

anes_filtered <- anes %>%
  filter(VCF0004 %in% c(1980, 1992, 2000, 2020))

party_identity_label <- function(x) {
  if (x <= 3) {
    return("Democrat")
  } else if (x >= 5) {
    return("Republican")
  } else {
    return(NA)}}

anes_filtered <- anes_filtered %>%
  mutate(party_identity = sapply(VCF0302, party_identity_label))

anes_filtered <- anes_filtered %>%
  mutate(feeling_therm_diff = ifelse(party_identity == "Democrat", 
                                     VCF0224 - VCF0218, 
                                     ifelse(party_identity == "Republican", 
                                            VCF0218 - VCF0224, NA)))

anes_filtered <- anes_filtered %>%
  filter(!is.na(feeling_therm_diff))

polarization_over_time <- anes_filtered %>%
  group_by(VCF0004) %>%
  summarize(avg_feeling_therm_diff = mean(feeling_therm_diff, na.rm = TRUE))

print("Average Feeling Thermometer Differences Over Time:")

[1] "Average Feeling Thermometer Differences Over Time:"

print(polarization_over_time)

# A tibble: 4 × 2
  VCF0004 avg_feeling_therm_diff
    <dbl>                  <dbl>
1    1980                   19.1
2    1992                   16.7
3    2000                   21.1
4    2020                   37.2

Q7. 1980- 19.11637 ; 1992- 16.65981; 2000- 21.06966; 2020- 37.24604.

The data supports the claim that political polarization has worsened over time. The increase in the average feeling thermometer differences between Democrats and Republicans from 1980 to 2020 highlights a trend of growing partisan divide, corroborating the observation that political polarization today is more pronounced than in the past.

Part 2: Exploration (20 points)

For the remainder of this project, you will further explore the ANES data and then create and answer some questions that are interesting to you about American politics.

Answer four different questions or test four different claims, or conduct some other analysis of your choice, in the style of what you did in Part 1. In each question, introduce at least one new variable into your analysis.

If you have any concerns about what makes a good question, please reach out for help.

You’ll probably need about 500 words of text (1.5-2 pages) to describe what you are looking at/what analysis you are performing, and of course you’ll need to include the associated code that provides the numerical/statistical evidence for your claims.

Use Python for at least two of these explorations.

You have flexibility to pursue questions that interest you. But, you need to leverage the data in a meaningful way. For example, “The mean response to question X was higher than the mean response to question Y” is not good enough. Use the temporal dimension of these data, the demographics of survey respondents, or the responses to different questions to create interesting comparisons or to focus on populations of particular interest to you. Let your creativity flow!

Good luck and have fun!

Q1. How has political ideology shifted over time, and how does this shift compare between genders?

import pandas as pd

anes = pd.read_csv("ANES.csv", low_memory=False)

years = [1980, 1992, 2000, 2020]
anes_filtered = anes[anes['VCF0004'].isin(years)].copy()

anes_filtered['VCF0803'] = pd.to_numeric(anes_filtered['VCF0803'], errors='coerce')
anes_filtered['VCF0104'] = pd.to_numeric(anes_filtered['VCF0104'], errors='coerce')

anes_filtered['Gender'] = anes_filtered['VCF0104'].map({1: 'Male', 2: 'Female', 3: 'Other'})

average_ideology_by_gender = anes_filtered.groupby(['VCF0004', 'Gender'])['VCF0803'].mean().unstack()
print(average_ideology_by_gender)

Gender     Female      Male
VCF0004                    
1980     5.964091  5.604317
1992     5.644310  5.254750
2000     2.579154  2.515190
2020     4.758202  4.836301

Q1. Year Female Male

1980 5.964091 5.604317

1992 5.644310 5.254750

2000 2.579154 2.515190

2020 4.758202 4.836301

The analysis of political ideology over time by gender reveals several noteworthy trends. In 1980, women had slightly more liberal average ideology scores compared to men, but by 1992, this gap had narrowed significantly, with both genders exhibiting similar levels of political ideology. In 2000, both men and women showed a substantial decrease in their average ideology scores, although women continued to have slightly higher scores. By 2020, men’s average political ideology scores surpassed those of women, indicating a shift towards more conservative views among men compared to women.

Q2.

library(dplyr)

anes <- read.csv("ANES.csv")

anes_filtered <- anes %>% select(VCF0110, VCF0720)

anes_filtered$VCF0110 <- as.numeric(anes_filtered$VCF0110)
anes_filtered$VCF0720 <- as.numeric(anes_filtered$VCF0720)

anes_filtered <- anes_filtered %>%
  mutate(Education_Group = ifelse(VCF0110 >= 4, "College or Greater", "No College Degree"))

average_voting_by_education <- anes_filtered %>%
  group_by(Education_Group) %>%
  summarize(Average_Voting = mean(VCF0720, na.rm = TRUE))

print(average_voting_by_education)

# A tibble: 2 × 2
  Education_Group    Average_Voting
  <chr>                       <dbl>
1 College or Greater           1.06
2 No College Degree            1.02

Q2. The analysis of political participation by education level shows a distinct pattern. Individuals with a college degree or higher tend to have a slightly higher average voting frequency compared to those without a college degree. Specifically, the average voting frequency is higher for those with a college degree (1.0638) compared to those without (1.0166). This suggests that higher education levels are associated with greater political participation, as measured by voting frequency. This trend aligns with the broader understanding that higher educational attainment often correlates with increased political engagement and activity.

Q3. The Relationship Between Political Discussion Frequency and Environmental Regulation Preferences

library(dplyr)

anes <- read.csv("ANES.csv")

anes_filtered <- anes %>% select(VCF0731, VCF0842)

anes_filtered$VCF0731 <- as.numeric(as.character(anes_filtered$VCF0731))
anes_filtered$VCF0842 <- as.numeric(as.character(anes_filtered$VCF0842))

anes_filtered <- anes_filtered %>%
  mutate(Political_Discussion = case_when(
    VCF0731 >= 1 & VCF0731 <= 3 ~ "Occasional",
    VCF0731 >= 4 & VCF0731 <= 6 ~ "Frequent",
    TRUE ~ "Rarely"
  ))
 frequency

function (x, ...) 
UseMethod("frequency")
<bytecode: 0x00000211d3a51fb0>
<environment: namespace:stats>

average_environmental_regulation_by_discussion <- anes_filtered %>%
  group_by(Political_Discussion) %>%
  summarize(Average_Environmental_Regulation = mean(VCF0842, na.rm = TRUE))

print(average_environmental_regulation_by_discussion)

# A tibble: 3 × 2
  Political_Discussion Average_Environmental_Regulation
  <chr>                                           <dbl>
1 Frequent                                         4.66
2 Occasional                                       3.62
3 Rarely                                           4.00

Q3. Political_Discussion Average_Environmental_Regulation

Frequent 4.656824

Occasional 3.619450

Rarely 4.003953

The examination of environmental regulation preferences based on the frequency of political discussions reveals interesting insights. Those who engage in frequent political discussions tend to support stronger environmental regulations (average score of 4.656) compared to those who discuss politics occasionally (average score of 3.620) or rarely (average score of 4.004). This indicates that individuals who frequently discuss politics are more likely to favor rigorous environmental policies, potentially reflecting a higher level of awareness and concern for environmental issues among these individuals.

Q4. Trends in the Importance of Gun Control Over Time

import pandas as pd

anes = pd.read_csv("ANES.csv", low_memory=False)

anes['VCF9239'] = pd.to_numeric(anes['VCF9239'], errors='coerce')

negative_values_2008 = anes[(anes['VCF0004'] == 2008) & (anes['VCF9239'] < 0)]
print("Negative values in 2008 before correction:")

Negative values in 2008 before correction:

print(negative_values_2008[['VCF0004', 'VCF9239']])

       VCF0004  VCF9239
47445     2008     -9.0
47446     2008     -9.0
47447     2008     -9.0
47451     2008     -9.0
47453     2008     -9.0
...        ...      ...
49753     2008     -9.0
49755     2008     -9.0
49756     2008     -9.0
49757     2008     -9.0
49759     2008     -9.0

[1171 rows x 2 columns]

anes.loc[(anes['VCF0004'] == 2008) & (anes['VCF9239'] < 0), 'VCF9239'] = None

average_importance_by_year = anes.groupby('VCF0004')['VCF9239'].mean()

average_importance_by_year_clean = average_importance_by_year.dropna()

print("Average importance of gun control issue by year after correction:")

Average importance of gun control issue by year after correction:

print(average_importance_by_year_clean)

VCF0004
2000    3.736027
2004    2.136964
2008    3.597741
2012    2.992222
2016    2.135129
2020    0.982005
Name: VCF9239, dtype: float64

Q4.
In 2000, the average importance rating was 3.736, indicating a relatively high level of concern about gun control at that time. By 2004, this average dropped to 2.137, reflecting a decrease in perceived importance. The importance rating rose again in 2008 to 3.598, but then fell to 2.992 in 2012, demonstrating fluctuating levels of concern. In 2016, the importance rating was 2.135, almost identical to the 2004 level. By 2020, the importance had decreased significantly to 0.982, indicating a substantial decline in concern about gun control in recent years.

The data suggests that public concern about gun control has experienced considerable fluctuations over the past two decades, with a notable decline in recent years. This variation could be influenced by numerous factors, including political events, policy changes, and shifts in public opinion.