1 Section - Objectives

The objective of this project is to create a relational database and pull various data sets together into in order to perform exploratory data analysis of Election, Unemployment, Poverty, and Education data. The goal is to learn how to use sql and data management skills to pull needed variables into one dataframe to perform necessary Exploratory Data Analysis.

1.1 Data Integration

1.1.1 Data Preparation - Initial Download and Analysis

In order to create a database all four of the datasets needed must be downloaded. Once all data is incorporated it is necessary to get a basic understanding of each dataset. The four datasets used in this project as education, which contains variables looking at education levels in various states. The second dataset is unemployment, which contains information pertaining to unemployment rates in different states and counties. The third dataset is poverty, which pertains information about poverty rates across the united states. The final dataset is contains election data, such as parties and the winning party for previous elections.

Here is a quick summary of the four datasets. A connection is then made to a sqlite database and data tables are input. Now the process of data integration can begin.

1.1.2 Data Preparation - Presidential Election Data

When preparing the first data set, Presidential Election Data, it must fit the following requirements: 1.only use the 2020 election data 2.only keep the data for the two major parties: Democrats and Republicans 3.aggregate the total votes and keep the winning party in the data 4.county FIPS code, State name, county name, total votes received in the winning party, and the name of the party.

SELECT 
    county_fips AS fips_code,
    state,
    county_name,
    SUM(totalvotes) AS total_votes,
    CASE 
        WHEN SUM(CASE WHEN party = 'DEMOCRAT' THEN candidatevotes ELSE 0 END) > 
             SUM(CASE WHEN party = 'REPUBLICAN' THEN candidatevotes ELSE 0 END) 
        THEN 'DEMOCRAT'
        ELSE 'REPUBLICAN'
    END AS winning_party
FROM 
    election
WHERE 
    year = 2020 
    AND party IN ('DEMOCRAT', 'REPUBLICAN')
GROUP BY 
    county_fips, state, county_name;

1.1.3 Data Preparation - Unemployment Data

When preparing the second data set, Unemployment Data, it must fit the following requirements: 1.Only keep the unemployment date in the year 2020 or the most recent year if the 2020 employment rate is unavailable 2.County FIPS code 3.Unemployment rate

I renamed ‘FIPS_Code’ to ‘fips_code’ for data consistency. Value was also renamed to unemployment_rate for clarity when it comes to merging datasets. The substring function was used on the ‘Attribute’ varible in order to extract the year 2020. This was aptly named ‘year’.

SELECT 
    FIPS_Code AS fips_code,
    Value AS unemployment_rate, 
    SUBSTR(Attribute, -4) AS year
FROM 
    unemployment
WHERE 
    year = "2020"

1.1.4 Data Preparation - Poverty Data

When preparing the third data set, Poverty Data, it must fit the following requirements: 1.Keep only 2019 poverty rate 2.County FIPS code

I renamed ‘FIPStxt’ to ‘fips_code’ for data consistancy. Value was also renamed to poverty_rate for clarity when it comes to merging datasets.

SELECT 
    FIPStxt AS fips_code,
    Value as poverty_rate
FROM 
    poverty
WHERE 
    Attribute = "PCTPOVALL_2019";

1.1.5 Data Preparation - Education Data

When preparing the final data set, Education Data, it must fit the following requirements: 1.only keep the percentage of education levels between 2015 and 2019. a.Education levels b.less than a high school diploma c.high school diploma d.completed some college (1-3 years) e.completed four years of college 2.County FIPS code

I renamed the variables above to ‘fips_code’ and replaced all “.” with “_” in order to run a sql query.

SELECT 
    fips_code,
    Percent_of_adults_with_less_than_a_high_school_diploma__2015_19,
    Percent_of_adults_with_a_high_school_diploma_only__2015_19,
    Percent_of_adults_completing_some_college_or_associate_s_degree__2015_19,
    Percent_of_adults_with_a_bachelor_s_degree_or_higher__2015_19
FROM 
    education
GROUP BY 
    fips_code;

1.1.6 Merge Datasets

Here I left joined all datasets together. Then, using the dplyr library, I removed all duplicates based on the fips_code and any missing observations. The final dataset is called fips_202_elections. This dataset has 3113 observations consisting of 12 variables of interest.

2 Exploratory Data Anaylsis

2.1 Univariate Analysis

The first EDA I want to do is a univariate analysis of the variable “winning_party” Below is pie chart that shows the distribution of winning parties.

From this chart we can see that the republican party has a higher amount of winning counties than the democratic party. However, in the electoral college, it is not how many counties your party wins, but how many delegates belong to that county. In 2020 the Democratic party won the presidency, this was done by winning fewer but larger counties with more delegates.

2.2 Bivariate Anaylsis

In looking at two continous varibles “poverty_rate” and “unemployment_rate” the best way to visualize if there is any association is through a scatterplot. The plot can be seen below.

Here we can see that there does not seem to be an association between poverty rate and unemployment rate in the 2020 presidential election. A statistical test to confirm this would be Pearson’s Correlation test. ## Pairwise comparisons

A pairwise comparisons was conducted for adults and various educational levels. The graphic can be seen below.

# Results and Discussions In this project four datasets were queried in a SQLite database and merged into one dataframe in order to perform Exploratory Data Analysis. When looking at the distribution of winning parties, we see that the Republication Party won a larger amount of counties compared to the Democratic Party. In a Bivariate analysis looking for an association between poverty rate and unemployment rate, none was found. Finally a pairwise analysis was conducted looking at correlations between party and educational levels. Here it can be seen higher education had a negative correlation both parties, except for republicans with a bachelor degree or higher. This can be due to where colleges are located in specific counties or job opportunists for professionals.


---
title: "SQL and EDA of Election Data"
author: "Jessica Gorr"
date: "9/25/2024"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---
```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight:bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: system-ui;
    font-weight:bold;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight:bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight:bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```


```{r setup, include=FALSE}
options(repos = list(CRAN="http://cran.rstudio.com/"))
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("odbc")) {
   install.packages("odbc")
   library(odbc)
}
if (!require("DBI")) {
   install.packages("DBI")
   library(DBI)
}
if (!require("RSQLite")) {
   install.packages("RSQLite")
   library(RSQLite)
}

if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}

knitr::opts_chunk$set(echo = TRUE,       
                      warning = FALSE,   
                      result = TRUE,   
                      message = FALSE,
                      comment = NA)
```


\

# Section - Objectives
The objective of this project is to create a relational database and pull various data sets together into in order to perform exploratory data analysis of Election, Unemployment, Poverty, and Education data. The goal is to learn how to use sql and data management skills to pull needed variables into one dataframe to perform necessary Exploratory Data Analysis.

## Data Integration
### Data Preparation - Initial Download and Analysis
In order to create a database all four of the datasets needed must be downloaded. Once all data is incorporated it is necessary to get a basic understanding of each dataset. The four datasets used in this project as education, which contains variables looking at education levels in various states. The second dataset is unemployment, which contains information pertaining to unemployment rates in different states and counties. The third dataset is poverty, which pertains information about poverty rates across the united states. The final dataset is contains election data, such as parties and the winning party for previous elections.
```{r, echo=FALSE, results='hide'}
#Load the sample data
education <- read.csv("https://jessgorr01.github.io/STA551/Education.csv")
poverty <- read.csv("https://jessgorr01.github.io/STA551/PovertyEstimates.csv")
unemployment <- read.csv("https://jessgorr01.github.io/STA551/Unemployment.csv")
election <- read.csv("https://jessgorr01.github.io/STA551/countypresidential_election_2000-2020.csv")

education <- education %>%
  rename(fips_code = FIPS.Code,
         Percent_of_adults_with_less_than_a_high_school_diploma__2015_19 = Percent.of.adults.with.less.than.a.high.school.diploma..2015.19,
         Percent_of_adults_with_a_high_school_diploma_only__2015_19 = Percent.of.adults.with.a.high.school.diploma.only..2015.19,
         Percent_of_adults_completing_some_college_or_associate_s_degree__2015_19 = Percent.of.adults.completing.some.college.or.associate.s.degree..2015.19,
         Percent_of_adults_with_a_bachelor_s_degree_or_higher__2015_19 = Percent.of.adults.with.a.bachelor.s.degree.or.higher..2015.19)
```
Here is a quick summary of the four datasets. A connection is then made to a sqlite database and data tables are input. Now the process of data integration can begin.

```{r, echo=FALSE, results='hide'}
summary(education)
summary(poverty)
summary(unemployment)
summary(election)
```
``````{r, echo=FALSE, results='hide'}
#Create database
con <- dbConnect(drv = SQLite(),
                 dbname = ":memory:")

#store sample data in the database
dbWriteTable(conn = con, 
             name = "education",
             value = education)

dbWriteTable(conn = con, 
             name = "poverty",
             value = poverty)

dbWriteTable(conn = con, 
             name = "unemployment",
             value = unemployment)
dbWriteTable(conn = con, 
             name = "election",
             value = election)
 
#remove the local data from the environment
rm(education,poverty, unemployment, election)
```

``````{r, echo=FALSE, results='hide'}
tbl(src = con,         #  the source if the database connection profile
    c("education"))      #  the name of the table to preview
tbl(src = con, "poverty")
tbl(src = con, "unemployment")
tbl(src = con, "election")
```
### Data Preparation - Presidential Election Data

When preparing the first data set, Presidential Election Data, it must fit the following requirements:
  1.only use the 2020 election data
  2.only keep the data for the two major parties: Democrats and Republicans
  3.aggregate the total votes and keep the winning party in the data
  4.county FIPS code, State name, county name, total votes received in the winning party, and the name of the party.
  
```{sql, connection = "con", output.var = "election_new"}
SELECT 
    county_fips AS fips_code,
    state,
    county_name,
    SUM(totalvotes) AS total_votes,
    CASE 
        WHEN SUM(CASE WHEN party = 'DEMOCRAT' THEN candidatevotes ELSE 0 END) > 
             SUM(CASE WHEN party = 'REPUBLICAN' THEN candidatevotes ELSE 0 END) 
        THEN 'DEMOCRAT'
        ELSE 'REPUBLICAN'
    END AS winning_party
FROM 
    election
WHERE 
    year = 2020 
    AND party IN ('DEMOCRAT', 'REPUBLICAN')
GROUP BY 
    county_fips, state, county_name;

```
### Data Preparation - Unemployment Data
When preparing the second data set, Unemployment Data, it must fit the following requirements:
  1.Only keep the unemployment date in the year 2020 or the most recent year if the 2020 employment rate is unavailable
  2.County FIPS code
  3.Unemployment rate
  
I renamed 'FIPS_Code' to 'fips_code' for data consistency. Value was also renamed to unemployment_rate for clarity when it comes to merging datasets. The substring function was used on the 'Attribute' varible in order to extract the year 2020. This was aptly named 'year'.
```{sql, connection = "con", output.var = "unemployment_new"}
SELECT 
    FIPS_Code AS fips_code,
    Value AS unemployment_rate, 
    SUBSTR(Attribute, -4) AS year
FROM 
    unemployment
WHERE 
    year = "2020"

```
  
### Data Preparation - Poverty Data
When preparing the third data set, Poverty Data, it must fit the following requirements:
  1.Keep only 2019 poverty rate
  2.County FIPS code

I renamed 'FIPStxt' to 'fips_code' for data consistancy. Value was also renamed to poverty_rate for clarity when it comes to merging datasets.  
```{sql, connection = "con", output.var = "poverty_new"}
SELECT 
    FIPStxt AS fips_code,
    Value as poverty_rate
FROM 
    poverty
WHERE 
    Attribute = "PCTPOVALL_2019";
```

### Data Preparation - Education Data
When preparing the final data set, Education Data, it must fit the following requirements:
  1.only keep the percentage of education levels between 2015 and 2019.
      a.Education levels
      b.less than a high school diploma
      c.high school diploma
      d.completed some college (1-3 years)
      e.completed four years of college
  2.County FIPS code

I renamed the variables above to 'fips_code' and replaced all "." with "_" in order to run a sql query.
```{sql, connection = "con", output.var = "education_new"}
SELECT 
    fips_code,
    Percent_of_adults_with_less_than_a_high_school_diploma__2015_19,
    Percent_of_adults_with_a_high_school_diploma_only__2015_19,
    Percent_of_adults_completing_some_college_or_associate_s_degree__2015_19,
    Percent_of_adults_with_a_bachelor_s_degree_or_higher__2015_19
FROM 
    education
GROUP BY 
    fips_code;
```
### Merge Datasets
Here I left joined all datasets together. Then, using the dplyr library, I removed all duplicates based on the fips_code and any missing observations. The final dataset is called fips_202_elections. This dataset has 3113 observations consisting of 12 variables of interest.
```{r, echo=FALSE, results='hide'}
# Load dplyr for data manipulation
library(dplyr)

# Merge datasets
combined_data <- election_new %>%
    left_join(unemployment_new, by = "fips_code") %>%
    left_join(poverty_new, by = "fips_code") %>%
    left_join(education_new, by = "fips_code")
#remove any duplicates
fips_2020_election <-combined_data %>%
  distinct(fips_code, .keep_all = TRUE)%>%
  drop_na()

# View the final combined dataset
head(fips_2020_election)
write_csv(fips_2020_election, "C:/Users/jessg/OneDrive - West Chester University of PA/sta551/fips_2020_election.csv")
#write.csv(fips_2020_election, file = "fips_2020_election.csv", row.names = FALSE)
```

# Exploratory Data Anaylsis

## Univariate Analysis
The first EDA I want to do is a univariate analysis of the variable "winning_party" Below is pie chart that shows the distribution of winning parties.


```{r, echo=FALSE}
freq.tbl = table(fips_2020_election$winning_party)
pie(freq.tbl, xlab="Winning Party", ylab = "Counts", main="Distribution of Winning Party in the 2020 Election")

```
From this chart we can see that the republican party has a higher amount of winning counties than the democratic party. However, in the electoral college, it is not how many counties your party wins, but how many delegates belong to that county. In 2020 the Democratic party won the presidency, this was done by winning fewer but larger counties with more delegates.

## Bivariate Anaylsis
In looking at two continous varibles "poverty_rate" and "unemployment_rate" the best way to visualize if there is any association is through a scatterplot. The plot can be seen below.


```{r, echo=FALSE}
#plot(fips_2020_election$poverty_rate, fips_2020_election$unemployment_rate)

plot(fips_2020_election$poverty_rate, fips_2020_election$unemployment_rate,
     main = "Association between Poverty Rate and Unemployment Rate",     # Title
     xlab = "Poverty Rate",             # X-axis label
     ylab = "Unemployment Rate",             # Y-axis label
     pch = 19,                          # Point type
     col = "blue")                     # Point color
```


Here we can see that there does not seem to be an association between poverty rate and unemployment rate in the 2020 presidential election. A statistical test to confirm this would be Pearson's Correlation test.
## Pairwise comparisons

A pairwise comparisons was conducted for adults and various educational levels. The graphic can be seen below.

```{r, echo=FALSE}
library(GGally)
ggpairs(fips_2020_election, columns = 9:12, aes(color=winning_party, alpha = 0.5),
        lower = list(continuous = "smooth"))
```
# Results and Discussions
In this project four datasets were queried in a SQLite database and merged into one dataframe in order to perform Exploratory Data Analysis. When looking at the distribution of winning parties, we see that the Republication Party won a larger amount of counties compared to the Democratic Party. In a Bivariate analysis looking for an association between poverty rate and unemployment rate, none was found. Finally a pairwise analysis was conducted looking at correlations between party and educational levels. Here it can be seen higher education had a negative correlation both parties, except for republicans with a bachelor degree or higher. This can be due to where colleges are located in specific counties or job opportunists for professionals.
\














