The objective of this project is to create a relational dataset out of four separate datasets about the United States. The four datasets provide 1) election data, 2) unemployment data, 3) poverty data, and 4) education data for counties across the United States corresponding to individual counties in the country, each identified by their unique government-assigned five-digit FIPS code.
To begin, we create the database where we will store each individual dataset and the eventual final relational dataset.
# read in the data
election <- read.csv("C:/Users/qinfa/OneDrive/Desktop/alice's laptop/school/STA551/countypresidential_election_2000-2020.csv")
education <- read.csv("C:/Users/qinfa/OneDrive/Desktop/alice's laptop/school/STA551/Education.csv")
unemployment <- read.csv("C:/Users/qinfa/OneDrive/Desktop/alice's laptop/school/STA551/Unemployment.csv")
poverty <- read.csv("C:/Users/qinfa/OneDrive/Desktop/alice's laptop/school/STA551/PovertyEstimates.csv")
#Create database
data <- dbConnect(drv = SQLite(),
dbname = ":memory:")
#store sample data in the database
dbWriteTable(conn = data,
name = "education",
value = education)
dbWriteTable(conn = data,
name = "election",
value = election)
dbWriteTable(conn = data,
name = "poverty",
value = poverty)
dbWriteTable(conn = data,
name = "unemployment",
value = unemployment)
#remove the local data from the environment
rm(education, election, poverty, unemployment)
The goal of the data integration portion of this project is to create a relational dataset that corresponds to the information we desire from the original four datasets and compile them into a single relational dataset with one observation for each county.
For the election data, we begin by subsetting the data for the 2020 election and 2 major parties.
SELECT *
FROM election
WHERE year = 2020 AND (party LIKE "REPUBLICAN" OR party LIKE "DEMOCRAT");
# add 2020 election data into sql database
dbWriteTable(conn = data,
name = "election2020",
value = election2020)
rm(election2020)
We proceed to aggregate the total votes for each party in order to find the winning party.
SELECT party, SUM(candidatevotes) AS totalcandidatevotes
FROM election2020
GROUP BY party;
The winning party was the Democratic party with 81264648 votes.
# add 2020 election data into sql database
dbWriteTable(conn = data,
name = "totalvotes2020",
value = totalvotes2020)
rm(totalvotes2020)
We proceed to add a variable for the total votes for the Democratic party as aggregated in the previous step.
SELECT *
FROM election2020
LEFT JOIN totalvotes2020 USING (party);
# add 2020 election data into sql database
dbWriteTable(conn = data,
name = "joinedelection",
value = joinedelection)
rm(joinedelection)
Our final subset of the election dataset includes the county name, county FIPS code, state name, political party (only keeping the Democratic electoral information in the dataset), and the total votes for the Democratic candidate in each county.
SELECT county_name, county_fips AS elec_FIP, state, party AS elec_party, candidatevotes AS elec_candidatevotes
FROM joinedelection
WHERE party LIKE "DEMOCRAT";
# add 2020 election data into sql database
dbWriteTable(conn = data,
name = "electionfinal",
value = electionfinal)
rm(electionfinal)
For the unemployment data, we begin by creating variables from the Attribute variable in the dataset to identify the year for each value and what the value is identified as. This is done by taking the last four characters of Attribute for Year and everything but the last six characters for the identifying attribute.
SELECT *
FROM unemployment;
## My SQL wasn't running any of the string functions I tried for some reason, so I did the data manipulation in R
unemployment <- unemployment %>% mutate(
year = str_sub(Attribute, start = -4),
att = str_sub(Attribute, end=-6)
)
unemployment$year <- as.numeric(unemployment$year)
dbWriteTable(conn = data,
name = "unemployment_att",
value = unemployment)
#remove the local data from the environment
rm(unemployment)
The latest record for unemployment rate for each county is found through finding the maximum value for year when grouped by the FIPS code.
SELECT FIPS_Code AS fip, MAX(year) AS max_year
FROM unemployment_att
WHERE att = 'Unemployment_rate'
GROUP BY FIPS_Code;
dbWriteTable(conn = data,
name = "unemployment_max_year",
value = unemployment_max_year)
#remove the local data from the environment
rm(unemployment_max_year)
Only the latest value for the unemployment rate for each year is kept through merging the collected maximum values of year with the original data. The variables we want to keep in the final dataset include the year of the unemployment rate, the value of the unemployment rate, and the county FIPS code.
SELECT FIPS_Code AS unemployment_FIP, year AS unemployment_year, Value AS Unemployment_Rate
FROM unemployment_att
RIGHT JOIN unemployment_max_year
ON unemployment_att.FIPS_Code = unemployment_max_year.fip
WHERE unemployment_att.att = 'Unemployment_rate' AND year = max_year;
dbWriteTable(conn = data,
name = "unemployment_final",
value = unemployment_final)
#remove the local data from the environment
rm(unemployment_final)
For the poverty data, we want to keep the FIPS code and the poverty rate, which are recorded in the variable Value for observations where Attribute = ‘PCTPOVALL_2019’. The variables are selected and renamed for clarity accordingly.
SELECT FIPStxt AS poverty_FIP, Value AS poverty_Rate
FROM poverty
WHERE Attribute = 'PCTPOVALL_2019';
dbWriteTable(conn = data,
name = "poverty_final",
value = poverty_final)
#remove the local data from the environment
rm(poverty_final)
The percentages of adults from 2015-2019 in each county who belong to one of four education levels are kept in the subset of the education data that will be used in the relational dataset as well as the county FIPS code. The variables were renamed for clarity, with the four education levels being 1) less than a high school diploma, 2) only a high school diploma, 3) some college education (1-3 years), or 4) 4+ years of college education.
SELECT `FIPS.Code` AS edu_FIP, `Percent.of.adults.with.less.than.a.high.school.diploma..2015.19` AS edu_lessthanhsdiploma, `Percent.of.adults.with.a.high.school.diploma.only..2015.19` AS edu_highschooldiplomapercent, `Percent.of.adults.completing.some.college.or.associate.s.degree..2015.19` AS edu_1to3yearscollegepercent, `Percent.of.adults.with.a.bachelor.s.degree.or.higher..2015.19` AS edu_fouryearscollegepercent
FROM education;
dbWriteTable(conn = data,
name = "education_final",
value = education_final)
#remove the local data from the environment
rm(education_final)
In order to create the relational dataset, the four subsetted datasets are joined together with full joins in order to preserve all of the information.
SELECT *
FROM education_final
FULL JOIN electionfinal
ON electionfinal.elec_FIP = education_final.edu_FIP
ORDER BY education_final.edu_FIP;
SELECT *
FROM poverty_final
FULL JOIN unemployment_final
ON unemployment_final.unemployment_FIP = poverty_final.poverty_FIP
ORDER BY unemployment_final.unemployment_FIP;
dbWriteTable(conn = data,
name = "join1",
value = join1, overwrite = TRUE)
#remove the local data from the environment
rm(join1)
dbWriteTable(conn = data,
name = "join2",
value = join2, overwrite = TRUE)
#remove the local data from the environment
rm(join2)
SELECT *
FROM join1
FULL JOIN join2
ON join1.edu_FIP = unemployment_FIP
ORDER BY edu_FIP, unemployment_FIP, poverty_FIP, elec_FIP;
dbWriteTable(conn = data,
name = "full_join",
value = full_join)
To begin with EDA, we take a look at the number of missing values for each of the variables in the dataset.
colSums(is.na(full_join))
edu_FIP edu_lessthanhsdiploma
40 50
edu_highschooldiplomapercent edu_1to3yearscollegepercent
50 50
edu_fouryearscollegepercent county_name
50 168
elec_FIP state
170 168
elec_party elec_candidatevotes
168 168
poverty_FIP poverty_Rate
130 130
unemployment_FIP unemployment_year
48 48
Unemployment_Rate
48
We can see that there are missing values in each column, with especially many in the variables that came from the election dataset. Not every county that had observations recorded in one dataset had information recorded for it in all of the other datasets. Still, with 5000 observations, even the highest rate of missing data is about 3%, which means that we have more than enough information to proceed with exploratory data analysis.
We examine the distribution of poverty rate through a histogram.
hist(full_join$poverty_Rate, breaks = 50, xlab = "Poverty Rate", ylab = "Count", main = "County Poverty Rates")
The poverty rate across counties is distributed unimodally with right skew. The mode is around 10-15%.
boxplot(full_join$poverty_Rate)
A boxplot shows a median of again, around 10-15% poverty rate, with 75%
of the counties in the dataset falling between 10 and 20 percent. No
outliers are identified in the left tail, but many are identified at the
right, matching the right skew shown in the histogram.
We continue our analysis of poverty rate by examining its relationship to one of the four recorded percentages for education level in a scatter plot.
p <- ggplot(data=full_join) + ggtitle("Percentage of people with less than a high school diploma vs. Poverty rate") + geom_point(aes(poverty_Rate, edu_lessthanhsdiploma))
ggplotly(p)
The scatterplot shows a moderately strong, relatively linear positive relationship between poverty rate and the percentage of individuals with less than a high school diploma. This indicates that generally speaking, in the dataset, the higher the poverty rate, the higher the percentage of individuals who were identified as having less than a high school diploma in the county. This generally follows the trend that we expect. We can also see a clear outlier with a poverty rate of roughly 14.2% but a percentage recorded of over 73.6% for those with less than a high school diploma. This would be worth examining, to see if this observation perhaps has a data entry error or some other discrepancy that may explain the disparity.
We continue to look at the relationships between poverty rate and the other education levels as well as unemployment rate through a pairwise comparison.
ggpairs(full_join, columns = c(2:5, 12, 15), title = "Pairwise Comparisons between Education Levels, Poverty Rate, and Unemployment Rate")
Many of the correlations were marked as significant, but the influence of sample size must be considered. Further, all four percentages for education level are interconnected, and therefore the pairwise comparisons comparing them are not particularly meaningful. We will focus on the scatterplots between poverty_Rate and the other variables and Unemployment_Rate and the other variables for this part of the project. We can see that all of the variables have unimodal distributions from the density plots, and that many of them feature right skew.
Both poverty rate and unemployment rate are positive correlated with percentage of people who have less than a high school diploma or only a high school diploma, but negatively correlated with percentage of people who have 1-3 or 4+ years of a college education. Poverty rate and unemployment rate also have a positive correlation with each other. All of this would follow our assumptions–having more money is well-known as a factor in whether or not people are able to pursue a higher education, and a higher education is often also associated with higher-paying jobs.
We created a relational dataset out of four individual datasets countaining county information in the United States about elections, unemployment, poverty, and education. We then proceeded to do exploratory analysis on the variable for poverty rate for each county and its relationship with other continuous variables in the dataset.