Introduction

Do state populations bear any relationship to state voter turnouts and party selections in presidential elections? What can we learn about voter turnout and state party preference based on state populations? I intend to answer these questions in my exploration of state population data from 1900 through 2016 in comparison to state voting percentages and party selection from 1900 through 2016. I will use census data and federal elections data to do the analysis. This data is available on the web. I will download, clean, transform, analyze, and visualize the data using R and R Markdown.

Gather Data

Gathering the data was quite difficult and required a great deal of code due to the many source files from the Census Bureau and the many slightly varying webpages containing Presidential election data at the national and state levels. Consequently, I have placed that code in a separate file, located here: GitHub RPubs.

Clean/Transform Data

The code referenced in the above section output many different .csv files to my local computer, which I have placed in GitHub. After joining similar files together, I have these 5 master files to use in the following analyses:

  1. state_population_1900_2015.csv: Every state’s population from 1900 through 2015

  2. national_population_1900_2016.csv: The United State’s national population from 1900 through 2016.

  3. state_header_df.csv: The year and Presidential candidate names for each presidential election 1824 - 2016. This matches with the file summary_state_df.csv.

  4. summary_state_df.csv: The year and total votes cast by state for each Presidential candidate from 1824 - 2016. This matches with the file state_header_df.csv.

  5. summary_df.csv: The year, party, Presidential Candidate, Vice Presidential Candidate, Electoral Votes, Electoral Percentage, Popular Votes, and Popular Vote Percentage for each Presidential candidate from 1789 - 2016.

Verify Data

Before moving on, let’s verify the data that we have.

We have the state populations for each year 1900-2015. In 2015, CA has the largest state population at 39144818, followed by TX at 27469114 and FL at 20271272. The least populated states are WY (586107), VT (626042), and AK (738432).

#state population
kable(head(state_population))
State Year Population
AL 1900 1830000
AR 1900 1314000
AZ 1900 124000
CA 1900 1490000
CO 1900 543000
CT 1900 910000
state_population %>% subset(Year==2015)%>%arrange(desc(Population))%>% head() %>%kable()
State Year Population
CA 2015 39144818
TX 2015 27469114
FL 2015 20271272
NY 2015 19795791
IL 2015 12859995
PA 2015 12802503
state_population %>% subset(Year==2015)%>%arrange(desc(Population))%>% tail() %>%kable()
State Year Population
46 SD 2015 858469
47 ND 2015 756927
48 AK 2015 738432
49 DC 2015 672228
50 VT 2015 626042
51 WY 2015 586107

We confirm that we have the national population from 1900-2016. The US population in 2016 was almost 324 million. Also shown below are the heads of the national election and state election files.

#national population
kable(head(arrange(national_population, desc(Population))))
Year Population
2016 323889854
2015 321418820
2014 318907401
2013 316427395
2012 314102623
2011 311718857
#national election
kable(head(national_election))
Year Party President VicePresident ElectoralVote ElectoralPerc PopularVote PopularPerc
1789 unofficially Federalist George Washington NA 69 100.0 NA NA
1792 unofficially Federalist George Washington NA 132 100.0 NA NA
1796 Federalist John Adams NA 71 51.1 NA NA
1796 Democratic-Republican Thomas Jefferson NA 68 48.9 NA NA
1800 Democratic-Republican Thomas Jefferson NA 73 52.9 NA NA
1800 Federalist John Adams NA 65 47.1 NA NA
#state election
kable(head(state_election_data))
State Year TotalPopularVote Candidate1PopVotes Candidate1PVPerc Candidate1EV Candidate2PopVotes Candidate2PVPerc Candidate2EV Candidate3PopVotes Candidate3PVPerc Candidate3EV Candidate4PopVotes Candidate4PVPerc Candidate4EV
2 AL 1824 13603 2422 17.8 NA 9429 69.3 5 96 0.7 NA 1656 12.2 NA
3 AL 1828 18618 16736 89.9 5 1878 10.1 NA NA NA NA NA NA NA
4 AL 1832 14291 14286 100.0 7 5 0.0 NA NA NA NA NA NA NA
5 AL 1836 37296 20638 55.3 7 NA NA NA 16658 44.7 NA NA NA NA
6 AL 1840 62511 28515 45.6 NA 33996 54.4 7 NA NA NA NA NA NA
7 AL 1844 63403 37401 59.0 9 26002 41.0 NA NA NA NA NA NA NA
kable(head(state_election_header))
Year Candidate1 Candidate2 Candidate3 Candidate4
1824 ADAMS JACKSON CLAY CRAWFORD
1828 ANDREW JACKSON JOHN Q. ADAMS NA NA
1832 ANDREW JACKSON HENRY CLAY WILLIAM WIRT NA
1836 MARTIN VAN BUREN WILLIAM HENRY HARRISON HUGH LAWSON WHITE DANIEL WEBSTER
1840 WILLIAM H. HARRISON MARTIN VAN BUREN NA NA
1844 JAMES K. POLK HENRY CLAY NA NA

Exploratory Data Analysis

Population

As visible below, the United States population has steadily increased since 1900. We can see slight changes during WWI and WWII in which the population growth slowed and then was followed by a boom. Since then, the growth rate has been pretty consistent. (Note: there is a slight jump between 1999 and 2000 due to a change in the data source.)

At the state level, we can see that several states have very large and growing populations (CA, FL, TX) while others appear to have plateaued (NY, DC, WV). This is easier to see in the second plot, which frees the y axis to the appropriate scale for each state. We can also see periods of rapid change. ND in particular is interesting in that it experience rapid growth, rapid decline, slow growth, and then recently, extremely rapid growth (due to oil fracking industry).

What about state population as a ratio to the national population? We can see that several states have increased over time (CA, TX, FL) while others have decreased (NY, PA, IL). This again is easier to see when the y axis is free, which reveals many very interesting population changes in states.

We can visualize the above information differently using a map of the United States. This first map shows that change in absolute population for each state from 1900-2016. The map starts dark in 1900, but various states begin to emerge as their populations increase. Early on we can see NY, PA, IL, and TX emerging. CA begins to be noticeable in the 1920s. FL becomes more noticeable in the 1980s. Today, we can see that CA, TX, FL, NY, and IL have large populations.

In this second map, we visualize the population ratio, that is, the state population ratio to the national population. As noted above, we can see that New England states and IL start off having the largest percentage of US residents. But very early on we can see the rising importance of CA and TX. CA has had the largest proportion of US residents since the 1970s.

Voting

What about voting trends? At the national level since 1789, the maximum electoral percentage for any candidate in a given voting year has jumped around wildly. But a very loose trend (as given by geom_smooth) would suggest that voters were more unified at the founding with a decline bottoming out right before the civil war. Electoral percentage for the winning candidate increase about 1950, when politics began to get more divisive again. This brings us to the most recent election, which was also very divisive.

Notice the only dot below 50% occurred in 1824 in a contest between between Andrew Jackson and John Quincy Adams. While Jackson got more electoral votes, the election went to the House of Representatives for a vote because no candidate got a majority. Adams was then elected President by the House.

The popular vote percentage tells a similar story, but perhaps less extreme because it is the popular vote percentages as opposed to electoral vote percentages. We can see that the highest popular vote percentage was 61% in 1964 by Lyndon B. Johnson. The least was in 1860, when Abraham Lincoln received only 39.9% of the popular vote.

Year max(PopularPerc)
1964 61.1
1936 60.8
1972 60.7
1920 60.3
1984 58.8
1928 58.2
Year max(PopularPerc)
1860 39.9
1824 41.3
1912 41.8
1992 43.0
1968 43.4
1856 45.3

How does the national population relate to the national popular vote? As you can see by the graphs below, the relationship is very linear. This is not especially uprising, for as the population increases, so does the voting population. What is more revealing is that a look at the ratio of popular vote to population over time shows that this ratio is increasing. The trend is not perfect (people voted less prior to and during WWI), but overall we see an increase in voting rate that is slowing over time. In other words, more people are voting in each election, but that increase is getting relatively smaller and smaller. Perhaps the ratio will get closer to 45%, but never pass over.

Why might the voting ratio be increasing over time? We know that the population is increasing over time, as is the population density. Below I calculate the population density. You can see from a simple linear model that Year is a better predictor of the ratio, but PopulationDensity is pretty good too. We will come back to that later.

## 
## Call:
## lm(formula = Ratio ~ Year, data = national)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.075665 -0.032212  0.000921  0.033547  0.085421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.7913331  0.4502621  -8.420 3.70e-09 ***
## Year         0.0021037  0.0002299   9.149 6.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0436 on 28 degrees of freedom
##   (115 observations deleted due to missingness)
## Multiple R-squared:  0.7494, Adjusted R-squared:  0.7404 
## F-statistic: 83.71 on 1 and 28 DF,  p-value: 6.594e-10
## 
## Call:
## lm(formula = Ratio ~ PopulationDensity, data = national)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.091947 -0.040616  0.008271  0.032232  0.094746 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.1609169  0.0249159   6.458 5.38e-07 ***
## PopulationDensity 0.0026782  0.0003706   7.227 7.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05145 on 28 degrees of freedom
##   (115 observations deleted due to missingness)
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.6385 
## F-statistic: 52.23 on 1 and 28 DF,  p-value: 7.246e-08

At the state level, we can also view trends in voting percentages. The below plot shows that most states do have an increasing popular vote to state population ratio over time, but that it has slowed in recent years or plateaued. Others do not, however. CA appears to have peaked at 47% in 1940 and has declined since. HI has been relatively flat at 32% since it become a state. Overall, it looks like southern states have continued to increase in their voting ratios while northern and western states have not or have slowed dramatically. I would suspect that the civil war and civil rights movements can explain much of the initial low turnout and the subsequent increase in the south.

Maine has had the highest ratio at 56% in 2004. At present, MN has the highest at 54%. The lowest historically was SC in 1924 at 3%. In 2012, the lowest was TX at 31%.

State Year SumPopularVote StatePopulation Ratio NationalPopulation PopulationRatio
ME 2004 740752 1313688 0.5638721 292805298 0.0044866
ME 2016 747927 1330272 0.5622361 323889854 0.0041072
NH 2016 744296 1332924 0.5583935 323889854 0.0041154
MN 2004 2828387 5087713 0.5559250 292805298 0.0173758
MN 2008 2910369 5247018 0.5546711 304093966 0.0172546
ME 1992 679499 1235748 0.5498686 255029699 0.0048455
State Year SumPopularVote StatePopulation Ratio NationalPopulation PopulationRatio
CA 2016 0 39509088.1 0.0000000 323889854 0.1219831
IN 2016 0 6647711.7 0.0000000 323889854 0.0205246
MN 2016 0 5527177.9 0.0000000 323889854 0.0170650
MT 2016 0 1040420.8 0.0000000 323889854 0.0032123
SD 2016 0 869563.7 0.0000000 323889854 0.0026848
SC 1924 50755 1710000.0 0.0296813 114109000 0.0149857
State Year SumPopularVote StatePopulation Ratio NationalPopulation PopulationRatio
TX 2012 7993851 26089741 0.3063983 314102623 0.0830612
HI 2012 434697 1392641 0.3121386 314102623 0.0044337
CA 2012 13038547 38056055 0.3426143 314102623 0.1211580
OK 2012 1334872 3817679 0.3496554 314102623 0.0121542
AZ 2012 2299254 6553262 0.3508564 314102623 0.0208634
UT 2012 1017440 2856343 0.3562037 314102623 0.0090937

We can also visualize this information using a map of the United States. As mentioned above, it looks like most states but particularly the south had little voter turnout before 1920. After 1920, the south continued to have low voter turnout until the 1960s. Since then, the south has come to have voter turnout similar to all other states. (Note: data for the 2016 election was not fully published by several states as of the final data pull on 12/15/16. Consequently, these states have a “0%” voter turnout ratio.)

What about candidates and parties? We use the below code to find the winning candidate and party for each state for each voting year from 1824 - 2016.

We can visualize the information using the mapping capabilities from ggplot2. We produce a map of the United States for each voting year 1824 - 2016 and color each state with the winning party. The Democratic party is in blue while the Republican party is in red. Other parties are in various colors.

There is lots of interesting information here. We see in 1860 that the south all voted for “Southern Democratic” party. Abraham Lincoln won the election and we can see that in 1864, none of the South voted (this was during the Civil War). It’s also interesting to observe that the “Democratic” party used to be the conservative and state’s rights party of the south. From 1876 through about 1960, Texas and other deep south states were typically Democratic even when most of the other states were Republican. After 1960, the south transitioned to becoming consistently Republican by 1980. We can also see an emergence of the Democratic west coast, New England, and Great lakes states vs. a Republican midwest and south beginning in 1988. This is the pattern we have at present.

While I made the above judgement about when “Democratic” and “Republican” switched in meaning using a visual analysis, can we mathematically determine when this changed? The main assumption we have to make is that the states that started off as more national and liberal (in our modern sense) have stayed that way. That is, they started off as “Republican” or “Whig”, and at some point, when the meaning of “Republican” changed relative to “Democrat”, these states become “Democratic” (in our present usage). Similarly, states that started off more state-oriented and conservative (in our modern sense) have also stayed that way. That is, they started off as “Democratic” or “Southern Democratic”, and as the meaning of “Democrat” changed relative to “Republican”, these states become “Republican” (in our modern sense).

My strategy then was to find the year at which a state switched from having a past of being “Democrat” to having a future of being “Republican”, and vice versa. I started with the year 1920, which is a year when the south is clearly Democratic in the old sense (conservative, state oriented) and the rest of the country is Republican in the old sense (liberal, national oriented), which contrasts with our present state/party orientation (south=“Republican”, most of the rest of the country = “Democrat”). I then assigned a 1 to each state for each year if the state was labeled as “Democratic” and a 0 if the state was labeled “Republican”.

I then found the year in which the running average over 10 elections for a Democratic (old sense) starting state switched from being above .5 to being below .5, indicating a switch to being “Republican” (current sense). Similarly, I found the year when a starting Republican (old sense) state switched from being below .5 to being above .5 in the running average over 10 elections, meaning it had switched to being “Democratic” (current sense). I averaged the years in which this switch occurred for all of the states in which a switch did occur. This was the estimated year in which the meaning of “Democrat” and “Republican” switched to be more similar to the other term’s previous meaning going forward.

Which year did I calculate? The average was 1966.211, and the median was 1964. This is consistent with my above visual conclusion, and it makes sense when we look at the map. Before 1964, the southern states are regularly, although not always, Democrat as a group, even when the rest of the country is Republican (in 1948, they were the “State’s Rights” party). In 1964, the rest of the country is “Democratic”, but the southern states as a group are “Republican” for the first time. In 1968, the south votes for the “American Independent” party, but afterwards, apart from 1976, the south as a group regularly goes “Republican”.

Before moving on, consider one last look at the map. In particular, compare 1900 to 2012 (see below map). Notice that they are almost exactly flipped: red states are now blue, and blue states are now red. Now compare 2012 to 2016, and notice that many New England and Great Lakes states have switched from blue to red. Could this be the start of another significant meaning change in the terms “Republican” and “Democrat”? We shall have to wait and see.

Predicting Voter Turnout and Party

Voter Turnout

Can we predict voter turnout? Let’s start with a state’s population ratio. From the graph below, we can see that there isn’t really any relationship.

What about population density? In order to calculate population density, we need to read in the square miles in each state. This is done below.

State Land sq miles
4 Alaska 570374
5 Texas 261914
6 California 155973
7 Montana 145556
8 New Mexico 121365
9 Arizona 113642

Now that we have the square miles of land per state, we can calculate the population density per state per year.

Does the state population density correlate with the voter turnout ratio? As the plot below shows, there doesn’t seem to much much of a relationship. We get an R^2 value of 0.04, which doesn’t mean much.

## 
## Call:
## lm(formula = Ratio ~ StatePopulationDensity, data = state)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35744 -0.06098  0.02820  0.08471  0.23253 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.261e-01  3.670e-03  88.840  < 2e-16 ***
## StatePopulationDensity 1.238e-04  1.557e-05   7.951 3.67e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1174 on 1460 degrees of freedom
##   (5018 observations deleted due to missingness)
## Multiple R-squared:  0.0415, Adjusted R-squared:  0.04085 
## F-statistic: 63.22 on 1 and 1460 DF,  p-value: 3.674e-15

What if we look at each state individually? From the plot below, we can see that, for each state, as population density increases relative to the state, so does voter turnout. But this is problematic since then we can’t use population density as a common variable to predict voter turnout.

What then do the states have in common that could account for an increase in voter turnout? The commonality is the passage of time. In the below plot, we use the Year to predict voter turnout for each state. As you can see, the results for each state are fairly similar and approximately linear (although we do see a bit of a curve using the red geom_smooth line). An aggregated view is also included. R^2 using Year to predict voter turnout ratio is 0.38.

## 
## Call:
## lm(formula = Ratio ~ Year, data = state)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45967 -0.05768  0.00548  0.06830  0.23138 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.718e+00  1.420e-01  -26.18   <2e-16 ***
## Year         2.072e-03  7.246e-05   28.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09597 on 1474 degrees of freedom
##   (5004 observations deleted due to missingness)
## Multiple R-squared:  0.3568, Adjusted R-squared:  0.3564 
## F-statistic: 817.7 on 1 and 1474 DF,  p-value: < 2.2e-16
Party Prediction

Can we predict the winning party in any state using the population variables we have looked at? I subset the data to only look at elections from 1970 onward so as to ensure that “Democratic” and “Republican” have roughly the same meaning as today.

In the first box plot comparing Party to the state population, we can see that there isn’t much separation in the parties. The same is true in the second box plot that uses population ratio. However, the third box plot which uses population density does provide some noticeable separation.

If we subset the data to only include elections from 1990 onward, we can see a similar but more pronounced pattern to what we saw above. Both population and population ratio do not matter much, but population density matters a great deal.

Since population density seems to be a good predictor of party outcome, we can use this in a model. Using all data from 1970 onward, we get an R^2 of 0.10. For 1990 onward, 0.17. For 2000 onward, 0.22.

## 
## Call:
## lm(formula = Republican ~ StatePopulationDensity, data = state_population_party_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7513 -0.5690  0.2571  0.3042  0.9452 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7.551e-01  2.282e-02  33.086   <2e-16 ***
## StatePopulationDensity -6.736e-04  7.709e-05  -8.738   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4532 on 598 degrees of freedom
## Multiple R-squared:  0.1132, Adjusted R-squared:  0.1117 
## F-statistic: 76.36 on 1 and 598 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Republican ~ StatePopulationDensity, data = subset(state_population_party_sub, 
##     Year >= 1990))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6485 -0.5041  0.3476  0.3950  0.6656 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             6.533e-01  3.030e-02  21.563   <2e-16 ***
## StatePopulationDensity -8.403e-04  9.724e-05  -8.642   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.455 on 348 degrees of freedom
## Multiple R-squared:  0.1767, Adjusted R-squared:  0.1743 
## F-statistic: 74.68 on 1 and 348 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Republican ~ StatePopulationDensity, data = subset(state_population_party_sub, 
##     Year >= 2000))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7073 -0.5165  0.2877  0.3485  0.6210 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.7208320  0.0350185  20.584  < 2e-16 ***
## StatePopulationDensity -0.0009008  0.0001095  -8.225 1.11e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4429 on 248 degrees of freedom
## Multiple R-squared:  0.2143, Adjusted R-squared:  0.2111 
## F-statistic: 67.65 on 1 and 248 DF,  p-value: 1.106e-14

Conclusion

So what can we conclude? It seems we can draw these various conclusions about population and voting trends in the United States. The United States population as a whole is steadily increasing. At the state level, however, some states are growing both absolutely or proportionally while others are not.

With respect to voting, while our current political situation is rather divisive, this is not uncommon in the past. Divisive elections have swiftly been followed by relatively unified elections, and vice versa. The political map of the United States has changed often, and will almost certainly change again. So the current divisive situation should not lead us to despair.

What is encouraging is that the level of political participation through voting is higher than its ever been at the national level. However, this increase does seem to be slowing down. In individual states, it appears to have plateaued or decreased. Such places present opportunities to make a renewed effort to “get out the vote”.

Finally, population density seems to be a significant factor, especially more recently, in predicting party outcome. It should not surprise us that in less populated states, where people live more independently, traditionally, and may not be in need of government assistance and coordinated social planning, a Republican platform (less government, traditional values, less taxation) would be more appealing. As urbanization occurs requiring more social cooperation and interaction in meeting social needs and demands in an increasingly diverse population, it makes sense that a Democratic platform (social welfare, government programs and projects, heterogeneous values) would become more appealing.

This is certainly not the whole story (this would at most explain 20% of the variance in party preference), but it is reasonable interpretation of why population density aligns to significant degree with party preference.