My name is Chris Woodward. I am a recent graduate of the Quantitative Methods in the Social Sciences master’s program at Columbia University looking to pivot towards a career in data science. Below are a few of the data visualization projects I have worked on in the R language and ggplot2. In addition to data visualization, I have training in machine learning and predictive analysis and am always looking to learn more. I can be contacted by email at woodwardchris5@gmail.com or by phone at 774-571-1515. Thank you.
The following is a plot of the 10 most successful countries at the Winter Olympics, updated through the 2018 Winter Games. It was decided to merge countries that competed under different designations in the past with their corresponding countries of today. More specifically, the Soviet Union, Unified Team (1992), and Olympic Athletes from Russia (2018) were merged with Russia. In addition, East Germany, West Germany, and the Unified Team of Germany were merged with Germany. Finally, Czechoslovakia was merged with the Czech Republic.
The following plot looks at how the 5 most successful countries – Germany, Norway, Russia, United States, and Austria – have attained their medals over time.
There are some interesting takeaways from this cumulative medal count plot. First, in the earlier years (1924 - 1980), Norway led the way in the all-time medal count race. However, by 1984, other countries had caught up to them. In addition, the plot shows that Russia/Soviet Union did not begin competing in the Winter Olympics until 1956. Despite starting from behind in the medal count, Russia/Soviet Union quickly had significant success and quickly climbed the ladder. Germany was relatively unimpressive in the early years of the Winter Olympics before having a lot of success in the 1970s and 1980s. In the recent years, Germany has been able to separate itself from the rest of the pack with 408 total medals as of 2018. Looking at the United States, the Americans fell a bit behind Germany, Russia/Soviet Union, and Norway in the 1980s and 1990s, competing neck-and-neck more so with Austria. However, more recently, the Americans have been able to separate themselves from Austria a bit. With the Norwegians’ success at the 2018 Winter Games (and many Russians not participating due to the doping controversy), Norway has overtaken Russia for second place in the all-time medal count.
The following plots look at the check-in behavior of users of the popular app Foursquare in the Bay Area region. This data captures check-ins between April 2012 and September 2013 and was collected by Dingqi Yang, Daqing Zhang, Longbiao Chen, and Bingqing Qu.
The following plots compare the distribution of check-ins of the 8 most popular venue categories between weekdays (defined as Monday through Friday) and weekends (defined as Saturday and Sunday). Check-ins have been binned by hour of the day in which they occurred (for instance, a check-in occuring at 15:54:36 would be binned in the 3 PM hour). Since there are more weekdays than weekend days (and thus more total check-ins on weekdays), the plots have been normalized by dividing the counts (for each category across all hours of the day) by the number of days within each type of day (2 for weekend days, 5 for weekdays). As such, the resulting plots can be thought of as the “average” check-in behavior at these popular venue categories throughout a single day on weekdays and weekends.
Looking at the weekday plot, the peak of check-ins in these categories takes place in the morning in the 7, 8, and 9 o-clock hours. Check-ins at coffee shops and offices make up the majority of these check-ins with check-ins at train stations also popular. Visits to coffee shops generally plateau in the early afternoon before decling later in the afternoon and through the night. Check-ins at offices tend to dwindle in the early afternoon. Check-ins at bars tend to increase starting in the 5 o’clock hour.
Looking at the weekend plot, the distribution of check-ins is different. The peak of check-ins of these popular categories is later in the day – around noon and early afternoon. Visits to coffee shops slowly increase as the morning goes on and peak in the 11 o’clock hour and gradually decrease in the afternoon. This is a noticeably different distribution than the distribution of coffee shop visits on weekdays. Visits to grocery stores also seem to be popular in the afternoon on weekends. In addition, visits to bars start earlier in the day on weekends, as compared to weekdays, and increase more later at night time (much like weekdays). American restaurants also appear to be very popular on weekends. Finally, check-ins at offices and train stations are not nearly as popular on the weekend, as compared to weekdays.
While these plots are very helpful in visualizing the temporal patterns of check-ins, they are only considering the top 8 most popular venue categories. In addition, the check-in behavior at specific venues is not being considered. The following map plots the popularity of specific venues (as measured by total check-ins) at different parts of the day – morning, mid-day, and night. The size of the venue circle corresponds to the total number of check-ins during that part of the day. Parts of the day were categorized as following:
- Morning: 5:00:00 AM - 10:59:59 AM
- Mid-Day: 11:00:00 AM - 4:59:59 PM
- Night: 5:00:00 PM - 2:00:00 AM
To color the venue circles, venue categories were collapsed to the 8 categories at the highest hierarchical level, as defined by Foursquare. The map is interactive and allows the user to zoom in on certain areas to see how check-in behavior changes in certain places throughout the day. In addition, the venue circle can be clicked on to get information on the more specific venue category and the total number of check-ins during that part of the day.
The map plots all check-ins within the following coordinate bounds: (37.619 \(\leq\) Latitude \(\leq\) 37.885) and (-122.535 \(\leq\) Longitude \(\leq\) -122.195).
Focusing on downtown San Francisco, check-ins in the morning hours are largely dominated by the “Work/Home/Other”. In addition, several “Travel” venues are very popular, most notably the 4th & King Street train station in Mission Bay. Shifting to the mid-day hours, the “Work/Home/Other” category begins to dwindle, and there is a noticeable uptick in popularity of Food and Drink venues in the downtown area. In addition, shops in Union Square are extremely popular, as is the Ferry Building in The Embarcadero neighborhood. Finally, AT&T Park (stadium of the San Francisco Giants baseball team) in South Beach is the most popular venue in the mid-day hours. Shifting to the night-time hours, Nightlife venues gain signficant popularity, particularly around Union Square. Shops in Union Square generally maintain their popularity, but The Metreon on the corner of 4th Street and Mission Street gains noticeable popularity since the middle of the day. Finally, AT&T Park continues to be the most popular venue.
This plot attempts to provide a very broad overview of how college basketball programs’ on-court success compares to their recruiting performance between 2006 and 2016. One would expect that a team’s on-court success is highly correlated with the team’s talent level. However, are there some teams that perform on the court well above their on-paper talent level, and are there some teams that perform well below their talent level? The plot includes all teams from the ACC, Big Ten, Big 12, SEC, Pac-12, and Big East, as well as some notable teams from other conferences.
To measure on-court success over the last decade, popular college basketball website and statistical archive KenPom.com was used. Each program’s KenPom adjusted efficiency margin (AdjEM) from each season was extracted, starting with the 2006-07 season and concluding with the 2016-17 season. The AdjEMs were simply grouped by team and then the average was taken. These averages were then ranked from highest to lowest to create an on-court performance rank seen on the vertical axis.
To measure recruiting rankings over this time, the recruiting rankings for all players to commit to each school were extracted from popular recruiting service 247Sports, starting with the class of 2005 and concluding with the class of 2016. The bottom 15% of players from each team (as measured by recruiting ranking) were dropped to mitigate the effect of outliers. The average player recruiting ranking was then taken for each program and then this average was sorted to create a recruiting ranking for all teams, as seen on the horizontal axis.
The dashed line is where Recruiting Rank = On-Court Performance Rank, so the (vertical) distance between the center of the team’s logo and the dashed line is the ranking differential. As such, teams above the dashed line performed on the court above their recruiting ranking, and teams below the dashed line performed on the court below their recruiting ranking.
The upper right is a bit crowded. Zooming in:
Top 10 Biggest “Over-Performers”
(Rank Differential = Recruiting Rank - On-Court Rank in parentheses)
1. Wichita State (+66)
2. BYU (+52)
3. Saint Mary’s (+51)
4. Butler (+46)
5. VCU (+44)
6. Northern Iowa (+43)
7. West Virginia (+42)
8. Kansas State (+40)
9. Wisconsin (+39)
10. Gonzaga (+36)
Top 10 Biggest “Under-Performers”
1. Rutgers (-94)
2. DePaul (-80)
3. Auburn (-66)
4. Wake Forest (-58)
5. Oregon State (-56)
6. UNC-Charlotte (-53)
T7. NC State (-50)
T7. Georgia Tech (-50)
9. Marshall (-47)
T10. LSU (-45)
T10. St. John’s (-45)
Once again, the plot attempts to only provide a very broad overview, as there are certainly limitations to the analysis. For instance, it was assumed that recruiting classes across years are equal in quality, which is obviously not the case. In addition, recruiting rankings are an ordinal variable, and as such, the difference in quality between any two rankings one ranking apart is not equal across all rankings. There is obviously a large amount of subjectivity in recruiting rankings, as well. Furthermore, KenPom does not add more value to postseason games, so perhaps some teams that won championships during the time period are ranked slightly lower in on-court success than they should. For instance, UConn won 2 NCAA Tournament championships during this time period (2008 and 2011) but is only ranked 24th in on-court success.
The following is a choropleth map of terrorism-related casualties in 2016 around the world. Please note that the natural logarithm of casualties was taken to improve variation in shading of colors, as the distribution is very skewed. The map allows the user to hover over countries and get a count of total terrorism-related casualties for each country. In addition, the map allows the user to zoom in on certain regions.
There are several notable takeaways from the world map. First, with over 26,000 casualties, Iraq endured by far the most terrorism-related casualties in 2016, followed by Afghanistan with 12,603. Several African countries, such as Nigeria and Somalia, experienced significant terrorism-related casualties in 2016. Relatively speaking, South America, with the exceptions of Colombia and Venezuela, was largely unaffected by terrorism. There are several countries in West and South Asia that experienced significant terrorism-related casualties, including Yemen, Pakistan, India, and The Philippines. Finally, looking at the United States, while the country did endure 207 casualties in 2016, there were several other areas of the world that suffered significantly more.
(Sorry for the excessive blank space. I believe there is a bug in the plotly package that prevents you from changing the relative length of the shading bar without having to significantly increase the overall plot size.)While the choropleth map is interesting to look at to consider countries as a whole, it’s also obviously important to look at specific, unique terrorism incidents. The following map shows terrorism acts colored in layers by casualty count. In addition, the size of the circles corresponds to the total number of casualties (the more casualties, the larger the circle). The map allows the user to zoom in and out and click on specific circles to get further information about the incident.
The map aligns similarly to the choropleth map but provides further information about specific attacks. Many of the high casualty incidents occur in Iraq, Afghanistan, and different regions of Africa, which is unsurprising given what was seen in the choropleth map. However, there are of course some exceptions, including the tragic Orlando, Florida shooting that resulted in 103 casualties.
This analysis looks at the behavior of all 100 current U.S. senators on Twitter. Tweets from each senator’s official Twitter handle were scraped. Given API limits, the maximum number of past tweets for each senator was capped at 3,200. This analysis (and the scraping of the tweets) was conducted in early April 2018.
#Hashtag Wordcloud of Tweets Since Trump’s Inauguration
The data was first subsetted to include only tweets by senators since Trump’s inauguration on January 20, 2017. The following is a comparison wordcloud of the most popular hashtags (by total count) used by Republican and Democratic senators since Trump’s inauguration.
Some of the most common hashtags used by Republicans include “taxreform”, “jobs”, “obamacare”, and “sesta”, while some of the most common hashtags by Democrats include “trumpcare”, “netneutrality”, “dreamers”, and “goptaxscam”.
Parkland Shooting
Tweets by senators were filtered on the 24 hours following the tragic Parkland, Florida school shooting on February 14, 2018.
The following plot compares the most frequent bigrams (any two consectuve words) used in tweets by senators in the 24 hours following the shooting. Standard text mining techniques were used, including removing stop words (the, is, a, etc.).
There appear to be some similarities and difference among the two parties. Both contain mentions of the first responders, as well as factual information such as the name of the school and the town the shooting took place in. The tweets from the Republican senators tend to be focused around expressing sadness over the event and sending prayers to those affected. In the tweets from the Democratic senators, there seems to be more of a sense of anger and call for change in policies, with phrases such as “gun violence”, “yet another”, “take action”, “background checks”, and “enough [is] enough” among the most common bigrams from the Democrats.
Mentions Network
The following network takes a look at which senators any certain senator mentions in his/her tweets and to what extent. A mention was counted any time a senator used another senator’s Twitter handle in a tweet. Simple retweets of other senators were excluded, as the aim is more so to see who mentions whom in senators’ own tweets.
The color of the node corresponds to the senator’s party affiliation (red - Republican, blue - Democratic, purple - Independent). The color of the edge between nodes corresponds to whether the senators are from the same party or not. Edges colored red correspond to mentions between senators both from the Republican party. Edges colored blue correspond to mentions between senators both from the Democratic party. Mentions between senators of different parties are colored in purple. The only exception is that mentions between the two Independent senators (Bernie Sanders and Angus King) are also colored purple, even though they are both Independent. The width of the edges corresponds to the frequency that a senator mentions another particular senator in his/her tweets (the more mentions, the wider the edge).
The visualization allows the user to select any of the 100 senators to see which senators the selected senator tweets at and whether they mostly mention senators from his/her political party, or whether they also mention senators from opposing parties frequently, too. Please note that mentions between any two senators were not combined to create one edge width, as some senators tweet at others but do not get equivalent mentions back. This allows the user to see a particular senator’s tweeting behavior more closely.
Since the network is quite extensive with 100 nodes and over 4,000 connected edges, the visualization also allows the user to zoom in and out and go left/right/up/down to focus in on certain areas of the network.
Some exploration with the network visualization brings about some interesting takeaways. First, senators do indeed mention opposing party senators pretty often. In addition, many of the senators mention their fellow senator from the same state most frequently, regardless of political party affiliation. Thad Cochran (MS-R) tweets at fellow Mississippi Republican senator Roger Wicker more than any senator mentions any other in the network. Maine senators Angus King and Susan Collins, as well as Wyoming senators John Barrasso and Mike Enzi, also mention each other a lot in their tweets. Finally, Tennessee Republican senator Bob Corker appears to have very little connection in the network – not mentioning other senators in his tweets often, nor being mentioned by other senators in theirs much.