Data Visualization Portfolio

Author

Ash Johnson

Published

December 21, 2024

As an undergrad at Harvard, I have taken courses like Data Visualization and Sociology in the Age of Big Data, to learn how to tidy, manage, and visualize data. Through this experience and my time as a Course Assistant for Statistics 100: Introduction to Statistics and Data Science, I have become quite interested in data visualization. Here are a few visualizations that I have created:

Visualization #1: Skill - Basic Interactive Maps

#hover your cursor over the map to display county names and their densities.

Show code
#loading packages

suppressWarnings({
suppressPackageStartupMessages({
library(here)
library(socviz)
library(tidyverse)
library(maps)
library(plotly)

#organizing data frames 
us_counties <- as.tibble(map_data("county"))


county_data <- as.tibble(county_data)

us_counties_elec <- left_join(county_data, 
                           county_map, 
                            by = "id")



us_counties_elec <- us_counties_elec |> 
  mutate(subregion = tolower(name))

#creating theme
theme_map <- function(base_size=9, base_family="") {
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              panel.background=element_blank(),
              panel.border=element_blank(),
              panel.grid=element_blank(),
              panel.spacing=unit(0, "lines"),
              plot.background=element_blank(),
              legend.justification = c(0,0),
              legend.position = c(0,0)
              )
}


#static mapping 

countiesmap <- us_counties_elec |>
  ggplot(mapping = aes(x = long, 
                       y = lat,
                       fill = pop_dens,
                       group = group, text = name)) +
  geom_polygon(color = "gray90",
               linewidth = 0.1) +
  scale_fill_brewer(labels = c("0,10", "10,50", "50,100", "100,500", "500,1000", "1000,5000", "5000,71762")) + 
  labs(title = "Population Density in American Counties",
       fill = "Population Density") +
  theme_map() + theme(legend.position = "right")


ggplotly(countiesmap, tooltip = "text")

})
})

This map illustrates the population densities of counties across the United States, with each county shaded according to the number of people per square mile. The darker or lighter shades represent different density levels, as indicated by the range values in the accompanying legend. By scrolling over the graph, you can highlight specific counties and instantly view their population density. This interactive feature makes it easy to explore variations across regions, offering insight into both urbanized areas with high densities and rural areas with low densities. The underlying data for this visualization is sourced from the maps package in R, a popular tool for geographical mapping and analysis.

Visualization #2: Skill - Interactive Slippy Maps

#explore the map by dragging and scrolling. click the red dots to reveal the name of each hiv/aids clinic in the district of columbia.

Show code
suppressPackageStartupMessages({
library(ggthemes)
library(sf)
library(dplyr)
library(ggplot2)

hivdc <- read_sf("/Users/ashleyjohnson/Desktop/data/hivaids")

econdc <- read_sf("/Users/ashleyjohnson/Desktop/data/econdc")

theme_map <- function(base_size=9, base_family="") {
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              panel.background=element_blank(),
              panel.border=element_blank(),
              panel.grid=element_blank(),
              panel.spacing=unit(0, "lines"),
              plot.background=element_blank(),
              legend.justification = c(0,0),
              legend.position = c(0,0)
              )
}

econdc <- econdc |> 
  mutate(unemployment = DP03_0009P, 
         avgincome = DP03_0063E, 
         povlevel = DP03_0128P )

econdc <- econdc |> 
  mutate(finallabel = paste0("Poverty Level: ", povlevel, "<br>", "Avg. Income per Year: ", avgincome))
  

library(tidyverse)  
library(scales)     
library(plotly)




#adding google map 
library(leaflet)
#38.90775080029937, -77.07224241025676
#38.904308635459394, -77.0161111

hivdc_sf <- st_transform(hivdc, crs = 4326)

econdc1 <- st_transform(econdc, crs = 4326 )

mypal <- colorBin(palette = "Reds", domain = econdc$povlevel, bins = 5, pretty = TRUE)

dc2 <- leaflet(options = leafletOptions(zoomSnap = 0)) |> 
  addProviderTiles("CartoDB.Voyager") |> 
    setView(lng = -77.0161111, lat = 38.904308635459394, zoom = 11.5)|>
  addPolygons(data = econdc1, fillColor = ~mypal(econdc1$povlevel), color = "#FFFDD0", fillOpacity = 0.5, weight = 1, smoothFactor = 0.2) |> addLegend(pal = mypal, values = econdc$povlevel, position = "topright", title = "Percent Below the Poverty Line") |> 
  addCircles(data = hivdc_sf, weight = 3, radius = 70, color = "darkred", fillOpacity = 0.5) |> 
  addCircleMarkers(data = hivdc_sf, popup = hivdc_sf$NAME, weight = 3, radius = 4, color = "red")
 


dc2

})

This is a map of the census tracts in the District of Columbia in 2020. They are color graded based on the percent of all people living in that area that are below the poverty line, this percentage is dictated by income level and benefits. The red dots along the map represent HIV/AIDs clinics within these census tracts. As you can see, it tends to be the case that areas that have significant poverty (60 percent or higher) tend to have fewer HIV/Aids Clinics in the area. These findings may be important for city planners and activists who need to understand where impoverished people in DC may not have access to these clinics. Data were retrieved from OpenDC.

Visualization #3: Skill - Text Analysis

Show code
suppressWarnings({
suppressPackageStartupMessages({

library(janeaustenr)
library(dplyr)
library(stringr)
library(tidytext)
library(tidyverse)
  
library(tidyr)

tidy_books <- austen_books() |> 
  group_by(book) |> 
  mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) |> ungroup() |> unnest_tokens(word, text)
  
jane_austen_sentiment <- tidy_books |>
  inner_join(get_sentiments("bing")) |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative)

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

})
  
})
Joining with `by = join_by(word)`

This graph visualizes the sentiment of Jane Austen’s six books as they progress. Each bar represents a section of 80 lines of text and whether it is positive or negative based on the sentiment assigned to words within those 80 lines. I used the bing lexicon, which is a list of words and their relative associations. Because the indexes or the sections of 80 lines are in order for each story we can see the way that the sentiment ( positivity or negativity) of the books fluctuate.

Skill Application: This text sentiment analysis can be applied with any texts, including court documents, contracts, emails, etc. Specific sentiments can also be analyzed, so one could track the “angry” sentiment in a affidavit or the tone of a contract. This is a useful skill when many documents need to be processed or analyzed quickly!