Introduction

This script produces three data visualizations, drawing on data from the Sams Club database, as well as some selected data from the US Census. This document diplays the graphs and the code used to create them.

Packages

The script uses four packages: dplyr, ggplot2, plotly, and readr

Data Intake

We need to pull in data to R from three *.csv files. I downloaded them and placed them into the “data” in my home directory. In R, it’s good practice to specify an active “Workding Directory”.

I then use the read_csv function to create three table data frames (“tibbles”)

#setwd("~/")
df1 <- read_csv("~/data/Sam_Sales_ by_city.csv")
## Parsed with column specification:
## cols(
##   City = col_character(),
##   State = col_character(),
##   Total_Sales = col_double(),
##   X4 = col_character()
## )
states <- read_csv("~/data/Stateabbrevs.csv")
## Parsed with column specification:
## cols(
##   `State Abbreviation` = col_character(),
##   `State Name` = col_character()
## )
cities <- read_csv("~/data/US_cities.csv")
## Parsed with column specification:
## cols(
##   Municipality = col_character(),
##   State = col_character(),
##   Pop2000 = col_number(),
##   LandArea2000 = col_number(),
##   PopDensity2000 = col_number(),
##   Pop1990 = col_number(),
##   PopChange = col_character(),
##   X8 = col_character(),
##   X9 = col_character()
## )

Data Preparation for first two graphs

There is a small amount to do by way of JOINING tables, transforming some variables, and computing some aggregate summaries. Package dplyr does most of the work.

First, we join the Sams Data to the city population data. In Sam’s data, city names are in ALL CAPITAL LETTERS, but in US_Cities dataframe they are not. This code chunk uses the r funciton toupper to make every name all upper case, and then joins the tables.

cities$Municipality <- toupper(cities$Municipality)
Samcities <- inner_join(df1, cities, by=c("City" = "Municipality"))

Population histogram & Scatter plot of Sales v. Population

g1 <- ggplot(data=Samcities, aes(Samcities$Pop2000)) +
      geom_histogram(binwidth = 100000) +
      ggtitle("Population of Cities with Sams Club Stores")
g1

g2 <- ggplot(data=Samcities, aes(x=Pop2000, y=Total_Sales)) +
  geom_point() +
  ggtitle("Sales vs. Population") +
  geom_smooth(method=lm)
g2

# Data prep for mapping

Our data is provided at the city level, not the state level. For mapping, aggregate total sales by STATE and join to state abbreviation, and then use package plotly to create a choropleth map.

dfss <- df1 %>%
  group_by(State) %>%
  summarize(Tot_Sales = sum(Total_Sales))

dfss <-  left_join(dfss, states, by = c("State" = "State Abbreviation"))


dfss$hover <- with(dfss, paste("State",  State, '<br>', "Sales", Tot_Sales))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g4 <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

p <- plot_geo(dfss, locationmode = 'USA-states') %>%
  add_trace(
    z = ~Tot_Sales, text = ~hover, locations = ~State,
    color = ~Tot_Sales, colors = 'Purples'
  ) %>%
  colorbar(title = "Total Sales in USD") %>%
  layout(
    title = '2000 Sams Club Sales by State<br>(Hover for details)',
    geo = g4
  )

p