1 Preface

The purpose of this coding sample is to demonstrate proficiency in R and the use of tidyverse to explore, manipulate, and visualize data.

In this exercise, I explore internet speeds in the state of Washington. The primary source of my data is the FCC, in which using an API I download a list of 1,321,153 internet providers and relevant details in all 195,574 Census Blocks in the Evergreen State. I manipulate this dataset and join with other Census sources to find the average maximum advertised download speed per block, and conduct basic statistical inference to evaluate the relationship between the number of internet providers and the average advertised maximum download speed. I then include population in a basic multiple linear regression. I conclude by mapping these averages onto maps of the City of Seattle and its home County and State.

library(RSocrata)  #Reading in dataset
library(dplyr)
library(data.table)
library(tidyverse)
library(scales)  # Labels for ggplot
library(stats)  # lm model

library(tidycensus)
census_api_key("44081b88b5af24eaa694a50130f50bb22a05b422")
library(tigris)

2 Sourcing Data

To gather the data from the FCC, I query the SODA API using the R package RSocrata. It allows me to request a preview before initiating a download, so I can make sure the specification is correct. Because the data is national and I seek only one state, I filter only for New York. I also specify which variables I need. It’s important to filter at this step to make sure that the information processed by my machine is only what I need, and no processing time is wasted on loading data for other states, or redundant string variables like business names (I keep the DBA name in case I need it later).

FCC_URL <- "https://opendata.fcc.gov/resource/whue-6pnt.json?$select=blockcode,dbaname,consumer,business,maxaddown,maxadup&%24where=stateabbr='WA'"

# This code allows me to pull a preview. jsonlite::fromJSON(readLines(FCC_URL))

# The preview looks good, so I import the set.
testdata <- read.socrata(FCC_URL)
# A limited version of the dataset, sometimes used for faster iterations for
# troubleshooting further steps. testdata <-
# read.socrata(paste(FCC_URL,'&$limit=20000',sep=''))
# I use the following code to check the number of observations from the FCC to
# see if it matches with what's on the website. It does.
length(unique(testdata$blockcode))
## [1] 195574

After downloading the dataset, I need to aggregate the provider-level observations to become block-level in order to answer questions about census blocks. I group by the blockcode and construct averages and maximums of provided maximum advertised upload and download speeds, as well as count how many providers are operating in each block.

Here, I make a judgment to filter out rows where Consumer = 0. I do this because my exploratory analysis revealed that for all observations where Consumer = 0, broadband speeds are also 0. Because I will be averaging all advertised broadband speeds, zeroes present from non-operating providers would ā€œunfairlyā€ lower the average of advertised speeds of operating providers.

When aggregating from provider-level to block-level, I construct a few variables: I gather the average maximum advertised download and upload speeds of all providers in a block, as well as the maximum maximum advertised download and upload speeds.

census <- testdata %>%
  filter(consumer != 0) %>% #Filtering out Consumer=0 to not average between 0s and values
  group_by(blockcode) %>%
  summarize(
    count = n(),
    avg_maxdown = mean(na.omit(as.numeric(maxaddown))),
    max_maxdown = max(na.omit(as.numeric(maxaddown))),
    avg_maxup = mean(na.omit(as.numeric(maxadup))),
    max_maxup = max(na.omit(as.numeric(maxadup))),
  )

3 Mean and Median Number of Providers

I construct a histogram of the number of providers per block, and add a line to delineate the mean on the histogram.

ggplot(census, aes(x = count)) + geom_histogram(binwidth = 1, boundary = -0.5, color = "black ",
    fill = "white") + geom_vline(aes(xintercept = mean(count)), color = "blue", linetype = "dashed",
    size = 1) + scale_x_continuous(breaks = 1:36) + scale_y_continuous(breaks = seq(0,
    120000, 10000), labels = comma) + stat_bin(binwidth = 1, geom = "text", aes(label = ..count..,
    ), vjust = 0.4, hjust = -0.5, srt = 90, size = 3) + expand_limits(y = c(0, 110000)) +
    theme_classic() + ggtitle("Histogram of Blocks' broadband Provider Count") +
    xlab("Number of broadband Providers in Block") + ylab("Number of Blocks")

As visible on the histogram, the median number of broadband providers per block is 5, and the number of blocks with exactly 5 providers is 36,277. The mean number of broadband providers per block is 5.72.

Observing the histogram reveals an interesting trend in the number of providers. No blocks have zero active providers (this is confirmed by the same number of unique blocks existing in the original dataset as well as the block-level aggregation after zeroes are filtered out) yet very few have two, while the upward trend continues to 3 and 4 providers. This insinuates the assumption that, perhaps, it’s important for each block to have one, but for some blocks it may not be useful for two providers to offer services. Then, as demand increases, more providers service some blocks (the upward trend in 3 and 4 providers) up until a falloff. The number of broadband providers per census block is relatively normally distributed, indicating it may be an appropriate measure to use for OLS.

4 The Relationship Between Provider Count and Internet Speed

I must first define internet speed from the given variables per provider: Maximum advertised upload speed and maximum advertised upload speed. With more communication with stakeholders, I would consider constructing some index which balances both, but I make the judgment to focus on download speed, as this is of the most focus to the average consumer.

Then, as I have aggregated the provider-level dataset to become a block-level dataset, where each block has (as shown above) between 1 and 16 providers, I choose to use the average maximum advertised download speed, as opposed to the maximum maximum per block. This better captures the diverse offerings of speeds per block. I name this variable avgmaxdown.

I construct a simple linear regression to estimate the relationship between the number of providers in a block, and the average maximum advertised download speed per block. This relationship is expressed as

\[ \begin{array}{ccc}avgmaxdown \sim \beta_1 + \beta_2Providers + e\end{array} \]

I evaluate the simple linear regression to find the following results:

reg <- lm(avg_maxdown ~ count, census)
summary(reg)
## 
## Call:
## lm(formula = avg_maxdown ~ count, data = census)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212.92  -71.33  -32.11   59.57  579.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.2857     0.6189   36.01   <2e-16 ***
## count        17.3860     0.1015  171.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 95.23 on 195572 degrees of freedom
## Multiple R-squared:  0.1306, Adjusted R-squared:  0.1306 
## F-statistic: 2.937e+04 on 1 and 195572 DF,  p-value: < 2.2e-16

The estimated coefficient of the number of providers in a census block is 17.39. I interpret this to mean that we expect an increase of about 17 mb/s in download speed in a census block with one additional provider in contrast to a census block without one additional provider, all else held constant.

This estimated coefficient is statistically significant, with the model having an R-squared of 0.13. I can interpret this to mean that there is a relationship between the number of broadband providers per census block and this measure of how fast the internet is, however this model does not control for any other factors, has a low R-squared, and poor residual standard error of 95.23 - while I have identified a relationship, this model performs poorly in predicting internet speeds given the number of internet providers.

5 Variation by Geography

On the surface, the most obvious factor to include in our consideration of internet speed is the number of people per census block. The most recent available data for this granular level is the 2010 decennial census, so I import this data directly from the Census with the help of the Tidycensus package.

census_import <- get_decennial(geography = "block", variables = "P001001", state = "53",
    year = 2010, sumfile = "pl")
census_import <- rename(census_import, blockcode = GEOID)
census <- merge(census, census_import[, c("blockcode", "value")], by = "blockcode")

I then construct another model to control for the population of each census block when evaluating the relationship between internet speeds and the number of available providers. The estimated relationship is defined as:

\[ \begin{array}{ccc}avgmaxdown \sim \beta_1 + \beta_2Providers + \beta_3Population+ e\end{array} \]

regpop <- lm(avg_maxdown ~ count + value, data = census)
regpop_resid = residuals(regpop)
summary(regpop)
## 
## Call:
## lm(formula = avg_maxdown ~ count + value, data = census)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -768.09  -66.96  -30.34   56.96  581.35 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.222039   0.607744   46.44   <2e-16 ***
## count       14.809446   0.102626  144.31   <2e-16 ***
## value        0.255944   0.002646   96.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.04 on 195571 degrees of freedom
## Multiple R-squared:  0.1703, Adjusted R-squared:  0.1702 
## F-statistic: 2.006e+04 on 2 and 195571 DF,  p-value: < 2.2e-16

This model performs slightly better, with a higher R-squared of 0.17 and lower residual standard error of 93.04. I observe that the estimated coefficient of provider count is slightly lower, 14.8, and population is estimated to have a statistically significant positive coefficient of 0.26. I interpret this to mean that I would expect about a 2.5mb/s higher average maximum advertised upload speed in blocks with 10 additional people, all else held constant. This provides evidence for population having a surprisingly strong effect.

I call on the Plot function to visually diagnose the strength of the model - I’m not concerned with specifying an accurate model in this exercise, but I want to see how far this one got us.

All plots unanimously visually confirm strong heteroskedasticity in the model, with interesting shapes that would prompt further discussion if I was trying to build a full model - It appears that a majority of the error originates from overestimation of low speed blocks, but this is almost balanced by the stronger underestimation of high-speed blocks. This underestimation is reflected by a clear kink in the Normal Q-Q plot.

I construct a Residuals Density plot to visually evaluate what would certainly be a failed test for normality in the residuals. I employ a personal trick to overlay an arbitrary, hypothetical normal distribution to compare against.

par(mfrow = c(2, 3))
plot(regpop)
plot(density(resid(regpop)[!is.na(resid(regpop))]), main = "Residuals Density")
curve(dnorm(x, 0, 50), from = -200, to = 200, col = "blue", add = TRUE)
par(mfrow = c(1, 1))

7 Visualizing and Mapping

Finally, I graph the average maximum advertised broadband speeds on a map of Washington State to geographically visualize the distribution of high and low internet speeds.

tracts_tbl <- as_tibble(tracts)
ggplot(data = tracts, aes(fill = avg_maxdown_t)) + aes(geometry = geometry) + geom_sf(color = NA) +
    scale_fill_viridis_c(option = "D", trans = "log2", name = "Speed (Mb)") + labs(title = "Washington Average broadband Speeds",
    subtitle = "Census Tract Level", color = "Average Maximum Advertised Download Speed") +
    theme(axis.text = element_blank(), axis.ticks = element_blank(), plot.background = element_rect(fill = NA)) +
    theme_void()

Next, I zoom into King County to observe the area around Seattle.

tractscity_tbl <- tracts_tbl %>%
    filter(County == "King County")
ggplot(data = tractscity_tbl, aes(fill = avg_maxdown_t)) + aes(geometry = geometry) +
    geom_sf(color = NA) + scale_fill_viridis_c(option = "D", trans = "log2", name = "Speed (Mb)") +
    labs(title = "King County Average broadband Speeds", subtitle = "Census Tract Level",
        color = "Average Maximum Advertised Download Speed") + theme(axis.text = element_blank(),
    axis.ticks = element_blank(), plot.background = element_rect(fill = NA)) + theme_void()

Finally, I used Seattle GeoData from data-seattlecitygis.opendata.arcgis.com to identify only census tracts considered to be part of the City of Seattle, and filter my dataset to reflect only those census tracts. I plot this to produce a map of Seattle’s census tracts and the average of their highest advertised download speeds.

seattle_tracts <- read.csv(file = "C:/Users/dange/Downloads/Census_Tracts_2010.csv")
seattle <- tracts %>%
    filter(tract %in% seattle_tracts$GEOID10)

ggplot(data = seattle, aes(fill = avg_maxdown_t)) + aes(geometry = geometry) + geom_sf(color = "white") +
    scale_fill_viridis_c(option = "D", trans = "log2", name = "Speed (Mb)") + labs(title = "Seattle Average broadband Speeds",
    subtitle = "Census Tract Level", color = "Average Maximum Advertised Download Speed") +
    theme(axis.text = element_blank(), axis.ticks = element_blank(), plot.background = element_rect(fill = NA)) +
    theme_void()