Research Assessment Exercise: Daneal O’Habib

Prelimaries

R code his hidden by default, but you can view it clicking the code tab above.

R Packages

Code
pacman::p_load(tidycensus, tidyverse, censusapi, janitor, vroom, scales, sf, units, stargazer, modelsummary)

options(scipen = 999)

options(tigris_use_cache = TRUE)
Code
df_raw <- vroom("data/Fixed_Broadband_Deployment_Data__December_2019_20240916.csv", .name_repair = janitor::make_clean_names) 

Question 1

The mean is 5.9 and the median is 6.

143225 providers have exactly the mean and median number of providers

Code
count_providers <- df_raw %>% 
  group_by(census_block_fips_code) %>% 
  summarise(unique_providers_name = n_distinct(provider_name),
            #unique_holding_name = n_distinct(holding_company_final),
            #n = n(),
            mean_downstream = mean(max_advertised_downstream_speed_mbps), 
            mean_upstream = mean(max_advertised_upstream_speed_mbps)) %>% 
  ungroup()

median_providers <- median(count_providers$unique_providers_name)

mean_providers <- round(mean(count_providers$unique_providers_name),1)

n_providers <- count_providers %>% 
  filter(unique_providers_name == median_providers) %>% 
  nrow()
Code
count_providers %>%
  ggplot(aes(unique_providers_name)) +
  geom_histogram(bins = 23, color = "black", fill = "gray") +
  geom_vline(xintercept = median_providers, color = "red", linetype = "dashed") +
  labs(x = "# providers in census block", 
       title = "Dense Competition",
       subtitle = "Most Census Blocks in New York Hover Around Six Broadband Providers") +
  theme_light() +
  annotate("text", 
           x = 20, 
           y = 125000, 
           label = paste0("mean = ", mean_providers, "\nmedian = ", median_providers)) +
  scale_y_continuous(labels = comma) +
  labs(caption = paste0(n_providers, " census blocks have excatly the median number of providers"))

Question 2

Is the number of broadband providers related to how fast the available internet is?

The scatter plot shows a non-linear relationship between the number of providers and mean download speed, suggesting diminishing returns in speed improvement as the number of providers increases.

Code
# Is the number of broadband providers related to how fast the available internet is
count_providers %>% 
  ggplot(aes(unique_providers_name, mean_downstream)) +
  geom_point(alpha = .7) +
  theme_light() +
  scale_y_continuous(labels = comma) +
  labs(title = "More Choices, Faster Speeds?",
       subtitle = "The Link Between Broadband Competition and Download Rates in New York",
       y = "mean download speed",
       x = "# providers in census block",
       caption = "")

Question 3

For Question 3, I used data from both the 2010 Decennial Census and the 2019 American Community Survey (ACS) to analyze the variation in internet speeds across New York State. The first regression used granular data from the 2010 Decennial Census at the census block level, which we can use to analyze of how population density, area, and the number of broadband providers influence internet speeds in these small geographical units. The downside of using this data is that it’s outdated, but the fine geographic resolution offers a clearer view of how infrastructure (like providers) and geographical factors affect broadband speeds.

In the second regression, I integrated 2019 ACS data at the block group level, which provides richer socioeconomic variables such as median income, educational attainment, and age. However, the trade-off here is that block groups are larger geographical units than census blocks, and this aggregation loses some spatial detail. Additionally, although the ACS data includes socioeconomic factors, it lacks data on the number of providers, which is important for understanding broadband competition and its impact on speeds. Despite this, spatial autocorrelation suggests that combining these datasets can still reveal important insights about how economic and demographic characteristics relate to broadband access.

Code
# First, we extract the census block FIPS codes from the `count_providers` dataset,
# which contains information about the number of broadband providers and other statistics.
census_blocks <- count_providers %>% 
  pull(census_block_fips_code)

# Load variables from the 2010 Decennial Census (using SF1 dataset) to later retrieve population data.
# The `load_variables` function is used to inspect available variables in the dataset.
all_vars_acs1 <- load_variables(year = 2010, dataset = "sf1")

# Pull population data for New York at the census block level using the "P001001" variable from the 2010 Decennial Census.
# Census block geometry is included for spatial analysis (geometry = TRUE).
ny <- get_decennial(geography = "block", 
                    variables = c(population = "P001001"), 
                    year = 2010,
                    state = "NY", 
                    geometry = TRUE)

# Process the census block data to compute population density and prepare it for joining with other datasets.
ny_processed <- ny %>% 
  mutate(area_m = st_area(geometry),  # Calculate the area of each census block in square meters.
         area_km = set_units(area_m, km^2),  # Convert the area to square kilometers for easier interpretation.
         pop_density = value / area_km) %>%  # Calculate population density (population per square kilometer).
  select(census_block_fips_code = GEOID,  # Keep only relevant columns.
         pop = value,  # Population
         area_km,  # Area in square kilometers
         pop_density) %>%  # Population density
  st_drop_geometry() %>%  # Drop geometry for easier manipulation.
  as_tibble()  # Convert the dataset to a tibble for easier manipulation.

# Join the population density data with the broadband data using the census block FIPS code.
final_data <- count_providers %>% 
  mutate(census_block_fips_code = as.character(census_block_fips_code)) %>%  # Ensure FIPS code is treated as a character.
  inner_join(ny_processed) %>%  # Join with population density data.
  mutate(pop_density = as.numeric(pop_density),  # Ensure population density is numeric for analysis.
         area_km = as.numeric(area_km))  # Ensure area is numeric for analysis.

# Fetch 2019 ACS data at the block group level to retrieve additional demographic and economic variables.
# These include total population, median income, race, education, housing units, home values, and poverty levels.
variables <- c(
  population = "B01003_001",           # Total population
  median_income = "B19013_001",         # Median household income
  white_population = "B02001_002",      # White alone population
  black_population = "B02001_003",      # Black or African American alone
  median_age = "B01002_001",            # Median age
  high_school_grad = "B15003_017",      # High school graduates
  bachelors_degree = "B15003_022",      # Bachelor's degree holders
  housing_units = "B25001_001",         # Total housing units
  median_home_value = "B25077_001",     # Median home value
  poverty = "B17001_003"                # Poverty status
)

# Retrieve the ACS data at the block group level and include geometry for later spatial analysis.
acs_data <- get_acs(geography = "block group",
                    variables = variables,
                    year = 2019,
                    state = "NY",
                    geometry = TRUE)

# Clean the ACS data by pivoting the `variable` column so that each variable becomes its own column.
# Compute additional variables such as population density based on the area, and create proportions for certain demographics.
acs_data_clean <- acs_data %>%
  select(GEOID, variable, estimate) %>%  # Keep the GEOID, variable, and estimate columns.
  pivot_wider(names_from = variable, values_from = estimate) %>%  # Pivot the data so each variable has its own column.
  mutate(area_km2_acs = as.numeric(st_area(geometry)) * .000001,  # Convert area from square meters to square kilometers for the ACS block group.
         pop_density_acs = population / area_km2_acs) %>%  # Calculate population density from the ACS data.
  st_drop_geometry() %>%  # Drop the geometry for easier analysis.
  as_tibble() %>%  # Convert to a tibble.
  mutate(prop_white = white_population / population,  # Calculate proportion of white population.
         prop_university = bachelors_degree / population)  # Calculate proportion of the population with a university degree.

# Join the cleaned ACS data with the broadband and population data using block group FIPS codes.
# We truncate the census block FIPS code to block group level (first 12 characters) for the join.
final_data_extended <- final_data %>% 
  mutate(census_block_group_code = substr(census_block_fips_code, 1, 12)) %>%  # Convert block FIPS to block group FIPS.
  left_join(acs_data_clean, by = c("census_block_group_code" = "GEOID"))  # Join the data.

Regression

Code
# Create two regression models to examine the impact of population density and other variables on internet speed.
models <- list(
  # First model using block-level data: internet speed as a function of population density, area, and number of broadband providers.
  "Internet Speed (block)" = lm(log(mean_downstream + 1) ~ log(pop_density + 1) + area_km + unique_providers_name, data = final_data_extended),
  
  # Second model using block group data from ACS: internet speed as a function of ACS-derived population density, income, age, and educational attainment.
  "Internet Speed (block 2)" = lm(log(mean_downstream + 1) ~ log(pop_density_acs + 1) + log(median_income + 1) + area_km2_acs + median_age + 
                                    prop_university, data = final_data_extended))

# The `p.value` for each coefficient is reported, with significance stars for p-values.
# Certain goodness-of-fit (GOF) metrics like AIC and BIC are omitted for simplicity.
modelsummary(models,
  statistic = "p.value", stars = TRUE,
  gof_omit = "DF|Deviance|AIC|BIC|Log.Lik.|RMSE",
  output = "markdown",
  coef_rename = c(
    "unique_providers_name" = "# providers",  # Rename number of providers for easier interpretation.
    "log(pop_density + 1)" = "log (pop density)",  # Rename log population density for easier interpretation.
    "log(median_income + 1)" = "log (median income)",  # Rename log median income for easier interpretation.
    "area_km" = "area (km^2)",  # Rename area for clarity.
    "median_age" = "median age",  # Rename median age.
    "prop_white" = "% white",  # Rename proportion of white population.
    "log(pop_density_acs + 1)" = "log (pop density, acs)",  # Rename ACS-derived population density.
    "area_km2_acs" = "area (km^2) acs"  # Rename ACS-derived area for clarity.
)) 
Internet Speed (block) Internet Speed (block 2)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 3.393*** 5.391***
(<0.001) (<0.001)
log (pop density) 0.063***
(<0.001)
area (km^2) -0.001***
(<0.001)
# providers 0.195***
(<0.001)
log (pop density, acs) 0.121***
(<0.001)
log (median income) -0.115***
(<0.001)
area (km^2) acs -0.001***
(<0.001)
median age 0.001**
(0.001)
prop_university 0.342***
(<0.001)
Num.Obs. 350169 336935
R2 0.274 0.151
R2 Adj. 0.274 0.151

Higher population density is associated with faster internet speeds, as demonstrated by positive coefficients in both models. A higher number of broadband providers within a census block significantly increases internet speeds, supporting the idea that competition enhances service quality. Conversely, the negative coefficient for median income indicates that wealthier areas might not enjoy faster internet. Larger geographic areas within census blocks and block groups are linked to slower speeds. Finally, the positive impact of a higher proportion of university-educated residents on internet speed suggests that areas with higher educational attainment have greater demand for or access to faster internet services.