Penetration Rate Analysis for New Vehicle Sales

Introduction

In this document I look at the “penetration rate” for new vehicles (cars and light trucks), based on the distribution of household income (what percentages of households fall into certain income ranges) versus the distribution of vehicle prices (what percentages of new vehicles sold fall into certain price ranges).

The 2018 Howard County Rental Survey defines the penetration rate as follows in the context of rental apartments:

The penetration rate is calculated by dividing the total number of units targeting a particular income range by the number of renter households with incomes that fall within that range. A penetration rate of 100 percent would indicate that there is equal number of multifamily units in an affordability classification and renter households with income sufficient to afford rents at that level. A penetration rate over 100 percent would indicate an oversupply of units, while a penetration rate of less than 100 percent would indicate an inadequate supply of units relative to the number of renter households in that income range.

I adapt this definition to the context of new vehicle sales, and do the analysis at the level of the United States as a whole, using U.S. Census household income data and third-party reports on new vehicle sales and prices.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the following packages for the following purposes:

tidyverse: do general data manipulation.
tidycensus: retrieve U.S. census data.
knitr: display tabular data.

library(tidyverse)
library(tidycensus)
library(knitr)

Data sources

I use data from the following sources:

Number of households in various income ranges in 2018.
Number of vehicles sold in 2017 by model.
Minimum and maximum vehicle prices in 2019 by model.

The first set of data is from U.S. Census surveys of household income, and is obtained via the Census API.

The next two sets of data are taken from the CSV file vehicle-sales.csv. Each row of the file contains data values as follows:

The vehicle model name.
The number of vehicles of that model sold in 2017.
The minimum list price of that model vehicle in 2019, in dollars. If no list prices are available for the model, this value is blank.
The maximum list price of that model vehicle in 2019, in dollars. If no list prices are available for the model, this value is blank. If only a single list price is available for the model, this value is the same as the minimum list price.

See the References section below for more information, including the data sources from which I compiled the data in vehicle-sales.csv.

Reading in the data

I use the Census API to obtain household income data for 2018 from the American Community Survey 1-year survey estimates. I first get the median household income for the U.S. as a whole, using the Census variable B19013_001.

mhi_us <- get_acs(survey = 'acs1', year = 2018, variable = 'B19013_001', geography = 'us') %>%
  select(estimate) %>%
  as.integer()

## Getting data from the 2018 1-year ACS

## The one-year ACS provides data for geographies with populations of 65,000 and greater.

I next read in the Census table B19001 for the entire U.S. This table contains data on the number of households in each of 16 different income ranges, with one row per range, plus an initial row giving the estimated number of all households for the geography. I retain only the Census variable names used for each income range and the estimates for the number of households in each range. (For data at the level of the entire U.S. the margins of error are less than 1% of the estimates, and can be safely ignored.)

households_by_income <- get_acs(survey = 'acs1', year = 2018, table = 'B19001', geography = 'us') %>%
  select(variable, estimate)

## Getting data from the 2018 1-year ACS

## The one-year ACS provides data for geographies with populations of 65,000 and greater.

## Loading ACS1 variables for 2018 from table B19001. To cache this dataset for faster access to ACS tables in the future, run this function with `cache_table = TRUE`. You only need to do this once per ACS dataset.

I next read in the vehicle sales and price data, and retain only those models for which I have price data.

veh_prices <- read_csv(
  "vehicle-sales.csv",
  col_names = c("model", "units", "min_price", "max_price"),
  col_types = "cidd"
) %>%
  filter(!is.na(min_price))

Preparing the data for the analysis

I have to do additional preparation of the data before I can analyze it.

First, the household income data provides estimates on the number of households in each income range, but I want the percentage of households in each range relative to the total number of U.S. households. The total number of households is stored in the first row of the households_by_income table, under the Census variable B19001_001. To compute the percentages I therefore do the following to create a new table pct_by_income:

Start with the households_by_income table read in above.
Use the spread() function to convert each row of the table into a column, with the column names taken from the Census variable names and the column values taken from the estimates for those variables.
Use the mutate_at() function to apply to each column a function that takes the column value (the number of households in a given income range) and replaces it by that value as a percentage of the value in the column corresponding to the Census variable B19001_001 (the total number of households).
Use the gather() function to convert each column back into a row, with the column names of the new table being variable (for the Census variable) and h_pct (for the percentage of households in each income range).

pct_by_income <- households_by_income %>%
  spread(variable, estimate) %>%
  mutate_at(1:ncol(.), function(x) 100 * x / .[["B19001_001"]]) %>%
  gather(variable, h_pct)

I next extend the vehicle sales data to cover additional price points. The rationale for doing this is as follows:

If a vehicle model has only one list price in the data then I can reasonably assume that all vehicles of that model are sold at that price. (This ignores dealer discounts and sales of additional vehicle options, but I don’t have any data on those, and for purposes of this analysis I’ll ignore them.)

If a vehicle has a range of list prices, I could assume that all vehicles of that model are sold at the average of the minimum and maximum prices. This discards useful information though: a vehicle model with minimum and maximum prices of $30,000 and $70,000 respectively would be treated the same in terms of affordability as another vehicle model with minimum and maximum prices of $49,000 and $51,000 respectively, since the average price in both cases is $50,000. However, in practice the former vehicle model would be more affordable to those of lower incomes than the latter, since its minimum price (the list price of the base model) is lower.

To account for this factor, I divide the price range for a given vehicle model into five price points: the minimum price, the price intermediate between the minimum price and the average price, the average price, the price intermediate between the average price and the maximum price, and the maximum price. I then apportion vehicle sales among these price points in the ratio 1:2:4:2:1, to account for the typical scenario where sales of the base vehicle package and the most expensive package are relatively low compared to sales of the mid-range packages.

For example, for a vehicle model with sales of 50,000 units, a minimum price of $30,000, and a maximum price of $70,000, I apportion the units sold as follows:

5,000 units at $30,000
10,000 units at $40,000
20,000 units at $50,000
10,000 units at $60,000
5,000 units at $70,000

I follow this scheme in creating a new table veh_prices_extended, as follows:

I create null vectors for the columns of the new table.
I loop through each row of the veh_prices table. If the minimum and maximum list prices are the same for a given model then I just add the model name and prices to the vectors that will make up the columns of the new table.
Otherwise I calculate the five new price points to be used, and apportion the number of units sold among those price points in a 1:2:4:2:1 ratio as discussed above. This results in five new values being added to each of the vectors that will make up the columns of the new table: the model name (repeated five times), the five calculated price points, and the number of units apportioned to each of the five price points.
Having looped through all rows of the veh_prices table, I use the resulting vectors for model names, prices, and units to construct a new table veh_prices_extended.

model_vec = NULL
units_vec = NULL
price_vec = NULL
for (row in 1:nrow(veh_prices)) {
    if (veh_prices$min_price[row] == veh_prices$max_price[row]) {
        model_vec <- c(model_vec, veh_prices$model[row])
        units_vec <- c(units_vec, veh_prices$units[row])
        price_vec <- c(price_vec, veh_prices$min_price[row])
    } else {
        price_interval <- (veh_prices$max_price[row] - veh_prices$min_price[row]) / 4
        new_prices <- seq(veh_prices$min_price[row], veh_prices$max_price[row], price_interval)
        units_tenth <- (veh_prices$units[row] + 5) %/% 10
        new_model <- veh_prices$model[row]
        model_vec <- c(model_vec, rep(new_model, 5))
        units_vec <- c(units_vec, units_tenth, 2 * units_tenth, 4 * units_tenth, 2 * units_tenth, units_tenth)
        price_vec <- c(price_vec, new_prices[1], new_prices[2], new_prices[3], new_prices[4], new_prices[5])
    }
}

veh_prices_extended <- tibble(model = model_vec, units = units_vec, price = price_vec)

Helper functions

In this section I create additional tables and functions for use in the analysis.

I create a table income_ranges that maps the Census variable used for each range (e.g., B19001_002) to the corresponding text description of the range (e.g., “Less than $10,000”). I define the income range descriptions as factors in order to force them to be displayed in a desired order when doing graphs.

hi_ranges <- c(
  "Less than $10,000",
  "$10,000 to $14,999",
  "$15,000 to $19,999",
  "$20,000 to $24,999",
  "$25,000 to $29,999",
  "$30,000 to $34,999",
  "$35,000 to $39,999",
  "$40,000 to $44,999",
  "$45,000 to $49,999",
  "$50,000 to $59,999",
  "$60,000 to $74,999",
  "$75,000 to $99,999",
  "$100,000 to $124,999",
  "$125,000 to $149,999",
  "$150,000 to $199,999",
  "$200,000 or more"
)
hi_ranges <- factor(hi_ranges, levels = hi_ranges)

hi_variables <- c(
  "B19001_002",
  "B19001_003",
  "B19001_004",
  "B19001_005",
  "B19001_006",
  "B19001_007",
  "B19001_008",
  "B19001_009",
  "B19001_010",
  "B19001_011",
  "B19001_012",
  "B19001_013",
  "B19001_014",
  "B19001_015",
  "B19001_016",
  "B19001_017"
)

income_ranges <- tibble(variable = hi_variables, income_range = hi_ranges)

I also create a function assign_to_range() to take a particular income value and return the (text description of the) range in which that income falls.

assign_to_range <- function(x) {
  ifelse(x < 10000,  "Less than $10,000",
  ifelse(x < 15000,  "$10,000 to $14,999",
  ifelse(x < 20000,  "$15,000 to $19,999",
  ifelse(x < 25000,  "$20,000 to $24,999",
  ifelse(x < 30000,  "$25,000 to $29,999",
  ifelse(x < 35000,  "$30,000 to $34,999",
  ifelse(x < 40000,  "$35,000 to $39,999",
  ifelse(x < 45000,  "$40,000 to $44,999",
  ifelse(x < 50000,  "$45,000 to $49,999",
  ifelse(x < 60000,  "$50,000 to $59,999",
  ifelse(x < 75000,  "$60,000 to $74,999",
  ifelse(x < 100000, "$75,000 to $99,999",
  ifelse(x < 125000, "$100,000 to $124,999",
  ifelse(x < 150000, "$125,000 to $149,999",
  ifelse(x < 175000, "$150,000 to $199,999",
  "$200,000 or more"
  )))))))))))))))
}

Finally, I define a function weighted_median to compute the weighted median of vehicle prices weighted by the number of vehicle models sold at each price. (See the References section for the source of the code for this function.) For example, if one vehicle of a given model were sold at a price of $20,000, 9 units of a second vehicle model sold at a price of $50,000, and one other vehicle of a third model sold at a price of $100,000, the weighted median price would be the same as the median value of the data values {20000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 100000}, or $50,000.

weighted_quantile <- function(x, w, probs=seq(0,1,0.25), na.rm=TRUE) {
  x <- as.numeric(as.vector(x))
  w <- as.numeric(as.vector(w))
  if(anyNA(x) || anyNA(w)) {
    ok <- !(is.na(x) | is.na(w))
    x <- x[ok]
    w <- w[ok]
  }
  stopifnot(all(w >= 0))
  if(all(w == 0)) stop("All weights are zero", call.=FALSE)
  #'
  oo <- order(x)
  x <- x[oo]
  w <- w[oo]
  Fx <- cumsum(w)/sum(w)
  #'
  result <- numeric(length(probs))
  for(i in seq_along(result)) {
    p <- probs[i]
    lefties <- which(Fx <= p)
    if(length(lefties) == 0) {
      result[i] <- x[1]
    } else {
      left <- max(lefties)
      result[i] <- x[left]
      if(Fx[left] < p && left < length(x)) {
        right <- left+1
        y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]-Fx[left])
        if(is.finite(y)) result[i] <- y
      }
    }
  }
  names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
  return(result)
}

weighted_median <- function(x, w, na.rm=TRUE) {
  unname(weighted_quantile(x, probs=0.5, w=w, na.rm=na.rm))
}

Analysis

I do analyses to answer the following questions:

What is a reasonable affordability criterion for new vehicles? In other words, what price vehicle would be ”affordable” for a household with a given income, or alternatively what household income would be required to afford a vehicle with a given price?
What percentage of households fall into each income range?
What percentage of new vehicles sold are affordable for households in each income range?
What is the penetration rate for new vehicles for each income range?

Defining an affordability criterion for new vehicles

I want to get a sense for what price vehicle is affordable to a household with a given income. My goal is to come up with a measure analogous to that used in housing affordability discussions. In those discussions people use 30% of household income as a criterion of affordability for renters: anyone spending more than that for rent is assumed to be “rental cost burdened”.

As noted below, for my purposes it’s simpler to relate new vehicle purchase price to household income: for a given household income, what vehicle purchase price can we consider affordable? Or, to put it another way, for a given vehicle purchase price, what household income is required to buy that vehicle and not be over-burdened?

I approach this problem from multiple directions. First, I can look to various third-party recommendations on how much one should spend on a vehicle:

The popular financial pundit Dave Ramsey recommends not spending more than 50% of household income on vehicles: “The total value of all your vehicles—things with a motor in them—should not be more than half of your annual income.” Ramsey does not specify whether this refers to before-tax or after-tax income, but the context indicates that he’s probably referring to before-tax income.
According to the Edmunds Auto Affordability Calculator, “Other experts say that a vehicle that costs roughly half of your annual take-home pay will be affordable.” This recommendation references after-tax income, so this corresponds to a lower percentage of before-tax income.
The Money Under 30 site recommends “Spend no more than 35 percent of your pre-tax annual income on a car.”
The instamotor.com site includes an “affordability rank” for 100 metro areas, comparing the median household income for a given area with the price of a vehicle (new or used) considered to be affordable for that income level: “The Affordable Price factors in the total monthly expenses an average household should spend on a vehicle. This includes auto loan payments, the cost of insurance and gas, where the total monthly payment does not exceed 15% of the gross monthly income.” Those prices range from 38.7% of median household income for New York City down to 11.2% of median household income for McAllen, Texas. At the mid-point in the rankings is Las Vegas, where the “affordable price” is about 30.2% of the median household income.
Finally, the independent blogger at Break Free offers the following recommendation: “Take 20% of your gross annual income (before taxes, social security, etc) and that’s how much car you can afford.”

Expressed in terms of the income needed to purchase a given vehicle, these recommendations span from requiring household income of 2 times the vehicle purchase price (Dave Ramsey) up to 5 times the purchase price (Break Free).

These recommendations also for the most part assume purchase of a single vehicle. According to the University of Michigan Transportation Research Institure, U.S. households have just under two vehicles on average. That means that under Dave Ramsey’s criterion (for example) a two-car household should not purchase a vehicle with a price of more than 25% of their household income.

Second, I can compare a “typical” new vehicle price to the U.S. median household income. (Even though the underlying data I’m using is from different years—2017 for vehicles sold, 2018 for household income, and 2019 for new vehicle prices—I ignore inflation in these calculations.)

I calculate a “typical” new vehicle price in three different ways:

Using an average sale price as reported by Kelly Blue Book for July
Using the weighted mean of new vehicle prices from the veh_price_extended table.
Using the weighted median of new vehicle prices from the veh_price_extended table.

kbb_mean_price = 35359

total_units <- sum(veh_prices_extended$units)
total_sales <- veh_prices_extended %>%
  mutate(sales_per_model = units * price) %>%
  summarize(total_sales = sum(sales_per_model)) %>%
  as.numeric()
weighted_mean_price <- total_sales / total_units

weighted_median_price <- weighted_median(veh_prices_extended$price, veh_prices_extended$units)

This gives three different values: the Kelly Blue Book average of $35,359, a weighted mean price of $37,944, and a weighted median price of $33,230. I can therefore take $35,000 as an approximate typical price for a new vehicle.

How do these prices relate to median household income? In 2018 the median U.S. household income was $61,937, so a $35,000 vehicle would be about 56.5% of the median household income. Put another way, it we adopt the relationship between median household income and the typical vehicle price as a criterion for affordability, buying a $35,000 vehicle would require having an income of 1.77 times that price. This is a looser criterion even than Dave Ramsey’s, which requires having an income more than twice the vehicle purchase price.

For purposes of this analysis I’ll arbitrarily decree that for a vehicle to be affordable its price must not be more than 50% of household income, or (to put it another way) a new vehicle with a given price will require an income of at least twice that price to be affordable. To allow for future tweaking I capture this criterion in the form of a variable required_income_vs price:

required_income_vs_price <- 2

Percentage of households in each income range

Before looking at vehicle affordability, I first want to get a sense of how many households are in each of the U.S. Census income ranges. I already computed the percentages of households in each income range, so I can simply graph that data as follows:

Start with the table pct_by_income.
Remove the row for the variable B19001_001 corresponding to all households, since it will just be 100%.
Join the resulting table with the table income_ranges containing text descriptions for each range, using the common variable variable.
Replace any missing values with zero, and save the resulting table as h_pcts.
Plot the h_pcts table as a bar chart (using geom_col) with the x-axis being the various income ranges and the y-axis the percentage of households with incomes in those ranges.

h_pcts <- pct_by_income %>%
  filter(variable != "B19001_001") %>%
  right_join(income_ranges, by = "variable") %>%
  replace_na(list(h_pct = 0))

h_pcts %>%
  ggplot(mapping = aes(x = income_range, y = h_pct)) +
  geom_col() +
  xlab("Household Income Range") +
  ylab("% of Households in Range") +
  scale_y_continuous(limits = c(0, 15), breaks = seq(0, 15, 5)) +
  labs(
    title = "Percentages of U.S. Households by Income Range in 2018",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2018 American Community Survey 1-Year Estimates, Table B19001",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

Since the income ranges are not evenly spaced, the above graph is not the same as an income distribution graph. That’s why there are higher percentages of households in the higher income ranges, since those ranges are so wide compared to the income ranges below $50,000.

Percentage of vehicles affordable for each income range

I now calculate the percentage of new vehicles sold that are affordable to households in each income range, using the affordability criterion above. I do this as follows, creating a new table veh_pcts:

I start with the table veh_prices_extended containing imputed numbers of vehicles sold for each pricepoint for all vehicle models.
I multiply the price times two (more correctly, by the value of required_income_vs_price) to get the income level required to afford a vehicle of that price.
I use the assign_to_range() function to take the resulting income and assign it to one of the Census income ranges.
For each income range I take a subtotal units_per_range of the number of vehicles sold whose prices require incomes in that range.
I compute the percentage of the total number of vehicles sold that are targeted to each income range. (The variable total_units was computed above.)
I join the resulting table with the income_ranges table, using the full_join() function so that the resulting table has rows for all income ranges.
I replace any missing values for the percentages with the value zero.

veh_pcts <- veh_prices_extended %>%
  mutate(income_range = assign_to_range(required_income_vs_price * price)) %>%
  mutate(income_range = factor(income_range, levels = hi_ranges)) %>%
  group_by(income_range) %>%
  summarize(units_per_range = sum(units)) %>%
  mutate(veh_pct = 100 * (units_per_range / total_units)) %>%
  full_join(income_ranges, by = "income_range") %>%
  replace_na(list(veh_pct = 0))

I plot a bar chart of the percentage of cars affordable to each income range, using the table veh_pcts created above.

veh_pcts %>%
  ggplot(mapping = aes(x = income_range, y = veh_pct)) +
  geom_col() +
  xlab("Household Income Ranges") +
  ylab("% of New Vehicles for Range") +
  scale_y_continuous(limits = c(0, 30), breaks = seq(0, 30, 5)) +
  labs(
    title = "Percentage of New Vehicles Targeted to Each Income Range",
    subtitle = "(Assumes Required Income of Twice the Vehicle Price)",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2018 American Community Survey 1-Year Estimates, Table B19001",
      "\n  GoodCarBadCar.net, 2017 Year End U.S. Vehicle Sales Rankings",
      "\n  Consumer Reports, 2019 New Car Guide",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

Based on this graph it appears that most new vehicles are priced to sell to households with incomes between $50,000 and $100,000.

Penetration rate for new vehicles by income range

I now calculate the penetration rate for new vehicles by income range. The penetration rate is calculated by analogy to the penetration rate for rental apartments, by dividing the percentage of vehicles targeted to a particular income range (based on my affordability criterion) by the percentage of households with incomes that fall within that range.

A penetration rate of 100 percent would indicate that there are equal numbers of vehicles targeted to an income range and households within that range. A penetration rate over 100 percent would indicate an oversupply of vehicles (for that income range), while a penetration rate of less than 100 percent would indicate an inadequate supply of vehicles relative to the number of households in that income range.

I create the table pen_rates as follows:

I start with the h_pcts table.
I then join to that table the veh_pcts tables, using the common column income_range.
I calculate the penetration rate as the ratio of the percentage of vehicles targeted to a particular income range to the percentage of households in that range.

pen_rates <- h_pcts %>%
  full_join(veh_pcts, by = 'income_range') %>%
  mutate(pen_rate = 100 * (veh_pct / h_pct))

Finally I plot a bar chart showing the penetration rate for each household income range.

pen_rates %>%
  ggplot(mapping = aes(x = income_range, y = pen_rate)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 100) +
  xlab("Household Income Ranges") +
  ylab("Penetration Rate (%)") +
  scale_y_continuous(limits = c(0, 300), breaks = seq(0, 300, 50)) +
  labs(
    title = "Penetration Rate of New Vehicles by Income Range",
    subtitle = "Percentage Sold vs. Percentage Affordable to Each Income Range",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2018 American Community Survey 1-Year Estimates, Table B19001",
      "\n  GoodCarBadCar.net, 2017 Year End U.S. Vehicle Sales Rankings",
      "\n  Consumer Reports, 2019 New Car Guide",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

There are several points worth making about this graph. First, the details of the graph are obviously dependent on the assumptions made about vehicle affordability. If the criterion for affordability, i.e., a vehicle price no more than 50% of household income, were changed then the penetration rates for the various income groups would also change. However the overall shape of the graph should remain the same, with under-penetration in the lowest and highest income ranges, and over-penetration in the middle income ranges.

This general shape can be explained as follows:

In the higher-income ranges there are fewer vehicles sold than one might expect. For example, a household in the income range “$200,000 or more” should be able to afford a vehicle costing $100,000 or more, but relatively few such vehicles are sold (about a 25% penetration rate). This is probably due to two factors:

For most households there is an upper limit on how luxurious a vehicle they want or need. Of vehicle models with prices above $100,000 (especially models whose prices start above $100,000), many are high-end sports cars unsuitable for general use (e.g., Acura NSX, Audi R8, BMW i8, or Nissan GT-R) or prestige brands with limited market presence (e.g., Bentley or Maserati). The typical high-income household will be very satisfied with vehicles in the $75,000-to-$100,000 price range.
The penetration rate analysis I’m doing implicitly assumes that a household buys only one vehicle, but higher-income households almost certainly have multiple vehicles. Thus, for example, a typical household with income of $300,000 following the “50% of before-tax income” rule is more likely to buy two vehicles for $75,000 each, or three vehicles for $50,000 each, than to buy a single vehicle for $150,000.

In the lower-income ranges there are also fewer vehicles sold than one might expect. Again this is likely due to two factors:

It is difficult for automobile manufacturers to make and sell truly low-cost vehicles in the U.S. market, due to U.S. standards for emissions, crashworthiness, etc., and to limited U.S. market appeal for small low-powered vehicles with few or no passenger amenities. The lowest-priced vehicle in the data I’m using, the Nissan Versa, has a minimum list price around $12,500, which according to my chosen affordability criterion would require a household income of $25,000 or more. This means that essentially no vehicles are targeted to the four lowest income ranges defined by the U.S. Census Bureau (reflected in the graph above). Buyers in those income ranges are served by the used vehicle market (or by public transit).
Some households in the lower-income ranges buy vehicles that are more expensive than what they could afford based on my criterion. This may be due to need (for example, needing to transport a large family) or simply the desire to have a nicer vehicle.

The flip-side of under-penetration in the lowest and highest household income ranges is over-penetration in the middle income ranges. This is easily explained by lower-income households buying vehicles more expensive than they could afford by my criterion and by higher-income households buying vehicles less expensive than they could afford by this criterion, for reasons discussed above.

Appendix

Caveats

Values from the American Community Survey are estimates based on survey samples, with associated margins of error. However at the national level the margins of error are relatively small even for ACS 1-year estimates.

References

To obtain the 2017 ACS data I used the tidycensus package, which provides an easy-to-use interface to U.S. Census Bureau data made available via a set of public APIs. As its name suggests, the tidycensus package is designed to be compatible with the tidyverse approach to representing and manipulating data.

I obtained data for 2019 new car prices from the Consumer Reports guide to cars. (Consumer Reports does not apparently store archived versions of the guide for previous years, and there are no usable copies archived by the Wayback Machine.) Consumer Reports typically lists a range of prices for each vehicle. In those cases I assume that sales of a particular model were distributed between the minimum and maximum price in a 1-2-4-2-1 distribution, as discussed above.

I obtained data for the number of vehicles sold from “2017 Year End U.S. Vehicle Sales Rankings – Top 296 Best-Selling Vehicles In America – Every Vehicle Ranked” on the goodcarbadcar.net website.

I obtained the average sales price for new vehicles in July 2018 from the press release “Demand Quickly Backing Away from Cars, Pushing Average New-Car Transaction Prices Up for July 2018, According to Kelley Blue Book”.

The value for the average number of cars per household is from Table 1 of the University of Michigan Sustainable Worldwide Transportation Report No. SWT-2018-2, “Has Motorization in the U.S. Peaked? Part 10: Vehicle Ownership and Distance Driven, 1984 TO 2016”.

Recommendations for how much to spend on a vehicle are from the following sources:

Dave Ramsey: “How Much Car Is Too Much?”.
Edmunds: “How Much Car Can I Afford?”.
Money Under 30: “How Much Should You Spend On A Car?”.
instamotor.com: “How Much Car Can I Afford? Comparing Car Affordability in United States”.
Break Free: “How Much Car Can I Afford? The 20% Rule”.

The code for the weighted_median() function is taken from the code for the weighted.median() function in the spatstat package, as discussed in the StackOverflow question “Is there a weighted.median() function?”

Suggestions for others

It would be nice to have data on household income, vehicle sales by model, and vehicle prices all from the same year.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.24       tidycensus_0.9.2 forcats_0.4.0    stringr_1.4.0   
##  [5] dplyr_0.8.3      purrr_0.3.2      readr_1.3.1      tidyr_0.8.3     
##  [9] tibble_2.1.3     ggplot2_3.2.1    tidyverse_1.2.1 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.8           sf_0.7-7          
##  [4] haven_2.1.1        lattice_0.20-38    colorspace_1.4-1  
##  [7] generics_0.0.2     vctrs_0.2.0        htmltools_0.3.6   
## [10] yaml_2.2.0         rlang_0.4.0        e1071_1.7-2       
## [13] pillar_1.4.2       foreign_0.8-71     glue_1.3.1        
## [16] withr_2.1.2        DBI_1.0.0          rappdirs_0.3.1    
## [19] sp_1.3-1           uuid_0.1-2         modelr_0.1.5      
## [22] readxl_1.3.1       munsell_0.5.0      gtable_0.3.0      
## [25] cellranger_1.1.0   rvest_0.3.4        tigris_0.8.2      
## [28] evaluate_0.14      maptools_0.9-5     curl_4.0          
## [31] class_7.3-15       broom_0.5.2        Rcpp_1.0.2        
## [34] KernSmooth_2.23-15 scales_1.0.0       backports_1.1.4   
## [37] classInt_0.4-1     jsonlite_1.6       hms_0.5.0         
## [40] digest_0.6.20      stringi_1.4.3      grid_3.6.0        
## [43] rgdal_1.4-4        cli_1.1.0          tools_3.6.0       
## [46] magrittr_1.5       lazyeval_0.2.2     crayon_1.3.4      
## [49] pkgconfig_2.0.2    zeallot_0.1.0      xml2_1.2.2        
## [52] lubridate_1.7.4    assertthat_0.2.1   rmarkdown_1.14    
## [55] httr_1.4.1         rstudioapi_0.10    R6_2.4.0          
## [58] units_0.6-3        nlme_3.1-139       compiler_3.6.0

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.