Below is a table with my calculated HHI values by year. It includes the average industry HHI for that year as well as the standard deviation, minimum value, maximum value and number of industries HHIs reported that year.
## # A tibble: 17 × 6
## year avg_industry_hhi sd_industry_hhi max_industry_hhi min_industry_hhi
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2003 0.638 0.332 1 0.0410
## 2 2004 0.642 0.330 1 0.0348
## 3 2005 0.645 0.331 1 0.0344
## 4 2006 0.652 0.333 1 0.0359
## 5 2007 0.656 0.332 1 0.0372
## 6 2008 0.664 0.335 1 0.0390
## 7 2009 0.679 0.333 1 0.0386
## 8 2010 0.687 0.335 1 0.0424
## 9 2011 0.692 0.339 1 0.0390
## 10 2012 0.689 0.339 1 0.0387
## 11 2013 0.694 0.338 1 0.0401
## 12 2014 0.693 0.341 1 0.0393
## 13 2015 0.692 0.341 1 0.0420
## 14 2016 0.697 0.341 1 0.0405
## 15 2017 0.692 0.335 1 0.0401
## 16 2018 0.695 0.334 1 0.0390
## 17 2019 0.704 0.333 1 0.0439
## # ℹ 1 more variable: count_firms <int>
For this question, I first created a subset of the NAICS-to-SIC crosswalk containing only the NAICS codes present in both the crosswalk and our BEA.dta dataset. Then, I filtered through the BEA.dta dataframe to find all possible SIC codes corresponding to each BEA industry code in the data. In cases where there was more than one possible SIC code in the crosswalk, I performed fuzzy matching based on the industry descriptions of the SIC codes and the BEA industry descriptions. This matching process used the Jaro-Winkler distance metric, which measures the similarity between two strings. The program then selected the SIC code with the smallest distance from the BEA description — essentially, the closest match. I added a new column to the BEA dataframe containing the assigned SIC code. In the end, there were only 17 unmatched codes in the BEA dataframe, which is a strong result in terms of data retention and match quality.
Next, I created a column in the Sales and Profits dataframe containing the corresponding BEA industry output for each row’s year and SIC code. This allowed me to recalculate the HHI using the BEA industry output as the total market size, instead of relying solely on the sum of private company data we had available.
The two scatterplots below illustrate the impact of incorporating BEA data. The first scatterplot includes both the new HHI values and the data points that could not be matched to BEA data. This makes it easier to see how using BEA market size estimates generally made industries appear less concentrated than they initially seemed based on private data alone. The second scatterplot focuses only on the industries where BEA data could be matched, highlighting the shift in HHI values more cleanly.
Overall, this analysis demonstrated that our initial assessment of market concentration based solely on private data did not capture the full picture. The exercise revealed that market concentration metrics like the HHI can shift meaningfully when incorporating broader market size estimates from public data sources. Interestingly, the results can sometimes be counterintuitive — we might not expect every industry to appear less concentrated once the BEA data is factored in. While the BEA data offers a more comprehensive market denominator, the absence of firm-level revenue data for private companies means we still lack the full set of numerators needed for a precise recalculation of the HHI. As a result, we can better contextualize the private data’s limitations, but remain somewhat constrained in determining how unobserved private firms truly affect market concentration.
## Question 3: Calculating IHHI
Below is a table with my calculated IHHI values by industry and year. It includes the average industry IHHI & weighted average IHHI for that year as well as the standard deviation, minimum value, maximum value and number of IHHIs reported that year.
## # A tibble: 17 × 8
## year avg_industry_ihhi sd_industry_ihhi weighted_avg_ihhi sd_weighted_ihhi
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2003 0.0446 0.0158 0.0423 0.0182
## 2 2004 0.0423 0.0136 0.0395 0.0144
## 3 2005 0.0421 0.0138 0.0388 0.0137
## 4 2006 0.0418 0.0154 0.0391 0.0162
## 5 2007 0.0414 0.0193 0.0395 0.0191
## 6 2008 0.0405 0.0136 0.0374 0.0140
## 7 2009 0.0400 0.0176 0.0374 0.0179
## 8 2010 0.0406 0.0184 0.0382 0.0185
## 9 2011 0.0390 0.0145 0.0373 0.0152
## 10 2012 0.0400 0.0190 0.0385 0.0199
## 11 2013 0.0384 0.0129 0.0371 0.0134
## 12 2014 0.0383 0.0127 0.0370 0.0119
## 13 2015 0.0395 0.0134 0.0384 0.0132
## 14 2016 0.0406 0.0113 0.0400 0.0136
## 15 2017 0.0415 0.00922 0.0410 0.0126
## 16 2018 0.0429 0.00935 0.0421 0.0110
## 17 2019 0.0454 0.00983 0.0445 0.0107
## # ℹ 3 more variables: max_industry_ihhi <dbl>, min_industry_hhi <dbl>,
## # count_firms <int>
Below are two scatter plots both showing the relationship between industry HHI and industry IHHI. The first one shows the average industry IHHI and the second shows the weighted average IHHI. We can see here that there is no strong relationship between IHHI and HHI in most cases.
To measure common ownership within industries, I created a variable that captures the percentage of a firm’s total shares owned by investors who also hold stakes in other companies within the same industry during the same quarter. I weighted this by each investor’s share of other firms in the industry, so that investors with larger positions in competitors had a greater influence on the metric. I then ran a regression of firm profits on this weighted common ownership percentage, industry concentration (HHI), and their interaction. The results showed a statistically significant negative relationship between common ownership and profits: moving from no common ownership to full weighted common ownership was associated with about a $293k (a drop of almost 6%) drop in profits. Interestingly, the interaction term was positive, suggesting that in more concentrated industries, the negative effect of common ownership on profits is less severe. However, while these effects are (very) statistically significant, it explains only a tiny portion of the overall variation in profits, which makes sense since firm profitability depends on many other factors. Because of this, we should be cautious about overinterpreting the size of the effect, but it does suggest a consistent relationship worth exploring further.
##
## Call:
## lm(formula = profits ~ weighted_percent_common_ownership * HHI,
## data = common_ownership)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5010.7 -2484.9 5.9 2505.1 5102.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4998.645 1.963 2546.402 < 2e-16 ***
## weighted_percent_common_ownership -293.222 10.012 -29.288 < 2e-16 ***
## HHI 23.240 2.753 8.442 < 2e-16 ***
## weighted_percent_common_ownership:HHI 940.368 139.365 6.748 1.5e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2882 on 16753624 degrees of freedom
## (92852 observations deleted due to missingness)
## Multiple R-squared: 0.0001094, Adjusted R-squared: 0.0001093
## F-statistic: 611.3 on 3 and 16753624 DF, p-value: < 2.2e-16
If I were to formally study the relationship between profits and common ownership, I’d start by building a precise measure of how much overlap exists between investors in competing firms within the same industry, weighted by the size of their stakes. I’d use firm-level financial data (like Compustat) and detailed ownership data (FactSet) to construct this, along with industry classifications like SIC or NAICS codes.
The biggest challenges would be dealing with endogeneity — since ownership structures and profits likely influence each other — and data limitations, since not all ownership data is comprehensive. I’d consider using lagged variables, fixed effects, or natural experiments (like investor mergers) to isolate causal effects.
Finally, I’d look at how the relationship varies across industries and firm sizes, and whether it operates through pricing, investment, or competitive behavior. It’s a tricky but important question for understanding market competition and corporate strategy.
While I completed this empirical exercise in R, I am also highly proficient in Stata and have extensive experience working with both programs for a wide range of applied econometric tasks. In Stata, I have advanced skills in data cleaning, management, and pre-processing, as well as conducting both foundational and more sophisticated econometric analyses. Beyond typical coursework/research, I’ve also assisted classmates and peers with implementing techniques such as propensity score matching and synthetic control methods for policy evaluation and causal inference projects. Additionally, I am comfortable working with panel data models, fixed effects, clustered standard errors, and instrumental variable approaches in empirical research.