Overview

Total Sources

Languages

3,121,342

Articles Collected

63%

Sources Complete

In Progress

Blocked / Failed

Lines of Inquiry

01: Collection Completeness

I started here because before doing any content analysis, I needed to know how much of the pipeline was actually finished. Tracking each source across five completion states let me figure out where the real gaps were and whether the corpus was even ready to use downstream.

02: Language and Volume Distribution

I looked at this because language balance is fundamental to a regional news project. I wanted to see if we were over-relying on Hindi and Marathi, and as I suspected, languages like Khasi, Ladakhi, and Manipuri have almost no coverage, which I knew would skew any cross-linguistic analysis.

03: Technical Barriers to Harvesting

I categorized the issues because “it didn’t work” is not actionable on its own. I needed to know whether something was a firewall problem, a date parsing bug, or an E-Paper situation, since each one requires a completely different approach to fix.

04: Temporal Depth and Efficiency

I included this because a source with 50,000 articles across a decade is very different from one with 50,000 articles from last month. I wanted to understand which sources were actually carrying the longitudinal weight and which ones looked bigger than they really were.

Chart 1: Overall Collection Status

Line of Inquiry 01: Collection Completeness

I built this donut chart first because I wanted an immediate high-level read on pipeline health before going into any of the detail charts. About 63% of all sources are fully collected or marked Done, which is a reasonable baseline, but there are still meaningful gaps that would affect any analysis that assumes the corpus is complete. Five sources are completely blocked and I wanted that visible right away: Ajit hit a subscription wall, Poknapham and Dainik Sambad only publish image-based E-Papers, The Hindu has a Cloudflare firewall, and Vanglaini was returning empty files every run. I also made sure to flag the NA in-progress sources separately because their article counts are still growing, which means every other chart in this report slightly understates their final numbers. The “Complete (Unverified)” slice specifically flags Esakal, which has enormous volume but I had not yet done a final quality check at the time of writing.

Chart 2: Articles per Source

Line of Inquiry 02: Volume by Source and Language

I used a horizontal bar chart here because I wanted to be able to read all the source names without them overlapping, and sorting by article count makes it immediately obvious where the corpus weight sits. Prabhat Khabar (Hindi) towers above everything else with nearly 2 million articles, accounting for roughly 64% of the entire corpus on its own. I also applied a log scale on the x-axis because I did this specifically to keep sources like Stawa (Ladakhi, 126 articles) visible in the same chart as Prabhat Khabar, since on a linear scale smaller sources would essentially disappear. The language color-coding was intentional too, so you can immediately see that the sources clustering at the far left are Khasi, Nepali, Ladakhi, and Mizo, which is a structural gap I knew would be a problem for any cross-linguistic work.

Chart 3: Language Volume Treemap

Line of Inquiry 02: Language Equity in the Corpus

I added the treemap because I wanted a different view of the same language imbalance that the bar chart shows, but in a way where the relative proportions feel more visceral. When you see Hindi taking up most of the canvas and Khasi as a tiny sliver, that communicates the gap much more intuitively than numbers alone. I did this because I find that decision-makers respond better to area-based visualizations when the differences are this extreme. Marathi and Oriya hold their own mainly because Esakal and Sambad had clean pipelines and I could extract full archives. The near-invisible tiles for Khasi, Ladakhi, Nepali, and Mizo are not because those regions have no news, but because the digital archives were limited or blocked, and I wanted that distinction to be clear in the hover text.

Chart 4 Temporal Coverage (Gantt)

Line of Inquiry 04: Temporal Depth

I made this Gantt-style chart because I wanted to show temporal coverage in a way that a simple table just cannot replicate. I did this because once I started looking at the date ranges in the raw CSV, it was immediately clear that some sources cover a week and some cover a decade, and that difference completely changes how useful they are for longitudinal work. Prabhat Khabar is the standout here, with an archive starting in July 2013, which gives over a decade of Hindi-language news and makes it by far the most valuable source for anything time-series related. I also made sure the in-progress sources (orange bars) are visible with their current end dates so it is obvious their right edges will keep moving. Sources like Dainik Bhaskar MP (one week) and Anandabazar Patrika (a few days) came out this narrow because older archives were either behind paywalls or simply never digitized.

Chart 5: Collection Barriers

Line of Inquiry 03: Why Collection Was Limited

I built this chart because I needed to communicate the breakdown of what went wrong, not just that things went wrong. I used a bubble chart because the source count is already encoded in both the y-axis and the bubble size, which doubles up the signal and makes the most common barriers immediately pop. Limited Archive being the most common issue was actually reassuring in a way, because it means the scraper is working fine and there is just nothing more to collect, which is a different problem from access being blocked. Access Blocked is the one I flagged as highest priority for follow-up because Cloudflare and paywalls are the hardest to fix without institutional credentials or paid API access. I specifically called out the Date Parsing issue with Assam Tribune because every one of their articles shows up as “15 Sept 2010” in the raw data, which means the articles exist but I cannot place them on a timeline, and that completely breaks any time-series analysis that touches that source. E-Paper Only sources are a separate category because they require OCR rather than standard scraping and I wanted that distinction visible before anyone tries to rerun the pipeline.

Chart 6: Collection Efficiency

Line of Inquiry 04: Temporal Depth and Efficiency

I made this scatter plot because I wanted to look at efficiency, not just volume. I did this because a source can look impressive in the bar chart just from having a long-running scrape, but this plot exposes whether the articles per day actually justify that. Plotting archive span on the x-axis against article count on a log y-axis lets you immediately see who is in the top-right corner, which is where the genuinely useful sources live. Prabhat Khabar is the clear outlier at roughly 4,600 days of coverage and nearly 2 million articles, working out to about 430 articles per day, which is far above anything else in the dataset. Sources clustering in the bottom-left are the ones I am most concerned about for longitudinal studies, since they either have very short archives or restricted access that kept the harvest shallow regardless of how long the scraper ran. I used a log scale on the y-axis here for the same reason as Chart 2, because otherwise the smaller sources would be invisible.

Full Source Table

Source	Language	Start	End	Articles	Status	Issue Types
Prabhat Khabar	Hindi	2013-07-01	2026-02-25	1,994,893	Done	Date Parsing
Esakal	Marathi	2023-05-20	2026-02-16	550,233	Complete (Unverified)	No Major Issue
Vijaya Karnataka	Kannada	2020-01-01	2025-12-31	196,615	Complete	File Corruption
Sambad	Oriya	2022-04-15	2026-02-27	126,587	Complete	Date Parsing
Rising Kashmir	English	2022-05-25	2025-12-18	70,331	Complete	Access Blocked, Date Parsing
Amar Ujala	Hindi	2024-06-29	2025-09-04	42,150	Done	No Major Issue
U Nongsain Hima	Khasi	2023-02-10	2025-09-09	19,206	Done	No Major Issue
Namasthe Telangana	Telugu	2023-01-07	2024-06-29	16,189	In Progress	Content Noise
Anandabazar Patrika	Bengali	2026-09-02	2026-11-02	15,566	In Progress	No Major Issue
Dainik Jagran (Uttarakhand)	Hindi	2025-08-13	2026-03-25	14,373	Done	Date Parsing
Sandesh	Gujarati	2022-03-01	2026-03-21	13,628	In Progress	No Major Issue
Malayala Manorama	Malayalam	2024-07-09	2026-02-26	13,217	Complete	Access Blocked, Date Parsing
Haribhoomi	Hindi	2024-01-01	2026-02-05	8,369	Complete	Date Parsing
Gomantak	Marathi	2020-01-01	2025-07-15	7,338	Complete	No Major Issue
Assam Tribune	English	2025-11-07	NA	6,383	Complete	Date Parsing, Content Noise
Daily Thanti	Tamil	2022-02-11	2025-12-17	6,133	Complete	Date Parsing, E-Paper Only, Limited Archive
Eenadu	Telugu	2025-12-07	2026-02-19	4,457	Complete	Limited Archive
Himalayan Darpan	Nepali	2025-04-09	2026-02-04	3,524	Done	No Major Issue
Navbharat Times	Hindi	2025-01-02	2026-01-15	3,265	In Progress	Date Parsing, Limited Archive
Nagaland Post	English	2025-05-26	2026-01-14	3,028	Done	Limited Archive
Dainik Bhaskar (MP edition)	Hindi	2026-01-14	2026-01-21	2,667	NA	Limited Archive
Punjab Kesari	Hindi	2025-01-10	2026-01-14	1,687	Done	Limited Archive
Rajasthan Patrika	Hindi	2026-01-31	2026-02-02	1,377	Done	No Major Issue
Stawa	Ladakhi	2020-04-12	2025-11-27	126	Done	No Major Issue
Ajit	Punjabi	NA	NA	0	Failed	Access Blocked, Date Parsing
Echoes of Arunachal	English	NA	NA	0	NA	No Major Issue
Poknapham	Manipuri (Meitei)	NA	NA	0	NA	E-Paper Only
Vanglaini	Mizo	NA	NA	0	NA	File Corruption
Dainik Sambad	Bengali	NA	NA	0	NA	E-Paper Only
The Hindu	English	NA	NA	0	NA	Access Blocked

Generated with R · plotly · knitr · RMarkdown · Source: data description(Sheet1).csv

Indian Regional News Pipeline Analysis Dashboard

Source: data description(Sheet1).csv

Research Team | Professor Sunandan Chakraborty Lab

May 07, 2026