Overview

30
Total Sources
16
Languages
3,121,342
Articles Collected
63%
Sources Complete
NA
In Progress
1
Blocked / Failed

Lines of Inquiry

01: Collection Completeness

I started here because before doing any content analysis, I needed to know how much of the pipeline was actually finished. Tracking each source across five completion states let me figure out where the real gaps were and whether the corpus was even ready to use downstream.

02: Language and Volume Distribution

I looked at this because language balance is fundamental to a regional news project. I wanted to see if we were over-relying on Hindi and Marathi, and as I suspected, languages like Khasi, Ladakhi, and Manipuri have almost no coverage, which I knew would skew any cross-linguistic analysis.

03: Technical Barriers to Harvesting

I categorized the issues because “it didn’t work” is not actionable on its own. I needed to know whether something was a firewall problem, a date parsing bug, or an E-Paper situation, since each one requires a completely different approach to fix.

04: Temporal Depth and Efficiency

I included this because a source with 50,000 articles across a decade is very different from one with 50,000 articles from last month. I wanted to understand which sources were actually carrying the longitudinal weight and which ones looked bigger than they really were.


Chart 1: Overall Collection Status

Line of Inquiry 01: Collection Completeness

I built this donut chart first because I wanted an immediate high-level read on pipeline health before going into any of the detail charts. About 63% of all sources are fully collected or marked Done, which is a reasonable baseline, but there are still meaningful gaps that would affect any analysis that assumes the corpus is complete. Five sources are completely blocked and I wanted that visible right away: Ajit hit a subscription wall, Poknapham and Dainik Sambad only publish image-based E-Papers, The Hindu has a Cloudflare firewall, and Vanglaini was returning empty files every run. I also made sure to flag the NA in-progress sources separately because their article counts are still growing, which means every other chart in this report slightly understates their final numbers. The “Complete (Unverified)” slice specifically flags Esakal, which has enormous volume but I had not yet done a final quality check at the time of writing.


Chart 2: Articles per Source

Line of Inquiry 02: Volume by Source and Language

I used a horizontal bar chart here because I wanted to be able to read all the source names without them overlapping, and sorting by article count makes it immediately obvious where the corpus weight sits. Prabhat Khabar (Hindi) towers above everything else with nearly 2 million articles, accounting for roughly 64% of the entire corpus on its own. I also applied a log scale on the x-axis because I did this specifically to keep sources like Stawa (Ladakhi, 126 articles) visible in the same chart as Prabhat Khabar, since on a linear scale smaller sources would essentially disappear. The language color-coding was intentional too, so you can immediately see that the sources clustering at the far left are Khasi, Nepali, Ladakhi, and Mizo, which is a structural gap I knew would be a problem for any cross-linguistic work.


Chart 3: Language Volume Treemap

Line of Inquiry 02: Language Equity in the Corpus

I added the treemap because I wanted a different view of the same language imbalance that the bar chart shows, but in a way where the relative proportions feel more visceral. When you see Hindi taking up most of the canvas and Khasi as a tiny sliver, that communicates the gap much more intuitively than numbers alone. I did this because I find that decision-makers respond better to area-based visualizations when the differences are this extreme. Marathi and Oriya hold their own mainly because Esakal and Sambad had clean pipelines and I could extract full archives. The near-invisible tiles for Khasi, Ladakhi, Nepali, and Mizo are not because those regions have no news, but because the digital archives were limited or blocked, and I wanted that distinction to be clear in the hover text.


Chart 4 Temporal Coverage (Gantt)

Line of Inquiry 04: Temporal Depth

I made this Gantt-style chart because I wanted to show temporal coverage in a way that a simple table just cannot replicate. I did this because once I started looking at the date ranges in the raw CSV, it was immediately clear that some sources cover a week and some cover a decade, and that difference completely changes how useful they are for longitudinal work. Prabhat Khabar is the standout here, with an archive starting in July 2013, which gives over a decade of Hindi-language news and makes it by far the most valuable source for anything time-series related. I also made sure the in-progress sources (orange bars) are visible with their current end dates so it is obvious their right edges will keep moving. Sources like Dainik Bhaskar MP (one week) and Anandabazar Patrika (a few days) came out this narrow because older archives were either behind paywalls or simply never digitized.


Chart 5: Collection Barriers

Line of Inquiry 03: Why Collection Was Limited

I built this chart because I needed to communicate the breakdown of what went wrong, not just that things went wrong. I used a bubble chart because the source count is already encoded in both the y-axis and the bubble size, which doubles up the signal and makes the most common barriers immediately pop. Limited Archive being the most common issue was actually reassuring in a way, because it means the scraper is working fine and there is just nothing more to collect, which is a different problem from access being blocked. Access Blocked is the one I flagged as highest priority for follow-up because Cloudflare and paywalls are the hardest to fix without institutional credentials or paid API access. I specifically called out the Date Parsing issue with Assam Tribune because every one of their articles shows up as “15 Sept 2010” in the raw data, which means the articles exist but I cannot place them on a timeline, and that completely breaks any time-series analysis that touches that source. E-Paper Only sources are a separate category because they require OCR rather than standard scraping and I wanted that distinction visible before anyone tries to rerun the pipeline.


Chart 6: Collection Efficiency

Line of Inquiry 04: Temporal Depth and Efficiency

I made this scatter plot because I wanted to look at efficiency, not just volume. I did this because a source can look impressive in the bar chart just from having a long-running scrape, but this plot exposes whether the articles per day actually justify that. Plotting archive span on the x-axis against article count on a log y-axis lets you immediately see who is in the top-right corner, which is where the genuinely useful sources live. Prabhat Khabar is the clear outlier at roughly 4,600 days of coverage and nearly 2 million articles, working out to about 430 articles per day, which is far above anything else in the dataset. Sources clustering in the bottom-left are the ones I am most concerned about for longitudinal studies, since they either have very short archives or restricted access that kept the harvest shallow regardless of how long the scraper ran. I used a log scale on the y-axis here for the same reason as Chart 2, because otherwise the smaller sources would be invisible.


Full Source Table

Source Language Start End Articles Status Issue Types
Prabhat Khabar Hindi 2013-07-01 2026-02-25 1,994,893 Done Date Parsing
Esakal Marathi 2023-05-20 2026-02-16 550,233 Complete (Unverified) No Major Issue
Vijaya Karnataka Kannada 2020-01-01 2025-12-31 196,615 Complete File Corruption
Sambad Oriya 2022-04-15 2026-02-27 126,587 Complete Date Parsing
Rising Kashmir English 2022-05-25 2025-12-18 70,331 Complete Access Blocked, Date Parsing
Amar Ujala Hindi 2024-06-29 2025-09-04 42,150 Done No Major Issue
U Nongsain Hima Khasi 2023-02-10 2025-09-09 19,206 Done No Major Issue
Namasthe Telangana Telugu 2023-01-07 2024-06-29 16,189 In Progress Content Noise
Anandabazar Patrika Bengali 2026-09-02 2026-11-02 15,566 In Progress No Major Issue
Dainik Jagran (Uttarakhand) Hindi 2025-08-13 2026-03-25 14,373 Done Date Parsing
Sandesh Gujarati 2022-03-01 2026-03-21 13,628 In Progress No Major Issue
Malayala Manorama Malayalam 2024-07-09 2026-02-26 13,217 Complete Access Blocked, Date Parsing
Haribhoomi Hindi 2024-01-01 2026-02-05 8,369 Complete Date Parsing
Gomantak Marathi 2020-01-01 2025-07-15 7,338 Complete No Major Issue
Assam Tribune English 2025-11-07 NA 6,383 Complete Date Parsing, Content Noise
Daily Thanti Tamil 2022-02-11 2025-12-17 6,133 Complete Date Parsing, E-Paper Only, Limited Archive
Eenadu Telugu 2025-12-07 2026-02-19 4,457 Complete Limited Archive
Himalayan Darpan Nepali 2025-04-09 2026-02-04 3,524 Done No Major Issue
Navbharat Times Hindi 2025-01-02 2026-01-15 3,265 In Progress Date Parsing, Limited Archive
Nagaland Post English 2025-05-26 2026-01-14 3,028 Done Limited Archive
Dainik Bhaskar (MP edition) Hindi 2026-01-14 2026-01-21 2,667 NA Limited Archive
Punjab Kesari Hindi 2025-01-10 2026-01-14 1,687 Done Limited Archive
Rajasthan Patrika Hindi 2026-01-31 2026-02-02 1,377 Done No Major Issue
Stawa Ladakhi 2020-04-12 2025-11-27 126 Done No Major Issue
Ajit Punjabi NA NA 0 Failed Access Blocked, Date Parsing
Echoes of Arunachal English NA NA 0 NA No Major Issue
Poknapham Manipuri (Meitei) NA NA 0 NA E-Paper Only
Vanglaini Mizo NA NA 0 NA File Corruption
Dainik Sambad Bengali NA NA 0 NA E-Paper Only
The Hindu English NA NA 0 NA Access Blocked

Generated with R · plotly · knitr · RMarkdown · Source: data description(Sheet1).csv