I started here because before doing any content analysis, I needed to know how much of the pipeline was actually finished. Tracking each source across five completion states let me figure out where the real gaps were and whether the corpus was even ready to use downstream.
I looked at this because language balance is fundamental to a regional news project. I wanted to see if we were over-relying on Hindi and Marathi, and as I suspected, languages like Khasi, Ladakhi, and Manipuri have almost no coverage, which I knew would skew any cross-linguistic analysis.
I categorized the issues because “it didn’t work” is not actionable on its own. I needed to know whether something was a firewall problem, a date parsing bug, or an E-Paper situation, since each one requires a completely different approach to fix.
I included this because a source with 50,000 articles across a decade is very different from one with 50,000 articles from last month. I wanted to understand which sources were actually carrying the longitudinal weight and which ones looked bigger than they really were.
I built this donut chart first because I wanted an immediate high-level read on pipeline health before going into any of the detail charts. About 63% of all sources are fully collected or marked Done, which is a reasonable baseline, but there are still meaningful gaps that would affect any analysis that assumes the corpus is complete. Five sources are completely blocked and I wanted that visible right away: Ajit hit a subscription wall, Poknapham and Dainik Sambad only publish image-based E-Papers, The Hindu has a Cloudflare firewall, and Vanglaini was returning empty files every run. I also made sure to flag the NA in-progress sources separately because their article counts are still growing, which means every other chart in this report slightly understates their final numbers. The “Complete (Unverified)” slice specifically flags Esakal, which has enormous volume but I had not yet done a final quality check at the time of writing.
I used a horizontal bar chart here because I wanted to be able to read all the source names without them overlapping, and sorting by article count makes it immediately obvious where the corpus weight sits. Prabhat Khabar (Hindi) towers above everything else with nearly 2 million articles, accounting for roughly 64% of the entire corpus on its own. I also applied a log scale on the x-axis because I did this specifically to keep sources like Stawa (Ladakhi, 126 articles) visible in the same chart as Prabhat Khabar, since on a linear scale smaller sources would essentially disappear. The language color-coding was intentional too, so you can immediately see that the sources clustering at the far left are Khasi, Nepali, Ladakhi, and Mizo, which is a structural gap I knew would be a problem for any cross-linguistic work.
I added the treemap because I wanted a different view of the same language imbalance that the bar chart shows, but in a way where the relative proportions feel more visceral. When you see Hindi taking up most of the canvas and Khasi as a tiny sliver, that communicates the gap much more intuitively than numbers alone. I did this because I find that decision-makers respond better to area-based visualizations when the differences are this extreme. Marathi and Oriya hold their own mainly because Esakal and Sambad had clean pipelines and I could extract full archives. The near-invisible tiles for Khasi, Ladakhi, Nepali, and Mizo are not because those regions have no news, but because the digital archives were limited or blocked, and I wanted that distinction to be clear in the hover text.
I made this Gantt-style chart because I wanted to show temporal coverage in a way that a simple table just cannot replicate. I did this because once I started looking at the date ranges in the raw CSV, it was immediately clear that some sources cover a week and some cover a decade, and that difference completely changes how useful they are for longitudinal work. Prabhat Khabar is the standout here, with an archive starting in July 2013, which gives over a decade of Hindi-language news and makes it by far the most valuable source for anything time-series related. I also made sure the in-progress sources (orange bars) are visible with their current end dates so it is obvious their right edges will keep moving. Sources like Dainik Bhaskar MP (one week) and Anandabazar Patrika (a few days) came out this narrow because older archives were either behind paywalls or simply never digitized.
I built this chart because I needed to communicate the breakdown of what went wrong, not just that things went wrong. I used a bubble chart because the source count is already encoded in both the y-axis and the bubble size, which doubles up the signal and makes the most common barriers immediately pop. Limited Archive being the most common issue was actually reassuring in a way, because it means the scraper is working fine and there is just nothing more to collect, which is a different problem from access being blocked. Access Blocked is the one I flagged as highest priority for follow-up because Cloudflare and paywalls are the hardest to fix without institutional credentials or paid API access. I specifically called out the Date Parsing issue with Assam Tribune because every one of their articles shows up as “15 Sept 2010” in the raw data, which means the articles exist but I cannot place them on a timeline, and that completely breaks any time-series analysis that touches that source. E-Paper Only sources are a separate category because they require OCR rather than standard scraping and I wanted that distinction visible before anyone tries to rerun the pipeline.
I made this scatter plot because I wanted to look at efficiency, not just volume. I did this because a source can look impressive in the bar chart just from having a long-running scrape, but this plot exposes whether the articles per day actually justify that. Plotting archive span on the x-axis against article count on a log y-axis lets you immediately see who is in the top-right corner, which is where the genuinely useful sources live. Prabhat Khabar is the clear outlier at roughly 4,600 days of coverage and nearly 2 million articles, working out to about 430 articles per day, which is far above anything else in the dataset. Sources clustering in the bottom-left are the ones I am most concerned about for longitudinal studies, since they either have very short archives or restricted access that kept the harvest shallow regardless of how long the scraper ran. I used a log scale on the y-axis here for the same reason as Chart 2, because otherwise the smaller sources would be invisible.
| Source | Language | Start | End | Articles | Status | Issue Types |
|---|---|---|---|---|---|---|
| Prabhat Khabar | Hindi | 2013-07-01 | 2026-02-25 | 1,994,893 | Done | Date Parsing |
| Esakal | Marathi | 2023-05-20 | 2026-02-16 | 550,233 | Complete (Unverified) | No Major Issue |
| Vijaya Karnataka | Kannada | 2020-01-01 | 2025-12-31 | 196,615 | Complete | File Corruption |
| Sambad | Oriya | 2022-04-15 | 2026-02-27 | 126,587 | Complete | Date Parsing |
| Rising Kashmir | English | 2022-05-25 | 2025-12-18 | 70,331 | Complete | Access Blocked, Date Parsing |
| Amar Ujala | Hindi | 2024-06-29 | 2025-09-04 | 42,150 | Done | No Major Issue |
| U Nongsain Hima | Khasi | 2023-02-10 | 2025-09-09 | 19,206 | Done | No Major Issue |
| Namasthe Telangana | Telugu | 2023-01-07 | 2024-06-29 | 16,189 | In Progress | Content Noise |
| Anandabazar Patrika | Bengali | 2026-09-02 | 2026-11-02 | 15,566 | In Progress | No Major Issue |
| Dainik Jagran (Uttarakhand) | Hindi | 2025-08-13 | 2026-03-25 | 14,373 | Done | Date Parsing |
| Sandesh | Gujarati | 2022-03-01 | 2026-03-21 | 13,628 | In Progress | No Major Issue |
| Malayala Manorama | Malayalam | 2024-07-09 | 2026-02-26 | 13,217 | Complete | Access Blocked, Date Parsing |
| Haribhoomi | Hindi | 2024-01-01 | 2026-02-05 | 8,369 | Complete | Date Parsing |
| Gomantak | Marathi | 2020-01-01 | 2025-07-15 | 7,338 | Complete | No Major Issue |
| Assam Tribune | English | 2025-11-07 | NA | 6,383 | Complete | Date Parsing, Content Noise |
| Daily Thanti | Tamil | 2022-02-11 | 2025-12-17 | 6,133 | Complete | Date Parsing, E-Paper Only, Limited Archive |
| Eenadu | Telugu | 2025-12-07 | 2026-02-19 | 4,457 | Complete | Limited Archive |
| Himalayan Darpan | Nepali | 2025-04-09 | 2026-02-04 | 3,524 | Done | No Major Issue |
| Navbharat Times | Hindi | 2025-01-02 | 2026-01-15 | 3,265 | In Progress | Date Parsing, Limited Archive |
| Nagaland Post | English | 2025-05-26 | 2026-01-14 | 3,028 | Done | Limited Archive |
| Dainik Bhaskar (MP edition) | Hindi | 2026-01-14 | 2026-01-21 | 2,667 | NA | Limited Archive |
| Punjab Kesari | Hindi | 2025-01-10 | 2026-01-14 | 1,687 | Done | Limited Archive |
| Rajasthan Patrika | Hindi | 2026-01-31 | 2026-02-02 | 1,377 | Done | No Major Issue |
| Stawa | Ladakhi | 2020-04-12 | 2025-11-27 | 126 | Done | No Major Issue |
| Ajit | Punjabi | NA | NA | 0 | Failed | Access Blocked, Date Parsing |
| Echoes of Arunachal | English | NA | NA | 0 | NA | No Major Issue |
| Poknapham | Manipuri (Meitei) | NA | NA | 0 | NA | E-Paper Only |
| Vanglaini | Mizo | NA | NA | 0 | NA | File Corruption |
| Dainik Sambad | Bengali | NA | NA | 0 | NA | E-Paper Only |
| The Hindu | English | NA | NA | 0 | NA | Access Blocked |
Generated with R · plotly · knitr · RMarkdown · Source: data description(Sheet1).csv