Lessons from Diagram Chasing
https://aman.bh/to/indiafossGovernment website hoarder
Occasional open data publisher
Fan of maps (including OpenStreetMap)
Designer, programmer
Maps, data journalism, public technology enthusiast
Has too many ideas
Drainage patterns in Bangalore as a convenient map, with historical context.
Easy and intuitive explorer for browsing affidavits and parliamentary activity of elected MPs.

Building on the previous project, we analyzed election candidate names to find out “namesakes” which could have potentially flipped the election.
How do you find more namesakes like S Veeramani and S .V Ramani?
This is what 0.0001% of the National Time Use Survey looks like for the average user
Usually the answer is somewhat.
Open data that’s technically available but practically inaccessible to most people who could benefit from it.
We cleaned up the dataset and made it so that you can answer any query without code.
Run complex aggregations and SQL queries all in-browser in a GUI.



It’s never been easier to spin up a dashboard for your open-data with a few prompts.

It’s also never been easier for your audience to ignore yet another dashboard.

Each film leads you to others like it, creating links between 18,000 movies based on censorship

Normally the only way to find a certificate for a movie is to go around searching for something like this in a theatre
On opening the URL, the certificate is displayed with the list of cuts made to the film.

And what is better, a URL with numbers that look like they could make sense.
100090292400000155
100090292400000001
100090292300000001
100080202300000001
What could these numbers mean?
100090292400000155
100090292400000001
100090292300000001
100080202300000001
Appears everywhere, probably a prefix
1000100090292400000155
100090292400000001
100090292300000001
100080202300000001
These set of digits appear in pairs, we’ll figure these out later
90 appears in most examples80 in some examples. Maybeeee a code?100090292400000155
100090292400000001
100090292300000001
100080202300000001
Clear year pattern:
2924 → 20242923 → 20232023 → 2023100090292400000155
100090292400000001
100090292300000001
100080202300000001
The final digits represent the sequential certificate number for that office and period.
Now we could guess more i values, and we kept changing them in the URL. Sometimes you get a 404, but other times something loads!
How do you get it out of the page?
If a website is loading data in your browser, you can probably scrape it.
Use the Network Tab to see what is being loaded in the browser
HAR is a convenient format to record the API calls for a page

The HAR file contains all HTTP requests made when the page loaded, including any headers, session IDs, cookies; whatever the site needs to fetch data.
For us, it looked something like this:
There are many ways to go from a HAR file to a functional scraper
One quick way is to feed it to an LLM and prompt it to write a scraper
Works surprisingly well most of the time!

Soon enough, we had a functional scraper


This data was still unusable because the information that users cared about was hidden in piles of text and timestamps with no context.
Manual Classification

If we want to answer questions that interest others, it should come from our own interests and curiosity. If we don’t care, why would you?.
But then why use it here?
Large Language Models
Detailed prompts + Edge case examples = Text categorization that also cleans up messy content for better readability.
Results:
01:32:59:00 Replaced the whole V.O. stanza about caste system of Manu Maharaj. Aabadi hain Aabad .Aur unka jivan sarthak hoga To Aabadi hain Aabad nahiazhadi ki
Clean Description: Replaced a voice-over passage discussing the caste system.
Categories: - REPLACEMENT - TEXT DIALOGUE - IDENTITY REFERENCE
Topic: CASTE
80% of the people will not go through this CSV.
For us, exposing people to what we found exciting is important.
Otherwise, this dataset is futile.
How do we communicate this excitement?
Can we think from the user’s point of view? What can be a hook for them?
Find movies of their interest
Search random keywords
Share with others
Browse and roam through this dataset
Why would you not want your users to share things?
Deeplinks everywhere.
https://cbfc.watch/film/sinners-2025https://cbfc.watch/browse/actors/fahadh-faasilhttps://cbfc.watch/browse/content/religioushttps://cbfc.watch/search?q=maps+language%3AEnglishSince we needed almost every field of the metadata to be searchable, we needed to come up with a solution to minimize the UI surface.
This is bad

Let users make queries like they would in Google, we’ll do the complex operations.
Typesense doesn’t come with this, so we built a parser ourselves.
function parseQuery(input) {
// Find field:value patterns with quotes, wildcards, operators
const fieldPattern = /(\w+):(.*?)/g;
const fieldQueries = [];
let textQuery = input;
while ((match = fieldPattern.exec(input)) !== null) {
const [fullMatch, field, value] = match;
let operator = '=';
let processedValue = value;
if (value.startsWith('"')) {
processedValue = value.slice(1, -1); // exact match
} else if (value.endsWith('*')) {
operator = 'CONTAINS'; // wildcard search
processedValue = value.slice(0, -1);
} else if (value.match(/^[><=]+/)) {
operator = value.match(/^([><=]+)/)[1]; // comparison
processedValue = parseFloat(value.replace(/^[><=]+/, ''));
}
fieldQueries.push({ field, operator, value: processedValue });
textQuery = textQuery.replace(fullMatch, '').trim();
}
return { textQuery, fieldQueries };
}Most dashboards, including government ones, pretend to be the final answer. LOOK NO BEYOND ME!
Instead,
Think about the various kinds of users your data might attract, can you make their lives easier?
Show them one way to slice the data. Get them thinking of more!

We extensively document all data releases because we want users to use this data.
No guesswork.

But good design, in all its forms, becomes evident when you’re an end-user yourself.
They make the process of accessing them very inviting.
https://vahan.parivahan.gov.in/vahan4dashboard/
https://india-vehicle-stats.pages.dev/KA/ALL/2025?name=BYD+INDIA+PRIVATE+LIMITEDFlatGithub, HyParquet let people see the data without downloading.
Static sites and web-apps mean you get to move on with your life.
Examples: Sveltekit for apps• PMTiles + Maplibre for maps • DuckDB-WASM for data queries • Observable Plot for charts • Pyodide/R-WASM for data analysis and notebooks
This includes everything from just creating a set of CSVs to explorers.
Especially when designing applications and dashboards:
Design for sharing.
If users can’t link to it, they can’t discuss it
Documentation is what saves your data from being ignored.
Write data dictionaries, document sample analysis, create basic charts, explain edge-cases and limitations upfront.
Invite people in.
Extract data from web APIs and endpoints
Tools: Network tab, HAR files
Clean and structure the raw data
Tools: BeautifulSoup, regex
Join, analyze, and prepare final datasets
Tools: Pandas, DuckDB
Individual datasets have limited utility and good context comes from joining multiple sources.

Aesthetic quality signals care, invite exploration and sharing.
Good design makes your work memorable in a sea of utilitarian dashboards.
For us, this will be fully achieved when we are able to create educational resources around the project in addition to the output.

Thanks for listening!
diagramchasing.fun