Data Design by Dialogue - VizChitra 2025
Key points from the S Anand's "Data Design by Dialogue - VizChitra 2025" talk:
-
LLMs for Data Democratisation: Large Language Models (LLMs) are rapidly improving, evolving from the intelligence level of an 8th-grade student to roughly a postgraduate in two years. They are seen as the single biggest democratisation of data insights since the advent of open data. LLMs can assist in every step of the data-to-story value chain, including engineering, analysis, and visualisation.
-
Experimentation and Delegation are Key:
- Due to the "jagged edge" nature of LLMs (highly capable in some areas, less so in others), experimentation is crucial.
- The speaker advocates for delegating tasks to LLMs at every stage of data processing. Instead of trying to solve problems yourself, ask the LLM to fix it or perform the next step.
-
Handling LLM Mistakes and Reliability:
- LLMs make mistakes, "kind of like humans". When an LLM makes a mistake, the next time it might not make the same one.
- Technique 1: If an error occurs, tell the LLM to "fix it".
- Technique 2: If the first attempt fails, simply re-run the same prompt or "throw it away and redo it". LLMs can go wrong in independent ways, so a re-run might yield a different, correct result.
- If a task doesn't work after three attempts, add it to an "impossibility list" and revisit it later, as models are constantly improving.
-
LLMs Excel at Code Generation:
- This capability enables anyone to write programs in English, making coding more accessible.
- The cost of generating complex programs (e.g., end-to-end topic modeling) can be remarkably low, often less than a dollar for significant execution time.
-
Effective Prompting Strategies:
- Talk more than type: Anand finds it easier to be verbose and efficient when talking to an LLM.
- Start new chats if the LLM becomes confused.
- When processing text, explicitly ask the LLM to write code for scraping or analysis, rather than expecting it to convert text to structured form directly.
- Never ask for just one output: Ask for "a dozen" analyses or stories, even if the LLM initially suggests fewer. This helps uncover diverse insights and offloads work.
- Let the LLM generate hypotheses: Avoid applying your own biases by letting the LLM come up with the initial hypotheses.
- Do as little as possible: Many prompt engineering tricks (like "emotion prompting") are temporary; as models improve, they become less necessary.
-
Practical Applications Demonstrated:
- Scraping data from WhatsApp: Using JavaScript generated by ChatGPT pasted into browser developer tools.
- Data Quality Checks: Asking the LLM to identify missing values by writing and executing a program.
- Data Imputation: Instructing the LLM to interpolate or extrapolate missing timestamps from nearby messages.
- Topic Modeling and Clustering: Generating Python code to calculate text embeddings, cluster them using K-means, and then use GPT-4.1 to name the clusters.
- Generating Data Stories and Visualizations: Uploading processed data and asking the LLM to derive 10 diverse data stories, including "quirky ones," and to provide analysis code, tables, charts, and interpretations.
-
Model Usage Recommendations:
- The speaker uses the $20 ChatGPT account, considering it "incredibly worth it".
- For serious tasks, it's recommended to always use the o3 model (or highest available) until its quota is exhausted, then move to lower models. The o4 mini is a toned-down version of an even more advanced model, and higher "thinking time" generally leads to better results for programming tasks.
-
Current Limitations:
- As of the talk, LLMs struggle to apply style guides and create high-quality visualizations (e.g., for Matplotlib) from scratch, even if they can write the style guide itself.
The slides contain the prompts Anand has used in the demo & you can check the transcript if you prefer reading rather than watch videos.
Comments
Post a Comment