Chapter 6: LLMs in Data Science Workflows

Large language models have transformed how we work with unstructured data. Text that once required extensive manual processing—cleaning, feature engineering, labeling—can now be transformed into rich numerical representations, annotated at scale, and analyzed with unprecedented sophistication. For data scientists, LLMs are not replacements for statistical thinking but powerful additions to the computational toolkit we’ve developed throughout this course.

This chapter addresses a practical question: how can data scientists leverage pre-trained language models to enhance analytical workflows? We focus on integration patterns rather than model architecture, on responsible deployment rather than training from scratch. The goal is not to turn you into an ML engineer but to equip you as a data scientist who can thoughtfully incorporate LLMs alongside the frequentist, Bayesian, and resampling methods from earlier chapters.

We begin with foundations: what LLMs are, how they’re trained, and why they exhibit their remarkable capabilities. From there, we develop embeddings and feature extraction—transforming text into dense vectors that capture semantic meaning and can serve as inputs to statistical models. Data annotation addresses the persistent bottleneck of labeled data, showing how LLMs can annotate at scale with quality approaching human labelers. Retrieval-augmented generation (RAG) extends LLM capabilities to domain-specific knowledge by grounding responses in authoritative sources. Throughout, we emphasize prompt engineering for reliable outputs, reliability assessment for understanding when to trust LLM predictions, and responsible AI practices for navigating privacy, bias, and transparency concerns.

Learning Objectives: Upon completing this chapter, you will be able to:

LLM Foundations

Explain transformer architecture at a conceptual level (attention, pre-training, fine-tuning)
Compare model families (GPT, Claude, Llama) and their capabilities
Evaluate deployment options (API-based vs. local) and their trade-offs
Distinguish between open and closed models and implications for different use cases

Embeddings and Feature Extraction

Generate text embeddings using API-based and open-source models
Apply embeddings for similarity search, clustering, and classification
Integrate embedding-derived features into statistical models from earlier chapters
Evaluate embedding quality and choose appropriate dimensionality

Text Preprocessing

Build preprocessing pipelines for tokenization, chunking, and normalization
Handle long documents through effective chunking strategies
Manage context windows and token limits
Clean and normalize text data for LLM consumption

Data Annotation

Design annotation workflows with appropriate prompts and instructions
Implement quality control through consensus, sampling, and validation
Evaluate annotation quality against human baselines
Determine when LLM annotation is appropriate versus human labels essential

Retrieval-Augmented Generation

Implement RAG pipelines: document chunking, embedding, retrieval, generation
Design retrieval strategies that balance precision and recall
Evaluate RAG system quality through retrieval and generation metrics
Identify RAG failure modes and implement mitigations

Prompt Engineering

Apply systematic prompt design: clear instructions, few-shot examples, chain-of-thought
Implement output formatting for structured, parseable responses
Manage prompts as code with version control and testing
Debug prompt failures through systematic iteration

Reliability and Evaluation

Assess LLM reliability through consistency, calibration, and ground-truth comparison
Implement strategies for improving reliability (self-consistency, verification)
Design evaluation protocols appropriate for LLM-assisted workflows
Quantify uncertainty in LLM outputs using techniques from earlier chapters

Responsible AI

Navigate privacy considerations when sending data to external APIs
Identify potential biases in LLM outputs and implement mitigations
Determine appropriate disclosure of AI assistance
Evaluate ethical implications of LLM deployment in specific contexts