Chapter 6: LLMs in Data Science Workflows

Large language models have transformed how we work with unstructured data. Text that once required extensive manual processing—cleaning, feature engineering, labeling—can now be transformed into rich numerical representations, annotated at scale, and analyzed with unprecedented sophistication. For data scientists, LLMs are not replacements for statistical thinking but powerful additions to the computational toolkit we’ve developed throughout this course.

This chapter addresses a practical question: how can data scientists leverage pre-trained language models to enhance analytical workflows? We focus on integration patterns rather than model architecture, on responsible deployment rather than training from scratch. The goal is not to turn you into an ML engineer but to equip you as a data scientist who can thoughtfully incorporate LLMs alongside the frequentist, Bayesian, and resampling methods from earlier chapters.

We begin with foundations: what LLMs are, how they’re trained, and why they exhibit their remarkable capabilities. From there, we develop embeddings and feature extraction—transforming text into dense vectors that capture semantic meaning and can serve as inputs to statistical models. Data annotation addresses the persistent bottleneck of labeled data, showing how LLMs can annotate at scale with quality approaching human labelers. Retrieval-augmented generation (RAG) extends LLM capabilities to domain-specific knowledge by grounding responses in authoritative sources. Throughout, we emphasize prompt engineering for reliable outputs, reliability assessment for understanding when to trust LLM predictions, and responsible AI practices for navigating privacy, bias, and transparency concerns.

Learning Objectives: Upon completing this chapter, you will be able to:

LLM Foundations

  • Explain transformer architecture at a conceptual level (attention, pre-training, fine-tuning)

  • Compare model families (GPT, Claude, Llama) and their capabilities

  • Evaluate deployment options (API-based vs. local) and their trade-offs

  • Distinguish between open and closed models and implications for different use cases

Embeddings and Feature Extraction

  • Generate text embeddings using API-based and open-source models

  • Apply embeddings for similarity search, clustering, and classification

  • Integrate embedding-derived features into statistical models from earlier chapters

  • Evaluate embedding quality and choose appropriate dimensionality

Text Preprocessing

  • Build preprocessing pipelines for tokenization, chunking, and normalization

  • Handle long documents through effective chunking strategies

  • Manage context windows and token limits

  • Clean and normalize text data for LLM consumption

Data Annotation

  • Design annotation workflows with appropriate prompts and instructions

  • Implement quality control through consensus, sampling, and validation

  • Evaluate annotation quality against human baselines

  • Determine when LLM annotation is appropriate versus human labels essential

Retrieval-Augmented Generation

  • Implement RAG pipelines: document chunking, embedding, retrieval, generation

  • Design retrieval strategies that balance precision and recall

  • Evaluate RAG system quality through retrieval and generation metrics

  • Identify RAG failure modes and implement mitigations

Prompt Engineering

  • Apply systematic prompt design: clear instructions, few-shot examples, chain-of-thought

  • Implement output formatting for structured, parseable responses

  • Manage prompts as code with version control and testing

  • Debug prompt failures through systematic iteration

Reliability and Evaluation

  • Assess LLM reliability through consistency, calibration, and ground-truth comparison

  • Implement strategies for improving reliability (self-consistency, verification)

  • Design evaluation protocols appropriate for LLM-assisted workflows

  • Quantify uncertainty in LLM outputs using techniques from earlier chapters

Responsible AI

  • Navigate privacy considerations when sending data to external APIs

  • Identify potential biases in LLM outputs and implement mitigations

  • Determine appropriate disclosure of AI assistance

  • Evaluate ethical implications of LLM deployment in specific contexts