.. _chapter6: =========================================== Chapter 6: LLMs in Data Science Workflows =========================================== .. contents:: Chapter Contents :local: :depth: 2 Large language models have transformed how we work with unstructured data. Text that once required extensive manual processing—cleaning, feature engineering, labeling—can now be transformed into rich numerical representations, annotated at scale, and analyzed with unprecedented sophistication. For data scientists, LLMs are not replacements for statistical thinking but powerful additions to the computational toolkit we've developed throughout this course. This chapter addresses a practical question: how can data scientists leverage pre-trained language models to enhance analytical workflows? We focus on integration patterns rather than model architecture, on responsible deployment rather than training from scratch. The goal is not to turn you into an ML engineer but to equip you as a data scientist who can thoughtfully incorporate LLMs alongside the frequentist, Bayesian, and resampling methods from earlier chapters. We begin with **foundations**: what LLMs are, how they're trained, and why they exhibit their remarkable capabilities. From there, we develop **embeddings and feature extraction**—transforming text into dense vectors that capture semantic meaning and can serve as inputs to statistical models. **Data annotation** addresses the persistent bottleneck of labeled data, showing how LLMs can annotate at scale with quality approaching human labelers. **Retrieval-augmented generation (RAG)** extends LLM capabilities to domain-specific knowledge by grounding responses in authoritative sources. Throughout, we emphasize **prompt engineering** for reliable outputs, **tool use** for connecting models to computation and external systems, **reliability assessment** for understanding when to trust LLM predictions, and **responsible AI practices** for navigating privacy, bias, and transparency concerns. **Learning Objectives:** Upon completing this chapter, you will be able to: **LLM Foundations** * **Explain** transformer architecture at a conceptual level (attention, pre-training, fine-tuning) * **Compare** model families (GPT, Claude, Llama) and their capabilities * **Evaluate** deployment options (API-based vs. local) and their trade-offs * **Distinguish** between open and closed models and implications for different use cases **Embeddings and Feature Extraction** * **Generate** text embeddings using API-based and open-source models * **Apply** embeddings for similarity search, classification, and regression * **Integrate** embedding-derived features into statistical models from earlier chapters * **Evaluate** embedding quality and choose appropriate dimensionality **Text Preprocessing** * **Build** preprocessing pipelines for tokenization, chunking, and normalization * **Handle** long documents through effective chunking strategies * **Manage** context windows and token limits * **Clean** and normalize text data for LLM consumption **Data Annotation** * **Design** annotation workflows with appropriate prompts and instructions * **Implement** quality control through consensus, sampling, and validation * **Evaluate** annotation quality against human baselines * **Determine** when LLM annotation is appropriate versus human labels essential **Retrieval-Augmented Generation** * **Implement** RAG pipelines: document chunking, embedding, retrieval, generation * **Design** retrieval strategies that balance precision and recall * **Evaluate** RAG system quality through retrieval and generation metrics * **Identify** RAG failure modes and implement mitigations **Prompt Engineering** * **Apply** systematic prompt design: clear instructions, few-shot examples, chain-of-thought * **Implement** output formatting for structured, parseable responses * **Manage** prompts as code with version control and testing * **Debug** prompt failures through systematic iteration **Tool Use** * **Explain** tool use (function calling): the model requests a call, your code runs it, and the result grounds the answer * **Declare** a tool with the ``@tool`` decorator and read the JSON schema the model is given * **Determine** when tool use is appropriate—live data, exact computation, private systems—and when it is not * **Apply** safeguards for tool execution: argument validation, allowlisting, human-in-the-loop, and logging **Reliability and Evaluation** * **Assess** LLM reliability through consistency, self-consistency uncertainty, and ground-truth comparison * **Implement** strategies for improving reliability (self-consistency, verification) * **Design** evaluation protocols appropriate for LLM-assisted workflows * **Quantify** uncertainty in LLM outputs using techniques from earlier chapters **Responsible AI** * **Navigate** privacy considerations when sending data to external APIs * **Identify** potential biases in LLM outputs and implement mitigations * **Determine** appropriate disclosure of AI assistance * **Evaluate** ethical implications of LLM deployment in specific contexts .. toctree:: :maxdepth: 2 :caption: Sections ch6_1-llm-foundations ch6_2-embeddings-features ch6_3-text-preprocessing ch6_4-data-annotation ch6_5-retrieval-augmented-generation ch6_6-prompt-engineering ch6_7-tool-use ch6_8-reliability-evaluation ch6_9-responsible-ai ch6_10-chapter-summary