Section 6.10 Chapter Summary

This chapter developed the practical skills for integrating large language models into data science workflows. We began with the conceptual foundations of how LLMs work and progressed through the complete toolkit: generating embeddings, preprocessing text, annotating data, building RAG systems, engineering prompts, calling external tools, evaluating reliability, and navigating responsible AI concerns. Throughout, we connected these new techniques to the statistical methods from Chapters 1–5, ensuring that LLMs augment—rather than replace—rigorous quantitative reasoning.

Complete LLM integration workflow

Fig. 251 Figure 6.10.1: The complete LLM integration workflow for data science. Raw data flows through preprocessing and embedding, branches into applications (classification, annotation, RAG, prompt engineering), and passes through evaluation and responsible AI practices before deployment.

Section-by-Section Recap

Section 6.1: LLM Foundations introduced the transformer architecture (attention as weighted voting), the pre-training/fine-tuning/in-context learning paradigm, and the landscape of model families. We set up GenAI Studio and wrote our first API calls.

Section 6.2: Embeddings and Feature Extraction transformed text into dense numerical vectors. We computed cosine similarity, decoded what each principal component encodes (length, sentiment, category), trained classifiers on embedding features, and used bootstrap CIs to quantify uncertainty in embedding-derived regression coefficients.

Section 6.3: Text Preprocessing developed the pipeline between raw data and LLM input: tokenization, context window management, and chunking strategies (fixed-size, overlap, semantic).

Section 6.4: Data Annotation used LLMs to label text at scale. We designed annotation prompts, built batch pipelines, and evaluated quality with Cohen’s kappa and bootstrap confidence intervals.

Section 6.5: Retrieval-Augmented Generation grounded LLM responses in external documents. We built RAG both via GenAI Studio’s knowledge base API and manually from scratch (chunk → embed → retrieve → generate).

Section 6.6: Prompt Engineering treated prompts as code—versioned, tested, and iterated. We developed systematic techniques: structured instructions, few-shot examples, chain-of-thought reasoning, and self-consistency (framed as bootstrap resampling).

Section 6.7: Tool Use connected models to live data and exact computation through function calling: declaring a tool with @tool, letting the model request a call, running it in our own code, and returning the result to ground the answer—together with the safeguards that executing model-chosen actions demands.

Section 6.8: Reliability and Evaluation measured LLM trustworthiness through consistency assessment, LLM-as-judge evaluation, and self-consistency uncertainty quantification.

Section 6.9: Responsible AI Practices addressed privacy (PII detection/redaction), bias (differential treatment testing), transparency (disclosure frameworks), and ethical frameworks (NIST AI RMF).

Putting it all together. These capabilities compose. A realistic workflow engineers a prompt (Section 6.6) to steer the model, grounds it in your own documents with RAG (Section 6.5), lets it call tools (Section 6.7) for computation and live data, then evaluates the result (Section 6.8) and applies responsible-AI safeguards (Section 6.9). Chaining tool calls into an autonomous loop—where the model decides each next action itself—is an agent: a powerful but less predictable frontier that builds directly on the single-tool-use foundation from this chapter.

GenAI Studio Quick Reference

Table 63 GenAI Studio API Quick Reference

Method

Description

GenAIStudio(api_key=)

Initialize client (uses GENAI_STUDIO_API_KEY if not provided)

ai.select_model("model")

Set the model for subsequent calls

ai.chat(prompt)

Simple chat, returns string

ai.chat_complete(prompt)

Chat with metadata (tokens, model info)

ai.chat_stream(prompt)

Streaming chat, yields chunks

ai.chat_messages(messages)

Multi-turn conversation from message list

ai.chat_conversation(conv)

Multi-turn conversation from Conversation object

ai.embed(text)

Generate embedding vector(s)

ai.embed_complete(text)

Embeddings with metadata

ai.similarity(text1, text2)

Cosine similarity between two texts

GenAIStudio.cosine_similarity(v1, v2)

Cosine similarity between two vectors (static)

ai.upload_file(path)

Upload file for RAG

ai.create_knowledge_base(name)

Create a knowledge base

ai.add_file_to_knowledge_base(kb_id, file_id)

Link file to knowledge base

ai.chat(prompt, collections=[kb_id])

RAG-augmented chat

Connections to Earlier Chapters

Table 64 Chapter 6 Connections to Earlier Material

Chapter

Concept

Application in Chapter 6

Ch 2

Monte Carlo simulation

Self-consistency runs multiple reasoning paths (analogous to Monte Carlo trials)

Ch 3

Linear regression

Embedding PCA components as covariates in OLS

Ch 3

Hypothesis testing

Testing annotation quality against thresholds

Ch 4

Bootstrap

CIs on agreement metrics, embedding coefficients, and self-consistency rates

Ch 4

Cross-validation

Evaluating embedding-based classifiers

Ch 4

Permutation tests

A/B testing prompt variants

Ch 5

Posterior uncertainty

Self-consistency agreement as an uncertainty heuristic (stability, not correctness)

Purdue AI Working Competency Mapping

Competency mapping

Fig. 252 Figure 6.10.2: Each section of Chapter 6 maps to one or more pillars of Purdue’s AI Working Competency Requirement.

Table 65 Section → Competency Pillar Mapping

Section

Pillar 1: Understand & Apply

Pillar 2: Communicate

Pillar 3: Adapt

6.1 Foundations

✓ (core)

6.2 Embeddings

6.3 Preprocessing

6.4 Annotation

6.5 RAG

6.6 Prompt Engineering

6.7 Tool Use

6.8 Reliability

6.9 Responsible AI

✓ (core)

Common Pitfalls Checklist

Table 66 Common Pitfalls and Remedies

Pitfall

Remedy

Trusting LLM output without validation

Always evaluate against ground truth or human labels (Section 6.8)

Sending PII to external APIs

Run PII detection/redaction before any API call (Section 6.9)

Using raw LLM confidence scores

Validate self-consistency against ground truth—use agreement to triage, not as a probability; LLMs are typically overconfident (Section 6.8)

Fixed chunk size for all documents

Experiment with chunk sizes; use semantic chunking when structure matters (Section 6.3)

Ignoring prompt version control

Treat prompts as code: version, test, and document (Section 6.6)

No bias testing before deployment

Run differential treatment probes across demographic groups (Section 6.9)

Using LLM annotation for domain-expert tasks

Validate with Cohen’s kappa; use hybrid human+LLM workflows (Section 6.4)

Omitting AI disclosure

Disclose at the level matching AI’s contribution (Section 6.9)

End-to-End Example

The program below strings together five of the chapter’s techniques—preprocessing (Section 6.3), embeddings (Section 6.2), self-consistent annotation (Sections 6.4 and 6.6), evaluation against ground truth (Section 6.8), and disclosure (Section 6.9)—into one minimal, runnable workflow:

from genai_studio import GenAIStudio
import numpy as np
from collections import Counter
from sklearn.metrics import cohen_kappa_score

ai = GenAIStudio()
ai.select_model("gemma3:12b")

# 1. Preprocess (Section 6.3)
import re
def clean(text):
    text = re.sub(r'<[^>]+>', '', text)
    return re.sub(r'\s+', ' ', text).strip()

# 2. Embed (Section 6.2)
reviews = ["Great product!", "Terrible quality.", "It works fine.",
           "Best purchase ever!", "Waste of money."]
cleaned = [clean(r) for r in reviews]
embeddings = ai.embed(cleaned, model="llama3.2:latest")  # gemma3 has no embedding endpoint; use an embed-capable model (3072-d)
print(f"Embedded {len(embeddings)} reviews ({len(embeddings[0])} dims)")

# 3. Annotate with self-consistency (Sections 6.4 + 6.6)
PROMPT = ("Classify as positive, negative, or neutral. "
          "Respond with ONLY one word.\nText: {text}\nLabel:")

annotations = []
for text in cleaned:
    runs = [ai.chat(PROMPT.format(text=text)).strip().lower()
            for _ in range(5)]
    majority = Counter(runs).most_common(1)[0][0]
    agreement = Counter(runs).most_common(1)[0][1] / 5
    annotations.append({"label": majority, "confidence": agreement})
    print(f"  [{majority:>8}] ({agreement:.0%}) {text}")

# 4. Evaluate (Section 6.8)
ground_truth = ["positive", "negative", "neutral", "positive", "negative"]
predicted = [a["label"] for a in annotations]
kappa = cohen_kappa_score(ground_truth, predicted)
print(f"\nCohen's kappa: {kappa:.3f}")

# 5. Disclose (Section 6.9)
print("\nDisclosure: Labels generated by gemma3:12b via Purdue GenAI Studio "
      f"with 5-run self-consistency. Agreement with ground truth: "
      f"kappa = {kappa:.3f}.")
Embedded 5 reviews (3072 dims)
  [positive] (100%) Great product!
  [negative] (100%) Terrible quality.
  [ neutral] (80%) It works fine.
  [positive] (100%) Best purchase ever!
  [negative] (100%) Waste of money.

Cohen's kappa: 1.000

Disclosure: Labels generated by gemma3:12b via Purdue GenAI Studio with 5-run self-consistency. Agreement with ground truth: kappa = 1.000.

Learning Outcomes Checklist

Upon completing this chapter, verify that you can:

Outcome

Key Section

Explain transformer architecture at a conceptual level

6.1

Generate embeddings and compute cosine similarity

6.2

Build preprocessing pipelines with appropriate chunking strategies

6.3

Design annotation prompts and evaluate quality with Cohen’s kappa

6.4

Implement RAG pipelines (SDK and manual)

6.5

Apply systematic prompt design: few-shot, CoT, self-consistency

6.6

Use tool calling: declare a tool, judge when it is warranted, apply safeguards

6.7

Assess reliability through consistency, self-consistency uncertainty, and LLM-as-judge

6.8

Navigate privacy, bias, and disclosure concerns

6.9

Integrate LLM techniques with statistical methods from Chapters 1–5

All

Further Reading

LLM Foundations and Architecture

  • Raschka, Build a Large Language Model (From Scratch) — for those who want deeper architectural understanding

  • Alammar, The Illustrated Transformer (blog) — visual explanations of attention

  • Bommasani et al., “On the Opportunities and Risks of Foundation Models” (2021)

Embeddings and Applications

  • Alammar & Grootendorst, Hands-On Large Language Models — practical embedding techniques

  • Jurafsky & Martin, Speech and Language Processing (3rd ed.) — comprehensive NLP reference

How Concepts Are Encoded (Interpretability)

  • Elhage et al., “Toy Models of Superposition” (Anthropic, 2022) — concepts share overlapping, non-orthogonal directions (“superposition”), which is why naive vector arithmetic on embeddings is noisy

  • Templeton et al., “Scaling Monosemanticity” (Anthropic, 2024) — sparse autoencoders extract interpretable feature directions from Claude that can be added or clamped to steer behavior

  • Anthropic, “Persona Vectors” (2025) — high-level traits encoded as linear directions you can add or subtract to control a model

RAG and Production Systems

  • Chip Huyen, AI Engineering — production LLM systems

  • Peters & Bouchard, Building LLMs for Production — RAG chapters

  • Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey” (2024)

Prompt Engineering

  • Schulhoff et al., “The Prompt Report” (2024) — comprehensive survey

  • Wei et al., “Chain-of-Thought Prompting” (2022)

  • Wang et al., “Self-Consistency Improves Chain of Thought Reasoning” (2022)

Reliability and Evaluation

  • Liang et al., “Holistic Evaluation of Language Models (HELM)” (2022)

  • Farquhar et al., “Detecting Hallucinations in Large Language Models” (2024)

Responsible AI

  • Narayanan & Kapoor, AI Snake Oil — critical perspective on AI capabilities and limitations

  • NIST AI Risk Management Framework (AI RMF 1.0)

Final Perspective

Large language models are the most versatile tools to enter the data scientist’s toolkit in a generation. They transform unstructured text into analyzable features, annotate data at unprecedented speed, retrieve information from domain-specific knowledge bases, and assist with analytical reasoning. But they are tools, not oracles. They hallucinate, they encode biases, they require careful evaluation, and they demand transparent disclosure.

The data scientist’s job is not to use LLMs uncritically—it is to use them well. This means applying the statistical thinking from earlier chapters: validating outputs, quantifying uncertainty, testing for bias, and communicating limitations honestly. The bootstrap confidence interval you compute on LLM-annotated data is only as meaningful as the annotation quality you verified. The regression model you train on embedding features is only as trustworthy as the evaluation protocol you designed.

Throughout this chapter, we have treated LLMs not as black boxes but as components in analytical workflows subject to the same rigor as any other tool. This is the perspective that will serve you as models evolve, APIs change, and new capabilities emerge. The specific models and APIs will change; the discipline of careful, skeptical, statistically-grounded integration will not.

That change will not be uniform. Expect the frontier/open-weight boundary itself to keep moving: capabilities that today exist only behind proprietary APIs will keep migrating into open-weight models through distillation and open releases, while frontier systems push further into the long-horizon agentic work that only centrally served scale supports. Neither trend invalidates anything in this chapter. The workflows you built—embedding, annotating, retrieving, evaluating—were chosen because they rest on deployment constraints you control (privacy, cost, reproducibility) rather than on any particular model’s position on a leaderboard. When the landscape shifts again, and it will, re-run the same decision you learned in Section 6.1: constraints first, capability second.

Key Takeaways

Key Takeaways 📝

  1. LLMs are powerful additions to the data science toolkit, not replacements for statistical thinking. Embeddings, annotation, RAG, and prompt engineering each solve specific problems.

  2. All LLM outputs must be evaluated — consistency, accuracy, and uncertainty checks are prerequisites for responsible use, not optional extras.

  3. Embeddings bridge NLP and statistics — once text is embedded, every technique from Chapters 1–5 applies: regression, bootstrap, cross-validation, hypothesis testing.

  4. Responsible AI is not optional — privacy, bias, and disclosure concerns are integral to any LLM deployment. Purdue’s AI Working Competency Requirement makes this a graduation standard.

  5. The discipline matters more than the model — specific LLMs will evolve, but the practice of careful evaluation, uncertainty quantification, and transparent reporting will remain essential.