Section 6.10 Chapter Summary

This chapter developed the practical skills for integrating large language models into data science workflows. We began with the conceptual foundations of how LLMs work and progressed through the complete toolkit: generating embeddings, preprocessing text, annotating data, building RAG systems, engineering prompts, calling external tools, evaluating reliability, and navigating responsible AI concerns. Throughout, we connected these new techniques to the statistical methods from Chapters 1–5, ensuring that LLMs augment—rather than replace—rigorous quantitative reasoning.

Complete LLM integration workflow — Fig. 251 **Figure 6.10.1:** The complete LLM integration workflow for data science. Raw data flows through preprocessing and embedding, branches into applications (classification, annotation, RAG, prompt engineering), and passes through evaluation and responsible AI practices before deployment.

Section-by-Section Recap

Section 6.1: LLM Foundations introduced the transformer architecture (attention as weighted voting), the pre-training/fine-tuning/in-context learning paradigm, and the landscape of model families. We set up GenAI Studio and wrote our first API calls.

Section 6.2: Embeddings and Feature Extraction transformed text into dense numerical vectors. We computed cosine similarity, decoded what each principal component encodes (length, sentiment, category), trained classifiers on embedding features, and used bootstrap CIs to quantify uncertainty in embedding-derived regression coefficients.

Section 6.3: Text Preprocessing developed the pipeline between raw data and LLM input: tokenization, context window management, and chunking strategies (fixed-size, overlap, semantic).

Section 6.4: Data Annotation used LLMs to label text at scale. We designed annotation prompts, built batch pipelines, and evaluated quality with Cohen’s kappa and bootstrap confidence intervals.

Section 6.5: Retrieval-Augmented Generation grounded LLM responses in external documents. We built RAG both via GenAI Studio’s knowledge base API and manually from scratch (chunk → embed → retrieve → generate).

Section 6.6: Prompt Engineering treated prompts as code—versioned, tested, and iterated. We developed systematic techniques: structured instructions, few-shot examples, chain-of-thought reasoning, and self-consistency (framed as bootstrap resampling).

Section 6.7: Tool Use connected models to live data and exact computation through function calling: declaring a tool with @tool, letting the model request a call, running it in our own code, and returning the result to ground the answer—together with the safeguards that executing model-chosen actions demands.

Section 6.8: Reliability and Evaluation measured LLM trustworthiness through consistency assessment, LLM-as-judge evaluation, and self-consistency uncertainty quantification.

Section 6.9: Responsible AI Practices addressed privacy (PII detection/redaction), bias (differential treatment testing), transparency (disclosure frameworks), and ethical frameworks (NIST AI RMF).

Putting it all together. These capabilities compose. A realistic workflow engineers a prompt (Section 6.6) to steer the model, grounds it in your own documents with RAG (Section 6.5), lets it call tools (Section 6.7) for computation and live data, then evaluates the result (Section 6.8) and applies responsible-AI safeguards (Section 6.9). Chaining tool calls into an autonomous loop—where the model decides each next action itself—is an agent: a powerful but less predictable frontier that builds directly on the single-tool-use foundation from this chapter.

GenAI Studio Quick Reference

Table 63 GenAI Studio API Quick Reference
Method	Description
`GenAIStudio(api_key=)`	Initialize client (uses GENAI_STUDIO_API_KEY if not provided)
`ai.select_model("model")`	Set the model for subsequent calls
`ai.chat(prompt)`	Simple chat, returns string
`ai.chat_complete(prompt)`	Chat with metadata (tokens, model info)
`ai.chat_stream(prompt)`	Streaming chat, yields chunks
`ai.chat_messages(messages)`	Multi-turn conversation from message list
`ai.chat_conversation(conv)`	Multi-turn conversation from `Conversation` object
`ai.embed(text)`	Generate embedding vector(s)
`ai.embed_complete(text)`	Embeddings with metadata
`ai.similarity(text1, text2)`	Cosine similarity between two texts
`GenAIStudio.cosine_similarity(v1, v2)`	Cosine similarity between two vectors (static)
`ai.upload_file(path)`	Upload file for RAG
`ai.create_knowledge_base(name)`	Create a knowledge base
`ai.add_file_to_knowledge_base(kb_id, file_id)`	Link file to knowledge base
`ai.chat(prompt, collections=[kb_id])`	RAG-augmented chat

Connections to Earlier Chapters

Table 64 Chapter 6 Connections to Earlier Material
Chapter	Concept	Application in Chapter 6
Ch 2	Monte Carlo simulation	Self-consistency runs multiple reasoning paths (analogous to Monte Carlo trials)
Ch 3	Linear regression	Embedding PCA components as covariates in OLS
Ch 3	Hypothesis testing	Testing annotation quality against thresholds
Ch 4	Bootstrap	CIs on agreement metrics, embedding coefficients, and self-consistency rates
Ch 4	Cross-validation	Evaluating embedding-based classifiers
Ch 4	Permutation tests	A/B testing prompt variants
Ch 5	Posterior uncertainty	Self-consistency agreement as an uncertainty heuristic (stability, not correctness)

Purdue AI Working Competency Mapping

Table 65 Section → Competency Pillar Mapping
Section	Pillar 1: Understand & Apply	Pillar 2: Communicate	Pillar 3: Adapt
6.1 Foundations	✓ (core)
6.2 Embeddings	✓
6.3 Preprocessing	✓		✓
6.4 Annotation	✓	✓
6.5 RAG	✓		✓
6.6 Prompt Engineering	✓	✓
6.7 Tool Use	✓		✓
6.8 Reliability	✓	✓
6.9 Responsible AI	✓	✓ (core)	✓

Common Pitfalls Checklist

Table 66 Common Pitfalls and Remedies
Pitfall	Remedy
Trusting LLM output without validation	Always evaluate against ground truth or human labels (Section 6.8)
Sending PII to external APIs	Run PII detection/redaction before any API call (Section 6.9)
Using raw LLM confidence scores	Validate self-consistency against ground truth—use agreement to triage, not as a probability; LLMs are typically overconfident (Section 6.8)
Fixed chunk size for all documents	Experiment with chunk sizes; use semantic chunking when structure matters (Section 6.3)
Ignoring prompt version control	Treat prompts as code: version, test, and document (Section 6.6)
No bias testing before deployment	Run differential treatment probes across demographic groups (Section 6.9)
Using LLM annotation for domain-expert tasks	Validate with Cohen’s kappa; use hybrid human+LLM workflows (Section 6.4)
Omitting AI disclosure	Disclose at the level matching AI’s contribution (Section 6.9)

End-to-End Example

The program below strings together five of the chapter’s techniques—preprocessing (Section 6.3), embeddings (Section 6.2), self-consistent annotation (Sections 6.4 and 6.6), evaluation against ground truth (Section 6.8), and disclosure (Section 6.9)—into one minimal, runnable workflow:

from genai_studio import GenAIStudio
import numpy as np
from collections import Counter
from sklearn.metrics import cohen_kappa_score

ai = GenAIStudio()
ai.select_model("gemma3:12b")

# 1. Preprocess (Section 6.3)
import re
def clean(text):
    text = re.sub(r'<[^>]+>', '', text)
    return re.sub(r'\s+', ' ', text).strip()

# 2. Embed (Section 6.2)
reviews = ["Great product!", "Terrible quality.", "It works fine.",
           "Best purchase ever!", "Waste of money."]
cleaned = [clean(r) for r in reviews]
embeddings = ai.embed(cleaned, model="llama3.2:latest")  # gemma3 has no embedding endpoint; use an embed-capable model (3072-d)
print(f"Embedded {len(embeddings)} reviews ({len(embeddings[0])} dims)")

# 3. Annotate with self-consistency (Sections 6.4 + 6.6)
PROMPT = ("Classify as positive, negative, or neutral. "
          "Respond with ONLY one word.\nText: {text}\nLabel:")

annotations = []
for text in cleaned:
    runs = [ai.chat(PROMPT.format(text=text)).strip().lower()
            for _ in range(5)]
    majority = Counter(runs).most_common(1)[0][0]
    agreement = Counter(runs).most_common(1)[0][1] / 5
    annotations.append({"label": majority, "confidence": agreement})
    print(f"  [{majority:>8}] ({agreement:.0%}) {text}")

# 4. Evaluate (Section 6.8)
ground_truth = ["positive", "negative", "neutral", "positive", "negative"]
predicted = [a["label"] for a in annotations]
kappa = cohen_kappa_score(ground_truth, predicted)
print(f"\nCohen's kappa: {kappa:.3f}")

# 5. Disclose (Section 6.9)
print("\nDisclosure: Labels generated by gemma3:12b via Purdue GenAI Studio "
      f"with 5-run self-consistency. Agreement with ground truth: "
      f"kappa = {kappa:.3f}.")

Embedded 5 reviews (3072 dims)
  [positive] (100%) Great product!
  [negative] (100%) Terrible quality.
  [ neutral] (80%) It works fine.
  [positive] (100%) Best purchase ever!
  [negative] (100%) Waste of money.

Cohen's kappa: 1.000

Disclosure: Labels generated by gemma3:12b via Purdue GenAI Studio with 5-run self-consistency. Agreement with ground truth: kappa = 1.000.

Learning Outcomes Checklist

Upon completing this chapter, verify that you can:

✓	Outcome	Key Section
☐	Explain transformer architecture at a conceptual level	6.1
☐	Generate embeddings and compute cosine similarity	6.2
☐	Build preprocessing pipelines with appropriate chunking strategies	6.3
☐	Design annotation prompts and evaluate quality with Cohen’s kappa	6.4
☐	Implement RAG pipelines (SDK and manual)	6.5
☐	Apply systematic prompt design: few-shot, CoT, self-consistency	6.6
☐	Use tool calling: declare a tool, judge when it is warranted, apply safeguards	6.7
☐	Assess reliability through consistency, self-consistency uncertainty, and LLM-as-judge	6.8
☐	Navigate privacy, bias, and disclosure concerns	6.9
☐	Integrate LLM techniques with statistical methods from Chapters 1–5	All

Final Perspective

Large language models are the most versatile tools to enter the data scientist’s toolkit in a generation. They transform unstructured text into analyzable features, annotate data at unprecedented speed, retrieve information from domain-specific knowledge bases, and assist with analytical reasoning. But they are tools, not oracles. They hallucinate, they encode biases, they require careful evaluation, and they demand transparent disclosure.

The data scientist’s job is not to use LLMs uncritically—it is to use them well. This means applying the statistical thinking from earlier chapters: validating outputs, quantifying uncertainty, testing for bias, and communicating limitations honestly. The bootstrap confidence interval you compute on LLM-annotated data is only as meaningful as the annotation quality you verified. The regression model you train on embedding features is only as trustworthy as the evaluation protocol you designed.

Throughout this chapter, we have treated LLMs not as black boxes but as components in analytical workflows subject to the same rigor as any other tool. This is the perspective that will serve you as models evolve, APIs change, and new capabilities emerge. The specific models and APIs will change; the discipline of careful, skeptical, statistically-grounded integration will not.

That change will not be uniform. Expect the frontier/open-weight boundary itself to keep moving: capabilities that today exist only behind proprietary APIs will keep migrating into open-weight models through distillation and open releases, while frontier systems push further into the long-horizon agentic work that only centrally served scale supports. Neither trend invalidates anything in this chapter. The workflows you built—embedding, annotating, retrieving, evaluating—were chosen because they rest on deployment constraints you control (privacy, cost, reproducibility) rather than on any particular model’s position on a leaderboard. When the landscape shifts again, and it will, re-run the same decision you learned in Section 6.1: constraints first, capability second.

Key Takeaways

Key Takeaways 📝

LLMs are powerful additions to the data science toolkit, not replacements for statistical thinking. Embeddings, annotation, RAG, and prompt engineering each solve specific problems.
All LLM outputs must be evaluated — consistency, accuracy, and uncertainty checks are prerequisites for responsible use, not optional extras.
Embeddings bridge NLP and statistics — once text is embedded, every technique from Chapters 1–5 applies: regression, bootstrap, cross-validation, hypothesis testing.
Responsible AI is not optional — privacy, bias, and disclosure concerns are integral to any LLM deployment. Purdue’s AI Working Competency Requirement makes this a graduation standard.
The discipline matters more than the model — specific LLMs will evolve, but the practice of careful evaluation, uncertainty quantification, and transparent reporting will remain essential.