.. _ch6_3-text-preprocessing: ===================================================== Section 6.3 Text Preprocessing for LLM Pipelines ===================================================== Before text reaches a language model, it must pass through a series of transformations that are easy to overlook but difficult to recover from when done poorly. A model that receives HTML tags alongside article text will waste tokens on markup. A document that exceeds the model's context window will be silently truncated, losing potentially critical information. A pipeline that chunks text mid-sentence will produce embeddings that capture fragments rather than ideas. Text preprocessing for LLMs differs from traditional NLP preprocessing. Classical pipelines aggressively transform text—lowercasing, stemming, removing stop words—because the downstream model (a bag-of-words classifier, say) benefits from reduced vocabulary. LLMs, by contrast, *understand* language; they benefit from well-formed, natural text. The goal is not to simplify the text but to ensure it arrives in a form the model can process effectively within its constraints. This section develops the preprocessing pipeline that sits between raw data and the LLM techniques of subsequent sections. We focus on three critical steps: understanding how models *see* text (tokenization), managing the fundamental constraint of context windows, and dividing long documents into meaningful chunks. .. admonition:: Road Map 🧭 :class: important • **Understand** how tokenizers convert text to tokens—the actual units models process • **Manage** context windows as a fixed token budget that constrains everything • **Choose** chunking strategies appropriate for different document types and downstream tasks • **Build** preprocessing pipelines that clean, normalize, and prepare text for LLM consumption Tokenization: How Models See Text ----------------------------------- Subword Tokenization ~~~~~~~~~~~~~~~~~~~~~~ Language models do not process text character by character or word by word. They use **subword tokenization**, which splits text into units called *tokens* that balance vocabulary size against sequence length. Common English words typically map to a single token ("the", "cat", "important"). Rare or technical words are split into subword pieces: "immunohistochemistry" might become ["immun", "oh", "ist", "oche", "mist", "ry"]. Numbers, punctuation, and code often tokenize unpredictably: "3.14159" might become ["3", ".", "14", "159"]. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIV/Chapter6/ch6_3_fig01_tokenization.png :alt: How tokenizers see different types of text :align: center :width: 80% **Figure 6.3.1:** Subword tokenization in action. Common English words become single tokens, while rare or technical terms are split into multiple subword pieces. Code and mathematical notation often tokenize less efficiently than prose. Why does this matter for data scientists? Because **everything in an LLM pipeline is measured in tokens, not words or characters.** Context windows, API pricing, embedding input limits, and generation budgets are all denominated in tokens. A rough heuristic: 1 token ≈ 0.75 English words, or equivalently, 1 English word ≈ 1.3 tokens. But this ratio varies significantly with content type. Counting Tokens in Practice ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since we cannot directly access the tokenizer used by models on GenAI Studio, we measure token counts indirectly: send text through ``chat_complete()`` and read back the token usage the gateway reports for the full prompt: .. code-block:: python from genai_studio import GenAIStudio ai = GenAIStudio() ai.select_model("llama3.2:latest") # Use chat_complete to measure token usage test_text = ("The bootstrap resamples data with replacement to estimate " "the sampling distribution of a statistic. By repeating this " "process thousands of times, we build an empirical approximation " "to the true sampling distribution.") prompt = f"Repeat the following text exactly: {test_text}" response = ai.chat_complete(prompt) print(f"Prompt tokens: {response.prompt_tokens}") print(f"Completion tokens: {response.completion_tokens}") print(f"Total tokens: {response.total_tokens}") print(f"Word count: {len(prompt.split())}") print(f"Token/word ratio: {response.prompt_tokens / len(prompt.split()):.2f}") .. code-block:: text Prompt tokens: 52 Completion tokens: 43 Total tokens: 95 Word count: 40 Token/word ratio: 1.30 Token Estimation Heuristics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For planning purposes, use these approximations: .. code-block:: python def estimate_tokens(text, method="words"): """Estimate token count from text. The word-based heuristic (1 word ≈ 1.3 tokens) is reasonably accurate for English prose. Character-based (1 token ≈ 4 chars) is better for mixed content (code, URLs, technical text). """ if method == "words": return int(len(text.split()) * 1.3) elif method == "chars": return len(text) // 4 else: raise ValueError(f"Unknown method: {method}") sample = "The bootstrap resamples data with replacement." print(f"Word estimate: {estimate_tokens(sample, 'words')} tokens") print(f"Char estimate: {estimate_tokens(sample, 'chars')} tokens") .. code-block:: text Word estimate: 7 tokens Char estimate: 11 tokens These are approximations. For production pipelines where token counts matter (e.g., staying within context windows), always measure actual token usage with ``chat_complete()`` on representative samples. Context Windows and Token Limits ---------------------------------- What Context Windows Mean ~~~~~~~~~~~~~~~~~~~~~~~~~~ Every language model has a **context window**—the maximum number of tokens it can process in a single call. This window must contain *everything*: system prompts, user input, and the model's response. If the total exceeds the window, input is truncated or the call fails. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIV/Chapter6/ch6_3_fig02_context_window.png :alt: Context window token budget :align: center :width: 85% **Figure 6.3.2:** The context window is a fixed token budget shared between system prompt, user input, and model output. For a model with an 8,192-token window, a 500-token system prompt and a 1,000-token expected output leave roughly 6,700 tokens for your actual data. Context Window Sizes on GenAI Studio ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The models available on GenAI Studio have varying context windows: .. list-table:: Model Context Windows :header-rows: 1 :widths: 30 20 50 * - Model - Context Window - Notes * - llama3.2:latest - 8,192 tokens - Good balance of capability and context * - mistral:latest - 8,192 tokens - Fast, efficient for shorter tasks * - gemma3:12b - 8,192 tokens - Strong general-purpose model * - phi4:latest - 16,384 tokens - Larger context window * - deepseek-r1:1.5b - 8,192 tokens - Smallest model, fastest inference Managing the Token Budget ~~~~~~~~~~~~~~~~~~~~~~~~~~ A practical function for checking whether text fits within a model's context window: .. code-block:: python def check_fits_context(text, model_context=8192, system_tokens=200, output_tokens=1000): """Check if text fits within the context window budget.""" estimated_input = estimate_tokens(text, "words") available = model_context - system_tokens - output_tokens fits = estimated_input <= available print(f"Context window: {model_context:,} tokens") print(f"System prompt: -{system_tokens:,}") print(f"Output reserve: -{output_tokens:,}") print(f"Available: {available:,} tokens") print(f"Input estimate: {estimated_input:,} tokens") print(f"Fits: {'Yes' if fits else 'NO — chunking required'}") return fits short_text = "The mean is sensitive to outliers. " * 10 long_text = "Statistical analysis reveals patterns. " * 5000 print("Short text:") check_fits_context(short_text) print("\nLong text:") check_fits_context(long_text) .. code-block:: text Short text: Context window: 8,192 tokens System prompt: -200 Output reserve: -1,000 Available: 6,992 tokens Input estimate: 78 tokens Fits: Yes Long text: Context window: 8,192 tokens System prompt: -200 Output reserve: -1,000 Available: 6,992 tokens Input estimate: 26,000 tokens Fits: NO — chunking required When text does not fit, we have two options: use a model with a larger context window, or *chunk* the text into smaller pieces. Chunking is almost always the more robust approach, because it works regardless of model and scales to arbitrarily long documents. Chunking Strategies --------------------- **Chunking** divides a long document into smaller segments, each of which fits within the model's context window. The challenge is choosing *where* to split: cuts in the wrong places destroy context and produce incoherent fragments. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIV/Chapter6/ch6_3_fig03_chunking_strategies.png :alt: Three chunking strategies compared :align: center :width: 90% **Figure 6.3.3:** Three chunking strategies. Fixed-size chunking is simple but may split mid-sentence. Overlap ensures no information is lost at boundaries. Semantic chunking splits at natural boundaries (paragraphs, sections) for more coherent chunks. Fixed-Size Chunking ~~~~~~~~~~~~~~~~~~~~~ The simplest approach: split text into chunks of approximately equal token count. .. code-block:: python def chunk_fixed_size(text, chunk_size=500, unit="words"): """Split text into fixed-size chunks by word count. A rough proxy for token count (1 word ≈ 1.3 tokens). For more precise control, use character-based chunking with chunk_size = desired_tokens * 4. """ words = text.split() chunks = [] for i in range(0, len(words), chunk_size): chunk = " ".join(words[i:i + chunk_size]) chunks.append(chunk) return chunks # Example with a long text long_text = " ".join([f"Sentence {i} discusses topic {i % 5}." for i in range(200)]) chunks = chunk_fixed_size(long_text, chunk_size=50) print(f"Document: {len(long_text.split())} words") print(f"Chunks: {len(chunks)}") for i, chunk in enumerate(chunks[:3]): print(f"\n Chunk {i} ({len(chunk.split())} words): {chunk[:80]}...") .. code-block:: text Document: 1000 words Chunks: 20 Chunk 0 (50 words): Sentence 0 discusses topic 0. Sentence 1 discusses topic 1. Sentence 2 discusses... Chunk 1 (50 words): Sentence 10 discusses topic 0. Sentence 11 discusses topic 1. Sentence 12 discus... Chunk 2 (50 words): Sentence 20 discusses topic 0. Sentence 21 discusses topic 1. Sentence 22 discus... Fixed-size chunking is fast and predictable but crude. It may split mid-sentence, breaking coherence. This is acceptable for some tasks (e.g., embedding large corpora for approximate search) but problematic when context matters. Chunking with Overlap ~~~~~~~~~~~~~~~~~~~~~~~ Adding overlap between consecutive chunks ensures that information near boundaries is not lost: .. code-block:: python def chunk_with_overlap(text, chunk_size=500, overlap=50): """Split text into overlapping chunks. Overlap ensures no information falls into a gap between chunks. Typical overlap: 10-20% of chunk_size. """ words = text.split() chunks = [] step = chunk_size - overlap for i in range(0, len(words), step): chunk = " ".join(words[i:i + chunk_size]) if chunk: chunks.append(chunk) if i + chunk_size >= len(words): break return chunks chunks_overlap = chunk_with_overlap(long_text, chunk_size=50, overlap=10) print(f"Chunks without overlap: {len(chunks)}") print(f"Chunks with overlap: {len(chunks_overlap)}") print(f"Overlap cost: {len(chunks_overlap) - len(chunks)} extra chunks " f"({(len(chunks_overlap)/len(chunks) - 1)*100:.0f}% more)") .. code-block:: text Chunks without overlap: 20 Chunks with overlap: 25 Overlap cost: 5 extra chunks (25% more) The trade-off is clear: overlap improves boundary coverage but increases the total number of chunks (and therefore embedding or LLM costs). A 10–20% overlap is usually sufficient. Semantic Chunking ~~~~~~~~~~~~~~~~~~~ Semantic chunking splits at natural document boundaries—paragraphs, section headers, or sentence endings—rather than at arbitrary word counts: .. code-block:: python import re def chunk_semantic(text, max_chunk_size=500): """Split text at paragraph boundaries, merging small paragraphs. Respects natural document structure by never splitting mid-paragraph. Merges consecutive short paragraphs until the chunk approaches max_chunk_size words. """ paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()] chunks = [] current_chunk = [] current_size = 0 for para in paragraphs: para_size = len(para.split()) if current_size + para_size > max_chunk_size and current_chunk: chunks.append('\n\n'.join(current_chunk)) current_chunk = [para] current_size = para_size else: current_chunk.append(para) current_size += para_size if current_chunk: chunks.append('\n\n'.join(current_chunk)) return chunks # Example with structured text structured_text = """ Introduction to Bootstrap Methods The bootstrap is a resampling method introduced by Bradley Efron in 1979. It estimates the sampling distribution of a statistic by repeatedly resampling from the observed data with replacement. How the Bootstrap Works Given a dataset of n observations, we draw n samples with replacement to create a bootstrap sample. We compute the statistic of interest on this sample. Repeating this B times gives B bootstrap estimates. Confidence Intervals The percentile method takes the alpha/2 and 1-alpha/2 quantiles of the bootstrap distribution as confidence interval endpoints. This is simple but can be improved with bias-corrected methods. Advantages and Limitations The bootstrap requires no distributional assumptions, making it widely applicable. However, it can fail for heavy-tailed distributions or when the sample size is very small. """.strip() chunks = chunk_semantic(structured_text, max_chunk_size=40) for i, chunk in enumerate(chunks): print(f"Chunk {i} ({len(chunk.split())} words):") print(f" {chunk[:100]}...") print() .. code-block:: text Chunk 0 (37 words): Introduction to Bootstrap Methods The bootstrap is a resampling method introduced by Bradley Efron ... Chunk 1 (36 words): Given a dataset of n observations, we draw n samples with replacem... Chunk 2 (30 words): The percentile method takes the alpha/2 and 1-alpha/2 quantiles o... Chunk 3 (25 words): The bootstrap requires no distributional assumptions, making it w... Semantic chunking never splits mid-paragraph, so every chunk reads as coherent prose. The word counts reveal one limitation of the greedy merger, though: a short heading always fits in the chunk being built, so "How the Bootstrap Works" is absorbed into the *tail* of Chunk 0 rather than leading Chunk 1—production chunkers treat headings as boundaries that force a new chunk. Even so, structure-aware chunking is the right default for tasks where context coherence matters—such as embedding for RAG (see :ref:`Section 6.5 `) or feeding to an LLM for annotation. Choosing a Chunking Strategy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: Chunking Strategy Selection Guide :header-rows: 1 :widths: 25 25 25 25 * - Strategy - Best For - Advantages - Disadvantages * - Fixed-size - Large corpora, approximate search - Simple, predictable size - May break mid-sentence * - Fixed + overlap - Embedding for retrieval - No boundary information loss - More chunks, higher cost * - Semantic - RAG, annotation, analysis - Preserves document structure - Variable chunk sizes * - Recursive - Complex documents - Hierarchical splitting - More complex to implement Text Normalization and Cleaning --------------------------------- Chunking solves the *size* problem; normalization solves the *quality* problem. However carefully we split a document, each chunk is only as useful as the text inside it—and text scraped from web pages, PDFs, or logs rarely arrives clean. Standard Cleaning Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Real-world text often contains artifacts that waste tokens and confuse models: HTML tags, duplicate whitespace, encoding errors, and invisible Unicode characters. .. code-block:: python import re def clean_text(text): """Standard text cleaning for LLM input. Removes HTML, fixes whitespace, and normalizes Unicode. Deliberately does NOT lowercase, stem, or remove stop words— LLMs benefit from natural, well-formed text. """ # Remove HTML tags text = re.sub(r'<[^>]+>', '', text) # Remove URLs text = re.sub(r'https?://\S+', '[URL]', text) # Normalize whitespace (collapse multiple spaces/newlines) text = re.sub(r'\s+', ' ', text) # Remove control characters (but keep newlines for structure) text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text) return text.strip() raw = """

This is a test of the cleaning pipeline.

Visit https://example.com/very/long/url for more info. Extra spaces and\ttabs\there. """ cleaned = clean_text(raw) print(f"Raw ({len(raw)} chars): {repr(raw[:80])}") print(f"Clean ({len(cleaned)} chars): {cleaned}") .. code-block:: text Raw (154 chars): '

This is a test of the cleaning pipeline.

\nVisit https://example' Clean (81 chars): This is a test of the cleaning pipeline. Visit [URL] for more info. Extra spaces and tabs here. When NOT to Normalize ~~~~~~~~~~~~~~~~~~~~~~~ LLMs are *not* traditional NLP models. Aggressive preprocessing can remove information the model needs: - **Do not lowercase**: Case carries meaning ("Apple" the company vs. "apple" the fruit). - **Do not remove stop words**: LLMs understand grammar; removing "not" from "not significant" inverts meaning. - **Do not stem or lemmatize**: The model's tokenizer handles morphological variation. - **Do not remove punctuation**: Sentence boundaries, questions, and emphasis depend on punctuation. The principle: clean *noise* (HTML, encoding errors, duplicate whitespace) but preserve *signal* (case, grammar, punctuation, natural phrasing). Building a Complete Preprocessing Pipeline -------------------------------------------- A production preprocessing pipeline combines all the steps above into a single, testable function: .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/PartIV/Chapter6/ch6_3_fig04_preprocessing_pipeline.png :alt: Complete text preprocessing pipeline :align: center :width: 90% **Figure 6.3.4:** The preprocessing pipeline from raw document to LLM-ready chunks. Each stage can be tested independently, and the pipeline can be adapted for different document types and downstream tasks. .. code-block:: python import re class TextPreprocessor: """Configurable text preprocessing pipeline for LLM workflows.""" def __init__(self, chunk_size=500, overlap=50, strategy="semantic"): self.chunk_size = chunk_size self.overlap = overlap self.strategy = strategy def clean(self, text): """Remove noise while preserving natural language structure.""" text = re.sub(r'<[^>]+>', '', text) text = re.sub(r'https?://\S+', '[URL]', text) text = re.sub(r'\s+', ' ', text) text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text) return text.strip() def chunk(self, text): """Split text into chunks using the configured strategy.""" if self.strategy == "fixed": return self._chunk_fixed(text) elif self.strategy == "overlap": return self._chunk_overlap(text) elif self.strategy == "semantic": return self._chunk_semantic(text) else: raise ValueError(f"Unknown strategy: {self.strategy}") def _chunk_fixed(self, text): words = text.split() return [" ".join(words[i:i+self.chunk_size]) for i in range(0, len(words), self.chunk_size)] def _chunk_overlap(self, text): words = text.split() chunks = [] step = self.chunk_size - self.overlap for i in range(0, len(words), step): chunk = " ".join(words[i:i+self.chunk_size]) if chunk: chunks.append(chunk) if i + self.chunk_size >= len(words): break return chunks def _chunk_semantic(self, text): paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()] chunks, current, size = [], [], 0 for para in paragraphs: para_size = len(para.split()) if size + para_size > self.chunk_size and current: chunks.append('\n\n'.join(current)) current, size = [para], para_size else: current.append(para) size += para_size if current: chunks.append('\n\n'.join(current)) return chunks def process(self, text): """Full pipeline: clean → check size → chunk if needed.""" cleaned = self.clean(text) word_count = len(cleaned.split()) estimated_tokens = int(word_count * 1.3) if estimated_tokens <= self.chunk_size: return [cleaned] return self.chunk(cleaned) def validate(self, chunks): """Check that all chunks meet size constraints.""" report = { "n_chunks": len(chunks), "sizes": [len(c.split()) for c in chunks], "total_words": sum(len(c.split()) for c in chunks), "all_within_limit": all( len(c.split()) <= self.chunk_size * 1.1 for c in chunks ), } return report # Usage preprocessor = TextPreprocessor(chunk_size=100, overlap=15, strategy="overlap") long_doc = " ".join([f"Paragraph {i} covers topic {i % 3}. " * 10 for i in range(20)]) chunks = preprocessor.process(long_doc) report = preprocessor.validate(chunks) print(f"Input: {len(long_doc.split())} words") print(f"Chunks: {report['n_chunks']}") print(f"Chunk sizes: {report['sizes'][:5]}...") print(f"All within limit: {report['all_within_limit']}") .. code-block:: text Input: 1000 words Chunks: 12 Chunk sizes: [100, 100, 100, 100, 100]... All within limit: True Pipeline Validation ~~~~~~~~~~~~~~~~~~~~ Always validate your preprocessing pipeline before running it at scale: .. code-block:: python def validate_pipeline(preprocessor, sample_texts): """Test preprocessing on sample texts and report statistics.""" all_reports = [] for text in sample_texts: chunks = preprocessor.process(text) report = preprocessor.validate(chunks) all_reports.append(report) total_chunks = sum(r['n_chunks'] for r in all_reports) all_sizes = [s for r in all_reports for s in r['sizes']] print(f"Documents processed: {len(sample_texts)}") print(f"Total chunks: {total_chunks}") print(f"Avg chunks/doc: {total_chunks/len(sample_texts):.1f}") print(f"Chunk size range: [{min(all_sizes)}, {max(all_sizes)}] words") print(f"Mean chunk size: {sum(all_sizes)/len(all_sizes):.0f} words") print(f"All within limit: {all(r['all_within_limit'] for r in all_reports)}") sample_docs = [ "Short text that fits in one chunk.", " ".join(["Medium text. "] * 100), " ".join(["Long document with many paragraphs. "] * 500), ] validate_pipeline(preprocessor, sample_docs) .. code-block:: text Documents processed: 3 Total chunks: 34 Avg chunks/doc: 11.3 Chunk size range: [7, 100] words Mean chunk size: 93 words All within limit: True Chapter 6.3 Exercises: Text Preprocessing -------------------------------------------- .. admonition:: Exercise 6.3.1 — Tokenization Explorer :class: hint a) Write a function that uses ``chat_complete()`` to measure the token count of a given text. Test it on five different content types: (1) standard English prose, (2) Python code, (3) a URL-heavy web page, (4) a mathematical expression, (5) text in a language other than English. b) Compute the token-to-word ratio for each content type. Which content types are most and least "token-efficient"? c) Based on your findings, explain why token estimation heuristics need to account for content type. .. dropdown:: Solution :icon: unlock .. code-block:: python def measure_tokens(text, ai): """Measure actual token count by sending text through chat_complete.""" response = ai.chat_complete(f"Repeat exactly: {text}") words = len(text.split()) chars = len(text) return { "text_preview": text[:50], "words": words, "chars": chars, "prompt_tokens": response.prompt_tokens, "token_word_ratio": response.prompt_tokens / max(words, 1), "token_char_ratio": response.prompt_tokens / max(chars, 1), } test_cases = { "prose": "The bootstrap method estimates sampling variability by " "repeatedly drawing samples with replacement from the data.", "code": "def bootstrap(data, n=1000):\n return [np.mean(" "np.random.choice(data, len(data))) for _ in range(n)]", "urls": "Visit https://genai.rcac.purdue.edu/api/v1 and " "https://docs.python.org/3/library/statistics.html", "math": "E[X] = Σ x_i * P(x_i), Var(X) = E[X²] - (E[X])²", "spanish": "El método bootstrap estima la variabilidad muestral " "mediante remuestreo con reemplazo de los datos.", } for content_type, text in test_cases.items(): result = measure_tokens(text, ai) print(f"{content_type:8s}: {result['token_word_ratio']:.2f} " f"tokens/word ({result['prompt_tokens']} tokens, " f"{result['words']} words)") .. admonition:: Exercise 6.3.2 — Chunking Strategy Comparison :class: hint a) Create a long document (1,000+ words) with clear section structure (headers, paragraphs). Chunk it using fixed-size, fixed-size with overlap, and semantic strategies, all targeting ~200-word chunks. b) Embed each set of chunks using ``ai.embed()`` and compute the average within-chunk cosine similarity to the full document embedding. Which chunking strategy produces chunks that are most representative of the full document? c) For the fixed-size strategy, deliberately create a chunk that splits mid-sentence. Show how this affects the embedding compared to a clean sentence-boundary chunk. .. dropdown:: Solution :icon: unlock .. code-block:: python import numpy as np document = """ Introduction to Statistical Learning Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Supervised Learning In supervised learning, for each observation of the predictor measurements, there is an associated response measurement. We wish to fit a model that relates the response to the predictors. Unsupervised Learning In unsupervised learning, for every observation, we observe a vector of measurements but no associated response. We seek to understand the relationships between the variables. Model Assessment In order to evaluate the performance of a statistical learning method, we need some way to measure how well its predictions match the observed data. The most common approach is to compute the mean squared error. """.strip() preprocessor_fixed = TextPreprocessor(chunk_size=30, strategy="fixed") preprocessor_overlap = TextPreprocessor(chunk_size=30, overlap=5, strategy="overlap") preprocessor_semantic = TextPreprocessor(chunk_size=30, strategy="semantic") chunks_fixed = preprocessor_fixed.process(document) chunks_overlap = preprocessor_overlap.process(document) chunks_semantic = preprocessor_semantic.process(document) doc_emb = ai.embed(document) for name, chunks in [("Fixed", chunks_fixed), ("Overlap", chunks_overlap), ("Semantic", chunks_semantic)]: chunk_embs = ai.embed(chunks) sims = [GenAIStudio.cosine_similarity(doc_emb, ce) for ce in chunk_embs] print(f"{name:8s}: {len(chunks)} chunks, " f"avg similarity to full doc: {np.mean(sims):.4f}") .. admonition:: Exercise 6.3.3 — Context Window Budget Calculator :class: hint Write a ``ContextBudget`` class that: a) Takes a model's context window size, system prompt, and desired output length. b) Computes the remaining token budget for user input. c) Given a long text, determines whether chunking is needed and, if so, how many chunks. d) Reports a complete budget breakdown. Test it with a 2,000-word document against the 8,192-token window of llama3.2. .. dropdown:: Solution :icon: unlock .. code-block:: python class ContextBudget: def __init__(self, context_window=8192, system_tokens=200, output_tokens=1000, safety_margin=100): self.context_window = context_window self.system_tokens = system_tokens self.output_tokens = output_tokens self.safety_margin = safety_margin self.available = (context_window - system_tokens - output_tokens - safety_margin) def analyze(self, text): est_tokens = int(len(text.split()) * 1.3) needs_chunking = est_tokens > self.available n_chunks = max(1, (est_tokens // self.available) + 1) if needs_chunking else 1 return { "context_window": self.context_window, "system_tokens": self.system_tokens, "output_reserve": self.output_tokens, "safety_margin": self.safety_margin, "available_for_input": self.available, "estimated_input_tokens": est_tokens, "needs_chunking": needs_chunking, "recommended_chunks": n_chunks, } def report(self, text): analysis = self.analyze(text) for key, val in analysis.items(): print(f" {key}: {val}") budget = ContextBudget(context_window=8192) long_doc = "Statistical analysis of data. " * 700 budget.report(long_doc) .. admonition:: Exercise 6.3.4 — Preprocessing Pipeline for Academic Papers :class: hint a) Design a preprocessing pipeline specifically for academic paper abstracts. It should handle: LaTeX artifacts (``\textbf{}``, ``\cite{}``), reference markers like ``[1]`` or ``(Smith et al., 2023)``, and mathematical notation. b) Apply your pipeline to 5 sample abstracts (you can write synthetic ones). Verify that the cleaned text is still readable and retains the key content. c) Measure token counts before and after cleaning. How many tokens does preprocessing save? .. dropdown:: Solution :icon: unlock .. code-block:: python import re class AcademicPreprocessor(TextPreprocessor): def clean(self, text): # Remove LaTeX commands text = re.sub(r'\\textbf\{([^}]+)\}', r'\1', text) text = re.sub(r'\\textit\{([^}]+)\}', r'\1', text) text = re.sub(r'\\cite\{[^}]+\}', '', text) text = re.sub(r'\\ref\{[^}]+\}', '[ref]', text) # Remove bracketed references [1], [1,2,3] text = re.sub(r'\[\d+(?:,\s*\d+)*\]', '', text) # Remove parenthetical citations text = re.sub(r'\([A-Z][a-z]+ et al\.,? \d{4}\)', '', text) # Standard cleaning text = super().clean(text) return text pipeline = AcademicPreprocessor(chunk_size=200, strategy="semantic") abstracts = [ r"We present a \textbf{novel} approach to bootstrap inference " r"\cite{efron1979} that improves coverage [1,2]. Our method " r"(Smith et al., 2023) achieves 95\% coverage in simulations.", ] for abstract in abstracts: cleaned = pipeline.clean(abstract) print(f"Before ({len(abstract.split())} words): {abstract[:80]}...") print(f"After ({len(cleaned.split())} words): {cleaned[:80]}...") print() Transition to What Follows ---------------------------- With preprocessing in place, we can now move to a core application of LLMs: **data annotation**. In :ref:`Section 6.4 `, we build annotation pipelines that label text data at scale—addressing the persistent bottleneck of labeled data for supervised learning. Good preprocessing ensures that the text reaching those annotation prompts is clean, properly sized, and ready for reliable classification. Key Takeaways ~~~~~~~~~~~~~~ .. admonition:: Key Takeaways 📝 :class: tip 1. **Tokenization** determines how models see text. Common words become single tokens; rare or technical terms are split into multiple subwords. Everything in LLM pipelines is measured in tokens. 2. **Context windows** are a fixed token budget shared between input and output. When documents exceed the window, chunking is required. 3. **Chunking strategies** trade off simplicity against coherence. Fixed-size is simple; overlap prevents boundary information loss; semantic chunking preserves document structure. 4. **LLM preprocessing differs from classical NLP**: do *not* lowercase, stem, or remove stop words. Clean noise but preserve natural language structure. 5. **Validate your pipeline** before deploying at scale. Check chunk sizes, boundary handling, and content preservation on representative samples.