Section 6.3 Text Preprocessing for LLM Pipelines

Before text reaches a language model, it must pass through a series of transformations that are easy to overlook but difficult to recover from when done poorly. A model that receives HTML tags alongside article text will waste tokens on markup. A document that exceeds the model’s context window will be silently truncated, losing potentially critical information. A pipeline that chunks text mid-sentence will produce embeddings that capture fragments rather than ideas.

Text preprocessing for LLMs differs from traditional NLP preprocessing. Classical pipelines aggressively transform text—lowercasing, stemming, removing stop words—because the downstream model (a bag-of-words classifier, say) benefits from reduced vocabulary. LLMs, by contrast, understand language; they benefit from well-formed, natural text. The goal is not to simplify the text but to ensure it arrives in a form the model can process effectively within its constraints.

This section develops the preprocessing pipeline that sits between raw data and the LLM techniques of subsequent sections. We focus on three critical steps: understanding how models see text (tokenization), managing the fundamental constraint of context windows, and dividing long documents into meaningful chunks.

Road Map 🧭

Understand how tokenizers convert text to tokens—the actual units models process
Manage context windows as a fixed token budget that constrains everything
Choose chunking strategies appropriate for different document types and downstream tasks
Build preprocessing pipelines that clean, normalize, and prepare text for LLM consumption

Tokenization: How Models See Text

Subword Tokenization

Language models do not process text character by character or word by word. They use subword tokenization, which splits text into units called tokens that balance vocabulary size against sequence length.

Common English words typically map to a single token (“the”, “cat”, “important”). Rare or technical words are split into subword pieces: “immunohistochemistry” might become [“immun”, “oh”, “ist”, “oche”, “mist”, “ry”]. Numbers, punctuation, and code often tokenize unpredictably: “3.14159” might become [“3”, “.”, “14”, “159”].

How tokenizers see different types of text — Fig. 222 **Figure 6.3.1:** Subword tokenization in action. Common English words become single tokens, while rare or technical terms are split into multiple subword pieces. Code and mathematical notation often tokenize less efficiently than prose.

Why does this matter for data scientists? Because everything in an LLM pipeline is measured in tokens, not words or characters. Context windows, API pricing, embedding input limits, and generation budgets are all denominated in tokens. A rough heuristic: 1 token ≈ 0.75 English words, or equivalently, 1 English word ≈ 1.3 tokens. But this ratio varies significantly with content type.

Counting Tokens in Practice

Since we cannot directly access the tokenizer used by models on GenAI Studio, we measure token counts indirectly: send text through chat_complete() and read back the token usage the gateway reports for the full prompt:

from genai_studio import GenAIStudio
ai = GenAIStudio()
ai.select_model("llama3.2:latest")

# Use chat_complete to measure token usage
test_text = ("The bootstrap resamples data with replacement to estimate "
             "the sampling distribution of a statistic. By repeating this "
             "process thousands of times, we build an empirical approximation "
             "to the true sampling distribution.")

prompt = f"Repeat the following text exactly: {test_text}"
response = ai.chat_complete(prompt)
print(f"Prompt tokens: {response.prompt_tokens}")
print(f"Completion tokens: {response.completion_tokens}")
print(f"Total tokens: {response.total_tokens}")
print(f"Word count: {len(prompt.split())}")
print(f"Token/word ratio: {response.prompt_tokens / len(prompt.split()):.2f}")

Prompt tokens: 52
Completion tokens: 43
Total tokens: 95
Word count: 40
Token/word ratio: 1.30

Token Estimation Heuristics

For planning purposes, use these approximations:

def estimate_tokens(text, method="words"):
    """Estimate token count from text.

    The word-based heuristic (1 word ≈ 1.3 tokens) is reasonably
    accurate for English prose. Character-based (1 token ≈ 4 chars)
    is better for mixed content (code, URLs, technical text).
    """
    if method == "words":
        return int(len(text.split()) * 1.3)
    elif method == "chars":
        return len(text) // 4
    else:
        raise ValueError(f"Unknown method: {method}")

sample = "The bootstrap resamples data with replacement."
print(f"Word estimate: {estimate_tokens(sample, 'words')} tokens")
print(f"Char estimate: {estimate_tokens(sample, 'chars')} tokens")

Word estimate: 7 tokens
Char estimate: 11 tokens

These are approximations. For production pipelines where token counts matter (e.g., staying within context windows), always measure actual token usage with chat_complete() on representative samples.

Context Windows and Token Limits

What Context Windows Mean

Every language model has a context window—the maximum number of tokens it can process in a single call. This window must contain everything: system prompts, user input, and the model’s response. If the total exceeds the window, input is truncated or the call fails.

Context window token budget — Fig. 223 **Figure 6.3.2:** The context window is a fixed token budget shared between system prompt, user input, and model output. For a model with an 8,192-token window, a 500-token system prompt and a 1,000-token expected output leave roughly 6,700 tokens for your actual data.

Context Window Sizes on GenAI Studio

The models available on GenAI Studio have varying context windows:

Table 60 Model Context Windows
Model	Context Window	Notes
llama3.2:latest	8,192 tokens	Good balance of capability and context
mistral:latest	8,192 tokens	Fast, efficient for shorter tasks
gemma3:12b	8,192 tokens	Strong general-purpose model
phi4:latest	16,384 tokens	Larger context window
deepseek-r1:1.5b	8,192 tokens	Smallest model, fastest inference

Managing the Token Budget

A practical function for checking whether text fits within a model’s context window:

def check_fits_context(text, model_context=8192,
                       system_tokens=200, output_tokens=1000):
    """Check if text fits within the context window budget."""
    estimated_input = estimate_tokens(text, "words")
    available = model_context - system_tokens - output_tokens
    fits = estimated_input <= available

    print(f"Context window:  {model_context:,} tokens")
    print(f"System prompt:  -{system_tokens:,}")
    print(f"Output reserve: -{output_tokens:,}")
    print(f"Available:       {available:,} tokens")
    print(f"Input estimate:  {estimated_input:,} tokens")
    print(f"Fits: {'Yes' if fits else 'NO — chunking required'}")
    return fits

short_text = "The mean is sensitive to outliers. " * 10
long_text = "Statistical analysis reveals patterns. " * 5000

print("Short text:")
check_fits_context(short_text)
print("\nLong text:")
check_fits_context(long_text)

Short text:
Context window:  8,192 tokens
System prompt:  -200
Output reserve: -1,000
Available:       6,992 tokens
Input estimate:  78 tokens
Fits: Yes

Long text:
Context window:  8,192 tokens
System prompt:  -200
Output reserve: -1,000
Available:       6,992 tokens
Input estimate:  26,000 tokens
Fits: NO — chunking required

When text does not fit, we have two options: use a model with a larger context window, or chunk the text into smaller pieces. Chunking is almost always the more robust approach, because it works regardless of model and scales to arbitrarily long documents.

Chunking Strategies

Chunking divides a long document into smaller segments, each of which fits within the model’s context window. The challenge is choosing where to split: cuts in the wrong places destroy context and produce incoherent fragments.

Fixed-Size Chunking

The simplest approach: split text into chunks of approximately equal token count.

def chunk_fixed_size(text, chunk_size=500, unit="words"):
    """Split text into fixed-size chunks by word count.

    A rough proxy for token count (1 word ≈ 1.3 tokens).
    For more precise control, use character-based chunking
    with chunk_size = desired_tokens * 4.
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Example with a long text
long_text = " ".join([f"Sentence {i} discusses topic {i % 5}." for i in range(200)])
chunks = chunk_fixed_size(long_text, chunk_size=50)
print(f"Document: {len(long_text.split())} words")
print(f"Chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
    print(f"\n  Chunk {i} ({len(chunk.split())} words): {chunk[:80]}...")

Document: 1000 words
Chunks: 20

  Chunk 0 (50 words): Sentence 0 discusses topic 0. Sentence 1 discusses topic 1. Sentence 2 discusses...
  Chunk 1 (50 words): Sentence 10 discusses topic 0. Sentence 11 discusses topic 1. Sentence 12 discus...
  Chunk 2 (50 words): Sentence 20 discusses topic 0. Sentence 21 discusses topic 1. Sentence 22 discus...

Fixed-size chunking is fast and predictable but crude. It may split mid-sentence, breaking coherence. This is acceptable for some tasks (e.g., embedding large corpora for approximate search) but problematic when context matters.

Chunking with Overlap

Adding overlap between consecutive chunks ensures that information near boundaries is not lost:

def chunk_with_overlap(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks.

    Overlap ensures no information falls into a gap between chunks.
    Typical overlap: 10-20% of chunk_size.
    """
    words = text.split()
    chunks = []
    step = chunk_size - overlap
    for i in range(0, len(words), step):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
        if i + chunk_size >= len(words):
            break
    return chunks

chunks_overlap = chunk_with_overlap(long_text, chunk_size=50, overlap=10)
print(f"Chunks without overlap: {len(chunks)}")
print(f"Chunks with overlap: {len(chunks_overlap)}")
print(f"Overlap cost: {len(chunks_overlap) - len(chunks)} extra chunks "
      f"({(len(chunks_overlap)/len(chunks) - 1)*100:.0f}% more)")

Chunks without overlap: 20
Chunks with overlap: 25
Overlap cost: 5 extra chunks (25% more)

The trade-off is clear: overlap improves boundary coverage but increases the total number of chunks (and therefore embedding or LLM costs). A 10–20% overlap is usually sufficient.

Semantic Chunking

Semantic chunking splits at natural document boundaries—paragraphs, section headers, or sentence endings—rather than at arbitrary word counts:

import re

def chunk_semantic(text, max_chunk_size=500):
    """Split text at paragraph boundaries, merging small paragraphs.

    Respects natural document structure by never splitting
    mid-paragraph. Merges consecutive short paragraphs until
    the chunk approaches max_chunk_size words.
    """
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]

    chunks = []
    current_chunk = []
    current_size = 0

    for para in paragraphs:
        para_size = len(para.split())
        if current_size + para_size > max_chunk_size and current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = [para]
            current_size = para_size
        else:
            current_chunk.append(para)
            current_size += para_size

    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks

# Example with structured text
structured_text = """
Introduction to Bootstrap Methods

The bootstrap is a resampling method introduced by Bradley Efron in 1979.
It estimates the sampling distribution of a statistic by repeatedly
resampling from the observed data with replacement.

How the Bootstrap Works

Given a dataset of n observations, we draw n samples with replacement
to create a bootstrap sample. We compute the statistic of interest
on this sample. Repeating this B times gives B bootstrap estimates.

Confidence Intervals

The percentile method takes the alpha/2 and 1-alpha/2 quantiles of
the bootstrap distribution as confidence interval endpoints. This
is simple but can be improved with bias-corrected methods.

Advantages and Limitations

The bootstrap requires no distributional assumptions, making it
widely applicable. However, it can fail for heavy-tailed distributions
or when the sample size is very small.
""".strip()

chunks = chunk_semantic(structured_text, max_chunk_size=40)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk.split())} words):")
    print(f"  {chunk[:100]}...")
    print()

Chunk 0 (37 words):
  Introduction to Bootstrap Methods

The bootstrap is a resampling method introduced by Bradley Efron ...

Chunk 1 (36 words):
  Given a dataset of n observations, we draw n samples with replacem...

Chunk 2 (30 words):
  The percentile method takes the alpha/2 and 1-alpha/2 quantiles o...

Chunk 3 (25 words):
  The bootstrap requires no distributional assumptions, making it w...

Semantic chunking never splits mid-paragraph, so every chunk reads as coherent prose. The word counts reveal one limitation of the greedy merger, though: a short heading always fits in the chunk being built, so “How the Bootstrap Works” is absorbed into the tail of Chunk 0 rather than leading Chunk 1—production chunkers treat headings as boundaries that force a new chunk. Even so, structure-aware chunking is the right default for tasks where context coherence matters—such as embedding for RAG (see Section 6.5) or feeding to an LLM for annotation.

Choosing a Chunking Strategy

Table 61 Chunking Strategy Selection Guide
Strategy	Best For	Advantages	Disadvantages
Fixed-size	Large corpora, approximate search	Simple, predictable size	May break mid-sentence
Fixed + overlap	Embedding for retrieval	No boundary information loss	More chunks, higher cost
Semantic	RAG, annotation, analysis	Preserves document structure	Variable chunk sizes
Recursive	Complex documents	Hierarchical splitting	More complex to implement

Text Normalization and Cleaning

Chunking solves the size problem; normalization solves the quality problem. However carefully we split a document, each chunk is only as useful as the text inside it—and text scraped from web pages, PDFs, or logs rarely arrives clean.

Standard Cleaning Pipeline

Real-world text often contains artifacts that waste tokens and confuse models: HTML tags, duplicate whitespace, encoding errors, and invisible Unicode characters.

import re

def clean_text(text):
    """Standard text cleaning for LLM input.

    Removes HTML, fixes whitespace, and normalizes Unicode.
    Deliberately does NOT lowercase, stem, or remove stop words—
    LLMs benefit from natural, well-formed text.
    """
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'https?://\S+', '[URL]', text)
    # Normalize whitespace (collapse multiple spaces/newlines)
    text = re.sub(r'\s+', ' ', text)
    # Remove control characters (but keep newlines for structure)
    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text)
    return text.strip()

raw = """<p>This is a <b>test</b> of the   cleaning pipeline.</p>
Visit https://example.com/very/long/url for more info.
Extra    spaces   and\ttabs\there.
"""
cleaned = clean_text(raw)
print(f"Raw ({len(raw)} chars): {repr(raw[:80])}")
print(f"Clean ({len(cleaned)} chars): {cleaned}")

Raw (154 chars): '<p>This is a <b>test</b> of the   cleaning pipeline.</p>\nVisit https://example'
Clean (81 chars): This is a test of the cleaning pipeline. Visit [URL] for more info. Extra spaces and tabs here.

When NOT to Normalize

LLMs are not traditional NLP models. Aggressive preprocessing can remove information the model needs:

Do not lowercase: Case carries meaning (“Apple” the company vs. “apple” the fruit).
Do not remove stop words: LLMs understand grammar; removing “not” from “not significant” inverts meaning.
Do not stem or lemmatize: The model’s tokenizer handles morphological variation.
Do not remove punctuation: Sentence boundaries, questions, and emphasis depend on punctuation.

The principle: clean noise (HTML, encoding errors, duplicate whitespace) but preserve signal (case, grammar, punctuation, natural phrasing).

Building a Complete Preprocessing Pipeline

A production preprocessing pipeline combines all the steps above into a single, testable function:

Complete text preprocessing pipeline — Fig. 225 **Figure 6.3.4:** The preprocessing pipeline from raw document to LLM-ready chunks. Each stage can be tested independently, and the pipeline can be adapted for different document types and downstream tasks.

import re

class TextPreprocessor:
    """Configurable text preprocessing pipeline for LLM workflows."""

    def __init__(self, chunk_size=500, overlap=50, strategy="semantic"):
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.strategy = strategy

    def clean(self, text):
        """Remove noise while preserving natural language structure."""
        text = re.sub(r'<[^>]+>', '', text)
        text = re.sub(r'https?://\S+', '[URL]', text)
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text)
        return text.strip()

    def chunk(self, text):
        """Split text into chunks using the configured strategy."""
        if self.strategy == "fixed":
            return self._chunk_fixed(text)
        elif self.strategy == "overlap":
            return self._chunk_overlap(text)
        elif self.strategy == "semantic":
            return self._chunk_semantic(text)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

    def _chunk_fixed(self, text):
        words = text.split()
        return [" ".join(words[i:i+self.chunk_size])
                for i in range(0, len(words), self.chunk_size)]

    def _chunk_overlap(self, text):
        words = text.split()
        chunks = []
        step = self.chunk_size - self.overlap
        for i in range(0, len(words), step):
            chunk = " ".join(words[i:i+self.chunk_size])
            if chunk:
                chunks.append(chunk)
            if i + self.chunk_size >= len(words):
                break
        return chunks

    def _chunk_semantic(self, text):
        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
        chunks, current, size = [], [], 0
        for para in paragraphs:
            para_size = len(para.split())
            if size + para_size > self.chunk_size and current:
                chunks.append('\n\n'.join(current))
                current, size = [para], para_size
            else:
                current.append(para)
                size += para_size
        if current:
            chunks.append('\n\n'.join(current))
        return chunks

    def process(self, text):
        """Full pipeline: clean → check size → chunk if needed."""
        cleaned = self.clean(text)
        word_count = len(cleaned.split())
        estimated_tokens = int(word_count * 1.3)

        if estimated_tokens <= self.chunk_size:
            return [cleaned]
        return self.chunk(cleaned)

    def validate(self, chunks):
        """Check that all chunks meet size constraints."""
        report = {
            "n_chunks": len(chunks),
            "sizes": [len(c.split()) for c in chunks],
            "total_words": sum(len(c.split()) for c in chunks),
            "all_within_limit": all(
                len(c.split()) <= self.chunk_size * 1.1 for c in chunks
            ),
        }
        return report

# Usage
preprocessor = TextPreprocessor(chunk_size=100, overlap=15, strategy="overlap")
long_doc = " ".join([f"Paragraph {i} covers topic {i % 3}. " * 10
                     for i in range(20)])
chunks = preprocessor.process(long_doc)
report = preprocessor.validate(chunks)

print(f"Input: {len(long_doc.split())} words")
print(f"Chunks: {report['n_chunks']}")
print(f"Chunk sizes: {report['sizes'][:5]}...")
print(f"All within limit: {report['all_within_limit']}")

Input: 1000 words
Chunks: 12
Chunk sizes: [100, 100, 100, 100, 100]...
All within limit: True

Pipeline Validation

Always validate your preprocessing pipeline before running it at scale:

def validate_pipeline(preprocessor, sample_texts):
    """Test preprocessing on sample texts and report statistics."""
    all_reports = []
    for text in sample_texts:
        chunks = preprocessor.process(text)
        report = preprocessor.validate(chunks)
        all_reports.append(report)

    total_chunks = sum(r['n_chunks'] for r in all_reports)
    all_sizes = [s for r in all_reports for s in r['sizes']]

    print(f"Documents processed: {len(sample_texts)}")
    print(f"Total chunks: {total_chunks}")
    print(f"Avg chunks/doc: {total_chunks/len(sample_texts):.1f}")
    print(f"Chunk size range: [{min(all_sizes)}, {max(all_sizes)}] words")
    print(f"Mean chunk size: {sum(all_sizes)/len(all_sizes):.0f} words")
    print(f"All within limit: {all(r['all_within_limit'] for r in all_reports)}")

sample_docs = [
    "Short text that fits in one chunk.",
    " ".join(["Medium text. "] * 100),
    " ".join(["Long document with many paragraphs. "] * 500),
]
validate_pipeline(preprocessor, sample_docs)

Documents processed: 3
Total chunks: 34
Avg chunks/doc: 11.3
Chunk size range: [7, 100] words
Mean chunk size: 93 words
All within limit: True

Chapter 6.3 Exercises: Text Preprocessing

Exercise 6.3.1 — Tokenization Explorer

Write a function that uses chat_complete() to measure the token count of a given text. Test it on five different content types: (1) standard English prose, (2) Python code, (3) a URL-heavy web page, (4) a mathematical expression, (5) text in a language other than English.
Compute the token-to-word ratio for each content type. Which content types are most and least “token-efficient”?
Based on your findings, explain why token estimation heuristics need to account for content type.

Exercise 6.3.2 — Chunking Strategy Comparison

Create a long document (1,000+ words) with clear section structure (headers, paragraphs). Chunk it using fixed-size, fixed-size with overlap, and semantic strategies, all targeting ~200-word chunks.
Embed each set of chunks using ai.embed() and compute the average within-chunk cosine similarity to the full document embedding. Which chunking strategy produces chunks that are most representative of the full document?
For the fixed-size strategy, deliberately create a chunk that splits mid-sentence. Show how this affects the embedding compared to a clean sentence-boundary chunk.

Exercise 6.3.3 — Context Window Budget Calculator

Write a ContextBudget class that:

Takes a model’s context window size, system prompt, and desired output length.
Computes the remaining token budget for user input.
Given a long text, determines whether chunking is needed and, if so, how many chunks.
Reports a complete budget breakdown.

Test it with a 2,000-word document against the 8,192-token window of llama3.2.

Exercise 6.3.4 — Preprocessing Pipeline for Academic Papers

Design a preprocessing pipeline specifically for academic paper abstracts. It should handle: LaTeX artifacts (\textbf{}, \cite{}), reference markers like [1] or (Smith et al., 2023), and mathematical notation.
Apply your pipeline to 5 sample abstracts (you can write synthetic ones). Verify that the cleaned text is still readable and retains the key content.
Measure token counts before and after cleaning. How many tokens does preprocessing save?

Transition to What Follows

With preprocessing in place, we can now move to a core application of LLMs: data annotation. In Section 6.4, we build annotation pipelines that label text data at scale—addressing the persistent bottleneck of labeled data for supervised learning. Good preprocessing ensures that the text reaching those annotation prompts is clean, properly sized, and ready for reliable classification.

Key Takeaways

Key Takeaways 📝

Tokenization determines how models see text. Common words become single tokens; rare or technical terms are split into multiple subwords. Everything in LLM pipelines is measured in tokens.
Context windows are a fixed token budget shared between input and output. When documents exceed the window, chunking is required.
Chunking strategies trade off simplicity against coherence. Fixed-size is simple; overlap prevents boundary information loss; semantic chunking preserves document structure.
LLM preprocessing differs from classical NLP: do not lowercase, stem, or remove stop words. Clean noise but preserve natural language structure.
Validate your pipeline before deploying at scale. Check chunk sizes, boundary handling, and content preservation on representative samples.