Building an AI-Powered Video Segmentation Toolkit

Transform lecture recordings into structured, navigable educational content using LLMs, Python, and modern web technologies.

10+

Python Scripts

3,000+

Lines of Code

60 min

Build Time

∞

Lectures Processed

1. Overview

This tutorial guides you through building a complete toolkit for transforming lecture video transcripts into AI-analyzed pedagogical segments. The system processes SRT subtitle files, uses large language models to identify educational boundaries, maps content to YouTube uploads, and generates interactive viewing experiences for students.

What You'll Build

🎯 AI Segmentation Engine

Process transcripts through LLMs to identify introductions, concept explanations, examples, and summaries with pedagogical metadata.

📊 Structured JSON Output

Generate validated JSON with learning objectives, prerequisites, difficulty levels, and engagement tips for each segment.

🎬 YouTube Integration

Automatically map segmentation data to YouTube video IDs for embedded playback with timestamp navigation.

🖥️ Interactive Viewer

Single-file HTML viewer with role-based content, search functionality, and theater mode for immersive learning.

Design Principles

Bounded Expertise Oracle: AI analysis anchored to transcript content with mandatory timecode citations
Graceful Degradation: Token truncation with clear markers when transcripts exceed limits
Human-in-the-Loop: Flask-based annotation tool for timestamp corrections
Schema Enforcement: Pydantic validation ensures reliable JSON parsing
Idempotent Operations: Backup files before any mutation

💡 Educational Use Case

This toolkit was developed for STAT 350 at Purdue University, processing 70+ lecture videos into a searchable, navigable learning platform. The same architecture applies to any course with video content.

2. System Architecture

The toolkit follows a pipeline architecture where each component produces artifacts consumed by downstream processes. This design enables incremental processing and easy debugging.

┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐ │ SRT Transcripts │────▶│ srt_pedagogical_ │────▶│ *_segments.json │ │ (lecture videos)│ │ segmentation.py │ │ (AI-analyzed) │ └─────────────────┘ │ + estimate_limit.py │ └──────────┬──────────┘ └──────────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐ │ YouTube upload │────▶│ create_youtube_ │────▶│ youtube_mapping.json│ │ log (CSV) │ │ mapping_smart.py │ └──────────┬──────────┘ └─────────────────┘ └──────────────────────┘ │ ▼ ┌──────────────────────┐ ┌─────────────────────┐ │ lecture_segment_ │◀───▶│ Corrected segments │ │ annotator.py (Flask) │ │ (human-verified) │ └──────────────────────┘ └──────────┬──────────┘ │ ┌─────────────────┐ ┌──────────────────────┐ ▼ │ fix_*.py │────▶│ Clean, repair, │ ┌─────────────────────┐ │ merge_*.py │ │ synchronize JSONs │ │ video_viewer.html │ └─────────────────┘ └──────────────────────┘ │ (student interface) │ └─────────────────────┘

Component Responsibilities

Component	Input	Output	Purpose
`srt_pedagogical_segmentation.py`	SRT files	JSON segments	Core AI analysis engine
`segmentation_schema.py`	JSON data	Validated objects	Schema enforcement
`estimate_limit.py`	SRT file	Token count	Context window sizing
`create_youtube_mapping_smart.py`	Upload log + JSON	Mapping JSON	Video-to-lecture matching
`generate_segmentation_report_youtube.py`	JSON + mapping	HTML viewer	Interactive interface
`lecture_segment_annotator.py`	JSON segments	Corrected JSON	Manual timestamp refinement

3. Prerequisites

Required Software

requirements.txt Text

# Core dependencies
pandas>=1.5.0
pydantic>=2.0.0
tiktoken>=0.5.0

# LLM providers (install one or both)
openai>=1.0.0
anthropic>=0.18.0

# Web application (for annotation tool)
flask>=3.0.0
flask-cors>=4.0.0

# Optional: video probing
# ffprobe (from ffmpeg) - install via system package manager

API Keys

Configure environment variables for your chosen LLM provider:

.env Environment

# OpenAI (for GPT-4o, o1, o3 models)
OPENAI_API_KEY=sk-your-key-here

# Anthropic (for Claude models)
ANTHROPIC_API_KEY=sk-ant-your-key-here

# Or use a local/custom endpoint
OPENAI_BASE_URL=https://your-endpoint.com/v1

Project Structure

video-segmentation-toolkit/ ├── scripts/ │ ├── srt_pedagogical_segmentation.py # Core engine │ ├── segmentation_schema.py # Pydantic models │ ├── estimate_limit.py # Token calculator │ ├── create_youtube_mapping_smart.py # Video matcher │ ├── generate_segmentation_report_youtube.py │ ├── lecture_segment_annotator.py # Flask app │ └── utilities/ │ ├── fix_json_line_breaks.py │ ├── fix_lecture_indices.py │ ├── merge_split_items.py │ └── rebuild_course_context.py ├── data/ │ ├── transcripts/ # Input SRT files │ └── lecture_mapping.xlsx # Video-transcript pairs ├── output/ │ ├── json/ # Segment JSON files │ ├── corrected/ # Human-verified JSONs │ └── youtube_mapping.json ├── templates/ │ └── segment_annotate.html # Flask template ├── requirements.txt └── .env

4. Schema Design

Reliable LLM output parsing requires strict schema enforcement. We use Pydantic models to validate AI-generated JSON and provide clear error messages when output doesn't conform.

Segment Types

The toolkit recognizes ten pedagogical segment types, each with distinct visual styling and semantic meaning:

Type	Icon	Description	Color
`introduction`	🎯	Topic introduction and learning objectives	#2E86DE
`concept_explanation`	💡	Core concept explanation and theory	#5F27CD
`example`	📊	Worked examples and demonstrations	#00B894
`deep_reasoning`	🧠	Deep reasoning and intuition building	#D63031
`common_mistakes`	⚠️	Common mistakes and misconceptions	#E17055
`practice_problem`	✏️	Practice problems and exercises	#00CEC9
`real_world_application`	🌍	Real-world applications and context	#A29BFE
`summary`	📝	Summary and key takeaways	#6C5CE7
`q_and_a`	❓	Student questions and answers	#00B894
`transition`	➡️	Topic transitions and administrative content	#636E72

Pydantic Models

segmentation_schema.py Python

from typing import Dict, List, Optional
from pydantic import BaseModel, Field, validator
from enum import Enum

class SegmentType(str, Enum):
    """Valid segment types."""
    INTRODUCTION = "introduction"
    CONCEPT_EXPLANATION = "concept_explanation"
    EXAMPLE = "example"
    DEEP_REASONING = "deep_reasoning"
    COMMON_MISTAKES = "common_mistakes"
    PRACTICE_PROBLEM = "practice_problem"
    REAL_WORLD_APPLICATION = "real_world_application"
    SUMMARY = "summary"
    Q_AND_A = "q_and_a"
    TRANSITION = "transition"

class DifficultyLevel(str, Enum):
    """Valid difficulty levels."""
    EASY = "Easy"
    MEDIUM = "Medium"
    HARD = "Hard"

class TimeRange(BaseModel):
    """Represents a time range in HH:MM:SS,mmm format."""
    start: str = Field(..., pattern=r'^\d{2}:\d{2}:\d{2}[,;]\d{2,3}$')
    end: str = Field(..., pattern=r'^\d{2}:\d{2}:\d{2}[,;]\d{2,3}$')

class SegmentSchema(BaseModel):
    """Schema for a single segment."""
    time_range: TimeRange
    segment_type: SegmentType
    title: str = Field(..., min_length=1, max_length=200)
    description: str = Field(..., min_length=1, max_length=500)
    key_concepts: List[str] = Field(default_factory=list, max_items=10)
    learning_objectives: List[str] = Field(default_factory=list, max_items=5)
    prerequisites: List[str] = Field(default_factory=list, max_items=5)
    difficulty: DifficultyLevel = Field(default=DifficultyLevel.MEDIUM)
    engagement_tips: List[str] = Field(default_factory=list, max_items=5)
    microlecture_suitable: bool = Field(default=False)
    
    class Config:
        use_enum_values = True

class SegmentationAnalysis(BaseModel):
    """Complete schema for segmentation analysis output."""
    overview: LectureOverview
    segments: List[SegmentSchema] = Field(..., min_items=1)
    interactive_opportunities: List[InteractiveOpportunity] = Field(default_factory=list)
    microlecture_recommendations: List[MicrolectureRecommendation] = Field(default_factory=list)

⚠️ Validation Strategy

Always validate LLM output before storing. Use validate_analysis(data) to catch malformed responses early. LLMs occasionally produce invalid JSON, especially with complex schemas.

5. Core Segmentation Engine

The segmentation engine processes SRT transcripts through several stages: parsing, enhancement, LLM analysis, and output generation.

Timestamp Handling

SRT files use millisecond timestamps (HH:MM:SS,mmm), while video editing uses frame-based timecodes. The toolkit handles both formats:

Timecode Conversion Python

def tc_to_seconds(tc: str, fps: float = 30.0) -> float:
    """Convert timecode to seconds - handles all formats."""
    try:
        if ',' in tc:  # Millisecond format: HH:MM:SS,mmm
            h, m, rest = tc.split(':')
            s, ms = rest.split(',')
            return int(h) * 3600 + int(m) * 60 + int(s) + int(ms) / 1000
        else:  # Frame format: HH:MM:SS:FF or HH:MM:SS;FF
            parts = re.split(r'[:;]', tc)
            if len(parts) == 4:
                h, m, s, ff = parts
                return int(h) * 3600 + int(m) * 60 + int(s) + int(ff) / fps
            else:
                h, m, s = parts
                return int(h) * 3600 + int(m) * 60 + int(s)
    except Exception as e:
        logger.error(f"Error parsing timecode '{tc}': {e}")
        return 0.0

def seconds_to_tc(sec: float, fps: float = 30.0, mode: str = "ms") -> str:
    """Convert seconds to timecode string."""
    h, rem = divmod(sec, 3600)
    m, s_full = divmod(rem, 60)
    s = int(s_full)
    
    if mode == "frames":
        frames = int(round((s_full - s) * fps))
        return f"{int(h):02d}:{int(m):02d}:{int(s):02d}:{frames:02d}"
    else:  # milliseconds
        ms = int((sec - int(sec)) * 1000)
        return f"{int(h):02d}:{int(m):02d}:{int(s):02d},{ms:03d}"

Transcript Enhancement

Before sending to the LLM, we enhance transcripts with embedded timestamp markers. This anchors the AI to actual timecodes rather than hallucinating times:

Transcript Enhancement Python

def enhance_transcript_with_timestamps(srt_text: str) -> Tuple[str, List]:
    """
    Add timestamp markers to transcript for AI reference.
    Returns: (enhanced_text, timestamp_mapping)
    """
    enhanced_lines = []
    timestamp_mapping = []
    
    # Split into caption blocks
    blocks = re.split(r'\n\n+', srt_text.strip())
    
    for block in blocks:
        lines = block.strip().split('\n')
        if len(lines) < 3:
            continue
        
        # Extract timestamp
        timestamp_match = TIMESTAMP_RE.search(lines[1])
        if timestamp_match:
            start_tc, end_tc = timestamp_match.groups()
            
            # Get caption text (lines after timestamp)
            text = ' '.join(lines[2:])
            
            # Add to enhanced transcript with timestamp marker
            enhanced_lines.append(f"[{start_tc}] {text}")
            timestamp_mapping.append((start_tc, end_tc, text))
    
    return '\n'.join(enhanced_lines), timestamp_mapping

Token Budget Management

Long lectures may exceed model context windows. Use estimate_limit.py to calculate token requirements:

Usage Bash

python estimate_limit.py --srt lecture_42.srt --model gpt-4o

# Output:
# Transcript characters : 45,230
# Exact tokens (gpt-4o) : 12,847
# ---------------------------------------------------------
# Recommended config.token_limit  : 19,764  (largest safe model window)
# Minimum-for-current heuristic   : 12,848  (fits transcript + 1-token reply)

💡 Token Budget Split

The toolkit uses a 65/25/10 heuristic: 65% of context for transcript, 25% for model output, 10% safety margin. If transcripts exceed limits, they're truncated with a clear [TRANSCRIPT TRUNCATED] marker.

6. LLM Prompting Strategy

Effective prompting is critical for reliable segmentation. The toolkit uses a structured prompt that constrains the LLM to use actual transcript timestamps.

Model Configuration

Model Configurations Python

MODEL_CONFIGS = {
    'gpt-4o': {
        'temperature': 0.1,      # Low temp for consistent output
        'max_tokens': 4000,
        'token_limit': 128000,
        'provider': 'openai'
    },
    'claude-sonnet-4': {
        'temperature': 0.1,
        'max_tokens': 4000,
        'token_limit': 200000,
        'provider': 'claude'
    },
    'o3': {
        'temperature': None,     # O-series ignores temperature
        'max_completion_tokens': 25000,
        'token_limit': 200000,
        'provider': 'openai',
        'reasoning_effort': 'high'
    }
}

System Prompt Structure

The prompt follows a specific structure to maximize reliability:

Prompt Template (Abbreviated) Text

You are an expert educational content analyst specializing in 
pedagogical segmentation of lecture videos.

## CRITICAL TIMESTAMP RULES
1. ONLY use timestamps that appear in [HH:MM:SS,mmm] markers in the transcript
2. Segment boundaries MUST align with actual caption timestamps
3. DO NOT interpolate or invent timestamps
4. If uncertain about a boundary, use the nearest visible timestamp

## Segment Types
- introduction: Topic setup and learning objectives
- concept_explanation: Core theory and definitions
- example: Worked problems and demonstrations
- deep_reasoning: Intuition building and "why" explanations
- common_mistakes: Pitfalls and misconceptions
- practice_problem: Student exercises
- real_world_application: Practical applications
- summary: Key takeaways and review
- q_and_a: Student questions
- transition: Topic changes and administrative content

## Output Format
Return valid JSON matching this schema:
{
  "overview": {
    "learning_objectives": ["..."],
    "prerequisites": ["..."],
    "key_takeaways": ["..."]
  },
  "segments": [
    {
      "time_range": {"start": "HH:MM:SS,mmm", "end": "HH:MM:SS,mmm"},
      "segment_type": "concept_explanation",
      "title": "Descriptive Title",
      "description": "1-2 sentence summary",
      "key_concepts": ["concept1", "concept2"],
      "difficulty": "Easy|Medium|Hard"
    }
  ]
}

⛔ Timestamp Hallucination

LLMs frequently hallucinate timestamps when not properly constrained. The embedded [HH:MM:SS,mmm] markers in enhanced transcripts and explicit instructions to only use visible timestamps significantly reduce this problem, but human verification remains essential.

7. YouTube Video Mapping

After uploading lectures to YouTube, you need to map video IDs to lecture indices. The create_youtube_mapping_smart.py script automates this using a two-pass matching algorithm.

Input: YouTube Upload Log

Most YouTube upload tools generate a CSV log. The script expects this format:

youtube_upload_log.csv CSV

Video File,YouTube Video ID,Status
"/path/to/STAT 350 - Chapter 6.3.1 Intro.mp4",dQw4w9WgXcQ,UPLOAD SUCCESS
"/path/to/STAT 350 - Chapter 6.3.2 Examples.mp4",9bZkp7q19f0,UPLOAD SUCCESS
"/path/to/STAT 350 - Chapter 7.1 Overview.mp4",kJQP7kiw5Fk,UPLOAD SUCCESS

Matching Algorithm

Extract Chapter Information: Parse chapter numbers from filenames using regex patterns like Chapter\s*(\d+(?:\.\d+)*).
Pass 1 - Chapter Group Matching: Group videos and lectures by base chapter (e.g., "6.3") and match in order. This handles sub-chapters like 6.3.1, 6.3.2.
Pass 2 - Fuzzy Title Matching: For remaining unmatched items, use difflib.SequenceMatcher to find best title matches above a similarity threshold (default 0.7).
Generate Mapping: Output youtube_mapping.json with lecture index → video ID pairs.

Usage Bash

python create_youtube_mapping_smart.py \
    --log-file youtube_upload_log.csv \
    --json-dir segmentation_reports/json \
    --output youtube_mapping.json \
    --threshold 0.7

# Output files:
# - youtube_mapping.json           (primary mapping)
# - youtube_mapping_detailed.csv   (for human review)
# - unmatched_videos.csv           (failed matches)

Output Format

youtube_mapping.json JSON

{
  "1": "dQw4w9WgXcQ",
  "2": "9bZkp7q19f0",
  "3": "kJQP7kiw5Fk",
  "38": "abc123xyz",
  "39": "def456uvw"
}

8. Interactive Video Viewer

The HTML viewer is a self-contained single-file application that renders all segmentation data with embedded YouTube playback.

Key Features

🎭 Role-Based Views

Students see simplified content; instructors see engagement tips, difficulty badges, and analytics.

🔍 Full-Text Search

Search across all lectures by title, concept, description, or learning objective.

🎬 Theater Mode

Immersive full-width video playback with hidden sidebar.

📊 Visual Timeline

Color-coded segment bars showing lecture structure at a glance.

Generation Command

Generate Viewer Bash

python generate_segmentation_report_youtube.py \
    --json-dir segmentation_reports/json \
    --youtube-mapping youtube_mapping.json \
    --out video_viewer.html

Data Embedding

Segment data is embedded directly in the HTML as JavaScript objects:

Embedded Data Structure JavaScript

// Embedded in generated HTML
window.segmentData = {
    "1": {
        "lecture_index": 1,
        "lecture_title": "Chapter 1.1: Introduction to Statistics",
        "total_duration": 1847.5,
        "segments": [
            {
                "start_time": 0,
                "end_time": 185.2,
                "start_tc": "00:00:00,000",
                "end_tc": "00:03:05,200",
                "segment_type": "introduction",
                "title": "Course Overview",
                "description": "Introduction to the course...",
                "key_concepts": ["statistics", "data analysis"],
                "difficulty_level": "Easy"
            }
            // ... more segments
        ]
    }
};

const YOUTUBE_MAPPING = {
    "1": "dQw4w9WgXcQ",
    "2": "9bZkp7q19f0"
};

✅ Single-File Deployment

The viewer is entirely self-contained—no server required. Host it on GitHub Pages, drop it in an LMS, or open directly in a browser. All styles, scripts, and data are inline.

9. Manual Annotation Tool

AI-generated timestamps often need refinement. The Flask-based annotation tool provides a side-by-side interface for human correction.

Starting the Server

Launch Annotation Tool Bash

# Start the Flask server
python lecture_segment_annotator.py

# Server runs at http://localhost:5005
# Open in browser to begin annotation

Workflow

Initialize: On first run, the tool copies all original JSONs to a corrected/ directory.
Select Lecture: Browse available lectures in the sidebar. Videos with corrections show a badge.
Adjust Timestamps: Play the YouTube video and click on segments to adjust start/end times.
Save Corrections: Changes are saved to the corrected directory with metadata timestamps.
Export: Download all corrected segments as a ZIP for deployment.

API Endpoints

Endpoint	Method	Purpose
`/api/lectures`	GET	List all available lectures with correction status
`/api/segments/<filename>`	GET	Get segments for a specific lecture
`/api/segments/save`	POST	Save corrected timestamps
`/api/reset/<filename>`	POST	Reset to original timestamps
`/api/export`	GET	Download all segments as ZIP

10. Utility Scripts

LLM output often contains artifacts that need post-processing. These utilities clean and repair JSON files.

fix_json_line_breaks.py

Removes unwanted line breaks and HTML artifacts from text fields:

Usage Bash

python fix_json_line_breaks.py --json-dir segmentation_reports/json

# Fixes:
# - Removes \n, \r, \t sequences
# - Cleans HTML entities (<br/>)
# - Normalizes whitespace
# - Creates .bak backups

merge_split_items.py

LLMs sometimes split list items incorrectly. This script merges fragments:

Before/After JSON

// Before (incorrectly split)
"prerequisites": [
    "High",
    "school algebra",
    "Basic probability"
]

// After (merged)
"prerequisites": [
    "High-school algebra",
    "Basic probability"
]

fix_lecture_indices.py

Ensures lecture_index fields match filename prefixes:

Usage Bash

# Preview changes
python fix_lecture_indices.py --json-dir output/json --dry-run

# Apply changes
python fix_lecture_indices.py --json-dir output/json

# File: 038_STAT_350_Chapter_7_segments.json
# Before: lecture_index: 1
# After:  lecture_index: 38

rebuild_course_context.py

After reprocessing individual lectures, rebuild the cross-lecture concept graph:

Usage Bash

python rebuild_course_context.py \
    --input-dir segmentation_analysis \
    --verify \
    --enhanced-summary

# Outputs:
# - course_context.json (concept tracking)
# - course_summary.md (human-readable overview)

11. Deployment

Static Hosting (GitHub Pages)

The viewer is a single HTML file that can be hosted anywhere:

GitHub Pages Deployment Bash

# Create a docs folder for GitHub Pages
mkdir -p docs
cp video_viewer.html docs/index.html

# Commit and push
git add docs/
git commit -m "Deploy video viewer"
git push

# Enable GitHub Pages in repository settings
# Source: Deploy from branch → main → /docs

LMS Integration

For Brightspace, Canvas, or Blackboard:

Upload video_viewer.html to course files
Create a content item linking to the uploaded file
Or embed in an iframe if your LMS allows

⚠️ YouTube API Restrictions

The YouTube IFrame API requires the page to be served over HTTP/HTTPS. Opening the HTML file directly (file://) may not load videos. Use a local server for development:

Local Development Server Bash

# Python 3
python -m http.server 8000

# Then open http://localhost:8000/video_viewer.html

12. Extensions & Future Work

Potential Enhancements

🔗 Chatbot Integration

Connect segmentation data to an AI tutor that can answer questions about specific video segments.

📈 Learning Analytics

Track which segments students watch, rewind, or skip to identify difficult content.

🎬 Auto-Clip Generation

Automatically extract "microlecture" clips based on segment boundaries using ffmpeg.

📝 Quiz Generation

Use LLMs to generate comprehension questions for each segment.

EDL Export for Video Editing

The toolkit can export Edit Decision Lists for Adobe Premiere Pro:

EDL Export Python

def export_segmentation_edl(segmentation, output_path, video_title, fps):
    """Export segmentation as EDL file for Adobe Premiere Pro."""
    with open(output_path, 'w') as f:
        f.write(f"TITLE: {video_title}_segments\n")
        f.write("FCM: NON-DROP FRAME\n\n")
        
        for i, segment in enumerate(segmentation.segments, 1):
            start_tc = seconds_to_tc(segment.start_time, fps, 'frames')
            end_tc = seconds_to_tc(segment.end_time, fps, 'frames')
            
            f.write(f"{i:03d}  AX       V     C        ")
            f.write(f"{start_tc} {end_tc} {start_tc} {end_tc}\n")
            f.write(f"* COMMENT: TYPE={segment.segment_type} | {segment.title}\n\n")

Contributing

This toolkit is designed to be modular and extensible. Key extension points include:

New segment types: Add entries to SEGMENT_TYPES dictionary
Custom LLM providers: Implement provider adapters in segmentation engine
Alternative viewers: Generate React, Vue, or native mobile apps from JSON
Assessment integration: Connect to LMS gradebooks via LTI

🎉 Congratulations!

You now have a complete understanding of the pedagogical video segmentation toolkit. Start with a single lecture, validate the output, then scale to your full course library. The human-in-the-loop annotation ensures quality while AI handles the heavy lifting.