# YouTube Transcript Summary

## Overview

This SOP downloads transcripts from YouTube videos and generates structured summaries of the main topics discussed. It extracts clean text from video subtitles and creates concise topic-based summaries with relevant datapoints, plus a LinkedIn post highlighting key insights.

## Parameters

- **youtube_url** (required): The YouTube video URL to process
- **language** (optional, default: "en"): Language code for transcript (e.g., "en", "es", "fr")

**Constraints for parameter acquisition:**
- You MUST ask for all required parameters upfront in a single prompt rather than one at a time
- You MUST support multiple input methods including direct input, file path, or URL
- You MUST confirm successful acquisition of all parameters before proceeding

## Steps

### 1. Download and Clean Transcript

Download the transcript using yt-dlp and clean it for processing.

**Constraints:**
- You MUST use yt-dlp to download subtitles: `yt-dlp --write-subs --write-auto-subs --sub-lang en --skip-download "URL"` because it reliably handles YouTube's various subtitle formats and authentication requirements
- You MUST clean VTT files using this Python script because raw VTT files contain timing codes, duplicate text, and formatting tags that interfere with content analysis:
```python
import re, os, sys
# Find VTT file matching video ID (passed as argument)
video_id = sys.argv[1]
vtt_file = [f for f in os.listdir('.') if f.endswith('.vtt') and video_id in f][0]
with open(vtt_file, 'r') as f:
    content = f.read()
# Clean VTT content
lines = content.split('\n')
text_lines = []
for line in lines:
    line = line.strip()
    if (line and not line.startswith('WEBVTT') and not line.startswith('Kind:')
        and not line.startswith('Language:') and not re.match(r'\d{2}:\d{2}:\d{2}', line)
        and 'align:start position:' not in line and not line.isdigit()):
        line = re.sub(r'<[^>]+>', '', line)  # Remove inline VTT timing tags
        text_lines.append(line)
# Join and clean
text = ' '.join(text_lines)
text = re.sub(r'\s+', ' ', text)
# Remove 3x repeated phrases (common in auto-generated VTT)
words = text.split()
cleaned = []
i = 0
while i < len(words):
    found = False
    for n in range(3, 15):
        if i + n*3 <= len(words) and words[i:i+n] == words[i+n:i+n*2] == words[i+n*2:i+n*3]:
            cleaned.extend(words[i:i+n])
            i += n*3
            found = True
            break
    if not found:
        cleaned.append(words[i])
        i += 1
with open('subtitle.txt', 'w') as f:
    f.write(' '.join(cleaned))
```
- You MUST save the cleaned transcript as "subtitle.txt" because subsequent steps expect this exact filename

### 2. Generate Summary

Process the cleaned transcript to create a scannable summary that helps readers decide if they should watch the video.

**Constraints:**
- You MUST read the entire transcript before writing because the summary requires understanding the full context and arc of the discussion
- You MUST use this exact structure because it provides scannable, decision-oriented information:

```markdown
# [Descriptive Title]

## TL;DR
2-3 sentences max. What is this talk about and why should someone care?

## The Big Idea
One paragraph explaining the core concept or problem being solved. Set the context.

## Key Takeaways
- 3-5 bullets maximum
- Each is a complete thought, not a feature list
- Focus on "so what" not just "what"
- Include specific numbers/results when impactful

## Who Should Watch
- Target audience in 1-2 bullets

## Skip If
- When this talk isn't relevant (1-2 bullets)
```

- You MUST derive a short descriptive filename from the video content with the YouTube video ID appended (e.g., "strands-agents-autonomous-ai-NrbzlvjX0GQ.md") because this makes files easily identifiable and prevents naming conflicts
- You MUST extract the video ID from the YouTube URL (the "v" parameter) because it uniquely identifies the video
- You MUST NOT use generic filenames like "summary.md" because they create confusion when managing multiple summaries
- You MUST NOT list detailed features or create comprehensive documentation because the goal is a quick decision aid, not exhaustive coverage
- You MUST save the summary to the derived filename because this ensures consistent file organization
- You MUST display the summary content using `python -c "print(open('filename.md').read())"` for easy copy/paste access because it provides immediate visibility of the output

### Example Input
```
youtube_url: "https://www.youtube.com/watch?v=NrbzlvjX0GQ"
language: "en"
```

### Example Output Files
- `cloud-migration-best-practices-NrbzlvjX0GQ.md` (summary)

## Troubleshooting

### yt-dlp Not Found
If yt-dlp is not installed, download the binary from https://github.com/yt-dlp/yt-dlp/releases and place it in the current directory or install via package manager.

### No Subtitles Available
If the video has no subtitles, the download will fail. Check if the video has captions enabled on YouTube.
