Social Content Extractor

A CLI tool that turns public social media content into reusable text.

If you've ever tried to turn useful Instagram content into notes or LLM-ready context, you already know the pain. One image is manageable - take a screenshot, run OCR, copy the text. But the process falls apart when the useful content is spread across a 10-slide carousel, or buried inside a reel where you have to pause frame by frame.

The manual workflow usually looks like this - take screenshots one by one, open Google Lens, copy the text manually, repeat for every slide or frame. That is exactly the repetitive work this tool removes.

You give it a URL, and it gives you the content back in a usable form.

What It Does

Downloads media from public Instagram posts, reels, and YouTube Shorts
Extracts visible on-screen text via OCR
Captures captions, hashtags, and mentions
Handles carousel slides and short-form video scenes automatically
Saves everything in a clean folder structure
Prepares extracted content for notes, docs, or LLM workflows

Real Example Inputs

These are the kinds of content this tool processes - carousel posts, reels, and Shorts with useful on-screen text.

Give it any of these URLs, and the tool downloads the media, extracts all visible text, and saves everything into structured files you can use immediately.

Real Example Output

Carousel post

social-content-extractor "https://www.instagram.com/p/DVqbs3Qjifn" --sarvam

╔══════════════════════════════════════════════════════════════════════════════╗
║ Social Content Extractor                                                     ║
║ https://www.instagram.com/p/DVqbs3Qjifn/                                     ║
╚══════════════════════════════════════════════════════════════════════════════╝
                     Post Info
╭───────────────────┬─────────────────────────────╮
│  Platform         │  Instagram                  │
│  Owner            │  @clouddevopsengineer       │
│  Type             │  CAROUSEL                   │
│  Media Count      │  12                         │
╰───────────────────┴─────────────────────────────╯
╭─ Caption ────────────────────────────────────────────────────────────────────╮
│ 100 Real-World Kubernetes Use Cases 🚀                                       │
│                                                                              │
│ Kubernetes isn't just about containers — it powers modern cloud              │
│ infrastructure.                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭─ Slide 1 - OCR Text (94.4%) ─────────────────────────────────────────────────╮
│ 100 Kubernetes Real-Time UseCases                                            │
│ 1. Microservices Deployment                                                  │
│ Example: Deploying a microservices-based e-commerce application using        │
│ multiple Kubernetes Pods, each representing a different service.             │
│ 2. Automated CI/CD Pipelines                                                 │
│ Example: Setting up Jenkins in Kubernetes to automatically build, test, and  │
│ deploy applications using Kubernetes Pods and Jobs.                          │
╰──────────────────────────────────────────────────────────────────────────────╯

... slides 2-11 omitted for brevity ...

╭─ Slide 12 - OCR Text (94.0%) ────────────────────────────────────────────────╮
│ 99. API Gateway Integration                                                  │
│ Example: Integrating an API Gateway like Kong or Ambassador with Kubernetes  │
│ for managing and securing microservices APIs.                                │
│ 100. Performance Optimization                                                │
│ Example: Optimizing application performance using Kubernetes resource        │
│ management, autoscaling, and monitoring tools.                               │
╰──────────────────────────────────────────────────────────────────────────────╯

Downloaded media: 12 file(s)
OCR text saved to:
downloads/instagram/posts/DVqbs3Qjifn/content/DVqbs3Qjifn.sarvam.ocr.txt

The full run extracted 12 carousel slides at 94%+ confidence. Each slide's text is saved individually and combined into a single .ocr.txt file.

Reel

social-content-extractor "https://www.instagram.com/reel/DTTBJSgE6pP" --sarvam

╔══════════════════════════════════════════════════════════════════════════════╗
║ Social Content Extractor                                                     ║
║ https://www.instagram.com/reel/DTTBJSgE6pP/                                  ║
╚══════════════════════════════════════════════════════════════════════════════╝
                     Post Info
╭───────────────────┬─────────────────────────────╮
│  Platform         │  Instagram                  │
│  Owner            │  @amanrahangdale_2108       │
│  Type             │  VIDEO                      │
│  Media Count      │  1                          │
╰───────────────────┴─────────────────────────────╯

╭─ Slide 1 - Video OCR Scenes ─────────────────────────────────────────────────╮
│ 00:02                                                                        │
│ DevOps                                                                       │
│ Roadmap                                                                      │
│ you actually need                                                            │
│                                                                              │
│ 00:04                                                                        │
│ DevOps Foundations                                                           │
│ Linux: files, permissions, processes, services                               │
│ Networking: HTTP/HTTPS, DNS, ports, SSH                                      │
│ Scripting: Bash (mandatory), Python (basic)                                  │
│ Version Control: Git, GitHub/GitLab                                          │
│                                                                              │
│ 00:06                                                                        │
│ Core DevOps Tools                                                            │
│ CI/CD: Jenkins / GitHub Actions                                              │
│ Containers: Docker, Dockerfile, Docker Compose                               │
│ Orchestration: Kubernetes (Pods, Deployments, Services)                      │
│ Cloud: AWS (EC2, S3, IAM, EKS)                                               │
│ Terraform (infra automation)                                                 │
│                                                                              │
│ 00:07                                                                        │
│ Advanced + Career                                                            │
│ Monitoring: Prometheus, Grafana                                              │
│ Logging: ELK Stack                                                           │
│ Security: IAM, Secrets, Trivy, SonarQube                                     │
│ Roles: DevOps Engineer | Cloud Engineer | SRE                                │
╰──────────────────────────────────────────────────────────────────────────────╯

Downloaded media: 1 file(s)
OCR text saved to:
downloads/instagram/reels/DTTBJSgE6pP/content/DTTBJSgE6pP.sarvam.ocr.txt

The reel was sampled into scenes with timestamps. Duplicate frames were skipped, and the text was cleaned up by Sarvam to remove OCR noise.

How It Works

Input URL (Instagram post / reel / YouTube Short)
        │
        ▼
  Parse URL → detect platform + media type
        │
        ▼
  Fetch metadata (Instaloader / yt-dlp)
        │
        ▼
  Download media (images / video)
        │
        ▼
  Run OCR on slides or sampled video frames
        │         ├── Tesseract (local)
        │         └── Sarvam Vision (optional)
        │
        ▼
  Cleanup text (optional Sarvam AI)
        │
        ▼
  Save structured output
        ├── OCR text file
        ├── JSON metadata
        └── Downloaded media

The extraction adapts to the content type. For carousels, each slide is OCRed individually. For reels and Shorts, the video is sampled into scenes, timestamps are preserved, and duplicate frames are skipped.

Quick Start

git clone git@github.com:shaikahmadnawaz/social-content-extractor.git
cd social-content-extractor

python3 -m venv .venv
source .venv/bin/activate

pip install -e .
brew install tesseract ffmpeg

Run it:

# Instagram post
social-content-extractor "https://www.instagram.com/p/DVqbs3Qjifn" --sarvam

# Instagram reel
social-content-extractor "https://www.instagram.com/reel/DTTBJSgE6pP" --sarvam

# YouTube Short
social-content-extractor "https://www.youtube.com/shorts/Lay3pQF3I3c" --sarvam

OCR Modes

`--local`

Pure local OCR using Tesseract. No external API calls. Best when you want the simplest path.

`--sarvam`

Local OCR with Tesseract, then text cleanup via Sarvam AI. Fixes OCR junk, normalizes formatting, and preserves meaning. This is the recommended default for all content types.

`--sarvam-vision`

Sarvam Vision does the OCR instead of Tesseract, then Sarvam cleans up the text. Useful for comparing OCR quality on posts and carousels. Still experimental on reels.

Sarvam Pricing

Sarvam 30B (chat cleanup / sentence formatting) - Free
Sarvam Vision (image OCR / vision extraction) - ₹1.5 per page

--sarvam uses local OCR plus free Sarvam chat cleanup, making it the cheaper default. --sarvam-vision invokes Sarvam Vision OCR, so use it only when you specifically want to compare vision-based extraction quality.

Content Source Handling

Short-form social content isn't always text-on-image first. Sometimes the real message is in the caption, sometimes it's inside the slides, sometimes both matter.

The extractor doesn't merge caption and OCR blindly. It keeps them separate and adds a lightweight decision layer:

caption_only - Caption is substantial, OCR is weak
ocr_only - OCR is substantial, caption is promotional
caption_plus_ocr - Both are meaningful
media_representational - Media is decorative, caption carries the message

This keeps raw data intact while giving you a practical default for downstream use.

Output Layout

The tool uses a platform-first directory structure:

downloads/
  instagram/
    posts/
      DVqbs3Qjifn/
        media/
          DVqbs3Qjifn_1.jpg
          DVqbs3Qjifn_2.jpg
        content/
          DVqbs3Qjifn.sarvam.ocr.txt
          DVqbs3Qjifn.sarvam.json
    reels/
      DTTBJSgE6pP/
        media/
          DTTBJSgE6pP_1.mp4
        content/
          DTTBJSgE6pP.sarvam.ocr.txt
  youtube/
    shorts/
      Lay3pQF3I3c/
        media/
          Lay3pQF3I3c_1.mp4
        content/
          Lay3pQF3I3c.sarvam.ocr.txt

Media and content are separated. Artifacts are named by OCR mode so you can compare results from different modes side by side.

Caching

Downloaded media isn't fetched again if a valid local copy exists. Images are verified by opening them. Videos are validated for duration and stream presence using ffprobe when available. Broken cached files are deleted and re-downloaded automatically.

Sarvam AI Cleanup

The cleanup is designed to preserve meaning and order, not rewrite content. It removes obvious OCR junk, normalizes formatting, and avoids inventing missing text.

Safety behavior:

Reasoning-style responses from the model are ignored
Markdown fences are stripped
Generic provider artifacts like data:image/... payloads are filtered
Image-description prose like "The image is..." is filtered

No content-specific hardcoded hacks or platform-specific text replacements.

OCR Behavior

Posts and carousels

Image slides are OCRed one by one
Results are stored per slide with confidence scores
Combined text is saved into a single .ocr.txt file

Reels and Shorts

Video is sampled into scenes at key intervals
Timestamps are formatted in playback order (e.g. 00:02, 00:04)
Repeated scenes are deduplicated to avoid duplicate text
If local frame OCR fails, thumbnail OCR is used as a fallback

JSON Structure

When you pass --json, the saved file includes:

shortcode, url, post_type, owner
caption, accessibility_caption, hashtags, mentions
date, date_local, likes, comments_count
media_count, media, slides, downloaded_files
ocr_text, ocr_combined_text, ocr_provider
ocr_cleanup_model (when Sarvam is used)
content_strategy, primary_source, primary_text

{
  "content_strategy": "caption_plus_ocr",
  "primary_source": "ocr",
  "primary_text": "The most useful extracted content goes here"
}

The content_strategy field tells you how the extractor decided to handle the content, and primary_text gives you the most useful text to use downstream.

Project Structure

src/social_content_extractor/
  __init__.py
  __main__.py
  cli.py
  extractor/
    __init__.py
    constants.py
    core.py
    sources.py
    text.py

Tech Stack

Language - Python
Instagram fetching - Instaloader
YouTube fetching - yt-dlp
Local OCR - Tesseract
Video processing - FFmpeg
AI cleanup - Sarvam AI (chat completions)
AI vision OCR - Sarvam Vision (optional)
CLI - argparse
Rich terminal output - Rich library

CLI Flags

--local - Pure Tesseract OCR
--sarvam - Tesseract OCR + Sarvam cleanup (recommended)
--sarvam-vision - Sarvam Vision OCR + Sarvam cleanup
--sarvam-model - Choose sarvam-30b or sarvam-105b
--json - Save JSON output
--ocr-lang - Tesseract language code
--ocr-psm - Tesseract page segmentation mode
--ocr-min-confidence - Minimum word confidence threshold
--show-accessibility - Show Instagram accessibility caption
--no-download - Skip media download

Current Limitations

What works well:

Posts and carousels with local OCR
Reels with local OCR + Sarvam cleanup
YouTube Shorts extraction
Cached media reuse
Clean output folder structure

What's still imperfect:

Sarvam Vision on reels can still be noisy
Stylized or low-contrast visuals are challenging for any OCR provider
Anonymous Instagram requests may occasionally hit 403 or rate limits

What This Is NOT

This isn't a content scraper for mass downloading. It's a personal productivity tool - designed to extract text from content you've already found useful and want to reference later. It's built for the workflow of "I saw something useful, I want to save the text" - not "download everything from this account."

Built with Sarvam AI · GitHub