Social Content Extractor

March 22, 2026

Part 2 of 2
Social Content Extractor

A CLI tool that turns public social media content into reusable text.

If you've ever tried to turn useful Instagram content into notes or LLM-ready context, you already know the pain. One image is manageable - take a screenshot, run OCR, copy the text. But the process falls apart when the useful content is spread across a 10-slide carousel, or buried inside a reel where you have to pause frame by frame.

The manual workflow usually looks like this - take screenshots one by one, open Google Lens, copy the text manually, repeat for every slide or frame. That is exactly the repetitive work this tool removes.

You give it a URL, and it gives you the content back in a usable form.


What It Does

  • Downloads media from public Instagram posts, reels, and YouTube Shorts
  • Extracts visible on-screen text via OCR
  • Captures captions, hashtags, and mentions
  • Handles carousel slides and short-form video scenes automatically
  • Saves everything in a clean folder structure
  • Prepares extracted content for notes, docs, or LLM workflows

Real Example Inputs

These are the kinds of content this tool processes - carousel posts, reels, and Shorts with useful on-screen text.

Instagram carousel post

  

Instagram reel

  

YouTube Short

Give it any of these URLs, and the tool downloads the media, extracts all visible text, and saves everything into structured files you can use immediately.


Real Example Output

Carousel post

social-content-extractor "https://www.instagram.com/p/DVqbs3Qjifn" --sarvam
╔══════════════════════════════════════════════════════════════════════════════╗
 Social Content Extractor                                                     
 https://www.instagram.com/p/DVqbs3Qjifn/                                     ║
╚══════════════════════════════════════════════════════════════════════════════╝
                     Post Info
╭───────────────────┬─────────────────────────────╮
  Platform           Instagram                  
  Owner              @clouddevopsengineer       
  Type               CAROUSEL                   
  Media Count        12                         
╰───────────────────┴─────────────────────────────╯
╭─ Caption ────────────────────────────────────────────────────────────────────╮
 100 Real-World Kubernetes Use Cases 🚀                                       
                                                                              
 Kubernetes isn't just about containers — it powers modern cloud              │
│ infrastructure.                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

╭─ Slide 1 - OCR Text (94.4%) ─────────────────────────────────────────────────╮
│ 100 Kubernetes Real-Time UseCases                                            │
│ 1. Microservices Deployment                                                  │
│ Example: Deploying a microservices-based e-commerce application using        │
│ multiple Kubernetes Pods, each representing a different service.             │
│ 2. Automated CI/CD Pipelines                                                 │
│ Example: Setting up Jenkins in Kubernetes to automatically build, test, and  │
│ deploy applications using Kubernetes Pods and Jobs.                          │
╰──────────────────────────────────────────────────────────────────────────────╯

... slides 2-11 omitted for brevity ...

╭─ Slide 12 - OCR Text (94.0%) ────────────────────────────────────────────────╮
│ 99. API Gateway Integration                                                  │
│ Example: Integrating an API Gateway like Kong or Ambassador with Kubernetes  │
│ for managing and securing microservices APIs.                                │
│ 100. Performance Optimization                                                │
│ Example: Optimizing application performance using Kubernetes resource        │
│ management, autoscaling, and monitoring tools.                               │
╰──────────────────────────────────────────────────────────────────────────────╯

Downloaded media: 12 file(s)
OCR text saved to:
downloads/instagram/posts/DVqbs3Qjifn/content/DVqbs3Qjifn.sarvam.ocr.txt

The full run extracted 12 carousel slides at 94%+ confidence. Each slide's text is saved individually and combined into a single .ocr.txt file.

Reel

social-content-extractor "https://www.instagram.com/reel/DTTBJSgE6pP" --sarvam
╔══════════════════════════════════════════════════════════════════════════════╗
 Social Content Extractor                                                     
 https://www.instagram.com/reel/DTTBJSgE6pP/                                  ║
╚══════════════════════════════════════════════════════════════════════════════╝
                     Post Info
╭───────────────────┬─────────────────────────────╮
  Platform           Instagram                  
  Owner              @amanrahangdale_2108       
  Type               VIDEO                      
  Media Count        1                          
╰───────────────────┴─────────────────────────────╯

╭─ Slide 1 - Video OCR Scenes ─────────────────────────────────────────────────╮
 00:02                                                                        
 DevOps                                                                       
 Roadmap                                                                      
 you actually need                                                            
                                                                              
 00:04                                                                        
 DevOps Foundations                                                           
 Linux: files, permissions, processes, services                               
 Networking: HTTP/HTTPS, DNS, ports, SSH                                      
 Scripting: Bash (mandatory), Python (basic)                                  
 Version Control: Git, GitHub/GitLab                                          
                                                                              
 00:06                                                                        
 Core DevOps Tools                                                            
 CI/CD: Jenkins / GitHub Actions                                              
 Containers: Docker, Dockerfile, Docker Compose                               
 Orchestration: Kubernetes (Pods, Deployments, Services)                      
 Cloud: AWS (EC2, S3, IAM, EKS)                                               
 Terraform (infra automation)                                                 
                                                                              
 00:07                                                                        
 Advanced + Career                                                            
 Monitoring: Prometheus, Grafana                                              
 Logging: ELK Stack                                                           
 Security: IAM, Secrets, Trivy, SonarQube                                     
 Roles: DevOps Engineer | Cloud Engineer | SRE                                
╰──────────────────────────────────────────────────────────────────────────────╯

Downloaded media: 1 file(s)
OCR text saved to:
downloads/instagram/reels/DTTBJSgE6pP/content/DTTBJSgE6pP.sarvam.ocr.txt

The reel was sampled into scenes with timestamps. Duplicate frames were skipped, and the text was cleaned up by Sarvam to remove OCR noise.


How It Works

Input URL (Instagram post / reel / YouTube Short)
        
        
  Parse URL  detect platform + media type
        
        
  Fetch metadata (Instaloader / yt-dlp)
        
        
  Download media (images / video)
        
        
  Run OCR on slides or sampled video frames
                 ├── Tesseract (local)
                 └── Sarvam Vision (optional)
        
        
  Cleanup text (optional Sarvam AI)
        
        
  Save structured output
        ├── OCR text file
        ├── JSON metadata
        └── Downloaded media

The extraction adapts to the content type. For carousels, each slide is OCRed individually. For reels and Shorts, the video is sampled into scenes, timestamps are preserved, and duplicate frames are skipped.


Quick Start

git clone git@github.com:shaikahmadnawaz/social-content-extractor.git
cd social-content-extractor

python3 -m venv .venv
source .venv/bin/activate

pip install -e .
brew install tesseract ffmpeg

Run it:

# Instagram post
social-content-extractor "https://www.instagram.com/p/DVqbs3Qjifn" --sarvam

# Instagram reel
social-content-extractor "https://www.instagram.com/reel/DTTBJSgE6pP" --sarvam

# YouTube Short
social-content-extractor "https://www.youtube.com/shorts/Lay3pQF3I3c" --sarvam

OCR Modes

--local

Pure local OCR using Tesseract. No external API calls. Best when you want the simplest path.

--sarvam

Local OCR with Tesseract, then text cleanup via Sarvam AI. Fixes OCR junk, normalizes formatting, and preserves meaning. This is the recommended default for all content types.

--sarvam-vision

Sarvam Vision does the OCR instead of Tesseract, then Sarvam cleans up the text. Useful for comparing OCR quality on posts and carousels. Still experimental on reels.

Sarvam Pricing

  • Sarvam 30B (chat cleanup / sentence formatting) - Free
  • Sarvam Vision (image OCR / vision extraction) - ₹1.5 per page

--sarvam uses local OCR plus free Sarvam chat cleanup, making it the cheaper default. --sarvam-vision invokes Sarvam Vision OCR, so use it only when you specifically want to compare vision-based extraction quality.


Content Source Handling

Short-form social content isn't always text-on-image first. Sometimes the real message is in the caption, sometimes it's inside the slides, sometimes both matter.

The extractor doesn't merge caption and OCR blindly. It keeps them separate and adds a lightweight decision layer:

  • caption_only - Caption is substantial, OCR is weak
  • ocr_only - OCR is substantial, caption is promotional
  • caption_plus_ocr - Both are meaningful
  • media_representational - Media is decorative, caption carries the message

This keeps raw data intact while giving you a practical default for downstream use.


Output Layout

The tool uses a platform-first directory structure:

downloads/
  instagram/
    posts/
      DVqbs3Qjifn/
        media/
          DVqbs3Qjifn_1.jpg
          DVqbs3Qjifn_2.jpg
        content/
          DVqbs3Qjifn.sarvam.ocr.txt
          DVqbs3Qjifn.sarvam.json
    reels/
      DTTBJSgE6pP/
        media/
          DTTBJSgE6pP_1.mp4
        content/
          DTTBJSgE6pP.sarvam.ocr.txt
  youtube/
    shorts/
      Lay3pQF3I3c/
        media/
          Lay3pQF3I3c_1.mp4
        content/
          Lay3pQF3I3c.sarvam.ocr.txt

Media and content are separated. Artifacts are named by OCR mode so you can compare results from different modes side by side.


Caching

Downloaded media isn't fetched again if a valid local copy exists. Images are verified by opening them. Videos are validated for duration and stream presence using ffprobe when available. Broken cached files are deleted and re-downloaded automatically.


Sarvam AI Cleanup

The cleanup is designed to preserve meaning and order, not rewrite content. It removes obvious OCR junk, normalizes formatting, and avoids inventing missing text.

Safety behavior:

  • Reasoning-style responses from the model are ignored
  • Markdown fences are stripped
  • Generic provider artifacts like data:image/... payloads are filtered
  • Image-description prose like "The image is..." is filtered

No content-specific hardcoded hacks or platform-specific text replacements.


OCR Behavior

Posts and carousels

  • Image slides are OCRed one by one
  • Results are stored per slide with confidence scores
  • Combined text is saved into a single .ocr.txt file

Reels and Shorts

  • Video is sampled into scenes at key intervals
  • Timestamps are formatted in playback order (e.g. 00:02, 00:04)
  • Repeated scenes are deduplicated to avoid duplicate text
  • If local frame OCR fails, thumbnail OCR is used as a fallback

JSON Structure

When you pass --json, the saved file includes:

  • shortcode, url, post_type, owner
  • caption, accessibility_caption, hashtags, mentions
  • date, date_local, likes, comments_count
  • media_count, media, slides, downloaded_files
  • ocr_text, ocr_combined_text, ocr_provider
  • ocr_cleanup_model (when Sarvam is used)
  • content_strategy, primary_source, primary_text
{
  "content_strategy": "caption_plus_ocr",
  "primary_source": "ocr",
  "primary_text": "The most useful extracted content goes here"
}

The content_strategy field tells you how the extractor decided to handle the content, and primary_text gives you the most useful text to use downstream.


Project Structure

src/social_content_extractor/
  __init__.py
  __main__.py
  cli.py
  extractor/
    __init__.py
    constants.py
    core.py
    sources.py
    text.py

Tech Stack

  • Language - Python
  • Instagram fetching - Instaloader
  • YouTube fetching - yt-dlp
  • Local OCR - Tesseract
  • Video processing - FFmpeg
  • AI cleanup - Sarvam AI (chat completions)
  • AI vision OCR - Sarvam Vision (optional)
  • CLI - argparse
  • Rich terminal output - Rich library

CLI Flags

  • --local - Pure Tesseract OCR
  • --sarvam - Tesseract OCR + Sarvam cleanup (recommended)
  • --sarvam-vision - Sarvam Vision OCR + Sarvam cleanup
  • --sarvam-model - Choose sarvam-30b or sarvam-105b
  • --json - Save JSON output
  • --ocr-lang - Tesseract language code
  • --ocr-psm - Tesseract page segmentation mode
  • --ocr-min-confidence - Minimum word confidence threshold
  • --show-accessibility - Show Instagram accessibility caption
  • --no-download - Skip media download

Current Limitations

What works well:

  • Posts and carousels with local OCR
  • Reels with local OCR + Sarvam cleanup
  • YouTube Shorts extraction
  • Cached media reuse
  • Clean output folder structure

What's still imperfect:

  • Sarvam Vision on reels can still be noisy
  • Stylized or low-contrast visuals are challenging for any OCR provider
  • Anonymous Instagram requests may occasionally hit 403 or rate limits

What This Is NOT

This isn't a content scraper for mass downloading. It's a personal productivity tool - designed to extract text from content you've already found useful and want to reference later. It's built for the workflow of "I saw something useful, I want to save the text" - not "download everything from this account."


Built with Sarvam AI · GitHub