Voxa

active

Hands-free discussion mobile app for YouTube videos and documents. Learn and discuss on the go while running, riding, or multitasking.

Feb 2024$5,000 MRR

Tech Stack

React NativeYouTube APIWhisperGPT-4RedisFirebaseExpo

Overview

I watch a lot of YouTube while running, but I always wanted to ask questions or discuss what I'm learning. Pausing, typing on a sweaty phone screen, waiting for a response - it kills the flow. Voxa lets you have actual conversations about any YouTube video or PDF while you're moving. It transcribes what you say, understands the context of what you're watching/reading, and responds naturally. Like having a study buddy who's actually listened to the same lecture.

The Problem

Most learning is passive. You watch a programming tutorial or listen to a lecture, but you can't really engage with it unless you stop everything and type. I tried voice memos to myself, but then I'd have dozens of recordings asking questions I'd forget about. Tried podcast apps with playback controls, but they can't actually discuss the content. The gym, commutes, dog walks - all this time where you could be actively learning instead of just consuming.

The Solution

Talk to Voxa while watching/listening to any YouTube video. Ask it to explain a concept, summarize what you've watched so far, quiz you, or just discuss ideas. It maintains conversation context across the entire video. For PDFs, you upload them and chat as you read. The key innovation is the latency - responses come back in under 2 seconds because I'm caching video transcripts and doing smart chunking. Used it myself to get through a 4-hour database course while doing errands.

Technical Details

  • React Native with Expo for cross-platform mobile (iOS and Android from one codebase)
  • YouTube transcript API + Whisper for when official transcripts aren't available or are low-quality
  • Voice recording with react-native-audio-recorder-player, streaming chunks as you speak
  • Real-time transcription with Deepgram API (faster than Whisper for real-time, but I fall back to Whisper for better accuracy on technical content)
  • GPT-4 for conversation with a custom prompt that includes video transcript chunks + conversation history
  • Redis for caching transcripts and conversation context (GPT-4 with full transcripts gets expensive fast)
  • Firebase for auth, user data, and tracking video progress/bookmarks
  • Smart chunking algorithm that splits transcripts at natural boundaries (scene changes, topic shifts) instead of fixed tokens
  • Background audio playback so you can lock your phone and keep talking
  • Expo voice detection to automatically pause when you start speaking (took forever to tune the sensitivity)

Challenges

  • Latency was killing the experience initially - 8-10 seconds for responses. Turns out I was sending the entire transcript every time (some are 50k+ tokens). Now I only send relevant chunks based on where the user is in the video, plus conversation history. Got it down to 1.5-2s average.
  • YouTube's transcript API is rate-limited and sometimes just fails for no reason. Built a fallback pipeline: official transcripts → Whisper on cached audio → error with retry queue. Still fails sometimes on really new videos.
  • Voice recording is a nightmare on mobile - different sampling rates, background noise, wind if you're running. Spent 2 weeks just testing different noise cancellation approaches. Settled on react-native-audio-api with a high-pass filter, but it's not perfect.
  • Context management is hard - users jump around videos, switch between multiple videos, start new conversations. Built a conversation tree structure, but the UX for navigating it is still clunky. Considering just making it linear with better search.
  • iOS app store review took 6 weeks because they kept flagging it as a wrapper app for GPT. Had to show them the custom transcript processing and conversation context system. Finally approved after the 4th submission.
  • Battery drain was insane initially - 25% per hour. Turns out I was keeping the video playing in background. Now I pause it when Voxa is responding, which is obvious in hindsight.

Results

  • ~890 active users, about 40% are paying subscribers at $8/month
  • $5K MRR after 6 months (launched with no marketing, just Product Hunt)
  • Users are averaging 42 minutes per day in the app - way higher than I expected
  • Most popular use case (based on surveys): programming tutorials while commuting
  • One guy told me he uses it to 'attend' university lectures he records, then discusses them on his drive home. That's both genius and probably technically cheating.
  • 4.7 stars on App Store (would be higher but there were bugs in the first release)
  • Processing about 6,000 conversations per week
  • Transcript cache hit rate is 73% (most people watch popular videos)
  • Average conversation is 12 messages - people really engage with it
  • API costs are my biggest expense at ~$2.2K/month (mostly GPT-4). Considering fine-tuning a smaller model for common questions.