Project Prototype

Realtime Native Audio Demo Realtime Native Audio Demo

오디오를 텍스트로 우회하지 않고 직접 이해하는 native multimodal pipeline과 context caching 전략을 검증한 음성 챗봇 demo입니다. A voice chatbot demo validating a native multimodal pipeline that understands audio directly without text detours, together with context caching strategy.

이 프로젝트의 핵심은 STT-LLM-TTS 직렬 구조를 걷어내고 native audio를 직접 이해하는 모델 중심으로 재설계하는 것입니다. 그 결과 latency, 감정 정보 손실, 반복 프롬프트 비용이라는 세 문제를 한 번에 다루는 방향으로 구조를 정리했습니다. The core of this project is replacing the serial STT-LLM-TTS chain with a model that understands native audio directly. That redesign addresses latency, emotional fidelity loss, and repeated prompt cost in one structure.

Solution Solution

가장 큰 병목은 STT와 LLM 사이의 추가 네트워크 hop이었습니다. 그래서 사용자 음성을 텍스트로 풀지 않고 Gemini 2.5 Flash Native API가 원시 오디오를 직접 받아 이해하도록 바꾸어 대화 지연과 감정 톤 손실을 동시에 줄였습니다. The largest bottleneck was the extra network hop between STT and the LLM. We replaced that by letting Gemini 2.5 Flash Native API ingest raw audio directly, reducing both conversational delay and emotional tone loss.

동시에 캐릭터 설정 프롬프트를 매 턴 반복 전송하지 않도록 context caching을 결합했습니다. 이로써 긴 persona prompt를 다시 보내는 비용을 줄이면서도 일관된 캐릭터 응답을 유지할 수 있게 했습니다. At the same time, we combined the flow with context caching so long character setup prompts do not need to be resent every turn. That cuts repeated prompt cost while preserving consistent character behavior.

Architecture Architecture

사용자 음성은 native model로 직접 전달되고, 모델은 오디오 신호를 그대로 해석해 텍스트 스트림 또는 음성 응답 생성 단계로 이어집니다. 별도 STT 계층을 제거함으로써 입력 맥락이 더 많이 보존되고, 중간 변환 과정에서 생기는 지연과 오류 가능성을 줄였습니다. User audio is passed directly into the native model, which interprets the signal and proceeds into text streaming or voice response generation. Removing a separate STT layer preserves more input context and reduces both latency and failure surfaces.

운영 관점에서는 persona cache, endpoint capacity, TTS rendering을 별도 concern으로 관리합니다. 즉, 모델 호출 하나만 빠르게 만드는 것이 아니라 대규모 동시 접속에서도 cost-per-turn과 TTFT를 통제할 수 있는 배포 구조를 전제로 설계했습니다. Operationally, persona cache, endpoint capacity, and TTS rendering are handled as separate concerns. The design assumes deployment at scale, where cost per turn and TTFT must remain controlled under high concurrency.

Technical Decisions Technical Decisions

Context Caching Context Caching

긴 persona prompt를 매 턴 다시 보내면 비용 누수가 큽니다. 2,048 토큰 이상 설정값은 명시적으로 cache해 입력 비용과 첫 응답 시간을 함께 낮추는 쪽을 택했습니다. Resending long persona prompts each turn creates major cost leakage. We explicitly cached persona settings above 2,048 tokens to reduce both input cost and time-to-first-token.

Capacity Planning Capacity Planning

더 비싼 상위 모델보다 2.5 Flash를 메인으로 두고, VAD와 provisioned throughput을 함께 검토하는 방향을 선택했습니다. 목표는 최고 스펙이 아니라 대규모 B2C 트래픽에서도 버틸 수 있는 가성비입니다. Instead of defaulting to a more expensive high-end model, we selected 2.5 Flash as the primary model and planned capacity with VAD and provisioned throughput. The goal was cost-effective scalability rather than peak spec.

Tech Stack

Gemini 2.5 Flash Native API Context Caching VAD Provisioned Throughput TTS Renderer

graph TD User([User Voice Stream]) --> Native{Native Audio Model} Persona[(Persona Cache)] --> Native Native --> Render[TTS Renderer] Render --> Audio([Voice Output]) Native -. Capacity .-> PT[Provisioned Throughput] Native -. Guardrail .-> VAD[VAD Logic] style User fill:#f8fafc,stroke:#94a3b8 style Native fill:#eff6ff,stroke:#3b82f6 style Persona fill:#fefce8,stroke:#eab308 style Render fill:#f1f5f9,stroke:#64748b style Audio fill:#ecfdf5,stroke:#10b981 style PT fill:#fef3c7,stroke:#f59e0b style VAD fill:#fef3c7,stroke:#f59e0b