Project Prototype

Custom Voice Cloning Demo Custom Voice Cloning Demo

오디오 전처리 미들웨어로 API 제약을 흡수하고, 업로드부터 생성까지 실패율을 낮춘 custom voice cloning prototype입니다. A custom voice cloning prototype that absorbs API constraints through audio preprocessing middleware and reduces failure rates from upload to generation.

이 프로젝트의 핵심은 모델 자체보다 그 앞단의 실패 방어 로직입니다. 사용자가 어떤 파일을 올리든 API 규격에 맞게 자동 보정하고, UI 프레임워크 차이까지 백엔드에서 흡수하는 middleware design에 초점을 뒀습니다. The core of this project is less about the model itself and more about the failure-defense logic in front of it. It focuses on middleware design that automatically normalizes uploads to API requirements and absorbs UI framework differences in the backend.

Solution Solution

사용자가 올린 참조 음성을 바로 API에 보내지 않고, `librosa` 기반 미들웨어가 먼저 24kHz 리샘플링과 길이 보정을 수행하도록 만들었습니다. 이 과정에서 10초 제한을 넘는 오디오는 안전한 길이로 잘라 새로운 임시 객체로 재생성합니다. Instead of sending uploaded reference audio directly to the API, a `librosa`-based middleware first performs 24kHz resampling and duration normalization. Audio exceeding the limit is trimmed into a safe temporary object before the request proceeds.

프론트엔드에는 명확한 가이드를 두되, 실제 실패 방어는 백엔드가 맡도록 설계했습니다. 덕분에 사용자가 규격을 잘못 이해하거나 브라우저 환경이 달라도 voice generation workflow 전체의 안정성을 높일 수 있었습니다. The frontend provides explicit guidance, but real failure defense happens in the backend. This keeps the full voice generation workflow stable even when users misunderstand requirements or browser environments vary.

Architecture Architecture

업로드된 파일은 먼저 입력 객체 판별기를 지나면서 문자열 경로인지 객체형 데이터인지 정규화됩니다. 그다음 오디오 전처리 계층에서 리샘플링과 트리밍을 거치고, 검증이 끝난 후에만 Chirp 3 API 호출과 TTS 생성을 이어갑니다. Uploaded files first pass through an input discriminator that normalizes whether the payload is a string path or an object. Audio preprocessing then handles resampling and trimming, and only after validation does the workflow proceed to Chirp 3 API calls and TTS generation.

즉, API 호출부를 똑똑하게 만드는 것이 아니라 API 앞단을 견고하게 만드는 구조입니다. 서비스 실패율을 낮추는 책임을 모델이 아니라 middleware layer가 가지도록 설계한 점이 핵심입니다. In other words, the structure strengthens the layer before the API rather than trying to make the API call itself smarter. The middleware layer, not the model, carries responsibility for reducing service failure rates.

Technical Decisions Technical Decisions

Librosa Safe Wrapper Librosa Safe Wrapper

프론트 가이드만 믿지 않고 backend에서 강제로 리샘플링과 트리밍을 수행하도록 설계했습니다. 업로드 품질이 제각각인 실제 환경을 전제로 한 결정입니다. We did not rely on frontend guidance alone. Resampling and trimming are enforced in the backend to handle real-world upload variability.

Compatibility-Centric Parsing Compatibility-Centric Parsing

UI 라이브러리 버전이 바뀌면 오디오 입력 객체 구조도 달라집니다. 입력 타입을 먼저 판별해 절대 경로만 안전하게 추출하는 wrapper를 두어 환경 차이를 흡수했습니다. Audio input object structures change across UI library versions. We absorbed those differences with a wrapper that first identifies the input type and safely extracts only the absolute file path.

Tech Stack

Google Chirp 3 Librosa Gradio Audio Middleware TTS Pipeline

graph TD User([Audio Upload]) --> Parse[Input Parser] Parse --> Preprocess[Librosa Preprocess] Preprocess --> Verify[Validation] Verify --> Chirp[Chirp 3 API] Chirp --> TTS[TTS Generation] TTS --> Output([Generated Voice]) style User fill:#f8fafc,stroke:#94a3b8 style Parse fill:#eff6ff,stroke:#3b82f6 style Preprocess fill:#eff6ff,stroke:#3b82f6 style Verify fill:#fef3c7,stroke:#f59e0b style Chirp fill:#fefce8,stroke:#eab308 style TTS fill:#f1f5f9,stroke:#64748b style Output fill:#ecfdf5,stroke:#10b981