GenAI Search System Guide

1. 개요1. Overview

이 애플리케이션은 Streamlit을 사용하여 구축된 대화형 Q&A 시스템입니다. 사용자가 자연어 질문을 입력하면, Google Cloud의 Discovery Engine과 BigQuery를 통해 연구 논문 데이터베이스를 검색하고, Google Gemini Pro (LLM)를 활용하여 질문에 대한 답변을 생성합니다. This application is an interactive Q&A system built using Streamlit. When a user enters a natural language question, it searches a research paper database via Google Cloud's Discovery Engine and BigQuery, and generates an answer using Google Gemini Pro (LLM).

주요 기능Key Features

✓자연어 질의응답:Natural Language Q&A: 논문 내용, 저자, 연도 등 다양한 조건으로 질문Ask questions with various criteria like content, author, and year.
✓정확한 ID 검색:Exact ID Search: BigQuery에서 직접 데이터를 조회하여 빠르고 정확한 검색 지원Supports fast and accurate search by directly querying data from BigQuery.
✓의미 기반 검색:Semantic Search: Discovery Engine으로 사용자 의도를 파악하여 유사 문서 검색Searches for similar documents by understanding user intent with Discovery Engine.
✓복합 및 후속 질문 처리:Complex & Follow-up Questions: 대화 맥락을 기억하여 후속 질문 처리Handles follow-up questions by remembering the conversation context.
✓관련 논문 추천:Related Paper Recommendations: TF-IDF와 코사인 유사도 기반으로 관련 논문 추천Recommends related papers based on TF-IDF and cosine similarity.
✓대화 히스토리 및 시각화:History & Visualization: 대화 기록 관리 및 검색 결과 시각화Manages conversation history and visualizes search results.

사용 기술 스택Technology Stack

Streamlit Google Gemini Pro Discovery Engine BigQuery Pandas Scikit-learn LangChain Google ADK

2. 시스템 아키텍처2. System Architecture

시스템 구성도System Diagram

1. 질의1. Query

사용자 입력User Input

›

2. 분석/검색2. Analyze/Search

유형 분석 및
데이터 조회Type Analysis &
Data Retrieval

›

3. 생성3. Generate

LLM 프롬프트 및
답변 생성LLM Prompt &
Answer Generation

›

4. 결과4. Result

UI 렌더링 및
스트리밍UI Rendering &
Streaming

주요 구성 요소:Key Components: Google Discovery Engine, BigQuery, Gemini Pro, Streamlit/ADK

1
질문 입력 (User Input)User Input

사용자가 웹 UI(Streamlit 또는 ADK Web)에서 질문을 입력합니다.The user enters a question in the web UI (Streamlit or ADK Web).
2
질의 유형 분석 (Query Analysis)Query Analysis

ID, 후속 질문 키워드 등을 분석하여 질문 유형을 판단합니다.The system analyzes keywords for IDs, follow-up questions, etc., to determine the query type.
3
데이터 검색 (Data Retrieval)Data Retrieval

ID 검색은 BigQuery, 그 외는 Discovery Engine을 사용합니다. ADK에서는 이 기능들이 `@tool`로 정의됩니다.BigQuery is used for ID searches, and Discovery Engine for all others. In ADK, these functions are defined as `@tool`.
4
데이터 필터링 및 정제 (Filtering & Refinement)Filtering & Refinement

키워드, 연도, 유사도 등으로 검색 결과를 정제합니다.Search results are refined by keyword, year, similarity, etc.
5
LLM 프롬프트 생성 (Prompt Generation)Prompt Generation

검색 결과와 대화 히스토리를 조합해 LLM용 프롬프트를 만듭니다.An LLM prompt is created by combining search results and conversation history.
6
답변 생성 및 스트리밍 (Answer Generation)Answer Generation

LLM이 생성하는 답변을 실시간으로 UI에 표시합니다. ADK에서는 `adk-chat` 컴포넌트가 이를 처리합니다.The answer generated by the LLM is streamed to the UI in real-time. In ADK, the `adk-chat` component handles this.
7
결과 표시 (Display Results)Display Results

답변과 함께 검색된 데이터(테이블)를 UI에 렌더링합니다.The answer and the retrieved data (table) are rendered in the UI.

3. 코드 구조 및 핵심 함수 설명3. Code Structure and Core Functions

3.1. 핵심 함수3.1. Core Functions

`retrieve_documents_as_dataframe(question)`

Discovery Engine API를 호출하여 문서를 검색하고, 결과를 1시간 동안 캐싱합니다.Calls the Discovery Engine API to search for documents and caches the results for 1 hour.

`query_bigquery_by_id(doc_id)`

논문 ID를 사용해 BigQuery 테이블을 직접 쿼리하며, SQL 인젝션을 방지합니다.Directly queries the BigQuery table using a paper ID, preventing SQL injection.

`find_related_documents(target_doc, all_docs_df)`

TF-IDF와 코사인 유사도를 계산하여 기준 논문과 가장 유사한 논문을 찾습니다.Finds the most similar papers to a reference paper by calculating TF-IDF and cosine similarity.

`generate_answer_stream(question, docs)`

대화 히스토리, 참고 문서를 조합하여 LLM 프롬프트를 구성하고 답변을 스트리밍합니다.Constructs an LLM prompt by combining conversation history and reference documents, then streams the answer.

3.2. 전체 코드 (Streamlit)3.2. Full Code (Streamlit)

아래는 이 매뉴얼의 기반이 되는 원본 Streamlit 애플리케이션의 전체 Python 코드입니다. Below is the full Python code for the original Streamlit application that this manual is based on.


# 이 코드는 Streamlit 버전의 원본 코드입니다.
# ADK 버전으로의 변환은 아래 가이드를 참고하세요.
import streamlit as st
import requests
import re
import pandas as pd
from google.auth import default, exceptions as google_auth_exceptions
from google.auth.transport.requests import Request
from google.cloud import bigquery
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import logging

# ------------------------------
# 로깅 설정
# ------------------------------
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# ------------------------------
# 환경 설정
# ------------------------------
# 중요: 실제 운영 환경에서는 API 키와 같은 민감한 정보를 코드에 직접 하드코딩하지 마세요.
# Streamlit Secrets, 환경 변수, 또는 Google Secret Manager와 같은 안전한 방법을 사용하세요.
os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY"
PROJECT_NUMBER = "YOUR_PROJECT_NUMBER"
ENGINE_ID = "YOUR_ENGINE_ID"
BIGQUERY_TABLE = "YOUR_PROJECT_ID.YOUR_DATASET.YOUR_TABLE"

DISCOVERY_ENDPOINT = f"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search"

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-001", streaming=True)

# Streamlit 세션 상태 초기화
if "chat_history" not in st.session_state:
    st.session_state.chat_history = []
if "last_authors" not in st.session_state:
    st.session_state.last_authors = None
if "last_doc_ids" not in st.session_state:
    st.session_state.last_doc_ids = set()
if "context_docs" not in st.session_state:
    st.session_state.context_docs = []
if "last_search_query" not in st.session_state:
    st.session_state.last_search_query = ""

# ------------------------------
# 유틸 함수
# ------------------------------
def extract_id(question: str):
    match = re.search(r'id[=\s:]+([a-f0-9\-]+)', question, re.IGNORECASE)
    return match.group(1) if match else None

def extract_doi_or_url(question: str):
    doi_match = re.search(r'(10\.\d{4,9}/[^\s]+)', question)
    if doi_match: return doi_match.group(1)
    url_match = re.search(r'https://(?:www\.)?biorxiv\.org/content/([^\s]+)', question)
    if url_match: return url_match.group(1)
    return None

def extract_keywords_and_years(question: str):
    years = re.findall(r'(19|20)\d{2}', question)
    year = years[0] if years else None
    words = re.findall(r'\b([A-Z][a-zA-Z\-\'.]+(?:\s[A-Z][a-zA-Z\-\'.]+)*)\b', question)
    keyword = " ".join(words) if words else None
    return keyword, year

def extract_keywords_from_question(question: str) -> str:
    keywords = re.findall(r'\b\w+\b', question)
    return " ".join([word for word in keywords if len(word) > 2])

# ------------------------------
# 필터 함수
# ------------------------------
def filter_relevant_documents(df: pd.DataFrame, question: str) -> pd.DataFrame:
    logging.info(f"필터링 시작. 입력 행 개수: {len(df)}")
    if df.empty: return pd.DataFrame()

    keyword, year = extract_keywords_and_years(question)
    
    if not keyword and not year: return df

    def row_matches(row):
        keyword_match = True
        if keyword:
            keyword_match = any(keyword.lower() in str(row.get(col, "")).lower() for col in ["title", "abstract", "authors"])
        
        year_match = True
        if year:
            year_match = str(year) in str(row.get("date", ""))
            
        return keyword_match and year_match

    return df[df.apply(row_matches, axis=1)]

# ------------------------------
# 검색 함수
# ------------------------------
@st.cache_data(ttl=3600)
def retrieve_documents_as_dataframe(question: str, top_k: int = 20):
    logging.info(f"Discovery Engine 검색: '{question}'")
    try:
        credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
        credentials.refresh(Request())
    except Exception as e:
        st.error(f"❌ Google Cloud 인증 실패: {e}")
        return pd.DataFrame()

    headers = {"Authorization": f"Bearer {credentials.token}", "Content-Type": "application/json"}
    body = {
        "pageSize": top_k,
        "query": question,
        "languageCode": "ko",
        "queryExpansionSpec": {"condition": "AUTO"},
        "spellCorrectionSpec": {"mode": "AUTO"}
    }

    try:
        response = requests.post(DISCOVERY_ENDPOINT, headers=headers, json=body)
        response.raise_for_status()
        data = response.json()
        rows = [doc.get("document", {}).get("structData", {}) for doc in data.get("results", [])]
        logging.info(f"Discovery Engine 검색 성공. 결과 {len(rows)}개")
        return pd.DataFrame(rows)
    except requests.exceptions.RequestException as e:
        st.error(f"❌ Discovery Engine 검색 실패: {e}")
        return pd.DataFrame()

# ------------------------------
# BigQuery 직접 쿼리 함수
# ------------------------------
@st.cache_data(ttl=3600)
def query_bigquery_by_id(doc_id: str) -> pd.DataFrame:
    logging.info(f"BigQuery 쿼리: ID='{doc_id}'")
    try:
        bq_client = bigquery.Client()
        query = f"SELECT * FROM `{BIGQUERY_TABLE}` WHERE id = @doc_id"
        job_config = bigquery.QueryJobConfig(query_parameters=[bigquery.ScalarQueryParameter("doc_id", "STRING", doc_id)])
        return bq_client.query(query, job_config=job_config).to_dataframe()
    except Exception as e:
        st.error(f"❌ BigQuery 쿼리 오류: {e}")
        return pd.DataFrame()

# ------------------------------
# 관련 논문 찾기 함수
# ------------------------------
def preprocess_text(text):
    return re.sub(r'[^\w\s]', '', str(text)).lower() if text else ""

def find_related_documents(target_doc, all_docs_df: pd.DataFrame, top_k=5):
    if target_doc is None or all_docs_df.empty: return pd.DataFrame()

    all_docs_df["combined_text"] = all_docs_df["title"].fillna("") + " " + all_docs_df["abstract"].fillna("")
    all_docs_df["combined_text"] = all_docs_df["combined_text"].apply(preprocess_text)
    target_text = preprocess_text(target_doc.get("title", "") + " " + target_doc.get("abstract", ""))
    
    tfidf = TfidfVectorizer()
    all_docs_tfidf = tfidf.fit_transform(all_docs_df["combined_text"])
    target_tfidf = tfidf.transform([target_text])
    
    similarities = cosine_similarity(target_tfidf, all_docs_tfidf).flatten()
    related_indices = similarities.argsort()[::-1]
    related_indices = [i for i in related_indices if all_docs_df.iloc[i].get("id") != target_doc.get("id")]
    
    return all_docs_df.iloc[related_indices[:top_k]]

# ------------------------------
# 복합 질의 처리 함수
# ------------------------------
def handle_complex_query(question: str, context_df: pd.DataFrame) -> pd.DataFrame:
    if "관련된 논문" in question.lower() and not context_df.empty:
        target_doc = context_df.iloc[0].to_dict()
        search_query = target_doc.get("title", "")
        all_docs_df = retrieve_documents_as_dataframe(search_query)
        return find_related_documents(target_doc, all_docs_df)
    
    return filter_relevant_documents(context_df, question)

# ------------------------------
# LLM 응답 생성
# ------------------------------
def generate_answer_stream(question: str, docs: list):
    context = "\n\n---\n\n".join([str(doc) for doc in docs])
    history = "\n".join([f"Q: {turn['question']}\nA: {turn['answer']}" for turn in st.session_state.chat_history])
    prompt = f"""
대화 기록:
{history}

참고 문서:
{context}

위 대화 기록과 참고 문서를 바탕으로 다음 질문에 대해 상세하고 친절하게 답변해 주세요.
질문: {question}
답변:
"""
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""
        for chunk in llm.stream(prompt):
            full_response += chunk.content
            message_placeholder.markdown(full_response + "▌")
        message_placeholder.markdown(full_response)
    return full_response

# ------------------------------
# Streamlit UI
# ------------------------------
st.set_page_config(page_title="QA (Discovery Engine)", layout="wide")
st.title("🔍 VertexAI 기반 BigQuery QA")
question_input = st.text_input("질문을 입력하세요", placeholder="예: 논문 ID, 저자, 키워드 등으로 검색")

if question_input:
    with st.spinner("🔎 검색 및 LLM 분석 중..."):
        extracted_id = extract_id(question_input)
        is_complex = st.session_state.chat_history and any(k in question_input.lower() for k in ["다른", "관련된"])

        if extracted_id:
            filtered_df = query_bigquery_by_id(extracted_id)
        elif is_complex:
            context_df = pd.DataFrame(st.session_state.context_docs)
            filtered_df = handle_complex_query(question_input, context_df)
        else:
            df = retrieve_documents_as_dataframe(question_input)
            filtered_df = filter_relevant_documents(df, question_input)
            st.session_state.last_search_query = question_input
        
        if not filtered_df.empty:
            st.session_state.context_docs = filtered_df.to_dict('records')
        
        docs = st.session_state.context_docs
        with st.chat_message("user"):
            st.markdown(question_input)
        
        answer = generate_answer_stream(question_input, docs)
        st.session_state.chat_history.append({"question": question_input, "answer": answer, "df": filtered_df})

# 대화 히스토리 표시
if st.session_state.chat_history:
    st.subheader("💬 대화 히스토리")
    for i, turn in enumerate(reversed(st.session_state.chat_history)):
        with st.expander(f"#{len(st.session_state.chat_history)-i} {turn['question'][:60]}..."):
            with st.chat_message("user"): st.markdown(turn["question"])
            with st.chat_message("assistant"): st.markdown(turn['answer'])
            
            df = turn.get("df")
            if df is not None and not df.empty:
                st.markdown("#### 📊 검색 결과")
                st.dataframe(df, use_container_width=True)

3.3. Streamlit에서 Google Agent Development Kit (ADK)으로 전환 가이드3.3. Guide: Converting from Streamlit to Google Agent Development Kit (ADK)

Streamlit은 빠른 프로토타이핑에 매우 유용하지만, Google ADK는 에이전트의 핵심 로직과 UI를 분리하고, 도구(Tool) 기반의 아키텍처를 적용하여 더 구조적이고 확장 가능한 애플리케이션을 만들 수 있게 해줍니다. ADK는 특히 Google Cloud 서비스와의 긴밀한 통합 및 배포를 염두에 두고 설계되었습니다. While Streamlit is excellent for rapid prototyping, Google ADK allows for more structured and scalable applications by separating the agent's core logic from the UI and applying a tool-based architecture. ADK is designed with close integration and deployment with Google Cloud services in mind.

이 가이드에서는 기존 Streamlit 코드의 핵심 기능을 ADK의 구조에 맞게 재구성하여 웹 기반 채팅 에이전트로 실행하는 방법을 설명합니다. This guide explains how to refactor the core functionality of the existing Streamlit code to fit ADK's structure and run it as a web-based chat agent.

1단계: 프로젝트 구조 설정 및 설치Step 1: Project Setup and Installation

ADK 프로젝트는 일반적으로 에이전트 로직, 웹 프론트엔드, 실행 스크립트로 구성됩니다. 먼저 필요한 라이브러리를 설치합니다. An ADK project typically consists of agent logic, a web frontend, and an execution script. First, install the necessary libraries.


# google-adk와 함께 필요한 라이브러리들을 설치합니다.
# Install the required libraries along with google-adk.
pip install google-adk langchain-google-genai google-cloud-bigquery pandas scikit-learn

아래와 같이 프로젝트 파일 구조를 만듭니다. Create the project file structure as shown below.


your-adk-project/
├── agent.py         # 에이전트의 핵심 로직 및 도구 정의 (Agent's core logic and tool definitions)
├── main.py          # ADK 웹 서버 실행 (Run the ADK web server)
└── static/
    └── index.html   # 웹 프론트엔드 UI (Web frontend UI)

2단계: 에이전트 및 도구 정의 (`agent.py`)Step 2: Define Agent and Tools (`agent.py`)

기존 Streamlit의 함수들을 ADK가 인식할 수 있는 '도구(Tool)'로 변환합니다. `@tool` 데코레이터를 사용하여 각 함수를 래핑하고, 이 도구들을 사용하는 에이전트 클래스를 정의합니다. Convert the functions from the Streamlit code into 'Tools' that ADK can recognize. Wrap each function with the `@tool` decorator and define an agent class that uses these tools.


# agent.py

import os
import re
import pandas as pd
import requests
from google.auth import default
from google.auth.transport.requests import Request
from google.cloud import bigquery
from langchain_google_genai import ChatGoogleGenerativeAI
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from adk.api import*

# --- 환경 변수 설정 (Set Environment Variables) ---
# 실제 운영 환경에서는 Secret Manager 사용을 권장합니다. (Using Secret Manager is recommended in production)
os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY"
PROJECT_NUMBER = "YOUR_PROJECT_NUMBER"
ENGINE_ID = "YOUR_ENGINE_ID"
BIGQUERY_TABLE = "YOUR_PROJECT_ID.YOUR_DATASET.YOUR_TABLE"
DISCOVERY_ENDPOINT = f"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search"

# --- 기존 함수를 ADK 도구로 변환 (Convert existing functions to ADK tools) ---

@tool
def search_papers_with_discovery_engine(query: str) -> str:
    """
    사용자의 자연어 질문을 기반으로 Discovery Engine을 사용해 관련 논문을 검색합니다.
    일반적인 질문, 주제, 키워드 검색에 사용됩니다.
    결과는 Markdown 테이블 형식의 문자열로 반환됩니다.
    (Searches for relevant papers using Discovery Engine based on the user's natural language query.
    Used for general questions, topics, and keyword searches.
    The result is returned as a Markdown table string.)
    """
    # ... (기존 retrieve_documents_as_dataframe 함수의 로직) ...
    # 인증 및 API 호출 로직은 동일 (Authentication and API call logic remains the same)
    try:
        credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
        credentials.refresh(Request())
        headers = {"Authorization": f"Bearer {credentials.token}", "Content-Type": "application/json"}
        body = {"pageSize": 10, "query": query, "languageCode": "ko"}
        response = requests.post(DISCOVERY_ENDPOINT, headers=headers, json=body)
        response.raise_for_status()
        data = response.json()
        rows = [doc.get("document", {}).get("structData", {}) for doc in data.get("results", [])]
        if not rows:
            return "검색 결과가 없습니다. (No search results found.)"
        df = pd.DataFrame(rows)[['id', 'title', 'authors']].head(5)
        return df.to_markdown()
    except Exception as e:
        return f"Discovery Engine 검색 중 오류 발생 (Error during Discovery Engine search): {e}"

@tool
def get_paper_details_by_id(paper_id: str) -> str:
    """
    정확한 논문 ID를 사용하여 BigQuery에서 논문의 상세 정보를 조회합니다.
    사용자가 'id'나 특정 ID 형식을 언급할 때 사용합니다.
    결과는 Markdown 테이블 형식의 문자열로 반환됩니다.
    (Retrieves detailed information for a paper from BigQuery using its exact ID.
    Used when the user mentions 'id' or a specific ID format.
    The result is returned as a Markdown table string.)
    """
    # ... (기존 query_bigquery_by_id 함수의 로직) ...
    try:
        bq_client = bigquery.Client()
        query = f"SELECT id, title, abstract, authors, date FROM `{BIGQUERY_TABLE}` WHERE id = @doc_id"
        job_config = bigquery.QueryJobConfig(query_parameters=[bigquery.ScalarQueryParameter("doc_id", "STRING", paper_id)])
        df = bq_client.query(query, job_config=job_config).to_dataframe()
        if df.empty:
            return f"ID '{paper_id}'에 해당하는 논문을 찾을 수 없습니다. (Paper with ID '{paper_id}' not found.)"
        return df.to_markdown()
    except Exception as e:
        return f"BigQuery 조회 중 오류 발생 (Error during BigQuery query): {e}"

# --- 에이전트 정의 (Define Agent) ---
class PaperQaAgent(Agent):
    def __call__(self, query: str, history: History) -> Message:
        """사용자 질문에 답변하는 논문 검색 에이전트 (A paper search agent that answers user questions)"""
        
        # LLM이 도구를 사용할 수 있도록 시스템 프롬프트 설정 (Set system prompt for the LLM to use tools)
        prompt = f"""
You are an AI assistant with access to a research paper database.
Analyze the user's intent and use the most appropriate tool.
- For general keywords, topics, or author names, use 'search_papers_with_discovery_engine'.
- For questions containing the word 'id' or a clear ID format (e.g., a1b2c3d4), use 'get_paper_details_by_id'.

Conversation History:
{history.format()}

User Question: {query}
"""
        llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-001", convert_system_message_to_human=True)
        
        # LLM이 도구 사용을 포함한 답변을 생성하도록 호출 (Invoke the LLM to generate a response, including tool use)
        response = llm.invoke(prompt, tools=[search_papers_with_discovery_engine, get_paper_details_by_id])
        
        return Message(content=response.content, tool_calls=response.tool_calls)

3단계: 웹 프론트엔드 생성 (`static/index.html`)Step 3: Create Web Frontend (`static/index.html`)

ADK가 제공하는 `adk-chat` 웹 컴포넌트를 사용하여 매우 간단하게 채팅 UI를 만들 수 있습니다. 이 컴포넌트는 백엔드 에이전트와의 통신, 대화 기록 관리, 스트리밍 응답 표시, Markdown 렌더링 등을 자동으로 처리합니다. You can easily create a chat UI using the `adk-chat` web component provided by ADK. This component automatically handles communication with the backend agent, conversation history management, streaming response display, Markdown rendering, and more.





  
    ADK-based Paper QA System

4단계: ADK 서버 실행 (`main.py`)Step 4: Run ADK Server (`main.py`)

`agent.py`에서 정의한 에이전트와 `static` 폴더의 웹 프론트엔드를 연결하여 서버를 실행하는 스크립트입니다. This script runs the server, connecting the agent defined in `agent.py` with the web frontend in the `static` folder.


# main.py

from adk.web import start_server
from agent import PaperQaAgent

if __name__ == "__main__":
    start_server(PaperQaAgent, "static")

5단계: 애플리케이션 실행Step 5: Run the Application

모든 파일이 준비되었으면, 터미널에서 `main.py`를 실행하여 ADK 웹 서버를 시작합니다. Once all files are ready, run `main.py` from the terminal to start the ADK web server.


python main.py

서버가 실행되면 터미널에 표시된 주소(기본값: http://127.0.0.1:8080)로 접속하여 채팅 에이전트를 사용할 수 있습니다. When the server is running, you can access the chat agent by navigating to the address shown in the terminal (default: http://127.0.0.1:8080).

4. 실행 방법4. Execution Guide

4.1. BigQuery 데이터 준비 (GCS를 통한 마이그레이션)4.1. BigQuery Data Preparation (Migration via GCS)

이 애플리케이션은 BigQuery 데이터를 사용합니다. 실습을 위해 기존 프로젝트(`YOUR_SOURCE_PROJECT_ID`)의 `ds_test` 데이터셋을 Google Cloud Storage(GCS)로 내보낸 후, 여러분의 신규 프로젝트에 로드하는 과정을 안내합니다. This application uses BigQuery data. For this lab, we will guide you through exporting the `ds_test` dataset from an existing project (`YOUR_SOURCE_PROJECT_ID`) to Google Cloud Storage (GCS) and then loading it into your new project.

1단계: 소스 데이터셋을 GCS로 내보내기 (Export)Step 1: Export Source Dataset to GCS (Export)

아래 BigQuery 스크립트는 `YOUR_SOURCE_PROJECT_ID` 프로젝트의 `ds_test` 데이터셋에 있는 모든 테이블을 GCS 버킷(`gs://YOUR_BUCKET_NAME/ds_test/`)으로 내보냅니다. 이 스크립트는 CSV로 직접 내보낼 수 없는 ARRAY 같은 복잡한 데이터 타입을 가진 열을 자동으로 제외하여 오류를 방지합니다. The BigQuery script below exports all tables from the `ds_test` dataset in the `YOUR_SOURCE_PROJECT_ID` project to a GCS bucket (`gs://YOUR_BUCKET_NAME/ds_test/`). This script automatically excludes columns with complex data types like ARRAY that cannot be directly exported to CSV, preventing errors.
소스 프로젝트(`YOUR_SOURCE_PROJECT_ID`)의 BigQuery 콘솔에서 아래 스크립트를 실행하세요.Run the script below in the BigQuery console of your source project (`YOUR_SOURCE_PROJECT_ID`).


-- BigQuery 스크립팅을 사용하여 ds_test 데이터셋의 모든 테이블을 GCS로 내보냅니다.
-- 이 스크립트는 소스 프로젝트(YOUR_SOURCE_PROJECT_ID)의 BigQuery 콘솔에서 실행해야 합니다.

-- 1. 내보낼 테이블 목록을 담을 변수를 선언합니다.
DECLARE tables_to_export ARRAY>;
-- 2. 루프에서 사용할 변수들을 선언합니다.
DECLARE table_name, column_names STRING;

-- 3. ds_test 데이터셋에 있는 모든 'BASE TABLE'의 이름을 가져와 변수에 할당합니다.
SET tables_to_export = (
  SELECT ARRAY_AGG(STRUCT(t.table_name))
  FROM `YOUR_SOURCE_PROJECT_ID.ds_test.INFORMATION_SCHEMA.TABLES` AS t
  WHERE t.table_type = 'BASE TABLE'
);

-- 4. 각 테이블에 대해 루프를 실행합니다.
FOR tbl_row IN (SELECT * FROM UNNEST(tables_to_export)) DO
  SET table_name = tbl_row.table_name;

  -- 5. CSV로 직접 내보낼 수 없는 ARRAY 타입을 제외하고, 모든 컬럼 이름을 동적으로 구성합니다.
  SET column_names = (
    SELECT STRING_AGG(c.column_name, ', ')
    FROM `YOUR_SOURCE_PROJECT_ID.ds_test.INFORMATION_SCHEMA.COLUMNS` AS c
    WHERE c.table_name = table_name
      AND NOT STARTS_WITH(c.data_type, 'ARRAY')
  );

  -- 6. 동적으로 생성된 컬럼 목록을 사용하여 EXPORT DATA 쿼리를 실행합니다.
  -- 각 테이블은 GCS 버킷 'gs://YOUR_BUCKET_NAME/ds_test/'에 '테이블명-*.csv' 형식으로 저장됩니다.
  EXECUTE IMMEDIATE FORMAT("""
    EXPORT DATA OPTIONS(
      uri='gs://YOUR_BUCKET_NAME/ds_test/%s-*.csv',
      format='CSV',
      header=true,
      overwrite=true
    ) AS
    SELECT %s
    FROM `YOUR_SOURCE_PROJECT_ID.ds_test`.`%s`;
  """, table_name, column_names, table_name);

END FOR;

2단계: GCS의 CSV 파일을 새 프로젝트의 BigQuery로 로드하기 (Load)Step 2: Load CSV files from GCS to BigQuery in the New Project (Load)

1단계가 완료되어 GCS에 CSV 파일들이 저장되었다면, 이제 여러분의 새로운 Google Cloud 프로젝트에서 아래 절차에 따라 데이터를 BigQuery로 가져옵니다. Once Step 1 is complete and the CSV files are in GCS, follow the steps below in your new Google Cloud project to import the data into BigQuery.

데이터세트 생성:Create a Dataset:
- 새 프로젝트의 BigQuery Studio로 이동합니다.Go to the BigQuery Studio in your new project.
- 탐색기 패널에서 프로젝트 ID 옆의 점 3개 아이콘(⋮)을 클릭하고 '데이터세트 만들기'를 선택합니다.In the Explorer panel, click the three-dot icon (⋮) next to your project ID and select 'Create dataset'.
- '데이터세트 ID'(예: `ds_test_loaded`)를 입력하고 '데이터세트 만들기'를 클릭합니다.Enter a 'Dataset ID' (e.g., `ds_test_loaded`) and click 'Create dataset'.
테이블 만들기:Create a Table:
- 방금 만든 데이터세트 옆의 점 3개 아이콘(⋮)을 클릭하고 '테이블 만들기'를 선택합니다.Click the three-dot icon (⋮) next to the dataset you just created and select 'Create table'.
- 소스:Source: '다음 위치에서 테이블 만들기' 항목에서 'Google Cloud Storage'를 선택합니다.Under 'Create table from', select 'Google Cloud Storage'.
- GCS 경로:GCS Path: 'GCS 버킷에서 파일 선택...' 필드에 `gs://YOUR_BUCKET_NAME/ds_test/*`를 입력합니다.In the 'Select file from GCS bucket...' field, enter `gs://YOUR_BUCKET_NAME/ds_test/*`.
- 파일 형식:File format: 'CSV'를 선택합니다.Select 'CSV'.
스키마 및 옵션 설정:Set Schema and Options:
- 테이블 이름:Table name: 로드할 테이블의 이름을 지정합니다. GCS의 각 파일이 별도의 테이블이 되므로, 파일별로 이 과정을 반복해야 합니다.Specify the name of the table to load. Since each file in GCS will be a separate table, you must repeat this process for each file.
- 스키마:Schema: '자동 감지' 체크박스를 활성화합니다.Enable the 'Auto detect' checkbox.
- 고급 옵션:Advanced options: '헤더 행 건너뛰기' 필드에 `1`을 입력하여 헤더 줄을 데이터로 읽지 않도록 합니다.In the 'Header rows to skip' field, enter `1` to prevent the header from being read as data.
로드 완료:Complete Load:
- '테이블 만들기' 버튼을 클릭합니다.Click the 'Create table' button.
- GCS 버킷에 있는 모든 CSV 파일에 대해 2~4단계를 반복하여 모든 테이블을 로드합니다.Repeat steps 2-4 for all CSV files in the GCS bucket to load all tables.
- 마지막으로, 스크립트 상단의 BIGQUERY_TABLE 환경 변수를 새로 로드한 테이블 경로(예: `YOUR_NEW_PROJECT_ID.ds_test_loaded.YOUR_TABLE_NAME`)로 수정합니다.Finally, update the `BIGQUERY_TABLE` environment variable at the top of the script to the path of the newly loaded table (e.g., `YOUR_NEW_PROJECT_ID.ds_test_loaded.YOUR_TABLE_NAME`).

4.2. 사전 준비 사항4.2. Prerequisites

Python 3.8 이상 환경Python 3.8+ environment
필요한 라이브러리 설치 (아래 명령어 참조)Required libraries installed (see command below)
Vertex AI, Discovery Engine, BigQuery API가 활성화된 Google Cloud 프로젝트A Google Cloud project with Vertex AI, Discovery Engine, and BigQuery APIs enabled
Google Cloud CLI를 통한 로컬 환경 인증 (gcloud auth application-default login)Local environment authenticated via Google Cloud CLI (gcloud auth application-default login)

4.3. 라이브러리 설치4.3. Library Installation

pip install google-adk langchain-google-genai google-cloud-bigquery pandas scikit-learn requests

4.4. 스크립트 실행 (ADK)4.4. Run Script (ADK)

위 가이드에 따라 `agent.py`, `main.py`, `static/index.html` 파일을 생성합니다.Create `agent.py`, `main.py`, and `static/index.html` files according to the guide above.
`agent.py` 파일 상단의 환경 변수들을 자신의 환경에 맞게 수정합니다.Modify the environment variables at the top of `agent.py` to match your environment.
터미널에서 아래 명령어를 실행하여 ADK 웹 서버를 시작합니다.Run the command below in your terminal to start the ADK web server.

python main.py

1. 개요1. Overview

주요 기능Key Features

사용 기술 스택Technology Stack

2. 시스템 아키텍처2. System Architecture

시스템 구성도System Diagram

질문 입력 (User Input)User Input

질의 유형 분석 (Query Analysis)Query Analysis

데이터 검색 (Data Retrieval)Data Retrieval

데이터 필터링 및 정제 (Filtering & Refinement)Filtering & Refinement

LLM 프롬프트 생성 (Prompt Generation)Prompt Generation

답변 생성 및 스트리밍 (Answer Generation)Answer Generation

결과 표시 (Display Results)Display Results