랭체인 활용해서 간단한 RAG 구현 및 테스트

머신러닝, 딥러닝

랭체인 활용해서 간단한 RAG 구현 및 테스트

hyuga_ 2025. 1. 13. 17:47

웹 페이지를 Context 삼아서 유저의 질문에 답하는 챗봇을 구현해보자.

과정은 다음과 같다.

1. 대상 웹 페이지 크롤링

2. 적절한 사이즈로 잘라서 Docs 만들기 (chunks들 만들기)

3. 유저 질문 받기

4. 유저 질문 기반으로 VectorDB 검색 (docs에서 가장 코사인 유사도 높은 chunk가 Context로 선택된다.)

5. 최종 Prompt 생성 = 사전 Prompt + Context + 유저 질문

6. 답변

대상으로는 해당 페이지를 사용하였다. (https://lilianweng.github.io/posts/2023-06-23-agent/)

LLM Powered Autonomous Agents

Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-writte

lilianweng.github.io

구현과정

Colab 환경에서 진행하였으나 로컬에서 진행해도 무방하다.

!pip install langchain-community langchain-chroma langchain-openai bs4

import bs4  # BeautifulSoup4: HTML 및 XML 파일을 파싱하기 위한 라이브러리
from langchain import hub  # LangChain의 다양한 유틸리티와 허브 사용
from langchain_chroma import Chroma  # 벡터 데이터 저장소인 Chroma 사용
from langchain_openai import ChatOpenAI  # OpenAI LLM 모델 사용
from langchain_openai import OpenAIEmbeddings  # OpenAI 임베딩 사용
from langchain_community.document_loaders import WebBaseLoader  # 웹 문서를 불러오는 유틸리티
from langchain_text_splitters import RecursiveCharacterTextSplitter  # 텍스트를 분할하는 도구

OpenAI API 준비

OpenAI의 api key를 꺼내오는 과정!

나는 gpt-4o-mini를 사용하였다. 현 시점(2025년 초) 공식적으로 OpenAI에서 가장 가성비 좋은 모델이다.

from google.colab import userdata

openai_api_key = userdata.get('openai_api_key')
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key) # OpenAI 모델 초기화 (by langchain)

웹문서 가져오기

"""
WebBaseLoader는 LangChain이 제공하는 유틸리티로, 웹 기반 데이터를 로드하는 데 사용된다.

주요 특징:
- BeautifulSoup을 사용해 HTML 콘텐츠를 파싱한다.
- 클래스 이름이나 태그 등으로 필요한 HTML 요소만 선택적으로 로드할 수 있다.
- web_paths 매개변수에 하나 이상의 URL을 지정해 다양한 페이지에서 데이터를 가져올 수 있다.
"""

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),  # 크롤링할 웹페이지 URL
    bs_kwargs=dict(  # BeautifulSoup로 원하는 HTML 요소를 추출하는 설정
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")  # 주의. 크롤링 대상 웹문서의 구조에 맞춰서 설정해야 함.
        )
    ),
)
docs = loader.load() # 문서 데이터 불러오기 (크롤링 실행)

(크롤링시 적절한 class_를 찾는 방법)

1. 브라우저에서 웹페이지 소스 확인: - 개발자도구 > 해당 HTML 태그와 클래스 이름을 확인.

2. BeautifulSoup로 임시 파싱 후 클래스 확인: - 크롤링한 페이지를 BeautifulSoup로 로드하고 find_all() 메서드로 태그와 클래스 이름 확인.

Docs 분할 및 VectorDB 에 저장

# Retrieval 정확도 향상을 위해 적절한 크기로 Text 분할하기
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# 로드한 문서를 분할
splits = text_splitter.split_documents(docs)

# 벡터 저장소 생성 (문서를 임베딩 후 Chroma에 저장)
vectorstore = Chroma.from_documents(
    documents=splits,  # 분할된 문서
    embedding=OpenAIEmbeddings(api_key=openai_api_key)  # OpenAI 임베딩 사용
)

print(splits[0], '\n\n===============\n===============\n\n', splits[1])

출력 결과

page_content='LLM Powered Autonomous Agents Date: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian Weng Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver. Agent System Overview# In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, co ...

Retriever 객체 및 포맷팅 함수 정의 + 랭체인 허브에서 prompt 가져오기

# 벡터 저장소에서 검색 가능한 형태의 검색자 초기화
retriever = vectorstore.as_retriever()

# 검색된 문서 포맷팅
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
prompt = hub.pull("rlm/rag-prompt") # 랭체인 허브에서 사전 정의된 RAG용 프롬프트를 가져옴

유저 질문 받기 + prompt.invoke()

랭체인의 invoke() 함수

invoke() 함수는 LangChain의 주요 작업을 수행하는 핵심 메서드 중 하나.
invoke는 그대로 번역하면 실행한다는 뜻. 어떠한 대상을 실행하는 기능이다.
즉, 주어진 입력(프롬프트, 쿼리 등)을 기반으로 모델 또는 리소스에서 작업을 수행하고, 결과를 반환한다.
- LLM: 입력된 텍스트를 기반으로 AI 모델이 응답을 생성.
- Retrievers: 사용자의 입력 쿼리를 기반으로 검색 작업 수행.
- Prompts: 특정 컨텍스트와 질문을 기반으로 프롬프트 생성.

user_msg = "What is the Self-Reflection?" # 유저 질문
retrieved_docs = retriever.invoke(user_msg) # 유저 질문 기반으로 문서 검색

# 검색된 문서를 컨텍스트로 활용하여 사용자 질문을 포함한 프롬프트 생성
user_prompt = prompt.invoke({"context": format_docs(retrieved_docs), "question": user_msg})
print(user_prompt)

Prompt 구조

최종 prompt 출력 값을 보면, "(RAG 사전 prompt) + (유저 prompt) + (Context)'로 구성되어있다.

유저 질문이 바뀌면 vectorDB에서 레퍼런스 삼을 문서도 다른 걸 검색하기 때문에, 관련 document 내용 prompt도 바뀐다.

만일 질문이 'What is the Task Decomposition?' 인 경우, 다음과 같은 prompt 가 출력된다.

messages=[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. (사전 prompt)

Question: What is Task Decomposition? (유저 prompt)

Context: (Context) Fig. 1. Overview of a LLM-powered autonomous agent system.

Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
...

llm.invoke() 및 출력 결과

앞서 완성된 prompt 를 기반으로 llm을 실행한다.

response = llm.invoke(user_prompt)
print(response.content)

출력 결과:

Self-reflection involves presenting two-shot examples to a Large Language Model (LLM), where each example consists of a failed trajectory and an ideal reflection for improving future plans. Reflections from these examples are then integrated into the agent's working memory, facilitating better context for subsequent queries. This process aims to guide the agent in avoiding past mistakes and enhancing planning efficiency.

(자기 성찰은 대규모 언어 모델(LLM)에 투샷 예시를 제시하는 것으로, 각 예시는 실패한 궤적과 향후 계획을 개선하기 위한 이상적인 성찰로 구성됩니다. 그런 다음 이러한 예제에서 반영된 내용은 상담원의 작업 기억에 통합되어 후속 쿼리에 대한 더 나은 컨텍스트를 제공합니다. 이 프로세스는 상담원이 과거의 실수를 피하고 계획의 효율성을 높일 수 있도록 안내하는 것을 목표로 합니다.)

정상적으로 답변한 것을 확인 가능하다!

만약 문서에 없는 내용을 요청한다면 어떻게 될까?

만약 레퍼런스 삼을 docs에 없는 정보인 '일론 머스크는 누구인가?' 에 대해서 해당 모델이 어떻게 답변할까?

user_msg = "Who is the Elon Musk?" 
retrieved_docs = retriever.invoke(user_msg) 

user_prompt = prompt.invoke({"context": format_docs(retrieved_docs), "question": user_msg})
print(user_prompt)

출력 결과:

messages=[HumanMessage(content=" ...

Question: Who is the Elon Musk?

Context: Fig. 10. A picture of a sea otter using rock to crack open a seashell, while floating in the water. While some other animals can use tools, the complexity is not comparable with humans. (Image source: Animals using tools)

(상관없는 내용이 Context로 선정됨)

우선, 이처럼 VectorDB 에서 Context 선정부터 정상 작동하지 않는다.

문서에 있는 Task Decomposition, Self-Reflection 에 대해서 물어봤을 때는 이를 기반으로 벡터 검색을 수행하기 떄문에 연관 내용을 가져왔었다. 그러나 Elon Musk에 대해 물어보자 관련 없는 내용이 선택되었다.

Task Decomposition: Task decomposition에 대한 세부 정보를 포함한 문서가 선택됨.
Self-Reflection: Self-reflection과 관련된 Reflexion 프레임워크 문서가 선택됨.
Elon Musk: 엘론 머스크와 관련 없는 내용(동물의 도구 사용)이 선택됨 → 검색 결과가 적절하지 않음.

그 이유는, VectorDB는 사용자의 쿼리와 데이터베이스에 저장된 문서(또는 텍스트 청크)의 벡터 표현 간 유사도를 계산하여 가장 관련성 높은 문서를 반환하기 때문이다. 즉, VectorDB에서 검색 과정은 기본적으로 쿼리와 문서 간의 '의미적' 유사성에 의존한다.

레퍼런스가 없는 질문에 대한 출력은?

랭체인 사전 Prompt에는 레퍼런스에 없는 것에 대해서는 '모른다'고 답하라고 명시되어 있다. 과연 이 규칙을 잘 지켰을까?

솔직하게 "I don't know." 라고 응답한 것을 확인할 수 있다.

'머신러닝, 딥러닝' 카테고리의 다른 글

GPT-3의 1750억 파라미터는 대체 무슨 의미일까? (0)	2024.12.31
Perceptron 부터 Self-Attention 까지 요약 (3)	2024.12.24

현재글랭체인 활용해서 간단한 RAG 구현 및 테스트

잔잔한 물결로 파도 만들기

온라인 메모장

크래프톤정글, java, 알고리즘, 컴퓨터 구조, 네트워크, 자바스크립트, 자료구조, CSAPP, 운영체제, 크래프톤 정글, 컴퓨터구조, C, 리액트, ostep, 임시, 자바의 정석,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

잔잔한 물결로 파도 만들기