[실습] 한글(korean)과 영어(english)가 섞인 글을 검색 가능하게 만들기

author

JSCODE 박재성

✅ 한글(korean)과 영어(english)가 섞인 글을 검색 가능하게 만들기

“오늘 영어 책에서 'It depends on the results.'이라는 문구를 봤다."우리가 글을 쓰다보면 한글과 영어를 같이 쓰는 경우가 많다. 그럼 이런 글은 어떻게 검색할 수 있을 지 알아보자.

Analyze API 활용해 한글과 영어 섞인 문장을 어떻게 토큰으로 분해하는 지 디버깅해보기


// Nori analyzer의 구성을 직접 명시
GET /_analyze
{
  "text": "오늘 영어 책에서 'It depends on the results.'이라는 문구를 봤다.",
  "char_filter": [], 
	"tokenizer": "nori_tokenizer", 
	"filter": ["nori_part_of_speech", "nori_readingform", "lowercase"]
}

응답값


{
  "tokens": [
    {
      "token": "영어",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "책",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 2
    },
    {
      "token": "it",
      "start_offset": 11,
      "end_offset": 13,
      "type": "word",
      "position": 4
    },
    {
      "token": "depends",
      "start_offset": 14,
      "end_offset": 21,
      "type": "word",
      "position": 5
    },
    {
      "token": "on",
      "start_offset": 22,
      "end_offset": 24,
      "type": "word",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 25,
      "end_offset": 28,
      "type": "word",
      "position": 7
    },
    {
      "token": "results",
      "start_offset": 29,
      "end_offset": 36,
      "type": "word",
      "position": 8
    },
    {
      "token": "이",
      "start_offset": 38,
      "end_offset": 39,
      "type": "word",
      "position": 9
    },
    {
      "token": "문구",
      "start_offset": 42,
      "end_offset": 44,
      "type": "word",
      "position": 11
    },
    {
      "token": "보",
      "start_offset": 46,
      "end_offset": 47,
      "type": "word",
      "position": 13
    }
  ]
}

Nori analyzer만 쓰더라도 한글과 영어 전부다 의미 단위로 잘 나누는 걸 확인할 수 있다.

lowercase가 적용되어 있어서 영어 토큰을 소문자로 변환시켜 토큰으로 저장했다는 것도 알 수 있다.

아쉬운 점은 영어의 불용어(it, on, the)가 포함되어 있다는 점과 영단어의 기본 형태를 안 쓰고 있다는 점(depends, results)이다. 이 문제를 해결하기 위해 이전에 배웠던 token filter를 활용해보자.

Analyzer에 token filter 추가하기


// Nori analyzer의 구성을 직접 명시
GET /_analyze
{
  "text": "오늘 영어 책에서 'It depends on the results.'이라는 문구를 봤다.",
  "char_filter": [], 
	"tokenizer": "nori_tokenizer", 
	"filter": ["nori_part_of_speech", "nori_readingform", "lowercase", "stop", "stemmer"]
}

응답값


{
  "tokens": [
    {
      "token": "영어",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "책",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 2
    },
    {
      "token": "depend",
      "start_offset": 14,
      "end_offset": 21,
      "type": "word",
      "position": 5
    },
    {
      "token": "result",
      "start_offset": 29,
      "end_offset": 36,
      "type": "word",
      "position": 8
    },
    {
      "token": "이",
      "start_offset": 38,
      "end_offset": 39,
      "type": "word",
      "position": 9
    },
    {
      "token": "문구",
      "start_offset": 42,
      "end_offset": 44,
      "type": "word",
      "position": 11
    },
    {
      "token": "보",
      "start_offset": 46,
      "end_offset": 47,
      "type": "word",
      "position": 13
    }
  ]
}

stop, stemmer를 쓰니 불용어가 제거되고 영단어가 기본 형태로 바뀌었다.

✅ 정리

한글(korean)과 영어(english)가 섞여있는 글이라면 Nori Analyzer를 활용하면 된다. 거기서 필드 값의 특징에 따라 character filter나 token filter를 추가해서 사용하면 된다.

author

JSCODE 박재성

category

Elasticsearch

createdAt

Dec 6, 2025 03:54 AM

isPublic

series

실전에서 바로 써먹는 Elasticsearch 입문 (검색 최적화편)

slug

type

series-footer

updatedAt

📎

이 글은 실전에서 바로 써먹는 Elasticsearch 입문 (검색 최적화편) 강의의 수업 자료 중 일부입니다.

// Nori analyzer의 구성을 직접 명시 GET /_analyze { "text": "오늘 영어 책에서 'It depends on the results.'이라는 문구를 봤다.", "char_filter": [], "tokenizer": "nori_tokenizer", "filter": ["nori_part_of_speech", "nori_readingform", "lowercase"] }

{ "tokens": [ { "token": "영어", "start_offset": 3, "end_offset": 5, "type": "word", "position": 1 }, { "token": "책", "start_offset": 6, "end_offset": 7, "type": "word", "position": 2 }, { "token": "it", "start_offset": 11, "end_offset": 13, "type": "word", "position": 4 }, { "token": "depends", "start_offset": 14, "end_offset": 21, "type": "word", "position": 5 }, { "token": "on", "start_offset": 22, "end_offset": 24, "type": "word", "position": 6 }, { "token": "the", "start_offset": 25, "end_offset": 28, "type": "word", "position": 7 }, { "token": "results", "start_offset": 29, "end_offset": 36, "type": "word", "position": 8 }, { "token": "이", "start_offset": 38, "end_offset": 39, "type": "word", "position": 9 }, { "token": "문구", "start_offset": 42, "end_offset": 44, "type": "word", "position": 11 }, { "token": "보", "start_offset": 46, "end_offset": 47, "type": "word", "position": 13 } ] }

{ "tokens": [ { "token": "영어", "start_offset": 3, "end_offset": 5, "type": "word", "position": 1 }, { "token": "책", "start_offset": 6, "end_offset": 7, "type": "word", "position": 2 }, { "token": "depend", "start_offset": 14, "end_offset": 21, "type": "word", "position": 5 }, { "token": "result", "start_offset": 29, "end_offset": 36, "type": "word", "position": 8 }, { "token": "이", "start_offset": 38, "end_offset": 39, "type": "word", "position": 9 }, { "token": "문구", "start_offset": 42, "end_offset": 44, "type": "word", "position": 11 }, { "token": "보", "start_offset": 46, "end_offset": 47, "type": "word", "position": 13 } ] }