[실습] 검색할 때 필요없는 불용어(a, an, the, or, but) 제거하기 (stop)

author

JSCODE 박재성

✅ 검색할 때 필요없는 불용어(a, an, the, or, but 등) 제거하기 (stop)

영어로 작성된 게시글을 보면 검색어로 잘 사용하지 않는 a, an, the, or과 같은 불용어(= 의미없는 단어)가 많이 포함되어 있다. 역인덱스의 효율적인 저장과 활용을 위해 불용어를 제거하고 관리하는 방법을 알아보자.

Custom Analyzer를 활용해 인덱스 생성하기


// 기존 인덱스 삭제
DELETE /boards

// 인덱스 생성 + 매핑 정의 + Custom Analyzer 적용
PUT /boards
{
  "settings": {
    "analysis": {
      "analyzer": {
        "boards_content_analyzer": {
          "char_filter": [],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop"]
        }
      }
    }
  },
  "mappings": {
	  "properties": {
	    "content": {
	      "type": "text",
	      "analyzer": "boards_content_analyzer"
	    }
	  }
	}
}

// 잘 생성됐는 지 확인
GET /boards

이번 Analyzer에는 standard tokenizer(공백 또는 ,, ., !, ?와 같은 문장 부호를 기준으로 문자열을 자름)와 lowercase token filter(소문자로 변환)에다가 stop token filter(불용어를 제거)를 추가했다.

데이터 삽입하기


POST /boards/_doc
{
  "content": "The cat and the dog are friends."
}

검색해보기


GET /boards/_search
{
  "query": {
    "match": {
      "content": "the"
    }
  }
}

GET /boards/_search
{
  "query": {
    "match": {
      "content": "and"
    }
  }
}

GET /boards/_search
{
  "query": {
    "match": {
      "content": "are"
    }
  }
}

위 쿼리로 검색해보면 아무 데이터도 조회되지 않는다. 왜 그런지 Analyze API를 사용해서 분석해보자.

Analyze API 사용하기


GET /boards/_analyze
{
  "field": "content",
  "text": "The cat and the dog are friends."
}

응답값


{
  "tokens": [
    {
      "token": "cat",
      "start_offset": 4,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 16,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "friends",
      "start_offset": 24,
      "end_offset": 31,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

토큰으로 저장될 필요 없는 불용어(the, and, are)를 제거한 뒤에 역인덱스에 저장했다. 그래서 the, and, are로 검색했을 때 아무 데이터도 조회되지 않은 것이다.

✅ 정리

불용어(a, the, are, is 등)를 활용해 검색할 일이 없는 데이터라면, 역인덱스의 효율성을 위해 token filter로 stop을 활용하도록 하자.

만약 노래 제목(ex. 비틀즈의 Let It Be)과 같이 불용어를 포함해서 검색하는 게 중요할 때는 token filter로 stop을 적용시키지 않아야 한다.

👨🏻‍🏫

다음 강의에서는 Analyzer의 또 다른 기능에 대해 배워보도록 하자.

author

JSCODE 박재성