[실습] 대소문자 구분없이 검색하는 방법 (lowercase)

author

JSCODE 박재성

✅ 대소문자 구분없이 검색하는 방법

Custom Analyzer를 활용해 인덱스 생성하기

아무 설정 없이 인덱스를 생성하면 standard analyzer가 설정된다. 이번에는 기본으로 설정되는 standard analyzer가 아닌, Analyzer의 구성 요소를 직접 설정하는 Custom Analyzer를 활용해서 인덱스를 생성해볼 것이다.


// 기존 인덱스 삭제
DELETE /products

// 인덱스 생성 + 매핑 정의 + Custom Analyzer 적용
PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "products_name_analyzer": {
          "char_filter": [],
          "tokenizer": "standard",
          "filter": []
        }
      }
    }
  },
  "mappings": {
	  "properties": {
	    "name": {
	      "type": "text",
	      "analyzer": "products_name_analyzer"
	    }
	  }
	}
}

// 잘 생성됐는 지 확인
GET /products

products 인덱스에서 name 필드에 products_name_analyzer라는 Custom Analyzer를 적용시켰다. 이번 Analyzer에는 standard tokenizer(공백 또는 ,, ., !, ?와 같은 문장 부호를 기준으로 문자열을 자름)만 설정했고, lowercase token filter(소문자로 변환)은 설정하지 않았다.

데이터 삽입하기


POST /products/_create/1
{
  "name": "Apple 2025 맥북 에어 13 M4 10코어"
}

검색해보기


GET /products/_search
{
  "query": {
    "match": {
      "name": "apple"
    }
  }
}

위 쿼리로 검색해보면 아무 검색 결과가 안 뜨는 걸 확인할 수 있다.


GET /products/_search
{
  "query": {
    "match": {
      "name": "Apple"
    }
  }
}

apple이 아닌 Apple로 바꿔서 다시 검색해보면 아까 저장한 도큐먼트가 조회되는 걸 확인할 수 있다. 왜 그런 지 Analyze API를 활용해서 디버깅해보자.

Analyze API 사용하기


// 특정 인덱스의 필드에 적용된 Analyzer를 활용해 분석
GET /products/_analyze
{
  "field": "name"
  "text": "Apple 2025 맥북 에어 13 M4 10코어"
}

응답값


{
  "tokens": [
    {
      "token": "Apple",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
		...
  ]
}

응답값을 보면 토큰으로 apple이라고 저장되어 있지 않고 Apple이라고 저장되어 있다. 이 때문에 apple이라고 검색했을 때 검색이 안 된 것이다.

Custom Analyzer를 활용해 인덱스 다시 생성하기

이번에는 대소문자를 구분하지 않고 검색할 수 있게 만들기 위해 lowercase token filter를 추가해서 인덱스를 생성해보자.


// 기존 인덱스 삭제
DELETE /products

// 인덱스 생성 + 매핑 정의 + Custom Analyzer 적용
PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "products_name_analyzer": {
          "char_filter": [],
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
	  "properties": {
	    "name": {
	      "type": "text",
	      "analyzer": "products_name_analyzer"
	    }
	  }
	}
}

// 잘 생성됐는 지 확인
GET /products

데이터 삽입하기


POST /products/_create/1
{
  "name": "Apple 2025 맥북 에어 13 M4 10코어"
}

검색해보기


GET /products/_search
{
  "query": {
    "match": {
      "name": "apple"
    }
  }
}

GET /products/_search
{
  "query": {
    "match": {
      "name": "Apple"
    }
  }
}

둘 다 잘 검색이 되는 걸 확인할 수 있다. 왜 검색이 잘 됐는 지 디버깅하기 위해 Analyze API를 사용해보자.

Analyze API 사용하기


// 특정 인덱스의 필드에 적용된 Analyzer를 활용해 분석
GET /products/_analyze
{
  "field": "name"
  "text": "Apple 2025 맥북 에어 13 M4 10코어"
}

응답값


{
  "tokens": [
    {
      "token": "apple",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
		...
  ]
}

응답값을 보면 토큰으로 apple이라고 저장되어 있다. 이 때문에 apple이라고 검색했을 때 조회가 잘 된 것이다. 하지만 Apple이라고 검색했는데도 도큐먼트가 조회된 이유가 뭘까?

도큐먼트를 생성할 때 Analyzer가 문자열을 토큰으로 분리해 역인덱스를 생성한다. 그런데 검색을 할 때도 Analyzer가 검색어로 입력한 문자열을 토큰으로 분리해 검색한다.

이 때문에 Apple이라고 검색어를 입력하더라도 lowercase token filter에 의해 apple로 바뀐 채로 검색을 하게 된다. 그래서 Apple이라고 검색했는데도 불구하고 도큐먼트가 조회된 것이다.

✅ 정리

결론적으로 Analyzer에서 lowercase token filter를 활용함으로써 대소문자에 상관없이 데이터를 검색할 수 있게 된 것이다.

👨🏻‍🏫

그럼 다음 강의에서는 Analyzer의 또 다른 기능에 대해 배워보도록 하자.

author

JSCODE 박재성