whoosh
개요
pure-Python search engine
- Okapi BM25F ranking function 사용
- Lucene 같이 엿같은 java 환경 안 써도 됨
- 모든 인덱스는 반드시 unicode이어야 함
용어 사전
- Analysis
- The process of breaking the text of a field into individual terms to be indexed. This consists of tokenizing the text into terms, and then optionally filtering the tokenized terms (for example, lowercasing and removing stop words). Whoosh includes several different analyzers.
- Corpus
- The set of documents you are indexing.
- Documents
- The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.
- Fields
- Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.
- Forward index
- A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.
- Indexing
- The process of examining documents in the corpus and adding them to the reverse index.
- Postings
- The reverse index lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.
- Reverse index
- Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.
- Schema
- Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.
- Term vector
- A forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.
Schema 디자인
Schema 와 fields
- Schema는 index의 문서 field를 지정한다.
- 각 문서는 title, content, url, date 등 여러 field를 가질 수 있다.
- 일부 field는 인덱싱되거나 저장될 수 있다.(둘 모두도 가능)
- 스키마는 문서의 모든 possible fields의 집합이며, 각각의 문서는 스키마에 있는 필드의 하위집합을 통해서만 사용할 수 있다.
> e.g. 이메일을 인덱싱 할 때의 필드는, from_addr, to_addr, subject, body, attachments 가 될 것이다.
내장 필드 유형
Whoosh는 몇 가지 내장 필드 유형을 제공한다.
- whoosh.fields.TEXT
- 본문 텍스트
- 구문 검색을 허용하기 위해 텍스트를 인덱싱 (및 선택적으로 저장)하고 용어 위치를 저장
- whoosh.fields.KEYWORD
- 공백 또는 쉼표로 구분 된 키워드
- whoosh.fields.ID
- ID필드는 나머지 필드 값들을 single unit으로 간단히 인덱싱한다.
- url이나 file path, date, category 등에 사용할 것을 추천
- default는 stored=False 이므로, 저장하려면 ID(stored=True) 을 사용하도록!
- whoosh.fields.STORED
- 이 필드는 문서와 함께 저장되지만 인덱싱되거나 검색되지 않는다.
- whoosh.fields.NUMERIC
- int, long, or floating point numbers in a compact, sortable format
- 정렬 가능
- whoosh.fields.DATETIME
- datetime objects
- 정렬 가능
- whoosh.fields.BOOLEAN
- boolean value(yes, no, true, false, 1, 0, t, f)
- whoosh.fields.NGRAM
- 미정
Schema 만들기
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer
schema = Schema(from_addr=ID(stored=True),
to_addr=ID(stored=True),
subject=TEXT(stored=True),
body=TEXT(analyzer=StemmingAnalyzer()),
tags=KEYWORD)
or
from whoosh.fields import SchemaClass, TEXT, KEYWORD, ID, STORED
class MySchema(SchemaClass):
path = ID(stored=True)
title = TEXT(stored=True)
content = TEXT
tags = KEYWORD
Modifying the schema after indexing
- add_field() 와 remove_field()
writer = ix.writer()
writer.add_field("fieldname", fields.TEXT(stored=True))
writer.remove_field("content")
writer.commit()
Dynamic fields
associate a field type with any field name that matches a given “glob” (a name pattern containing *, ?, and/or [abc] wildcards).
schema = fields.Schema(...)
# Any name ending in "_d" will be treated as a stored
# DATETIME field
schema.add("*_d", fields.DATETIME(stored=True), glob=True)
Advanced schema setup
- Field boosts
schema = Schema(title=TEXT(field_boost=2.0), body=TEXT)
- Field types
- Formats
- Vectors
How to index documents
Creating an Index object
index.create_in
: create an index
clear the current contents if the directory with an existing index
import os, os.path
from whoosh import index
if not os.path.exists("indexdir"):
os.mkdir("indexdir")
ix = index.create_in("indexdir", schema)
index.open_dir
: open an existing index
import whoosh.index as index
ix = index.open_dir("indexdir")
Indexing documents
Index
object를 만들게 되면, IndexWriter
를 사용하여 문서를 인덱스에 추가할 수 있다.
ix = index.open_dir("index")
writer = ix.writer()
NOTE
- Because opening a writer locks the index for writing, in a multi-threaded or multi-process environment your code needs to be aware that opening a writer may raise an exception (
whoosh.store.LockErro
r) if a writer is already open. Whoosh includes a couple of example implementations (whoosh.writing.AsyncWriter
andwhoosh.writing.BufferedWriter
) of ways to work around the write lock.
The IndexWriter’s add_document(**kwargs) method accepts keyword arguments where the field name is mapped to a value:
writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my document!",
path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
path=u"/c", tags=u"short", icon=u"/icons/book.png")
writer.commit()
NOTE
- 모든 필드를 다 채울 필요는 없음
Finishing adding documents
IndexWriter
object를 commit()
한다.
writer.commit()
커밋하지 않은 채로 writer를 끝내고 싶으면, commit()
대신 cancel()
을 호출한다.
writer.cancel()
Merging segments
문서를 추가할 때마다 전체 index를 다시 작성하는 것보다 몇 개의 세그먼트를 사용하는 것이 더 효율적이다.
하지만 많은 세그먼트를 검색하는 것은 검색을 느리게 할 수 있으므로 이를 주의해야 한다.
그래서 Whoosh는 commit()
을 실행할 때마다 작은 세그먼트들을 병합하여 큰 세그먼트를 만드는 알고리즘을 가지고 있다.
그게 싫다면,
writer.commit(merge=False)
모든 세그먼트를 병합하고 싶다면, optimize
를 사용해 index를 하나의 세그먼트로 최적화한다.
writer.commit(optimize=True)
Deleting documents
다 지운 후에는 commit()
을 호출해야 한다.
delete_document(docnum)
is_deleted(docnum)
delete_by_term(fieldname, termtext)
delete_by_query(query)
# Delete document by its path -- this field must be indexed
ix.delete_by_term('path', u'/a/b/c')
# Save the deletion to disk
ix.commit()
Updating documents
from whoosh.fields import Schema, ID, TEXT
schema = Schema(path = ID(unique=True), content=TEXT)
ix = index.create_in("index")
writer = ix.writer()
writer.add_document(path=u"/a", content=u"The first document")
writer.add_document(path=u"/b", content=u"The second document")
writer.commit()
writer = ix.writer()
# Because "path" is marked as unique, calling update_document with path="/a"
# will delete any existing documents where the "path" field contains "/a".
writer.update_document(path=u"/a", content="Replacement for the first document")
writer.commit()
Clearing the index
from whoosh import writing
with myindex.writer() as mywriter:
# You can optionally add documents to the writer here
# e.g. mywriter.add_document(...)
# Using mergetype=CLEAR clears all existing segments so the index will
# only have any documents you've added to this writer
mywriter.mergetype = writing.CLEAR
How to search
Searcher
object
The get whoosh.searching.Searcher
object
searcher = myindex.searcher()
#or
with ix.searcher() as searcher:
...
#or
try:
searcher = ix.searcher()
...
finally:
searcher.close()
methods
lexion(fieldname)
>>> list(searcher.lexicon("content"))
[u"document", u"index", u"whoosh"]
search()
- very important
- takes a
whoosh.query.Query
object and returns aResults
object
- very important
from whoosh.qparser import QueryParser
qp = QueryParser("content", schema=myindex.schema)
q = qp.parse(u"hello world")
with myindex.searcher() as s:
results = s.search(q)
limit
- use with
search()
- By default, search() shows the first 10 matching documetns.
- use with
results = s.search(q, limit=20)
search_page
- 특정 페이지 검색
results = s.search_page(q, 1)
pagelen
- default는 10.
results = s.search_page(q, 5, pagelen=20)
Results object
>>> results[0]
{"title": u"Hello World in Python", "path": u"/a/b/c"}
>>> results[0:2]
[{"title": u"Hello World in Python", "path": u"/a/b/c"},
{"title": u"Foo", "path": u"/bar"}]
Searcher.search(myquery)
는 default hit number를 20으로 제한한다.
따라서Result
object의 scored hits 는 index matching document보다 적다.
뭔소리??
>>> # How many documents in the entire index would have matched?
>>> len(results)
27
>>> # How many scored and sorted documents in this Results object?
>>> # This will often be less than len() if the number of hits was limited
>>> # (the default).
>>> results.scored_length()
10
- 크고 무거운 index의 delay를 피하고 싶으면,
found = results.scored_length()
if results.has_exact_length():
print("Scored", found, "of exactly", len(results), "documents")
else:
low = results.estimated_min_length()
high = results.estimated_length()
print("Scored", found, "of between", low, "and", high, "documents")
Scoring and sorting
- 보통 result document list는 score에 의해 정렬된다.
whoosh.scoring
모듈은 scoring 알고리즘에 대한 많은 implementation을 담고 있다.
default는BM25F
이다. weighting
- score object를 설정할 수 있다.
from whoosh import scoring
with myindex.searcher(weighting=scoring.TF_IDF()) as s:
...
Sorting
sortable=True
schema = fields.Schema(title=fields.TEXT(sortable=True),
content=fields.TEXT,
modified=fields.DATETIME(sortable=True)
)
'개발 > python' 카테고리의 다른 글
[TIP] *args **kwargs (4) | 2017.09.12 |
---|---|
windows에서 RabbitMQ 설치 (4) | 2017.06.20 |
[python] Python 3 에서 MySQL DB 연동 (4) | 2016.12.22 |
Flask로 웹페이지를 만들어보자 (14) | 2016.05.16 |
[Python] BeautifulSoup로 YouTube에서 영상 정보를 크롤링 해보자 (10) | 2016.05.15 |
댓글