This column explores the use of the powerful, open source, distributed search tool, Elasticsearch, which can enable the retrieval of data through a simple search interface.
When you have a huge number of documents, wouldnt it be great if you could search them almost as well as you can with Google? Lucene (http://lucene.apache.org/) has been helping organisations search their data for years. Projects like Elasticsearch (http://www.elasticsearch.org/) are built on top of Lucene to provide distributed, scalable solutions in order to search huge volumes of data. A good example is the use of Elasticsearch at WordPress (http://gibrown.wordpress.com/2014/01/09/scaling-elasticsearch-part-1-overview/).
In this experiment, you start with three nodes on OpenStack: h-mstr, h-slv1 and h-slv2 as in the previous article. Download the rpm package from the Elasticsearch site and install it on each of the nodes.
The configuration file is /etc/elasticsearch/elasticsearch.yml. You will need to configure it on each of the three nodes. Consider the following settings on the h-mstr node:
cluster.name: es node.master: true node.data: true index.number_of_shards: 10 index.number_of_replicas: 0
We have given the name es to the cluster. The same value should be used on the h-slv1 and h-slv2 nodes. The h-mstr node will act as a master and store data as well. The master nodes process the requests by distributing the search to the data nodes and consolidating the results. The next two parameters relate to the index. The number of shards is the number of sub-indices that are created and distributed among the data nodes. The default value for the number of shards is 5. The number of replicas represents the additional copies of the indices created. Since it has been set to No replicas, the default value is 1.
You may use the same values on slv1 and slv2 nodes or use node.master set to False. Once you have loaded the data, you will find that the h-mstr node has four shards and h-slv1 and h-slv2 have three shards each. The indices will be in the directory /var/lib/elasticsearch/es/nodes/0/indices/ on each node.
You start Elasticsearch on each node by executing the following command:
$ sudo systemctl start elasticsearch
You can get to know the status of the cluster by browsing http://h-mstr:9200/_cluster/health?pretty.
Loading the data
If you want to index the documents located on your desktop, Elasticsearch supports a Python interface for it. It is available in the Fedora 20 repository. So, on your desktop, install:
$ sudo yum install python-elasticsearch
The following is a sample program to index LibreOffice documents. The comments embedded in the code hopefully make it clear that this is not a complex task.
#!/usr/bin/python import sys import os import subprocess from elasticsearch import Elasticsearch FILETYPES=['odt','doc','sxw','abw'] # Covert a document file into a text file in /tmp and return the text file name def convert_to_text(inpath,infile): subprocess.call(['soffice','--headless','--convert-to','txt:Text', '--outdir','/tmp','/'.join([inpath,infile])]) return '/tmp/' + infile.rsplit('.',1)[0] + '.txt' # Read the text tile and return it as a string def process_file(p,f): textfile = convert_to_text(p,f) return ' '.join([line.strip() for line in open(textfile)]) # Search all files in a root path and select the document files def get_documents(path): for curr_path,dirs,files in os.walk(path): for f in files: try: if f.rsplit('.',1)[1].lower() in FILETYPES: yield curr_path,f except: pass # Run this program with the root directory. # If none, then the current directory is used. def main(argv): try: path=argv[1] except IndexError: path='.' es = Elasticsearch(hosts='h-mstr') id = 0 # index each document with 3 attributes: # path, title (the file name) and text (the text content) for p,f in get_documents(path): text = process_file(p,f) doc = {'path':p, 'title':f, 'text':text} id += 1 es.index(index='documents', doc_type='text',id=id, body= doc) if __name__=="__main__": main(sys.argv)
Once the index is created, you cannot increase the number of shards. However, you can change the replication value as follows:
$ curl -XPUT 'h-mstr:9200/documents/_settings' -d ' { "index" : { "number_of_replicas" : 1 } } '
Now, the number of shards will be 7, 7 and 6, respectively, on the three nodes. As you would expect, if one of the nodes is down, you will still be able to search the documents. If more than one node is down, the search will return a partial result from the shards that are still available.
Searching the data
The program search_documents.py below uses the query_string option of Elasticsearch to search for the string passed as a parameter in the content field text’. It returns the fields path’ and title’ in the response, which are combined to print the full file names of the documents found.
#!/usr/bin/python import sys from elasticsearch import Elasticsearch def main(query_string): es = Elasticsearch(['h-mstr']) query_body = {'query': {'query_string': { 'default_field':'text', 'query': query_string}}, 'fields':['path','title'] } # response is a dictionary with nested dictionaries response = es.search(index='documents', body=query_body) for hit in response['hits']['hits']: print '/'.join(hit['fields']['path'] + hit['fields']['title']) # run the program with search expression as a parameter if __name__=='__main__': main(' '.join(sys.argv[1:]))
You can now search using expressions like the following:
$ python search_documents.py smalltalk objects $ python search_documents.py smalltalk AND objects $ python search_documents.py +smalltalk objects $ python search_documents.py +smalltalk +objects
More details can be found at the Lucene and Elasticsearch sites.
Open source options let you build a custom, scalable search engine. You may include information from your databases, documents, emails, etc, very conveniently. Hence, it is a shame to come across sites that do not offer an easy way to search their content. One hopes that website managers will add that functionality using tools like Elasticsearch!