Big Data processing and analytics is one of the key domains addressed on corporate, research as well as academic platforms. There are a number of technologies directly associated with Big Data including fault-tolerant clustering and servers, load balancing, NoSQL databases, NewSQL databases, high performance cloud applications and many others.
In database technology, the NoSQL database systems are widely used in social media applications that have exponentially increasing user interactions. To cope up with applications handling petabytes and yottabytes of data per second, there is a need for high performance and fault-tolerant database systems to execute real-time applications without a single point of delay or failure.
Nowadays, Web based applications use unstructured and heterogeneous data formats, which include video, text, audio, live streaming, wireless signals, images, pixels and many others. As in video, each type of file has a number of formats including WEBM, MPEG, OGG, DRC, RM, MP4, NSV, AVI, MXF, 3GP, WMV, OGG, FLV and others. Similarly, there are lots of formats for images and graphics including EXIF, BPG, GIF, CGM, PNG, ARF, CPT, JPEG, WEBP, PCX, PNM, BMP, U3D, TIFF and many others.
The major issue is the compatibility of the database application with all these file formats, in assorted domains. The adoption, integration and implementation of NoSQL databases comes into play at this juncture. NoSQL databases can integrate and process any type of file format with higher performance.
There are a number of approaches to classifying NoSQL databases, based on the type of data to be processed and stored.
Column: Accumulo, Druid, Cassandra, Vertica and HBase
Document: Apache CouchDB, Couchbase, Clusterpoint, DocumentDB, MongoDB, MarkLogic, OrientDB, Lotus Notes and RethinkDB
Key-Value: Redis, MemcacheDB, Dynamo, Riak, Aerospike FairCom c-treeACE, FoundationDB, MUMPS and CouchDB
Graph: Allegro, MarkLogic, InfiniteGraph, OrientDB, Neo4J, Virtuoso and Stardog
Apache Cassandra
Apache Cassandra is a prominent and high performance open source database system devised and developed at Facebook, with distributed storage integrated so that a huge amount of data can be processed. It is one of the NoSQL database systems with a higher degree of load balancing, fault tolerance and security without any point of failure.
Apache Cassandra is used in enormous commodity systems and servers because it offers scalability, flexibility and robust support to the number of clusters spread across multiple data centres. It has the inherent properties of synchronous replication with a lower latency rate for all the applications as well as users.
Clustering in Cassandra NoSQL
There are four types of partitioners in Cassandra NoSQL, and it is very important to select the appropriate one for a particular application.
- Random partitioner: The random partitioner (RP) is used to distribute the key-value pairs randomly to incorporate effective load balancing. In addition to load balancing, this partitioner implements the dynamic hash functions for the security and integrity of the application. It is the default partitioner in Cassandra for all the applications. RP decides and implements the scheme for organising the keys in the node ring of the Cassandra cluster.
- Order preserving partitioner (OPP): This type of partitioner implements the distribution of key-value pairs in the classical method to ensure simplicity and ease of use. The key advantage of this approach is the ability to access fewer nodes rather than multiple nodes in a complex way.
- Murmur3Partitioner: This has the same features as the random partitioner (RP) but has a more effective and faster hash than MD5.
- Byte ordered partitioner: This type is the order-preserving partitioner, which works on the partition key bytes in a lexicographic way.
The features and key points of Cassandra NoSQL are:
- Scalability
- Fault tolerance
- Consistency
- Column-oriented database
- Distribution Design (DD) is based on Dynamo of Amazon
- Data Model (DM) is based on Bigtable of Google
- Implementation of Dynamo-styled model of replication without any single point of failure
- Powerful data model of Column Family
- Fast linear-scale performance
- Higher throughput
- Flexibility in data storage
- Support for ACID (atomicity, consistency, isolation and durability) transaction properties
- Fast and effective writes to thousands of terabytes without compromising on other parameters
- Support for a number of language drivers including C, C++, Java, Ruby, Go, Python, C# and many others
- Strong developer community for troubleshooting
Installation of the Cassandra driver with Python
Python is one of the powerful programming languages widely used for Big Data processing, NoSQL databases, cloud infrastructure and related high-performance computing applications.
Python has a Cassandra driver in its repository for linking and interaction with the Cassandra engine.
The Cassandra driver can be installed in real-time from the Internet using the Pip installer of Python with the following command:
$ pip install cassandra-driver
Alternately, the Cassandra driver for Python can be fetched from the live repository of Python packages http://pypi.python.org. The source or compressed file can be downloaded, and then the following command is used after moving it to the folder of the Cassandra driver:
$ python setup.py instal
Creating a connection with Cassandra
First of all, an instance of Cluster is created before the execution of any query in Cassandra.
This can be done by using the following command:
from cassandra.cluster import Cluster mycluster = Cluster()
These instructions create a connection to a Cassandra instance on a local machine (localhost or 127.0.0.1).
A list of IP addresses can be specified for the nodes in your cluster, as follows:
from cassandra.cluster import Cluster mycluster = Cluster([IP-Address-1, IP-Address-2])
To establish the connections and execution of queries, a session is created by using the following instructions:
mycluster = Cluster() session = cluster.connect()
The connect() function accepts the optional argument keyspace that is used to set the default keyspace for queries in that particular session:
mycluster = Cluster() session = cluster.connect(mydbkeyspace)
The keyspace of the session can be changed using set_keyspace() or with the execution of USE <keyspacename> as follows:
session.set_keyspace(MyKeySpace) session.execute(USE MyKeySpace)
Execution of queries
The simplest method for the execution of a query is to call the execute() method, as follows:
rows = session.execute(SELECT name, age, email FROM MyKeySpace) for myrecord in rows: print myrecord.myname, myrecord.myage, myrecord.myemail
Programming Python – MongoDB
MongoDB is another excellent and cross-platform NoSQL database. It can be classified as a document-oriented database. The latest release of MongoDB is 3.2 for supercomputing mission-critical deployment applications.
The key features of MongoDB include:
- Replication for high availability
- Load balancing
- MapReduce for aggregation functions
- JavaScript at the server side
The users of MongoDB include Adobe, McAfee, SAP, eBay, Craigslist, LinkedIn and many other corporate giants.
First, MongoDB and PyMongo are installed so that the connection of Python with MongoDB can be established.
In a Python shell, the following instruction is executed:
$ import pymongo
After implementing this step, MongoClient is created by running the mongod instance.
The following code is used to connect the default host and port on Python:
>>> from pymongo import MongoClient >>> myMongoclient = MongoClient()
The specific host and port can be mentioned explicitly as:
>>> myMongoclient = MongoClient(localhost, 27017)
Alternatively, the MongoDB URI format can be used as follows:
>>> myMongoclient = MongoClient(mongodb://localhost:27017)
The MongoDB instance is able to support multiple and independent databases. The databases can be accessed with the use of attribute style access on MongoClient instances, as follows:
>>> myNoSQLdb = myclient.MyTestDatabase
Representation of documents
The data in MongoDB is stored and represented JSON-style. Using PyMongo, dictionaries can be used for the representation of documents.
The following dictionary is used to represent the blog post:
>>> import datetime >>> post = {author: BlogAuthorName, ... text: My Message, ... tags: [NoSQL, MongoDB, BigData], ... date: datetime.datetime.utcnow()}
Inserting a document
To insert a document into a Cassandra collection, the insert_one() method can be used, as follows:
>>> myposts = myNoSQLdb.posts >>> mypostid = myposts.insert_one(mypost).inserted_id >>> mypostid
After the insertion of the document, the collection of posts is created on the server. It can be verified by listing all the collections in the database:
>>> myNoSQLdb.collection_names(include_system_collections=False)
Fetching a document
To fetch a document, type:
>>> myposts.find_one() {udate: datetime.datetime(...), utext: uMy Message, u_id: ObjectId(ObjectID), uauthor: uBlogAuthorName, utags: [uNoSQL, uMongoDB, uBigData]}
In a similar way, other operations can be implemented for big data processing and analytics.