The MongoDB Connector for Apache Spark can take advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL datastores that do not offer secondary indexes or in-database aggregations. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for data scientists and engineers.
To maximize performance across large, distributed data sets, the MongoDB Connector for Apache Spark can co-locate Resilient Distributed Datasets (RDDs) with the source MongoDB node, thereby minimizing data movement across the cluster and reducing latency.
While MongoDB natively offers rich real-time analytics capabilities, there are use cases where integrating the Apache Spark engine can extend the processing of operational data managed by MongoDB. This allows users to operationalize results generated from Spark within real-time business processes supported by MongoDB.
China Eastern Airlines uses the MongoDB Connector for Apache Spark in it’s new fare calculation engine, serving 1.6 billion queries per day.
Qumram, exposes user session data stored in MongoDB to Spark's machine learning processes to help global financial institutions detect fraud through behavioral analytics, and to apply deep learning techniques for sentiment analysis with Natural Language Processing.
Artificial intelligence personal assistant company x.ai uses MongoDB and Spark for distributed machine learning problems.
Stratio implemented its Pure Spark big data platform, combining MongoDB with Apache Spark, Zeppelin, and Kafka, to build an operational data lake for Mutua Madrileña, one of Spain’s largest insurance companies. Machine learning models are built to personalize the customer experience, with analysis of marketing campaign data to measure impact and improve performance.
A global airline has consolidated customer data scattered across more than 100 systems into a single view stored in MongoDB. Spark processes are run against the live operational data in MongoDB to update customer classifications and personalize offers in real time, as the customer is live on the web or speaking with the call center.
The MongoDB Spark Connector is available for download from GitHub.
Read our new whitepaper: Turning Analytics into Real Time Action with Apache Spark and MongoDB.
Read MongoDB Spark Connector documentation here.