Retail Reference Architecture Part 1: Building a Flexible, Searchable, Low-Latency Product Catalog

MongoDB
May 4, 2015 | Updated: May 7, 2018
#Technical

Series:

Building a Flexible, Searchable, Low-Latency Product Catalog
Approaches to Inventory Optimization
Query Optimization and Scaling
Recommendations and Personalizations

Product catalog data management is a complex problem for retailers today. After years of relying on multiple monolithic, vendor-provided systems, retailers are now reconsidering their options and looking to the future.

In today’s vendor-provided systems, product data must frequently be moved back and forth using ETL processes to ensure all systems are operating on the same data set. This approach is slow, error prone, and expensive in terms of development and management. In response, retailers are now making data services available individually as part of a centralized service-oriented architecture (SOA).

This is a pattern that we commonly see at MongoDB, so much so that we’ve begun to define some best practices and reference architecture specifically targeted at the retail space. As part of that effort, today we’ll be taking a look at implementing a catalog service using MongoDB as the first of a three part series on retail architecture.

Why MongoDB?

Many different database types are able to fulfill our product catalog use case, so why choose MongoDB?

Document flexibility: Each MongoDB document can store data represented as rich JSON structures. This makes MongoDB ideal for storing just about anything, including very large catalogs with thousands of variants per item.
Dynamic schema: JSON structures within each document can be altered at any time, allowing for increased agility and easy restructuring of data when needs change. In MongoDB, these multiple schemas can be stored within a single collection and can use shared indexes, allowing for efficient searching of both old and new formats simultaneously.
Expressive query language: The ability to perform queries across many document attributes simplifies many tasks. This can also improve application performance by lowering the required number of database requests.
Indexing: Powerful secondary, compound and geo-indexing options are available in MongoDB right out of the box, quickly enabling features like sorting and location-based queries.
Data consistency: By default, all reads and writes are sent to the primary member of a MongoDB replica set. This ensures strong consistency, an important feature for retailers, who may have many customers making requests against the same item inventory.
Geo-distributed replicas: Network latency due to geographic distance between a data source and the client can be problematic, particularly for a catalog service which would be expected to sustain a large number of low-latency reads. MongoDB replica sets can be geo-distributed, so that they are close to users for fast access, mitigating the need for CDNs in many cases.

These are just a few of the characteristics of MongoDB that make it a great option for retailers. Next, we’ll take a look at some of the specifics of how we put some of these to use in our retail reference architecture to support a number of features, including:

Searching for items and item variants
Retrieving per store pricing for items
Enabling catalog browsing with faceted search

Item Data Model

The first thing we need to consider is the data model for our items. In the following examples we are showing only the most important information about each item, such as category, brand and description:

{
	“_id”: “30671”, //main item ID
	“department”: “Shoes”,
	“category”: “Shoes/Women/Pumps”,
	“brand”: “Calvin Klein”,
	“thumbnail”: “http://cdn.../pump.jpg”,
	“title”: “Evening Platform Pumps”,
	“description”: “Perfect for a casual night out or a formal event.”,
	“style”: “Designer”,
	…
}

This type of simple data model allows us to easily query for items based on the most demanded criteria. For example, using db.collection.findOne, which will return a single document that satisfies a query:

Get item by ID
db.definition.findOne({_id:”301671”})
Get items for a set of product IDs
db.definition.findOne({_id:{$in:[”301671”,”452318”]}})
Get items by category prefix
db.definition.findOne({category:/^Shoes\/Women/})

Notice how the second and third queries used the $in operator and a regular expression, respectively. When performed on properly indexed documents, MongoDB is able to provide high throughput and low latency for these types of queries.

Variant Data Model

Another important consideration for our the product catalog is item variants, such as available sizes, colors, and styles. Our item data model above only captures a small amount of the data about each catalog item. So what about all of the available item variations we may need to retrieve, such as size and color?

One option is to store an item and all its variants together in a single document. This approach has the advantage of being able to retrieve an item and all variants in a single query. However, it is not the best approach in all cases. It is an important best practice to avoid unbounded document growth. If the number of variants and their associated data is small, it may make sense to store them in the item document.

Another option is to create a separate variant data model that can be referenced relative to the primary item:

{
	“_id”: ”93284847362823”, //variant sku
	“itemId”: “30671”, //references the main item
	“size”: 6.0,
	“color”: “red”
	…
}

This data model allows us to do fast lookups of specific item variants by their SKU number:

db.variation.find({_id:”93284847362823”})

As well as all variants for a specific item by querying on the itemId attribute:

db.variation.find({itemId:”30671”}).sort({_id:1})

In this way, we maintain fast queries on both our primary item for displaying in our catalog, as well as every variant for when the user requests a more specific product view. We also ensure a predictable size for the item and variant documents.

Per Store Pricing

Another consideration when defining the reference architecture for our product catalog is pricing. We’ve now seen a few ways that the data model for our items can be structured to quickly retrieve items directly or based on specific attributes. Prices can vary by many factors, like store location. We need a way to quickly retrieve the specific price of any given item or item variant. This can be very problematic for large retailers, since a catalog with a million items and one thousand stores means we must query across a collection of a billion documents to find the price of any given item.

We could, of course, store the price for each variant as a nested document within the item document, but a better solution is to again take advantage of how quickly MongoDB is able to query on _id. For example, if each item in our catalog is referenced by an itemId, while each variant is referenced by a SKU number, we can set the _id of each document to be a concatenation of the itemId or SKU and the storeId associated with that price variant. Using this model, the _id for the pair of pumps and its red variant described above would look something like this:

Item: 30671_store23
Variant: 93284847362823_store23

This approach also provides a lot of flexibility for handling pricing, as it allows us to price items at the item or the variant level. We can then query for all prices or just the price for a particular location:

All prices: db.prices.find({_id:/^30671/})
Store price: db.prices.find({_id:/^30671_store23/})

We could even add other combinations, such as pricing per store group, and get all possible prices for an item with a single query by using the $in operator:

db.prices.find({_id:{$in:[	“30671_store23”,
				“30671_sgroup12”,
				“93284847362823_store23”,
				“93284847362823_sgroup12” ]}})

Browse and Search Products

The biggest challenge for our product catalog is to enable browsing with faceted search. While many users will want to search our product catalog for a specific item or criteria they are looking for, many others will want to browse, then narrow the returned results by any number of attributes. So given the need to create a page like this:

Sample catalog

We have many challenges:

Response time: As the user browses, each page of results should return in milliseconds.
Multiple attributes: As the user selects different facets—e.g. brand, size, color—new queries must be run on multiple document attributes.
Variant-level attributes: Some user-selected attributes will be queried at the item level, such as brand, while others will be at the variant level, such as size.
Multiple variants: Thousands of variants can exist for each item, but we only want to display each item once, so results must be de-duplicated.
Sorting: The user needs to be allowed to sort on multiple attributes, like price and size, and that sorting operation must perform efficiently.
Pagination: Only a small number of results should be returned per page, which requires deterministic ordering.

Many retailers may want to use a dedicated search engine as the basis of these features. MongoDB provides an open source connector project, which allows the use of Apache Solr and Elasticsearch with MongoDB. For our reference architecture, however, we wanted to implement faceted search entirely within MongoDB.

To accomplish this, we create another collection that stores what we will call summary documents. These documents contain all of the information we need to do fast lookups of items in our catalog based on various search facets.

{ 
	“_id”: “30671”,
	“title”: “Evening Platform Pumps”,
	“department”: “Shoes”,
	“Category”: “Women/Shoes/Pumps”,
   	“price”: 149.95,
	“attrs”: [“brand”: “Calvin Klein”, …],
	“sattrs”: [“style”: ”Designer”, …],
	“vars”: [
		{
			“sku”: “93284847362823”,
			“attrs”: [{“size”: 6.0}, {“color”: “red”}, …],
			“sattrs”: [{“width”: 8.0}, {“heelHeight”: 5.0}, …],
		}, … //Many more SKUs
	]
<p>}

Note that in this data model we are defining attributes and secondary attributes. While a user may want to be able to search on many different attributes of an item or item variant, there is only a core set that are most frequently used. For example, given a pair of shoes, it may be more common for a user to filter their search based on available size than filtering by heel height. By using both the attr and sattr attributes in our data model, we are able to make all of these item attributes available to search, but incur only the expense of indexing the most used attributes by indexing only attr.

Using this data model, we would create compound indices on the following combinations:

department + attr + category + _id
department + vars.attr + category + _id
department + category + _id
department + price + _id
department + rating + _id

In these indices, we always start with department, and we assume users will chose the department to refine their search results. For a catalog without departments, we could have just as easily begun with another common facet like category or type. We can then perform the queries needed for faceted search and quickly return the results to the page:

Get summary from itemId
db.variation.find({_id:”30671”})
Get summary of specific item variant
db.variation.find({vars.sku:”93284847362823”},{“vars.$”:1})
Get summaries for all items by department
db.variation.find({department:”Shoes”})
Get summaries with a mix of parameters
db.variation.find({ “department”:”Shoes”,
“vars.attr”: {“color”:”red”},
“category”: “^/Shoes/Women”})

Recap

We’ve looked at some best practices for modeling and indexing data for a product catalog that supports a variety of application features, including item and item variant lookup, store pricing, and catalog browsing using faceted search. Using these approaches as a starting point can help you find the best design for your own implementation.

Learn more

To discover how you can re-imagine the retail experience with MongoDB, read our white paper. In this paper, you'll learn about the new retail challenges and how MongoDB addresses them.

Learn more about how leading brands differentiate themselves with technologies and processes that enable the omni-channel retail experience.

Read our guide on the digitally oriented consumer

Read Part 2 >>

← Previous

New Compression Options in MongoDB 3.0

MongoDB 3.0 introduces compression with the WiredTiger storage engine. In this post we will take a look at the different options, and show some examples of how the feature works. As always, YMMV, so we encourage you to test your own data and your own application. Why compression? Everyone knows storage is cheap, right? But chances are you’re adding data faster than storage prices are declining, so your net spend on storage is rising. Your internal costs might also incorporate management and other factors, so they may be significantly higher than commodity market prices. In other words, it still pays to look for ways to reduce your storage needs. Size is one factor, and there are others. Disk I/O latency is dominated by seek time on rotational storage. By decreasing the size of the data, fewer disk seeks will be necessary to retrieve a given quantity of data, and disk I/O throughput will improve. In terms of RAM, some compressed formats can be used without decompressing the data in memory. In these cases more data can fit in RAM, which improves performance. Storage properties of MongoDB There are two important features related to storage that affect how space is used in MongoDB: BSON and dynamic schema. MongoDB stores data in BSON, a binary encoding of JSON-like documents (BSON supports additional data types, such as dates, different types of numbers, binary). BSON is efficient to encode and decode, and it is easily traversable. However, BSON does not compress data, and it is possible its representation of data is actually larger than the JSON equivalent. One of the things users love about MongoDB’s document data model is dynamic schema. In most databases, the schema is described and maintained centrally in a catalog or system tables. Column names are stored once for all rows. This approach is efficient in terms of space, but it requires all data to be structured according to the schema. In MongoDB there is currently no central catalog: each document is self-describing. New fields can be added to a document without affecting other documents, and without registering the fields in a central catalog. The tradeoff is that with greater flexibility comes greater use of space. Field names are defined in every document. It is a best practice to use shorter field names when possible. However, it is also important not to take this too far – single letter field names or codes can obscure the field names, making it more difficult to use the data. Fortunately, traditional schema is not the only way to be space efficient. Compression is very effective for repeating values like field names, as well as much of the data stored in documents. There is no Universal Compression Compression is all around us: images (JPEG, GIF), audio (mp3), video (MPEG), and most web servers compress web pages before sending to your browser using gzip. Compression algorithms have been around for decades, and there are competitions that award innovation . Compression libraries rely on CPU and RAM to compress and decompress data, and each makes different tradeoffs in terms of compression rate, speed, and resource utilization. For example, one measure of today’s best compression library for text can compress 1GB of Wikipedia data to 124MB compared to 323MB for gzip, but it takes about almost 3,000 times longer and 30,000 times more memory to do so. Depending on your data and your application, one library may be much more effective for your needs than others. MongoDB 3.0 introduces WiredTiger, a new storage engine that supports compression. WiredTiger manages disk I/O using pages. Each page contains many BSON documents. As pages are written to disk they are compressed by default, and when they are read into the cache from disk they are decompressed. One of the basic concepts of compression is that repeating values – exact values as well as patterns – can be stored once in compressed form, reducing the total amount of space. Larger units of data tend to compress more effectively as there tend to be more repeating values. By compressing at the page level – commonly called block compression – WiredTiger can more efficiently compress data. WiredTiger supports multiple compression libraries. You can decide which option is best for you at the collection level. This is an important option – your access patterns and your data could be quite different across collections. For example, if you’re using GridFS to store large documents such as images and videos, MongoDB automatically breaks the large files into many smaller “chunks” and reassembles them when requested. The implementation of GridFS maintains two collections: fs.files, which contains the metadata for the large files and their associated chunks, and fs.chunks, which contains the large data broken into 255KB chunks. With images and videos, compression will probably be beneficial for the fs.files collection, but the data contained in fs.chunks is probably already compressed, and so it may make sense to disable compression for this collection. Compression options in MongoDB 3.0 In MongoDB 3.0, WiredTiger provides three compression options for collections: No compression Snappy (enabled by default) – very good compression, efficient use of resources zlib (similar to gzip) – excellent compression, but more resource intensive There are two compression options for indexes: No compression Prefix (enabled by default) – good compression, efficient use of resources You may wonder why the compression options for indexes are different than those for collections. Prefix compression is fairly simple – the “prefix” of values is deduplicated from the data set. This is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field. Prefix indexes also provide one very important advantage over Snappy or zlib – queries operate directly on the compressed indexes, including covering queries. When compressed collection data is accessed from disk, it is decompressed in cache. With prefix compression, indexes can remain compressed in RAM. We tend to see very good compression with indexes using prefix compression, which means that in most cases you can fit more of your indexes in RAM without sacrificing performance for reads, and with very modest impact to writes. The compression rate will vary significantly depending on the cardinality of your data and whether you use compound indexes. Some things to keep in mind that apply to all the compression options in MongoDB 3.0: Random data does not compress well Binary data does not compress well (it may already be compressed) Text compresses especially well Field names compress well in documents (the additional benefits of short field names are modest) Compression is enabled by default for collections and indexes in the WiredTiger storage engine. To explicitly set the compression for the replica at startup, specify the appropriate options in the YAML config file . use the command line option -- wiredTigerCollectionBlockCompressor . Because WiredTiger is not the default storage engine in MongoDB 3.0, you’ll also need to specify the -- storageEngine option to use WiredTiger and take advantage of these compression features. To specify compression for specific collections, you’ll need to override the defaults by passing the appropriate options in the db.createCollection() command. For example, to create a collection called email using the zlib compression library: db.createCollection( "email", { storageEngine: { wiredTiger: { configString: "blockCompressor=zlib" }}}) How to measure compression The best way to measure compression is to separately load the data with and without compression enabled, then compare the two sizes. The db.stats() command returns many different storage statistics, but the two that matter for this comparison are storageSize and indexSize. Values are returned in bytes, but you can convert to MB by passing in 1024*1024: > db.stats(1024*1024).dataSize + db.stats(1024*1024).indexSize 1406.9201011657715 This is the method we used for the comparisons provided below. Testing compression on different data sets Let’s take a look at some different data sets to see how some of the compression options perform. We have four databases: Enron This is the infamous Enron email corpus . It includes about a half million emails. There’s a great deal of text in the email body fields, and some of the metadata has low cardinality, which means that they’re both likely to compress well. Here’s an example (the email body is truncated): { "_id" : ObjectId("4f16fc97d1e2d32371003e27"), "body" : "", "subFolder" : "notes_inbox", "mailbox" : "bass-e", "filename" : "450.", "headers" : { "X-cc" : "", "From" : "michael.simmons@enron.com", "Subject" : "Re: Plays and other information", "X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "eric.bass@enron.com", "X-Origin" : "Bass-E", "X-FileName" : "ebass.nsf", "X-From" : "Michael Simmons", "Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)", "X-To" : "Eric Bass", "Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Here’s how the different options performed with the Enron database: Flights The US Federal Aviation Administration (FAA) provides data about on-time performance of airlines . Each flight is represented as a document. Many of the fields have low cardinality, so we express this data set to compress well: { "_id" : ObjectId("53d81b734aaa3856391da1fb"), "origin" : { "airport_seq_id" : 1247802, "name" : "JFK", "wac" : 22, "state_fips" : 36, "airport_id" : 12478, "state_abr" : "NY", "city_name" : "New York, NY", "city_market_id" : 31703, "state_nm" : "New York" }, "arr" : { "delay_group" : 0, "time" : ISODate("2014-01-01T12:38:00Z"), "del15" : 0, "delay" : 13, "delay_new" : 13, "time_blk" : "1200-1259" }, "crs_arr_time" : ISODate("2014-01-01T12:25:00Z"), "delays" : { "dep" : 14, "arr" : 13 }, "taxi_in" : 5, "distance_group" : 10, "fl_date" : ISODate("2014-01-01T00:00:00Z"), "actual_elapsed_time" : 384, "wheels_off" : ISODate("2014-01-01T09:34:00Z"), "fl_num" : 1, "div_airport_landings" : 0, "diverted" : 0, "wheels_on" : ISODate("2014-01-01T12:33:00Z"), "crs_elapsed_time" : 385, "dest" : { "airport_seq_id" : 1289203, "state_nm" : "California", "wac" : 91, "state_fips" : 6, "airport_id" : 12892, "state_abr" : "CA", "city_name" : "Los Angeles, CA", "city_market_id" : 32575 }, "crs_dep_time" : ISODate("2014-01-01T09:00:00Z"), "cancelled" : 0, "unique_carrier" : "AA", "taxi_out" : 20, "tail_num" : "N338AA", "air_time" : 359, "carrier" : "AA", "airline_id" : 19805, "dep" : { "delay_group" : 0, "time" : ISODate("2014-01-01T09:14:00Z"), "del15" : 0, "delay" : 14, "delay_new" : 14, "time_blk" : "0900-0959" }, "distance" : 2475 } Here’s how the different options performed with the Flights database: MongoDB Config Database This is the metadata MongoDB stores in the config database for sharded clusters. The manual describes the various collections in that database. Here’s an example from the chunks collection, which stores a document for each chunk in the cluster: { "_id" : "mydb.foo-a_\"cat\"", "lastmod" : Timestamp(1000, 3), "lastmodEpoch" : ObjectId("5078407bd58b175c5c225fdc"), "ns" : "mydb.foo", "min" : { "animal" : "cat" }, "max" : { "animal" : "dog" }, "shard" : "shard0004" } Here’s how the different options performed with the config database: TPC-H TPC-H is a classic benchmark used for testing relational analytical DBMS. The schema has been modified to use MongoDB’s document model. Here’s an example from the orders table with only the first of many line items displayed for this order: { "_id" : 1, "cname" : "Customer#000036901", "status" : "O", "totalprice" : 173665.47, "orderdate" : ISODate("1996-01-02T00:00:00Z"), "comment" : "instructions sleep furiously among ", "lineitems" : [ { "lineitem" : 1, "mfgr" : "Manufacturer#4", "brand" : "Brand#44", "type" : "PROMO BRUSHED NICKEL", "container" : "JUMBO JAR", "quantity" : 17, "returnflag" : "N", "linestatus" : "O", "extprice" : 21168.23, "discount" : 0.04, "shipinstr" : "DELIVER IN PERSON", "realPrice" : 20321.5008, "shipmode" : "TRUCK", "commitDate" : ISODate("1996-02-12T00:00:00Z"), "shipDate" : ISODate("1996-03-13T00:00:00Z"), "receiptDate" : ISODate("1996-03-22T00:00:00Z"), "tax" : 0.02, "size" : 9, "nation" : "UNITED KINGDOM", "region" : "EUROPE" } ] } Here’s how the different options performed with the TPC-H database: Twitter This is a database of 200K tweets. Here’s a simulated tweet introducing our Java 3.0 driver: { "coordinates": null, "created_at": "Fri April 25 16:02:46 +0000 2010", "favorited": false, "truncated": false, "id_str": "", "entities": { "urls": [ { "expanded_url": null, "url": "http://mongodb.com", "indices": [ 69, 100 ] } ], "hashtags": [ ], "user_mentions": [ { "name": "MongoDB", "id_str": "", "id": null, "indices": [ 25, 30 ], "screen_name": "MongoDB" } ] }, "in_reply_to_user_id_str": null, "text": "Introducing the #Java 3.0 driver for #MongoDB http://buff.ly/1DmMTKu", "contributors": null, "id": null, "retweet_count": 12, "in_reply_to_status_id_str": null, "geo": null, "retweeted": true, "in_reply_to_user_id": null, "user": { "profile_sidebar_border_color": "C0DEED", "name": "MongoDB", "profile_sidebar_fill_color": "DDEEF6", "profile_background_tile": false, "profile_image_url": "", "location": "New York, NY", "created_at": "Fri April 25 23:22:09 +0000 2008", "id_str": "", "follow_request_sent": false, "profile_link_color": "", "favourites_count": 1, "url": "http://mongodb.com", "contributors_enabled": false, "utc_offset": -25200, "id": null, "profile_use_background_image": true, "listed_count": null, "protected": false, "lang": "en", "profile_text_color": "", "followers_count": 159678, "time_zone": "Eastern Time (US & Canada)", "verified": false, "geo_enabled": true, "profile_background_color": "", "notifications": false, "description": "Community conversation around the MongoDB software. For official company news, follow @mongodbinc.", "friends_count": , "profile_background_image_url": "", "statuses_count": 7311, "screen_name": "MongoDB", "following": false, "show_all_inline_media": false }, "in_reply_to_screen_name": null, "source": "web", "place": null, "in_reply_to_status_id": null } Here’s how the different options performed with the Twitter database: Comparing compression rates The varying sizes of these databases make them difficult to compare side by side in terms of absolute size. We can take a closer look at the benefits by comparing the storage savings provided by each option. To do this, we compare the size of each database using Snappy and zlib to the uncompressed size in WiredTiger. As above, we’re adding the value of storageSize and indexSize. Another way some people describe the benefits of compression is in terms of the ratio of the uncompressed size to the compressed size. Here’s how Snappy and zlib perform across the five databases. How to test your own data There are two simple ways for you to test compression with your data in MongoDB 3.0. If you’ve already upgraded to MongoDB 3.0, you can simply add a new secondary to your replica set with the option to use the WiredTiger storage engine specified at startup. While you’re at it, make this replica hidden with 0 votes so that it won’t affect your deployment. This new replica set member will perform an initial sync with one of your existing secondaries. After the initial sync is complete, you can remove the WiredTiger replica from your replica set then connect to that standalone to compare the size of your databases as described above. For each compression option you want to test, you can repeat this process. Another option is to take a mongodump of your data and use that to restore it into a standalone MongoDB 3.0 instance. By default your collections will use the Snappy compression option, but you can specify different options by first creating the collections with the appropriate setting before running mongorestore, or by starting mongod with different compression options. This approach has the advantage of being able to dump/restore only specific databases, collections, or subsets of collections to perform your testing. For examples of syntax for setting compression options, see the section “How to use compression.” A note on capped collections Capped collections are implemented very differently in the MMAP storage engines as compared to WiredTiger (and RocksDB). In MMAP space is allocated for the capped collection at creation time, whereas for WiredTiger and RocksDB space is only allocated as data is added to the capped collection. If you have many empty or mostly-empty capped collections, comparisons between the different storage engines may be somewhat misleading for this reason. If you’re considering updating your version of MongoDB, take a look at our Major Version Upgrade consulting services: UPGRADE WITH CONFIDENCE About the Author - Asya Asya is Lead Product Manager at MongoDB. She joined MongoDB as one of the company's first Solutions Architects. Prior to MongoDB, Asya spent seven years in similar positions at Coverity, a leading development testing company. Before that she spent twelve years working with databases as a developer, DBA, data architect and data warehousing specialist.

April 30, 2015

Next →

Data and the European Landscape: 3 Trends for 2022

The past two years have brought massive changes for IT leaders: large and complex cloud migrations; unprecedented numbers of people suddenly working, shopping and learning from home; and a burst in demand for digital-first experiences. Like everyone else, we are hoping that 2022 isn’t so disruptive (fingers crossed!), our customer conversations in Europe do lead us to believe the new year will bring new business priorities. We’re already noticing changes in conversations around vendor lock-in, thanks to the Digital Markets Act, a new enthusiasm for combining operational and analytical data to drive new insights faster, and a more strategic embrace of sustainability. Here’s how we see these trends playing out in 2022. Digital markets act draws new attention to cloud vendor lock-in in Europe We’ve heard plenty about the European Commission’s Digital Markets Act , which, in the name of ensuring fair and open digital markets, would place new restrictions on companies that are deemed to be digital “gatekeepers” in the region. That discussion will be nothing compared to the vigorous debate we expect once the EU begins the very tricky political business of determining exactly which companies will fall under the act. If the EU sets the bar for revenues, users, and market size high enough, it’s possible that the regulation will end up affecting only Facebook, Amazon, Google, Apple, and Microsoft. But a European group representing 2,500 CIOs and almost 700 organizations is now pushing to have the regulation encompass more software companies. Their main concern centers around “distorted competition” in cloud infrastructure services and a worry that companies are being locked into one cloud vendor. A trend that will likely increase in 2022 that pushes back on cloud vendor lock-in is embracing multi-cloud strategies. We should expect to see more organisations in the region pursuing multi-cloud environments as a means to improve business continuity and agility whilst being able to access best of breed services from each cloud provider. As we have always said …”it’s fine to date your cloud provider….but don’t ever marry them.” The convergence of operational and analytical data The processing of operational and analytical data is almost always contained in different data systems, each tuned to that use case and managed by separate teams. But because that data lives in separate places, it’s almost impossible for organisations to generate insights and automate actions in real time, against live data. We believe 2022 is the year we’ll see a critical mass of companies in the region make significant progress toward a convergence of their operational and analytical data. We’re already starting to see some of the principles of microservices in operational applications, such as domain ownership, be applied to analytics as well. We’re hearing about this from so many of our customers locally, who are looking at MongoDB as an application data platform that allows them to perform queries across both real-time and historical data, using a unified platform and a single query API. This results in the applications they are building becoming more intelligent and contextual to their users, while avoiding dependencies on centralized analytics teams that otherwise slow down how quickly new, data-driven experiences can be released. Sustainability drives local strategic IT choice Technology always has some environmental cost. Sometimes that’s obvious — such as the energy needs and emissions associated with Bitcoin mining. More often, though, the environmental costs are well hidden. The European Green Deal commits the European Union to reducing emissions by 55% by 2030, with a focus on sustainable industry. With the U.N. Climate Change Conference (COP26) recently completed in Glasgow, and coming off the hottest European summer on record, climate issues have become top of mind. That means our customers are increasingly looking to make their technical operations more sustainable — including in their choice of cloud provider and data centers. According to research from IDC , more than 20% of CxOs say that sustainability is now important in selecting a strategic cloud service provider, and some 29% of CxOs are including sustainability into their RFPs for cloud services. Most interesting, 26% say they are willing to switch to providers with better sustainability credentials. Historically, it’s been difficult to make a switch like that. That’s part of the reason we built MongoDB Atlas — to give our customers the flexibility to run in any region , with any of the three largest cloud providers, and to make it easy to switch between them, and even to run a single database cluster across them. Publicly available information about the footprint of individual regions and even single data centers will make it simpler for companies to make informed decisions. Already, at least one cloud platform has added indicators to regions with the lowest carbon footprint. Source: IDC, European Customers Engage Services Providers at All Stages of Their Cloud Journey, IDC Survey Spotlight, Doc #EUR248484021, Dec 2021

December 21, 2021