AWS Step Functions: Integrating MongoDB Atlas, Twilio,& AWS Simple Email Service - Part 2

Raphael Londner
May 17, 2017 | Updated: June 26, 2020
#aws #Technical #Cloud

This is Part 2 of the AWS Step Functions overview post published a few weeks ago. If you want to get more context on the sample application business scenario, head back to read Part 1. In this post, you’ll get a deep dive into the application’s technical details. As a reference, the source code of this sample app is available on GitHub.

Setting up the Lambda functions

The screenshot above is the graphical representation of the state machine we will eventually be able to test and run. But before we get there, we need to set up and publish the 4 Lambda functions this Step Functions state machine relies on. To do so, clone the AWS Step Functions with MongoDB GitHub repository and follow the instructions in the Readme file to create and configure these Lambda functions.

If you have some time to dig into their respective codebases, you'll realize they're all made up of just a few lines, making it simple to embed Twilio, AWS and MongoDB APIs in your Lambda function code. In particular, I would like to point out the concise code the Get-Restaurants lambda function uses to query the MongoDB Atlas database:

db.collection('restaurants').aggregate(
  [
    { $match: 
      { 
        "address.zipcode": jsonContents.zipcode, 
        "cuisine": jsonContents.cuisine, 
        "name": new RegExp(jsonContents.startsWith) 
     } 
   },
   { $project: 
      { 
        "_id": 0, 
        "name": 1, 
        "address.building": 1, 
        "address.street": 1, 
        "borough": 1, 
        "address.zipcode": 1, 
        "healthScoreAverage": 
          { $avg: "$grades.score" }, 
        "healthScoreWorst": 
          { $max: "$grades.score" } 
      } 
    }
  ]
)

The code snippet above is a simple yet powerful example of aggregation framework queries using the $match and $project stages along with the $avg and $max accumulator operators. In a nutshell, this aggregation filters the restaurants dataset by 3 properties (zip code, cuisine, and name) in the $match stage, returns a subset of each restaurant’s properties (to minimize the bandwidth usage and query latency), and computes the maximum and average values of health scores obtained by each restaurant (over the course of 4 years) in the $project stage. This example shows how you can very easily replace SQL clauses (such as WHERE(), MAX() and AVG()) using MongoDB’s expressive query language.

Creating the Step Functions state machine

Once you are done with setting up and configuring these Lambda functions, it's time to finally create our Step Functions state machine.

AWS created a JSON-based declarative language called the Amazon States Language, fully documented on the Amazon States Language specification page. A Step Functions state machine is essentially a JSON file whose structure conforms to this new Amazon States Language. While you don’t need to read its whole specification to understand how it works, I recommend reading the AWS Step Functions Developer Guide to understand its main concepts and artifacts.

For now, let's go ahead and create our WhatsThisRestaurantAgain state machine. Head over to the Create State Machine page in AWS Step Functions and give your new state machine a name (such as WhatsThisRestaurantAgain).

Next, copy and paste the following JSON document (also available on GitHub) into the Code text editor (at the bottom of the Create State Machine page):

{
    "Comment": "A state machine showcasing the use of MongoDB Atlas to notify a user by text message or email depending on the number of returned restaurants",
    "StartAt": "GetRestaurants",
    "States": {
        "GetRestaurants": {
            "Type": "Task",
            "Resource": "",
            "ResultPath": "$.restaurants",
            "Next": "CountItems"
        },
        "CountItems": {
            "Type": "Task",
            "Resource": "",
            "InputPath": "$.restaurants",
            "ResultPath": "$.count",
            "Next": "NotificationMethodChoice"
        },
        "NotificationMethodChoice": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.count",
                    "NumericGreaterThan": 1,
                    "Next": "SendByEmail"
                },
                {
                    "Variable": "$.count",
                    "NumericLessThanEquals": 1,
                    "Next": "SendBySMS"
                }
            ],
            "Default": "SendByEmail"
        },
        "SendByEmail": {
            "Type": "Task",
            "Resource": "",
            "End": true
        },
        "SendBySMS": {
            "Type": "Task",
            "Resource": "",
            "End": true
        }
    }
}

Once you’re done pasting this JSON document, press the Refresh button of the Preview section right above the Code editor and... voilà! The state machine now shows up in its full, visual glory:

We’re not quite done yet. But before we complete the last steps to get a fully functional Step Functions state machine, let me take a few minutes to walk you through some of the technical details of my state machine JSON file.

Note that 4 states are of type "Task" but that their Resource attributes are empty. These 4 "Task" states represent the calls to our 4 Lambda functions and should thus reference the ARNs (Amazon Resource Names) of our Lambda functions. You might think you have to get these ARNs one by one—which might prove to be tedious—but don't be discouraged; AWS provides a neat little trick to get these ARNs automatically populated!

Simply click inside the double quotes for each Resource attribute and the following drop-down list should appear (if it doesn't, make sure you are creating your state machine in the same region as your Lambda functions):

AWS Step Functions Code Editor - Lambda Functions ARN Dropdown List

Once you have filled out the 4 empty Resource attributes with their expected values, press the Create State Machine button at the bottom. Last, select the IAM role that will execute your state machine (AWS should have conveniently created one for you) and press OK:

On the page that appears, press the New execution button:

Enter the following JSON test document (with a valid emailTo field) and press Start Execution:

{
    "startsWith": "M",
    "cuisine": "Italian",
    "zipcode": "10036",
    "phoneTo": "+15555555555",
    "firstnameTo": "Raphael",
    "emailTo": "raphael@example.com",
    "subject": "List of restaurants for {{firstnameTo}}",
}

If everything was properly configured, you should get a successful result, similar to the following one:

If you see any red boxes (in lieu of a green one), check CloudWatch where the Lambda functions log their errors. For instance, here is one you might get if you forgot to update the emailTo field I mentioned above:

And that's it (I guess you can truly say we’re "done done" now)! You have successfully built and deployed a fully functional cloud workflow that mashes up various API services thanks to serverless functions.

For those of you who are still curious, read on to learn how that sample state machine was designed and architected.

Design and architecture choices

Let's start with the state machine design:

The GetRestaurants function queries a MongoDB Atlas database of restaurants using some search criteria provided by our calling application, such as the restaurant's cuisine type, its zip code and the first few letters of the restaurant's name. It retrieves a list of matching restaurants and passes that result to the next function (CountItems). As I pointed out above, it uses MongoDB's aggregation framework to retrieve the worst and average health score granted by New York's Health Department during its food safety inspections. That data provides the end user with information on the presumed cleanliness and reliability of the restaurant she intends to go to. Visit the aggregation framework documentation page to learn more about how you can leverage it for advanced insights into your data.
The CountItems method counts the number of the restaurants; we'll use this number to determine how the requesting user is notified.
If we get a single restaurant match, we'll send the name and address of the restaurant to the user's cell phone using the SendBySMS function.
However, if there's more than one match, it's probably more convenient to display that list in a table format. As such, we'll send an email to the user using the SendByEmail method.

At this point, you might ask yourself: how is the data passed from one lambda function to another?

As it turns out, the Amazon States Language provides developers with a flexible and efficient way of treating inputs and outputs. By default, the output of a state machine function becomes the input of the next function. That doesn't exactly work well for us since the SendBySMS and SendByEmail methods must know the user's cell phone number or email address to properly work. An application that would like to use our state machine would have no choice but to pass all these parameters as a single input to our state machine, so how do we go about solving this issue?

Fortunately for us, the Amazon States Language has the answer: it allows us to easily append the result of a function to the input it received and forward the concatenated result to the next function. Here's how we achieved this with our GetRestaurants function:

"GetRestaurants": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
    "ResultPath": "$.restaurants",
    "Next": "CountItems"
}

Note the ResultPath attribute above where we instruct Step Functions to append the result of our GetRestaurants task (an array of matching restaurants) to the input it received, whose structure is the test JSON document I mentioned above (duplicated here for reading convenience):

{
    "startsWith": "M",
    "cuisine": "Italian",
    "zipcode": "10036",
    "phoneTo": "+15555555555",
    "firstnameTo": "Raphael",
    "emailTo": "raphael@example.com",
    "subject": "List of restaurants for {{firstnameTo}}"
}

This input contains all the information my state machine might need, from the search criteria (startsWith, cuisine, and zipcode), to the user's cell phone number (if the state machine ends up using the SMS notification method), first name, email address and email subject (if the state machine ends up using the email notification method).

Thanks to the ResultPath attribute we set on the GetRestaurants task, its output has a structure similar to the following JSON document (additional data in bold):


{
  "firstnameTo": "Raphael",
  "emailTo": "raphael@example.com",
  "subject": "List of restaurants for {{firstnameTo}}",
 <strong>"restaurants": [
  {
    "address": {
      "building": "235-237",
      "street": "West 48 Street"
    },
    "borough": "Manhattan",
    "name": "La Masseria"
  },
  {
    "address": {
      "building": "315",
      "street": "West 48 Street"
    },
    "borough": "Manhattan",
    "name": "Maria'S Mont Blanc Restaurant"
  },
  {
    "address": {
      "building": "654",
      "street": "9 Avenue"
    },
    "borough": "Manhattan",
    "name": "Cara Mia"
  }
]</strong>
}

As expected, the restaurants sub-document has been properly appended to our original JSON input. That output becomes by default the input for the CountItems method. But, we don't want that function to have any dependency on the input it receives. Since it's a helper function, we might want to use it in another scenario where the input structure is radically different. Once again, the Amazon States Language comes to the rescue with the optional InputPath parameter. Let's take a closer look at our CountItems task declaration in the state machine’s JSON document:

"CountItems": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
    "InputPath": "$.restaurants",
    "ResultPath": "$.count",
    "Next": "NotificationMethodChoice"
}

By default, the InputPath value is the whole output of the preceding task (GetRestaurants in our state machine). The Amazon States Language allows you to override this parameter by explicitly setting it to a specific value or sub-document. As you can see in the JSON fragment above, this is exactly what I have done to only pass an array of JSON elements to the CountItems Lambda function (in my case, the array of restaurants we received from our previous GetRestaurants function), thereby making it agnostic to any JSON schema. Conversely, the result of the CountItems task is stored in a new count attribute that serves as the input of the NotificationMethodChoice choice state that follows:

"NotificationMethodChoice": {
    "Type": "Choice",
    "Choices": [
        {
            "Variable": "$.count",
            "NumericGreaterThan": 1,
            "Next": "SendByEmail"
        },
        {
            "Variable": "$.count",
            "NumericLessThanEquals": 1,
            "Next": "SendBySMS"
        }
    ],
    "Default": "SendByEmail"
}

The logic here is fairly simple: if the restaurants count is greater than one, the state machine will send an email message with a nicely formatted table of the restaurants to the requesting user’s email address. If only one restaurant is returned, we’ll send a text message to the user’s phone number (using Twilio’s SMS API) since it’s probably faster and more convenient for single row results (especially since the user might be on the move while requesting this piece of information). Note that my JSON "code" actually uses the NumericLessThanEquals operator to trigger the SendBySMS task and not the Equals operator as it really should. So technically speaking, even if no result is returned from the GetRestaurants task, the state machine would still send a text message to the user with no restaurant information whatsoever! I’ll leave it up to you to fix this intentional bug.

Next steps

In this post, I showed you how to create a state machine that orchestrates calls to various cloud services and APIs using a fictitious restaurant search and notification scenario. I hope you enjoyed this tutorial explaining how to deploy and test that state machine using the AWS console. Last, I went through various design and architecture considerations, with a focus on data flow abilities available in Step Functions.

If you haven’t done so already, sign up for MongoDB Atlas and create your free M0 MongoDB cluster in minutes.
Next, you can get more familiar with AWS Lambda development and deployment by following our 101 Lambda tutorial.
If you already have some experience with AWS Lambda, Developing a Facebook Chatbot with AWS Lambda and MongoDB Atlas will walk through a richer use case.
As a last step, you might be interested in Step Functions integration with API Gateway to learn how to call a state machine from an external application.

About the Author - Raphael Londner

Raphael Londner is a Principal Developer Advocate at MongoDB, focused on cloud technologies such as Amazon Web Services, Microsoft Azure and Google Cloud Engine. Previously he was a developer advocate at Okta as well as a startup entrepreneur in the identity management space. You can follow him on Twitter at @rlondner.

← Previous

Deep Learning and the Artificial Intelligence Revolution: Part 3

Welcome to part 3 of our 4-part blog series. In part 1 we looked at the history of AI, and why it is taking off now In part 2 , we discussed the differences between AI, Machine Learning, and Deep Learning In today’s part 3, we’ll dive deeper into deep learning and evaluate key considerations when selecting a database for new projects We’ll wrap up in part 4 with a discussion on why MongoDB is being used for deep learning, and provide examples of where it is being used If you want to get started right now, download the complete Deep Learning and Artificial Intelligence white paper. What is Deep Learning? Deep learning is a subset of machine learning that has attracted worldwide attention for its recent success solving particularly hard and large-scale problems in areas such as speech recognition, natural language processing, and image classification. Deep learning is a refinement of ANNs, which, as discussed earlier, “loosely” emulate how the human brain learns and solves problems. Before diving into how deep learning works, it’s important to first understand how ANNs work. ANNs are made up of an interconnected group of neurons, similar to the network of neurons in the brain. Figure 1: The Neuron Model At a simplistic level, a neuron in a neural network is a unit that receives a number of inputs (xi), performs a computation on the inputs, and then sends the output to other nodes or neurons in the network. Weights (wj), or parameters, represent the strength of the input connection and can be either positive or negative. The inputs are multiplied by the associated weights (x1w1, x2w2,..) and the neuron adds the output from all inputs. The final step is that a neuron performs a computation, or activation function. The activation function (sigmoid function is popular) allows an ANN to model complex nonlinear patterns that simpler models may not represent correctly. Figure 2: Neural Network Diagram Figure 2 represents a neural network. The first layer is called the input layer and is where features (x1, x2, x3) are input. The second layer is called the hidden layer. Any layer that is not an input or output layer is a hidden layer. Deep learning was originally coined because of the multiple levels of hidden layers. Networks typically contain more than 3 hidden layers, and in some cases more than 1,200 hidden layers. What is the benefit of multiple hidden layers? Certain patterns may need deeper investigation that can be surfaced with the additional hidden layers. Image classification is an area where deep learning can achieve high performance on very hard visual recognition tasks – even exceeding human performance in certain areas. Let’s illustrate this point with an example of how additional hidden layers help perform facial recognition. Figure 3: Deep Learning Image Recognition When a picture is input into a deep learning network, it is first decomposed into image pixels. The algorithm will then look for patterns of shapes at certain locations in the image. The first hidden layer might try to uncover specific facial patterns: eyes, mouth, nose, ears. Adding an additional hidden layer deconstructs the facial patterns into more granular attributes. For example, a “mouth” could be further deconstructed into “teeth”, “lips”, “gums”, etc. Adding additional hidden layers can devolve these patterns even further to recognize the subtlest nuances. The end result is that a deep learning network can break down a very complicated problem into a set of simple questions. The hidden layers are essentially a hierarchical grouping of different variables that provide a better defined relationship. Currently, most deep learning algorithms are supervised; thus, deep learning models are trained against a known truth. How Does Training Work? The purpose of training a deep learning model is to reduce the cost function, which is the discrepancy between the expected output and the real output. The connections between the nodes will have specific weights that need to be refined to minimize the discrepancy. By modifying the weights, we can minimize the cost function to its global minimum, which means we’ve reduced the error in our model to its lowest value. The reason deep learning is so computationally intensive is that it requires finding the correct set of weights within millions or billions of connections. This is where constant iteration is required as new sequences of weights are tested repeatedly to find the point where the cost function is at its global minimum. Figure 4: Deep Learning Cost Function A common technique in deep learning is to use backpropagation gradient descent. Gradient descent emerged as an efficient mathematical optimization that works effectively with a large number of dimensions (or features) without having to perform brute force dimensionality analysis. Gradient descent works by computing a gradient (or slope) in the direction of the function global minimum based on the weights. During training, weights are first randomly assigned, and an error is calculated. Based on the error, gradient descent will then modify the weights, backpropagate the updated weights through the multiple layers, and retrain the model such that the cost function moves towards the global minimum. This continues iteratively until the cost function reaches the global minimum. There may be instances where gradient descent resolves itself at a local minimum instead of global minimum. Methods to mitigate this issue is to use a convex function or generate more randomness for the parameters. Database Considerations with Deep Learning Non-relational databases have played an integral role in the recent advancement of the technology enabling machine learning and deep learning. The ability to collect and store large volumes of structured and unstructured data has provided deep learning with the raw material needed to improve predictions. When building a deep learning application, there are certain considerations to keep in mind when selecting a database for management of underlying data. Flexible Data Model . In deep learning there are three stages where data needs to be persisted – input data, training data, and results data. Deep learning is a dynamic process that typically involves significant experimentation. For example, it is not uncommon for frequent modifications to occur during the experimentation process – tuning hyperparameters, adding unstructured input data, modifying the results output – as new information and insights are uncovered. Therefore, it is important to choose a database that is built on a flexible data model, avoiding the need to perform costly schema migrations whenever data structures need to change. Scale . One of the biggest challenges of deep learning is the time required to train a model. Deep learning models can take weeks to train – as algorithms such as gradient descent may require many iterations of matrix operations involving billions of parameters. In order to reduce training times, deep learning frameworks try to parallelize the training workload across fleets of distributed commodity servers. There are two main ways to parallelize training: data parallelism and model parallelism. Data parallelism . Splitting the data across many nodes for processing and storage, enabled by distributed systems such as Apache Spark, MongoDB, and Apache Hadoop Model parallelism . Splitting the model and its associated layers across many nodes, enabled by software libraries and frameworks such as Tensorflow, Caffe, and Theano. Splitting provides parallelism, but does incur a performance cost in coordinating outputs between different nodes In addition to the model’s training phase, another big challenge of deep learning is that the input datasets are continuously growing, which increases the number of parameters to train against. Not only does this mean that the input dataset may exceed available server memory, but it also means that the matrices involved in gradient descent can exceed the node’s memory as well. Thus, scaling out, rather than scaling up, is important as this enables the workload and associated dataset to be distributed across multiple nodes, allowing computations to be performed in parallel. Fault Tolerance . Many deep learning algorithms use checkpointing as a way to recover training data in the event of failure. However, frequent checkpointing requires significant system overhead. An alternative is to leverage the use of multiple data replicas hosted on separate nodes. These replicas provide redundancy and data availability without the need to consume resources on the primary node of the system. Consistency . For most deep learning algorithms it is recommended to use a strong data consistency model. With strong consistency each node in a distributed database cluster is operating on the latest copy of the data, though some algorithms, such as Stochastic Gradient Descent (SGD), can tolerate a certain degree of inconsistency. Strong consistency will provide the most accurate results, but in certain situations where faster training time is valued over accuracy, eventual consistency is acceptable. To optimize for accuracy and performance, the database should offer tunable consistency. Wrapping Up Part 3 That wraps up the third part of our 4-part blog series. We’ll conclude in part 4 with a discussion on why MongoDB is being used for deep learning, and provide examples of where it is being used Remember, if you want to get started right now, download the complete the complete Deep Learning and Artificial Intelligence white paper.

May 15, 2017

Next →

Data and the European Landscape: 3 Trends for 2022

The past two years have brought massive changes for IT leaders: large and complex cloud migrations; unprecedented numbers of people suddenly working, shopping and learning from home; and a burst in demand for digital-first experiences. Like everyone else, we are hoping that 2022 isn’t so disruptive (fingers crossed!), our customer conversations in Europe do lead us to believe the new year will bring new business priorities. We’re already noticing changes in conversations around vendor lock-in, thanks to the Digital Markets Act, a new enthusiasm for combining operational and analytical data to drive new insights faster, and a more strategic embrace of sustainability. Here’s how we see these trends playing out in 2022. Digital markets act draws new attention to cloud vendor lock-in in Europe We’ve heard plenty about the European Commission’s Digital Markets Act , which, in the name of ensuring fair and open digital markets, would place new restrictions on companies that are deemed to be digital “gatekeepers” in the region. That discussion will be nothing compared to the vigorous debate we expect once the EU begins the very tricky political business of determining exactly which companies will fall under the act. If the EU sets the bar for revenues, users, and market size high enough, it’s possible that the regulation will end up affecting only Facebook, Amazon, Google, Apple, and Microsoft. But a European group representing 2,500 CIOs and almost 700 organizations is now pushing to have the regulation encompass more software companies. Their main concern centers around “distorted competition” in cloud infrastructure services and a worry that companies are being locked into one cloud vendor. A trend that will likely increase in 2022 that pushes back on cloud vendor lock-in is embracing multi-cloud strategies. We should expect to see more organisations in the region pursuing multi-cloud environments as a means to improve business continuity and agility whilst being able to access best of breed services from each cloud provider. As we have always said …”it’s fine to date your cloud provider….but don’t ever marry them.” The convergence of operational and analytical data The processing of operational and analytical data is almost always contained in different data systems, each tuned to that use case and managed by separate teams. But because that data lives in separate places, it’s almost impossible for organisations to generate insights and automate actions in real time, against live data. We believe 2022 is the year we’ll see a critical mass of companies in the region make significant progress toward a convergence of their operational and analytical data. We’re already starting to see some of the principles of microservices in operational applications, such as domain ownership, be applied to analytics as well. We’re hearing about this from so many of our customers locally, who are looking at MongoDB as an application data platform that allows them to perform queries across both real-time and historical data, using a unified platform and a single query API. This results in the applications they are building becoming more intelligent and contextual to their users, while avoiding dependencies on centralized analytics teams that otherwise slow down how quickly new, data-driven experiences can be released. Sustainability drives local strategic IT choice Technology always has some environmental cost. Sometimes that’s obvious — such as the energy needs and emissions associated with Bitcoin mining. More often, though, the environmental costs are well hidden. The European Green Deal commits the European Union to reducing emissions by 55% by 2030, with a focus on sustainable industry. With the U.N. Climate Change Conference (COP26) recently completed in Glasgow, and coming off the hottest European summer on record, climate issues have become top of mind. That means our customers are increasingly looking to make their technical operations more sustainable — including in their choice of cloud provider and data centers. According to research from IDC , more than 20% of CxOs say that sustainability is now important in selecting a strategic cloud service provider, and some 29% of CxOs are including sustainability into their RFPs for cloud services. Most interesting, 26% say they are willing to switch to providers with better sustainability credentials. Historically, it’s been difficult to make a switch like that. That’s part of the reason we built MongoDB Atlas — to give our customers the flexibility to run in any region , with any of the three largest cloud providers, and to make it easy to switch between them, and even to run a single database cluster across them. Publicly available information about the footprint of individual regions and even single data centers will make it simpler for companies to make informed decisions. Already, at least one cloud platform has added indicators to regions with the lowest carbon footprint. Source: IDC, European Customers Engage Services Providers at All Stages of Their Cloud Journey, IDC Survey Spotlight, Doc #EUR248484021, Dec 2021

December 21, 2021