The Internet of Things (IoT) is an increasingly hot topic as more and more devices and services come to market. There are numerous examples of IoT apps in action today: wearable pedometers such as Fitbit track and publish our daily steps; networked seismometers track earthquakes and monitor fracking operations; power utilities track their smart grids; doctors monitor patient health remotely. You can even track the movements of tagged sharks on the popular site http://www.ocearch.org/. The number of examples are overwhelming and constantly growing. In fact, Gartner predicts there will be nearly 26 billion internet-connected devices by 2020.
Cardinal Solutions has been involved in several IoT projects in the healthcare and manufacturing industries. Although the individual business problems are different, there are numerous similarities in terms of solution architecture and data flow.
In this particular example, the client in question utilized networked devices to track compliance and avoid financial penalties associated with an out-of-compliance state. (I am being purposefully vague here due to the competitive nature of the business and the “first-mover” nature of the solution.) The purpose of this post is not to detail the business problem, but to discuss the architectural approach and best practices associated with building IoT applications.
In this example, our client actually had a (mostly) working solution in production that utilized a cloud-based architecture. However, performance was ironically a significant issue, especially as new customers were added. Each new customer meant more users and more data that would slow response times even more.
So let’s briefly discuss the overall solution at a high level. To track compliance there are some sensors involved (see Figure 1: Device Hardware – Sensors and Gateways). There are sensors that detect when employees enter and exit an area of interest. Within the area of interest there are sensors that detect when employees perform activities that they should perform to be compliant. The sensors do not post data directly to the internet, but instead relay data to a central gateway device that is configured to post the data to a server in the cloud. We’ll discuss the nuts and bolts of how the data is processed later on.
Of course the sensor data isn’t merely posted to the internet to sit idle; rather there is a web portal that provides user access to the data. The web portal allows administrators to maintain users, customers, facilities, devices, etc. The web portal also includes various dashboards that display key performance indicators at a glance, to monitor devices and compliance. There are also numerous reports that allow users to slice and dice their data and view all sorts of graphs and tables.
Previous Architecture –Amazon Web Services
So we’ve seen that devices post data to a central server for processing, and that there’s a web portal to view and maintain all that data, but where does all the magic happen? In the cloud of course! The “old architecture” (see Figure 1: Old Architecture) is largely based on an Infrastructure as a Service (IaaS) approach, which requires support and maintenance of virtual machines (VMs) in the cloud using Amazon Web Services (AWS). The first VM hosts the web portal on a single VM instance. To address the stated performance problems, the client was considering upgrading to a larger VM instance (scaling vertically) or to creating more instances of the web portal VM (scaling horizontally). They chose to scale vertically to a large VM instance.
The next piece of the “old architecture” is a VM hosting multiple applications that receive and process the messages sent from gateway devices. The first of these applications is the “gateway server.” It receives the messages from the gateway devices and posts them to the second application, which is a RabbitMQ queue also running on the server. The third application is the message processor, which pulls messages off of the queue, executes business logic and writes relevant information to the database. The final piece to this VM’s architecture is AWS’s ElasticIP, which allows the gateway devices to post their data to an IP address. Elastic IP then forwards the data to the gateway server application.
The final component to the old architecture is a MySQL database hosted in Amazon RDS, which is actually a Platform as a Service (PaaS) offering on AWS. With a PaaS approach the cloud provider maintains the infrastructure (hardware, operating system, database, etc.) and the client supports the cloud-hosted applications mostly via browser-based dashboards. With Amazon RDS there are options to scale the database VM, but as deployed it is a single large instance and therefore is a fairly costly single point of failure.
There are some issues with the “old architecture,” but one issue looms large. The single biggest concern is that the applications on the gateway server VM are not designed for multiple instances to be run concurrently, whether on the same VM or on different VMs. That means it is not possible to run multiple instances of the VM (horizontal scaling), which greatly limits scaling options and also introduces a single point of failure. As a result, the only scaling option is to upgrade the gateway server VM to a very large instance, which is costly. Unfortunately, even with a large VM instance, the gateway server applications have on occasion struggled to keep up with the flow of messages from a few thousand devices.
Another issue is with the gateway server VM’s RabbitMQ implementation. RabbitMQ’s maximum queue depth is related to available disk space, such that on a small VM instance RabbitMQ can support around 2000 messages. The limited queue depth may not sound concerning, except that there have been occasions when gateway devices have lost connectivity for a period of time, and when they regained connectivity they posted a large volume of messages on the queue. If they hit the maximum queue depth there is the potential for data loss, which has occurred. The queue depth limitation was one reason for upgrading the gateway server VM to a medium sized instance.
To summarize, the current implementation of the server architecture has limited scalability, uses more expensive medium and large server instances, lacks redundancy/failover, and has not provided high throughput, which has led to poor responsiveness in the web portal.
New Architecture –Microsoft Azure
The customer needed to improve the performance, scalability, and maintainability of the overall architecture, at a reasonable cost. They needed to be able to support a much larger customer base on the system with consistently high throughput and responsiveness. Additionally, given a very small IT staff, they needed to minimize their application support efforts. Microsoft Azure was a clear fit.
The following table shows before and after comparisons of the various components of the old and new architectures:
To summarize the new architecture (see Figure 3: New Architecture), the gateway server application is the only remaining VM (i.e. using IaaS), but by hosting the app in 2 small VMs it costs much less than the old gateway server VM. The web site and message processor WebJob run within Azure Web Sites and can be scaled both horizontally and vertically. The Storage Queue is massive, fast, redundant, and cheap. The SQL Database is not currently redundant, but will soon support redundancy and higher scaling.
The primary lesson learned with this particular IoT application is that the hardware integration was far more difficult than expected. There were decisions made when designing the device hardware that created serious challenges in processing the event data. For example, the gateways post the event data repeatedly so that the message processor has to contend with duplicate events. In addition, they often post events out of sequence, so we have to contend with properly sequencing the events as well. The general point is that we had no say in the design of the hardware, and limited information on its detailed behavior. Your mileage may vary, but it is critical with IoT apps to have a detailed understanding of the device hardware, its capabilities and limitations, and the data it publishes.
Another lesson learned over the course of multiple IoT applications is that they tend to use a relatively wide variety of cloud services such as web sites, databases, queues, batch processes, VMs, schedulers, notification hubs, load balancing, auto-scaling, etc. It’s great that there are so many services provided in the cloud, but also there are usually many options to choose from for a given service. For example, with the message processor we debated whether to use Azure WebJobs or Worker Roles. For the queues we could have used Azure Service Bus instead of Storage Queues. In both cases the discussion weighed simplicity versus flexibility/complexity, as well as performance and cost. The path to the optimal solution isn’t always obvious.
Finally, some recommendations that apply to all cloud applications but that are critical to IoT apps given that they tend to scale rapidly. We performed performance and load testing roughly two-thirds of the way through the project lifecycle, after the major plumbing of the application was in place and many of the features were well under way. (Note that we did not wait until approaching production deployment!) We loaded up the database with many millions of rows of data, and using Visual Studio Ultimate we recorded various automated web test scenarios, then simulated hundreds of users executing those scenarios across multiple browsers. We even used Visual Studio Online to simulate many users across the internet hitting our web site – very cool! Fortunately the app performed as designed and page response times were all in the desired sub-second range. We played with a couple of scaling options, but the performance was acceptable in all cases.
The other recommendation is to dedicate effort to designing robust and easy-to-support applications. We spent significant time on our approach to exception handling, logging, auditing, handling unexpected conditions, failing gracefully, etc. We also instrumented our web site with Microsoft’s Application Insights to provide analytics, which thrilled our client. With all of the logging, auditing and analytics we have had a pretty easy time tracking down potential issues, as well as verifying that our page response times are still consistently in the sub-second range.
The architecture detailed here is quite representative of other IoT applications Cardinal has built. We commonly use queues, message processors, databases, and web portals, as well as the other PaaS and IaaS components detailed here. The most common variations between applications are largely based on device hardware requirements and requested functionality. For example, the gateway server application is only needed in this case because the hardware doesn’t support writing directly to the queue. Other IoT apps have incorporated features such as notification services, in order to send push notifications to users for alerts or other relevant information.
Clearly IoT applications are on the rise, as is cloud adoption. The cloud is the perfect platform for IoT apps because of the flexibility provided by such a wide variety of available cloud features/services. The cloud also offers tremendous scalability, so an IoT app can start with a small user base and grow to a massive scale at a cost that matches the needed scale. It has been a great challenge building IoT apps thus far, and I’m already looking forward to the next!