Technical Guidance for Analytics managed services for CSP partners

Introduction

This document is an addendum to the Business Guidance document that enables Microsoft Cloud Solution Provider (CSP) program enables partners to sell Microsoft Online services to customers with additional value added services like assessment, consulting, deployment, migration and monitoring, support and billing. As part of the CSP program, partners are able to create tenants, provide them Azure subscriptions and manage various services on behalf of the customer.

This document provides technical guidance for CSP partners to build industry vertical end-to-end solutions and provides guidance on reference architectures leveraging the power of Microsoft Azure Data Platform. It also highlights specific high value opportunities that exist for CSP partners across various industry vertical scenarios such as Healthcare, Retail and Hospitality sectors, and highlights the benefits of choosing Microsoft Azure Data platform. The 3 key trends this document capitalizes are:

  1. Using Predictive analytics with machine learning to transform Big Data to "Huge" Insights with Microsoft Azure Data Platform
  2. Leveraging open source with Azure to perform real-time stream analytics and sentiment analysis capitalizing social trends.
  3. The Explosion of Data with IoT and how to make money riding the IoT wave

While these solutions may be built with specific components on public, private, or hybrid Azure cloud components, the core guidance is focused on public cloud Microsoft Azure implementations.

Reference Architecture

This section provides technical details on reference architecture for the various use case scenarios for 3 industry verticals – Healthcare, Retail and Hospitality/Hotel sector.

Healthcare Sector

In the Healthcare sector, huge amount of data is amassed across EHR (Electronic Health Records) and EMR (Electronic Medical Records). Opportunities for CSP partners in healthcare are abundant. According to CDC, chronic diseases affect 52% of Americans and consume 86% of healthcare costs. The industry trend is to enable diagnosis efficiencies and the best patient outcomes at the lowest possible cost and according to Gartner, Healthcare IT is undergoing a transformational shift that dictates the future towards "Personalized Care" and this trend towards Personalized Care presents significant opportunity for offering both managed services and value-add services to lower TCO reducing inefficiencies, thereby reducing healthcare costs overall.

According to CDC, Diabetes is the leading cause for kidney failure and increases the risk of cardiovascular disease, neuropathy, retinopathy and various other health complications.

  • Detecting diabetes involves various factors, including, body mass index, high blood pressure, sedentary lifestyles and hereditary conditions. In 90% of cases, detection happens after the fact.
  • The total estimated cost of diagnosed diabetes in 2012 is $245 billion, including $176 billion in direct medical costs and $69 billion in reduced productivity.

Type 2 diabetes affects 28 Million Americans in 2012 & is primarily due to sedentary lifestyle in patients who are genetically predisposed and makes up 90% of cases of diabetes.

Healthcare Use Case Scenario Description

Scenario 1: Diabetes Type 2 Prediction: Using the electronic health record (EHR) data set containing patient records from all 50 states in the United States, predict

  • whether a patient will have type 2 diabetes diagnosis within the next year so that preventive interventions can be administered to patients at high risk for type 2 diabetes
  • whether a diabetes patient will be admitted to the hospital next year – the goal is to proactively sign up high risk patients to diabetes case management program

Machine learning with AzureML can be used to predict diabetes type 2 – Microsoft Data Platform provides a comprehensive suite of libraries that enables powerful supervised and unsupervised learning algorithms to aid predictive analytics.

Scenario 2: Remote Patient Monitoring with Continuous Glucose Monitoring (CGM): Internet of Things(IoT) powers a new class of scenarios around Remote Patient Monitoring. CGM is a new trend where devices and sensors are implanted subcutaneously to perform continuous glucose monitoring to provide timely medical aid with Remote Patient Monitoring to address episodes such as hypoglycemic conditions. Cloud driven Continuous glucose monitoring (CGM) is a small wearable device that tracks your glucose vitals throughout the day and night – helping hypoglycemic or hyperglycemic incidents and alert you of highs and lows so you can take timely evasive action and thereby protecting the patient. CGM helps to minimize the guesswork that comes from making decisions based solely on a number from a blood glucose meter reading, for better diabetes management and in keeping your physician abreast with remote monitoring.

Microsoft Azure Data Factory

The first step in building the reference architecture is to decipher the various "data silos" that exist across industry sectors – for instance in Healthcare, Electronic Medical Records (EMR) and Electronic Health Records (EHR) straddle hybrid cloud boundaries. Microsoft Data Platform provides unique opportunity for CSP partners to bridge "data silos" across cloud boundaries and offer managed Data Quality, Data Movement (using Azure to backup, running secondary replica services in Azure for mission critical workloads) and Data Management Service - allowing partners to build a pipeline of data movement/data migration automation for long term retention, reducing the need for expanding cold storage on premise.

Microsoft Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. Just like a manufacturing factory that runs equipment to take raw materials and transform them into finished goods, Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.

Data Factory works a cross on-premises and cloud data sources and SaaS to ingest, prepare, transform, analyze, and publish your data. Use Data Factory to compose services into managed data flow pipelines to transform your data using services like Azure HDInsight (Hadoop) and Azure Batch for your big data computing needs, and with Azure Machine Learning to operationalize your analytics solutions. Go beyond just a tabular monitoring view, and use the rich visualizations of Data Factory to quickly display the lineage and dependencies between your data pipelines. Monitor all of your data flow pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.

Azure Data Factory enables you to compose data movement and data processing tasks as a data driven workflow. You will learn how to build your first pipeline that uses HDInsight to transform and analyze web logs on a monthly basis and the steps you will be performing are as follows:

1. Create the data factory. A data factory can contain one or more data pipelines that move and process data.

2. Create the linked services. You create a linked service to link a data store or a compute service to the data factory. A data store such as Azure Storage holds input/output data of activities in the pipeline. A compute service such as Azure HDInsight processes/transforms data.

3. Create input and output datasets. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity.

4. Create the pipeline. A pipeline can have one or more activities such as Copy Activity to copy data from a source to a destination (or) HDInsight Hive Activity to transform input data using Hive script to produce output data. This sample uses the HDInsight Hive activity that runs a Hive script. The script first creates an external table that references the raw web log data stored in Azure blob storage and then partitions the raw data by year and month.

Prerequisites:

  1. Azure subscription - If you don't have an Azure subscription, you can create a free trial account in just a couple of minutes.
  2. Azure Storage – You will use an Azure storage account for storing the data. After you have created the storage account, you will need to obtain the account key used to access the storage.

Refer the link to create the Azure Data Factory pipeline using the Azure Data Factory Editor.

Azure Data Factory has a few key entities that work together to define the input and output data, processing events, and the schedule and resources required to execute the desired data flow.

Figure: Relationships between Dataset, Activity, Pipeline, and Linked service

Activities: Activities define the actions to perform on your data. Each activity takes zero or more datasets as inputs and produces one or more datasets as outputs. An activity is a unit of orchestration in Azure Data Factory. For example, you may use a Copy activity to orchestrate copying data from one dataset to another. Similarly, you may use a Hive activity which will run a Hive query on an Azure HDInsight cluster to transform or analyze your data. Azure Data Factory provides a wide range of data transformation, analysis, and data movement activities.

Pipelines: Pipelines are a logical grouping of Activities. They are used to group activities into a unit that together perform a task. For example, a sequence of several transformation Activities might be needed to cleanse log file data. This sequence could have a complex schedule and dependencies that need to be orchestrated and automated. All of these activities could be grouped into a single Pipeline, for instance, "CleanLogFiles". "CleanLogFiles" could then be deployed, scheduled, or deleted as one single unit instead of managing each individual activity independently

Datasets: Datasets are named references/pointers to the data you want to use as an input or an output of an Activity. Datasets identify data structures within different data stores including tables, files, folders, and documents.

Linked service: Linked services define the information needed for Data Factory to connect to external resources. Linked services are used for two purposes in Data Factory:

  • To represent a data store including, but not limited to, an on-premises SQL Server, Oracle DB, File share or an Azure Blob Storage account. As discussed above, Datasets represent the structures within the data stores connected to Data Factory through a Linked service.
  • To represent a compute resource that can host the execution of an Activity. For example, the "HDInsightHive Activity" executes on an HDInsight Hadoop cluster.

With the four simple concepts of datasets, activities, pipelines and linked services, you can build your pipeline from the ground up.

Reference Architecture: Healthcare


Solution Building Blocks: The following are the key solution building blocks for the above reference architecture:

  • Azure Data Factory: Azure Data Factory is a fully managed service for composing data storage, processing and data movement into streamlined, scalable and reliable data production pipelines. Azure Data Factory allows composing, orchestrating and monitoring of data movement across various storage services in the cloud.
  • Azure HDInsight: Azure HDInsight is a service that deploys and provisions Apache™ Hadoop™ clusters in the cloud, providing a software framework designed to manage, analyze and report on big data that allows to scale to petabytes on demand and offers powerful capabilities to crunch all kinds of data - structured, semi-structured and unstructured.
  • Azure Blob Storage: Azure Blob Storage is a REST-based object store for unstructured data in the cloud that allows secure access to data over the wire and at rest.
  • Azure SQL Database: SQL Database is a relational database service in the cloud based on the market-leading Microsoft SQL Server engine, with mission-critical capabilities. SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection—all with near-zero administration. You can focus on rapid app development and accelerating your time to market, rather than managing virtual machines and infrastructure.
  • Azure ML: Azure ML is a powerful cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solution. The forecasts and predictions can make apps and devices smarter and they can learn from existing data I order to forecast future trends, behaviors and outcomes.
  • Microsoft Power BI: Microsoft Power BI transforms data into rich visuals to spot trends as they happen and push your business further by offering richer insights into the data.

For Type II Diabetes Prediction, when the data is being used to predict a category, a classification based machine learning model is used. Supervised learning is also called classification – in this case we want the ability to predict whether a patient has all risk factors to be diagnosed with Diabetes or not. When there are only two choices, this is called two-class or binomial classification. When there are more categories, as when predicting the winner of the NCAA March Madness tournament, this problem is known as multi-class classification.

National Institutes of Health (NIH) conducted a study using machine learning to predict Diabetes Type 2 - In the Past 3years, if:

  • the patient had prescriptions of angiotensin-converting-enzyme (ACE) inhibitor AND the patient's maximum body mass index recorded is ≥35 
  • (Or) the patient had ≥6 diagnoses of hyperlipidemia AND the patient had prescriptions of statins AND the patient had ≥9 prescriptions
  • (Or) the patient had ≥5 diagnoses of hypertension AND the patient had prescriptions of statins AND the patient had ≥11 doctor visits  

→ the patient will have type 2 diabetes diagnosis within the next year.

Such predictions based on historical training set can be done with Azure ML as it offers powerful predictive analytics capabilities with machine learning models such as Two Class Bayes Point Machine Model.

To showcase an illustrative example, we use the data from Pima Indians Diabetes dataset was used (sample dataset available in ML studio). Data includes EMR data for female patients including medical data including lifestyle factors –

Two Class Bayes Point Machine is powerful linear classifier model – shown for illustrative purposes. To learn more on this model, refer the MSDN article.

Data Management Gateway (a component of Azure Data Factory that you can install on premise) is an agent software that you can install on premise that connects data sources on-premises to cloud services for consumption. With Data Management Gateway, you can:

  • Connect to on-premises data for Hybrid data access – You can connect on-premises data to cloud services to benefit from cloud services while keeping the business running with on-premises data.
  • Define a secure data proxy – You can define which on-premises data sources are exposed with Data Management Gateway so that Data Management Gateway authenticates the data request from cloud services and safeguards the on-premises data sources.
  • Manage your gateway for complete governance – You are provided with full monitoring and logging of all the activities inside the Data Management Gateway for management and governance.

Data Management Gateway has a full range of on-premises data connection capabilities.

  • Non-intrusive to corporate firewall – Data Management Gateway just works after installation, without having to open up a firewall connection or requiring intrusive changes to your corporate network infrastructure.
  • Encrypt credentials with your certificate – Credentials used to connect to data sources are encrypted with a certificate fully owned by a user. Without the certificate, no one can decrypt the credentials into plain text, including Microsoft.
  • Move data efficiently – Data is compressed and transferred in parallel, resilient to intermittent network issues with auto retry logic.

Overall, here's how the solution works end-to-end:

  1. The Data Management Gateway service connects data from on premise and uploads the data to Azure Blob Storage using the Azure Data Factory.
  2. The Azure Data Factory Data Movement Service processes the data from various blob stores (pertaining to EHR and EMR data) and aggregates it with HDInsight for analysis and post processing.
  3. Azure ML, specifically the Two-Class Bayes Point Classification is used on the data (using the training set data) to deduce whether the patient's data possess risk factors that indicate whether the patient would be tagged a diabetic or not – the classification model uses the training set data based on historical data sets to deduce whether the patient has the risk factors to be a diabetic or not within the next year.
  4. The processed data and the results based on the Two-Class Bayes Point Classification is then stored as processed output to SQL Database, which can be consumed through dashboards and Power BI for analytics.

Retail Sector

Retail industry vertical scenario presents plethora of opportunity both in terms of Hosting/Managed and Value-add service offerings. According to IDC Research - Success or failure in retail will be driven by an effective and profitable consumer experience. Product availability, information access, promotional relevancy, ease of checkout, and convenience are now key components of a competitive and differentiated Omni-channel strategy.

We anticipate stores to double down on these strategies and continue to find ways to bridge the gap between offline and digital channels. Retailers thus are on a mission to enable bridging the "brick-2-click" transition by offering rich & immersive shopping experience leveraging mobile and social strategy taking customer experience to a whole new level. This Omni-channel strategy presents an incredible opportunity for MSPs to multi-fold grow their business, enabling MSPs to cash in on this mega-trend.

In 2014, we saw more merchants venture into Omni-channel retailing and try in-store marketing solutions such as beacons to enrich the shopping experience. We anticipate stores to double down on these strategies and continue to find ways to bridge the gap between offline and digital channels.

Retail Use Case Scenario Description

Scenario: Immersive Shopping Experience: Retailers are on a mission is to enable a rich immersive shopping experience leveraging mobile and social strategy at all points of contact. Giving customers real-time relevant information on promotions based on their location, personal preferences and social influencers, is key to deliver a modern "e-tailer" experience opening new sources of revenue.

  • Retailers commonly want to entice their customers to purchase products by presenting them with products they are most likely to be interested in, and therefore most likely to buy.
  • Using Cognitive services, customize shopping experience leveraging customer demographics such as age, customized preferences, social shopping "hot" trends tailored to particular demographic. Microsoft Data Platform can be leveraged to build a recommendation engine taking into account customer demographic, including historical shopping behavior data, product information, newly introduced brands, hot trends and social trends.
  • The goal of these retailers is to optimize for user click-to-sale conversions and earn higher sales revenue. They achieve this by delivering contextual, behavior-based product recommendations based on customer interests and actions.

Reference Architecture: Retail

For the above Reference Architecture, we have shown how easy it is to integrate with 3rd party services such as Apache Storm and Apache Mahout – description for these services and how it works end-to-end are described below.

Solution Building Blocks

  • Azure IoT Hub: Azure IoT Hub is used to connect, monitor and manage millions of IoT assets and it establishes a secure and reliable bi-directional communication channel to ingest real-time data and upload files to the cloud.
  • Apache Mahout to offer personalized recommendations: Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.
  • Apache Storm, distributed, fault tolerant, open source event processing for event streams at real time.
  • Azure HDInsight: Azure HDInsight is a service that deploys and provisions Apache™ Hadoop™ clusters in the cloud, providing a software framework designed to manage, analyze and report on big data that allows to scale to petabytes on demand and offers powerful capabilities to crunch all kinds of data - structured, semi-structured and unstructured.
  • Azure Data Factory: Azure Data Factory is a fully managed service for composing data storage, processing and data movement into streamlined, scalable and reliable data production pipelines. Azure Data Factory allows composing, orchestrating and monitoring of data movement across various storage services in the cloud.
  • Azure Blob Storage: Azure Blob Storage is a REST-based object store for unstructured data in the cloud that allows secure access to data over the wire and at rest.
  • Azure SQL Database: SQL Database is a relational database service in the cloud based on the market-leading Microsoft SQL Server engine, with mission-critical capabilities. SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection—all with near-zero administration. You can focus on rapid app development and accelerating your time to market, rather than managing virtual machines and infrastructure.
  • Azure Cognitive Services, FaceAPI was used to predict age using FaceAPIs.
  • Microsoft Power BI: Microsoft Power BI transforms data into rich visuals to spot trends as they happen and push your business further by offering richer insights into the data.

3rd Party services with Microsoft Azure Data Platform - Apache Mahout and Storm:

Apache Mahout to offer personalized recommendations: Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.

Apache Storm, distributed, fault tolerant, open source event processing for event streams at real time.

Creating a recommendation model is usually enough to allow the system to provide recommendations - in this case Apache Mahout machine learning was used to build the recommendation engine which takes into account Customer Information, their behaviors, Product Information – these factors were input to the recommendation engine which offered personalized product recommendations based on their demographic information, matching social trends for that demographic.

Nevertheless, recommendation quality varies based on the usage processed and the coverage of the catalog. For example, if you have a lot of cold items (items without significant usage), the system will have difficulties providing a recommendation for such an item or using such an item as a recommended one. In order to overcome the cold item problem, the system allows the use of metadata of the items to enhance the recommendations. This metadata is referred to as features.

Features can enhance the recommendation model, but to do so requires the use of meaningful features. For this purpose, a new build was introduced - a rank build. This build will rank the usefulness of features. A meaningful feature is a feature with a rank score of 2 and up. After understanding which of the features are meaningful, trigger a recommendation build with the list (or sub-list) of meaningful features.

A sample recommendation service for movie recommendations with similar approach above is described in Azure documentation using Apache Mahout machine learning service – CSP partners can leverage concepts to learn how to use the Apache Mahout machine learning library with Azure HDInsight to build a recommendation engine.

Use of Azure IoT Hub

Azure IoT Hub addresses the device-connectivity challenges in the following ways:

  • Per-device authentication and secure connectivity. You can provision each device with its own security key
    to enable it to connect to IoT Hub. The IoT Hub identity registry
    stores device identities and keys in a solution. A solution back end can whitelist and blacklist individual devices, which enables complete control over device access.
  • Monitoring of device connectivity operations. You can receive detailed operation logs about device identity management operations and device connectivity events. This enables your IoT solution to easily identify connectivity issues, such as devices that try to connect with wrong credentials, send messages too frequently, or reject all cloud-to-device messages.
  • An extensive set of device libraries. Azure IoT device SDKs
    are available and supported for a variety of languages and platforms--C for many Linux distributions, Windows, and real-time operating systems. Azure IoT device SDKs also support managed languages, such as C#, Java, and JavaScript.
  • IoT protocols and extensibility. If your solution cannot use the device libraries, IoT Hub exposes a public protocol that enables devices to natively use the MQTT v3.1.1, HTTP 1.1, or AMQP 1.0 protocols. You can also extend IoT Hub to provide support for custom protocols by:
    • Creating a field gateway with the Azure IoT Gateway SDK
      that converts your custom protocol to one of the three protocols understood by IoT Hub.
    • Customizing the Azure IoT protocol gateway, an open source component that runs in the cloud.
  • Scale. Azure IoT Hub scales to millions of simultaneously connected devices and millions of events per second.

Overall, here's how the solution works end-to-end:

  • Event Hubs is a highly scalable publish-subscribe data integrator capable of consuming large volumes of events per second, enabling Azure to process vast amounts of data from connected applications or devices. It provides a unified collection point for a broad array of platforms and devices, and as such, abstracts the complexity of ingesting multiple different input streams directly into the streaming analytics engine. In the above reference architecture, Event Hubs provide a robust ingestion capability for incoming data that is being generated at data source endpoints such as devices, sensors or applications.
  • While the primary function of Event Hubs is to provide a single storage entry point to robustly ingest data for short term storage, the role of Stream Analytics (in this reference architecture, we use Apache Storm for stream processing) is to consume these events to provide temporal queries for analytics against the moving event stream for filter, aggregate, or join transformations that are then written to an output destination.
  • Azure Cognitive Services, specifically FaceAPIs can be used to analyze customer's picture to mine analytics around customer's demographics. This information can be fed as input into Apache Mahout where the recommendation engine executes to offer recommendations based on hot trends that is applicable to the customer demographic, making recommendations personal and meaningful.
  • Azure Machine Learning provides a cloud based platform for mining data to identify trends and patterns across large scale information sets. By "learning" from historical events and trends, machine learning can publish a model that can be leveraged for determining predictions in real-time based on incoming event data. Combining the real-time rules based processing capabilities of Azure Stream Analytics with the real-time predictive analytics capabilities of Azure Machine Learning can help businesses rapidly deploy highly scalable data solutions to support complex information challenges. In the above reference architecture, we have used Apache Mahout, a machine learning service from Apache offered in Azure to perform and build the recommendation engine.
  • The data storage layer along with Azure Data Factory is used to orchestrate movement of data for storage from event streams to processed output. In the reference architecture, we use Azure SQL Database for processed event stream store and use HDInsight along with Azure Blob as store for inputs/outputs from the recommendation engine.
  • The resultant data from the recommendation engine is mapped with the product catalog to offer rich immersive shopping experience for the customer along with current offers/discounts/products on sale deals etc. through .Net based mobile application leveraging Microsoft Xamarin.

Hospitality Sector

In the Hospitality/Hotel sector, huge opportunity exists for CSP Partners with data sprawl with Internet of Things (IoT) scenarios - both in terms of Hosting/Managed and Value-add service offerings. Gartner predicts that by the year 2020, 25 million connected devices will exist and this explosion of data leads to data ingestion growing at CAGR of 41% between 2013-2020.

This mission is to transform hotel information systems (HIS) to Holistic Insightful Services to offer "m"-powered Guest Experience and this explosion of data and associated challenges of bid data presents an incredible opportunity for MSPs to multi-fold grow their business, enabling MSPs to cash in on this mega-trend. This gives a tremendous opportunity for hotels to provide a unique experience to their guests using predictive analytics and connected environment in their pre, during and post stay-experiences.

Hospitality Use Case Scenario Description

Scenario: This mission is to transform hotel information systems (HIS) to Holistic Insightful Services to offer "M"-Powered Guest Experience. The reimagined HIS Experience would combine feeds from relevant 3rd party providers with traditional hotel information systems to offer tailored guest experience.

  • Hotels can analyze the social activity of their loyalty program to decipher that their guest has boarded an early flight on the way to the hotel – the Hotel Information System, a smart management platform can alert housekeeping for an early check-in – the alert tells the staff to prepare the guest's room right away in anticipation of an early check-in. Once the guest arrives, the staff members greet them by name, the hotel staff can very well be aware of preferences to customize dinner reservations based on the guest's profile, be aware of their typical workout schedule, favorite foods, sleeping patterns, temperature preferences, favorite TV channels, and more.
  • The "connected hotel" builds on enabling rich scenarios through a "connected" ecosystem – for instance, guest preferences such as favorite foods can enable a waiter at the restaurant to be aware of the guest's dietary requirements, food allergies etc., before they've even met them in person – this allows the restaurant to provide tailored guest specific experiences to make the guest feel "special".
  • With Azure analyzing device usage, the hotel can decipher that the guest stayed up late. The front desk then instantly sends a notification to the guest's smart device, telling them that late checkout has been arranged.

Microsoft Azure Data Platform can be leveraged to develop a data collection and analysis platform that generates reports based on spend analysis and social weightage to offer tailored loyalty rewards thereby enhancing customer stickiness. The key is to ensure a truly informed, cross-channel, and personal experience.

Reference Architecture: Hospitality


Solution Building Blocks

  • Azure IoT Hub: Azure IoT Hub is used to connect, monitor and manage millions of IoT assets and it establishes a secure and reliable bi-directional communication channel to ingest real-time data and upload files to the cloud.
  • Azure Data Factory: Azure Data Factory is a fully managed service for composing data storage, processing and data movement into streamlined, scalable and reliable data production pipelines. Azure Data Factory allows composing, orchestrating and monitoring of data movement across various storage services in the cloud.
  • Azure HDInsight: Azure HDInsight is a service that deploys and provisions Apache™ Hadoop™ clusters in the cloud, providing a software framework designed to manage, analyze and report on big data that allows to scale to petabytes on demand and offers powerful capabilities to crunch all kinds of data - structured, semi-structured and unstructured.
  • Azure Stream Analytics: Azure Stream Analytics (ASA) is a fully managed, cost effective real-time event processing engine that helps to unlock deep insights from data. Stream Analytics makes it easy to set up real-time analytic computations on data streaming from devices, sensors, web sites, social media, applications, infrastructure systems, and more. Azure Stream Analytics provides a native connector to Azure SQL Database for consuming events that are output from the stream. Business users (and business applications) can connect to a database directly to issue queries against real-time, inbound data using the same familiar TSQL language that is part of SQL Server.
  • Azure Event Hubs: Azure Event Hubs is an event processing service that provides event and telemetry ingress to the cloud at massive scale, with low latency and high reliability. This service, used with other downstream services, is particularly useful in application instrumentation, user experience or workflow processing, and Internet of Things (IoT) scenarios
  • Azure Blob Storage: Azure Blob Storage is a REST-based object store for unstructured data in the cloud that allows secure access to data over the wire and at rest. Azure Blob Storage can also be designated as an output destination for the event stream. Blob Storage provides a massive scale data repository that can consume information at high transfer rates, and therefore is suitable to consuming raw data streams coming through the stream event processor. As events are written, they are appended to a single file within the Azure storage container. This is an important feature as it enables more efficient query processing by tools such as HDInsight which is Microsoft's distribution of Hadoop that runs in Azure.
  • Azure SQL Database: SQL Database is a relational database service in the cloud based on the market-leading Microsoft SQL Server engine, with mission-critical capabilities. SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection—all with near-zero administration. You can focus on rapid app development and accelerating your time to market, rather than managing virtual machines and infrastructure. Azure SQL Database provides resilient, non-volatile PaaS relational databases for storing and querying data. Filtered or aggregated events flowing out of Azure Stream Analytics can be written to a database for consumption either through direct user queries or business analytics tools. Additional attributes relevant to the streaming event but not actually part of the stream can also be stored in a database to create a holistic model for analysis. Such examples of these attributes include additional time descriptions, or product attributes such as size, color or weight.
  • Azure ML: Azure ML is a powerful cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solution. The forecasts and predictions can make apps and devices smarter and they can learn from existing data I order to forecast future trends, behaviors and outcomes.
  • Microsoft Power BI: Microsoft Power BI transforms data into rich visuals to spot trends as they happen and push your business further by offering richer insights into the data.

Overall, here's how the solution works end-to-end:

  • Location based services are implemented with .Net using Microsoft Xamarin. GPS coordinates indicating the customer's current location is used to perform context sensitive queries to offer rich and immersive connected hospitality experience.
  • Event Hubs is a highly scalable publish-subscribe data integrator capable of consuming large volumes of events per second, enabling Azure to process vast amounts of data from connected applications or devices. It provides a unified collection point for a broad array of platforms and devices, and as such, abstracts the complexity of ingesting multiple different input streams directly into the streaming analytics engine. In the above reference architecture, Event Hubs provide a robust ingestion capability for incoming data that is being generated at data source endpoints such as devices, sensors or applications.
  • While the primary function of Event Hubs is to provide a single storage entry point to robustly ingest data for short term storage, the role of Stream Analytics (in this reference architecture, we use Apache Storm for stream processing) is to consume these events to provide temporal queries for analytics against the moving event stream for filter, aggregate, or join transformations that are then written to an output destination.
  • Azure Machine Learning provides a cloud based platform for mining data to identify trends and patterns across large scale information sets. By "learning" from historical events and trends, machine learning can publish a model that can be leveraged for determining predictions in real-time based on incoming event data. Combining the real-time rules based processing capabilities of Azure Stream Analytics with the real-time predictive analytics capabilities of Azure Machine Learning can help businesses rapidly deploy highly scalable data solutions to support complex information challenges. In the above reference architecture, we have used Apache Mahout, a machine learning service from Apache offered in Azure to perform and build the recommendation engine.
  • The data storage layer along with Azure Data Factory is used to orchestrate movement of data for storage from event streams to processed output. In the reference architecture, we use Azure SQL Database for processed event stream store and use HDInsight along with Azure Blob as store for inputs/outputs from the recommendation engine.
  • The resultant data from the recommendation engine is mapped with the hospitality service catalog to offer rich immersive connected experience for the customer along with current offers/discounts/products on sale deals etc. through .Net based mobile application leveraging Microsoft Xamarin.
  • In addition, from a hospitality analytics and BI perspective, business users have a number of options for accessing the raw data that is stored in Azure Blob Storage.
    • One common approach, as mentioned above, is to deploy an HDInsight cluster in Azure which effectively binds the dynamically created compute power to the persisted storage account.
    • Hive tables can be created against the schema of the consolidated events that are output from Azure Stream Analytics. This is an efficient method of resource management as the HDInsight compute can be shut down when analytics are complete whilst the storage account continues to consume and store events from the stream.
    • An alternative self-service method for consuming data from Blob Storage is through Power Query in Excel. With Power Query, users can connect directly to Azure Blob Storage using a built-in native connector and consume data to the in-memory engine within Excel.

Why bet on Microsoft Data Platform?

Microsoft Data Platform offers a comprehensive suite of PaaS (Platform-as-a-service) offering that enables CSP Partners to create end-to-end solutions and provides unique advantages from a Go-To-Market perspective, with quicker Time-to-Market and reducing infrastructure costs. Microsoft Azure PaaS offerings enable CSP Partners to create differentiated E2E value add industry vertical specific solutions. Partners can bring solutions to market at a significantly reduced cost, yet allowing the ability to craft customer's data models to unlock richer and deeper insights with the help of deep and rich analytics and integration with Internet of Things (IoT) capabilities.

Benefits of leveraging Microsoft PaaS – Quicker Time to Market, Reduced Cost

  1. Faster Time to Market

    PaaS provides built in scalability and elasticity to provide efficiency and resiliency to enable scale-out scenarios such as "cloud burst". PaaS helps CSP partners in the rapid construction of E2E solutions to integrate custom workflow essential to the creation of a targeted business application.

  2. Lower Costs and Higher Computing Efficiency

    PaaS allows you to rent converged platform services for the period for which services will be used. It changes the cost structure from Capital expense (Capex) to Opex (Operational expense) for an enterprise. PaaS reduces TCO (Total Cost of Ownership) to run and deploy the application. From a pricing perspective, the pay per use / pay per go pricing model is quite effective in reducing unnecessary CAPEX build out costs, where there is no need to buy the software, middleware or full year license; its pay on the basis of usage.

  3. Agility in Development

    PaaS is a perfect match for agile software development methodologies. An agile software development methodology is based on iterative and incremental development which may require iterations in need of software and other middleware platforms with progress and hence PaaS is the right match for agile application development methodology.

  4. Reduce Complexity

    PaaS provides services required to support the complete life cycle of building and delivering web applications and services on the internet allowing you to build, deploy, test, host and maintain application – enabling you to deliver solutions faster, reducing time to market and in enabling you to build iterative solutions through build-measure-learn cycle effectively. Simplifying the tasks of application developers eliminates the need for time consuming configuration tasks, by offering a user-friendly "plug-and-play" interface.

  5. Rich Analytics and Insights

    Microsoft Power BI transforms your company's data into rich visuals for you to collect and organize so you can focus on what matters to you. Stay in the know, spot trends as they happen, and push your business further – the possibilities are endless with richer and deeper insights.

Thus, Microsoft Data Platform enables CSP Partners focus their efforts on building end-to-end solutions to enable recurring revenue streams and allowing partners to create differentiated offers, thereby driving customer stickiness when it comes to customer adoption and retention.

Best Practices

This section provides insights on best practices to follow and highlights critical areas to focus while building out the reference architecture described above:

Good Design -- Worth the Effort

Any large-scale application design takes careful thought, planning, and potentially complex implementation. For Windows Azure, one of the most fundamental design principles is scale-out. Rather than invest in increasingly more powerful (and expensive) hardware, a scale-out strategy responds to increasing demand by adding more machines or service-instances.

Note that SQL Database is just a very obvious example where partitioning improves scalability. But to maximize the strengths of the platform, other roles and services must scale out in a similar way. For example, storage accounts have an upper bound on the rate of transactions, virtual machines have an upper bound on CPU and memory; maximum scale is achieved by designing for the use of multiple storage accounts and for services whose components scale out across virtual machines of set sizes.

Design for Cloud Scalability

Scalability is a driving force behind design, there are other critically important design considerations. The paper stresses that you must plan for telemetry and diagnostic data collection, which becomes increasingly important as your solution becomes more componentized and partitioned. Availability and business continuity are two other major areas of focus throughout the paper. Scalability is irrelevant when your service goes down or irretrievably loses data.

Monitoring Scenarios

Monitoring a large-scale distributed system poses a significant challenge. Each of the scenarios described in the previous section should not necessarily be considered in isolation. There is likely to be a significant overlap in the monitoring and diagnostic data that's required for each situation, although this data might need to be processed and presented in different ways. For these reasons, you should take a holistic view of monitoring and diagnostics.

You can use monitoring to gain an insight into how well a system is functioning. Monitoring is a crucial part of maintaining quality-of-service targets. Common scenarios for collecting monitoring data include:

  • Health Monitoring: Ensuring that the system remains healthy.
  • Availability Monitoring: Tracking the availability of the system and its component elements.
  • Performance Monitoring: Maintaining performance to ensure that the throughput of the system does not degrade unexpectedly as the volume of work increases.
  • SLA Monitoring: Guaranteeing that the system meets any service-level agreements (SLAs) established with customers.
  • Security Monitoring: Protecting the privacy and security of the system, users, and their data.
  • Audit: Tracking the operations that are performed for auditing or regulatory purposes.
  • Usage Monitoring: Monitoring the day-to-day usage of the system and spotting trends that might lead to problems if they're not addressed.
  • Incident Monitoring: Tracking issues that occur, from initial report through to analysis of possible causes, rectification, consequent software updates, and deployment.
  • Diagnostics Monitoring: Tracing operations and debugging software releases.

IoT Device Ingestion Best Practices: Focus on Security

Securing an Internet of Things (IoT) infrastructure requires a rigorous security-in-depth strategy. Starting from securing data in the cloud, to protecting data integrity while in transit over the public internet, and providing the ability to securely provision devices, each layer builds greater security assurance in the overall infrastructure.

  • Security hardware: If COGS permit, build security features such as secure and encrypted storage and Trusted Platform Module (TPM) based boot functionality. These features make devices more secure protecting the overall IoT infrastructure.
  • Choose open source software with care: Open source software provides an opportunity to quickly develop solutions. When choosing open source software, consider the activity level of the community for each open source component. An active community ensures software will be supported; issues will be discovered and addressed. Alternatively, an obscure and inactive open source software will not be supported and issues will most probably not be discovered.
  • Keep authentication keys safe: During deployment, each device requires device IDs and associated authentication keys generated by the cloud service. Keep these keys physically safe even after the deployment. Any compromised key can be used by a malicious device to masquerade as an existing device.
  • Audit frequently: Auditing IoT infrastructure for security related issues is key when responding to security incidents. Most operating systems, such as Windows 10 (IoT and other SKUs), provide built-in event logging that should be reviewed frequently to make sure no security breach has occurred. Audit information can be sent as a separate telemetry stream to the cloud service and analyzed.

Cross-validation (CV) technique for AzureML training models:

Leverage CV as a technique that assesses how well a model trained on a known set of data will generalize to predicting the features of datasets on which it has not been trained. The general idea behind this technique is that a model is trained on a dataset of known data on and then the accuracy of its predictions is tested against an independent dataset. A common implementation used here is to divide a dataset into K folds and then train the model in a round-robin fashion on all but one of the folds.

Limitations and External Components

Microsoft Azure SQL DB Limitations

Microsoft Azure SQL Database does not support SQL Server Agent or jobs. You can, however, run SQL Server Agent on your on-premises SQL Server and connect to Microsoft Azure SQL Database. In addition to the general limitations outlined in this article, SQL Database has specific resource quotas and limitations based on your service tier. For an overview of service tiers, see SQL Database service tiers.
  • For other SQL Database limits, see Azure SQL Database Resource Limits.
  • For security related guidelines, see Azure SQL Database Security Guidelines and Limitations.
  • Another related area surrounds the compatibility that Azure SQL Database has with on-premises versions of SQL Server, such as SQL Server 2014 and SQL Server 2016. The latest V12 version of Azure SQL Database has made many improvements in this area. For more details, see What's new in SQL Database V12.

    Apache Storm vs. Azure Stream Analytics:

    Get guidance choosing a cloud analytics platform by using this Apache Storm comparison to Azure Stream Analytics. Understand the value propositions of Stream Analytics versus Apache Storm as a managed service on Azure HDInsight, so you can choose the right solution for your business use cases.

    Both analytics platforms provide benefits of a PaaS solution, there are a few major distinguishing capabilities that differentiate them. Capabilities as well as the limitations of these services are listed below to help you land on the solution you need to achieve your goals.

    Storm comparison to Stream Analytics:

     

    Azure Stream Analytics

    Apache Storm on HDInsight

    Open Source

    No, Azure Stream Analytics is a Microsoft proprietary offering.

    Yes, Apache Storm is an Apache licensed technology.

    Price

    Stream Analytics is priced by volume of data processed and the number of streaming units (per hour the job is running) required.

    For Apache Storm on HDInsight, the unit of purchase is cluster-based, and is charged based on the time the cluster is running, independent of jobs deployed.

     

    Capabilities: SQL DSL

    Yes, an easy to use SQL language support is available.

    No, users must write code in Java C# or use Trident APIs.

    Input Data Sources

    The supported input sources are Azure Event Hubs and Azure Blobs.

    There are connectors available for Event Hubs, Service Bus, Kafka, etc. Unsupported connectors may be implemented via custom code.

    Input Data formats

    Supported input formats are Avro, JSON, CSV.

    Any format may be implemented via custom code.

    Outputs

    A streaming job may have multiple outputs. Supported Outputs: Azure Event Hubs, Azure Blob Storage, Azure Tables, Azure SQL DB, and PowerBI.

    Support for many outputs in a topology, each output may have custom logic for downstream processing. Out of the box Storm includes connectors for PowerBI, Azure Event Hubs, Azure Blob Store, Azure DocumentDB, SQL and HBase. Unsupported connectors may be implemented via custom code.

    Data Encoding formats

    Stream Analytics requires UTF-8 data format to be utilized.

    Any data encoding format may be implemented via custom

    Scalability

    Number of Streaming Units for each job. Each Streaming Unit processes up to 1MB/s. Max of 50 units by default. Call to increase limit.

    Number of nodes in the HDI Storm cluster. No limit on number of nodes (Top limit defined by your Azure quota). Call to increase limit.

    Data processing limits

    Users can scale up or down number of Streaming Units to increase data processing or optimize costs.

    User can scale up or down cluster size to meet needs.

    Scale up to 1 GB/s

    Stop/Resume

    Stop and resume from last place stopped.

    Stop and resume from last place stopped based on the watermark.

    Conclusion

    CSP partners looking for accelerating their time to market by embracing the "Cloud-First" economy should consider leveraging Microsoft Azure Data Platform to construct offers that are rich, captivating and immersive customer experiences. Microsoft Azure Data Platform offers a continuum of rich capabilities that partners can take full advantage of – given the trend towards data explosion, there is a significant opportunity for MSPs to benefit from a business value proposition and offers tremendous competitive edge harnessing the power of data, insights and analytics. By complementing traditional batch based analytics implementations with real-time analytics, and unifying the data visualizations experience into a single dynamic interface, companies can empower their staff to decide and act on information faster.

    While this document covers the technical guidance for CSP partners, refer to the "Business Guidance" document for details on use case scenarios across different industry vertical scenarios described above.