This document is an addendum to the Business Guidance document that enables Microsoft Cloud Solution Provider (CSP) program enables partners to sell Microsoft Online services to customers with additional value added services like assessment, consulting, deployment, migration and monitoring, support and billing. As part of the CSP program, partners are able to create tenants, provide them Azure subscriptions and manage various services on behalf of the customer.
This document provides technical guidance for CSP partners to build industry vertical end-to-end solutions and provides guidance on reference architectures leveraging the power of Microsoft Azure Data Platform. It also highlights specific high value opportunities that exist for CSP partners across various industry vertical scenarios such as Healthcare, Retail and Hospitality sectors, and highlights the benefits of choosing Microsoft Azure Data platform. The 3 key trends this document capitalizes are:
While these solutions may be built with specific components on public, private, or hybrid Azure cloud components, the core guidance is focused on public cloud Microsoft Azure implementations.
This section provides technical details on reference architecture for the various use case scenarios for 3 industry verticals – Healthcare, Retail and Hospitality/Hotel sector.
In the Healthcare sector, huge amount of data is amassed across EHR (Electronic Health Records) and EMR (Electronic Medical Records). Opportunities for CSP partners in healthcare are abundant. According to CDC, chronic diseases affect 52% of Americans and consume 86% of healthcare costs. The industry trend is to enable diagnosis efficiencies and the best patient outcomes at the lowest possible cost and according to Gartner, Healthcare IT is undergoing a transformational shift that dictates the future towards "Personalized Care" and this trend towards Personalized Care presents significant opportunity for offering both managed services and value-add services to lower TCO reducing inefficiencies, thereby reducing healthcare costs overall.
According to CDC, Diabetes is the leading cause for kidney failure and increases the risk of cardiovascular disease, neuropathy, retinopathy and various other health complications.
Type 2 diabetes affects 28 Million Americans in 2012 & is primarily due to sedentary lifestyle in patients who are genetically predisposed and makes up 90% of cases of diabetes.
Scenario 1: Diabetes Type 2 Prediction: Using the electronic health record (EHR) data set containing patient records from all 50 states in the United States, predict –
Machine learning with AzureML can be used to predict diabetes type 2 – Microsoft Data Platform provides a comprehensive suite of libraries that enables powerful supervised and unsupervised learning algorithms to aid predictive analytics.
Scenario 2: Remote Patient Monitoring with Continuous Glucose Monitoring (CGM): Internet of Things(IoT) powers a new class of scenarios around Remote Patient Monitoring. CGM is a new trend where devices and sensors are implanted subcutaneously to perform continuous glucose monitoring to provide timely medical aid with Remote Patient Monitoring to address episodes such as hypoglycemic conditions. Cloud driven Continuous glucose monitoring (CGM) is a small wearable device that tracks your glucose vitals throughout the day and night – helping hypoglycemic or hyperglycemic incidents and alert you of highs and lows so you can take timely evasive action and thereby protecting the patient. CGM helps to minimize the guesswork that comes from making decisions based solely on a number from a blood glucose meter reading, for better diabetes management and in keeping your physician abreast with remote monitoring.
The first step in building the reference architecture is to decipher the various "data silos" that exist across industry sectors – for instance in Healthcare, Electronic Medical Records (EMR) and Electronic Health Records (EHR) straddle hybrid cloud boundaries. Microsoft Data Platform provides unique opportunity for CSP partners to bridge "data silos" across cloud boundaries and offer managed Data Quality, Data Movement (using Azure to backup, running secondary replica services in Azure for mission critical workloads) and Data Management Service - allowing partners to build a pipeline of data movement/data migration automation for long term retention, reducing the need for expanding cold storage on premise.
Microsoft Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. Just like a manufacturing factory that runs equipment to take raw materials and transform them into finished goods, Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.
Data Factory works a cross on-premises and cloud data sources and SaaS to ingest, prepare, transform, analyze, and publish your data. Use Data Factory to compose services into managed data flow pipelines to transform your data using services like Azure HDInsight (Hadoop) and Azure Batch for your big data computing needs, and with Azure Machine Learning to operationalize your analytics solutions. Go beyond just a tabular monitoring view, and use the rich visualizations of Data Factory to quickly display the lineage and dependencies between your data pipelines. Monitor all of your data flow pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.
Azure Data Factory enables you to compose data movement and data processing tasks as a data driven workflow. You will learn how to build your first pipeline that uses HDInsight to transform and analyze web logs on a monthly basis and the steps you will be performing are as follows:
1. Create the data factory. A data factory can contain one or more data pipelines that move and process data.
2. Create the linked services. You create a linked service to link a data store or a compute service to the data factory. A data store such as Azure Storage holds input/output data of activities in the pipeline. A compute service such as Azure HDInsight processes/transforms data.
3. Create input and output datasets. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity.
4. Create the pipeline. A pipeline can have one or more activities such as Copy Activity to copy data from a source to a destination (or) HDInsight Hive Activity to transform input data using Hive script to produce output data. This sample uses the HDInsight Hive activity that runs a Hive script. The script first creates an external table that references the raw web log data stored in Azure blob storage and then partitions the raw data by year and month.
Prerequisites:
Refer the link to create the Azure Data Factory pipeline using the Azure Data Factory Editor.
Azure Data Factory has a few key entities that work together to define the input and output data, processing events, and the schedule and resources required to execute the desired data flow.
Figure: Relationships between Dataset, Activity, Pipeline, and Linked service
Activities: Activities define the actions to perform on your data. Each activity takes zero or more datasets as inputs and produces one or more datasets as outputs. An activity is a unit of orchestration in Azure Data Factory. For example, you may use a Copy activity to orchestrate copying data from one dataset to another. Similarly, you may use a Hive activity which will run a Hive query on an Azure HDInsight cluster to transform or analyze your data. Azure Data Factory provides a wide range of data transformation, analysis, and data movement activities.
Pipelines: Pipelines are a logical grouping of Activities. They are used to group activities into a unit that together perform a task. For example, a sequence of several transformation Activities might be needed to cleanse log file data. This sequence could have a complex schedule and dependencies that need to be orchestrated and automated. All of these activities could be grouped into a single Pipeline, for instance, "CleanLogFiles". "CleanLogFiles" could then be deployed, scheduled, or deleted as one single unit instead of managing each individual activity independently
Datasets: Datasets are named references/pointers to the data you want to use as an input or an output of an Activity. Datasets identify data structures within different data stores including tables, files, folders, and documents.
Linked service: Linked services define the information needed for Data Factory to connect to external resources. Linked services are used for two purposes in Data Factory:
With the four simple concepts of datasets, activities, pipelines and linked services, you can build your pipeline from the ground up.
Solution Building Blocks: The following are the key solution building blocks for the above reference architecture:
For Type II Diabetes Prediction, when the data is being used to predict a category, a classification based machine learning model is used. Supervised learning is also called classification – in this case we want the ability to predict whether a patient has all risk factors to be diagnosed with Diabetes or not. When there are only two choices, this is called two-class or binomial classification. When there are more categories, as when predicting the winner of the NCAA March Madness tournament, this problem is known as multi-class classification.
National Institutes of Health (NIH) conducted a study using machine learning to predict Diabetes Type 2 - In the Past 3years, if:
→ the patient will have type 2 diabetes diagnosis within the next year.
Such predictions based on historical training set can be done with Azure ML as it offers powerful predictive analytics capabilities with machine learning models such as Two Class Bayes Point Machine Model.
To showcase an illustrative example, we use the data from Pima Indians Diabetes dataset was used (sample dataset available in ML studio). Data includes EMR data for female patients including medical data including lifestyle factors –
Two Class Bayes Point Machine is powerful linear classifier model – shown for illustrative purposes. To learn more on this model, refer the MSDN article.
Data Management Gateway (a component of Azure Data Factory that you can install on premise) is an agent software that you can install on premise that connects data sources on-premises to cloud services for consumption. With Data Management Gateway, you can:
Data Management Gateway has a full range of on-premises data connection capabilities.
Overall, here's how the solution works end-to-end:
Retail industry vertical scenario presents plethora of opportunity both in terms of Hosting/Managed and Value-add service offerings. According to IDC Research - Success or failure in retail will be driven by an effective and profitable consumer experience. Product availability, information access, promotional relevancy, ease of checkout, and convenience are now key components of a competitive and differentiated Omni-channel strategy.
We anticipate stores to double down on these strategies and continue to find ways to bridge the gap between offline and digital channels. Retailers thus are on a mission to enable bridging the "brick-2-click" transition by offering rich & immersive shopping experience leveraging mobile and social strategy taking customer experience to a whole new level. This Omni-channel strategy presents an incredible opportunity for MSPs to multi-fold grow their business, enabling MSPs to cash in on this mega-trend.
In 2014, we saw more merchants venture into Omni-channel retailing and try in-store marketing solutions such as beacons to enrich the shopping experience. We anticipate stores to double down on these strategies and continue to find ways to bridge the gap between offline and digital channels.
Scenario: Immersive Shopping Experience: Retailers are on a mission is to enable a rich immersive shopping experience leveraging mobile and social strategy at all points of contact. Giving customers real-time relevant information on promotions based on their location, personal preferences and social influencers, is key to deliver a modern "e-tailer" experience opening new sources of revenue.
For the above Reference Architecture, we have shown how easy it is to integrate with 3rd party services such as Apache Storm and Apache Mahout – description for these services and how it works end-to-end are described below.
Solution Building Blocks
Apache Mahout to offer personalized recommendations: Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.
Apache Storm, distributed, fault tolerant, open source event processing for event streams at real time.
Creating a recommendation model is usually enough to allow the system to provide recommendations - in this case Apache Mahout machine learning was used to build the recommendation engine which takes into account Customer Information, their behaviors, Product Information – these factors were input to the recommendation engine which offered personalized product recommendations based on their demographic information, matching social trends for that demographic.
Nevertheless, recommendation quality varies based on the usage processed and the coverage of the catalog. For example, if you have a lot of cold items (items without significant usage), the system will have difficulties providing a recommendation for such an item or using such an item as a recommended one. In order to overcome the cold item problem, the system allows the use of metadata of the items to enhance the recommendations. This metadata is referred to as features.
Features can enhance the recommendation model, but to do so requires the use of meaningful features. For this purpose, a new build was introduced - a rank build. This build will rank the usefulness of features. A meaningful feature is a feature with a rank score of 2 and up. After understanding which of the features are meaningful, trigger a recommendation build with the list (or sub-list) of meaningful features.
A sample recommendation service for movie recommendations with similar approach above is described in Azure documentation using Apache Mahout machine learning service – CSP partners can leverage concepts to learn how to use the Apache Mahout machine learning library with Azure HDInsight to build a recommendation engine.
Azure IoT Hub addresses the device-connectivity challenges in the following ways:
Overall, here's how the solution works end-to-end:
In the Hospitality/Hotel sector, huge opportunity exists for CSP Partners with data sprawl with Internet of Things (IoT) scenarios - both in terms of Hosting/Managed and Value-add service offerings. Gartner predicts that by the year 2020, 25 million connected devices will exist and this explosion of data leads to data ingestion growing at CAGR of 41% between 2013-2020.
This mission is to transform hotel information systems (HIS) to Holistic Insightful Services to offer "m"-powered Guest Experience and this explosion of data and associated challenges of bid data presents an incredible opportunity for MSPs to multi-fold grow their business, enabling MSPs to cash in on this mega-trend. This gives a tremendous opportunity for hotels to provide a unique experience to their guests using predictive analytics and connected environment in their pre, during and post stay-experiences.
Scenario: This mission is to transform hotel information systems (HIS) to Holistic Insightful Services to offer "M"-Powered Guest Experience. The reimagined HIS Experience would combine feeds from relevant 3rd party providers with traditional hotel information systems to offer tailored guest experience.
Microsoft Azure Data Platform can be leveraged to develop a data collection and analysis platform that generates reports based on spend analysis and social weightage to offer tailored loyalty rewards thereby enhancing customer stickiness. The key is to ensure a truly informed, cross-channel, and personal experience.
Solution Building Blocks
Overall, here's how the solution works end-to-end:
Microsoft Data Platform offers a comprehensive suite of PaaS (Platform-as-a-service) offering that enables CSP Partners to create end-to-end solutions and provides unique advantages from a Go-To-Market perspective, with quicker Time-to-Market and reducing infrastructure costs. Microsoft Azure PaaS offerings enable CSP Partners to create differentiated E2E value add industry vertical specific solutions. Partners can bring solutions to market at a significantly reduced cost, yet allowing the ability to craft customer's data models to unlock richer and deeper insights with the help of deep and rich analytics and integration with Internet of Things (IoT) capabilities.
PaaS provides built in scalability and elasticity to provide efficiency and resiliency to enable scale-out scenarios such as "cloud burst". PaaS helps CSP partners in the rapid construction of E2E solutions to integrate custom workflow essential to the creation of a targeted business application.
PaaS allows you to rent converged platform services for the period for which services will be used. It changes the cost structure from Capital expense (Capex) to Opex (Operational expense) for an enterprise. PaaS reduces TCO (Total Cost of Ownership) to run and deploy the application. From a pricing perspective, the pay per use / pay per go pricing model is quite effective in reducing unnecessary CAPEX build out costs, where there is no need to buy the software, middleware or full year license; its pay on the basis of usage.
PaaS is a perfect match for agile software development methodologies. An agile software development methodology is based on iterative and incremental development which may require iterations in need of software and other middleware platforms with progress and hence PaaS is the right match for agile application development methodology.
PaaS provides services required to support the complete life cycle of building and delivering web applications and services on the internet allowing you to build, deploy, test, host and maintain application – enabling you to deliver solutions faster, reducing time to market and in enabling you to build iterative solutions through build-measure-learn cycle effectively. Simplifying the tasks of application developers eliminates the need for time consuming configuration tasks, by offering a user-friendly "plug-and-play" interface.
Microsoft Power BI transforms your company's data into rich visuals for you to collect and organize so you can focus on what matters to you. Stay in the know, spot trends as they happen, and push your business further – the possibilities are endless with richer and deeper insights.
Thus, Microsoft Data Platform enables CSP Partners focus their efforts on building end-to-end solutions to enable recurring revenue streams and allowing partners to create differentiated offers, thereby driving customer stickiness when it comes to customer adoption and retention.
This section provides insights on best practices to follow and highlights critical areas to focus while building out the reference architecture described above:
Good Design -- Worth the Effort
Any large-scale application design takes careful thought, planning, and potentially complex implementation. For Windows Azure, one of the most fundamental design principles is scale-out. Rather than invest in increasingly more powerful (and expensive) hardware, a scale-out strategy responds to increasing demand by adding more machines or service-instances.
Note that SQL Database is just a very obvious example where partitioning improves scalability. But to maximize the strengths of the platform, other roles and services must scale out in a similar way. For example, storage accounts have an upper bound on the rate of transactions, virtual machines have an upper bound on CPU and memory; maximum scale is achieved by designing for the use of multiple storage accounts and for services whose components scale out across virtual machines of set sizes.
Design for Cloud Scalability
Scalability is a driving force behind design, there are other critically important design considerations. The paper stresses that you must plan for telemetry and diagnostic data collection, which becomes increasingly important as your solution becomes more componentized and partitioned. Availability and business continuity are two other major areas of focus throughout the paper. Scalability is irrelevant when your service goes down or irretrievably loses data.
Monitoring Scenarios
Monitoring a large-scale distributed system poses a significant challenge. Each of the scenarios described in the previous section should not necessarily be considered in isolation. There is likely to be a significant overlap in the monitoring and diagnostic data that's required for each situation, although this data might need to be processed and presented in different ways. For these reasons, you should take a holistic view of monitoring and diagnostics.
You can use monitoring to gain an insight into how well a system is functioning. Monitoring is a crucial part of maintaining quality-of-service targets. Common scenarios for collecting monitoring data include:
IoT Device Ingestion Best Practices: Focus on Security
Securing an Internet of Things (IoT) infrastructure requires a rigorous security-in-depth strategy. Starting from securing data in the cloud, to protecting data integrity while in transit over the public internet, and providing the ability to securely provision devices, each layer builds greater security assurance in the overall infrastructure.
Cross-validation (CV) technique for AzureML training models:
Leverage CV as a technique that assesses how well a model trained on a known set of data will generalize to predicting the features of datasets on which it has not been trained. The general idea behind this technique is that a model is trained on a dataset of known data on and then the accuracy of its predictions is tested against an independent dataset. A common implementation used here is to divide a dataset into K folds and then train the model in a round-robin fashion on all but one of the folds.
Get guidance choosing a cloud analytics platform by using this Apache Storm comparison to Azure Stream Analytics. Understand the value propositions of Stream Analytics versus Apache Storm as a managed service on Azure HDInsight, so you can choose the right solution for your business use cases.
Both analytics platforms provide benefits of a PaaS solution, there are a few major distinguishing capabilities that differentiate them. Capabilities as well as the limitations of these services are listed below to help you land on the solution you need to achieve your goals.
Storm comparison to Stream Analytics:
| Azure Stream Analytics | Apache Storm on HDInsight |
Open Source | No, Azure Stream Analytics is a Microsoft proprietary offering. | Yes, Apache Storm is an Apache licensed technology. |
Price | Stream Analytics is priced by volume of data processed and the number of streaming units (per hour the job is running) required. | For Apache Storm on HDInsight, the unit of purchase is cluster-based, and is charged based on the time the cluster is running, independent of jobs deployed. |
| ||
Capabilities: SQL DSL | Yes, an easy to use SQL language support is available. | No, users must write code in Java C# or use Trident APIs. |
Input Data Sources | The supported input sources are Azure Event Hubs and Azure Blobs. | There are connectors available for Event Hubs, Service Bus, Kafka, etc. Unsupported connectors may be implemented via custom code. |
Input Data formats | Supported input formats are Avro, JSON, CSV. | Any format may be implemented via custom code. |
Outputs | A streaming job may have multiple outputs. Supported Outputs: Azure Event Hubs, Azure Blob Storage, Azure Tables, Azure SQL DB, and PowerBI. | Support for many outputs in a topology, each output may have custom logic for downstream processing. Out of the box Storm includes connectors for PowerBI, Azure Event Hubs, Azure Blob Store, Azure DocumentDB, SQL and HBase. Unsupported connectors may be implemented via custom code. |
Data Encoding formats | Stream Analytics requires UTF-8 data format to be utilized. | Any data encoding format may be implemented via custom |
Scalability | Number of Streaming Units for each job. Each Streaming Unit processes up to 1MB/s. Max of 50 units by default. Call to increase limit. | Number of nodes in the HDI Storm cluster. No limit on number of nodes (Top limit defined by your Azure quota). Call to increase limit. |
Data processing limits | Users can scale up or down number of Streaming Units to increase data processing or optimize costs. | User can scale up or down cluster size to meet needs. |
Scale up to 1 GB/s | ||
Stop/Resume | Stop and resume from last place stopped. | Stop and resume from last place stopped based on the watermark. |
CSP partners looking for accelerating their time to market by embracing the "Cloud-First" economy should consider leveraging Microsoft Azure Data Platform to construct offers that are rich, captivating and immersive customer experiences. Microsoft Azure Data Platform offers a continuum of rich capabilities that partners can take full advantage of – given the trend towards data explosion, there is a significant opportunity for MSPs to benefit from a business value proposition and offers tremendous competitive edge harnessing the power of data, insights and analytics. By complementing traditional batch based analytics implementations with real-time analytics, and unifying the data visualizations experience into a single dynamic interface, companies can empower their staff to decide and act on information faster.
While this document covers the technical guidance for CSP partners, refer to the "Business Guidance" document for details on use case scenarios across different industry vertical scenarios described above.