SQL Server 2016 Real-Time Operational Analytics - Deployment Guide

Overview

Traditionally, businesses have had separate systems for operational (i.e. OLTP) and analytics workloads. For such systems, Extract, Transform, and Load (ETL) jobs regularly move the data from the operational store to an analytics store. The analytics data is usually stored in a data warehouse or data mart dedicated to running analytics queries. While this solution has been the standard, it has these three key challenges:

  • Complexity. Implementing ETL can require considerable coding specially to load only the modified rows. It can be complex to identify which rows have been modified.
  • Cost. Implementing ETL requires the cost of purchasing additional hardware and software licenses.
  • Data Latency. Implementing ETL adds a time delay for running the analytics. For example, if the ETL job runs at the end of each business day, the analytics queries will run on data that is at least a day old. For many businesses this delay is unacceptable because the business depends on analyzing data in real-time. For example, fraud-detection requires real-time analytics on operational data.

SQL Server 2016 introduces Real-time operational analytics, the ability to run both analytics and OLTP workloads on the same database tables at the same time. Besides running analytics in real-time, you can also eliminate the need for ETL and a data warehouse. Real-Time Operational Analytics enables running analytics queries directly on your operational workload using Columnstore indexes thereby eliminating any data latency.

How It Works

Real-time analytics uses an updateable Columnstore index on a rowstore table. The Columnstore index maintains a copy of the data, so the OLTP and analytics workloads run against separate copies of the data. This minimizes the performance impact of both workloads running at the same time. SQL Server automatically maintains index changes so OLTP changes are always up-to-date for analytics. With this design, it is possible and practical to run analytics in real-time on up-to-date data. This works for both disk-based and memory-optimized tables. The image below shows one possible configuration using Analysis Server in Direct Query mode, but you can use any analytics tool or custom solution at your disposal. This solution addresses the drawbacks mentioned above since the analysis happens directly on the operational data.

The schema for OLTP databases is highly normalized (i.e. with minimal data duplication) which when used for analytic queries could lead to poor performance primarily because of complexity of joins between larger number of tables. The significant query speed up made possible by Columnstore index (i.e. NCCI) can overcome the complexity of query and still deliver most analytics queries in few seconds. At this point, it is important to emphasize the analytics query performance with real-time operational analytics will not be as fast as you can get with a dedicated DW but the key benefit is the ability to do analytics in real-time.

Prerequisite and Deployment Considerations

  • A Columnstore index on in-memory table can only be added as part of table creation. You cannot add a Columnstore index on an existing in-memory table. At least this is the case at the current time of development of this deployment guide.
  • A filtered non-clustered Columnstore index is only supported on disk-based tables. It is not supported on memory-optimized tables
  • Identify the tables in your operational schema that contain data required for analytics.
  • A new database cannot support In-Memory OLTP if it is restored from a database that was created before the In-Memory OLTP features became active.
  • When in doubt, you can always run the following T-SQL SELECT to ascertain whether your database supports In-Memory OLTP. A result of 1 means the database does support In-Memory OLTP:

SELECT DatabasePropertyEx(DB_NAME(), 'IsXTPSupported');

Environment

Prerequisites

  1. SQL Server 2016 CTP 3 should be configured on your HSP server.
  2. Your desired database should be created on the server instance.

    NOTE: If you do not have any database you can use sample database Adventureworks2016CTP3.

    For downloading sample database, you can refer to steps in 4.1.1.

  3. On the HSP server, sign in to SQL 2016 Management Studio using the windows credentials (default).

    NOTE: For reference, we are using Adventureworks2016CTP3
    database throughout the document.

Download AdventureWorks sample databases and restore it to SQL Server 2016 instance.

  1. Remote logging to the HSP server using local administrator credential.
  2. Open Server Manager (located on the task bar near to start button), click on local server and turn off IE Enhanced Security for Administrators (you can turn it on after the downloads)
  3. Go to http://www.microsoft.com/en-us/download/details.aspx?id=49502 , select and download the AdventureWorks2016CTP3 database back up (.bak) file to a folder.

    Note: The provided link is mapped to SQL 2016 CTP3 version downloads. Look for the latest version.

  4. Open SQL Server 2016 Management Studio and connect to the default instance using Windows credentials.
  5. Right click on the Server and click Restore Database, on the wizard select Device and
    browse for the
    AventureWorks2016CTP3.bak
    file downloaded. Click OK to start the restore.
  6. Wait (it will take 1 – 2 minutes) for the "Database Adventureworks2016CTP3 restored successfully" message and click OK
  7. Right click on the database restored and click on New Query, One the SQL Query window issue the below command and verify result set is returned successfully:

Select top 10 * from Sales.Customer

Post Condition

SQL Server 2016 CTP3 is up and running in the HSP server.

If you have downloaded the sample database then from SQL management studio, you can see the AdventureWorks2016CTP3 database created on server.

Figure 3 Customer records displayed through the Management Services

Figure 3 executes a simple SELECT query to retrieve data from the restored database.

Implementing Real-time Analytics on Disk-based Tables

SQL Server 2016 supports real-time operational analytics both for disk-based (i.e. traditional tables) and memory-optimized tables without requiring any changes to your transactional workload. For disk-based tables, you can do it by creating a non-clustered Columnstore index (NCCI).

  • Disk-based (i.e. traditional tables)
  • Memory-optimized tables.

We'll first walkthrough scenarios implementing real time analytics for disk based tables.

Creating Non-Clustered Columnstore Indexes to Perform Real-time Analytics

  1. Identify the tables in your operational schema that contain data required for analytics.
  2. For each table, drop all btree indexes that are primarily designed to speed up existing analytics on your OLTP workload. Replace them with a single Columnstore index. This can improve the overall performance of your OLTP workload since there will be fewer indexes to maintain.
  3. Below script will create an "Orders" table as an example and create a Non-Clustered index on top of it.

Script:

--Creating disk-based Orders Table

create table orders
(
AccountKey int not null,
customername nvarchar (50),
OrderNumber bigint,
PurchasePrice decimal (9,2),
OrderStatus smallint not NULL,
OrderStatusDesc nvarchar (50)
)
--Create a Non Clustered Columnstore Index (NCCI)

CREATE NONCLUSTERED COLUMNSTORE INDEX orders_ncci ON orders (accountkey, customername, purchaseprice, orderstatus, orderstatusdesc)

NOTE: The Columnstore index on an in-memory table allows operational analytics by integrating in-memory OLTP and in-memory Columnstore technologies to deliver high performance for both OLTP and analytics workloads. The Columnstore index on an in-memory table must include all the columns.

  1. This is all you need to do!

Let us now take an example of how NCCI can be used on a regular rowstore table in the context of an Order Management application:

The key table for this application is 'orders' that tracks customer information, purchase price and the order status. It is expected that over time a large number of rows will have the order status as 'Order Received' (i.e. the order has been received by the customer. We don't expect any more changes to the order once it has been received by the customer). So, once the order has been received by the customer, it can be considered that this row is now 'cold'.

  1. Below script will create "orders" table to be analyzed

--Creating disk-based Orders Table

create table orders
(
AccountKey int not null,
customername nvarchar (50),
OrderNumber bigint,
PurchasePrice decimal (9,2),
OrderStatus smallint not NULL,
OrderStatusDesc nvarchar (50)
)

  1. Create a Clustered index on 'OrderStatus'

--Creating Clustered Index on OrderStatus


create
clustered
index orders_ci on orders(OrderStatus)

  1. Now we load 3 million rows with the data pattern that 95% of the orders have already been received by the customer.

-- Loading 3 million rows with the data pattern that 95% of the orders have already been received by the customer.
declare @outerloop int = 0
declare @i int = 0

declare @purchaseprice decimal (9,2)
declare @customername nvarchar (50)
declare @accountkey int
declare @orderstatus smallint
declare @orderstatusdesc nvarchar(50)
declare @ordernumber bigint
while (@outerloop < 3000000)
begin

Select @i = 0

begin tran

while (@i < 2000)

begin

set @ordernumber = @outerloop + @i

set @purchaseprice = rand() * 1000.0

set @accountkey = convert (int, RAND ()*1000)

set @orderstatus = convert (smallint, RAND()*100)

if (@orderstatus >= 5)
set @orderstatus = 5
set @orderstatusdesc =
case @orderstatus
WHEN 0 THEN 'Order Started'
WHEN 1 THEN 'Order Closed'
WHEN 2 THEN 'Order Paid'
WHEN 3 THEN 'Order Fullfillment'
WHEN 4 THEN 'Order Shipped'
WHEN 5 THEN 'Order Received'
END

insert orders values (@accountkey, (convert (varchar (6), @accountkey) + 'firstname'),@ordernumber, @purchaseprice,@orderstatus, @orderstatusdesc)

set @i += 1;
end

commit
set @outerloop = @outerloop + 2000
set @i = 0

end

go

  1. Now let us create a nonclustered columnstore index as follows.

NOTE: It is just a DDL operation and similar to any other btree index that you would create on a rowstore table. No changes to the application needed because the NCCI is updateable in SQL Server 2016 so your transaction workload will continue to run.

--Create a Non Clustered Columnstore Index (NCCI)


CREATE NONCLUSTERED COLUMNSTORE INDEX orders_ncci ON orders (accountkey,
customername, purchaseprice, orderstatus, orderstatusdesc)

  1. Let us load another 200k orders but these orders are not 'yet' received by the customers. Below is the script

--Loading another 200k orders but these orders are not 'yet' received by the customers.
--insert additional 200k rows
declare @outerloop int = 3000000
declare @i int = 0

declare @purchaseprice decimal (9,2)

declare @customername nvarchar (50)

declare @accountkey int

declare @orderstatus smallint

declare @orderstatusdesc nvarchar(50)

declare @ordernumber bigint

while (@outerloop < 3200000)

begin

Select @i = 0

begin tran

while (@i < 2000)

begin

set @ordernumber = @outerloop + @i

set @purchaseprice = rand() * 1000.0

set @accountkey = convert (int, RAND ()*1000)
set @orderstatus = convert (smallint, RAND()*5)
set @orderstatusdesc =

case @orderstatus

WHEN 0 THEN 'Order Started'

WHEN 1 THEN 'Order Closed'

WHEN 2 THEN 'Order Paid'
WHEN 3 THEN 'Order Fulfillment'
WHEN 4 THEN 'Order Shipped'
WHEN 5 THEN 'Order Received'
END
insert orders values (@accountkey, (convert(varchar(6), @accountkey) + 'firstname'),
@ordernumber, @purchaseprice, @orderstatus, @orderstatusdesc)
set @i += 1;
end

commit

set @outerloop = @outerloop + 2000

set @i = 0

end

go

  1. We are done ! Let's now see how it works and improves the performance.

Post Conditions

1. Once we have defined Columnstore indexes on the table(s) identified, we'll now query the table with and without NCCI (Non Clustered Columnstore index) defined on the table to see the performance.

1.1 CPU time when running Query with NCCI defined on Table:

1.2 CPU Time when running the same query without NCCI defined on the table:

NOTE: Note that the query with NCCI ran 25x faster and took 1/60th CPU resources. The query optimizer picks NCCI for analytics queries and provides significant speed up over traditional btree indexes. Also, note that the difference in performance will only widen as you add more data.

2. Now, let's check the details on the Columnstore index.

NOTE: You will see that 3 million rows are compressed while the new 200k rows that were added after defining the non-clustered Columnstore Index on the table are in the delta rowgroup. In SQL Server 2016, the delta rowgroups for Clustered Columnstore Index (CCI) and Non-Clustered Columnstore Index (NCCI) are uncompressed unlike in SQL Server 2014 where delta rowgroup was PAGE compressed.

  1. The Query plan while running the query with NCCI

NOTE: The query optimizer automatically picks Non Clustered ColumnStore Index for analytics queries and provides significant speed up.

Use Filtered Indexes to Improve Query Performance

To minimize the overhead of maintaining a non-clustered Columnstore index on an operational workload, you can use a filtered condition to create a non-clustered Columnstore index only on the warm or slowly changing data.

GOAL: To design the filtered condition and subsequent queries to access frequently changing "hot" data from the btree index, and to access the more stable "warm" data from the Columnstore index.

  • Below script will create a filtered Non clustered Index on Orders table

--Using a filtered condition to separate hot data in a rowstore table

-- From "warm" data in a Columnstore index.

CREATE TABLE
orders_filteredNCCI (
     AccountKey        int not null,
     Customername        nvarchar (50),
     OrderNumber        bigint,
     PurchasePrice        decimal (9,2),
     OrderStatus        smallint not null,
     OrderStatusDesc        nvarchar (50))
-- OrderStatusDesc
-- 0 => 'Order Started'
-- 1 => 'Order Closed'
-- 2 => 'Order Paid'
-- 3 => 'Order Fulfillment Wait'
-- 4 => 'Order Shipped'
-- 5 => 'Order Received'
--Create the Columnstore index with a filtered condition
CREATE NONCLUSTERED COLUMNSTORE INDEX orders_ncci_filteredindex ON orders_filteredNCCI (accountkey, customername, purchaseprice, orderstatus)
where orderstatus = 5

NOTE: The query optimizer will consider, but not always choose, the Columnstore index for the query plan. When the query optimizer chooses the filtered Columnstore index, it transparently combines the rows both from Columnstore index as well as the rows that do not meet the filtered condition to allow real-time analytics.

2. This is all you need to do!

Let us now take an example of how NCCI can be used with predicate to improve query performance:

The example below shows how to use filtered indexes to minimize impact of non-clustered Columnstore index on transactional workload while still delivering analytics in real-time.

Taking the same example as above of Orders Table:

--Create Orders Tables
CREATE TABLE orders_filteredNCCI (
     AccountKey int not null,
     Customername        nvarchar (50),
     OrderNumber        bigint,
     PurchasePrice        decimal (9,2),
     OrderStatus        smallint not null,
     OrderStatusDesc        nvarchar (50))
-- OrderStatusDesc
-- 0 => 'Order Started'
-- 1 => 'Order Closed'
-- 2 => 'Order Paid'
-- 3 => 'Order Fulfillment Wait'
-- 4 => 'Order Shipped'
-- 5 => 'Order Received'
--Creating Btree Indexes
CREATE CLUSTERED INDEX orders_ci_filteredindex ON orders_filteredNCCI(OrderStatus)
-- Loading 3 million rows with the data pattern that 95% of the orders have already been received by the customer.

declare
@outerloop int = 0

declare @i int = 0
declare @purchaseprice decimal (9,2)
declare @customername nvarchar (50)
declare @accountkey int
declare @orderstatus smallint
declare @orderstatusdesc nvarchar(50)
declare @ordernumber bigint
while (@outerloop < 3000000)
begin
Select @i = 0
begin
tran

while (@i < 2000)

begin

set
@ordernumber = @outerloop + @i

set
@purchaseprice = rand() * 1000.0

set
@accountkey = convert (int, RAND ()*1000)
set @orderstatus = convert (smallint, RAND()*100)

if (@orderstatus >=
5)
set @orderstatus = 5
set @orderstatusdesc =
case @orderstatus

WHEN
0 THEN 'Order Started'
WHEN 1 THEN 'Order Closed'
WHEN 2 THEN 'Order Paid'
WHEN 3 THEN 'Order Fulfillment'
WHEN 4 THEN 'Order Shipped'
WHEN 5 THEN 'Order Received'
END

insert
orders values (@accountkey,
(convert (varchar (6), @accountkey) + 'firstname'),@ordernumber, @purchaseprice,@orderstatus, @orderstatusdesc)
set @i += 1;

end

commit
--Create the columnstore index with a filtered condition
CREATE NONCLUSTERED COLUMNSTORE INDEX orders_ncci_filteredindex ON orders_filteredNCCI (accountkey, customername, purchaseprice, orderstatus)
where orderstatus = 5
--Load another 200k orders but these orders are not 'yet' received by the customers.
-- in the same way as above

NOTE: To minimize the overhead of maintaining a non-clustered columnstore index on an operational workload, you can use a filtered condition to create a non-clustered columnstore index only on the warm or slowly changing data. For example, in an order management application, you can create a non-clustered columnstore index on the orders that have already been shipped. Once the order has shipped, it rarely changes and therefore can be considered warm data. With Filtered index, the data in non-clustered columnstore index requires fewer updates thereby lowering the impact on transactional workload.

Post Conditions

1. Once we have defined Columnstore indexes with a filtered condition, we'll now query the table to see the performance improvement.

Run the below query on Orders table:

-- The following query returns the total purchase done by customers for items > $100 .00
SELECT top 5 customername, sum (PurchasePrice)
FROM orders_filteredNCCI
WHERE purchaseprice > 100.0
Group By customername

RESULT:

The analytics query will execute with the following query plan. Also, the rows not meeting the filtered condition are accessed through clustered btree index.

NOTE: SQL Server uses the clustered index to quickly scan the rows that did not meet the filtered condition. Without this clustered index, a full table scan of the rowstore table will be required to find these rows which can negatively impact the performance of analytics query significantly. In the absence of clustered index, you could create a complementary filtered non-clustered btree index to identify such rows but it is not recommended because accessing large range of rows through non-clustered btree indexes is expensive.

Implementing Real-time Analytics on Memory-Optimized tables

SQL Server 2016 combines memory-optimized table for extreme OLTP and in-memory analytics using columnstore to deliver real-time operational analytics to customers. You can achieve it with no changes to your operational workload.

Following are the key differences from NCCI:

  • There is no explicit delta rowgroup. The new rows are only inserted into the memory-optimized table. The rows that are in memory-optimized table but not in columnstore are referred to as the 'tail' or the 'virtual delta rowgroup'. In-memory OLTP tracks the rows in the 'tail' efficiently using dedicated memory allocator.
    Ideally, you would like to keep 'hot' rows (i.e. the ones that are being change frequently) in the tail.
  • You can't create a filtered columnstore index.
  • The columnstore index is persisted but it is required to be fully resident in memory. It will take additional memory but it should roughly be 10% more as columnstore index typically compresses the data 10x.

NOTE: A columnstore index on in-memory table can only be added as part of table creation.

  • Run the following script to create a clustered columnstore Index on a memory optimized table.

--Create a memory-optimized table with a columnstore index
create table orders_Inmemory (AccountKey int not null PRIMARY KEY NONCLUSTERED,
customername nvarchar (50),
OrderNumber bigint,
PurchasePrice decimal (9,2),
OrderStatus smallint NOT NULL,
OrderStatusDesc nvarchar (50),
INDEX t_orders_cci CLUSTERED COLUMNSTORE
)
WITH (MEMORY_OPTIMIZED = ON);

Post Conditions

Once we have defined Columnstore indexes on the memory optimized table(s), analytics queries provides significant speed up as shown for disk based tables.

References