This document gives insight into how to approach a Teradata to Azure SQL Data Warehouse migration. This whitepaper is broken into sections which detail the migration phases, the preparation required for data migration including schema migration, migration of the business logic, the actual data migration approach and testing strategy.
The migration should pivot on six areas (one is optional but highly recommended) and, with the benefit of Azure, you can get ahead of the development curve by quickly being able to provision Azure SQL Data Warehouses for your development team to start business object migration before the data is migrated.
The use of an agile methodology, with focus on what must be delivered, to provide a working environment and any items that are "nice to haves" will achieve a foundation from which to build upon within Azure.
The illustration below shows the different areas and they all build on top of each other:
Figure 1: Migration Workflow
Phase One – Fact Finding
Through a question and answers session you can help define what your inputs and outputs are for the migration project. This is the collecting requirements in an Agile methodology.
Phase Two – Defining Success Criteria for Proof of Concept
Taking the answers from phase one, you decide upon running a POC to validate the outputs required and run the following phases as a POC. If you have decided upon a live migration, then you would skip the POC stage (not recommended). This should feed into the Agile Project Charter and establish the team norms.
Phase Three: Data layer Mapping Options
This phase is about mapping the data you have in Teradata to the data layout you will create in Azure SQL Data Warehouse.
Phase Four: Data Modelling
Once you've defined the data mappings, phase four concentrates on how to tune Azure SQL Data Warehouse to provide the best performance for the data you will be landing into it.
Depending on the state of the data warehouse you might need to deploy a team to adjust the schema into a star or snowflake data model. This layout can bring performance benefits to both the data and reporting layers.
Phase Five: Identify Migration Paths
What is the path of least resistance? What is the quickest path given your cloud maturity? Phase Five helps describe the options open to you and then for you to decide on the path you wish to take. With Azure you have two different paths – export to Azure Databox and ship to the Azure Datacentre or use Azure ExpressRoute to transfer from your data centre to Azure's datacentre.
The data export file types can make a difference to the export, transfer and import to Azure SQL Data Warehouse – these have been listed in the detailed section.
Phase Six: Execution of Migration
Migrating your Teradata data to SQL Data Warehouse involves a series of steps. These steps are executed in three logical stages: Preparation, Metadata migration and Data migration which have been defined in the previous phases and are combined into Phase Six: Execution of Migration.
Once migrated, you should have a robust testing plans to validate the solution in a test environment before going live in production.
What is the best approach?
Phase 1 is all about fact finding and understanding if the workload is new or a migration project. This is the collecting requirements phase in Agile.
Before migrating, you want to be certain SQL Data Warehouse is the right solution for your workload. SQL Data Warehouse is a distributed system, designed to perform analytics on large volumes of data. Migrating to SQL Data Warehouse requires some design changes that are not too hard to understand, but might take some time to implement. If your business requires an enterprise-class data warehouse (DW), the benefits are worth the effort. However, if you don't need the power of SQL Data Warehouse, it is more cost-effective to use SQL Server or Azure SQL Database.
Consider using SQL Data Warehouse when you:
Don't use SQL Data Warehouse for operational (OLTP) workloads that have:
After you have the above questions and some of the bottom ones answered we can then look to sizing the Data Warehouse.
Sizing the Data Storage Capacity
Sizing the Processing Capacity
If you have decided to run a Proof of Concept prior to the live migration, then phase one answers should give enough information to understand what success looks like along with who the stakeholders are for certain parts of the project.
Figure 2: Example Proof of Concept
Taking those answers, you can map out what you want to validate in a POC lab and who the owners are for delivering that part. The owner(s) input should have had input into the success criteria in phase one.
In the POC above we have decided to validate data migration, report performance, query performance and delta loads. These tests will be compared against the current solution with the requirements to be as fast or faster than the current solution at a cost points which supports the business case.
As we are migrating to an Azure solution, some areas don't need to be POC lab tasks but can be extrapolated during the lab. For example, if we take the data migration task in the lab, this would drive an extrapolation of the tasks which would be required for the Preparation for Data Migration. Post Data Migration would come from the task carried out after data migration. The other two mentioned (High Availability and SLAs) can be verified based on the number of clients already running solutions in Azure.
The Business Development Manager is key at the end of the process as they are usually the person who signs off the investment to move to production. They must be involved from the start and should define the cost target at which it makes sense to migrate from Teradata to an Azure solution.
The outputs are driven from the overall proof of concept to provide a Bill of Materials (BOM) and the costs associated with running the solution.
Finally, through the course of the POC you will identify optimisation and fixes which will need to be applied to the migration. This all drives the cost to migration, but it is important to categorise and show the return on investment by implementing these changes. Some changes will be mandatory such as identity and networking infrastructure. These mandatory requirements are usually referred to as Cloud Operating Model.
Consider this as part of your Agile Project Charter and then take the information in this and the previous phase to build out your agile project. Run two weekly sprints with a well-defined backlog. The first sprint would concentrate on getting the infrastructure ready. From sprint two the team would be focussing on schema and data migration.
Phase three is about understanding Data Layering Mapping Options. Teradata has different data types and metadata options. Modify your data types to be compatible with SQL Data Warehouse. For a list of supported and unsupported data types, see data types. This source also provides a query to identify existing types that are not supported in SQL Data Warehouse.
If Teradata has several layers such as Staging, Operational Data Store and then a Semantic model you must decide if SQL Data Warehouse is the right location for all this data. There are multiple options in Azure, and the phase one fact finding will help you focus in on the right technology.
For example, a Teradata data warehouse might include a staging database, a data warehouse database, and some data mart databases. In this topology each database runs as a separate workload with separate security policies. By contrast, SQL Data Warehouse runs the entire data warehouse workload within one database. Cross database joins are not permitted. Therefore, SQL Data Warehouse expects all tables used by the data warehouse to be stored within the one database.
This is our approach for consolidating workloads, security, domain and functional boundaries by using user defined schemas:
If schemas have already been used, then you have a few options:
SQL Data Warehouse uses different distribution types to Teradata and it is important to consider how you will distribute the tables you will be migrating.
Each table is distributed or replicated across the Compute nodes. There's a table option that lets you specify how to distribute the data. The choices are round-robin, replicated, or hash distributed. Each has pros and cons. If you don't specify the distribution option, SQL Data Warehouse will use round-robin as the default.
To choose the best distribution option for each table, see Distributed tables.
For best performance, minimize the row length of your tables. Since shorter row lengths lead to better performance, use the smallest data types that work for your data.
For table row width, PolyBase has a 1 MB limit. When importing terabytes of data into SQL Data Warehouse, PolyBase is the fastest way to ingest data. Therefore, update your tables to have maximum row widths of less than 1 MB.
Depending on the state of the data warehouse you might need to deploy a team to adjust the schema into a star or snowflake data model. This methodology can bring performance benefits to both the data and reporting layers.
For example, you have a Sales fact table, a Retailer and Retailer_Store Dimensions tables. Your current reporting solution links Sales to Retailer on Sales.Retail_ID to Retailer.Retail_ID – this will work fine as it will bring back a retailer for each Sales line. However, you have linked Sales.Retail_ID to Retailer_Store. Retail_ID as opposed creating a unique surrogate key in Retailer_Store for StoreID and copying this value into a field in the Sales fact table.
The Retailer_Store has the layout of:
Retailer_ID | Store_ID |
1 | 1 |
1 | 2 |
2 | 1 |
2 | 3 |
Table 1: Retail Store Table
As the table contain duplicates for Store you will have to write a query to join Sales on both Retail_ID and Store_ID. The data warehouse will have to scan considerably more data than if you have a join on single unique surrogate key.
A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance. See Using IDENTITY to create surrogate keys in Azure SQL Data Warehouse. The schema of
Retailer_ID | Store_ID | StoreKey |
1 | 1 | 1 |
1 | 2 | 2 |
2 | 1 | 3 |
2 | 3 | 4 |
Table 2: Retail Store Table with surrogate key
The data migration steps usually affect the performance, maintainability and reliability of the migration. Approaches for migrating data to SQL Data Warehouse can be classified based on where the data migration is orchestrated from and based on whether the migration operations are individualized or combined. Typically you have four options:
For Teradata Migration you should use Azure Controlled, with Differentiated or Integrated history loads, depending on your Azure connectivity. These options are discussed under the Data Migration section.
Where possible, automate as much of the export as possible. This enables repeatable exports and can be used for both the proof of concept and the live migration.
Use Polybase to import the data to Azure SQL Data Warehouse. PolyBase is a technology that accesses data outside of the database via the T-SQL language. It is the best way to load data into SQL Data Warehouse. With PolyBase, the data loads in parallel from the data source directly to the compute nodes.
PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. In addition to the delimited text files, it loads from the Hadoop file formats RC File, ORC, and Parquet. PolyBase can load data from Gzip and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested formats such as WinZip, JSON, and XML.
From Migrating data to Azure SQL Data Warehouse in practice.
After collecting information, combining and making the key decisions in phases one to five you must move onto the Execution of the Migration. Migrating your Teradata data to SQL Data Warehouse are executed in three logical stages: Preparation, Metadata migration and Data migration.
Figure 3: Logical Stages of Migration
In each stage, tasks to be executed involve the on-premises database system, the on-premises local storage, the network connecting the local system to Azure (either internet or a dedicated circuit) and SQL Data Warehouse.
This results in a physical data movement from the source database to Azure as shown below. (These steps are also similar in moving data from any other source system on cloud instead of on-premises to SQL Data Warehouse)
Figure 4: Physical Flow of Migration
Bringing these two diagrams together results in a logical flow from top to bottom and a physical flow from left to right.
Figure 5: Logical and Physical Workflow
If the volume of the data to migrate is large, some steps can be time consuming. These steps are rate-determining because they influence the overall migration time. Such steps are shaded in yellow.
Some migration steps may be optional depending on the size of the data, the nature of the network, and the tools and services used for migration. Optional steps are shown with orange dots.
To ingest historical data, you need a basic cloud Cloud Data Warehouse setup for moving data from your on-premise solution to Azure SQL Data Warehouse and to enable the development team to build Azure Analysis Cubes once the majority of the data is loaded.
Note: You should provision a testing virtual machine in Azure, with accelerated networking enabled. This will be used to execute ad-hoc test queries. You can then look at the internals of Azure SQL DW to see the latency/execution times.
After ingesting historical data, and depending on your data loading sizes, adding Azure Databrick alongside Azure Data Lake Storage Generation 2 lets you bring together all your data at any scale easily, and to get insights through analytical dashboards, operational reports, or advanced analytics for all your users.
Giving you the ability to easily add more data feeds which are then moved through the pipeline up to the Azure SQL Data Warehouse layer and to your analytics/reporting layer.
To accelerate your development sprints, you should deploy the data migration architecture first to allow the team to start migrating schemas and historical information.
Once the team starts to work on incremental loads, the Modern Data Warehouse architecture should have been built, with focus on the Azure Data Bricks to cleanse and transform data into Azure SQL Data Warehouse.
See further details on Data Migration pipeline to understand how this architecture supports your migration.
You can provision a Teradata Database (developer edition), in Azure, before migrating the data from your on-premises Teradata database. This can speed up migration efforts, as it often takes time to provision on-premise virtual machines, and go through internal approval processes, to obtain a virtual machine inside your datacenter.
By provisioning an environment in under an hour in Azure, it allows both activities to run in parallel. The key to getting the schema from Teradata to SQL Data Warehouse is to build a Teradata Database VM which we can recreate the on-premise schema on.
The process of extracting the Teradata tables to Azure SQL Data Warehouse is:
Figure 6: Workflow for Schema Migration
If you are able to provision a virtual machine quickly in your datacenter, close to the Teradata database, then this is the preferred process. This would allow you to start at "Create Jumpbox". However, in our experience, internal processes can slow this down and with a schema the project will be on hold for ages.
A development version of Teradata Database which, as at 4th July 2018, is provided free (All you pay for is the Azure compute and storage). Teradata is licensed for use on specific Azure VMs. For latest information on supported software and VMs, visit www.teradata.com/azure/products/
Using Azure Marketplace, it is possible to deploy a 2-node development edition of Teradata Database 15.x and to then use create a copy of the schemas you wish to migrate. Only Teradata Database is required, no other Teradata components are required.
You should have extracted a copy of the Teradata Schema scripts for the objects you wish to migrate. You will need one vNet with three subnets provisions and an Azure Subscription with contributor permission on at least one empty Azure resource group.
After the Teradata cluster has been created, you will need to open port 1025 to allow the Jumpbox VM to communicate with it.
Full details on creating a Teradata Azure VM can be found in the Appendix under "Create Teradata Database Azure VM".
To facilitate copying the schema between the Azure Teradata Cluster and Azure SQL Data Warehouse you should create a Windows 2012 Azure VM (Size D4_v2). Using the Data Science Virtual Machine - Windows Server 2012 edition, from the Azure Marketplace, will save you time in installing SQL Server Management Studio and Visual Studio – since these components are pre-installed on the Azure Marketplace image.
Additional software which should be installed after this has been provisioned (also applies to on-premise virtual machine): -
Full details on creating a Teradata Azure VM can be found in the Appendix under "Create Jumpbox VM"
You will need a blank SQL Data Warehouse database to migrate the schema to and this needs to be setup prior to migrating the schema. Initially the performance level can be set Gen2 DWU500, as you will only be copying metadata at this stage – you can scale this up later when you need to load data.
Prior to using Attunity Replicate, to create the schema, you will need to create a master key for Azure SQL Data Warehouse.
To run loads with appropriate compute resources, we recommend that you create a load database user resource class to fit the resource class and load need (see Loading Best Practices).
Full details on Creating an Azure SQL Data Warehouse Database can be found in the Appendix under "Create a SQL DataWarehouse Dev DB"
To create the schema, you will need to export your on-premise Teradata Schema into .sql text files and then use the Teradata Developer Tools for Visual Studio, to recreate the schema on Azure Teradata Cluster.
By approaching the schema migration using a separate instance away from Production, you can remove several barriers around connecting to a production instance.
Full details on how you can create the Teradata schema can be found in the Appendix under "Create the Teradata Schema in Azure".
Attunity Replicate for Microsoft Migrations is for Microsoft customers who want to migrate data from popular commercial and open-source databases to the Microsoft Data Platform, including Teradata to Azure SQL Data Warehouse. It can be obtained from https://www.attunity.com/products/replicate/attunity-replicate-for-microsoft-migration/.
Attunity Replicate for Microsoft Migrations is installed on the Jumpbox VM and is used transfer the Teradata schema to Azure SQL Data Warehouse.
The table schema will need to be performance tuned as each table is created as a ROUND ROBIN table.
This schema migration consists of the following stages:
Once migrated, you should check that the schema has been transferred. The full process is described in the Appendix under "Using Attunity Replicate to Create Schema".
At this point in the schema creation process, it is advisable to extract the Azure SQL Data Warehouse schema and check it into your code store.
The easiest way to achieve this is with the mssql-scripter. mssql-scripter is the multiplatform command line equivalent of the widely used Generate Scripts Wizard experience in SSMS.
You can use mssql-scripter on Linux, macOS, and Windows to generate data definition language (DDL) and data manipulation language (DML) T-SQL scripts for database objects in SQL Server running anywhere, Azure SQL Database, and Azure SQL Data Warehouse. You can save the generated T-SQL script to a .sql file or pipe it to standard *nix utilities (for example, sed, awk, grep) for further transformations. You can edit the generated script or check it into source control and subsequently execute the script in your existing SQL database deployment processes and DevOps pipelines with standard multiplatform SQL command line tools such as sqlcmd.
Follow the installation guide on https://github.com/Microsoft/mssql-scripter/blob/dev/doc/installation_guide.md and launch a command prompt with the following script
mssql-scripter -S <servername>.database.windows.net -d <DatabaseName> -U <UserID> -P <Password> -f ./CreateSQLDWTables.dsql --script-create --target-server-version AzureDW --display-progress
This will extract the Azure SQL Data Warehouse Schema to allow you to recreate the database later. For a full command line reference please see https://github.com/Microsoft/mssql-scripter/blob/dev/doc/usage_guide.md.
This section runs through how you can take you Teradata business logic and migrate it through to SQL Data Warehouse.
Some Azure SQL Data Warehouse features can significantly improve performance as they are designed to work in a distributed fashion. However, to maintain performance and scale, some features are also not available. See Migrate your SQL code to SQL Data Warehouse.
There are two approaches which you can take to migrating the stored procedures and views which reside in the Teradata Database.
Datometry, a Microsoft Partner, should be considered in phase one (lift and shift) of any Teradata Migration where you have adhoc Teradata (TD) SQL, legacy applications and reporting which needs to be moved across quickly. It runs between the legacy applications and Azure SQL Data Warehouse to translate Teradata queries into Azure SQL Data Warehouse T-SQL at the network layer. They claim they can eliminate the rewriting of database applications and reduce migration cost and project time by up to 90% by letting existing applications, written originally for Teradata, to run natively and instantly on SQL Data Warehouse.
The application works, on a VM, within Azure, in the following manner:
Existing on-premise applications can continue to be used:
If you have decided to go for a complete re-write of the application there are various toolkits and Microsoft Partners who will help you take complex business logic through to SQL Server from which you can add SQL Server Data Warehouse specifics.
If you have legacy applications and want to move quickly our recommended approach would be to use Datometry and then concentrating on recoding the application using a Microsoft Partner.
However, if you are migrating to different application stacks and have the timelines to be able to move business logic migration and data loads, then investment upfront in this will leave you with less technical debt.
If you are moving from Teradata to Azure SQL Data Warehouse, without Datometry, then you will have to do a certain amount of recoding. Take this view definition, from Teradata, as an example of what would need to be changed to be compatible with Azure SQL Data Warehouse.
REPLACE VIEW ADW.DIMDATE AS
LOCKING ROW FOR ACCESS
SELECT
CAL.DayNumberofWeek AS WeekName,
CAST(CAL.DateKey AS DATE FORMAT 'MM/DD/YYYY') AS CalDate,
CASE WHEN CAL.DayNumberOfMonth = 5 THEN 1 ELSE 0 END AS Calc_CY,
FROM TESTTD.CALENDER AS CAL
WHERE CAL.DateKey BETWEEN (Current_Date- INTERVAL '3' YEAR) AND (Current_Date- INTERVAL '1' DAY);
Looking at this code there are multiple challenges with trying to execute this against SQL Data Warehouse. To convert these into T-SQL we would apply the following changes:
These are small changes and shows why our approach of Datometry followed by recoding eases the migration path to SQL Data Warehouse.
The finished T-SQL script, with changes highlighted in read, would look like:
CREATE VIEW [ADW].[DimDate] AS SELECT
CAL.DayNumberofWeek AS WeekName,
CONVERT(datetime,CAL.DateKey,101) AS CalDate,
/*note the way that alias has changed from as columname to columname=case statement*/
"MonthOfMay"=CASE WHEN CAL.DayNumberOfMonth = 5 THEN 1 ELSE 0 END,
FROM TESTTD.CALENDER AS CAL
WHERE CAL.DateKey BETWEEN dateadd(YEAR, -3,GETDATE()) and dateadd(DAY, -1, GETDATE());
The pattern for semantic layer views is to create these in Azure SQL Data Warehouse and then to create an Azure Analysis Services Tabular Model over the top of them to improve scale and speed. For concurrency and scale, Azure Analysis Services should be included in the architecture. Ad-hoc queries can be run against the Azure SQL Data Warehouse.
Several great benefits exist when leveraging Azure Analysis Services (AAS) with SQL DW:
Make sure the semantic layer conforms to a Star or Snowflake Schema.
Decide on your query model for Azure Analysis Services.
The Analysis Services Dimensions can be optimized by:
Analysis Services Facts can be for optimized by:
Microsoft have published a whitepaper on modeling for AS tabular scalability.
There multiple ways to approach data migration and your choice will depend on the connectivity you have from your datacenter to Azure.
After assessing the Teradata export toolset and the effort required in creating BTEQ scripts to export all the data, the source-controlled pipeline approach isn't appropriate for a Teradata to Azure SQL Data Warehouse migration.
Adopting the Azure Controller pipeline approach, all the way from Azure down to the on-premise resources removes human intervention and reduces the number of mistakes. It relies on less on skills and more on the infrastructure you have enabled.
You would use Azure Data Factory v2 (https://docs.microsoft.com/en-us/azure/data-factory/introduction) with Self Hosted Runtimes (https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime), to manage the pipeline, installed onto a single or multiple Windows 2012 R2+ Virtual Machine(s) located near the existing Teradata Cluster. Data Factory, is a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines. A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private network, and it can dispatch transform activities against compute resources in an on-premises network or an Azure virtual network.
Teradata ships with FASTEXPORT utilities which allow you to export data which is called by Teradata Parallel Transporter. Azure Data Factory v2 Teradata connector calls the Teradata Parallel Transporter.
An internet connection to the virtual machine is required to allow Azure Data Factory to connect to self-hosted runtimes which will orchestrate the export/import process.
If your data export is between 1TB and 2TB at a time, having at least two virtual machines, with self-hosted runtimes, will increase throughput by 1.5x.
The following items are required on the Virtual Machine located in the Teradata Datacenter: -
For the integrated history load process, we would suggest you create a Windows 2012 R2+ Virtual Machine located as near to the existing Teradata Cluster and follow the instructions on Configuring Jumpbox VM earlier in this document.
You should then follow the Attunity Replicate for Microsoft Migrations instructions for migrating data which are maintained outside this document.
The integrated approach relies on Azure ExpressRoute being in place between the source Teradata location datacenter and Azure datacenter location, then using Attunity Replicate for Microsoft Migrations to ease the historical load process.
Attunity Replicate for Microsoft Migrations can be scaled out onto multiple nodes and use one node to co-ordinate the load process. It is recommended that you build VMs, based on their recommended specifications, in the Teradata source location datacenter. Attunity will query data locally and then upload to blob storage before creating external tables and importing data using a Create Table as Select to the base tables.
The data is transferred to local storage which can then be copied to Azure Databox Disks and shipped back to the Azure datacenter. Full details on Azure Databox (https://docs.microsoft.com/en-us/azure/databox/data-box-disk-overview).
The time to transfer data over the internet or Azure Express route will drive the decision to use Azure Data Factory or AzCopy to upload the export files instead of the Azure Databox service.
For example a 100TB export, over a 70MBps connection, assuming you have 100% bandwidth all of the time, would take almost 140 days to upload. Anything more than 14 days to upload should make use of the Azure Databox service.
To summarize, the process flow, of the two methods described would be broken into: -
Where possible use ORC file format over Text Format. Delimited files lose any text formatting that has been applied and will fail if there are special characters or the delimiter appears within the data.
If you have Azure ExpressRoute or performant internet connection:
If you have no Azure ExpressRoute or less than 70MBps VPN to Azure, then:
If your POC doesn't have Azure ExpressRoute or you need to load data in stages, then you will need to look at alternative ways to load data from Teradata to SQL Data Warehouse, using the differentiated approach and your favorite ETL toolkit.
To load SQL Data Warehouse, you would make use of Azure Data Factory v2 to extract data from Teradata to local storage then copy the data from the blob storage, once the data has been transferred either by Azure Databox or AzCopy.
All configuration activities are executed from the Virtual Machine located nearest to the Teradata Cluster.
At a high level, this export involves following steps:
Note: Depending on bandwidth, you would copy the local storage to Azure Data Box disks, and ship Azure Datacenter, or Use AzCopy with multiple threads (no more than the number of processor cores on the VM) to copy to blob storage.
To allow us to dynamically feed Azure Data Factory with the tables which need to be exported from Teradata, we create metadata tables in the Azure SQL Data Warehouse, which ADF v2 loops through for both the extract and import process.
You could keep these metadata tables simple BUT you must keep in mind that if you have some very large tables, you will want to split these into smaller export files, as opposed creating one large export file, which will cause out of memory issues during export and are much harder to reload should the import fail at some point.
During our testing, we always ran a smaller 1.5TB test export a couple of times, to make sure that the multiple file sizes for one table were less than 2GB.
The simple metadata table, called TablesToProcess, has schemaname and tablename columns. If you wanted to add intelligence to this, you could add an imported smallint column to allow ADF to update this column to 1 upon successful import or 3 upon failure.
The data in the simple metadata table would look like:
schemaname | tablename | Imported |
Dimension | City | 0 |
Integration | City_Staging | 0 |
Dimension | Customer | 0 |
Integration | Customer_Staging | 0 |
Dimension | Date | 0 |
Dimension | Employee | 0 |
3. Simple metadata table
A more advanced metadata table structure would still need one table called TablesToProcess which has schemaname, tablename, filtercol, FilterVal and imported columns.
schemaname | tablename | Imported | filtercol | Imported | FilterValue |
FACT | Orders | 0 | Orderdate | 0 | 2015 |
FACT | Orders | 0 | Orderdate | 0 | 2016 |
Dimension | Customers | 0 | CustomerType | 0 | Retail |
4. Advanced metadata table
In the advanced metadata table, you will see multiple entries for the same table. As Azure Data Factory accepts parameters, we can build dynamic SQL where clauses. The filtercol contains the column we will filter on and the filtervalue is the value we will compare to.
When the pipeline runs, it will use the advance metadata table to export data into smaller files. The import will use the same table to move the data into Azure SQL Data Warehouse.
To best understand this pipeline we would suggest running through the detailed setup on the Appendix.
This section details the next areas you need to put into your migration plan once the schema and data have to been moved to Azure SQL Data Warehouse.
The tables you created to load data into will have been distributed as ROUND_ROBIN tables, without any partitions, which aren't always the most optimal for Azure SQL Data Warehouse.
Data Distribution is an important concept in SQL Data Warehouse. It allows breaking down data processing requirements for efficient parallel processing across all hardware nodes. When creating any table in SQL DW, the distribution key is used to determine the node on which each row of data is stored. That is, any table will be distributed across multiple nodes based upon the distribution key. An important factor in SQL DW database design is selecting good distribution keys for fact tables.
The most expensive queries in MPP systems such as SQL DW, are those that involve a lot of data movement. When you execute a query, SQL DW will redistribute rows across the compute nodes as necessary to fulfill the query such as on JOINs and GROUP BY statements. Users should create tables with distribution styles that minimize data movement, reflective of queries that would occur in their normal workload.
SQL Data Warehouse supports three primary distribution styles:
Once data has been loaded, you should redistribute tables to a more appropriate distribution type.
SQL Data Warehouse offers several indexing options including clustered columnstore indexes, clustered indexes and nonclustered indexes, and a non-index option also known as heap.
Clustered columnstore tables offer both the highest level of data compression as well as the best overall query performance. Clustered columnstore tables will generally outperform clustered index or heap tables and are usually the best choice for large tables. Clustered columnstore is the best place to start when you are unsure of how to index your table.
There are a few scenarios where clustered columnstore may not be a good option:
For futher information on Indexing tables in SQL Data Warehouse please read https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-index
Fact tables are typically millions or billions of rows in a data warehouse and using table partitioning during loading and querying can improve performance.
The primary benefit of partitioning in SQL Data Warehouse is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging see (https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition). In most cases data is partitioned on a date column that is closely tied to the order in which the data is loaded into the database. One of the greatest benefits of using partitions to maintain data is the avoidance of transaction logging. While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance.
Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table might contain just data for the past 36 months. At the end of every month, the oldest month of sales data is removed from the table. This data could be removed by using a delete statement to delete the data for the oldest month. However, removing a large amount of data row-by-row with a delete statement can take too much time, as well as create the risk of large transactions that take a long time to rollback if something goes wrong. A more optimal approach is to drop the oldest partition of data. Where deleting the individual rows could take hours, dropping an entire partition could take seconds.
Partitioning can also be used to improve query performance. A query that applies a filter to partitioned data can limit the scan to only the qualifying partitions. Parition elimination can avoid a full table scan and only scan a smaller subset of data. With the introduction of clustered columnstore indexes, the predicate elimination performance benefits are less beneficial, but in some cases, there can be a benefit to queries. For example, if the sales fact table is partitioned into 36 months using the sales date field, then queries that filter on the sale date can skip searching in partitions that don't match the filter.
Hash distribute keys on common join conditions. Each table can only have one distribution key. As a rule of thumb, you should look at your typical query patterns and find the most common join conditions such as those between fact and the largest dimension tables as candidates for distribution. This will ensure that data is generally collocated at query time.
Choose an integer type for your distribution key if possible. Integer types such as INT or BIGINT can be used in query optimizations better than data types such as char, varchar, decimal, date, etc.
For even data skew, choose columns with large numbers of unique values for distribution keys. If you choose a date column as your distribution key, you may get a relatively even skew. However, if your query filters on a narrow date range, most of the data will land on a single compute node, thereby losing benefits of parallelization. Remember that distribution of data is more relevant to which data is chosen during a filtered query than how the data is at rest. As a rule of thumb, try to avoid any more than 30% skew across distributions.
Avoid a NULLABLE column for your distribution key. Rows where the distribution key is NULL or an "unknown" value such as -1 will be distributed in a single distribution which can lead to very skewed data and bottleneck performance.
For type 1 and type 2 dimensions.
Consider using ROUND_ROBIN tables for tables that cannot be replicated, have no consistent join column, or have very low cardinality (uniqueness) across all columns.
Clustered columnstore should be your default option for most table types. Clustered columnstore is a highly efficient compressed storage format that reduces IO greatly during query time.
Clustered indexes enforce a physical row sorting of your data on disk. Consider this type of index only on relatively small tables (< 60 million rows) such as (replicated) dimension tables that benefit from range filtering such as on date columns.
While partitioning can be used to improve performance in some scenarios, creating a table with too many partitions can hurt performance in others. These concerns are especially true for clustered columnstore tables. For partitioning to be helpful, it is important to understand when to use partitioning and the number of partitions to create. There is no hard-fast rule as to how many partitions are too many, it depends on your data and how many partitions you are loading simultaneously. A successful partitioning scheme usually has tens to hundreds of partitions, not thousands.
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, SQL Data Warehouse already divides each table into 60 distributed files. Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that SQL Data Warehouse has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition. The Indexing article on Microsoft docs includes queries that can assess the quality of cluster columnstore indexes.
Once the performance tuning of the schema has been completed. It is important to check that all the data has been transferred across.
You should implement a robust testing plan which includes comparing row counts, data types and checking that queries between on the source and on Azure SQL Data Warehouse return the same amount of data.
A full testing plan is outside the scope of this whitepaper but consider defining a repeatable process.
We discussed selecting the correct model type for the size of your data under Semantic Layer and Reporting Views.
Microsoft have publish a whitepaper called Performance Tuning of Tabular Models in SQL Server 2012 Analysis Services. This will help you optimise and performance tune your model.
Optimise the processessing of Azure Analysis Services by using partitioning, in your model, which align with the partitions in Azure SQL Data Warehouse.Large datasets normally require table partitioning to accelerate and optimize the data-load process. Azure Analysis Services partitioning enables incremental loads, increases parallelization, and reduces memory consumption. See Automated Partition Management with Azure Analysis Services and Azure Analysis Service Partitions.
After creating your Azure Analysis Service Model your should recreate your reports. For a Proof of Concept, choose no more than five reports and decide if you will keep your existing reporting technology or adopt a new solution.
If the reports are complex and are going to take time to transform, then Datometry is a great solution for being able to lift and shift your reporting solution whilst you evaluate your path.
If you are moving to a Microsoft reporting solution such as Power BI it can still make sense to repoint your current reporting solution, via Datometry, whilst you redevelop your reports. This is a much quicker way to get users working with your modern data warehouse.
Figure 7: Azure Marketplace
Figure 8: Teradata Marketplace Offering
Figure 9: Teradata Costs as at July 2018
Figure 10: Create Teradata Cluster
Figure 11: Teradata Basic Settings
Figure 12: Teradata Database Configuration
Figure 13: Teradata General Settings
Figure 14: Teradata Virtual Network Settings
Figure 15: Teradata Azure Validation
Figure 16:Create Teradata Cluster
Note: Provisioning of the Teradata Database (Developer Edition) can take up to 30 minutes.
Important: once the Teradata database VM has been provision. do not shut this VM down until you have finished the schema conversion, or it will fail to come back online.
To allow access to the Teradata Database VM we need to enable port 1025, on the firewall, for the virtual network. Without this our Data Science VM won't be able to connect to create and then extract the schema.
Figure 17: Virtual Machine Blade
Figure 18: Networking Firewall Rules
Figure 19: Add inbound security rule
We will use this to connect to the Teradata Azure VM and load some additional software onto this machine to assist with the schema migration.
Figure 20: Azure Marketplace
Figure 21: Data Science VM in Azure Marketplace
Figure 22: Data Science VM Pricing
Figure 23: Create virtual machine
Figure 24: Virtual Machine Size
Figure 25: Virtual Machine Settings
Figure 26: Create Virtual Machine
The jump box needs to be configured with software to connect to the Teradata database, to create your schema, and has the migration software installed which will take your Teradata schema across to SQL Data Warehouse.
Figure 27: TTU Tools Installation
Figure 28: Attunity Website
Figure 29: Virtual Machine Environment Varibles
Figure 30: Teradata Visual Studio Install
Close all applications and log off.
You will need a blank SQL Data Warehouse database to migrate the schema to and this needs to be setup prior to running Attunity Replicate.
Figure 31: Azure New Service Blade
Figure 32: Azure SQL Data Warehouse Settings
Figure 33: Add New Server
Figure 34:Performance Tier
Prior to deploying the Teradata Schema, you will need to check connectivity from the Data Science VM; Create a login and user that is dedicated for loading data; Create a MASTER KEY.
Figure 35: Connect to Azure SQL Data Warehouse
Figure 36: Object View
Figure 37: New Query
Figure 38: Create User
Figure 39: Create Master Key
This describes the process for creating your Teradata Database Schema in Azure Teradata VM. It presumes that you are a DBA or have Teradata skills.
Figure 40: New Teradata Query
Figure 41: Recreate Teradata Tablespace
Figure 42: Connect to Teradata
Figure 43: Connection Information Dialogue
Figure 44: Execute Query
Figure 45: Query Output
Figure 46: Create Tables
To access the Console from a remote type the following address in the address bar of your Web browser:
http://<computer name>/attunityreplicatemsm
where <computer name> is the name or IP address of the computer
After opening the console, the next step is to add a migration task.
To add a replication task:
Figure 47: Attunity Task
After adding a new task, the next step is to set up your source and target endpoints in Attunity Replicate for Microsoft Migrations. An endpoint is a set of connection parameters that enables Replicate for Microsoft Migrations to extract data from your source and migrate it to your target. Connection parameters specify information such as the database server name, the database name, the user name, and so on.
To add a Teradata Database source endpoint to Attunity Replicate for Microsoft:
To add a SQL Data Warehouse Target to Attunity Replicate for Microsoft:
After defining your source and target endpoints, the next step is to add them to the migration task.
The Endpoints pane is in the left of the console. Replicate for Microsoft Migrations supports several endpoint types (database, Hadoop, data warehouse, file, NoSQL) with each endpoint type having its own icon. As both of our endpoints are databases, the following icons will be displayed:
The source endpoint.
The target endpoint.
To add the source and target endpoints to the task
After adding the source and target endpoints, you now need to select the Teradata source tables that you want to migrate to Microsoft Azure SQL Database.
To add Teradata tables to the migration task:
To run and monitor the replication task
The Starting task dialog is displayed, and the console automatically switches to Monitor view:
Monitor view contains the following tabs:
After you have run the Attunity Migration you should check the SQL Data Warehouse Database to see the schema which has been transferred.
Download all of the Powershell and SQL Scripts, from the GitHub Repository to "C:\ADF" on the local VM.
Using PowerShell, with the Azure PowerShell module 3.6 or later, open ".\02-ADF\01-Pre-Requisites.ps1" and step through the script.
Sign in to your Azure subscription with the Connect-AzureRmAccount command and follow the on-screen directions.
Connect-AzureRmAccount
If you don't know which location you want to use, you can list the available locations. After the list is displayed, find the one you want to use. This example uses eastus. Store this in a variable and use the variable so you can change it in one place.
Get-AzureRmLocation | select-object Location
$location = "eastus"
Create an Azure resource group with New-AzureRmResourceGroup. A resource group is a logical container into which Azure resources are deployed and managed.
$resourceGroup = "myResourceGroup"
New-AzureRmResourceGroup -Name $resourceGroup -Location $location
Create a standard general-purpose storage account with LRS replication using New-AzureRmStorageAccount, then retrieve the storage account context that defines the storage account to be used. When acting on a storage account, you reference the context instead of repeatedly providing the credentials. This example creates a storage account called mystorageaccount with locally redundant storage (LRS) and blob encryption (enabled by default).
$storageAccount = New-AzureRmStorageAccount -ResourceGroupName $resourceGroup `
-Name "mystorageaccount" `
-Location $location `
-SkuName Standard_LRS `
-Kind Storage
$ctx = $storageAccount.Context
Blobs are always uploaded into a container. You can organize groups of blobs like the way you organize your files on your computer in folders.
Set the container name, then create the container using New-AzureStorageContainer, setting the permissions to 'blob' to allow public access of the files. The container name in this example is import
$containerName = "import"
New-AzureStorageContainer -Name $containerName -Context $ctx -Permission blob
To import data from blob storage you need to create external tables which points to the container you created earlier. You cannot create these external tables until you have uploaded data to the blob store but for now we can create the DDL. Creating the DDL before the data is uploaded allows us to then create a local directory structure on the Teradata export machine to copy the files to.
CREATE TABLE [dbo].[TablesToProcess]
(
[schemaname] [varchar](255) NULL,
[tablename] [varchar](255) NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO
INSERT INTO TablesToProcess (schemaname, tablename)
select top 30000 sc.name, so.name
from sys.tables so
join sys.schemas sc on so.schema_id = sc.schema_id
left join sys.external_tables et on so.object_id = et.object_id
where et.name is NULL and so.type = 'U' order by so.name
GO
With the metadata tables created we are going to create a new data factory for your historical migration.
Connect-AzureRmAccount
Get-AzureRmSubscription
Select-AzureRmSubscription -SubscriptionId "<SubscriptionId>"
Note the following points:
To enable the VM to export data from Teradata to local storage we need to create you a self-hosted integration runtime and associate it with an on-premises machine with the Teradata database and local storage. The self-hosted integration runtime is the component that copies data from the Teradata database on your machine to local storage.
Here is the sample output:
Id : /subscriptions/<subscription ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factories/onpremdf0914/integrationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description : selfhosted IR description
Here is the sample output:
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net, *.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
Get-AzureRmDataFactoryV2IntegrationRuntimeKey `
-Name $integrationRuntimeName `
-DataFactoryName $dataFactoryName.DataFactoryName `
-ResourceGroupName $resourceGroupName.ResourceGroupName | `
ConvertTo-Json
Here is the sample output:
{
"AuthKey1": "IR@0000000000-0000-0000-0000-000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
When the self-hosted integration runtime is registered successfully, the following message is displayed:
Note all the preceding values for later use.
To link your data stores and compute services to the data factory create linked services in the data factory. You will need to create links to Teradata, local storage, your Azure storage account and Azure SQL Data Warehouse. The linked services have the connection information that the Data Factory service uses at runtime to connect to them.
Create and encrypt a Teradata linked service (source)
In this step, you link your on-premises Teradata instance to the data factory.
Create and encrypt a local storage linked service (source)
In this step, you link your on-premises Teradata instance to the data factory.
Creating the Azure SQL Data Warehouse linked service
Create All Linked Services
Linked Services Notes
The file share should be on the local virtual machine as opposed pushing out file copies from the VM to another location.
This will create all the datasets required for the four pipelines to use.
In SourceDW.json, in the C:\ADFv2\02-ADF\DataSets folder, "tableName" is a dummy one which the pipeline will use the SQL query in copy activity to retrieve data. This can be checked or changed in the Azure Data Factory User Interface SourceDW dataset.
Figure 48: SourceDW Connection Properties
In TDDataset.json, in the C:\ADFv2\02-ADF\DataSets folder, "tableName" is a dummy one which the pipeline will use the SQL query in copy activity to retrieve data. This can be checked or changed in the Azure Data Factory User Interface TDDataset dataset.
Figure 49: TDDataset Connection Properties
In DWDestinationDataset.json, in the C:\ADFv2\02-ADF\DataSets folder the "tableName" is set as a parameter. The copy activity that references this dataset dynamically passes the actual value into the dataset. This can be checked or changed in the Azure Data Factory User Interface DWDestinationDataset dataset.
Figure 50: DWDestinationDataset Parameters
These are dynamically referenced in table name.
Figure 51: DWDestinationDataset Connection Properties
In FileORC.json, in the C:\ADFv2\02-ADF\DataSets folder the "FileDirectory" and "FileTableName" are set as a parameter. The copy activity that references this dataset passes the actual directory and filename into the dataset. This can be checked or changed in the Azure Data Factory User Interface FileORC dataset.
Figure 52: FIleORC Parameters
These are dynamically referenced in the file path and the file format is set to ORC format.
Figure 53: FileOrc Connection Properties
This will create all four pipelines to use for migration (GetTableListAndTriggerCopyTDToCSV, IterateAndCopyTDTablesToCSV, GetTableListAndTriggerCopyBlobtoSQLDW and IterateAndCopyTDTablesToCSV)
Teradata to Local Storage
The GetTableListAndTriggerCopyTDToCSV pipeline looks up the Azure SQL Data Warehouse system table to get the list of tables to be copied from Teradata to the local storage. This can be checked or changed in the Azure Data Factory User Interface GetTableListAndTriggerCopyTDToCSV pipeline by selecting the LookupTableList activity.
Selecting the TDTableToCSVFilePipeline activity and setting will show how we pass the output of LookupTableList to the IterateAndCopyTDTablesToCSV pipeline.
GetTableListAndTriggerCopyTDToCSV pipeline triggers the pipeline " IterateAndCopyTDTablesToCSV" to do the actual data copy.
The output of the lookup activity is passed into an array parameter which is used by the ForEach loop.
By selecting the ForEach activity you will see that this array to passed to another parameter associated with the activity.
Editing the activities for IterateAndCopyTDTablesToCSV and selecting CopyTDtoCSV you can investigate how Teradata data is copied to local storage using a dynamic query.
The local storage is passed a directory and filename based on the schema and tablename. This example is a basic extract process. You will know your data and we highly recommend that modify this query to partition the data export based on the advanced metadata approach above.
Blob to Azure SQL Data Warehouse
The GetTableListAndTriggerCopyBlobtoSQLDW pipeline looks up the Azure SQL Data Warehouse system table to get the list of tables to be copied from Teradata to the local storage. This can be checked or changed in the Azure Data Factory User Interface GetTableListAndTriggerCopyBlobtoSQLDW pipeline by selecting the LookupTableList activity.
Selecting the BlobtoSQLDW activity and setting will show how we pass the output of LookupTableList to the IterateAndCopyBlobToDW pipeline.
GetTableListAndTriggerCopyBlobToDW pipeline triggers the pipeline IterateAndCopyBlobToDW to do the actual data copy.
The output of the lookup activity is passed into an array parameter which is used by the ForEach loop.
By selecting the ForEach activity you will see that this array to passed to another parameter associated with the activity.
Editing the activities for IterateAndCopyBlobToDW and selecting CopyCSVtoSQLDW you can investigate how the blob storage is passed a directory and filename based on the schema and tablename.
Azure SQL Data Warehouse is setup as the sink and makes use of PolyBase, parameters for the destination and truncates the table before loading. In an advanced metadata approach, you would remove the pre-copy script and make use of Azure SQL Data Warehouse snapshot backups which are retained for seven days.