Azure Data Factory v2 and its available components in Data Flows

Azure Data Factory v2 and its available components in Data Flows

Many of you (including me) wonder about it.
Namely: Is it possible to move my ETL process from SSIS to ADF? How can I reflect current SSIS Data Flow business logic in Azure Data Factory? And turned out that new feature in ADF: Data Flow – comes with help. Further more, such solution will be scalable as Azure Databricks works under the hood. Don’t worry – you don’t have to know Databricks and extra language (Scala, Python) at all. How does it possible? Carry on reading.

In this post’s section I would like to show you what kind of actions you can do and what is their equivalent in SQL and SSIS.

 

New world: Data Flow in Azure Data Factory

Big benefit here is that you will not write any line of code. You can design whole business logic from the scratch using Data Flow UX and appropriate code in Scala will be prepare, compile and execute in Azure Databricks behind the scenes. So that you can focus on business logic and data transformations like data cleaning, aggregation, data preparation and build code-free dataflow pipelines.
Additionally, the process would be automatically scale-out if you allow for that.

The Data Flow feature in ADF is currently in limited preview. If you would like to try this out on your Data Factories, please fill out this form to request whitelisting your Azure Subscription for ADF Data Flows: http://aka.ms/dataflowpreview. Once your subscription has been enabled, you will see “Data Factory V2 (with data flows)” as an option from the Azure Portal when creating Data Factories.

ADF Data Flow vs SSIS vs T-SQL

The main purpose of this post is to bring capabilities of (ADF) Data Flow closer and compare to its counterparts from SSIS and relevant code of T-SQL.
Why? Because it’s a far easier to understand something new by comparison to something that we know very well.
Further more, tables and icons talks to us much more, hence it easy to acquire such new knowledge.
Having that fundamentals, you can re-design current ETL process in Azure Data Factory when having clear image of mapping components between SSIS and ADFDF. To fulfil the picture out, I have added column which shows T-SQL code which does the same or similar things in SQL.
So, no matter which technology your current process uses, either Stored Procedures in SQL or SSIS, you are able to sit down and recreate that process uncover new opportunities.

Components

Operation / Activity Description SSIS equivalent SQL Server equivalent

New branch
Create a new flow branch with the same data
Multicast (+icon)

Join
Join data from two streams based on a condition
Merge join

Conditional Split
Route data into different streams based on conditions
Conditional Split

Union
Collect data from multiple streams
Union All

Lookup
Lookup additional data from another stream
Lookup
Subselect, function,
LEFT/RIGHT JOIN

Derived Column
Compute new columns based on the existing once
Derived Column

Aggregate
Calculate aggregation on the stream
Aggregate

Surrogate Key
Add a surrogate key column to output stream from a specific value
Script Component

Incremental Primary Key
(with limited capabilities)

Exists
Check the existence of data in another stream
Lookup / Merge Join

Select
Choose columns to flow to the next stream OUTPUT in components,
mapping columns

Filter
Filter rows in the stream based on a condition
Conditional Split

Sort
Order data in the stream based on column(s)
Sort

Extend
Use any custom logic from an external library
Script Component
SQL CLR

Source
Source for your data flow.
Obligatory first element of every Data Flow in ADF.

OLE DB Source and more …

Sink
Destination for your data flow
OLE DB Destination and more…

Summary

This new feature has huge capabilities. I’m very excited being had opportunities to use it more.
Automatically scalable process, like this, might be very efficient with Big Data processing. Hence, it’s worth to start designing new processes with Azure Data Factory or even migrate existing processes when your enterprise suffers on performance degradation due to amount of processing data.
Please be aware that among Microsoft solutions is another Data Flow – exists in Power BI. Do not confuse them.

In next posts of this series I will be explaining all activities from ADF Data Flow a bit deeper.
Let me know your thoughts or leave a comment once you have any questions.
Thanks for reading!

Useful links

ADF Data Flow’s documentation
ADF Data Flow’s videos
Follow this tag on the blog: ADFDF

Previous Last week reading (2018-12-02)
Next Speaking at SQL Saturday #782 Slovenia

About author

Kamil Nowinski
Kamil Nowinski 84 posts

Blogger, speaker. Data Platform MVP, MCSE. Senior Data Engineer & data geek. Member of Data Community Poland, co-organizer of SQLDay, Happy husband & father.

View all posts by this author →

You might also like

Auditing in Azure SQL Data Warehouse

The first article in a series on “Security Intelligence in Azure PaaS” inspired me to write something on auditing in Azure SQL Data Warehouse. To put it simply for anyone

Big Data 0 Comments

Setting up Code Repository for Azure Data Factory v2

In this blog post I want to quick go through one of useful capabilities that Microsoft provided with version 2 of Azure Data Factory. As a developer I always want

Big Data 2 Comments

Functions in the USQL – the hidden gem in the Summer 2017 Update

Hello SQL Folks. We all love USQL for its great extensibility and how it makes our life easier in some particular “area of data”. There are a lot of things

0 Comments

No Comments Yet!

You can be first to comment this post!

Leave a Reply

− 4 = 2

Protected with IP Blacklist CloudIP Blacklist Cloud