Copying Data into a Table in Azure SQL DW (Synapse Analytics)

In this post I want to start conversation about copying data into Azure Synapse Analytics (SQL DW). Yes, as you expected there are multiple ways and besides some of the good old ones (such as a primitive INSERT INTO … SELECT), I wanted to point out some of the exclusive ones on the Azure Synapse Analytics and between them besides some good old ones (*cough*,*cough*) there are a couple of new ones that Microsoft launched without huge fanfare a couple of months ago.
I am definitely not touching on the Integration Services side since the product is half-dead and as much as I love it, right now with ADF Mapping Data Flows, Azure Databricks, Spark & Polybase well into the game, SSIS is progressively looking like a product that came a well over a decades ago (yeah, it is in fact 15 years old right now).

First of all, let us list some of them (and I am not even attempting on providing all of them, of course):
INSERT INTO … SELECT FROM … (the most well known one)
SELECT INTO … FROM … (the most well-known to perform well, since it will create a HEAP while copying most of the properties from the original table(s))
CREATE TABLE … AS SELECT … (the old way, which must be like 10 years old on PDW/APS & Azure SQL DW, but that has never gotten into a Box Product or Azure SQL Database)
Polybase (that will use the External Tables & externally allocated data to transfer into Azure SQL DW)
BCP (good old tested friend that will give you a pain in the neck until you dominate it)
OPENROWSET / BULK INSERT (some very good and very old friends with complicated histories (who remembers all the code pages?, settings and uncertain future mostly because of their original restrictions, I guess)
COPY INTO … (the brand new command that will allow you under very neat privileges to copy data from the external storage accounts, much like BULK INSERT)

In this blog post I will simply focus on those features that have not been ported (hopefully just yet): CTAS & COPY INTO.

CTAS

CREATE TABLE AS … Syntax is probably a couple of decades old and it allows us to define with a precision the destination table where we shall land our data. Some people might point out that executing SELECT INTO … allows you to define the distribution, compression type (and even sorted Clustered Columnstore Index), file group (when using on-premises) or even partitioning, and much more. I will not even step into the ACID-less transaction when if you are unlucky in the interruption of the process, you might simply end with an empty table (which should not exist, if the execution would be 100% transactional).
There will be always situations where you simply want to exercise the precision in a more accessible & observable way – by defining the table before loading the data into it in a single step.
This functionality exists on all major databases such as Oracle, MYSQL, MariaDB, PostgreSQL, Apache Drill, you name it! With some exceptions I might have to add, but you can count on it existence.

Azure Synapse Analytics inherited from PDW/APS (old Microsoft Appliances) this functionality and its usage is very easy for sure – just as you would expect it works very fine.
Copying data of an already existing table in Azure Synapse Analytics is very easy with CTAS:

DROP TABLE [dbo].[NewDimAccount];

CREATE TABLE [dbo].[NewDimAccount]
WITH
(
	DISTRIBUTION = ROUND_ROBIN,
	CLUSTERED COLUMNSTORE INDEX
) AS
SELECT *
    FROM dbo.DimAccount;

Notice that we can not specify the name or the type of the destination columns, but we can surely influence their properties through the proper select statement, such as

DROP TABLE [dbo].[NewDimAccount];

CREATE TABLE [dbo].[NewDimAccount]
WITH
(
	DISTRIBUTION = ROUND_ROBIN,
	CLUSTERED COLUMNSTORE INDEX
) AS
SELECT AccountKey, ParentAccountKey, AccountCodeAlternateKey, ParentAccountCodeAlternateKey, AccountDescription,
    AccountType, 
    ISNULL(Operator,'') as Operator, 
    CustomMembers, ValueType, CustomMemberOptions
    FROM dbo.DimAccount;

In the statement above I have forced the Operator Column to become a NON-NULL column, and we can play a lot of the same tricks as we are used to for the SELECT INTO … FROM … statements.
I just wish we could define the column properties, contrasts and indexes right away … :)

Combining this method with the Polybase functionality we can copy data into our local table from any table – being it located on the same Azure Synapse Analytics, Azure Blob Storage or anywhere else – as long as the external table support it.

*sniff* … I hope it will come one day into the Azure SQL Database as well as into the Box (SQL Server) product.

COPY INTO

A couple of months ago Microsoft has announced that a new command has made it into Azure Synapse Analytics syntax – COPY INTO. COPY INTO command comes with promised credits of being high-performance and easy to setup and manage from the security perspective command for loading data from external storage accounts. External Storage Accounts for me on Azure Synapse Analytics means Azure Blob Storage or Azure Data Lake Storage (ADLS) Gen2, but who knows – the vague name might point the flexibility of adding support for new storage services in the future.

The currently in preview syntax for this command is following

COPY INTO [schema.]table_name
[(Column_list)] 
FROM ‘’ [,...n]
WITH  
 ( 
 [FILE_TYPE = {'CSV' | 'PARQUET' | 'ORC'} ]
 [,FILE_FORMAT = EXTERNAL FILE FORMAT OBJECT ]	
 [,CREDENTIAL = (AZURE CREDENTIAL) ]
 [,ERRORFILE = '[http(s)://storageaccount/container]/errorfile_directory[/]] 
 [,ERRORFILE_CREDENTIAL = (AZURE CREDENTIAL) ]
 [,MAXERRORS = max_errors ] 
 [,COMPRESSION = { 'Gzip' | 'DefaultCodec'|’Snappy’}] 
 [,FIELDQUOTE = ‘string_delimiter’] 
 [,FIELDTERMINATOR =  ‘field_terminator’]  
 [,ROWTERMINATOR = ‘row_terminator’]
 [,FIRSTROW = first_row]
 [,DATEFORMAT = ‘date_format’] 
 [,ENCODING = {'UTF8'|'UTF16'}] 
 [,IDENTITY_INSERT = {‘ON’ | ‘OFF’}]
)

meaning that besides specifying the destination table we can load data from the following 3 formats – CSV, PARQUET (yay!) and ORC, specify the file format, Azure Credentials (if you are using the secure connection and not a public URL and you should never allow your data to be publicly accessible – if you want to share the data, there are specific services that will serve this purpose), specify the destination for the error output and the necessary credential for it, and the usual properties that one would find in OPENROWSET or BULK INSERT statements – Maximum number of errors, compression, and all possible terminators and delimiters.
An extra reminder goes about encoding UTF-8 support (yay, SQL Server 2016 started allowing the code page 65001 supporting it for BULK INSERT), date format (there is none for the BULK INSERT) and a bunch of NULL settings, Last Row, Fire Triggers options that are definitely missing here, but from the other side – there are no triggers in Azure Synapse Analytics.
Given that BULK INSERT supports just the CSV format, I will make a wild guess that this option of COPY INTO was implemented to avoid some of the old code there, and because BULK INSERT is unavailable on Azure Synapse Analytics :)
Think about it that the key support with the COPY INTO command are the PARQUET & ORC formats.

Let’s try it out with a rather simple table by setting it up:

CREATE TABLE dbo.airtravel
(
    C1 NVARCHAR(10),
    C2  NVARCHAR(10),
    C3  NVARCHAR(10),
    C4  NVARCHAR(10) );

Preparing a rather plain and not-so-exciting CSV file with the following content that will be stored as a file with the name airtravel.csv on my blog storage account:

"JAN",  340,  360,  417;"FEB",  318,  342,  391;"MAR",  362,  406,  419;"APR",  348,  396,  461;"MAY",  363,  420,  472;"JUN",  435,  472,  535;"JUL",  491,  548,  622;"AUG",  505,  559,  606;"SEP",  404,  463,  508;"OCT",  359,  407,  461;"NOV",  310,  362,  390;"DEC",  337,  405,  432;

and then load the data into the airtravel table with the following statement (naturally I have obscured the exact location of the Blob Storage with the xxxsecretexxx name and the SAS key is so short because it is naturally wrong here on the blog – you will have to insert your own if you want to try out this technology:

COPY INTO dbo.airtravel
FROM 'https://xxxsecretexxx.blob.core.windows.net/external/airtravel.csv'
WITH (
    FILE_TYPE = 'CSV',
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='?sv=2019-02-02&yeahthaisar34ls34cr3th0h0h0'),
    FIELDTERMINATOR=',',
    ROWTERMINATOR=';'
);

In just a couple of seconds it will work out perfectly, as you can verify on the query below:

SELECT *
    FROM dbo.airtravel;

This is all pretty much something that would work on SQL Server 2017+ without much fanfare as well, but what about the PARQUET and ORC files ?
Let’s consider the following demo, where defining the following table:

CREATE TABLE dbo.userdata
(
    registration_dttm 	Datetime2,
    id 			int,
    first_name 		nvarchar(250),
    last_name 		nvarchar(250),
    email 		nvarchar(250),
    gender 		nvarchar(250),
    ip_address 		nvarchar(250),
    cc 			nvarchar(250),
    country 		nvarchar(250),
    birthdate 		nvarchar(250),
    salary 		float,
    title 		nvarchar(250),
    comments 		nvarchar(250)
);

And now taking a userdata1.parquet file we can load it directly from Azure Blob Storage into our table.

COPY INTO dbo.userdata
FROM 'https://xxxsecretexxx.blob.core.windows.net/external/userdata1.parquet'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='?sv=2019-02-02&yeahthaisar34ls34cr3th0h0h0')
);

And it works perfectly with ease, which are great news.

Checking on the output of the insertion works (there were 1000 rows), we have the following output:

select *
    from dbo.userdata;

As you can see on the picture on the left of this text, the table has the data and it was imported from the PARQUET file, which are great news. The very same procedure will work pretty fine in my tests for the ORC format as well.

Final Thoughts

I am happy to see a direct access to the different storage accounts (Blob & ADLS gen 2) and a support for the different modern formats, such as ORC & PARQUET. I wish that the old functionality (BULK INSERT | OPENROWSET) would have been implemented, for better portability between the applications that are being migrated from Azure SQL DB/SQL Server to Azure Synapse Analytics and back, but given the original state of the technology and its limitations, I guess that it was necessary, but having incompatible statements does not make me very happy, because the goal of the technology is to facilitate and not to complicate.

Oh and I wish there will be a support for AVRO as well in the future :)

8 thoughts on “Copying Data into a Table in Azure SQL DW (Synapse Analytics)

  1. hemanth vootla

    Facing issue I am getting the below error

    Msg 110813, Level 16, State 1, Line 1
    The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.

Leave a Reply to Niko Neugebauer Cancel reply

Your email address will not be published. Required fields are marked *