Extracting Data from (Azure) SQL Server Huge Tables in RFC 4180-Compliant CSV Files | by Luca Zavarella | Jan, 2023 | Towards Data Science

Extracting Data from (Azure) SQL Server Huge Tables in RFC 4180-Compliant CSV Files | by Luca Zavarella | Jan, 2023 | Towards Data Science

When you need to extract data from an (Azure) SQL Server database, the first tools that come to mind for the user are SQL Server Management Studio (SSMS) and Azure Data Studio (ADS). This is because both contain simple features that allow you to extract data from a database with a few clicks.

The interfacing tool with (Azure) SQL Server par excellence is SSMS. Recently Microsoft has been investing heavily in adding features in ADS to make it the tool of choice for the Microsoft data platform on Azure and beyond. Therefore, when you install the latest versions of SSMS today, the setup also installs ADS behind the scenes.

Any third-party system that involves importing a CSV file to load a dataset must be based on a standard that defines the CSV format. Therefore, before moving on to practical tests, let’s see if there is a standard definition of the CSV format.

What’s the RFC 4180

RFC 4180 is a standard that formalizes the format used for Comma-Separated Values (CSV) files and the specific Multipurpose Internet Mail Extensions (MIME) type associated with the CSV format (“text/csv”). The contents of this standard can be found here:

As you can see from the definition of the format in the previous link, while the first four points are fairly obvious, the remaining three need to be read carefully:

Keeping in mind also the examples given in the link, it is evident that the value of a field will be enclosed with double quotes only when needed. It doesn’t make sense to use double quotes for all the values of a field when only some of the values need them.

When you need to share information with third-party systems using CSV format files, the following applies:

It’s important that the CSV files you generate from your exports are RFC 4180 compliant to be sure that the files can be read by any external system that provides CSV file import capability.

In order to test how the upon mentioned tools extracts data in CSV format, let’s create a simple table containing special characters mentioned in the RFC 4180 standard, and Unicode characters to ensure the generality of the contents in text fields.

Creating a dummy table containing special characters

First, you have to create the extract_test table in your SQL Server instance using the following script:

Then you can add data to this table using the following script:

As you can see from the contents of the INSERT statements, we have provided all the special characters mentioned in the standard. We have also used Japanese characters, so that we can verify that the CSV file is written correctly using the Unicode character table.

Evidently the table created in this case will not be a 5 GB table, but will contain special characters to test CSV format exports. Here the output of a SELECT in ADS:

Do not worry from the fact that the carriage return does not show up in the output grid of ADS or SSMS. The way the INSERT of that row was done, the carriage return is there.

So, let’s try to extract data from this table using SSMS and ADS.

Using Microsoft user-friendly tools to extract data

Let’s first try to use the traditional tool with which we interface with SQL Server, namely the SQL Server Management Studio.

Extracting data with SSMS

Once you have opened SSMS and connected to your database instance, right-click on the name of the database hosting the table you’ve just created, go to Tasks and then Export Data:

You will be shown an initial screen describing the Extract Data activity. If you go forward, you will be shown this window:

Select the SQL Server Client data source, enter your server instance name, then choose the authentication to use to login to database. In my case, having persisted the test table on an Azure SQL database, I used a SQL Server authentication to access my test-sql-bug database, as you can see in Figure 3.

On the next screen of the Wizard you have the option of selecting the export destination. In our case, select Flat File Destination, create a CSV destination file via the Browse button in your preferred folder (remember to select the CSV extension in the Open window that opens after pressing Browse). Remember to check the Unicode flag to make sure you also handle the Japanese characters in our example. After that, select Delimited as the format, leaving the Text qualifier at “<none>”. Also make sure that the “Column names in the first data row” flag is checked. Then press Next:

In the next window select Copy data from one or more tables or views and press Next again.

In the configuration window that appears you can then select the table ` [dbo].[extract_text] as Source table or view. For the other options, you can leave everything as is, since the row delimiter (CR\LF) and the column delimiter (comma) are as defined by the RFC 4180 standard. Then press Next:

In the next window keep Run immediately selected and press Finish. A summary window of the selected options will appear. Press Finish again and the extraction will start. When finished, press Close.

If you now try to open the output CSV file with a text editor (not Excel), you will notice the following:

Basically, in this case the Export Wizard extracts the contents of each text field regardless of whether it may contain special characters (comma and carriage return). This means that any carriage return contained in a text field is interpreted as a row delimiter by the system that has to read the file, just as any comma contained in a text field is interpreted as a field delimiter. Unicode characters, on the other hand, have been treated correctly. Therefore, the generated CSV file will not be recognized as correct by any third-party system that needs to import that information.

If you try to repeat the export, this time entering the double quotes as a text qualifier, you will get the following:

In this case, all extracted values are surrounded by double quotes, including the header. However, this forces an external system that must read the data to consider all numeric values as strings. Moreover, if a value in a text field contains a double quote character, it is not escaped, generating parsing problems for external systems. Therefore, again, the generated CSV file will not be recognized as correct by any third-party system that needs to import this information.

Regarding the scalability of the extraction operation on very large masses of data, there are no problems, because the Export Wizard uses SQL Server Integration Services (SSIS) as its engine, which is developed to handle huge bulk volumes of data.

Moreover, it may sometimes happen that you need to take action on the data source data types to avoid some errors during export with the Export Wizard, as highlighted in this blog post:

We can conclude this section by stating the following:

Using the SSMS Export Wizard as a tool for extracting data in CSV format from an (Azure) SQL Server database doesn’t guarantee having a format that complies with the standard defined by RFC 4180, with the consequence that the extracted information may not be properly read by an external system.

Instead, let’s see what happens when we use Azure Data Studio to extract the information in CSV format.

Extracting data with ADS

Once Azure Data Studio is open, the first thing to do is to add a new connection to your server instance. Watch out that starting with newer versions, the Encrypted option is set to True by default. This will not result in connection errors if you connect to an Azure SQL database, but it might generate one if your data source is an on-prem SQL Server. In that case, you can set the option to False.

That said, in order to extract the contents of a table (or view, or query) in ADS, you must first perform a SELECT query and display its contents in the output grid running it. After that, simply press the “Save As CSV” button at the top right of the grid:

An output file selection window will open, allowing you to name the file that will be extracted (in our case ExtractTestADS.csv). As soon as you press the Save button, the contents of the CSV file will be shown directly within ADS:

Wow! The output generated by ADS complies with the RFC 4180 standard to all intents and purposes! Thus, it would seem that ADS is the perfect tool for extracting information in CSV format from an (Azure) SQL database.

However, there is a scalability problem. Since ADS requires that the query output be first exposed in the output grid, this limits the functionality when dealing with many GB of data. In these cases, containing all that data in a grid involves taking up so much RAM on the system, causing the application to crash.

We can therefore conclude this section as follows:

ADS’s CSV format data export procedure guarantees output that conforms to the RFC 4180 standard. However, the use of ADS for extraction tasks is indicated when the size of the dataset to be exported is rather limited. When more than 2–3 GB of data needs to be extracted, ADS may occupy the entire system memory and crash.

In general, we can therefore conclude that:

Unfortunately, the user-friendly features provided by Microsoft’s data platform tools don’t allow to extract huge amount of data in CSV format following the RFC 4180 standard.

Let’s try to see if we can achieve our goal through more specific tools known by expert users.

Using the BCP tool to extract data

The Bulk Copy Program (BCP) command line utility is used to import large numbers of new rows into SQL Server tables or to export data from tables to data files in a user-specified format. This is the solution that imports or exports data as fast as possible in even very large quantities. Therefore, it has no problem with scalability.

In addition to being installed by default with a standard on-prem SQL Server installation, and in addition to being able to be installed stand-alone on a Windows operating system, the BCP utility can also be used from the Azure cloud shell to interact with an Azure SQL database, as shown in this blog post:

Without going into too much detail, the main problem with BCP is that it doesn’t extract table headers and doesn’t handle double quotes in a simple way out of the box. This is evidenced by the fact that Erland Sommarskog’s reference guide for its use reports a number of workarounds for getting both headers and double quotes, as you can see here:

One of the drawbacks of this is approach is that you have to know in advance which fields need double quotes (unless you provide them for all text fields). Generally, I do not have the ability to know in advance which fields might have the need for double quotes. I just want to extract the data worry-free. Should you be able to get the headers and double quotes via Erland’s advice, however, the quotes would be applied to all values in the selected fields. As Erland himself points out:

… the assumption is that the data should always be quoted. If you only want to quote when needed, you will need to handle this in your query, which is outside the scope of this article. All I can say is: good luck. Or more directly: avoid it if you can.

Also, should a field with double quotes have a string containing both a comma and a double quote, the BCP does not handle the feature of escaping the double quote by doubling it.

We can therefore state that:

Using BCP to export data in a CSV format that includes both headers and double quotes is very arcane for the non-expert user. One downside is that you have to know in advance for which fields to provide double quotes. In addition, it would still not result in a format consistent with the RFC 4180 standard.

I will not go into the details of using Microsoft’s other command-line tool called SQLCMD, because the issues are similar to those highlighted in this section.

So what? How to proceed? Since I couldn’t find an application on the Internet that was able to extract data in an RFC 4180-compliant CSV format and at the same time handle very large data masses, the only possible solution was to develop a custom solution that can be easily used even by the non-expert user. Let’s see how this solution works.

Developing a custom solution in PowerShell

The first thing I asked myself when I decided to develop a specific solution for this problem was what programming language to use. The first language that came to mind was definitely Python. I then thought, however, that a standard user approaching the world of automation on a Windows machine may not know Python, and he would not find it preinstalled on the operating system. That is why the choice fell on PowerShell, which provides, among other things, a specific module for SQL Server.

Issues with the SQL Server PowerShell module

The first attempt I made was to use the SQL Server PowerShell module, which allows SQL Server developers, administrators, and business intelligence professionals to automate database development and server administration.

Specifically, the command I tried to use to send the query needed to retrieve the data to the Azure SQL database was Invoke-Sqlcmd. This command does nothing more than invoke the sqlcmd.exe command-line utility, often used by automation processes to retrieve information from a SQL Server database. So far, so good. The problem is that Invoke-Sqlcmd persists all query output directly into PowerShell data structures. As you can guess, when the query output takes up more than 3–4 GB, you have the same problem encountered with extraction done in ADS, which is that your system becomes unstable due to excessive RAM consumption.

Therefore, I found it appropriate to directly use ADO.NET objects in PowerShell to try to work around the problem. Let’s see how I used them in this solution.

Batch exporting data to output file

The main idea of my solution is to always use an intermediate data structure (a DataTable) that would collect the query data, but a number of rows at a time. Once the maximum capacity of the intermediate data structure is reached, its contents are written to the target file, it is emptied and is immediately loaded with the next rows of data from the data source:

This process goes on until there are new lines to read in the data source.

You might wonder why I used an intermediate DataTable and didn’t implement a direct write stream to the output file through the StreamWriter. The answer lies in the ability to use PowerShell’s Export-Csv cmdlet directly.

Writing data using Export-Csv

One of the goals I set for myself when I have to solve a problem is always not to reinvent the wheel if there are already convenient solutions that help you solve it completely or partially. In this case, I thought I would dispense with rewriting all the logic that handles the special characters mentioned by the RFC 4180 standard using the Export-Csv cmdlet.

Checking the PowerShell cmdlet guide, I realized that Export-Csv provides the parameters that control the use of double quotes only as of version 7:

Specifically, the UseQuotes parameter provides the value AsNeeded and defines its functionality as follows:

only quote fields that contain a delimiter character, double-quote, or newline character

Basically, it’s what we want in order to meet the requirements of the RFC 4180 standard.

Should you wish to provide double quotes only for certain fields, you can specify them explicitly via the QuoteFields parameter.

Now there is just a small problem with PowerShell versioning. Keep in mind that Windows 10, Windows 11, and Windows Server 2022 preinstall version 5.1 of Windows PowerShell (also known as Desktop edition). In order to use the newer versions of the Export-Csv cmdlet, you must install the newer version of PowerShell (at least PowerShell 7.0), which is for all intents and purposes a separate piece of software from Windows PowerShell based on .NET Core (if you are interested in learning about its evolution over time, you can learn more at this link).

It’s important to emphasize the following:

Since this module was developed for Core versions of PowerShell, it can also be used on Linux and macOS systems.

That said, let’s see how to use this new module.

How to use the SqlBulkExport module

The new SqlBulkExport module is available on GitHub here:

It provides two functions:

Both functions require the following parameters:

The Export-SqlBulkCsvByPeriod function provides three more mandatory parameters to be able to partition the result set according to a time period:

It’s evident that the formats used for the two input periods must be consistent with each other.

It’s important to note that extracting multiple CSV files broken down by a time period using the Export-SqlBulkCsvByPeriod function is only possible using a table/view, and not a query. If there are, for example, special needs for selecting fields and filters to be applied to a table, one must then first expose a view with these logics to then be able to extract multiple CSV files by time period.

Moreover, the Export-SqlBulkCsvByPeriod function involves the use of the string token {} (curly brackets open and closed) within the name of the output CSV file, which token will be replaced by the string associated with the time period of the transactions contained in the CSV file in question.

Both functions automatically recognize when to connect using Windows authentication or SQL Server authentication based on whether or not the User and Password parameters are passed.

Before proceeding with the examples, make sure you have installed the latest version of PowerShell.

Installing the latest PowerShell and SqlBulkExport versions

In order to install the latest version of PowerShell on Windows machines, download and run the 64-bit installer (in our case, version 7.3.0) from this link.

Click Next to all the Setup Wizard windows. Then click Finish. You’ll see the PowerShell 7 (x64) prompt installed into your applications:

Run it and you’ll see the PowerShell prompt ready for your commands:

You can enter the $PSVersionTable command and press Enter to check if all is working fine:

Great! If necessary, you can also install PowerShell on Linux or macOS.

Now you have to download the SqlBulkExport module files:

Ok! Now you are ready to try few examples.

Export the content of our dummy table in one CSV file

Let us try extracting the contents of the extract_test table created at the beginning of this article to check its consistency with the RFC 4180 standard. In our case, the table in question is persisted in an Azure SQL database:

Here the content of the output CSV file:

As you can see, the output CSV file content meets the RFC 4180 standard. Because the dummy table used had few rows, only one batch was used for extraction. Let’s now try to extract the contents of a table having a few tens of thousands of rows.

Exporting the content of a table/view in one CSV file

As before, also the table we’re going to use to extract data from, is persisted in an Azure SQL database:

As you can see, it took 3 batches of 30K rows to extract the contents of a table of about 74K rows, taking a total of 1 second and 88 milliseconds. Not bad!

Let’s try using a query to export the data.

Exporting the output of a query in one CSV file

In this case we will extract data from the same table as in the previous case, but using a query like SELECT * FROM <table> WHERE <condition>.

Everything works like a charm! Let us now try exporting the contents of one view to multiple monthly CSV files.

Exporting the content of a table/view in multiple monthly CSV files

Imagine you have a transaction table containing hundreds of thousands of rows per month. There is a group of Data Scientists from outside the company who are assigned to do advanced analysis on the transaction history. For convenience, they ask you to extract a dataset consisting of subsets of the fields available in the table for a couple of months of transactions. Instead of generating a single CSV file, they ask you to provide them with multiple CSV files broken down by month.
Let’s see how to do this thanks to the Export-SqlBulkCsvByPeriod function:

Amazing! You just extracted about 1.5mln rows broken down into three monthly CSV files in just 1 minute and 19 seconds!

The need that prompted me to write this article is to extract a large amount of data (3–4+ GB) into one or more files in CSV format compliant with the RFC 4180 standard.

You have seen how the tools provided by Microsoft (whether they are IDEs, such as SSMS and ADS; whether they are command-line tools, such as BCP) are unable to meet the above need. The only tool that seemed a bit more suitable is ADS, but it cannot extract large amounts of data without crashing. Without mincing words, it is quite embarrassing that to date Microsoft has not yet made a tool available to users that would meet the requirements set forth above.

Not having found software on the Internet that met the above needs, I wrote the SqlBulkExport PowerShell module that solves the problem and made it available opensource on GitHub under an MIT license. I emphasize that I am not a PowerShell developer, so any input from you that would improve the solution is really welcome!

This content was originally published here.