Loading Large Volumes of Data into SQL Azure with SSIS
Loading data into SQL Databases (Azure) is fairly simple. With SQL Server Integration Services (SSIS), the task becomes even more trivial as with many ETL tasks that we’ve done in the past from data source to data source. Truly speaking, a data source is just that, a data source. The true task at hand is in the preparation and transformation of the data between the source and destination. Viewing SQL Databases as what they are, just another data source, makes the design of what you need to do less complicated.
A basic test from a box version of SQL Server to a SQL Database is pretty uneventful. In the past, these tests have proven effective with only a few catches. A few of those are
1) Network, network and network. Your line to the “cloud” is critical to stability and speed.
2) Disconnections can be painful so plan for them. Have a restart point.
3) Manipulate your connections and data flow so they are tuned for a push to a cloud environment.
Before going too far, the testing so far has been limited to SSIS. BCP and Bulk Copy are two other options that should be investigated if you run into performance problems. However, SSIS being the flagship ETL for Microsoft Platforms, it is a given this is the tool you will look at initially. There is a need to state, “Use the right tool for the task”.
The first thing you can do in order to perform a simple test of pushing data to a SQL Database is go out and get a trial of Windows Azure. The trial is pretty small in terms of nodes and performance, but it serves as a great introduction to working with Azure. You can connect to SQL Azure with a number of methods. SSMS is typically going to be a common one if you are a DBA type or used to working out of SSMS. To connect to SQL Azure from SSMS, you can follow the instructions here, “Connecting to SQL Azure from SQL Management Studio 2008” or for a more visual look and addition notes on firewall changes, “Connecting to SQL Azure with SQL Server Management Studio 2008 R2”. With SSMS 2012, all of these steps are identical and nothing changes.
Once you configure your SQL Database, a good test is to push some data from AdventureWorks. This task is almost identical to the task of pushing data from one SQL Server to another. Create a connection (ADO.NET or OLEDB) for a source from SQL Server and AdventureWorks and then create a destination using the same guidelines and options as how you would connect to SQL Azure from SSMS.
Add a data flow
Push some data
This article will not go into the basic steps of the SSIS setup but is more of a guide to what you need.
Some things that are important and also recommended from the SQL CAT team when it comes to SQL Azure.
1) Use multiple streams. Cut your data up into multiple packages or data flows and run concurrent loads
2) Manipulate your network packet sizes given the concept of pushing data out to the internet ( that is the best way to say it)
3) Think about indexes. If you have a ton of indexes, consider disabling them, dropping them and so on for a data load situation. This is the same with any data loading event.
Lots of Data
The simple test of pushing data into SQL Azure is excellent for getting used to the slight differences and dealing with throwing data out into the cloud. The problem you run into is going to be volume of data. This is where the typical trial will cause some suffering on performance.
Recently, I had the chance to finally load a real life amount of data into SQL Azure and I wasn’t all that happy with it. Now, before going on, there are some things that could make it better. More nodes in the cluster, a thicker network, possibly a design change to the table. The list could go on but we’re talking about real life and I wanted to share this experience in case you run into it and know what to be prepared for. Remember, this is pushing over the internet. We’re not hard-lined into a network so that must be taken into account when we say, “it was slow”.
The amount of data that needed to be loaded from a SQL Server instance to SQL Azure was 90GB, comprised of one table. The table was thin at 20 columns and a maximum row size of 233 bytes. This information is important when considering your packet size.
The test performed was in SSIS with 4 segments of the data needed to load.
The packet size was lowered slightly and use bulk load when available was checked. Also, the batch size was left at the defaults.
This was the painful part and really, why I’m writing this blog – prepare yourself.
The SQL Azure destination had no indexing on it to enhance the ability to load the data. The schema was identical as well so the data flow was a direct, source to destination (as shown above in the image). The rate of loading the data was around ~500,000 rows per 14 minutes. Minutes is not a typo. Yet again, this is pushing over the internet to the cloud.
Given this rate and the volume that needed to go up being around 260,000,000 rows in the first load, the estimated total time for this task was around 121 hours. Whoa!!! There’s some planning that needs to be done here, resources increased, horses kicked…something.
This article wasn’t intended to show you a step-by-step SSIS setup to load data to SQL Azure. Truly speaking, it is just another data source. What we did take away is the time that is needed when you are loading a massive amount of data to SQL Azure as part of either a migration to the cloud, integration of historic data or other needs that have pushed your data into SQL Azure.
Plan for these lengths of time, take the options that need tuning into consideration, look at testing other load methods, and make sure the other cloud options are investigated.
The next step that will be performed is a load into RDS to see if there is improvement. That will be posted in another article and should give a good view into the choices available.
Lastly and to reiterate, this was a large volume data load as an initial setup of SQL Azure as a source. Typically, we do not push 90GB tables into the cloud daily (not sure you could) so take this as an article outlining the need for really paying attention to your planning and designing when converting over to SQL Azure, if that has been a decision that was made.