Authors: J. Ateeq
cartoon graphic of male engineering developer sitting with laptop in a keyhole

I was sitting at an improvised work-from-home desk inside a quiet Toronto home. In front of me, my work machine lay staring back at me blankly. Still fingers hovered above a silent keyboard. Just moments ago, I had been clicking away feverishly. Finally, everything was in place! All that remained was to confirm the suggested on-screen changes in the SQL Server Management Studio wizard—SSMS in short—and then eight Gigabytes of MS SQL Server-formatted data could be understood by a PostgreSQL engine. One satisfying mouse click later, a few miles north of Lake Ontario, somewhere in the heart of downtown Toronto, a teammate exuberantly affirmed that the AI-powered question-answer tool we’d been inventing could now execute queries against its database service. No trumpets blew, but a sense of accomplishment hung in the air. Getting here had been complicated.

Data and metadata representations between MS SQL Server and PostgreSQL are not homogeneous (see Table 1), so migrations between these databases have a reputation for being tedious work. Even so, my team decided that it was worth the effort because the alternative was to repurpose application code and for the team to learn a different software system, which would slow down our progress. Since each RDBMS uses its own set of data types, migrating data from one to the other requires a mapping to transform the data into a format the latter can understand. While the Open Database Connectivity (ODBC) standard goes a long way in providing such mappings, automatic migration is not guaranteed and the mappings often require tweaks. Some of the data in question was of the geography data type in MS SQL Server, which is not easily exported to its counterpart in PostgreSQL. Furthermore, stored procedures written in Transact-SQL (T-SQL), the variant of SQL used by MS SQL Server, needed to be translated into a procedural language supported by PostgreSQL (for example, PL/pgSQL) before they could be executed by the PostgreSQL engine.

  MS SQL Server PostgreSQL
1 Available under a Commercial license Open source, released under PostgreSQL License
2 Examples of data type differences: 
   
   1.  Fixed-length byte string Binary
   2.  [1, 0, null] available with Bit
   3.  Native support for spatial data type via geography
Examples of data type differences: 
   
   1.  Fixed-length byte string BYTEA
   2.  [1, 0, null] available with Boolean
   3.  Support for spatial data type via optional PostGIS package
3 Supports clustered and non-clustered indexes, among others     Supports B-Trees and Generalized Search Tree (GiST) structures, among others
4 Table-valued functions and stored procedures supported in T-SQL, .NET languages, Python and others User Defined Functions supported in PL/pgSQL, Perl, Python and others

Table 1. Some differences between MS SQL Server and PostgreSQL.

In non-regulated, non-secured environments, database migration could involve some or all of the following steps. We could have acquired a snapshot of our partner’s database by unencrypted e-mail, use public shared cloud storage, or even a thumb drive. There would not have been a requirement to track who needed access to what, when and why. Off-the-shelf tools like Ispirer-MnMTK or Full Convert could have been used to migrate the database schema (including tables, views, stored procedures and functions) in one go. In the absence of such tools, the solution shown in Figure 1 might be set up. The database snapshot would be used to restore a MS SQL Server instance (Figure 2). Then, once software dependencies are resolved (via public internet in the case of Figure 1) the migration to a PostgreSQL instance could be orchestrated by tools like SSMS or pgLoader. 

Figure 1. An example of data migration in non-secured, non-regulated environments without off-the-shelf-tools. Figure 1. An example of data migration in non-secured, non-regulated environments without off-the-shelf-tools.

# Restore MS SQL Server database and log files from snapshot
$ /opt/mssql-tools/bin/sqlcmd -U <username> -P <password> -S <hostname>
1> RESTORE DATABASE <database-name>
> FROM DISK = '<source_directory>/backup.BAK'
> WITH MOVE '<database-name>' TO '<destination_directory>/db.mdf'
> MOVE '<database-name>_log' TO '<destination_directory>/db_log.ldf'

Figure 2. Example of how to restore a MS SQL Server instance from a backup.

But when data privacy is at stake, the game changes, calling for rigorous access controls and monitoring. Banks like RBC take immense care to ensure data is secure and protected. As the machine learning hub for RBC, Borealis AI is held to the same high standards to keep data safe and secure. Once access to data was approved, snapshots could only be shared with authenticated shared network drives or internal cloud platforms. With looming deadlines, the migration tools mentioned above were off the table, as a request for admin access could take time to get approved. We ended up resolving software dependencies through secure software repositories, not the public internet. Stored procedures and functions were translated from T-SQL to PL/pgSQL. Finally, with the MS SQL Server and the PostgreSQL instances set up on Borealis AI’s High Performance Compute cluster, we could use SQL Server Management Studio to migrate data. While deployment on this cluster is configured through custom YAML templates, an equivalent Docker ‘run’ command for the MS SQL Server instance is shown in Figure 3. 

# Run MS SQL Server Instance inside Docker container
docker run -e 'ACCEPT_EULA=Y' \
    -e 'SA_PASSWORD=<Password>'
    -e 'MSSQL_PID=Developer'
    -p <host-port>:1433
    -d mssql-server-linux-secure 

Figure 3. Example of how to launch an instance of MS SQL Server in a Docker container on specified TCP port.

Surprisingly, working in a more restricted environment became an opportunity for creative problem solving and learning:

  • With (relevant) off-the-shelf tools out of the picture, I needed to grasp the differences in data and metadata representation between MS SQL Server and PostgreSQL. 
  • I learned to speak T-SQL and PL/pgSQL fluently. 
  • I learned that HTTP routes, so commonly used for creating remote connections to web applications, don’t work for creating remote connections to databases, because databases use other protocols (like JDBC); instead, I forwarded TCP ports from my local machine to the compute unit running the database instance (Figure 4). 
  • I realized that because the Microsoft Tape Format-based backup file was compressed, the storage requirements of the database instance were in fact greater than the backup file size. 
  • Lastly, I realized how important it was to pay attention to the type of storage volume attached to these database instances—shared volumes with high data traffic can prolong data migrations significantly.
     

# MS SQL Server listens on port 1433 by default
oc port-forward <openshift-pod-name> <pod-port>:1433
# PostgreSQL listens on port 5432 by default 
oc port-forward <openshift-pod-name> <pod-port>:5432
 

Figure 4. Using Openshift tools to forward TCP ports from development machine to pod running on Borealis AI cluster.

So as I sit here, in front of my improvised work-from-home desk in my quiet Toronto home, I recognize that this migration deepened my understanding of the ecosystem inside which I write code. I felt satisfaction in getting to know the system. It also invited me to reflect on why restrictions are necessary when managing data. And of course, the experience taught me how to migrate data from MS SQL Server to PostgreSQL.

Authors