Data Lakes and Warehouses

Relational databases are typically used for two distinct purposes – transaction processing and analysis. As these are two very different use-cases, the optimal underlying architecture is very different between the two.

Transactions (everyday events stored in rows as they occur) are similar to how you would naturally store data in Excel. However, analysis is typically performed over long time periods, and it would be very inefficient to traverse millions or even billions of rows for each analytical query.

How does loading data into Data Warehouses work?

Data warehouses use columnar storage (storing data column by column), which allows them exceptionally strong compression and quick data aggregation. However, this means that loading data into data warehouses require a different process than standard row-by-row insertion coming from relational databases optimized for transaction processing. One should prepare the data in a Blob storage and instruct the warehouse to ingest those files. The warehouse does so in a massively parallel manner. Such files must be of an optimal size, usually in the range of 100 to 200 megabytes each, which means that data from a single large table will be spread over many files.

Preparing Data

One can prepare data in several different formats depending on what the data warehouse supports. Typically, it can be CSV, Parquet, Orc, or Avro. Omni Loader supports CSV and Parquet file formats in both uncompressed and compressed variants. Specifically for CSV, we support GZIP compression, while for Parquet, we support both GZIP compression and Snappy.

Migrating row-based data from the source is a multi-step process. Data has to be read sequentially, data types mapped into the types supported by the target database (and the intermediate file format), row-based structure transformed into columnar-based one, compressed, uploaded to the data lake, and only then can the target warehouse be notified there is data to ingest. Omni Loader hides the complexity of that process so loading the data into a data warehouse looks as if one simply copies the data over from the standard OLTP source to the standard OLTP target.

Note: we use Parquet by default and transform the data from row-structured to columnar Parquet file format on the fly. It is a significantly better format than CSV, as it is a strongly-typed columnar format with far better compression. You can still use CSV if that’s a requirement in your use case, of course.