Data Lakes and Warehouses

Relational databases are typically used for two distinct purposes – transaction processing and analysis. As these are two very different use-cases, the optimal underlying architecture is very different between the two.

Omniloader  feature

Event Transactions

Transactions (everyday events stored in rows as they occur) are similar to how you would naturally store data in Excel. However, analysis is typically performed over long time periods, and it would be very inefficient to traverse millions or even billions of rows for each analytical query.

How does loading data into Data Warehouses work?

Data warehouses use columnar storage (storing data column by column), which allows them exceptionally strong compression and quick data aggregation.

However, this means that loading data into data warehouses require a different process than standard row-by-row insertion coming from relational databases optimized for transaction processing. One should prepare the data in a Blob storage and instruct the warehouse to ingest those files. The warehouse does so in a massively parallel manner.

Such files must be of an optimal size, usually in the range of 100 to 200 megabytes each, which means that data from a single large table will be spread over many files.

Preparing Data

One can prepare data in several different formats depending on what the data warehouse supports. Typically, it can be CSV, Parquet, Orc, or Avro. Omni Loader supports CSV and Parquet file formats in both uncompressed and compressed variants. Specifically for CSV, we support GZIP compression, while for Parquet, we support both GZIP compression and Snappy.

Migrating row-based data from the source is a multi-step process. Data has to be read sequentially, data types mapped into the types supported by the target database (and the intermediate file format), row-based structure transformed into columnar-based one, compressed, uploaded to the data lake, and only then can the target warehouse be notified there is data to ingest.

Omni Loader hides the complexity of that process so loading the data into a data warehouse looks as if one simply copies the data over from the standard OLTP source to the standard OLTP target. That is exactly what one wants - a boring and predictable copying process.

An optimized process

Omni Loader reads the data, transforms it into a columnar format, compresses it, uploads it into the blob storage, then has the data warehouse ingest the data files as they become ready.
All that without you having to set up any ETL pipelines.

Sisyphus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris imperdiet egestas blandit. Suspendisse rhoncus congue lorem ut malesuada.

Candice Doe
Product Manager, Sisyphus
Save time, increase efficiency.

Ready to get started?

FAQ

Frequently asked questions