ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process.
Data warehouses typically grow asynchronously, fed by a variety of sources which all serve a different purpose, resulting in, for example, different reference data. ETL is a key process to bring heterogeneous and asynchronous source extracts to a homogeneous environment.
Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day microbatch to integration with message queues or real-time change-data capture for continuous transformation and update.
0 comments:
Post a Comment