BI Search engine

Sunday, May 17, 2009

Good practices for ETL


Four-layered approach for ETL architecture design

Functional layer: Core functional ETL processing (extract, transform, and load).

Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.

Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.

Utility layer: Common components supporting all other layers.

Use file-based ETL processing where possible

Storage costs relatively little

Intermediate files serve multiple purposes

Used for testing and debugging

Used for restart and recover processing

Used to calculate control statistics

Helps to reduce dependencies - enables modular programming.

Allows flexibility for job-execution and -scheduling

Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.

Use data-driven methods and minimize custom ETL coding

Parameter-driven jobs, functions, and job-control

Code definitions and mapping in database

Consideration for data-driven tables to support more complex code-mappings and business-rule 
application.

Qualities of a good ETL architecture design

Performance

Scalable

Migratable

Recoverable (run_id, ...)

Operable (completion-codes for phases, re-running from checkpoints, etc.)

Auditable (in two dimensions: business requirements and technical troubleshooting)

Rerunnability, recoverability in ETL


Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece.

Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.