Data Warehousing Concepts: 2009

Sunday, May 17, 2009

Good practices for ETL

Four-layered approach for ETL architecture design

Functional layer: Core functional ETL processing (extract, transform, and load).

Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.

Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.

Utility layer: Common components supporting all other layers.

Use file-based ETL processing where possible

Storage costs relatively little

Intermediate files serve multiple purposes

Used for testing and debugging

Used for restart and recover processing

Used to calculate control statistics

Helps to reduce dependencies - enables modular programming.

Allows flexibility for job-execution and -scheduling

Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.

Use data-driven methods and minimize custom ETL coding

Parameter-driven jobs, functions, and job-control

Code definitions and mapping in database

Consideration for data-driven tables to support more complex code-mappings and business-rule

application.

Qualities of a good ETL architecture design

Performance

Scalable

Migratable

Recoverable (run_id, ...)

Operable (completion-codes for phases, re-running from checkpoints, etc.)

Auditable (in two dimensions: business requirements and technical troubleshooting)

Rerunnability, recoverability in ETL

Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece.

Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.

Parallel processing in ETL

A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.

ETL applications implement three main types of parallelism:

Data: By splitting a single sequential file into smaller data files to provide parallel access.

Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2.

Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

All three types of parallelism usually operate combined in a single job.

An additional difficulty comes with making sure that the data being uploaded is relatively consistent. Because multiple source databases may have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points becomes necessary.

Performance of ETL

ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory.

In real life, the slowest part of an ETL process usually occurs in the database load phase. Databases may perform slowly because they have to take care of concurrency, integrity maintenance, and indexes. Thus, for better performance, it may make sense to do most of the ETL processing outside of the database, and to use bulk load operations whenever possible. Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Here are some common methods used to increase performance:

Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning).

Do all validation in the ETL layer before the load. Disable integrity checking (disable constraint ...) in the target database tables during the load.

Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step.

Generate IDs in the ETL layer (not in the database).

Drop the indexes (on a table or partition) before the load - and recreate them after the load (SQL: drop index ...; create index ...).

Use parallel bulk load when possible — works well when the table is partitioned or there are no indexes. Note: attempt to do parallel loads into the same table (partition) usually causes locks — if not on the data rows, then on indexes.

If a requirement exists to do insertions, updates, or deletions, find out which rows should be processed in which way in the ETL layer, and then process these three operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through an API (using SQL).

Whether to do certain operations in the database or outside may involve a trade-off. For example, removing duplicates using distinct may be slow in the database; thus, it makes sense to do it outside. On the other side, if using distinct will significantly (x100) decrease the number of rows to be extracted, then it makes sense to remove duplications as early as possible in the database before unloading data.

A common source of problems in ETL is a big number of dependencies among ETL jobs. For example, job "B" cannot start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible.

Again, partitioning of big tables and of their indexes can really help.

Another common issue occurs when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only three layers:

Sources

Central ETL layer

Targets

This allows processing to take maximum advantage of parallel processing. For example, if you need to load data into two databases, you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd).

Of course, sometimes processing must take place sequentially. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.

Challenges in ETL

ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process.

Data warehouses typically grow asynchronously, fed by a variety of sources which all serve a different purpose, resulting in, for example, different reference data. ETL is a key process to bring heterogeneous and asynchronous source extracts to a homogeneous environment.

Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day microbatch to integration with message queues or real-time change-data capture for continuous transformation and update.

Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:

1. Cycle initiation

2.Build reference data

3.Extract (from sources)

4.Validate

5.Transform (clean, apply business rules, check for data integrity, create aggregates)

6.Stage (load into staging tables, if used)

7.Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)

8.Publish (to target tables)

9.Archive

10.Clean up

Load Concept

The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, updated data every week, while other DW (or even other parts of the same DW) may add new data in a historized form, for example, hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.

As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.

Transform Concept

The transform stage applies to a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required:

Selecting only certain columns to load (or selecting null columns not to load)

Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for automated data cleansing; no manual cleansing occurs during ETL

Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)

Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

Filtering

Sorting

Joining data from multiple sources (e.g., lookup, merge)

Aggregation (for example, rollup - summarizing multiple rows of data - total sales for each store, and for each region, etc.)

Generating surrogate-key values

Transposing or pivoting (turning multiple columns into multiple rows or vice versa)

Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)

Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.

Extract Concepts

The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as web spidering or screen-scraping. Extraction converts the data into a format for transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely.

ETL (Extract, Transform, Load) Basics

Extract, transform, and load (ETL) in database usage and especially in data warehousing involves:

Extracting data from outside sources

Transforming it to fit operational needs (which can include quality levels)

Loading it into the end target (database or data warehouse)

The advantages of efficient and consistent databases make ETL very important as the way data actually gets loaded.

This article discusses ETL in the context of a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.

ETL can also function to integrate contemporary data with legacy systems.

Usually ETL implementations store an audit trail on positive and negative process runs. In almost all designs, this audit trail does not give the level of granularity which would allow a DBA to reproduce the ETL's result in the absence of the raw data.

Saturday, April 18, 2009

Cognos BI0 112 Sample Questions

Q 1. In Report Studio, an author wants to ensure detailed report data is summarized using the default aggregation specified in the package. Which of the following is true?

A. The Aggregate Function must be set to Total.

B. The Aggregate Function property must be set to None.

C. The Auto-Group and Summarize property must be set to No.

D. The Auto-Group and Summarize property must be set to Yes.

Q2. In Report Studio, why would an author unlock a report?

A. To open a report saved locally in XML.

B. To insert an object inside a list column.

C. To apply conditional formatting.

D. To view data that has been restricted in Framework Manager.

Q3. In Report Studio, an author wants to create a variable for a conditional block so the report

displays either a crosstab or a chart, depending on what the user selects in the prompt. What

property of the conditional block must the author define to create this variable?

A. Style Variable

B. Current Block

C. Block Variable

D. Render Variable

Q4. Sort Key is a data item in Query1, however, it is not part of the rendered report. What must be done for the Sort Key data item to be applied to the report?

A. The Sort Key is added as a property of the list.

B. The Sort Key is added as a property of the page.

C. The Sort Key is added as a property of the query.

D. The Sort Key is added as a property of the prompt.

Solutions : 1. D, 2. B, 3.C , 4.A

For more Questions or the entire braindump at nominal price(10$ only), contact me: jaydev.doshi@gmail.com

Paper Pattern for BI0-112 Test for IBM- Cognos

If your job role includes building reports using relational data models, as well as enhancing, customizing, and managing professional reports, then you may consider adding this certification to your professional portfolio.

The Cognos 8 BI Author exam covers key concepts, technologies, and functionality of the Cognos products. In preparation for an exam, we recommend a combination of training and hands-on experience, and a detailed review of product documentation.

Create reports (14%)

1. Describe how to create list, crosstab, and repeater reports

2. Present data graphically

Focus reports (12%)

1. Describe how to focus reports using filters

2. Describe how to focus reports using prompts

Enhance reports (44%)

1. Describe the use of calculations in reports

2. Identify techniques to enhance layout and content

3. Describe how to customize reports with conditional formatting

4. Identify steps to set-up Drill-through Access

Create reports using the query model (18%)

1. Identify the purpose and components of the query model

2. Describe how to create the query model

3. Describe techniques used in the query model that determine how data is aggregated

4. Identify the effects on the query model(s) of creating a master/detail relationship in the report layout

Setup reports for bursting (6%)

1. Describe the function of the settings required to distribute reports through bursting

Manage events using agents (6%)

1. Describe the use of Event Studio in Cognos 8 BI

Tuesday, April 7, 2009

Notable Uses of Data Mining

Surveillance:

Previous data mining to stop terrorist programs under the U.S. government include the Total Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE, Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program. These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment, although many programs that were formed under them continue to be funded by different organizations, or under different names.

Two plausible data mining techniques in the context of combating terrorism include "pattern mining" and "subject-based data mining".

Pattern mining:

"Pattern mining" is a data mining technique that involves finding existing patterns in data. In this contextpatterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behaviour in terms of the purchased products. For example, an association rule "beer => crisps (80%)" states that four out of five customers that bought beer also bought crisps.

In the context of pattern mining as a tool to identify terrorist activity, the National Research Councilprovides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise." Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search techniques.

Subject-based data mining:

"Subject-based data mining" is a data mining technique involving the search for associations between individuals in data. In the context of combatting terrorism, the National Research Council provides the following definition: "Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."

Games:

Since the early 1960s, with the availability of oracles for certain combinatorial games, also calledtablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn inchess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Business:

Data mining in customer relationship management applications can contribute significantly to the bottom line. Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimize resources across campaigns so that one may predict which channel and which offer an individual is most likely to respond to — across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set.

Businesses employing data mining may see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than one model to predict which customers will churn, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move toautomated data mining.

Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.

Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within adatabase. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.

Market basket analysis has also been used to identify the purchase patterns of the Alpha consumer. Alpha Consumers are people that play a key roles in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society. Analyzing the data collected on these type of users has allowed companies to predict future buying trends and forecast supply demands.

Data Mining is a highly effective tool in the catalog marketing industry. Catalogers have a rich history of customer transactions on millions of customers dating back several years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns.

Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing." In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.

Science and engineering:

In recent years, data mining has been widely used in area of science and engineering, such asbioinformatics, genetics, medicine, education and electrical power engineering.

In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known asmultifactor dimensionality reduction.

In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment. Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.

Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers. DGA, as a diagnostics for power transformer, has been available for many years. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.

A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention. A similar example of the social application of data mining its is use in expertise finding systems, whereby descriptors of human expertise are extracted, normalized and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate Institutional memory.

Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining clinical trial data, traffic analysis using SOM, et cetera.

In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction incidents. Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses.

The process of data mining

Knowledge Discovery in Databases (KDD) is the name coined by Gregory Piatetsky-Shapiro in 1989 to describe the process of finding interesting, interpreted, useful and novel data. There are many nuances to this process, but roughly the steps are to preprocess raw data, mine the data, and interpret the results.

Pre-processing:

Once the objective for the KDD process is known, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse.

The target set is then cleaned. Cleaning removes the observations with noise and missing data.

The clean data is reduced into feature vectors, one vector per observation. A feature vector is a summarized version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.

The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.

Data mining:

Data mining commonly involves four classes of task:

Classification - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network.

Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.

Regression - Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming.

Association rule learning - Searches for relationships between variables. For example a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as "market basket analysis".

Interpreting the results:

The final step of knowledge discovery from data is to evaluate the patterns produced by the datamining algorithms. Not all patterns found by the datamining algorithms are necessarily valid. It is common for the datamining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a "test set" of data which the datamining algorithm was not trained on. The learnt patterns are applied to this "test set" and the resulting output is compared to the desired output. For example, a datamining algorithm trying to distinguish spam from legitimate emails would be trained on a "training set" of sample emails. Once trained, the learnt patterns would be applied to the "test set" of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves.

If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and datamining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

Data Mining Background

Humans have been "manually" extracting information from data for centuries, but the increasing volume of data in modern times has called for more automatic approaches. As data sets and the information extracted from them has grown in size and complexity, direct hands-on data analysis has increasingly been supplemented and augmented with indirect, automatic data processing using more complex and sophisticated tools, methods and models. The proliferation, ubiquity and increasing power of computer technology has aided data collection, processing, management and storage. However, the captured data needs to be converted into information and knowledge to become useful. Data mining is the process of using computing power to apply methodologies, including new techniques for knowledge discovery, to data.

Data mining identifies trends within data that go beyond simple data analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of processes and target opportunities. However, abdicating control and understanding of processes from statisticians to poorly informed or uninformed users can result in false-positives, no useful results, and worst of all, results that are misleading and/or misinterpreted.

Although data mining is a relatively new term, the technology is not. For many years, businesses and governments have used increasingly powerful computers to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining). Continuous innovations in computer processing power, disk storage, data capture technology, algorithms, methodologies and analysis software have dramatically increased the accuracy and usefulness of the extracted information.

The term data mining is often used to apply to the two separate processes of knowledge discovery andprediction. Knowledge discovery provides explicit information about the characteristics of the collected data, using a number of techniques (e.g., association rule mining). Forecasting and predictive modelingprovide predictions of future events, and the processes may range from the transparent (e.g., rule-based approaches) through to the opaque (e.g., neural networks).

Metadata, (data about the characteristics of a data set), are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

Data mining is usually performed on "real-world data". Such data are vulnerable to collinearity because of unknown and possibly unobserved interrelations. An unavoidable fact of data mining is that the (sub-)set of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships that exist across other parts of the domain. Alternative methods using experiment-based approaches, such as Choice Modelling for human-generated data, may be used to address this sort of issue. In these situations, inherent correlations can be either controlled for or removed altogether during the construction of the experimental design.

There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems likeRapidMiner, Weka, KNIME, and the R Project have become an informal standard for defining data-mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies. The latest version of PMML, version 4.0 is scheduled to be released in early 2009.

Since the availability of affordable computer processing power in the last quarter of the 20th century, organizations have been accumulating vast and ever growing amounts of data, including, for example:

operational and transactional data, such as sales, cost, inventory, payroll and accounting data

nonoperational data, such as forecasts and macro economic data

meta data — data about the data itself, such as logical database design and data dictionary definitions

This article outlines the longitudinal changes of DMKD research activities during the last decade by surveying a large collection of Data Mining literature to provide a comprehensive picture of current DMKD research and classify these research activities into high-level categories.

Data Mining

Data mining is the process of extracting hidden patterns from large amounts of data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

While data mining can be used to uncover hidden patterns in data samples that have been "mined", it is important to be aware that the use of a sample of the data may produce results that are not indicative of the domain. Data mining will not uncover patterns that are present in the domain, but not in the sample. There is a tendency for insufficiently knowledgable "consumers" of the results to treat the technique as a sort of crystal ball and attribute "magical thinking" to it. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data.

The term data mining has also been used in a related but negative sense, to mean the deliberate searching for apparent but not necessarily representative patterns in large amounts of data. To avoid confusion with the other sense, the terms data dredging and data snooping are often used. Note, however, that dredging and snooping can be (and sometimes are) used as exploratory tools when developing and clarifying hypotheses.

Saturday, March 21, 2009

Benefits and Disadvantages of data warehouses

Some of the benefits that a data warehouse provides are as follows:

A data warehouse provides a common data model for all data of interest regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc.

Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis.

Information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time.

Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems.

Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems.

Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

Disadvantages of data warehouses

There are also disadvantages to using a data warehouse. Some of them are:

Over their life, data warehouses can have high costs. The data warehouse is usually not static. Maintenance costs are high.

Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization.

There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa..

Evolution in organization use of data warehouses

Organizations generally start off with relatively simple use of data warehousing. Over time, more sophisticated use of data warehousing evolves. The following general stages of use of the data warehouse can be distinguished:

Off line Operational Database

Data warehouses in this initial stage are developed by simply copying the data off an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance.

Off line Data Warehouse

Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data is stored in a data structure designed to facilitate reporting.

Real Time Data Warehouse

Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking.)

Integrated Data Warehouse

Data warehouses at this stage are updated every time an operational system performs a transaction. The data warehouses then generate transactions that are passed back into the operational systems.

Data warehouses versus operational systems

Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of data normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems.

Data warehouses are optimized for speed of data retrieval. Frequently data in data warehouses are denormalised via a dimension-based model. Also, to speed data retrieval, data warehouse data are often stored multiple times - in their most granular form and in summarized forms called aggregates. Data warehouse data are gathered from the operational systems and held in the data warehouse even after the data has been purged from the operational systems.

Top-down versus bottom-up design methodologies

Bottom-up design

Ralph Kimball, a well-known author on data warehousing, is a proponent of an approach frequently considered as bottom-up, to data warehouse design. In the so-called bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventually be unioned together to create a comprehensive data warehouse. The combination of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture".

Business value can be returned as quickly as the first data marts can be created. Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse. The most important management task is making sure dimensions among data marts are consistent. In Kimball words, this means that the dimensions "conform".

Top-down design

Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise.[6] Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities.

Inmon states that the data warehouse is:

Subject-oriented:The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.

Time-variant:The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time.

Non-volatile:Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting.

Integrated:The data warehouse contains data from most or all of an organization's operational systems and this data is made consistent.

The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.

Hybrid design

Over time it has become apparent to proponents of bottom-up and top-down data warehouse design that both methodologies have benefits and risks. Hybrid methodologies have evolved to take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.

Conforming information

Another important fact in designing a data warehouse is which data to conform and how to conform the data. For example, one operational system feeding data into the data warehouse may use "M" and "F" to denote sex of an employee while another operational system may use "Male" and "Female".

Though this is a simple example, much of the work in implementing a data warehouse is devoted to making similar meaning data consistent when they are stored in the data warehouse.

Typically, extract, transform, load tools are used in this work.

Master Data Management has the aim of conforming data that could be considered "dimensions".

Normalized versus dimensional approach for storage of data

There are two leading approaches to storing data in a data warehouse - the dimensional approach and the normalized approach.

In the dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are:

1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and

2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.)

The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to

1) join data from different sources into meaningful information and then

2) access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

These approaches are not mutually exclusive. Dimensional approaches can involve normalizing data to a degree.

Data warehouse architecture

Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture. The worthiness of the architecture can be judged in how the conceptualization aids in the building, maintenance, and usage of the data warehouse.

One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:

Operational database layer

The source data for the data warehouse - An organization's ERP systems fall into this layer.

Data access layer

The interface between the operational and informational access layer - Tools to extract, transform, load data into the warehouse fall into this layer.

Metadata layer

The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

Informational access layer

The data accessed for reporting and analyzing and the tools for reporting and analyzing data - Business intelligence tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer.

History of data warehousing

The concept of data warehousing dates back to the late 1980s [2] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow - mainly, the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from the operational systems that were logically related to prior gathered data.

Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" that were tailored for ready access by users.
Key developments in early years of data warehousing were:

1960s - General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[3]

1970s - ACNielsen and IRI provide dimensional data marts for retail sales.[3]

1983 - Teradata introduces a database management system specifically designed for decision support.

1988 - Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse".

1990 - Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing.

1991 - Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse.

1991 - Bill Inmon publishes the book Building the Data Warehouse.

1995 - The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.

1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit.

1997 - Oracle 8, with support for star queries, is released.

Thursday, March 19, 2009

Data Warehousing: An Intro

Data warehouse is a repository of an organization's electronically stored data.
Data warehouses are designed to facilitate reporting and analysis.

This definition of the data warehouse focuses on data storage.
However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.
Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.

In contrast to data warehouses are operational databases that support day-to-day transaction processing.

Data Warehousing Concepts

BI Search engine