Sunday, May 17, 2009
Good practices for ETL
Rerunnability, recoverability in ETL
Parallel processing in ETL
Performance of ETL
Challenges in ETL
Real-life ETL cycle
Load Concept
Transform Concept
Extract Concepts
ETL (Extract, Transform, Load) Basics
Saturday, April 18, 2009
Cognos BI0 112 Sample Questions
Paper Pattern for BI0-112 Test for IBM- Cognos
Tuesday, April 7, 2009
Notable Uses of Data Mining
The process of data mining
Data Mining Background
Humans have been "manually" extracting information from data for centuries, but the increasing volume of data in modern times has called for more automatic approaches. As data sets and the information extracted from them has grown in size and complexity, direct hands-on data analysis has increasingly been supplemented and augmented with indirect, automatic data processing using more complex and sophisticated tools, methods and models. The proliferation, ubiquity and increasing power of computer technology has aided data collection, processing, management and storage. However, the captured data needs to be converted into information and knowledge to become useful. Data mining is the process of using computing power to apply methodologies, including new techniques for knowledge discovery, to data.
Data mining identifies trends within data that go beyond simple data analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of processes and target opportunities. However, abdicating control and understanding of processes from statisticians to poorly informed or uninformed users can result in false-positives, no useful results, and worst of all, results that are misleading and/or misinterpreted.
Although data mining is a relatively new term, the technology is not. For many years, businesses and governments have used increasingly powerful computers to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining). Continuous innovations in computer processing power, disk storage, data capture technology, algorithms, methodologies and analysis software have dramatically increased the accuracy and usefulness of the extracted information.
The term data mining is often used to apply to the two separate processes of knowledge discovery andprediction. Knowledge discovery provides explicit information about the characteristics of the collected data, using a number of techniques (e.g., association rule mining). Forecasting and predictive modelingprovide predictions of future events, and the processes may range from the transparent (e.g., rule-based approaches) through to the opaque (e.g., neural networks).
Metadata, (data about the characteristics of a data set), are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.
Data mining is usually performed on "real-world data". Such data are vulnerable to collinearity because of unknown and possibly unobserved interrelations. An unavoidable fact of data mining is that the (sub-)set of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships that exist across other parts of the domain. Alternative methods using experiment-based approaches, such as Choice Modelling for human-generated data, may be used to address this sort of issue. In these situations, inherent correlations can be either controlled for or removed altogether during the construction of the experimental design.
There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems likeRapidMiner, Weka, KNIME, and the R Project have become an informal standard for defining data-mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies. The latest version of PMML, version 4.0 is scheduled to be released in early 2009.
Since the availability of affordable computer processing power in the last quarter of the 20th century, organizations have been accumulating vast and ever growing amounts of data, including, for example:
operational and transactional data, such as sales, cost, inventory, payroll and accounting data
nonoperational data, such as forecasts and macro economic data
meta data — data about the data itself, such as logical database design and data dictionary definitions
This article outlines the longitudinal changes of DMKD research activities during the last decade by surveying a large collection of Data Mining literature to provide a comprehensive picture of current DMKD research and classify these research activities into high-level categories.
Data Mining
Saturday, March 21, 2009
Benefits and Disadvantages of data warehouses
Evolution in organization use of data warehouses
Data warehouses versus operational systems
Top-down versus bottom-up design methodologies
Conforming information
Normalized versus dimensional approach for storage of data
Data warehouse architecture
One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:
Operational database layer
The source data for the data warehouse - An organization's ERP systems fall into this layer.
History of data warehousing
Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" that were tailored for ready access by users.
Key developments in early years of data warehousing were:
1960s - General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[3]
1970s - ACNielsen and IRI provide dimensional data marts for retail sales.[3]
1983 - Teradata introduces a database management system specifically designed for decision support.
1988 - Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse".
1990 - Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing.
1991 - Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse.
1991 - Bill Inmon publishes the book Building the Data Warehouse.
1995 - The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit.
1997 - Oracle 8, with support for star queries, is released.
Thursday, March 19, 2009
Data Warehousing: An Intro
Data warehouses are designed to facilitate reporting and analysis.
This definition of the data warehouse focuses on data storage.
However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.
Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
In contrast to data warehouses are operational databases that support day-to-day transaction processing.