BI Search engine

Tuesday, April 7, 2009

Data Mining Background

Humans have been "manually" extracting information from data for centuries, but the increasing volume of data in modern times has called for more automatic approaches. As data sets and the information extracted from them has grown in size and complexity, direct hands-on data analysis has increasingly been supplemented and augmented with indirect, automatic data processing using more complex and sophisticated tools, methods and models. The proliferation, ubiquity and increasing power of computer technology has aided data collection, processing, management and storage. However, the captured data needs to be converted into information and knowledge to become useful. Data mining is the process of using computing power to apply methodologies, including new techniques for knowledge discovery, to data.

Data mining identifies trends within data that go beyond simple data analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of processes and target opportunities. However, abdicating control and understanding of processes from statisticians to poorly informed or uninformed users can result in false-positives, no useful results, and worst of all, results that are misleading and/or misinterpreted.

Although data mining is a relatively new term, the technology is not. For many years, businesses and governments have used increasingly powerful computers to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining). Continuous innovations in computer processing power, disk storage, data capture technology, algorithms, methodologies and analysis software have dramatically increased the accuracy and usefulness of the extracted information.

The term data mining is often used to apply to the two separate processes of knowledge discovery andprediction. Knowledge discovery provides explicit information about the characteristics of the collected data, using a number of techniques (e.g., association rule mining). Forecasting and predictive modelingprovide predictions of future events, and the processes may range from the transparent (e.g., rule-based approaches) through to the opaque (e.g., neural networks).

Metadata, (data about the characteristics of a data set), are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

Data mining is usually performed on "real-world data". Such data are vulnerable to collinearity because of unknown and possibly unobserved interrelations. An unavoidable fact of data mining is that the (sub-)set of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships that exist across other parts of the domain. Alternative methods using experiment-based approaches, such as Choice Modelling for human-generated data, may be used to address this sort of issue. In these situations, inherent correlations can be either controlled for or removed altogether during the construction of the experimental design.

There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems likeRapidMiner, Weka, KNIME, and the R Project have become an informal standard for defining data-mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies. The latest version of PMML, version 4.0 is scheduled to be released in early 2009.

Since the availability of affordable computer processing power in the last quarter of the 20th century, organizations have been accumulating vast and ever growing amounts of data, including, for example:

operational and transactional data, such as sales, cost, inventory, payroll and accounting data

nonoperational data, such as forecasts and macro economic data

meta data — data about the data itself, such as logical database design and data dictionary definitions

This article outlines the longitudinal changes of DMKD research activities during the last decade by surveying a large collection of Data Mining literature to provide a comprehensive picture of current DMKD research and classify these research activities into high-level categories.

No comments:

Post a Comment