INDUCTIVE DATABASES
AND KNOWLEDGE SCOUTS

Ryszard S. Michalski
Kenneth A. Kaufman
Machine Learning and Inference Laboratory
George Mason University

Contact Information

Ryszard S. Michalski
Kenneth A. Kaufman

Machine Learning and Inference Laboratory, MS 5C3
George Mason University
4400 University Dr.
Fairfax, VA 22030-4444

Phone: (703) 993-1558 or 764-9142 (Michalski), (703) 993-1709 (Kaufman)
Fax: (703) 993-3729
Email: {michalski,kaufman}@gmu.edu
Homepages: http://www.mli.gmu.edu/michalski
                     http://www.mli.gmu.edu/~kaufman

WWW PAGE

http://www.mli.gmu.edu/projects/idb.html

Project Award Information

Keywords

Data mining, knowledge discovery, inductive databases, knowledge scouts, multistrategy learning, intelligent agents.

Project Summary

Objectives of this research are to develop, implement, and test a methodology for building inductive databases, which extend conventional databases by integrating inductive inference capabilities. These capabilities allow a database to answer queries that require synthesizing plausible knowledge. Such knowledge is not directly or deductively obtainable from the database, but can be hypothesized through inductive inference. To implement these capabilities, a new type of operators are developed that are based on advanced methods for symbolic inductive inference. These operators, together with conventional database operators, are integrated into a knowledge generation language (KGL). Using KGL, a user can define knowledge scouts, which are specialized software agents that autonomously operate on one or more databases. Their function is to automatically synthesize and manage knowledge that is tailored to specific needs of the user.

Publications and Products

The results of our early research in this direction, and closely related work, have been reported in the journal articles, book chapters, and conference papers listed on the project home page (http://www.mli.gmu.edu/projects/idb.html). A selection of these publications is listed in the Project References.

These results include the development of an experimental system, INLEN, that integrates a rule-learning system, AQ15, with deductive inference operators and a custom-built database. The aims of INLEN are to provide a personal pattern discovery and decision support system, and to serve as an educational aid for teaching principles of data mining and building advisory systems.

Project Impact

The NSF grant is impacting both graduate research and course development. It is currently supporting one full-time Doctoral student Guido Cervone, one postdoc (partially), and will support another full-time Doctoral student in the near future. This research has also attracted one part-time Ph.D. student, Nawal Alkharouf, who started research on the development and application of inductive databases to microarray gene expression.

This project has also motivated us to introduce a new Ph.D. concentration "Computational Intelligence and Knowledge Mining," in which the topic of inductive database plays a major role (see http://www.mli.gmu.edu/cikm.html). The new concentration has been already approved by the faculty of the School of Computational Sciences, and is now being offered to interested students. As one of the components of this concentration area, the PI has introduced a new course entitled "Data Mining and Knowledge Discovery". Students taking this course are offered, among others, class projects on topics related to inductive databases.

This project has also attracted researchers at the Institute of Computer Science at the Polish Academy of Sciences, who recently organized a special research group dedicated to collaborating with us on this project.

In addition to enhancing our understanding of automated knowledge discovery and developing prototype tools for it, this project may impact other areas of science. For example, our early experiments indicate that the methodology being developed may help to partially automate the process of pattern discovery in medical databases, specifically to identify the cause/effect relationships between lifestyles and health.

Goals, Objectives, and Targeted Activities

This project is concerned with the development, implementation, and testing of a methodology for building inductive databases. Inductive databases extend conventional databases by integrating in them inductive inference capabilities that allow a database to answer queries that require synthesizing plausible knowledge. Such knowledge is neither directly nor deductively derivable from the database, but can be generated by inductive inference from facts in the database and prior knowledge in the associated knowledge base. This plausible knowledge may be in the form of generalized data summaries, likely consequences from the data, hypotheses about yet unseen data items, global qualitative and/or quantitative regularities, exceptions from hypothesized patterns, suspected errors and implied inconsistencies, recommended action plans, etc.

To this end, we are developing a new type of database operators, called knowledge generation operators (KGOs), and integrating them within a database language. A KGO takes a selection of data from the database, and possibly some prior knowledge or constraints represented in an associated knowledge base, and generates new knowledge. Knowledge generation operators (plus conventional database operators and deductive inference operators) are invoked through a knowledge generation language (KGL) that is used to define knowledge scouts. Knowledge scouts are in the form of KGL scripts that guide processes in deriving knowledge of interest to a given user, or a class of users. A script includes a plan of operations to be performed on a database (or multiple databases), and a target knowledge specification, which abstractly characterizes the knowledge of interest to the user. A script can be a program to be executed upon a user's request, or a "live" software agent that runs continuously in the background, and outputs its findings whenever an alert-user criterion is satisfied.

In order to develop and apply such knowledge scouts, we are developing two knowledge generation languages, KGL2 and KQL. KGL2 is an extension of KGL1, our initial knowledge generation language, implemented in INLEN, an early system that integrated selected machine learning and inference methods with a simple database. KQL (Knowledge Query Language) is a new language that will be built directly into an SQL-integrated environment. The research plan for this year is concerned with the design of KGL2 and KQL, and the implementation of the central inductive inference operator based on the AQ20 natural induction and pattern discovery program.

Outcomes of this research will include a methodology for building inductive databases, and a prototype inductive database (IDB) system. IDB will integrate a standard database language (SQL) with a knowledge generation language, and will work with a widely-available relational database system (ORACLE and/or Access). The system will support, in a seamless fashion, all standard database operations, as well as several novel operators for knowledge discovery, manipulation, inference, and visualization.

Project References

1.
Michalski, R.S. and Kaufman, K, "Learning Patterns in Noisy Data: The AQ Approach," in Paliouras, G., Karkaletsis, V. and Spyropoulos (eds.), Machine Learning and Applications, Springer-Verlag, 2001 (to appear).
2.
Michalski, R.S. and Kaufman, K.A., "The AQ19 System for Machine Learning and Pattern Discovery: A General Description and User's Guide," Reports of the Machine Learning and Inference Laboratory, MLI 01-2, George Mason University, Fairfax, VA, 2001.
3.
Kaufman, K.A. and Michalski, R.S., "A Knowledge Scout for Discovering Medical Patterns: Methodology and System SCAMP," Proceedings of the Fourth International Conference on Flexible Query Answering Systems, FQAS'2000, Warsaw, Poland, pp. 485-496, October 25-28, 2000.
4.
Michalski, R.S. and Kaufman, K., "Building Knowledge Scouts Using KGL Metalanguage," Fundamenta Informaticae, Vol. 40, pp. 433-447, 2000.
5.
Michalski, R.S. and Kaufman, K.A., "A Measure of Description Quality for Data Mining and its Implementation in the AQ18 Learning System," Proceedings of the ICSC Congress on Computational Intelligence Methods and Applications (CIMA-99), Rochester, NY, pp. 369-375, June, 1999.
6.
Michalski, R.S. and Kaufman, K.A., "Discovering Multidimensional Patterns in Large Datasets Using Knowledge Scouts," Reports of the Machine Learning and Inference Laboratory, MLI 99-7, George Mason University, Fairfax, VA, June 1999.
7.
Kaufman, K.A. and Michalski, R.S., "Learning from Inconsistent and Noisy Data: The AQ18 Approach," Proceedings of the Eleventh International Symposium on Methodologies for Intelligent Systems, Warsaw, pp. 411-419, June 8-11, 1999.
8.
Kaufman, K.A. and Michalski, R.S., "Multistrategy Data Mining via the KGL Metalanguage," Proceedings of the Seventh Symposium on Intelligent Information Systems (IIS'98), Malbork, Poland, pp. 39-48, June 15-19, 1998.
9.
Kaufman, K.A. and Michalski, R.S., "Discovery Planning: Multistrategy Learning in Data Mining," Proceedings of the Fourth International Workshop on Multistrategy Learning (MSL'98), Desenzano del Garda, Italy, June 11-13, 1998.
10.
Michalski, R.S. and Kaufman, K.A., "Data Mining and Knowledge Discovery: A Review of Issues and a Multistrategy Approach," in Michalski, R.S., Bratko, I. and Kubat, M. (Eds.), Machine Learning and Data Mining: Methods and Applications, London: John Wiley & Sons, pp. 71-112, 1998.
11.
Michalski, R.S., "Seeking Knowledge in the Deluge of Facts," Fundamenta Informaticae, Vol. 30, pp. 283-297, 1997.
12.
Kaufman, K.A. and Michalski, R.S., "KGL: A Language for Learning," Reports of the Machine Learning and Inference Laboratory, MLI 97-3, George Mason University, Fairfax, VA, 1997.
13.
Kaufman, K. and Michalski, R.S., "A Method for Reasoning with Structured and Continuous Attributes in the INLEN-2 Knowledge Discovery System," Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, pp. 232-237, August, 1996.
14.
Michalski, R.S., Kerschberg, L., Kaufman, K.A. and Ribeiro, J.S., "Mining For Knowledge in Databases: The INLEN Architecture, Initial Implementation and First Results," Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies, Vol. 1, No. 1, pp. 85-113, August 1992.
15.
Michalski, R.S., "Searching for Knowledge in a World Flooded with Facts," an invited talk, Proceedings of the Fifth International Symposium on Applied Stochastic Models and Data Analysis, Granada, Spain, April 23-26, 1991.

Area Background

The field of databases is in the midst of an extraordinary growth, and databases are becoming omnipresent and globally connected. In this context, a question arises as to what new scientific directions may lead to an improvement of the current database methodologies, and open new possibilities for database applications. It is strongly believed that inductive databases constitute one of the most important new such directions.

Area References

Potential Related Projects

There are several potential related projects, such as integration of databases with knowledge bases, automated learning of user interests, knowledge visualization (as opposed to data visualization), and applications of inductive databases to a wide range of practical domains.