Ryszard S. Michalski, PI
Kenneth A. Kaufman, Co-PI
Machine Learning and
Ryszard S.
Michalski
Kenneth A. Kaufman
Machine Learning and Inference Laboratory, MS 5C3
4400
Phone: (703) 993-1558 or 764-9142 (Michalski), (703) 993-1709 (Kaufman)
Fax: (703) 993-3729
Email: {michalski,kaufman}@gmu.edu
Homepages: http://www.mli.gmu.edu/michalski
http://www.mli.gmu.edu/~kaufman
http://www.mli.gmu.edu/projects/idb.html
Guido Cervone, Doctoral Student
Michal Draminski, Visiting Doctoral Student
Data mining, knowledge discovery, knowledge mining, inductive databases, knowledge scouts, multistrategy learning, intelligent agents.
Objectives of this research are to develop, implement, and test a methodology for building inductive databases, which extend conventional databases by integrating a wide range of inductive inference capabilities. These capabilities allow a database to answer queries that require hypothesizing plausible knowledge that is not directly or deductively obtainable from the database. This knowledge may represent probable answers to questions, strong patterns found in data, hypotheses of relationships, decision rules or equations derived from data, statistical relationships, summaries, classifications. It is postulated that the synthesized knowledge is represented to the user in the way easy to interpret and/or visualize. To implement such capabilities, new type of operators are being developed, called knowledge generation operators, that are based on advanced methods for symbolic inductive inference. These operators, together with conventional database operators, are integrated into a knowledge query language (KQL). Using KQL, a user can pose direct questions to the system, or define knowledge scouts that autonomously operate on one or more databases in search for target knowledge. Their function is to automatically synthesize and manage knowledge that is tailored to specific needs of the user.
The results of our early research in this direction, and closely related work, have been reported in the journal articles, book chapters, and conference papers listed on the project home page (http://www.mli.gmu.edu/projects/idb.html). A selection of these publications is listed in the Project References.
Current research is concerned with the development of methodology and an implementation of an experimental system, VINLEN, that will seamlessly integrate advanced inductive learning capabilities with an SQL-accessible database. A user can invoke VINLEN’s capabilities via a sophisticated visual interface, and the knowledge query language (KQL) that integrates SQL with knowledge generation operators. The aim of the VINLEN project is to provide the user with an integrated system for pattern discovery, knowledge mining, inference, and decision support. VINLEN will also serve as an educational aid for teaching principles of data mining, and developing of modern advisory systems.
The NSF grant is impacting both graduate research and course development. It is currently supporting one full-time Doctoral student Guido Cervone, one research scientist, one postdoctoral scientist, and two part-time Ph.D. students. We are in the process of hiring another full-time Doctoral student.
This project has also motivated us to introduce a new Ph.D. concentration,
"Computational Intelligence and Knowledge Mining," for which research
and development of advanced inductive databases is one of the major topics (see
http://www.mli.gmu.edu/cikm.html).
The new concentration has been approved by the faculty of the
To serve the needs of students in this concentration area, the PI has developed two new courses, "Data Mining and Knowledge Discovery,” and “Principles of Knowledge Mining.” Students taking these courses are offered, among others, class projects on topics related to inductive databases. Recently, one more course related to this area has been developed by the PI: “Computational Learning and Discovery,” which is currently in the process of review by the Curriculum Committee.
This project has also attracted researchers at the
In addition to enhancing our understanding of automated knowledge discovery and developing prototype tools for it, this project may impact other areas of science. For example, our early experiments indicated that the methodology being developed may help to partially automate the process of pattern discovery in medical databases, specifically to identify the cause/effect relationships between lifestyles and health, and in earth sciences. The latter research is done in collaboration with the GMU Center for Earth Observing and Space Research, directed by Dr. Menas Kafatos. We are also exploring the prospects of applying this technology in social sciences.
This project is concerned with the development, implementation, and testing of a methodology for building inductive databases. Inductive databases extend conventional databases by integrating in them inductive inference capabilities that allow a database to answer queries that require synthesizing plausible knowledge. Such knowledge is neither directly nor deductively derivable from the database, but can be generated by inductive inference from facts in the database and prior knowledge in the associated knowledge base. This plausible knowledge may be in the form of generalized data summaries, likely consequences from the data, hypotheses about yet unseen data items, global qualitative and/or quantitative regularities, exceptions from hypothesized patterns, suspected errors and implied inconsistencies, recommended action plans, etc.
To this end, we are developing a new type of database operators, called knowledge generation operators (KGOs), and integrating them within a database language. A KGO takes a selection of data from the database, and possibly some prior knowledge or constraints represented in an associated knowledge base, and generates new knowledge. Knowledge generation operators (plus conventional database operators and deductive inference operators) are invoked through a knowledge query language (KQL) that is used to define knowledge scouts. Knowledge scouts are in the form of KQL scripts that guide processes in deriving knowledge of interest to a given user, or a class of users. A script includes a plan of operations to be performed on a database (or multiple databases), and a target knowledge specification, which abstractly characterizes the knowledge of interest to the user. A script can be a program to be executed upon a user's request, or a "live" software agent that runs continuously in the background, and outputs its findings whenever an alert-user criterion is satisfied.
In order to develop and apply such knowledge scouts, we are developing a knowledge query language called KQL. KQL is in some ways an extension of KGL1, our initial knowledge generation language, implemented in INLEN, an early system that integrated selected machine learning and inference methods with a simple database. KQL is also being built directly into an SQL-integrated environment. Thus, all SQL functions are accessible through KQL, and the various KGOs can be invoked using SQL-style queries. The research plan for this year is concerned with the full implementation of KQL, and the implementation of inductive and statistical inference operators to complement the already implemented operator based on the AQ20 natural induction and pattern discovery program.
Outcomes of this research will include a methodology for building inductive databases, and a prototype inductive database (IDB) system. IDB will integrate a standard database language (SQL) with a knowledge query language, and will work with several widely-available relational database system (ORACLE, Access and Paradox are currently supported). The system will support, in a seamless fashion, all standard database operations, as well as several novel operators for knowledge discovery, manipulation, inference, and visualization.
1. Kaufman K. and Michalski, R.S., The Development of VINLEN: A System Integrating Database, Knowledge Base, and Inductive Learning Capabilities, in preparation, 2002.
2. Michalski, R.S. and Kaufman, K, "Learning Patterns in Noisy Data: The AQ Approach," in Paliouras, G., Karkaletsis, V. and Spyropoulos (eds.), Machine Learning and Applications, Springer-Verlag, 2002 (to appear).
3. Michalski, R.S. and Kaufman, K.A., "The AQ19 System for Machine Learning and Pattern Discovery: A General Description and User's Guide," Reports of the Machine Learning and Inference Laboratory, MLI 01-2, George Mason University, Fairfax, VA, 2001.
4. Kaufman, K.A. and Michalski, R.S., "A Knowledge Scout for Discovering Medical Patterns: Methodology and System SCAMP," Proceedings of the Fourth International Conference on Flexible Query Answering Systems, FQAS'2000, Warsaw, Poland, pp. 485-496, October 25-28, 2000.
5. Michalski, R.S. and Kaufman, K., "Building Knowledge Scouts Using KGL Metalanguage," Fundamenta Informaticae, Vol. 40, pp. 433-447, 2000.
6.
Michalski, R.S. and Kaufman, K.A.,
"A Measure of Description Quality for Data Mining and its Implementation
in the AQ18 Learning System," Proceedings of the ICSC Congress on
Computational Intelligence Methods and Applications (CIMA-99),
7.
Michalski, R.S. and Kaufman, K.A., "Discovering Multidimensional Patterns in Large Datasets Using Knowledge Scouts," Reports of the Machine Learning and Inference Laboratory, MLI 99-7, George Mason University, Fairfax, VA, June 1999.
8.
Kaufman, K.A. and Michalski, R.S.,
"Learning from Inconsistent and Noisy Data: The AQ18 Approach," Proceedings
of the Eleventh International Symposium on Methodologies for Intelligent
Systems,
9.
Kaufman, K.A. and Michalski, R.S., "Multistrategy Data Mining via the KGL Metalanguage," Proceedings of the Seventh Symposium on Intelligent Information Systems (IIS'98), Malbork, Poland, pp. 39-48, June 15-19, 1998.
10.
Kaufman, K.A. and Michalski, R.S., "Discovery Planning: Multistrategy Learning in Data Mining," Proceedings of the Fourth International Workshop on Multistrategy Learning (MSL'98), Desenzano del Garda, Italy, June 11-13, 1998.
11.
Michalski, R.S. and Kaufman, K.A., "Data Mining and Knowledge Discovery: A Review of Issues and a Multistrategy Approach," in Michalski, R.S., Bratko, I. and Kubat, M. (Eds.), Machine Learning and Data Mining: Methods and Applications, London: John Wiley & Sons, pp. 71-112, 1998.
12.
Michalski, R.S., "Seeking Knowledge in the Deluge of Facts," Fundamenta Informaticae, Vol. 30, pp. 283-297, 1997.
13.
Kaufman, K.A. and Michalski, R.S., "KGL: A Language for Learning," Reports of the Machine Learning and Inference Laboratory, MLI 97-3, George Mason University, Fairfax, VA, 1997.
14.
Kaufman, K. and Michalski, R.S., "A Method for Reasoning with Structured and Continuous Attributes in the INLEN-2 Knowledge Discovery System," Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, pp. 232-237, August, 1996.
15.
Michalski, R.S., Kerschberg, L., Kaufman, K.A. and Ribeiro, J.S., "Mining For Knowledge in Databases: The INLEN Architecture, Initial Implementation and First Results," Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies, Vol. 1, No. 1, pp. 85-113, August 1992.
16.
Michalski, R.S., "Searching for Knowledge in a World Flooded with Facts," an invited talk, Proceedings of the Fifth International Symposium on Applied Stochastic Models and Data Analysis, Granada, Spain, April 23-26, 1991.
The field of databases is in the midst of an extraordinary growth, and databases are becoming omnipresent and globally connected. In this context, a question arises as to what new scientific directions may lead to an improvement of the current database methodologies, and open new possibilities for database applications. It is strongly believed that inductive databases constitute one of the most important new such directions.
There are several potential related projects, such as integration of databases with knowledge bases, automated learning of user interests, knowledge visualization (as opposed to data visualization), and applications of inductive databases to a wide range of practical domains.