Machine Learning and Inference Laboratory, MS 5C3
George Mason University
4400 University Dr.
Fairfax, VA 22030-4444
Phone: (703) 993-1558 or 764-9142 (Michalski), (703) 993-1709 (Kaufman)
Fax: (703) 993-3729
Email: {michalski,kaufman}@gmu.edu
Homepages: http://www.mli.gmu.edu/michalski
http://www.mli.gmu.edu/~kaufman
http://www.mli.gmu.edu/projects/idb.html
Data mining, knowledge discovery, inductive databases, knowledge scouts, multistrategy learning, intelligent agents.
Objectives of this research are to develop, implement, and test a methodology for building inductive databases, which extend conventional databases by integrating inductive inference capabilities. These capabilities allow a database to answer queries that require synthesizing plausible knowledge. Such knowledge is not directly or deductively obtainable from the database, but can be hypothesized through inductive inference. To implement these capabilities, a new type of operators are developed that are based on advanced methods for symbolic inductive inference. These operators, together with conventional database operators, are integrated into a knowledge generation language (KGL). Using KGL, a user can define knowledge scouts, which are specialized software agents that autonomously operate on one or more databases. Their function is to automatically synthesize and manage knowledge that is tailored to specific needs of the user.
The results of our early research in this direction, and closely related work, have been reported in the journal articles, book chapters, and conference papers listed on the project home page (http://www.mli.gmu.edu/projects/idb.html). A selection of these publications is listed in the Project References.
These results include the development of an experimental system, INLEN, that integrates a rule-learning system, AQ15, with deductive inference operators and a custom-built database. The aims of INLEN are to provide a personal pattern discovery and decision support system, and to serve as an educational aid for teaching principles of data mining and building advisory systems.
The NSF grant is impacting both graduate research and course development. It is currently supporting one full-time Doctoral student Guido Cervone, one postdoc (partially), and will support another full-time Doctoral student in the near future. This research has also attracted one part-time Ph.D. student, Nawal Alkharouf, who started research on the development and application of inductive databases to microarray gene expression.
This project has also motivated us to introduce a new Ph.D. concentration "Computational Intelligence and Knowledge Mining," in which the topic of inductive database plays a major role (see http://www.mli.gmu.edu/cikm.html). The new concentration has been already approved by the faculty of the School of Computational Sciences, and is now being offered to interested students. As one of the components of this concentration area, the PI has introduced a new course entitled "Data Mining and Knowledge Discovery". Students taking this course are offered, among others, class projects on topics related to inductive databases.
This project has also attracted researchers at the Institute of Computer Science at the Polish Academy of Sciences, who recently organized a special research group dedicated to collaborating with us on this project.
In addition to enhancing our understanding of automated knowledge discovery and developing prototype tools for it, this project may impact other areas of science. For example, our early experiments indicate that the methodology being developed may help to partially automate the process of pattern discovery in medical databases, specifically to identify the cause/effect relationships between lifestyles and health.
This project is concerned with the development, implementation, and testing of a methodology for building inductive databases. Inductive databases extend conventional databases by integrating in them inductive inference capabilities that allow a database to answer queries that require synthesizing plausible knowledge. Such knowledge is neither directly nor deductively derivable from the database, but can be generated by inductive inference from facts in the database and prior knowledge in the associated knowledge base. This plausible knowledge may be in the form of generalized data summaries, likely consequences from the data, hypotheses about yet unseen data items, global qualitative and/or quantitative regularities, exceptions from hypothesized patterns, suspected errors and implied inconsistencies, recommended action plans, etc.
To this end, we are developing a new type of database operators, called knowledge generation operators (KGOs), and integrating them within a database language. A KGO takes a selection of data from the database, and possibly some prior knowledge or constraints represented in an associated knowledge base, and generates new knowledge. Knowledge generation operators (plus conventional database operators and deductive inference operators) are invoked through a knowledge generation language (KGL) that is used to define knowledge scouts. Knowledge scouts are in the form of KGL scripts that guide processes in deriving knowledge of interest to a given user, or a class of users. A script includes a plan of operations to be performed on a database (or multiple databases), and a target knowledge specification, which abstractly characterizes the knowledge of interest to the user. A script can be a program to be executed upon a user's request, or a "live" software agent that runs continuously in the background, and outputs its findings whenever an alert-user criterion is satisfied.
In order to develop and apply such knowledge scouts, we are developing two knowledge generation languages, KGL2 and KQL. KGL2 is an extension of KGL1, our initial knowledge generation language, implemented in INLEN, an early system that integrated selected machine learning and inference methods with a simple database. KQL (Knowledge Query Language) is a new language that will be built directly into an SQL-integrated environment. The research plan for this year is concerned with the design of KGL2 and KQL, and the implementation of the central inductive inference operator based on the AQ20 natural induction and pattern discovery program.
Outcomes of this research will include a methodology for building inductive databases, and a prototype inductive database (IDB) system. IDB will integrate a standard database language (SQL) with a knowledge generation language, and will work with a widely-available relational database system (ORACLE and/or Access). The system will support, in a seamless fashion, all standard database operations, as well as several novel operators for knowledge discovery, manipulation, inference, and visualization.
The field of databases is in the midst of an extraordinary growth, and databases are becoming omnipresent and globally connected. In this context, a question arises as to what new scientific directions may lead to an improvement of the current database methodologies, and open new possibilities for database applications. It is strongly believed that inductive databases constitute one of the most important new such directions.
There are several potential related projects, such as integration of databases with knowledge bases, automated learning of user interests, knowledge visualization (as opposed to data visualization), and applications of inductive databases to a wide range of practical domains.