Software for Hasse diagram construction and mining
Hasse Manager is a program which organizes data items into a structure that can be used for creation of Hasse Diagrams. Hasse diagrams are line drawings that visualise relations between data items. Items which are "larger" are drawn higher up in the drawing than smaller items. A line drawn in the diagram from Item1 up to Item2 means that Item2 is the larger of the two, and that no other item comes between the two in size. The technical definition is about binary relations and partially ordered sets (posets), and "larger" generalises to any sort of comparison between two elements in a set. Hasse diagrams and posets are well researched.
HasseManager woks on-line in the sense that data items are added one at a time, with the data structure completely updated after every new addition. The program will consist of classes for HasseDiagram, elements in the diagram, solvers etc. There is also a Windows form which is a testbench for the classes.
Hasse manager is written in C#. There is a HasseNode base class. Specific implementations will use classes that derive from this class. The StringHasseNode class is used for testing. It is for Hasse diagrams where the order comes from string/substring relations.
The current plan is to:
- Develop reasonably efficient algorithms for smaller problems (up to a few thousands of items), where all data fits into RAM. The algorithm should work on posets in general. Some existing algorithms for FCA (formal concept analysis) may be useful, or do they perhaps rely on special properties of FC lattices?
- Implement methods for fragmentation of data items. For example, if data items are text strings, like ABX and ABY, then also adding substrings AB, X and Y to the set will make sense for some applications. For chemistry applications we will want to add common substructures from molecules in a set.
- Make a database version. Develop suitable data model and design tables to contain the relation data. Also create procedures for insert and remove. An efficient database solution should probably rely heavily on indexing.
- Better drawings. The program writes output in DOT format. This format can be used by graph drawing programs in the Graphviz suite.
- Explore / develop methods for regression based on data in the Hasse diagram (see below).
The primary motivation is from chemoinformatics where one wants to discover relations between the molecular structure of chemicals and their activity as drugs. The HasseManager project will explore the possibilities to use Hasse diagrams for this purpose.
Molecules can be thought of as being built up of substructures or fragments. It should be possible to arrange structures into a diagram based on substructure/superstructure relations. Hypothetical fragments can be inserted in the database along with real data. A fragment Hasse diagram should be a useful in for discovery of any sort of structure of a chemical dataset.
Perhaps it is possible to use molecular fragment Hasse diagrams for automated discovery of quantitative structure-activity relations (QSAR) of compounds in a set. Methodology to set up regression models based on the input diagram needs to be developed.
Some links on related matters: