Data Transformation and Editing


We are generally dealing with a large set of objects sharing common features. The challenge is to describe these objects with an appropriate set of attributes. Objects have an infinite number of properties, so that the choice of pertinent attributes with respect to the questions to be addressed is determining. Often the best result is not obtained in the first trial, but at the end of a long series of "trials and errors" using an interactive and iterative feed-back mechanism.

SEMANA is conceived to help the user in these time-consuming tasks.

Dynamic DB Builder

Attribute Editor :
- When a Database is built, it generally consists of a set of objects described by a set of attributes taking a set of values (so-called multi-valued table). In Dynamic DB Builder, there is an Attribute Editor allowing the user to build and modify easily the whole set of attributes-values (AV):
        - to create new attributes and/or value
- to change their names
- to merge attributes
- to split attributes.
Modifications are instantly applied to the whole DB.

Statistics and help to decision :
- Dynamic DB Builder provides statistics about the use of attributes and values. A report informs the user when two attributes could be merged, i.e. when they are used exclusively (when one is used, the other is not).

- Dynamic DB Builder also indicates the existence of duplicates in the DB (i.e. the objects that have exactly the same set of AV) and gives an index of saturation of the DB (i.e. the number of combinations of AV used with respect to the theoretical number of combinations). This gives an idea of the representativeness of the sample.


Table handling

Collector : In Dynamic DB Builder, there is a procedure named Collector which builds a table from the whole set of objects present in the DB. This is a multi-valued table (AV-type) made of symbolic features.
As such, multi-valued tables are ready for RST and Decision logic.

Table conversions :
For other procedures, such as FCA and STAT, multi-valued-tables must be converted into one-valued tables.
        • This is generally achieved using nominal or plain scaling (each value of a multi-valued attribute becomes nominally a one-valued attribute).
• Logical scaling can also be used. It consists in the combination of two (or more) attributes according to rules proposed by the expert (see the example of the sleeping bags in S. Prediger 1997).
• Discretization: Quantitative measurements (age, length, weight, notations, count of words, etc.) can be converted into discrete values (called modalities) according to conversion rules designed by the expert. Histograms can help to the decision.

Table reordering :

There is a statistical test implemented in SEMANA called clustering index which indicates whether there is a trend toward clustering or toward seriation (Renfrew and Sterud 1969). If there is a trend toward seriation, the rows and columns of the table may be reorganized in order to concentrate the positive values along the diagonal. The following example is an ideal, theoretical case:

Matrix

(after Caraux 1984)

References

CARAUX G. (1984). Réorganisation et représentation visuelle d'une matrice de données numériques: un algorithme itératif. Revue de Statistique appliquée, t. 32, n°4, pp. 5-23.

PREDIGER S. (1997). "Logical scaling in Formal Concept Analysis". In Conceptual structures: fulfilling Peirce's dream (D. Lukose et al. eds.) Proceedings of the 5th Internat. Conf. on conceptual structures (ICCS'97). Lecture notes in Artificial Intelligence n°1257, Springer-Verlag: Berlin, pp. 332-341.

RENFREW C., G. STERUD (1969). Close-Proximity Analysis: A Rapid Method for the Ordering of Archaeological Materials. American Antiquity, Vol. 34, No. 3, pp. 265-277.

DEMSAR, J. &  ZUPAN, B. (2005),"From Experimental Machine Learning to Interactive Data Mining", (white paper), Slovenia

Yao, Y.Y., On conceptual modeling of data mining, in: Wang, J., Zhou, Z.H., and Zhou, A.Y. (Eds.), Machine Learning and Applications, Tsinghua University Press, Beijing, pp. 238-255, 2006.


Apache/1.3.29 Server at celta.paris-sorbonne.fr Port 80