Zurück zur Übersicht

Measuring Text Similarity With Dynamic Time Warping

full text: html PDF
author/s: Michael Matuschek, Tim Schlüter, Stefan Conrad
type:Inproceedings
editor:B.C. Desai
booktitle:Proceedings of the 2008 International Symposium on Database Engineering & Applications, Coimbra, Portugal, September 10-12, 2008
publisher:ACM International Conference Proceeding Series; Vol. 299
pages:263-267
month:September
year:2008
ISBN:978-1-60558-188-0
Abstract

In this work, we describe an approach which aims to make typed texts comparable with temporal data mining methods. This proposal was made in earlier work [11], but to our knowledge no significant research on this subject has been done yet. The basic idea is to derive artificial time series from texts by counting the occurrences of relevant keywords in a sliding window applied to them, and these time series can be compared with techniques of time series analysis. In this particular case the Dynamic Time Warping distance [3] was used. By extensive testing adequate parameters for time series calculation were derived, and we show that this approach might aid in the recognition of similar texts since the observed distances between similar documents are significantly lower than those between unrelated texts. Our idea might also be especially suitable for comparison in different languages since only the keyword translations must be known.

Heinrich Heine Universität

Datenbanken und Informationssysteme

Lehrstuhlinhaber

Prof. Dr. Stefan Conrad


Universitätsstr. 1
40225 Düsseldorf
Gebäude: 25.12
Etage/Raum: 02.24
Tel.: +49 211 81-14088
Fax: +49 211 81-13463

Sekretariat

Sabine Freese


Sprechzeiten:
Mo-Fr: 10:00-11:30 Uhr
Mo-Do: 13:00-14:30 Uhr


Universitätsstr. 1
40225 Düsseldorf
Gebäude: 25.12
Etage/Raum: 02.22
Tel.: +49 211 81-11312
Fax: +49 211 81-13463
Verantwortlich für den Inhalt:  E-Mail senden WE Informatik