In this paper, we present (a) a method for identifying documents captured from low-resolution devices such as web-cams, digital cameras or mobile phones and (b) a technique for extracting their textual content without performing OCR. The first method associates a hierarchically structured visual signature to the low-resolution document image and further matches it with the visual signatures of the original high-resolution document images, stored in PDF form in a repository. The matching algorithm follows the signature hierarchy, which speeds-up the search by guiding it towards fruitful solution spaces. In a second step, the content of the original PDF document is extracted, structured, and matched with its corresponding high-resolution visual signature. Finally, the matched content is attached to the low-resolution document image's visual signature, which greatly enriches the document's content and indexing. We present in this article both these identification and extraction methods and evaluate them on various documents, resolutions and lighting conditions, using different capture devices.
|Publication status||Published - 28 Oct 2004|
|Event||Association for Computing Machinery (ACM) Symposium on Document Engineering (DocEng) - Milwaukee, United States|
Duration: 28 Oct 2004 → 30 Oct 2004
|Conference||Association for Computing Machinery (ACM) Symposium on Document Engineering (DocEng)|
|Period||28/10/04 → 30/10/04|