Decrease size
Reset to Default
Increase size

Information Access from Document Images of Indian languages

Primary Information

Domain

Information & Communication Technology

Project No.

5326

Sanction and Project Initiation

Sanction No: F. No.:3-18/2015-T.S.-I (Vol. III)

Sanction Date: 22/03/2017

Project Initiation date: 13/06/2017

Project Duration: 36

Partner Ministry/Agency/Industry

MEITY

Role of partner:To provide half of the sanctioned budget.

Support from partner:No money received yet.

Principal Investigator

PI Image

Prof. Prabir Kumar Biswas
Dept. of E & ECE, IIT Kharagpur

Host Institute

Co-PIs

PI Image

Prof. Jayanta Mukhopadhyay
Dept. of CSE, IIT Kharagpur

PI Image

Prof. Santanu Chaudhury
CEERI Pilani

PI Image

Prof. Bhabotosh Chanda
ISI Kolkata

PI Image

Prof. Shamik Sural
Dept. of CSE, IIT Kharagpur

PI Image

Dr. C. V. Jawahar
IIIT Hyderabad

PI Image

Prof. P. P. Das
Dept. of CSE, IIT Kharagpur

PI Image

Prof. Gaurav Harit
IIT Jodhpur

Scope and Objectives

A. Content-aware image processing algorithms for analysing and improving degraded documents by exploiting the priors. B. Building recognizers for handwritten,typewritten and low resolution Indian language documents to convert images to textual representations. C. Exploiting information retrieval techniques for inferencing in presence of noisy textual representation. D. Creating ground truth for Indian language document image datasets such as typewritten, low resolution and handwritten for enabling ma chine learning. E. Validate methods on two immediate practical applications.

Deliverables

1. Prototype system that demonstrates the utility on real life applications in Indian settings. Scalable deployment and delivery over the web. Relevant IP in terms of patents on noisy content based document information extraction. Possible ToT with industry.
2.Ground truth document image data (corpus of three Indian languages: Hindi, Bangla and Malayalam) and associated resources that drive the research
3.Methods, algorithms and approaches that lead to peer reviewed international high quality publications.

Scientific Output

1. Methods for pre-processing and categorization of printed documents have been developed.
2. Recognizer for typewritten documents have been developed.
3. Algorithm designed for restoring degraded old handwritten documents.
4. Algorithm implemented for text extraction from degraded image
5. State of the art handwritten word recognizers for two popular Indic scripts are designed.
6. A graphical user interface based ground-truth generation tool called Multi-layered Document Image Annotation System (MultiDIAS) has been created.

Project image
Project image
Project image
Project image

Results and outcome till date

Works related to Objectives "A" and "B" are in good progress and few good results have been achieved. Works related to Objectives "C" and "D" are in progress.

Project image
Project image
Project image
Project image

Societal benefit and impact anticipated

As a consequence of many recent initiatives (Digital India, E-Governance), several document image collections are getting digitized. However, they are not yet content level accessible. They can be only accessed by manually entered metadata. Content level access to such collections can lead to powerful applications such as hyperlinking documents (such as court records), translation, transliteration and access to handwritten documents, form processing and data entry on paper, abnormal activity detection within the stream of fax documents. Our proposal involves development of technology so that many of the Indian document image collections are immediately accessible to enable civilian applications.

Next steps

Validation of the proposed methods on two immediate practical applications in Indian scenario is an upcoming challenge.

Publications and reports

1. Kartik Dutta, Praveen Krishnan, Minesh Mathew, and C. V. Jawahar - Towards Accurate Handwritten Word Recognition for Hindi and Bangla, NCVPRIPG, 2017
2. Kartik Dutta, Praveen Krishnan, Minesh Mathew and C. V. Jawahar- Offliine Handwriting Recognition on Devanagari using a new Benchmark Dataset, DAS 2018
3. Kartik Dutta, Praveen Krishnan, Minesh Mathew and C. V. Jawahar- Towards Spotting and Recognition of Handwritten Words in Indic Scripts, ICFHR 2018
4. Praveen Krishnan, Kartik Dutta and C. V. Jawahar - Word Spotting and Recognition using Deep Embedding, DAS 2018 G. Nagendar, Viresh Ranjan,
5. Gaurav Harit and C. V. Jawahar, Efficient Query Specific DTW Distance for Document Retrieval with Unlimited Vocabulary, J. Imaging, 2018
6. M. Wadhwani, D. Kundu and B. Chanda, Text segmentation and restoration of old handwritten documents, 11th ICVGIP, Hyderabad, December, 2018. (communicated)
7. A. Poddar, R. Mukherjee, J. Mukhopadhyay, P. K. Biswas, MultiDIAS: A Hierarchical Multi-layered Document Image Annotation System for Foreground Pixels, 11th ICVGIP, Hyderabad, December, 2018. (communicated)

Patents

Scholars and Project Staff

One Senior Project Fellow, One Junior Project Fellow, Two Technical Project Superintendent and One Principal PROJECT Officer have been recruited

Challenges faced

Being a multi-institutional project, there are certain challenges. But we are working to solve it. Secondly, no money has been received from the Partner Ministry. The data from CEERI Pilani is updated till 31st July, 2018.

Financial Information

  • Total sanction: Rs. 40000000

  • Amount received: Rs. 14113000

  • Amount utilised for Equipment: Rs. 2951386

  • Amount utilised for Manpower: Rs. 1844754

  • Amount utilised for Consumables: Rs. 263226

  • Amount utilised for Contingency: Rs. 351296

  • Amount utilised for Travel: Rs. 221900

  • Amount utilised for Other Expenses: 0

  • Amount utilised for Overheads: Rs. 1891559

Equipment and facilities

Work stations, laser printers, laptop