Information Access from Document Images of Indian languages
Primary Information
Domain
Information & Communication Technology
Project No.
5326
Sanction and Project Initiation
Sanction No: F. No.:3-18/2015-T.S.-I (Vol. III)
Sanction Date: 22/03/2017
Project Initiation date: 13/06/2017
Project Duration: 36
Partner Ministry/Agency/Industry
MEITY
Role of partner:To provide half of the sanctioned budget.
Support from partner:No money received yet.
Principal Investigator
Prof. Prabir Kumar Biswas
Dept. of E & ECE, IIT Kharagpur
Host Institute
Co-PIs
Prof. Jayanta Mukhopadhyay
Dept. of CSE, IIT Kharagpur
Prof. Santanu Chaudhury
CEERI Pilani
Prof. Bhabotosh Chanda
ISI Kolkata
Prof. Shamik Sural
Dept. of CSE, IIT Kharagpur
Dr. C. V. Jawahar
IIIT Hyderabad
Prof. P. P. Das
Dept. of CSE, IIT Kharagpur
Prof. Gaurav Harit
IIT Jodhpur
Scope and Objectives
A. Content-aware image processing algorithms for analysing and improving degraded documents by exploiting the priors. B. Building recognizers for handwritten,typewritten and low resolution Indian language documents to convert images to textual representations. C. Exploiting information retrieval techniques for inferencing in presence of noisy textual representation. D. Creating ground truth for Indian language document image datasets such as typewritten, low resolution and handwritten for enabling ma chine learning. E. Validate methods on two immediate practical applications.
Deliverables
1. Prototype system that demonstrates the utility on real life applications in Indian settings. Scalable deployment and delivery over the web. Relevant IP in terms of patents on noisy content based document information extraction. Possible ToT with industry.
2.Ground truth document image data (corpus of three Indian languages: Hindi, Bangla and Malayalam) and associated resources that drive the research
3.Methods, algorithms and approaches that lead to peer reviewed international high quality publications.
Scientific Output
1. Methods for pre-processing and categorization of printed documents have been developed.
2. Recognizer for typewritten documents have been developed.
3. Algorithm designed for restoring degraded old handwritten documents.
4. Algorithm implemented for text extraction from degraded image
5. State of the art handwritten word recognizers for two popular Indic scripts are designed.
6. A graphical user interface based ground-truth generation tool called Multi-layered Document Image Annotation System (MultiDIAS) has been created.




Results and outcome till date
Works related to Objectives "A" and "B" are in good progress and few good results have been achieved. Works related to Objectives "C" and "D" are in progress.




Societal benefit and impact anticipated
As a consequence of many recent initiatives (Digital India, E-Governance), several document image collections are getting digitized. However, they are not yet content level accessible. They can be only accessed by manually entered metadata. Content level access to such collections can lead to powerful applications such as hyperlinking documents (such as court records), translation, transliteration and access to handwritten documents, form processing and data entry on paper, abnormal activity detection within the stream of fax documents. Our proposal involves development of technology so that many of the Indian document image collections are immediately accessible to enable civilian applications.
Next steps
Validation of the proposed methods on two immediate practical applications in Indian scenario is an upcoming challenge.
Publications and reports
1. Kartik Dutta, Praveen Krishnan, Minesh Mathew, and C. V. Jawahar - Towards Accurate Handwritten Word Recognition for Hindi and Bangla, NCVPRIPG, 2017
2. Kartik Dutta, Praveen Krishnan, Minesh Mathew and C. V. Jawahar- Offliine Handwriting Recognition on Devanagari using a new Benchmark Dataset, DAS 2018
3. Kartik Dutta, Praveen Krishnan, Minesh Mathew and C. V. Jawahar- Towards Spotting and Recognition of Handwritten Words in Indic Scripts, ICFHR 2018
4. Praveen Krishnan, Kartik Dutta and C. V. Jawahar - Word Spotting and Recognition using Deep Embedding, DAS 2018 G. Nagendar, Viresh Ranjan,
5. Gaurav Harit and C. V. Jawahar, Efficient Query Specific DTW Distance for Document Retrieval with Unlimited Vocabulary, J. Imaging, 2018
6. M. Wadhwani, D. Kundu and B. Chanda, Text segmentation and restoration of old handwritten documents, 11th ICVGIP, Hyderabad, December, 2018. (communicated)
7. A. Poddar, R. Mukherjee, J. Mukhopadhyay, P. K. Biswas, MultiDIAS: A Hierarchical Multi-layered Document Image Annotation System for Foreground Pixels, 11th ICVGIP, Hyderabad, December, 2018. (communicated)
Patents
Scholars and Project Staff
One Senior Project Fellow, One Junior Project Fellow, Two Technical Project Superintendent and One Principal PROJECT Officer have been recruited
Challenges faced
Being a multi-institutional project, there are certain challenges. But we are working to solve it. Secondly, no money has been received from the Partner Ministry. The data from CEERI Pilani is updated till 31st July, 2018.
Financial Information
-
Total sanction: Rs. 40000000
-
Amount received: Rs. 14113000
-
Amount utilised for Equipment: Rs. 2951386
-
Amount utilised for Manpower: Rs. 1844754
-
Amount utilised for Consumables: Rs. 263226
-
Amount utilised for Contingency: Rs. 351296
-
Amount utilised for Travel: Rs. 221900
-
Amount utilised for Other Expenses: 0
-
Amount utilised for Overheads: Rs. 1891559
Equipment and facilities
Work stations, laser printers, laptop