A Platform for Crosslingual and Multilingual Event Monitoring in Indian Languages
Primary Information
Domain
Information & Communication Technology
Project No.
5592
Sanction and Project Initiation
Sanction No: 3-18/2015-T.S.-I (Vol. III)
Sanction Date: 22/03/2017
Project Initiation date: 18/04/2018
Project Duration: 36
Partner Ministry/Agency/Industry
Ministry of Electronics and Information Technology
Role of partner: To support funds and resources for the execution of the project.
Support from partner:
Principal Investigator
Sudeshna Sarkar
IIT Kharagpur
Host Institute
Co-PIs
Pushpak Bhattacharyya
IIT Patna, IIT Bombay
Pawan Goyal
IIT Kharagpur
Sobha L
AU-KBC Research Centre, MIT Campus, Anna University, Chennai
Asif Ekbal
IIT Patna
Ganesh Ramakrishnan
IIT Bombay
Malhar Arvind Kulkarni
IIT Bombay
Prasenjit Majumder
DAIICT, Gandhinagar
Scope and Objectives
1. To build a multi-lingual event monitoring platform to identify, extract and classify events of interest from news media.
2. An end-to-end system will be built and demonstrated for English and a few Indian languages such as Hindi, Bengali, Marathi, Tamil and domains such as health, conflict events and disaster.
3. The necessary NLP tools will be developed for the corresponding Indian languages.
4. The designed system will be modular and customizable to facilitate addition and adaptation of languages, domains and other sources such as social media.
5. A user-friendly visualization interface will also be developed, which will support retrieval and analytics driven by user query.
Deliverables
The final outcome of the project will be an open platform enabling Indian language information aggregation, extraction and visualization using open source platform which may be made available for further customizations. A user friendly interface will enable the users to browse and navigate through the events of their interest. Along with this main outcome, the deliverables will be
a). The required NLP tools in Hindi, Bengali, Marathi, Tamil.
b). Shared tasks related to Indian language Event Extraction will be run in FIRE 2017 and 2018 by making available resources and encouraging research in IL text mining.
c). Algorithms and modules for search, recommendation and (timeline) summarization will be made available for the individual languages, along with the cross-lingual retrieval modules. The platform built will be fully open and designed so that new modules can be plugged in easily and it can be extended to different languages, input media and domains.

Videos
Scientific Output
The project activity involves doing state of the art research in multi-lingual and low resource Natural Language processing, event extraction, event linking and multi-document and multilingual aggregation.
1. The project will lead to the development of algorithms and resources for event extraction in Indian languages. This will involve development of methods that will apply to any low resource languages. Multi-lingual and transfer methods will be developed that will extend the state of the art.
2. Multi-document and multi-lingual aggregation of frame elements poses interesting challenges that will be worked on in this project.
3. The project will focus on generation of event information in multiple Indian languages that will involve algorithms for natural language generation. The project will also make contributions to cross-lingual and multilingual event-specific summarization.

Results and outcome till date
A basic system has been developed that crawls news sources and indexes them. A frontend has been developed that accepts input from the user from free text, advanced search or a map interface and produces outputs as a list, in the map and links the output documents. The online system includes basic query processing, indexing in a search engine and ranking. As a part of the project we have created ontology of disaster and health events. Under disaster there are 8 broad event types and 26 total event types and 13 frame elements. A detailed annotation guideline for the disaster domain has been developed. We have also developed a ontology of 38 event types in Finance and 21 event types in Health. A multilingual annotation tool has been developed and annotated resources have been created for 5 languages in the disaster domain. The offline system includes domain identification and event extraction. We have worked on event trigger identification and classification, time and location identification and argument extraction and classification for monolingual documents in 5 Indian languages. Supervised ML based methods have been developed for sentence wise event trigger and event type identification, argument extraction and argument linking.

Societal benefit and impact anticipated
The project aims to develop resources and competency in Indian language information extraction and event extraction and aggregation and will contribute to Indian language text mining systems.
Next steps
1. Identification of stakeholders and use cases.
2. Event linking across sentences, linking of frames with events
3. Document level event aggregation
4. Increase in precision and recall of the system based on a hybrid of ML-based and knowledge based methods and multi-lingual methods.
5. Improvement of systems for argument extraction and linking
6. Multi-document aggregation of events
7. Multilingual event aggregation
8. Natural language generation
9. Event specific summarization
10. Extension to social media
Publications and reports
1. Z. Ahmad, S. Sahoo, A. Ekbal and P. Bhattacharyya (2018). A Deep Learning Model for Event Extraction and Classification in Hindi for Disaster Domain. In Proceedings of the 15th International Conference on Natural Language Processing (ICON)-Accepted
2. S. Kamila, A. Ekbal and P. Bhattacharyya (2018). Sentence Level Temporality Detection using an Implicit Time-sensed Resource. In Proceedings of LREC-2018, PP. 325-331, May 7-12, 2018, Japan.
3. Aipe, Mukuntha, N S, A. Ekbal and S. Kurohashi (2018). Deep Learning Approach towards Multi-label Classification of Crisis Related Tweets. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 20-23, 2018, Rochester, USA.
4. Gupta, A. Ekbal, S. Saha and P. Bhattacharyya (2018). MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi. In Proceedings of LREC-2018, PP. 2777-2784, May 7-12, 2018, Japan.
5. Gupta, A. Ekbal, S. Saha and P. Bhattacharyya (2018). A Deep Neural Network based Approach for Entity Extraction in Code-Mixed Indian Social Media Text. In Proceedings of LREC-2018, PP. 1762-1767, May 7-12, 2018, Japan.
6. Alapan Kuila, Sudeshna Sarkar: An Event Extraction System via Neural Networks. FIRE (Working Notes) 2017: 136-139
7. Pattabhi RK Rao and Sobha Lalitha Devi, (2018), "Enhancing Multi-Document Summarization using Concepts", Sadhana - Academy Proceedings in Engineering Sciences. vol 43:2 (27) . https://doi.org/10.1007/s12046-018-0789-y
8. Pattabhi RK Rao and Sobha Lalitha Devi. (2017). "Patent Document Summarization Using Conceptual Graphs", International Journal on Natural Language Computing (IJNLC), vol. 6(3).
9. Sindhuja Gopalan and Sobha Lalitha Devi. (2017). "BNEMiner: mining biomedical literature for extraction of biological target, disease and chemical entities". International Journal of Business Intelligence and Data Mining. 11, 2 (January 2017), 190-204. DOI: https://doi.org/10.1504/IJBIDM.2016.081612
10. Vijay Sundar Ram and Sobha Lalitha Devi. (2018). "A Semi-automated Annotation of Co-reference Chains in Tamil", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, March 18 to 24, 2018, Hanoi, Vietnam
11. Malarkodi C.S. and Sobha Lalitha Devi. (2018). "Twitter Named Entity Recognition for Indian Languages.", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, March 18 to 24, 2018, Hanoi, Vietnam
12. Vijay Sundar Ram and Sobha Lalitha Devi. (2017). "A Robust Coreference chain Builder for Tamil", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, April 17 to 23, 2017, Budapest, Hungary
13. Sindhuja Gopalan, and Sobha Lalitha Devi. (2017). "Cause and Effect Extraction from Biomedical Corpus", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, April 17 to 23, 2017, Budapest, Hungary
Scholars and Project Staff
IIT Kharagpur:
SRF: Aniruddha Roy (3 years)
JRF: Debanjana Kar, MS student (3 years)
JRF: Pratik Karia, MS student ( 3 years)
JRF: Sarath Chandra Bussa, MS student, 3 years
Language Annotators (Three) : Rupak ishra, Sonali Samanta, Arnab Sadhukhan
IIT Patna: SRF (2) : Sovan KUmar Sahoo, Zishan Ahmad
JRF (1) : Deeksha Varshney
Lexicographers (2):
Monalisa Bhattacharjee, Jaya Jha
MTech student:
Saumajit Saha AUKBC:
Ph D Scholars:
1. Pattabhi R. K Rao
2. Vijay Sundar Ram
3. Malarkoti. C. S
Project Associates
1. Parimala. N.H
2. Padmapriya. N
3. Suji. A
DAIICT: Project Staffs (4):
Parth Mehta
Jainisha Sankhvara
Palak Thakur
Surupendu Gangopadhyay
IIT Bombay
Saheli Khare
Challenges faced
Suitable computation servers need to be procured to host the initial system and for large scale computations in crawling, the offline training and processing, and real time online output presentation. The major equipment purchase has been deferred till the funds from the partner ministry are received to fulfil the salary commitments of project staff.
Financial Information
-
Total sanction: Rs. 39283200
-
Amount received: Rs. 13344000
-
Amount utilised for Equipment: Rs. 454031
-
Amount utilised for Manpower: Rs. 4778942
-
Amount utilised for Consumables: Rs. 52550
-
Amount utilised for Contingency: Rs. 272147
-
Amount utilised for Travel: Rs. 337953
-
Amount utilised for Other Expenses: 83869
-
Amount utilised for Overheads: Rs. 2177943
Equipment and facilities
Computers and accessories