Decrease size
Reset to Default
Increase size

A Platform for Crosslingual and Multilingual Event Monitoring in Indian Languages

Primary Information

Domain

Information & Communication Technology

Project No.

5592

Sanction and Project Initiation

Sanction No: 3-18/2015-T.S.-I (Vol. III)

Sanction Date: 22/03/2017

Project Initiation date: 18/04/2018

Project Duration: 36

Partner Ministry/Agency/Industry

Ministry of Electronics and Information Technology

 

Role of partner: To support funds and resources for the execution of the project.

 

Support from partner:

Principal Investigator

PI Image

Sudeshna Sarkar
IIT Kharagpur

Host Institute

Co-PIs

PI Image

Pushpak Bhattacharyya
IIT Patna, IIT Bombay

PI Image

Pawan Goyal
IIT Kharagpur

PI Image

Sobha L
AU-KBC Research Centre, MIT Campus, Anna University, Chennai

PI Image

Asif Ekbal
IIT Patna

PI Image

Ganesh Ramakrishnan
IIT Bombay

PI Image

Malhar Arvind Kulkarni
IIT Bombay

PI Image

Prasenjit Majumder
DAIICT, Gandhinagar

 

Scope and Objectives

1. To build a multi-lingual event monitoring platform to identify, extract and classify events of interest from news media.
2. An end-to-end system will be built and demonstrated for English and a few Indian languages such as Hindi, Bengali, Marathi, Tamil and domains such as health, conflict events and disaster.
3. The necessary NLP tools will be developed for the corresponding Indian languages.
4. The designed system will be modular and customizable to facilitate addition and adaptation of languages, domains and other sources such as social media.
5. A user-friendly visualization interface will also be developed, which will support retrieval and analytics driven by user query.

Deliverables

The final outcome of the project will be an open platform enabling Indian language information aggregation, extraction and visualization using open source platform which may be made available for further customizations. A user friendly interface will enable the users to browse and navigate through the events of their interest. Along with this main outcome, the deliverables will be
a). The required NLP tools in Hindi, Bengali, Marathi, Tamil.
b). Shared tasks related to Indian language Event Extraction will be run in FIRE 2017 and 2018 by making available resources and encouraging research in IL text mining.
c). Algorithms and modules for search, recommendation and (timeline) summarization will be made available for the individual languages, along with the cross-lingual retrieval modules. The platform built will be fully open and designed so that new modules can be plugged in easily and it can be extended to different languages, input media and domains.

 

Project image

Videos

Scientific Output

The project activity involves doing state of the art research in multi-lingual and low resource Natural Language processing, event extraction, event linking and multi-document and multilingual aggregation.
1. The project will lead to the development of algorithms and resources for event extraction in Indian languages. This will involve development of methods that will apply to any low resource languages. Multi-lingual and transfer methods will be developed that will extend the state of the art.
2. Multi-document and multi-lingual aggregation of frame elements poses interesting challenges that will be worked on in this project.
3. The project will focus on generation of event information in multiple Indian languages that will involve algorithms for natural language generation. The project will also make contributions to cross-lingual and multilingual event-specific summarization.

 

Project image

Results and outcome till date

A basic system has been developed that crawls news sources and indexes them. A frontend has been developed that accepts input from the user from free text, advanced search or a map interface and produces outputs as a list, in the map and links the output documents. The online system includes basic query processing, indexing in a search engine and ranking. As a part of the project we have created ontology of disaster and health events. Under disaster there are 8 broad event types and 26 total event types and 13 frame elements. A detailed annotation guideline for the disaster domain has been developed. We have also developed a ontology of 38 event types in Finance and 21 event types in Health. A multilingual annotation tool has been developed and annotated resources have been created for 5 languages in the disaster domain. The offline system includes domain identification and event extraction. We have worked on event trigger identification and classification, time and location identification and argument extraction and classification for monolingual documents in 5 Indian languages. Supervised ML based methods have been developed for sentence wise event trigger and event type identification, argument extraction and argument linking.

 

Project image

Societal benefit and impact anticipated

The project aims to develop resources and competency in Indian language information extraction and event extraction and aggregation and will contribute to Indian language text mining systems.

Next steps

1. Identification of stakeholders and use cases.
2. Event linking across sentences, linking of frames with events
3. Document level event aggregation
4. Increase in precision and recall of the system based on a hybrid of ML-based and knowledge based methods and multi-lingual methods.
5. Improvement of systems for argument extraction and linking
6. Multi-document aggregation of events
7. Multilingual event aggregation
8. Natural language generation
9. Event specific summarization
10. Extension to social media

Publications and reports

1. Z. Ahmad, S. Sahoo, A. Ekbal and P. Bhattacharyya (2018). A Deep Learning Model for Event Extraction and Classification in Hindi for Disaster Domain. In Proceedings of the 15th International Conference on Natural Language Processing (ICON)-Accepted
2. S. Kamila, A. Ekbal and P. Bhattacharyya (2018). Sentence Level Temporality Detection using an Implicit Time-sensed Resource. In Proceedings of LREC-2018, PP. 325-331, May 7-12, 2018, Japan.
3. Aipe, Mukuntha, N S, A. Ekbal and S. Kurohashi (2018). Deep Learning Approach towards Multi-label Classification of Crisis Related Tweets. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 20-23, 2018, Rochester, USA.
4. Gupta, A. Ekbal, S. Saha and P. Bhattacharyya (2018). MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi. In Proceedings of LREC-2018, PP. 2777-2784, May 7-12, 2018, Japan.
5. Gupta, A. Ekbal, S. Saha and P. Bhattacharyya (2018). A Deep Neural Network based Approach for Entity Extraction in Code-Mixed Indian Social Media Text. In Proceedings of LREC-2018, PP. 1762-1767, May 7-12, 2018, Japan.
6. Alapan Kuila, Sudeshna Sarkar: An Event Extraction System via Neural Networks. FIRE (Working Notes) 2017: 136-139
7. Pattabhi RK Rao and Sobha Lalitha Devi, (2018), "Enhancing Multi-Document Summarization using Concepts", Sadhana - Academy Proceedings in Engineering Sciences. vol 43:2 (27) . https://doi.org/10.1007/s12046-018-0789-y
8. Pattabhi RK Rao and Sobha Lalitha Devi. (2017). "Patent Document Summarization Using Conceptual Graphs", International Journal on Natural Language Computing (IJNLC), vol. 6(3).
9. Sindhuja Gopalan and Sobha Lalitha Devi. (2017). "BNEMiner: mining biomedical literature for extraction of biological target, disease and chemical entities". International Journal of Business Intelligence and Data Mining. 11, 2 (January 2017), 190-204. DOI: https://doi.org/10.1504/IJBIDM.2016.081612
10. Vijay Sundar Ram and Sobha Lalitha Devi. (2018). "A Semi-automated Annotation of Co-reference Chains in Tamil", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, March 18 to 24, 2018, Hanoi, Vietnam
11. Malarkodi C.S. and Sobha Lalitha Devi. (2018). "Twitter Named Entity Recognition for Indian Languages.", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, March 18 to 24, 2018, Hanoi, Vietnam
12. Vijay Sundar Ram and Sobha Lalitha Devi. (2017). "A Robust Coreference chain Builder for Tamil", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, April 17 to 23, 2017, Budapest, Hungary
13. Sindhuja Gopalan, and Sobha Lalitha Devi. (2017). "Cause and Effect Extraction from Biomedical Corpus", In proceedings of 18th International Conference on Computational Linguistics and Intelligent Text Processing, April 17 to 23, 2017, Budapest, Hungary

Scholars and Project Staff

IIT Kharagpur:
SRF: Aniruddha Roy (3 years)
JRF: Debanjana Kar, MS student (3 years)
JRF: Pratik Karia, MS student ( 3 years)
JRF: Sarath Chandra Bussa, MS student, 3 years

Language Annotators (Three) : Rupak ishra, Sonali Samanta, Arnab Sadhukhan

IIT Patna: SRF (2) : Sovan KUmar Sahoo, Zishan Ahmad
JRF (1) : Deeksha Varshney
Lexicographers (2):
Monalisa Bhattacharjee, Jaya Jha

MTech student:
Saumajit Saha AUKBC:

Ph D Scholars:
1. Pattabhi R. K Rao
2. Vijay Sundar Ram
3. Malarkoti. C. S

Project Associates
1. Parimala. N.H
2. Padmapriya. N
3. Suji. A

DAIICT: Project Staffs (4):
Parth Mehta
Jainisha Sankhvara
Palak Thakur
Surupendu Gangopadhyay

IIT Bombay
Saheli Khare

Challenges faced

Suitable computation servers need to be procured to host the initial system and for large scale computations in crawling, the offline training and processing, and real time online output presentation. The major equipment purchase has been deferred till the funds from the partner ministry are received to fulfil the salary commitments of project staff.

Financial Information

  • Total sanction: Rs. 39283200

  • Amount received: Rs. 13344000

  • Amount utilised for Equipment: Rs. 454031

  • Amount utilised for Manpower: Rs. 4778942

  • Amount utilised for Consumables: Rs. 52550

  • Amount utilised for Contingency: Rs. 272147

  • Amount utilised for Travel: Rs. 337953

  • Amount utilised for Other Expenses: 83869

  • Amount utilised for Overheads: Rs. 2177943

Equipment and facilities

 

Computers and accessories