Ogmios: a scalable NLP platform for annotating large web document collection
While NLP tools are now widely available, their use can be problematic considering the lack of homogeneity of their input/output, the granularity variation of the provided information, but also the difficulties to process large amounts of documents in a reasonable time, and their tunability to a domain. To address these problems, we propose a configurable platform combining NLP tools to enrich very large collections of French and English specialised documents. The platform is a modularized and tunable framework. Each module carries out an annotation step by using existing NLP tools and can be tuned to a domain by adding specific resources: named entity recognition, sentence and word segmentation, lemmatisation, POS tagging, term tagging and parsing. Linguistic annotations are recorded in a stand-off XML format. To manage very large collections of documents, we focus on the robustness of the annotation process by distributing the process on several machines.
In the ALVIS project (www.alvis.info/alvis), we have tested the scalability of the platform on two collections of 55,329 biomedical web documents (107 millions of words) and 47,393 Search Engine News (13 millions of words) with 20 computers. The collections have been annotated until the term tagging, respectively in 35 hours and 3 hours.
