The Use of Systemic-Functional Linguistics in Automated Text Mining.
Scientific Publication
- Report Number:
- DSTO-RR-0339
- Authors:
- Kappagoda, A.
- Issue Date:
- 2009-03
- AR Number:
- AR-014-419
- Classification:
- Unclassified
- Report Type:
- Research Report
- Division:
- Command, Control, Communication and Intelligence Division (C3ID)
- Release Authority:
- Chief, Command, Control, Communication and Intelligence Division
- Task Sponsor:
- ASCP; EXEC DIR CTSTC
- Task Number:
- INT 07/020
- File Number:
- 2009/1016253/1
- Pages:
- 82
- References:
- 38
- Terms:
- Information extraction; Machine learning
- URI:
- http://hdl.handle.net/1947/9900
Abstract
Systemic-functional linguistics is a linguistic framework for the analysis of grammatical and semantic information in text, with a potential role in automated text mining. This report outlines essential features of the theory, its application in computational work, and the rationale for use in automated text mining, and develops a grammatical annotation scheme– word functions– to enrich a mixed text corpus of newspaper articles and e-mails, for machine learning of semantically-oriented grammatical patterns. Testing demonstrates high accuracy in predicting word functions in unseen text in co-training with other grammatical information, providing the basis for further grammatical and semantic text processing.
Executive Summary
Using grammatical and semantic patterns as the basis for large-scale text processing has wide potential to improve the quality and speed of information management and analytical tasks in the defence and intelligence domains. It is proposed that a robust linguistic model is needed to support the automation of these tasks, which is achieved by co-training semantic and grammatical information with unstructured text, and that systemic-functional linguistics (SFL) provides a prime means for achieving this. SFL is a linguistic theory that has had a substantial presence in natural language processing work for the past 40 years, with recent developments in rule-based and machine learning (ML)- based text processing. An outline of the theoretical apparatus of SFL is presented, focusing on a detailed treatment of the functional structure of word groups and phrases. This is used to derive a grammatical annotation scheme for the labelling of the functions of single tokens in unstructured text (WFG). A justification for using this scheme is presented, and a method is outlined for the preprocessing of unstructured text and for annotation with the WFG scheme, in order to produce training and testing corpora for a ML system employing the 'conditional random fields' algorithm. It is demonstrated via this system that automated WFG annotation can be achieved with high accuracy, and that such labelling supports the automated recognition of other grammatical information such as chunk labelling. It is proposed that WFG annotation provides a robust semantically-oriented foundation for other kinds of semantically-based text processing, such as information extraction and text categorisation, which are important elements in information management in defence and intelligence tasks.
