Print version

Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff's (2001) Approach.

Scientific Publication

Report Number:
DSTO-TR-2290
Authors:
Parsons, K.; McCormac, A.; Butavicius, M.
Issue Date:
2009-04
AR Number:
AR-014-529
Classification:
Unclassified
Report Type:
Technical Report
Division:
Command, Control, Communication and Intelligence Division (C3ID)
Release Authority:
Chief, Command, Control, Communication and Intelligence Division
Task Sponsor:
Intelligence
Task Number:
INT 007/020
File Number:
2008/1147346
Pages:
54
References:
17
Terms:
Human performance; Algorithms; Empirical methods
URI:
http://hdl.handle.net/1947/9990

Abstract

There is a distinct lack of tools that provide a comprehensive measure of the similarity between corpora. Finding similar corpora is necessary for the design of certain user studies investigating text processing. It is also useful for ensuring comparability between studies on document analysis conducted across classified and unclassified domains. In this study, human judgements of corpora similarity were obtained as a gold standard. These were then compared to the values provided by Kilgarriff’s (2001) chi-square (Χ²) statistic. The findings indicated a high level of agreement between the participants, with 77% shared variance in overall similarity judgements. The results of the Χ² measure also correlated well with the human results, with a correlation of approximately 0.66. Although there are complexities associated with the Χ² technique that need to be examined in further research, this study provides extremely promising results, suggesting that a statistical technique could provide results that are comparable to human judgements.

Executive Summary

A corpus is a collection of written or spoken material, and in fields such as information retrieval, machine translation and natural language processing, they are a vital resource. Corpora vary considerably, and knowledge regarding their similarities and differences are particularly important. For instance, a measure of similarity is necessary to determine whether the findings of one corpus are applicable to different corpora for the purposes of assessing document processing tools and human-user interaction abilities. There is a distinct lack of tools that provide corpora comparisons, and the tools that do exist tend to provide a single value, which does not necessarily reflect the complexity associated with a collection of text. For example, corpora could be extremely similar in relation to content, but quite different in regards to structure or language use. Without information regarding the dimension of similarity that is being measured, the value provided by any corpora comparison scores are limited. Within this study, seventeen corpora were utilised, and two random samples were taken from each corpus. Human corpora comparisons were obtained, which were then compared to the values provided by a statistical technique. The aims of this study were to (1) obtain comprehensive human judgements of corpora similarity to act as a gold standard, (2) compare the judgements obtained by different individuals, and (3) compare human judgements with those provided by a statistical technique. The human judgements were made on a number of dimensions of similarity. The correlations between the participants’ scores were extremely high, with an overall correlation of 0.88, indicating 77% shared variance between the participants. However, when participants’ scores were assessed according to the various dimensions and corpora categories, there was far more variation. Hence, this indicates that corpora comparison is influenced by subjectivity and individual differences. Kilgarriff’s (2001)1 Χ² statistic is a word-frequency based measure, which involves a statistical analysis of the most frequent words in a pair of corpora. The word list is then compared to the most frequent words in both corpora, to examine the discrepancy between the observed frequency of words and the expected frequency if the corpora were derived from the same underlying body of text. This technique was used to determine the similarity between the corpora, and the results were then compared to the human judgements. The correlations between the chi-square results and the participants’ ratings of similarity were high, with an overall correlation of approximately 0.66. This is extremely promising, as it suggests that a statistical technique may provide results that are comparable to human judgements. However, it is necessary to note that there was a large range in the strength of the correlations for the corpora pairs within the various categories. Hence, it is possible that the chi-square technique is more effective for certain types of corpora. Furthermore, there are a number of complexities associated with the chi-square technique, which could limit the generalisability of these results. For example, it is unclear whether there is an optimal number of words for the word frequency lists or an optimal size for the corpora samples.

Back to the top