Replicability slippage

THIS IS AN UPDATED VERSION OF A PREVIOUSLY PUBLISHED BLOG POST. It was originally published 7th September 2008 on the Endangered Languages and Cultures blog, and is published here with a new introduction, and updates to the original text and links. Comments on the original post appear at the bottom.1

Introduction

There has been quite a bit of discussion recently about the notion of ‘replicability’ in linguistics research, especially as it applies to documentation and description of lesser-known languages (see Andreassen et al. 2018, 2019; Berez-Kroeker et al. 2018; Gawne et al. 2017; Gawne & Berez-Kroeker 2018). The main idea is that researchers should archive and grant access to their audio-visual recordings and associated analysis like transcriptions and translations so the others can verify claims made in their publications against these sources (sometimes misleadingly called ‘primary data’). Gawne et al. (2017: 177) indeed argue in favour of the development of citation practices ‘that will allow the reader to not only locate the larger data set of recordings and/or fieldnotes in an archive, but also to resolve back to the particular datum within the set’. Concern for veracity (or ‘falsifiability’) in linguistic research is not particularly new,2 and in this 2008 post I argue: (a) that there can be good reasons why published materials, especially text collections, may diverge from the recordings in a corpus; and (b) in favour of a method of transcriptional representation that draws on insights from epigraphy which enable researchers to be at least a little more explicit about how and why they transcribe and publish in the way they do. Little seems to have been written about transcription methods, in particular, in the intervening 13 years (in contrast to transcription software tools),3 and the way replicability within language documentation should be viewed in practice is still waiting for proper theorisation and implementation, it seems to me.

Glossed texts — the fiddle factor (edited)

In a blog post published on 2nd September 2008, Jane Simpson reported on opinions expressed by a group at the Australian National University, meeting to discuss grammar writing:

One thing that grammars have over most dictionaries however, is the notion of publishing an accompanying set of texts. Falsifiability has traditionally been more of a concern for grammarians than for lexicographers. We all agree it’s a good thing to publish glossed texts so that readers can check out the hypotheses proposed in the grammar, and expressed by the glossing. The classic example is Jeffrey Heath’s careful analysis of R. M.W. Dixon’s Dyirbal texts (Heath, Jeffrey. 1979. Is Dyirbal ergative?. Linguistics 17, 401-463) to argue against Dixon’s claim about Dyirbal being syntactically Ergative.

I’d like to inject a note of caution here. It seems to me that many times published texts, with interlinear glossing or not, and especially those that derive from transcriptions of spoken language, have often been fiddled with (or to put it more politely ‘edited’) on their way from recording to printed page. This is also often true of published texts that are based on written originals produced by literate native speakers. It is rarely the case that, as Wamut commented about Jeffrey Heath’s work on Ngandi at the end of Jane’s blog post:

What is especially great, is that when you go back to Heath’s archived field recordings, the spoken texts are there in pristine form, that is, the spoken text and written text CORRELATE PERFECTLY [emphasis added]

Heath adopted the same principle of “perfect correlation” in his published work on other languages such as Heath (1980) Nunggubuyu Myths and Ethnographic Texts which clearly states in the introduction:

in the texts presented here I have not ‘weeded out’ false starts, intrusive English words, or grammatical errors by the narrators

In many other cases of text publication, I know editing has taken place — I have done it myself, and some other researchers have admitted to it (though rarely indicating exactly what editorial changes were made — more on this below). The materials in my Texts in the Mantharta Languages, Western Australia (Austin 1997) were quite heavily edited in places, though I didn’t mention that in print at the time, and it was only when it came to creating a multimedia Jiwarli website, where both published texts and original recordings were presented, that I had to confess:

[y]ou may also notice that the Jiwarli texts are not word for word identical to the sound files, as Jack Butler, after recording the stories, made his own corrections in the texts.

There was no attempt to deceive here, rather it was Jack’s explicit wish that the stories be edited for publication. As an example, consider published Text 50 (which appears on the website here — there is a re-presentation of it with a different spelling system here) and the way it corresponds to the original recording (strikethrough indicates material in the recording which was deleted in the editing process, bold indicates text added during editing (within a single { …} string bold plus strikethrough indicates transposition of an item), and { x <=> y} indicates substitution during editing):

Nhukuramartuthu ngurrunyjarri julyumartu ngunha nhanyaartu {porcupinemanha <=> jiriparrinha} puniyanha. {Porcupine <=> Jiriparri} ngunha jakuparlarrirarru. Ngurntirarri jakuparlarru parnajipi{thu} ngunha warrirru nhanyapuka. Ngurrunyjarrilu yarnararnilaartu ngurntapuka ngunha{pa} jakuparla. Wangkirarringu. Yarnararrima nhurra. {Yarnararrima nhurra}. Ngatha {nhurranha murrurrpa manara nhurranha}. {Yarnararrima. Ngatha murrurrpa manara nhurranha. Yarnararrima. Ngatha murrurpa manara nhurranha.} Kunyarnurru ngunha kumpanhu. {Porcupinemanha <=> Jiriparri} ngunha kurlkanyunthurru yarnararrira. When he Yarnararrira{thu} parnarru thangkalpuka wurungku wirntupinyangurru pirrurru yanararri thikaru.

Translation:

The knowledgeable grey-haired old men used to see Echidna going along. Echidna would curl up now. He would lie curled up and you can’t see his head. The old men used to turn him over onto his back so he would lie there curled up. They would say: “Lie on your back. I’ll get you cicatrices. I’ll get you cicatrices”. They would tell him lies now. Echidna would open up and lie on his back pleased. He would lie on his back and then they would hit him on the head with a stick, and kill him to go and eat the meat.

Editorial changes that Jack and I made are the following:

  • replacement of the loan word ‘porcupine’ with the indigenous word jiriparri, and deletion of the English expression ‘when he’
  • omission of the enclitics: -thu ‘old information’, -pa ‘specific referent’ in order to decontextualise reference
  • omission of repetition of three instances of ‘Lie on your back. I’ll get you cicatrices’
  • reordering of constituents: the possessor ‘your’ and ‘cicatrices’ are separated on the tape but were made adjacent in the editing for publication

Wamut also mentions in his comment on Jane’s post another possible way in which published texts can differ from recordings:

“I’ve heard other spoken texts vary from the published text because the field worker has interrupted the speaker for clarification etc.”

There are also cases I know of where speakers “interrupt” themselves. My colleague David Nathan tells me that when he was working with the late Luise Hercus to produce a multimedia CD-ROM of Baagandji materials, he found Luise’s audio recordings of stories also contained interpolations and explanations in English by the speaker which do not appear in the published texts (see Nathan 2016 for discussion).

I think descriptive linguists and language documenters could well take some guidance in this area from the work of epigraphers who have been developing a TEI/XML markup for epigraphy called EpiDoc. Some of the EpiDoc proposals are concerned with adaptation of the TEI guidelines to deal with a range of issues such as legibility of characters on stone, missing elements or partially represented signs, but in addition there are several issues they address that I think should equally be of concern to language documentation:

  • additions and deletions to the text
  • editorial supplements, observations, and hypotheses, including:
    • identification and expansion of abbreviations understood by the editor
    • identification of abbreviations not understood by the editor
    • editorial supplement in which the editor makes a “subaudible” word manifest
    • editorial supplement in which the editor explains a “breviatio” or note
    • editorial supplement for characters wholly lost
    • letters omitted because the stonecutter did not carry out the text to the end
  • editorial corrections
    • letters erroneously included in the text, which the editor suppresses
    • letters erroneously omitted from the text, which the editor adds
    • letters erroneously substituted in the text, which the editor corrects

The EpiDoc guidelines contain explicit recommendations on how to encode these as markup annotations to the text, and are accompanied by a schema and custom-designed style sheets. For work on endangered languages I think there are some additional aspects that should be encoded, especially because we need to typically distinguish at least three participant roles in the process of published text creation, namely the original speaker(s), the transcriber(s), and the linguist-editor(s). We should pay attention to:

  • encoding code-switching, code-mixing and borrowing, ideally by coding for the language (or variety) of the items transcribed
  • puristic editorial amendments on the part of the transcriber
  • puristic editorial amendments on the part of the linguist-editor
  • deletions by the transcriber
  • additions by the transcriber
  • reorderings by the transcriber
  • additions and clarifications (editorial comments) by the linguist-editor
  • when the transcriber is not the originally recorded speaker we need to deal with: (1) inter-speaker variation at the dialect or idiolect level; and (2) inter-speaker variation arising from language loss, eg. phonemic or grammatical reduction among semi-speakers in a later generation transcribing earlier recorded texts

To my mind, it will only be when linguists make available marked-up documents encoding these aspects along with the published texts, and the original media recordings (ideally publically available through an archive or distributed on CD or DVD along with the published texts), that we can start truly talking about ‘replicability’ or ‘falsifiability’ of grammars and other analytical claims about languages. The ‘published texts’ and archival materials (including edited transcriptions) alone are often simply not enough.

Notes

  1. The ideas presented here have been fermenting since they were first publicly presented at an ELAP Workshop at SOAS, University of London in February 2005. At the Simposio Internacional: Contacto de Lenguas y Documentatión (International Symposium on Language Contact and Documentation) held in Buenos Aires in August 2008, Ulrike Mosel presented a paper entitled ‘Putting oral narratives into writing experiences from a language documentation project in Papua New Guinea’ in which she explored the issue of editing recorded Teop texts for publication (published as Mosel 2015). She independently identified many of the same issues I outline here.
  2. So, for example, Heath (1984: 5) states that: “My concern with documentation reflects my own sad experiences as a reader of other linguists’ grammars, which have almost never provided me with the information I wanted to undertake my own (re-)analysis of the language in question. It also reflects my experience that most published grammars are based on material obtained in unreliable direct-elicitation (sentence-translation) sessions, and/or utterances which were produced by the linguist with,or without “confirmation” from a native informant. I have no confidence whatever in such data, since my own early “data” of this type often turned out to be seriously wrong. Accordingly, to other linguists who express disapproval of my emphasis on documentation I suggest that they try doing an analysis based on a comparable textual corpus and see if it doesn’t add to. their understanding of their favourite language”. Thieberger (2004: 170) describes a need to “provide source information for each example sentence and text that would allow the reader to locate the example within the field recordings so as to be able to verify that the example actually did occur in the data”. Thieberger (2009: 365) turns this into a stronger moral proposal that “it is our professional responsibility to provide the data on which our claims are based. In this way we are shifting the authority of the analysis, which traditionally has been located only in the linguist’s work”. Similarly, Bender & Langendoen (2010: 20) argue for the “need to establish a culture of expecting claims to be checked against web-available data”.
  3. I have been unable to find any discussion of the importance of explicit encoding of transcriptional and analytical editing decisions among the list of ‘best practices’ promoted, eg. by the E-MELD School of Best Practice, despite the fact that, to me at least, they play an important role in ‘practices which are intended to make digital language documentation optimally longlasting, accessible, and re-usable by other linguists and speakers’.

References

  • Andreassen, H. N., Conzett, P., De Smedt, K., Berez-Kroeker, A. & Gawne, L. 2018. Data citation and metadata standards in linguistics. Paper presented at the LDIG working session during the RDA 11th Plenary Meeting, 21-23 March 2018, Berlin, Germany. 
  • Andreassen, H. N., Berez-Kroeker, A. L., Collister, L., Conzett, P., Cox, C., De Smedt, K., McDonnell, B. & Research Data Alliance Linguistics Data Interest Group. 2019. Tromsø Recommendations for Citation of Research Data in Linguistics. link
  • Austin, Peter. 1997. Texts in the Mantharta languages, Western Australia. Tokyo: ILCAA, Tokyo University of Foreign Studies.
  • Bender, Emily M. & D. Terence Langendoen. 2010. Computational linguistics in support of linguistic theory. Linguistic Issues in Language Technology 3(2), 1–31.
  • Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, David I. Beaver, Shobhana Chelliah, Stanley Dubinsky, Richard P. Meier, Nick Thieberger, Keren Rice & Anthony C. Woodbury. 2018. Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics 56(1), 1–18.
  • Gawne, Lauren, Barbara F. Kelly, Andrea L. Berez-Kroeker & Tyler Heston. 2017. Putting practice into words: The state of data and methods transparency in grammatical descriptions. Language Documentation & Conservation 11, 157–189.
  • Gawne, Lauren & Andrea L. Berez-Kroeker. 2018. Reflections on reproducible research. In Bradley McDonnell, Andrea L. Berez-Kroeker & Gary Holton (eds.) Reflections on language documentation 20 years after Himmelmann 1998, 22–32. Language Documentation & Conservation Special Publication No. 15. Honolulu: University of Hawai‘i Press.
  • Heath, Jeffrey. 1980. Nunggubuyu myths and ethnographic texts. Canberra: Australian Institute of Aboriginal Studies. link
  • Heath, Jeffrey. 1984. Functional grammar of Nunggubuyu. Canberra: Australian Institute of Aboriginal Studies. link see also link
  • Mosel, Ulrike. 2015. Putting oral narratives into writing: Experiences from a language documentation project in Bougainville, Papua New Guinea. In Bernard Comrie & Lucía Golluscio (eds.) Language Contact and Documentation (Contacto lingüístico y documentación), 321–342. Berlin: De Gruyter Mouton.
  • Nathan, David. 2016. ‘Don’t tell them we’re coming!’: learning to document languages with Luise Hercus. In Peter K. Austin, Harold Koch & Jane Simpson (eds.) Language, land & song: Studies in honour of Luise Hercus, 57-69 . London: EL Publishing. link
  • Thieberger, Nicholas. 2004. Documentation in practice: developing a linked media corpus of South Efate. Language Documentation and Description 2, 169-178.
  • Thieberger, Nick. 2009. Steps toward a grammar embedded in data. In Patricia Epps & Alexandre Arkhipov (eds.) New Challenges in Typology: Transcending the Borders and Refining the Distinctions, 389–408. New York: Mouton de Gruyter

Comments on original post

  • Peter Austin (2008-09-16): There is some nice discussion about the material presented here over at Claire Bowern’s blog, especially in the comments section.
  • Stephen Morey (2008-09-18): Peter Austin makes a very important point. It was for this reason that I made a CD version of my PhD thesis and subsequent book on the Tai Languages of Assam. In the version of the grammar on the CD (which has exactly the same text as the printed book), linguistic examples are linked to sound files so the reader can hear the recording and judge for themselves. So far nobody has written back to me and suggested an alternative reading or analysis, but I look forward to that day.
  • Peter Austin (2008-09-20): Stephen – I haven’t seen your CD version of the grammar, but it sounds similar to Nick Thieberger’s Grammar of South Efate (and his previous PhD on the language) which was published with CD of the sound files in the back. I was also trying to make the point that there may be good reasons to “clean up” transcriptions or written texts for publication (community preferences for editing, removal of loan words, corrections of ‘slips of the tongue’ etc) but anyone who does that should also make available annotations (eg. in an XML version, as the epigraphers do) which explain why the published version differs from the original. That makes for good science and helps people in the future understand how published texts can differ from other potential sources of data.
  • Mark Dingemanse (2008-09-23): Interesting post. Nick Enfield’s recent Lao grammar (Mouton) has meticulously transcribed conversations (including of course all the errors, false starts, repairs, etc.) instead of the traditional texts appended. Many of the examples given in his grammar are drawn from these (and other) conversations. His motivation is worth quoting: ‘The texts supplied in this chapter illustrate the kind of discourse in which Lao grammar emerges. (…) The choice to concentrate exclusively on conversation here is a form of affirmative action. Conversation as a structured domain is under-studied in linguistics compared to research on structure in semantics and sentence-level syntax. Yet conversation is by far the dominant, unmarked genre in language usage, and in language acquisition. This chapter reverses the usual balance in the ‘texts’ section of grammars: elicited monologues, with a very occasional fragment of conversation. (…) [W]ith a large enough sample, conversation yields the full complement of a language’s structural resources, including embedded narratives, procedural descriptions, and similar genres more familiar to descriptive linguistics.’ (Enfield 2007:487)