Archiving unanalysed texts

Originally published 5th November 2013 on the previous and discontinued ELAR Blog, and retrieved from the Internet Archive on 9th December 2012. I have added a new Introduction, and reproduced the original post with updates to the text, references, and links, and an adapted re-presentation of a user-generated comment.


In a recent post on this blog, I discussed the amount of time that can be taken up with the daunting tasks of transcribing, translating, and annotating recorded audio and/or video materials. Such work is often seen as required for language documenters, especially when they wish to archive their data and analysis for use by other present or future researchers or community members. Indeed, Himmelmann (2012: 193) suggests that recordings without any transcription ‘are rarely used directly as the basis for further research’ because the information they contain is ‘too much and too complex’. As Dobrin & Schwartz (2020) argue:

To ‘make the data compiled in a documentation accessible’, transcriptions and translations are necessary (Himmelmann 2012: 204). Nathan & Austin (2004: 184) warn against ‘a future of wading in digital quicksand—a rapidly expanding mass of digitized sound, image and video, with no way to get a foothold’. Transcripts provide probably the single most important such foothold. For this reason, ‘it is standard practice to work with a transcript of the recording which, ideally, contains all and only the [relevant] aspects of the recorded event’ (Himmelmann 2012: 193).

In my blog post from 2013, I presented a case study involving an untranscribed archived recording in the Diyari language, and discussed how the content of that recording was useful and valuable ‘as is’ for language learning materials aimed at language revitalisation within the Dieri community. The original post follows.


  • Dobrin, Lise & Saul Schwartz. 2020. The social lives of linguistic legacy materials. University of Virginia. MS.
  • Himmelmann, Nikolaus. 2012. Linguistic data types and the interface between language documentation and description. Language Documentation & Conservation 6, 187-207. link
  • Nathan, David & Peter K. Austin. 2004. Reconceiving metadata: Language documentation through thick and thin. Language documentation and description 2, 179-188. link

Archiving unanalysed texts can be a good thing

Professor Paul Newman, in his rather provocatively named seminar presented on 15th October 2013 at SOAS (The Law of Unintended Consequences: How the Endangered Languages Movement Undermines Field Linguistics as a Scientific Enterprise), argues that linguists should not spend too much of their time collecting texts in the languages they are studying. Indeed, such text collecting can distract them from the ‘proper scientific work’ they should be engaged in. He puts his point of view this way (my transcription of the online video recording from 23:49 to 24:40):

The problem with texts is that people collect too many. There is a misguided emphasis on collecting more and more and more texts … far beyond what you would have the time to transcribe and analyse properly. That is, if you do a cost-benefit analysis of doing texts what you find is text collection drains time and energy from serious research. Now texts do have a value. But where texts are most useful is not something to be stored away never to be used by anyone else, but as a guide to analysis and research while you are in the field. For this you only need a couple of good texts.

In the question-answer period after the talk (at around 1:02:57 in the video recording) I commented from the audience that this perspective ignores a whole potential (current and future) audience for language documentation materials, namely the speaker (and language learner) communities, who do not necessarily need transcription and analysis but may be well served by the content of ‘unprocessed’ texts. Newman’s definition of serious research as ‘scientific’ in a narrow sense of being concerned with structural linguistic analysis leaves these constituencies out in the cold.

I agree with Prof Newman that it would not be a good idea to create ‘something … stored away never to be used by anyone’ (in a true language graveyard), however, I do believe that accessible archived materials that have at least minimal metadocumentation associated with them (like the identity of the language and speaker of a recording) can still be extremely valuable, even in the absence of any other linguistic processing. I have a case in mind which illustrates the point perfectly.

In 1959 at Hermannsburg in the Northern Territory, the late Kenneth Hale happened to meet Johannes, a speaker of the Diyari language, traditionally spoken around Cooper Creek in South Australia over 800km south-east of Hermannsburg. Johannes had been born at Killalpaninna Lutheran mission in the latter half of the 19th century and had moved to Hermannsburg after the mission at Killalpaninna was closed by the South Australian government in 1915. Ken Hale collected 66 pages of fieldnotes and tape-recorded a fluent biographical narrative text from Johannes (a copy of which is on archive tape A4604a in the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) in Canberra) but he did not transcribe or analyse it, or elicit an English translation. The recording remained unused (as far as I know) for 44 years until I employed it to prepare Yawarra pinarru Johannesaya, the 50th in a series of blog posts on the Diyari language. The blog is a graduated set of multimedia materials that explore issues in Diyari language, culture, and history, and the posts are pedagogically organised as a sequence that builds on previously presented materials. The first one and a half minutes of Hale’s recording serve, I would suggest, as a perfect addition to the previous 49 posts so that readers who have studied them carefully should be able to understand and relate to the content of Johannes’ story.

I believe this example shows we can be but grateful that Ken Hale did spend his precious time all those years ago to make this unique text recording, even though he was not doing ‘proper scientific research’ on Diyari in detail (nor did he transcribe it), and that it has been preserved for modern day Dieri community members and language learners to enjoy and benefit from.


Hugh Paterson III  noted on 7th November 2013:

It would be a great shame if a corpus had too many texts. These texts might show various levels of variation within the language speaking community – variation at the individual speaker level, variation at the discourse level, variation in syntactic constructions, variation in the expression of ‘speaker mood’ or ‘stance’. I must agree that ‘a few good texts’ will keep formal grammars concise and manageable in size. Thereby allowing the linguist to continue to make the same impact they always have. 

In all seriousness, why don’t we see more emphasis on corpus development as a social process. Wikipedia didn’t exist as corpus 20 years ago, and now it is one of the largest freely available corpora in the world. I am not saying that Wikipedia is everything, only that perhaps there is an interactive element which is available in Wikipedia which is not evident or available in current field methods for linguistics and language documentation. Maybe the social pressures to develop tools and workflows which allow for collaborative corpus annotation and corpus building don’t exist because as a language documentation enterprise we are focusing on languages which don’t have a large enough user-base to produce annotated corpora. If the language documentation industry did focus on some of the lesser-described but mid-sized languages, but focused on them in a manner as to produce tools which allowed for the collaborative corpus to be produced then these tools could make it into workflows being used to document moribund languages. And perhaps in 10 years time we will see a higher percentage of annotation among our archived corpora.