How long is a piece of string?

THIS IS A REPUBLISHED VERSION OF A PREVIOUSLY PUBLISHED BLOG POST AND COMMENTS.
Originally published 14th April 2010 on the Endangered Languages and Cultures blog, republished with updates to the text, references, and links, and an adapted re-presentation of user-generated comments.

Last month (March 2010) I received the following email query from a colleague:

“I am currently submitting a grant application for a small grant at the HRELP to document …. One concern I have is how many hours it will realistically take to transcribe one hour of text. I have done fieldwork in the past, but this would be the first time that I will have trained a transcriber who would work (mostly) independently. (The linguists on the project would consult with them.) I would like to give some sort of concrete number of total hours transcribed and translated (in contrast to fully annotated).”

Since this is an issue I have been asked about several times [and continue to see raised in language documentation discussions– PKA, August 2020], I present here an elaborated version of what I wrote back to my correspondent (here I am using ‘source language’ to refer to the language of the recording, and ‘target language’ to refer to the language of a translation of the recording. I restrict my remarks to transcription of spoken languages).

I wrote back: The answer to your questions is kind of like the answer to the question: ‘How long is a piece of string?’ There are so many variables:

  1. the quality and listenability of the recording, often determined by the type and quality of the microphone used and where it was located, including how far it was placed from the speaker(s). A low quality muffled recording with high background noise will take much longer to transcribe. For video, quality and viewability is also affected by lighting and framing, which can influence transcription time;
  2. how many languages/varieties are represented in the recording (is it monolingual or multilingual) and what languages these are;
  3. the transcriber’s familiarity with and fluency in the source language(s) (including, if they are a native speaker, whether they speak the same variety as the interviewees);
  4. whether the transcriber can work alone or needs to work together with someone else (the interviewee or another speaker) to listen to or watch the recording and have it repeated back (possibly at a slower rate) for transcription. Some transcribers do a ‘first pass’ rough transcription that is then checked with another person to arrive at a more refined one. The transcription time should be calculated as the sum of the times for these two processes;
  5. the phonology of the source language – some languages have more segmental distinctions than others, and, depending on who is doing the transcribing, some distinctions may be more difficult to hear and transcribe than others. If a language has suprasegmental contrasts to be included in the transcription (e.g. tonal contrasts) the nature of these will also affect the amount of transcription time. Tony Woodbury reports that:
    “The Eastern Chatino of Quiahije has 20 phonemically different tones, with complex sandhi phenomena that affect morpholexical tones. Transcription alone by trained fluent native speakers takes 1 hour for 5-10 minutes of clear monologic speech. I’m slower than that, and I typically transcribe in tandem the post-sandhi phonemic version and a lexical version, mainly as a check on myself (that commits me to sandhi testing except if the context is just right). So for transcription alone, I’d be more like 1 hour for 2-3 minutes. In reality though, I also gloss and determine inflectional categories as I go, because that also helps me narrow down the tone possibilities. I’m a bit slower with the Eastern Chatino of Zacatepec; it is hard because two of the most ‘populous’ tone categories sound exactly alike in isolation and can only be distinguished with sandhi tests.” Transcription of other aspects such as melody of songs or chants, or gesture (see Seyfeddinipur 2011) , will require special training and be correspondingly more time consuming ;
  6. the transcriber’s familiarity with and fluency in the orthography for the transcription (and for keyboard entry, see below);
  7. the transcriber’s familiarity with a number of aspects of the recording, including:
    • genre — talk in more everyday registers may be less time-consuming to transcribe than special and rarer genres, e.g. chants, child-directed speech;
    • topic — talk about more familiar topics is easier to transcribe than less familiar ones, e.g. topics that require specialist knowledge like religion;
    • mode (monologue vs. dialogue vs. multi-party) — conversation between two people is more difficult to transcribe than monologue (unless each is recorded in a separate track), and increasing the number of conversational partners greatly increases the difficulty of transcription;
    • setting — recordings made in noisy environments are more difficult to transcribe, especially if there is spoken language in the background (e.g. on a TV or radio, or in side conversations or commentary);
    • identity of participants — if the transcriber knows that the people recorded have particular speech traits then that can help to identify that person in conversation and to transcribe their speech. The more familiar the transcriber is with these factors, the easier it may be to do the transcription.
  8. the attention spans and stamina of the linguist and/or transcribers — Peter Budd reported that he found that doing more than 60-90 minutes of transcribing at a stretch was tough for all parties;
  9. whether the transcription is digital (typed as a computer file) or analogue (hand written) that needs later conversion to digital. If typed up digitally:
    • whether there is a (continuous) power supply that allows transcribers to work for extended periods;
    • what level of IT skills the transcriber has — several colleagues have reported low levels of basic computer skills of collaborators (e.g. being able to save files and then find them again) which adds to training and transcription time;
    • what input method is used? (eg. if there are accents or non-ASCII characters are they entered via the keyboard or via ‘insert symbol’? If multiple languages are used with different orthographies, how complex is it to switch fonts and entry methods for each?)
  10. whether the transcription is time-aligned (i.e. there is a time link between parts of the textual transcription and the audio or video recording — the granularity of alignment (e.g. sentences aligned, or words aligned) will also influence the amount of time required — smaller transcribed parts will take longer to align than larger ones). Is software to be used for this? How familiar is the transcriber with the software, and how easy is it to use for the given task (e.g. ELAN is good for multi-party transcription but requires a lot of training)?
  11. whether the transcription needs checking and post-editing, and how much time needs to be allocated for that;
  12. for translation, the level of fluency in the source language and the target language;
  13. what kind of translation is intended (see Woodbury 2007) – will it be:
    • literal — word-by-word translation;
    • morpho-syntactic — morpheme-by-morpheme translation, also called interlinear glossing. For discussion see Lehman (1982), Schultze-Berndt (2006, 2012). For this kind of translation, conformity to a template, such as the Leipzig glossing rules, will require training and may take additional time;
    • idiomatic — a free translation that pays attention to idiomatic expression in the target language;
    • UN-style — translation done on-the-fly while listening to the audio or video recording;
    • literary — a translation that attempts to create a literary work in the target language, paying attention to metre, rhyme, alliteration, etc. in the source language and attempting to express equivalents in the target language. The field of ethnopoetics has explored the theory and practice of making such translations.
  14. whether notes, exegeses, and comments are to be included (see Schulze-Bernd (2006, 2012) on the kinds of linguistic annotations that can be added to recordings).

The experience of several colleagues is that having video recordings available speeds up the transcription process by making it easier to identify speaker turns and providing some access to context and extra-linguistic cues. Anthony Jukes, a colleague who worked in Indonesia on Ratahan, found that video recordings made transcription a more bearable and interesting task for the documentation team, and that the transcribers would consistently place audio-only transcription at the bottom of their ‘to do’ list.

A rough rule of thumb seems to be that for an experienced transcriber fluent in the source language and skilled with transcription software a ratio of at least 10:1 for monologue and 15:1 for conversation is needed for transcription, i.e. 6 minutes of monologue takes at least 1 hour to transcribe, 6 minutes of conversation takes at least 1.5 hours (but see Tony Woodbury’s estimate mentioned above of double that for a tonally complex language). For rough transcription plus checking and refinement, a factor of 15:1 for monologue and 30:1 for conversation seems not uncommon. If we add translation, it is not uncommon for a ratio of 50:1 to apply, i.e. for 6 minutes of recording at least 5 hours is required to transcribe and translate it.

Note that, for all of this, as the car advertisements say, ‘your mileage may differ’.

References

Lehmann, Christian. 1982. Directions for interlinear morphemic translations. Folia Linguistica 16, 199-224 link

Seyfeddinipur, Mandana. 2011. Reasons for documenting gestures and suggestions for how to go about it. In Nicholas
Thieberger (ed.) The Oxford Handbook of Linguistic Fieldwork, 147-165. Oxford: Oxford University Press

Schultze-Berndt, E. 2006. Linguistic Annotation. In Jost Gippert, Nikolaus P. Himmelmann & Ulrike Mosel. (eds.) Essentials of Language Documentation, 213-251. Berlin: de Gruyter. link

Schultze-Berndt, Eva. 2012. Language documentation. In Tibor Kiss & Artemis Alexiadou (eds.) Syntax: an international handbook of contemporary research, 2064-2096 (2nd edition) . Berlin: Mouton de Gruyter. link

Woodbury, Anthony C. 2007. On thick translation in language documentation. Language Documentation and Description 4, 120-135. link

Footnote: Many thanks to Peter Budd, Anthony Jukes, and Tony Woodbury for comments on a draft of this post.

Comments

  1. From Ariel Gutman: Interesting post. An important issue that slows down transcription, in my experience at least, is the usage of time-aligning software, and in particular ELAN. Although time-aligning sounds like a good idea in general, the actual design of ELAN makes it highly non-user friendly and non-ergonomic, especially since using the software only through the keyboard is not so easy. Moreover, in many cases it is useful to add notes to specific time points (such as ‘here begins the second story’), and not to time intervals, but ELAN is not flexible enough for this. Of course, all of this can be solved, but the people at the MPI should start designing their software with user-friendliness in mind.
  2. From Wamut: Yeah my estimate is about the same as Peter’s – I say that one minute of recording takes about an hour for a good quality transcription, gloss and translation. Of course there are always lots of factors, but one in particular is that we’re dealing with endangered languages so I’m happy to be a bit less fussy and let some finer points go through to the keeper if it means we can actually process more recordings. As for ELAN, I’m not a tech-head but I still find it really easy to use. I’ve taught stacks of people to use it, including endangered language speakers and language workers themselves and they’ve all picked it up pretty well and run with it. Now, I’d like to think that’s attributable solely to my brilliant training skills (joking!) but no, I think ELAN is a great program.
    One of the highlights of my own remote language work has been some great sessions with an endangered language team – me and 3-4 language speakers or workers: one on ELAN playing each annotation through the speakers for us, one or two old ladies repeating and translating for us, one scribe on the white/blackboard for us to jointly agree on the translation and transcription, and someone writing it down so that we can just do ‘data entry’ into ELAN at the end of the session. A great way to work and I found ELAN to be much more a help than a hindrance.
  3. From Stuart McGill: Peter I see commodification in EL is here to stay!
    Ariel: I think you’re spot on with your comments on ELAN and keyboard use. I guess you’ve tried Transcriber for audio transcription? I mention it because previously I’d dismissed it due to lack of support for special characters, but just recently the thought of transcribing two DVD commentaries delivered at break-neck speed drove me back to Transcriber! Using a kind of SAMPA notation and then exporting to Toolbox for further annotation turned out to be an order of magnitude quicker than ELAN
    (of course there are other considerations like number of speakers, video, and who is doing the transcription).
    More broadly, I guess a program can be ‘easy to use’ without being ‘user-friendly’. ELAN is not too difficult to learn, but the main source of frustration (for me, at least) in what is otherwise a nice program is that transcription in ELAN is simply slow(er than it could be), no matter how well you know the program. I guess different people have different tolerance-levels for that sort of thing.
  4. From Claire Bowern: Just to add to Peter’s excellent list of factors: (a) orthography and keyboard layout of the target language (aside from phonology and familiarity with it); my Yolŋu transcription sped up noticeably when I defined my own keyboard layout that let me type Yolngu and English on a single keyboard without switching between two layouts; (b) the speech style and rate of the speaker. Whether there’s head-tail linkage (speeds things up, I reckon); whether there are pauses — e.g. with clear pauses you can use Elan’s automatic segmentation with less manual editing, which also speeds things up.
  5. From Mark Dingemanse: Excellent post. One way to speed up the process somewhat is to go through a presegmented conversation quickly while recording someone repeating what is said – but without halting each time to transcribe it. With these repeated utterances in hand it’s quite easy to do a rough transcription on your own. This avoids the situation in which consultants are waiting for you to get it right. Described in more detail here. Followers of this thread will be interested to know that ELAN from version 4.1 has a new user interface for high-speed transcription work. See Transcription mode in ELAN for details and screenshots.

One thought on “How long is a piece of string?”

  1. I love this post. It is already so thorough (possibly more than anywhere else unless I’ve missed something in the literature). I’d like to add a couple things from my experience working in PNG with speakers of Chini and Yaw. Essentially what i would like to put out there is an argument in favor of being slow. as a conscious choice, for the following reasons. (1) Number one for me is that trying to transcribe as much data as possible as quickly as possible stresses people [expletives removed] out, including myself but more importantly the language teachers working with me. PNG is hot and social life is going on all the time, it’s already a bit odd to be spending all this time indoors sitting at a computer. I know we all want to have more transcribed texts but whenever I’ve aimed to get it done more quickly, the research value gets degraded (often my disappointment doesn’t come until much later, when it’s too late) and the people I love get annoyed with me. (2) Text-based elicitation as part of the transcription process. I have found that often the more interesting (ie language-specific-wise) some elicitable structure or whatever is, the harder it can be to elicit later when the context from the text is gone. For example in Yaw there are two verbs for ‘wait’, but not in the target language (Tok Pisin). The more unusual one came up in a text, but trying to elicit the verb forms later on was true headache material. (3) Another issue I’ve encountered especially with conversational texts is the vast amount of paratextual (or whatever the word is) material hovering around, so much of it cultural information that I either barely understand or have no idea about. References to all sorts of things, even down to the usage of proper names, you name it. The more I have sat back (with the audio recorder running!) while the language teacher goes into however much depth they choose, the more I have realized, I learn so much more then. The research value (from my pov) is higher, and the interaction is more teacher-student than otherwise (which, from what I have seen at least, is much preferred by the folks working with me).

Leave a Reply to Joseph Brooks Cancel reply

Your email address will not be published. Required fields are marked *