In the last couple of years there has been increasing concern in Linguistics about the sources of language materials cited in research and publications (as well as a growing role for citation metrics in quantification of research outcomes). Thus, Gawne et al. (2017) survey 100 reference grammars, looking at whether authors give metadata on the sources of examples in them, and/or links to recordings. They find that more than 50% of the grammars do not include any examples metadata, i.e. it is unclear in most cases where they come from, whether they are written or audio-visually recorded, and in what contexts they were collected (e.g. author intuitions, elicitation, conversation, narrative, overhearing, analogue (refereed or non-refereed) publication, learner-generated materials, grey literature, media broadcasts, (written and spoken/signed) social media, experiment). Their main concern is about transparency of description and analysis, as well as replicability, i.e. the possibility that later researchers can retrace the analytical steps taken by previous researchers (an idea drawn from computer science, see Gezelter 2009).
This theme is further elaborated in Berez-Kroeker et al.1 (2018) who argue for ‘facilitating a culture of proper long-term care and citation of linguistic data sets’, encouraging researchers to properly cite their materials, and make them available for inspection by others. They also urge ‘editors and publishers of linguistics journals and book series to develop concrete policies for both data sharing and data citation, and to develop formats for the citation of linguistic data sets’. They remind readers that the US National Science Foundation in 2014 called for proposals to develop standards for ‘data citation and attribution, so that data producers, […] and data curators are credited for their contributions’ (National Science Foundation 2014). The Research Data Alliance Linguistics Data Research Group has also published a set of explicit principles (Berez-Kroeker et al.2 2018) for such citation, including ‘data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data’ and ‘in scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited’.
Back in April 2011, I wrote a blog post where I already discussed these ideas and their importance, presenting a concrete example of how they were being violated in publications, and pointing out that I had developed a metadata system for implementing them 30 years earlier (i.e. almost 40 years ago now). Here is the story.
Blog post (edited)
In April 2011, Continuum International Publishing Group sent me a complimentary copy of Jim Miller’s textbook A Critical Introduction to Syntax which includes a chapter on Noun Phrases and Non-configurationality (pages 61-98). Since this is a topic I have published on (Austin & Bresnan 1996; Austin 2001a, 2001b), I figured I would have a quick look at this chapter first. Interestingly, on page 78 I found example sentence (27) which is ‘from the Native Australian language Jiwarli’, and for which Miller quotes the source as Pensalfini (2004: 364):
|‘Now those two eggs are lying there.’|
As readers who know about Australian Aboriginal languages will probably recognize, Pensalfini cannot be the original source for this example since only Alan Dench and I ever recorded data on Jiwarli from its last speaker, the late Jack Butler, and only I have published primary material on the language. Pensalfini (2004) indeed gives Austin (1993) as his source, but Miller makes no mention of this (my article was actually published as Austin 2001a, three years before Pensalfini’s article appeared). This seems to be what we could call ‘the example sentence variant’ of the ‘violation of citation etiquette’ described so eloquently by Pullum 1988. [Thirty years later, Pender 2018 re-emphasises the point that ‘referencing is not a neutral act. Citations are a form of scientific currency, actively conferring or denying value. Citing certain sources—and especially citing them often—legitimises ideas, solidifies theories, and establishes claims as facts. References also create transparency by allowing others to retrace your steps. Referencing is thus a moral issue, an issue upon which multiple values in science converge’]
However, the story has a further twist to it. The glossing of the Jiwarli example, faithfully copied by Miller, is not the glossing given in Austin (1993, 2001a) , but was changed by Pensalfini. Here is the example in its original form [Note 1]:
|‘Now those two eggs are lying there.’ [T51s9]|
What I was trying to show in my glossing was a syntactic analysis where each nominal element in Jiwarli can be understood as being nominative case-marked, and that there is no evidence for noun phrases. Hence, each of ‘two’, ‘that’ and ‘egg’ is marked for case, something that Pensalfini’s reglossing (copied by Miller) distorts. Rather more egregious however is that a whole category of information, the [T51s9] following the English free translation, has been silently eliminated. Let me give you a bit of relevant history, and explain what this is.
In 1981 I returned to Australia from a short-term teaching post at Harvard University to take up a position at La Trobe University Linguistics Division (later, Department). I recommenced my research on Western Australian languages, including Jiwarli, after a three year break in the United States. I started the Gascoyne-Ashburton Languages Project (GALP) at La Trobe (funded 1982-1986 by the Australian Research Grants Scheme), and as part of that established a basic principle of providing metadata giving the source of all the sentence examples (and lexical items) collected in the project. In doing so I was influenced by the same practice I had seen in the PhD research [Note 2] of Jane Simpson (I had been in contact with Jane in Boston in 1980-1981); as Simpson (1983:4, fn2) says:
‘I have tried to indicate the source of each example sentence where I know it. If the sentence is made up, I have indicated this, unless the sentence is elementary’. [In a comment on the original blog post dated 2011-05-03 Jane says: ‘When I’ve worked on languages, I’ve felt the same anxious compulsion as scholars of classical languages (or middle English, or indeed Otto Jespersen) have had in documenting their sources. Without native speaker intuition, it’s the only way to avoid some pretty hideous mistakes. (And even so they creep in.)’].
For GALP I developed a metadata source indication system that distinguished two categories (usually indicated in publications as material in square brackets following each English free translation — see the example immediately above3):
- elicited examples whose metadata reference has the form [AABBCCNDDpEEsFF] where AA is an abbreviation representing the language, BB represents the speaker, CC represents the recorder, N means notebook, DD is an integer for the fieldnotes book number, EE is an integer for the page of the notebook, and FF is an integer for the sequential order of the sentence on the notebook page. Thus, [TRCYTKN01p79s07] is the seventh sentence on page 79 of notebook 1 collected by Terry Klokeid from Chubby Yowadji in Tharrkari;
- text examples whose metadata reference has the form [AABBCCTDDsEE] where AA is an abbreviation representing the language, BB represents the speaker, CC represents the recorder, T means text, DD is an integer for a text in a collection, EE is an integer for the sequential order of the sentence within the text. Thus, [WRAEOGT03s01] is the first sentence in text 3 collected by Geoffrey O’Grady from Alec Eagles speaking Warriyangka.
For Jiwarli text examples, a reference like [JIJBPAT51s09] could be reduced to [T51s9] since the only existing texts are just those recorded by myself from the late Jack Butler (and published in Austin 1997). This is the metadata missing from the example as cited by Pensalfini and Miller above.4
I introduced this system to keep track of the contributions of individual speakers and recorders, the genre of examples, and to ensure that it was always possible to go back to the original fieldnotes and text collections to check materials, if necessary [Note 3]. I have maintained this system in all my analysed data sets using the Field Linguist’s Toolbox software, and in published and unpublished work (like Austin 2015) ever since (for a specification of the relevant data structures see Austin 2002).
Interestingly, a feature of Miller’s A Critical Introduction to Syntax is that it makes use of ‘real language examples’ taken from spoken and written English corpora. Each such example has relevant source metadata clearly indicated (thus page 39 example (79) is from ‘Miller-Brown corpus, conversation 58’, and page 133 example (25) is from ‘The Herald, 17 October 2009, p. 4’) yet no example sentence in a language other than English gets a metadata source reference, not even Russian which is extensively exemplified. Surely what’s good for the (English) goose should be good for the (non-English) gander?
In their seminal paper on data portability and digital language documentation, Bird & Simons (2003) identify citation as one of the major problems faced by those who wish to document and describe languages (for subsequent discussion in the literature see the Introduction above). They state that: ‘[w]e value the ability of users of a resource to give credit to its creators, as well as to learn the provenance of the sources on which it is based. Thus the best practice is one that makes it easy for … language documentation and description to be cited.’ [Note 4] Having developed such a system for my own research some thirty years ago, I find it disappointing that Miller, and Pensalfini before him, simply left out the crucial identifying citation metadata.
Let’s hope that practices and policies in linguistic research improve in this area in the future, so that the hard work of language speakers and language documenters and analysts can be properly recognised, especially as material is passed around, resulting in second-hand and third-hand publications [Note 5].
It is pleasing that others have published about this topic recently — if only they had cited ‘Citation, citation’, and the principles and practices it described, dating back to 1981.
Austin, Peter. 1993. Word order in a free word order language: the case of Jiwarli. La Trobe University Manuscript.
Austin, Peter. 1997. Texts in the Mantharta Languages, Western Australia. Tokyo: ILCAA, Tokyo University of Foreign Studies.
Austin, Peter K. 2001a. Word order in a free word order language: the case of Jiwarli. In Jane Simpson, David Nash, Mary Laughren, Peter Austin, Barry Alpher, (eds.) Forty years on: Ken Hale and Australian languages, 205-323. Canberra: Pacific Linguistics.
Austin, Peter K. 2001b. Zero arguments in Jiwarli, Western Australia. Australian Journal of Linguistics 21(1), 83-98.
Austin, Peter K. 2002. Developing Interactive Knowledgebases for Australian Aboriginal Languages — Malyangapa. Unpublished paper for EMELD 2003 workshop. link
Austin, Peter K. 2015. A reference grammar of the Mantharta languages, Western Australia. SOAS University of London, unpublished manuscript. link
Austin, Peter and Joan Bresnan. 1996. Non-configurationality in Australian Aboriginal languages. Natural Language and Linguistic Theory 14: 215-268.
Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, David I. Beaver, Shobhana Chelliah, Stanley Dubinsky, Richard P. Meier, Nick Thieberger, Keren Rice & Anthony C. Woodbury. 2018. Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics 56(1), 1-18. link
Berez-Kroeker, Andrea L., Helene N. Andreassen, Lauren Gawne, Gary Holton, Susan Smythe Kung, Peter Pulsifer, Lauren B. Collister, The Data Citation and Attribution in Linguistics Group & the Linguistics Data Interest Group. 2018. The Austin Principles of Data Citation in Linguistics. Version 1.0. link
Bird, Steven and Gary Simons. 2003. Seven dimensions of portability. Language 79(3), 557-582.
Bow, Cathy, Biaden Hughes and Steven Bird. 2003. Towards a general model of interlinear text. E-MELD workshop paper. Available here
Gawne, Lauren, Barbara F. Kelly, Andrea L. Berez-Kroeker & Tyler Heston. 2017. Putting practice into words: The state of data and methods transparency in grammatical descriptions. Language Documentation & Conservation 11, 157-189. link
Gezelter, Dan. 2009. Being scientific: Falsifiability, verifiability, empirical tests, and reproducibility. The OpenScience project. link
National Science Foundation. 2014. Supporting scientific discovery through norms and practices for software and data citation and attribution (Dear Colleague letter). link
Penders, Bart. 2018. Ten simple rules for responsible referencing. PLoS Computational Biology 14(4), e1006036. link
Pensalfini, Robert. 2004. Towards a typology of configurationality. Natural Language and Linguistic Theory 22(2), 359-408.
Pullum, Geoffrey. 1988. Citation etiquette beyond Thunderdome. Natural Language and Linguistic Theory 6(4), 579-588. Link
Simpson, Jane. 1983. Aspects of Warlpiri Morphology and Syntax. PhD dissertation, MIT.
Simpson, Jane. 1991. Warlpiri Morpho-Syntax. Amsterdam: Kluwer Academic Publishers.
- The same example also appears in Austin (2001b: 85, example 3), as well as in Austin & Bresnan (1996: 246 example 42)
- Simpson 1991 is the revised published version which continues the same practice
- Austin (1997) also gives the date and tape number for all audio-recorded texts. This is encoded in the Toolbox files, as is the date of recording and date of transcription, which may well be different.
- see also Bow, Hughes & Bird 2003 who propose a four-level model of interlinear glossed text that includes a text level which is ‘the complete unit of data under examination which functions as a unit in its entirety … The text level includes metadata’.
- The example sentence quoted above gets particularly woeful treatment at the hands of ODIN, the Online Database of Interlinear text, which is ‘a repository of interlinear glossed text extracted mainly from scholarly linguistic papers’. ODIN identifies the language of this example as Mangala (ISO-639-3 mem), spoken a thousand kilometers north of Jiwarli on the coast of Western Australia, because of misidentification of Jiwarli with Juwarliny, which they claim is dialect of Mangala. As Claire Bowern points out (comment 2011-04-28): ‘There’s a double misidentification! Juwarliny is actually a dialect of Walmajarri (ISO-639-3 wlm). (It’s also called Jiwarliny)’.