Computational Linguistics News

Data Statements for NLP Polling

Because of the pandemic, the Computational Linguistics Master of Science [CLMS] program could not host its normal 2-day in-person orientation in September 2020. However, the orientation was successfully moved to a virtual format, with a mix of live sessions and pre-recorded talks, as well as social interactions via breakout rooms and an online town. While this experience could not fully replicate the usual picnic and other activities, a feedback survey showed that the new CLMSers really valued this experience, and especially the opportunities to get to know each other better!

Emily M. Bender organized an international (online) workshop (May 11-13, 2020): "Data Statements for NLP: Towards Best Practices" together with Prof. Batya Friedman of the iSchool and Linguistics PhD student Angelina McMillan-Major and sponsored by UW's Tech Policy Lab. As Bender describes it, “this workshop was initially scheduled to be a one-day event associated with the Language Resources and Evaluation Conference (LREC 2020) in Marseille. In moving to an online format, it was spread out over three days in order to catch enough hours with participants from all around the globe (including Argentina, Sri Lanka, Mauritius, Nigeria, as well as the US and Europe). The workshop was organized as a working meeting where the organizers assisted participants in writing data statements---documentation of datasets which are fundamental to research and technology development in natural language processing (NLP)---for datasets that they are developing. The datasets at play included such varied languages and data types as German Sign Language, simplified writing in Basque, Mauritian riddles and proverbs, English Twitter data, and pairs of Fon-French translations (Fon being a language of Benin). Sample data statements developed during the workshop can now be found on the workshop webpage.  At the same time, Bender, Friedman and McMillan-Major received input from these participants which they are using to develop best practices for creating data statements that are responsive to a broad variety of research contexts, both in terms of the institutional environment (which is quite different between say Sri Lanka and Germany) and in terms of the types of underlying data being described.”