This 5-day course is part of ESSLLI 2016. It aims to introduce students to the use of corpora of naturally occurring language for research in semantics/pragmatics.
The course will have a hands-on component in order to provide participants with the opportunity to get their hands dirty with actual data and tools for handling that data. To minimize the amount of software you will need to install on your own laptops, the kind tech support team is providing access to a virtual machine pre-configured with all the necessary software. In order to access the virtual machine, you will only need to install VMWare Horizon (available for most operating systems). If you want to take advantage of this part of the course (recommended!), please follow the installation instructions and bring your laptop to class. You will receive your login credentials upon registration. If you save files on the virtual machine, make sure to save them into your home directory -- everything not saved in the home directory will be deleted when you log out!
Lecturer
Judith Degen -- jdegen@stanford.edu
Course description
Traditionally, the primary source of data in pragmatics has been researchers’ intuitions about utterance meanings. However, the small numbers of introspective judgments about examples, hand-selected by researchers who themselves provide these judgments, introduces bias into the phenomena under investigation. The recently emerging use of experimental methods for probing linguistically untrained language users’ interpretations has ameliorated the bias introduced by small numbers of judgments. It cannot, however, remove item bias: researchers artificially construct the stimuli used in experiments. Fortunately, studying corpora of naturally occurring language can reduce item bias. Corpora provide naturally occurring utterances that can be used in tandem with platforms like Mechanical Turk to provide large-scale crowd-sourced interpretations of these utterances, thereby allowing for constructing large databases of different types of meanings (e.g., implicatures) in context.
In order to not only introduce course participants to the use of corpora of naturally occurring language for research in semantics/pragmatics but also equip you with practical skills for conducting your own research in this area, the course will contain a substantial hands-on component. We will use tools for searching syntactically parsed corpora (tgrep2, TDTlite) as well as tools for analyzing and visualizing data (R, in particular the lme4 and ggplot2 packages).
Prerequisites
The course has no official prerequisites. However, the more programming experience you have, the easier it will be to follow along. Remember to fill out the survey to increase the probability of the course pace aligning with your needs. We will be making heavy use of the UNIX terminal for navigating directories and running the corpus search tools. Knowing how to use an editor from the terminal will be important. (I'll be using vim but I won't get in the way of anyone looking to use emacs or something even more outlandish). We'll use R and ggplot later in the course.
If you would like to get caught up on the basics of using these tools, I recommend the following tutorials (in order of priority):
- UNIX tutorial for beginners -- complete at least the first three tutorials
- Ryan's vi tutorial and a cheat sheet of useful vi commands
- datacamp for learning R
Preliminary syllabus
When | What | Reading(s) / resources | Slides |
---|---|---|---|
Monday | Introduction: utility of corpora for research in semantics/pragmatics; a practical example: variation in scalar implicature strength | de Marneffe & Potts 2014; Degen 2015 | |
Tuesday | TGrep2 tutorial: search corpora for syntactic patterns based on regular expressions | TGrep2 User Manual | code sheet |
Wednesday | Hands-on project intro (projection behavior of (non-/semi-)factive verbs); develop patterns for extracting factive verbs | Beaver 2010; Simons et al, to appear | |
Thursday | TDTlite tutorial: building an annotated database of corpus search results | TDTlite User Manual; Gibson et al 2011; Tonhauser 2016 | code sheet |
Friday | Visualizing corpus data and data from mini-experiment; discussion of results; reflection | code sheet |
Resources
- TGrep2 and the TGrep2 User Manual
- TDTlite on GitHub
- TDTlite User Manual
- The IMS Open Corpus Workbench
- Linguistic Data Consortium
- Lots of excellent resources for mixed effects modeling (which we ended up not getting to in class) can be found on Florian Jaeger's HLPlab wiki
Further readings
Beaver, D. 2010. Have you Noticed that your Belly Button Lint Colour is Related to the Colour of your Clothing? In R. Bauerle, U. Reyle, and T. E. Zimmermann (eds.), Presuppositions and Discourse: Essays offered to Hans Kamp, Elsevier, Oxford. (pp. 65–99).
Degen, J. 2015. Investigating the distribution of some (but not all) implicatures using corpora and web-based methods. Semantics and Pragmatics, 8 (11), pp. 1-55.
de Marneffe, M.-C., Manning, C., and Potts, C. 2012. Did it happen? The pragmatic complexity of veridicality assessment. Computational Linguistics 38(2): 301-333.
de Marneffe, M.-C. and Potts, C. 2014. Developing linguistic theories using annotated corpora. To appear in Nancy Ide and James Pustejovsky, eds., The Handbook of Linguistic Annotation. Berlin: Springer.
Gibson, E., Piantadosi, S. & Fedorenko, K. 2011. Using Mechanichal Turk to obtain and analyze English acceptability judgments. Language and Linguistics Compass, 5(8): 509-524.
Potts, C. 2012. Goal-driven answers in the Cards dialogue corpus. In Nathan Arnett and Ryan Bennett, eds., Proceedings of the 30th West Coast Conference on Formal Linguistics, 1-20. Somerville, MA: Cascadilla Press.
Simons, M., Beaver, D., Roberts, C., and Tonhauser, J. to appear. The Best Question: Explaining the projection behavior of factive verbs. Discourse Processes.
Tonhauser, J., Beaver, D., Roberts, C., and Simons, M. 2013. Toward a taxonomy of projective content. Language 89(1): 66-109.