This 5-day course is part of ESSLLI 2016. It aims to introduce students to the use of corpora of naturally occurring language for research in semantics/pragmatics.

The course will have a hands-on component in order to provide participants with the opportunity to get their hands dirty with actual data and tools for handling that data. To minimize the amount of software you will need to install on your own laptops, the kind tech support team is providing access to a virtual machine pre-configured with all the necessary software. In order to access the virtual machine, you will only need to install VMWare Horizon (available for most operating systems). If you want to take advantage of this part of the course (recommended!), please follow the installation instructions and bring your laptop to class. You will receive your login credentials upon registration. If you save files on the virtual machine, make sure to save them into your home directory -- everything not saved in the home directory will be deleted when you log out!

Lecturer

Judith Degen -- jdegen@stanford.edu

Course description

Traditionally, the primary source of data in pragmatics has been researchers’ intuitions about utterance meanings. However, the small numbers of introspective judgments about examples, hand-selected by researchers who themselves provide these judgments, introduces bias into the phenomena under investigation. The recently emerging use of experimental methods for probing linguistically untrained language users’ interpretations has ameliorated the bias introduced by small numbers of judgments. It cannot, however, remove item bias: researchers artificially construct the stimuli used in experiments. Fortunately, studying corpora of naturally occurring language can reduce item bias. Corpora provide naturally occurring utterances that can be used in tandem with platforms like Mechanical Turk to provide large-scale crowd-sourced interpretations of these utterances, thereby allowing for constructing large databases of different types of meanings (e.g., implicatures) in context.

In order to not only introduce course participants to the use of corpora of naturally occurring language for research in semantics/pragmatics but also equip you with practical skills for conducting your own research in this area, the course will contain a substantial hands-on component. We will use tools for searching syntactically parsed corpora (tgrep2, TDTlite) as well as tools for analyzing and visualizing data (R, in particular the lme4 and ggplot2 packages).

Prerequisites

The course has no official prerequisites. However, the more programming experience you have, the easier it will be to follow along. Remember to fill out the survey to increase the probability of the course pace aligning with your needs. We will be making heavy use of the UNIX terminal for navigating directories and running the corpus search tools. Knowing how to use an editor from the terminal will be important. (I'll be using vim but I won't get in the way of anyone looking to use emacs or something even more outlandish). We'll use R and ggplot later in the course.

If you would like to get caught up on the basics of using these tools, I recommend the following tutorials (in order of priority):

UNIX tutorial for beginners -- complete at least the first three tutorials
Ryan's vi tutorial and a cheat sheet of useful vi commands
datacamp for learning R

Preliminary syllabus

When	What	Reading(s) / resources	Slides
Monday	Introduction: utility of corpora for research in semantics/pragmatics; a practical example: variation in scalar implicature strength	de Marneffe & Potts 2014; Degen 2015	pdf
Tuesday	TGrep2 tutorial: search corpora for syntactic patterns based on regular expressions	TGrep2 User Manual	code sheet
Wednesday	Hands-on project intro (projection behavior of (non-/semi-)factive verbs); develop patterns for extracting factive verbs	Beaver 2010; Simons et al, to appear	pdf
Thursday	TDTlite tutorial: building an annotated database of corpus search results	TDTlite User Manual; Gibson et al 2011; Tonhauser 2016	code sheet
Friday	Visualizing corpus data and data from mini-experiment; discussion of results; reflection		code sheet

Resources

TGrep2 and the TGrep2 User Manual
TDTlite on GitHub
TDTlite User Manual
The IMS Open Corpus Workbench
Linguistic Data Consortium
Lots of excellent resources for mixed effects modeling (which we ended up not getting to in class) can be found on Florian Jaeger's HLPlab wiki

ESSLLI 2016: Corpus Methods for Research in Pragmatics

Lecturer

Course description

Prerequisites

Preliminary syllabus

Resources

Further readings