Fork me on GitHub

Wellner PDTB Head Extraction

Python implementation of Ben Wellner's Penn Discourse Treebank head extraction algorithm


Ben Wellner's 2009 dissertation Sequence Models on Ranking Methods for Discourse Parsing at Brandeis University includes chapters on automatically identifying connectives and arguments in the Penn Discourse Treebank (PDTB). However, unlike in the PDTB representation he is interested in identifying arguments by their heads, as opposed to by their text spans. The purpose of this program is to implement the text span --> syntactic head portion of his algorithm. See pages 40-43 and 60-61 of his dissertation for more information.


The script requires NLTK as a dependency. It is available in Ubuntu as the package python-nltk, but NLTK runs on all major platforms. Additionally, you must already have a copy of the PDTB, with the PDTB, PTB, and text portions of the corpus. I cannot distribute it. You can get a copy through your academic instiution or from the Linguistics Data Consortium. I assume your corpus directory has three subdirectories. One called pdtb containing subdirectories containing .pdtb files, a second called ptb containing subdirectories containing .mrg files, and a third called text containing subdirectories containing extensionless raw text files.


Single request mode:

$ python [simple_input_file] [output_file]
Efficient batch request mode:
$ python -p [batch_input_file] [output_file]

A simple input file contains a path to a PTB .mrg file (absolute or relative to script), followed by a literal tab followed by a list of Gorn addresses. See the provided sample_request.txt for an example. The script will write to the provided output file, whether or not it exists.

A batch input file contains a path to a PTB .mrg file (absolute or relative to script), followed by a literal tab followed by a first Gorn address followed by a tab followed by a second Gorn address followed by either the string "arg1" or "arg2". See the provided sample_batch_requests.txt for an example.


You can download this project in either zip or tar formats.

You can also clone the project with Git by running:

$ git clone git://
Alternatively, you can browse the code through GitHub.

Special Thanks

This work was done during the summer of 2009 while I was an intern of the Department of Computer Science at Penn, working with researchers at the Institute for Research in Cognitive Science (IRCS). The researchers at IRCS generously agreed to release the code under an open source license.


Yuvi Masory