Bag of what? Simple noun phrase extraction for text analysis (EMNLP, NLP + Computational Social Science, 2016)



Abram Handler, Denny, Matthew J., Hanna Wallach, and Brendan O'Connor. 2016. "Bag of what? Simple noun phrase extraction for text analysis." Proceedings of 2016 EMNLP Workshop on Natural Language Processing and Computational Social Science, pp 114-24.


Social scientists who do not have specialized natural language processing training often use a unigram bag-of-words (BOW) representa- tion when analyzing text corpora. We offer a new phrase-based method, NPFST, for en- riching a unigram BOW. NPFST uses a part- of-speech tagger and a finite state transducer to extract multiword phrases to be added to a unigram BOW. We compare NPFST to both n- gram and parsing methods in terms of yield, recall, and efficiency. We then demonstrate how to use NPFST for exploratory analyses; it performs well, without configuration, on many different kinds of English text. Finally, we present a case study using NPFST to ana- lyze a new corpus of U.S. congressional bills.

For our open-source implementation, see