内容简介
《python自然语言处理(影印版)》提供了非常易学的自然语言处理入门介绍,该领域涵盖从文本和电子邮件预测过滤,到自动总结和翻译等多种语言处理技术。在《python自然语言处理(影印版)》中,你将学会编写python程序处理大量非结构化文本。你还将通过使用综合语言数据结构访问含有丰富注释的数据集,理解用于分析书面通信内容和结构的主要算法。
《python自然语言处理》准备了充足的示例和练习,可以帮助你:
从非结构化文本中抽取信息,甚至猜测主题或识别“命名实体”;
分析文本语言结构,包括解析和语义分析;
访问流行的语言学数据库,包括wordnet和树库(treebank);
从多种语言学和人工智能领域中提取的整合技巧。
《python自然语言处理(影印版)》将帮助你学习运用python编程语言和自然语言工具包(nltk)获得实用的自然语言处理技能。如果对于开发web应用、分析多语言新闻源或记录濒危语言感兴趣——即便只是想从程序员视角观察人类语言如何运作,你将发现《python自然语言处理》是一本令人着迷且极为有用的好书。
作译者
作者:(英国)伯德(Steven Bird) (英国)克莱因(Ewan Klein) (美国)洛普(Edward Loper)
伯德(Steven Bird)是墨尔本大学计算机科学和软件工程系副教授,以及宾夕法尼亚大学语言数据联合会高级研究助理。
克莱因(Ewan Klein)是爱丁堡大学信息学院语言技术教授。
洛普(Edward Loper)最近从宾夕法尼亚大学获得机器学习自然语言处理博士学位,目前是波士顿BBN Technologies公司的研究员。
.. <<
目录
preface
1.language processing and python
1.1 computing with language: texts and words
1.2 a closer look at python: texts as lists of words
1.3 computing with language: simple statistics
1.4 back to python: making decisions and taking control
1.5 automatic natural language understanding
1.6 summary
1.7 further reading
1.8 exercises
2.accessing text corpora and lexical resources
2.1 accessing text corpora
2.2 conditional frequency distributions
2.3 more python: reusing code
2.4 lexical resources
2.5 wordnet
2.6 summary
2.7 further reading
2.8 exercises
3.processing raw text
.3.1 accessing text from the web and from disk
3.2 strings: text processing at the lowest level
3.3 text processing with unicode
3.4 regular expressions for detecting word patterns
3.5 useful applications of regular expressions
3.6 normalizing text
3.7 regular expressions for tokenizing text
3.8 segmentation
3.9 formatting: from lists to strings
3.10 summary
3.11 further reading
3.12 exercises
4.writing structured programs
4.1 back to the basics
4.2 sequences
4.3 questions of style
4.4 functions: the foundation of structured programming
4.5 doing more with functions
4.6 program development
4.7 algorithm design
4.8 a sample of python libraries
4.9 summary
4.10 further reading
4.11 exercises
5.categorizing andtagging words
5.1 using a tagger
5.2 tagged corpora
5.3 mapping words to properties using python dictionaries
5.4 automatic tagging
5.5 n-gram tagging
5.6 transformation-based tagging
5.7 how to determine the category of a word
5.8 summary
5.9 further reading
5.10 exercises
6.learning to classify text
6.1 supervised classification
6.2 further examples of supervised classification
6.3 evaluation
6.4 decision trees
6.5 naive bayes classifiers
6.6 maximum entropy classifiers
6.7 modeling linguistic patterns
6.8 summary
6.9 further reading
6.10 exercises
7.extracting information from text
7.1 information extraction
7.2 chunking
7.3 developing and evaluating chunkers
7.4 recursion in linguistic structure
7.5 named entity recognition
7.6 relation extraction
7.7 summary
7.8 further reading
7.9 exercises
8.analyzing sentence structure
8.1 some grammatical dilemmas
8.2 what\'s the use of syntax?
8.3 context-free grammar
8.4 parsing with context-free grammar
8.5 dependencies and dependency grammar
8.6 grammar development
8.7 summary
8.8 further reading
8.9 exercises
9.building feature-based grammars
9.1 grammatical features
9.2 processing feature structures
9.3 extending a feature-based grammar
9.4 summary
9.5 further reading
9.6 exercises
10.analyzing the meaning of sentences
10.1 natural language understanding
10.2 propositional logic
10.3 first-order logic
10.4 the semantics of english sentences
10.5 discourse semantics
10.6 summary
10.7 further reading
10.8 exercises
11.managing linguistic data
11.1 corpus structure: a case study
11.2 the life cycle of a corpus
11.3 acquiring data
11.4 working with xml
11.5 working with toolbox data
11.6 describing language resources using olac metadata
11.7 summary
11.8 further reading
11.9 exercises
afterword: the language challenge
bibliography
nltk index
general index
前言
This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages such as English, Hindi, or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing--or NLP for short--in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme,NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.
Technologies based on NLP are becoming increasingly widespread. For example,phones and handheld computers support predictive text and handwriting recognition;web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish. By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society.
This book provides a highly accessible introduction to the field of NLP. It can be used for individual study or as the textbook for a course on natura! language processing or computational linguistics, or as a supplement to courses in artificial intelligence, text mining, or corpus linguistics. The book is intensely practical, containing hundreds of fully worked examples and graded exercises.
The book is based on the Python programming language together with an open sourcelibrary called the Natural Language Toolkit (NLTK). NLTK includes extensive software, data, and documentation, all freely downloadable from http://www, nltk.org/.Distributions are provided for Windows, Macintosh, and Unix platforms. We strongly encourage you to download Python and NLTK, and try out the examples and exercises along the way.
Audience
NLP is important for scientific, economic, social, and cultural reasons. NLP is experiencing rapid growth as its theories and methods are deployed in a variety of new language technologies. For this reason it is important for a wide range of people to have a working knowledge of NLP. Within industry, this includes people in human-computer interaction, business information analysis, and web software development. Within academia, it includes people in areas from humanities computing and corpus linguistics through to computer science and artificial intelligence. (To many people in academia,NLP is known by the name of "Computational Linguistics.")
This book is intended for a diverse range of people who want to learn how to write programs that analyze written language, regardless of previous programming
experience:
New to programming?
The early chapters of the book are suitable for readers with no prior knowledge of programming, so long as you aren't afraid to tackle new concepts and develop new computing skills. The book is full of examples that you can copy and try for yourself, together with hundreds of graded exercises. If you need a more general intro duction to Python, see the list of Python resources at http://docs.python, org/.
New to Python?
Experienced programmers can quickly learn enough Python using this book to get immersed in natural language processing. All relevant Python features are carefully explained and exemplified, and you will quickly come to appreciate Python's suitability for this application area. The language index will help you locate relevant discussions in the book.
Already dreaming in Python?
Skim the Python examples and dig into the interesting language analysis material that starts in Chapter 1. You'll soon be applying your skills to this fascinating domain.
Emphasis
This book is a practical introduction to NLP. You will learn by example, write real programs, and grasp the value of being able to test an idea through implementation. If you haven't learned already, this book will teach you programming. Unlike other programming books, we provide extensive illustrations and exercises from NLP. The approach we have taken is also principled, in that we cover the theoretical underpinnings and don't shy away from careful linguistic and computational analysis. We have tried to be pragmatic in striking a balance between theory and application, identifying the connections and the tensions. Finally, we recognize that you won't get through this unless it is also pleasurable, so we have tried to include many applications and examples that are interesting and entertaining, and sometimes whimsical.
Note that this book is not a reference work. Its coverage of Python and NLP is selective,and presented in a tutorial style. For reference material, please consult the substantial quantity of searchable resources available at http://python.org/and .Org/.
This book is not an advanced computer science text. The content ranges from introductory to intermediate, and is directed at readers who want to learn how to analyze text using Python and the Natural Language Toolkit. To learn about advanced algorithms implemented in NLTK, you can examine the Python code linked from nltk. org/, and consult the other materials cited in this book.
What You Will Learn
By digging into the material presented here, you will learn:
. How simple programs can help you manipulate and analyze language data, and how to write these programs
How key concepts from NLP and linguistics are used to describe and analyze language
How data structures and algorithms are used in NLP
How language data is stored in standard formats, and how data can be used to evaluate the performance of NLP techniques
Depending on your background, and your motivation for being interested in NLP, youwill gain different kinds of skills and knowledge from this book, as set out in Table P-1.
Organization
The early chapters are organized in order of conceptual difficulty, starting with a practical introduction to language processing that shows how to explore interesting bodies
of text using tiny Python programs (Chapters 1-3). This is followed by a chapter on structured programming (Chapter 4) that consolidates the programming topics scattered across the preceding chapters. After this, the pace picks up, and we move on to a series of chapters covering fundamental topics in language processing: tagging, classification, and information extraction (Chapters 5-7). The next three chapters look at ways to parse a sentence, recognize its syntactic structure, and construct representations of meaning (Chapters 8-10). The final chapter is devoted to linguistic data and how it can be managed effectively (Chapter 11). The book concludes with an Afterword, briefly discussing the past and future of the field.
Within each chapter, we switch between different styles of presentation. In one style,natural language is the driver. We analyze language, explore linguistic concepts, and use programming examples to support the discussion. We often employ Python constructs that have not been introduced systematically, so you can see their purpose before delving into the details of how and why they work. This is just like learning idiomatic expressions in a foreign language: you're able to buy a nice pastry without first having learned the intricacies of question formation. In the other style of presentation, the programming language will be the driver. We'll analyze programs, explore algorithms,and the linguistic examples will play a supporting role.
Each chapter ends with a series of graded exercises, which are useful for consolidating the material. The exercises are graded according to the following scheme: o is for easy exercises that involve minor modifications to supplied code samples or other simple activities; * is for intermediate exercises that explore an aspect of the material in more depth, requiring careful analysis and design; is for difficult, open-ended tasks that will challenge your understanding of the material and force you to think independently(readers new to programming should skip these).
Each chapter has a further reading section and an online "extras" section at http://www.nltk.org/, with pointers to more advanced materials and online resources. Online versions of all the code examples are also available there.
Why Python?
Python is a simple yet powerful programming language with excellent functionality for processing linguistic data. Python can be downloaded for free from .org/. Installers are available for all platforms.
Here is a five-line Python program that processes file. txt and prints all the words ending in lng:
for line in open("file.txt"):
... for word in line.split():
... if word.endswith(' ing' ):
... print word
This program illustrates some of the main features of Python. First, whitespace is used to nest lines of code; thus the line starting with if falls inside the scope of the previous line starting with for; this ensures that the ing test is performed for each word. Second, Python is object-oriented; each variable is an entity that has certain defined attributes and methods. For example, the value of the variable line is more than a sequence of characters. It is a string object that has a "method" (or operation) called split() that we can use to break a line into its words. To apply a method to an object, we write the object name, followed by a period, followed by the method name, i.e., line. spilt().Third, methods have arguments expressed inside parentheses. For instance, in the example, word. endswith (' lng' ) had the argument' lng' to indicate that we wanted words ending with lng and not something else. Finally--and most importantly--Python is highly readable, so much so that it is fairly easy to guess what this program does even if you have never written a program before.
We chose Python because it has a shallow learning curve, its syntax and semantics are transparent, and it has good string-handling functionality. As an interpreted language,Python facilitates interactive exploration. As an object-oriented language, Python permits data and methods to be encapsulated and re-used easily. As a dynamic language,Python permits attributes to be added to objects on the fly, and permits variables to be typed dynamically, facilitating rapid development. Python comes with an extensive standard library, including components for graphical programming, numerical processing, and web connectivity.
Python is heavily used in industry, scientific research, and education around the world.Pyth$n is often praised for the way it facilitates productivity, quality, and maintainability of software. A collection of Python success stories is posted at org/about/success/.
NLTK defines an infrastructure that can be used to build NLP programs in Python. It provides basic classes for representing data relevant to natural language processing;standard interfaces for performing tasks such as part-of-speech tagging, syntactic parsing, and text classification; and standard implementations for each task that can be combined to solve complex problems.
NLTK comes with extensive documentation. In addition to this book, the website at http://www.nltk, org/ provides API documentation that covers every module, class, and function in the toolkit, specifying parameters and giving examples of usage. The website also provides many HOWTOs with extensive examples and test cases, intended for users, developers, and instructors.
Software Requirements
To get the most out of this book, you should install several free software packages.Current download pointers and instructions are available at http://www, nltk.org/.
Python
The material presented in this book assumes that you are using Python version 2.4 or 2.5. We are committed to porting NLTK to Python 3.0 once the libraries that NLTK depends on have been ported.
NLTK
The code examples in this book use NLTK version 2.0. Subsequent releases of NLTK will be backward-compatible.
NLTK-Data
This contains the linguistic corpora that are analyzed and processed in the book.NumPy (recommended)
This is a scientific computing library with support for multidimensional arrays and linear algebra, required for certain probability, tagging, clustering, and classification tasks.
Matplotlib (recommended)
This is a 2D plotting library for data visualization, and is used in some of the book's code samples that produce line graphs and bar charts.
Network)((optional)
This is a library for storing and manipulating network structures consisting of nodes and edges. For visualizing semantic networks, also install the Graphviz library.
Prover9 (optional)
This is an automated theorem prover for first-order and equational logic, used to support inference in language processing.
Natural Language Toolkit (NLTK)
NLTK was originally created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it has been developed and expanded with the help of dozens of contributors. It has now been adopted in courses in dozens of universities, and serves as the basis of many research projects. Table P-2 lists the most important NLTK modules.Table P-2. Language processing tasks and corresponding NL TK modules with examples of functionality
NLTK was designed with four primary goals in mind:
Simplicity
To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data
Consistency
To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names
Extensibility
To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task
Modularity
To provide components that can be used independently without needing to understand the rest of the toolkit
Contrasting with these goals are three non-requirements--potentially useful qualities that we have deliberately avoided. First, while the toolkit prove'des a wide range of functions, it is not encyclopedic; it is a toolkit, not a system, and it will continue to evolve with the field of NLP. Second, while the toolkit is efficient enough to support meaningful tasks, it is not highly optimized for runtime performance; such optimizations often involve more complex algorithms, or implementations in lower-level programming languages such as C or C++. This would make the software less readable and more difficult to install. Third, we have tried to avoid clever programming tricks,since we believe that clear implementations are preferable to ingenious yet indecipherable ones.
For Instructors
Natural Language Processing is often taught within the confines of a single-semester course at the advanced undergraduate level or postgraduate level. Many instructors have found that it is difficult to cover both the theoretical and practical sides of the subject in such a short span of time. Some courses focus on theory to the exclusion of practical exercises, and deprive students of the challenge and excitement of writing programs to automatically process language. Other courses are simply designed to teach programming for linguists, and do not manage to cover any significant NLP content. NLTK was originally developed to address this problem, making it feasible to cover a substantial amount of theory and practice within a single-semester course, even if students have no prior programming experience.
A significant fraction of any NLP syllabus deals with algorithms and data structures.On their own these can be rather dry, but NLTK brings them to life with the help of interactive graphical user interfaces that make it possible to view algorithms step-by-step. Most NLTK components include a demonstration that performs an interesting task without requiring any special input from the user. An effective way to deliver the materials is through interactive presentation of the examples in this book, entering them in a Python session, observing what they do, and modifying them to explore some empirical or theoretical issue.
This book contains hundreds of exercises that can be used as the basis for studentassignments. The simplest exercises involve modifying a supplied program fragment in a specified way in order to answer a concrete question. At the other end of the spectrum,NLTK provides a flexible framework for graduate-level research projects, with standard implementations of all the basic data structures and algorithms, interfaces to dozens of widely used datasets (corpora), and a flexible and extensible architecture. Additional support for teaching using NLTK is available on the NLTK website.
We believe this book is unique in providing a comprehensive framework for students to learn about NLP in the context of learning to program. What sets these materials apart is the tight coupling of the chapters and exercises with NLTK, giving students--even those with no prior programming experience--a practical introduction to NLP.After completing these materials, students will be ready to attempt one of the more advanced textbooks, such as Speech and Language Processing, by Jurafsky and Martin(Prentice Hall, 2008).
This book presents programming concepts in an unusual order, beginning with a nontrivial data type--lists of strings--then introducing non-trivial control structures such as comprehensions and conditionals. These idioms permit us to do useful language processing from the start. Once this motivation is in place, we return to a systematic presentation of fundamental concepts such as strings, loops, files, and so forth. In this way, we cover the same ground as more conventional approaches, without expecting readers to be interested in the programming language for its own sake.
Two possible course plans are illustrated in Table P-3. The first one presumes an arts/humanities audience, whereas the second one presumes a science/engineering audience. Other course plans could cover the first five chapters, then devote the remaining time to a single area, such as text classification (Chapters 6 and 7), syntax (Chapters 8 and 9), semantics (Chapter 10), or linguistic data management (Chapter 11).Conventions Used in This Book
The following typographical conventions are used in this book:Bold
Indicates new terms.
Italic
Used within paragraphs to refer to linguistic examples, the names of texts, and URLs; also used for filenames and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, statements, and keywords; also used for pro gram names.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter mined by context; also used for metavariables within program code examples.This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you're reproducing a significant portion of the code. For example,writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O'Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product's documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,author, publisher, and ISBN. For example: "Natural Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper. Copyright 2009 Steven Bird,Ewan Klein, and Edward Loper, 978-0-596-51649-9."
If you feel your use of code examples falls outside fair use or the permission given above,feel free to contact us at permissions@oreilly, com.
When you see a Safari~ Books Online icon on the cover of your favorite technology book, that means the book is available online through the O'Reilly Network Safari Bookshelf.
Safari offers a solution that's better than e-books. It's a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters,and find quick answers when you need the most accurate, current information. Try it for free at http://my.safaribooksonline, com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O'Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:
http://www.oreilly.com/catalog/9780596516499
The authors provide additional materials for each chapter via the NLTK website at: http://www.nltk, org/
To comment or ask technical questions about this book, send email to: bookquestions@oreilly, com
For more information about our books, conferences, Resource Centers, and the O'Reilly Network, see our website at:
http://www.oreilly, com
媒体评论
“很少有这样一本方法清晰、代码整洁的书来讨论如此高难度的计算机问题……这是学习自然语言处理的入门佳作。”
——Ken Getz资深咨询顾问,MCW Technologies公司