Team LiB
Previous Section Next Section

10.6. Using Containers: Text-Query Program

10.6. 容器的综合应用:文本查询程序

To conclude this chapter, we'll implement a simple text-query program.

我们将实现一个简单的文本查询程序来结束本章。

Our program will read a file specified by the user and then allow the user to search the file for words that might occur in it. The result of a query will be the number of times the word occurs and a list of lines on which it appears. If a word occurs more than once on the same line, our program should be smart enough to display that line only once. Lines should be displayed in ascending orderthat is, line 7 should be displayed before line 9, and so on.

我们的程序将读取用户指定的任意文本文件,然后允许用户从该文件中查找单词。查询的结果是该单词出现的次数,并列出每次出现所在的行。如果某单词在同一行中多次出现,程序将只显示该行一次。行号按升序显示,即第 7 行应该在第 9 行之前输出,依此类推。

For example, we might read the file that contains the input for this chapter and look for the word "element." The first few lines of the output would be:

例如,以本章的内容作为文件输入,然后查找单词“element”。输出的前几行应为:

     element occurs 125 times
        (line 62) element with a given key.
        (line 64) second element with the same key.
        (line 153) element |==| operator.
        (line 250) the element type.
        (line 398) corresponding element.

followed by the remaining 120 or so lines in which the word "element" occurs.

后面省略了大约 120 行。

10.6.1. Design of the Query Program

10.6.1. 查询程序的设计

A good way to start the design of a program is to list the program's operations. Knowing what operations we need to provide can then help us see what data structures we'll need and how we might implement those actions. Starting from requirements, the tasks our program needs to support include:

设计程序的一个良好习惯是首先将程序所涉及的操作列出来。明确需要提供的操作有助于建立需要的数据结构和实现这些行为。从需求出发,我们的程序需要支持如下任务:

  1. It must allow the user to indicate the name of a file to process. The program will store the contents of the file so that we can display the original line in which each word appears.

    它必须允许用户指明要处理的文件名字。程序将存储该文件的内容,以便输出每个单词所在的原始行。

  2. The program must break each line into words and remember all the lines in which each word appears. When it prints the line numbers, they should be presented in ascending order and contain no duplicates.

    它必须将每一行分解为各个单词,并记录每个单词所在的所有行。在输出行号时,应保证以升序输出,并且不重复。

  3. The result of querying for a particular word should be the line numbers on which that word appeared.

    对特定单词的查询将返回出现该单词的所有行的行号。

  4. To print the text in which the word appeared, it must be possible to fetch the line from the input file corresponding to a given line number.

    输出某单词所在的行文本时,程序必须能根据给定的行号从输入文件中获取相应的行。

Data Structure
数据结构

We'll implement our program as a simple class that we'll name TextQuery. Our requirements can be met quite neatly by using various containers:

我们将用一个简单的类 TextQuery 实现这个程序。再加几种容器的配合使用,就可相当巧妙地满足上述要求。

  1. We'll use a vector<string> to store a copy of the entire input file. Each line in the input file will be an element in this vector. That way, when we want to print a line, we can fetch it by using the line number as an index.

    使用一个 vector<string> 类型的对象存储整个输入文件的副本。输入文件的每一行是该 vector 对象的一个元素。因而,在希望输出某一行时,只需以行号为下标获取该行所在的元素即可。

  2. We'll store each word's line numbers in a set. Using a set will guarantee that there is only one entry per line and that the line numbers will be automatically stored in ascending order.

    将每个单词所在的行号存储在一个 set 容器对象中。使用 set 就可确保每行只有一个条目,而且行号将自动按升序排列。

  3. We'll use a map to associate each word with the set of line numbers on which the word appears.

    使用一个 map 容器将每个单词与一个 set 容器对象关联起来,该 set 容器对象记录此单词所在的行号。

Our class will have two data members: a vector to hold the input file and a map to associate each input word with the set of line numbers on which it appears.

综上所述,我们定义的 TextQuery 类将有两个数据成员:储存输入文件的 vector 对象,以及一个 map 容器对象,该对象关联每个输入的单词以及记录该单词所在行号的 set 容器对象。

Operations
操作

The requirements also lead fairly directly to an interface for our class. However, we have one important design decision to make first: The function that does the query will need to return a set of line numbers. What type should it use to do so?

对于类还要求有良好的接口。然而,一个重要的设计策略首先要确定:查询函数需返回存储一组行号的 set 对象。这个返回类型应该如何设计呢?

We can see that doing the query will be simple: We'll index into the map to obtain the associated set. The only question is how to return the set that we find. The safest design is to return a copy of that set. However, doing so means that each element in the set must be copied. Copying the set could be expensive if we process a very large file. Other possible return values are a pair of iterators into the set, or a const reference to the set. For simplicity, we'll return a copy, noting that this decision is one that we might have to revisit if the copy is too expensive in practice.

事实上,查询的过程相当简单:使用下标访问 map 对象获取关联的 set 对象即可。唯一的问题是如何返回所找到的 set 对象。安全的设计方案是返回该 set 对象的副本。但如此一来,就意味着要复制 set 中的每个元素。如果处理的是一个相当庞大的文件,则复制 set 对象的代价会非常昂贵。其他可行的方法包括:返回一个 pair 对象,存储一对指向 set 中元素的迭代器;或者返回 set 对象的 const 引用。为简单起见,我们在这里采用返回副本的方法,但注意:如果在实际应用中复制代价太大,需要新考虑其实现方法。

The first, third, and fourth tasks are actions programmers using our class will perform. The second task is internal to the class. Mapping these tasks to member functions, we'll have three public functions in the class interface:

第一、第三和第四个任务是使用这个类的程序员将执行的动作。第二个任务则是类的内部任务。将这四任务映射为类的成员函数,则类的接口需提供下列三个 public 函数:

  • read_file takes an ifstream&, which it reads a line at a time, storing the lines in the vector. Once it has read all the input, read_file will create the map that associates each word to the line numbers on which it appears.

    read_file 成员函数,其形参为一个 ifstream& 类型对象。该函数每次从文件中读入一行,并将它保存在 vector 容器中。输入完毕后,read_file 将创建关联每个单词及其所在行号的 map 容器。

  • run_query takes a string and returns the set of line numbers on which that string appears.

    run_query 成员函数,其形参为一个 string 类型对象,返回一个 set 对象,该 set 对象包含出现该 string 对象的所有行的行号。

  • text_line takes a line number and returns the corresponding text for that line from the input file.

    text_line 成员函数,其形参为一个行号,返回输入文本中该行号对应的文本行。

Neither run_query nor text_line changes the object on which it runs, so we'll define these operations as const member functions (Section 7.7.1, p. 260).

无论 run_query 还是 text_line 都不会修改调用此函数的对象,因此,可将这两个操作定义为 const 成员函数(第 7.7.1 节)。

To do the work of read_file, we'll also define two private functions to read the input file and build the map:

为实现 read_file 功能,还需定义两个 private 函数来读取输入文本和创建 map 容器:

  • store_file will read the file and store the data in our vector.

    store_file 函数读入文件,并将文件内容存储在 vector 容器对象中。

  • build_map will break each line into words and build the map, remembering the line number on which each word appeared.

    build_map 函数将每一行分解为各个单词,创建 map 容器对象,同时记录每个单词出现行号。

10.6.2. TextQuery Class

10.6.2. TextQuery

Having worked through our design, we can now write our TextQuery class:

经过前面的设计后,现在可以编写 TextQuery 类了:

     class TextQuery {
     public:
         // typedef to make declarations easier
         typedef std::vector<std::string>::size_type line_no;
         /* interface:
          * read_file builds internal data structures for the given file
          * run_query finds the given word and returns set of lines on which it appears
          * text_line returns a requested line from the input file
          */
         void read_file(std::ifstream &is)
                    { store_file(is); build_map(); }
         std::set<line_no> run_query(const std::string&) const;
         std::string text_line(line_no) const;
     private:
         // utility functions used by read_file
         void store_file(std::ifstream&); // store input file
         void build_map(); // associated each word with a set of line numbers
         // remember the whole input file
         std::vector<std::string> lines_of_text;
         // map word to set of the lines on which it occurs
         std::map< std::string, std::set<line_no> > word_map;
     };

The class directly reflects our design decisions. The only part we hadn't described is the typedef that defines an alias for size_type of vector.

这个类直接反映了我们的设计策略。唯一提及的是使用 typedefvectorsize_type 定义了一个别名。

For the reasons described on page 80, this class definition uses fully qualified std:: names for all references to library entities.

基于第 3.1 节所提及的原因,这个类的定义在引用标准库内容时都必须完整地使用 std:: 限定符。



The read_file function is defined inside the class. It calls store_file to read and store the input file and build_map to build the map from words to line numbers. We'll define the other functions in Section 10.6.4 (p. 385). First, we'll write a program that uses this class to solve our text-query problem.

read_file 函数在类的内部定义。该函数首先调用 store_file 读取并保存输入文件,然后调用 build_map 创建关联单词与行号的 map 容器。该类的其他函数将在第 10.6.4 节定义。首先,我们编写一个程序,使用这个类来解决文本查询问题。

Exercises Section 10.6.2

Exercise 10.30:

The member functions of TextQuery use only capabilities that we have already covered. Without looking ahead, write your own versions of the member functions. Hint: The only tricky part is what to return from run_query if the line number set is empty. The solution is to construct and return a new (temporary) set.

TextQuery 类的成员函数仅使用了前面介绍过的内容。先别查看后面章节,请自己编写这些成员函数。提示:唯一棘手的是 run_query 函数在行号集合 set 为空时应返回什么值?解决方法是构造并返回一个新的(临时)set 对象。


10.6.3. Using the TextQuery Class

10.6.3. TextQuery 类的使用

The following main program uses a TextQuery object to perform a simple query session with the user. Most of the work in this program involves managing the interaction with the user: prompting for the next search word and calling the print_results functionwhich we shall write nextto print the results.

下面的主程序 main 使用 TextQuery 对象实现简单的用户查询会话。这段程序的主要工作是实现与用户的互动:提示输入下一个要查询的单词,然后调用 print_results 函数(将在下面定义)输出结果。

     // program takes single argument specifying the file to query
     int main(int argc, char **argv)
     {
         // open the file from which user will query words
         ifstream infile;
         if (argc < 2 || !open_file(infile, argv[1])) {
             cerr << "No input file!" << endl;
             return EXIT_FAILURE;
         }
         TextQuery tq;
         tq.read_file(infile); // builds query map
         // iterate with the user: prompt for a word to find and print results
         // loop indefinitely; the loop exit is inside the while
         while (true) {
             cout << "enter word to look for, or q to quit: ";
             string s;
             cin >> s;
             // stop if hit eof on input or a 'q'is entered
             if (!cin || s == "q") break;
             // get the set of line numbers on which this word appears
             set<TextQuery::line_no> locs = tq.run_query(s);
             // print count and all occurrences, if any
             print_results(locs, s, tq);
          }
         return 0;
     }

Preliminaries
引子

This program checks that argv[1] is valid and then uses the open_file function (Section 8.4.3, p. 299) to open the file we're given as an argument to main. We test the stream to determine whether the input file is okay. If not, we generate an appropriate message and exit, returning EXIT_FAILURE (Section 7.3.2, p. 247) to indicate that an error occurred.

程序首先检查 argv[1] 是否合法,然后调用 open_file 函数(第 8.4.3 节)打开以 main 函数实参形式给出的文件。检查流以判断输入文件是否正确。如果不正确,就给出适当的提示信息结束程序的运行,返回 EXIT_FAILURE第 7.3.2 节)说明发生了错误。

Once we have opened the file, it is a simple matter to build up the map that will support queries. We define a local variable named tq to hold the file and associated data structures:

一旦文件成功打开,建立支持查询的 map 容器就相当简单。定义一个局部变量 tq 来保存该输入文件和所关联的数据结构:

     TextQuery tq;
     tq.read_file(infile);   builds query map

We call the read_file operation on tq, passing it the file opened by open_file.

tq 调用 read_file 操作,并将由 open_file 打开的文件传递给此函数。

After read_file completes, tq holds our two data structures: the vector that corresponds to the input file and the map from word to set of line numbers. That map contains an entry for each unique word in the input file. The map associates with each word the set of line numbers on which that word appeared.

read_file 完成后,tq 存储了两个数据结构:保存输入文件的 vector 对象,以及关联单词和行号的 map 容器对象。map 容器为输入文件中的每个单词建立唯一的元素,由每个单词关联的 set 容器记录了该单词出现的行号。

Doing the Search
实现查询

We want the program to let the user look for more than one word in each session, so we wrap the prompt in a while loop:

为了使用户在每次会话时都能查询多个单词,我们将提示语句也置于 while 循环中:

      // iterate with the user: prompt for a word to find and print results
      // loop indefinitely; the loop exit is inside the while
       while (true) {
          cout << "enter word to look for, or q to quit: ";
          string s;
          cin >> s;
          // stop if hit eof on input or a 'q' is entered
          if (!cin || s == "q") break;
          // get the set of line numbers on which this word appears
          set<TextQuery::line_no> locs = tq.run_query(s);
          // print count and all occurrences, if any
          print_results(locs, s, tq);
      }

The test in the while is the boolean literal true, which means that the test always succeeds. We exit the loop through the break after the test on cin and the value read into sought. The loop exits when cin hits an error or end-of-file or when the user enters a 'q' to quit.

while 循环条件为布尔字面值 true,这就意味着循环条件总是成立。在检查 cin 和读入 s 值后,由紧跟的 break 语句跳出循环。具体说来,当 cin 遇到错误或文件结束,或者用户输入 q 时,循环结束。

Once we have a word to look for, we ask tq for the set of line numbers on which that word appears. We pass that set along with the word we are looking for and the TextQuery object to the print_results function. That function will write the output of our program.

每次要查找一个单词时,访问 tq 获取记录该单词出现的行号的 set 对象。将 set 对象、要查找的单词和 TextQuery 对象作为参数传递给 print_results 函数,该函数输出查询结果。

Printing the Results
输出结果

What remains is to define the print_results function:

现在只剩下 print_results 函数的定义:

      void print_results(const set<TextQuery::line_no>& locs,
                         const string& sought, const TextQuery &file)
      {
          // if the word was found, then print count and all occurrences
          typedef set<TextQuery::line_no> line_nums;
          line_nums::size_type size = locs.size();
          cout << "\n" << sought << " occurs "
               << size << " "
               << make_plural(size, "time", "s") << endl;

          // print each line in which the word appeared
          line_nums::const_iterator it = locs.begin();
          for ( ; it != locs.end(); ++it) {
              cout << "\t(line "
                   // don't confound user with text lines starting at 0
                   << (*it) + 1 << ") "
                   << file.text_line(*it) << endl;
          }
     }

The function starts by defining a typedef to simplify the use of the line number set. Our output first reports how many matches were found, which we know from the size of the set. We call make_plural (Section 7.3.2, p. 248) to print time or times, depending on whether that size is equal to one.

函数首先使用 typedef 简化记录行号的 set 容器对象的使用。输出时,首先给出查询到的匹配个数,即 set 对象的大小。然后调用 make_plural第 7.3.2 节),根据 size 是否为 1 输出“time”或“times”。

The messiest part of the program is the for loop that processes locs to print the line numbers on which the word was found. The only subtlety here is remembering to change the line numbers into more human-friendly counting. When we stored the text, we stored the first line as line number zero, which is consistent with how C++ containers and arrays are numbered. Most users think of the first line as line number 1, so we systematically add one to our stored line numbers to convert to this more common notation.

这段程序最复杂的部分是处理 locs 对象的 for 循环,用于输出找到该单词的行号。其唯一的微妙之处是记得将行号修改为更友好的形式输出。为了与 C++ 的容器和数组下标编号匹配,在储存文本时,我们以行号 0 存储第一行。但考虑到很多用户会默认第一行的行号为 1,所以输出行号时,相应地所存储的行号上加 1 使之转换为更通用的形式。

Exercises Section 10.6.3

Exercise 10.31:

What is the output of main if we look for a word that is not found?

如果没有找到要查询的单词,main 函数输出什么?


10.6.4. Writing the Member Functions

10.6.4. 编写成员函数

What remains is to write the definitions of the member functions that were not defined inside the class.

现在给没有在类内定义的成员函数编写定义。

Storing the Input File
存储输入文件

Our first task is to read the file that our user wishes to query. Using string and vector operations, this task is handled easily:

第一个任务是读入需要查询的文件。使用 stringvector 容器提供的操作,可以很简便地实现这个任务:

     // read input file: store each line as element in lines_of_text
     void TextQuery::store_file(ifstream &is)
     {
         string textline;
         while (getline(is, textline))
            lines_of_text.push_back(textline);
     }

Because we want to store the file a line at a time, we use getline to read our input. We push each line we read onto the lines_of_text vector.

由于我们希望每次存储文件的一行内容,因此使用 getline 读取输入,每读入一行就将它添加到名为 lines_of_textvector 对象中。

Building the Word map
建立单词 map 容器

Each element in the vector holds a line of text. To build the map from words to line numbers, we need to break each line into its individual words. We again use an istringstream in ways outlined in the program on page 300:

vector 容器中的每个元素就是一行文本。要建立一个从单词关联到行号的 map 容器,必须将每行分解为各个单词。再次使用第 8.5 节描述的 istringstream

      // finds whitespace-separated words in the input vector
      // and puts the word in word_map along with the line number
      void TextQuery::build_map()
      {
          // process each line from the input vector
          for (line_no line_num = 0;
                       line_num != lines_of_text.size();
                       ++line_num)
          {
              //we'll use line to read the text a word at a time
              istringstream line(lines_of_text[line_num]);
              string word;
              while (line >> word)
                  // add this line number to the set;
                  // subscript will add word to the map if it's not already there
                  word_map[word].insert(line_num);
          }
     }

The for loop marches through lines_of_text a line at a time. We start by binding an istringstream object named line to the current line and use the istringstream input operator to read each word on the line. Remember that that operator, like the other istream operators, ignores whitespace. Thus, the while reads each whitespace-separated word from line into word.

for 循环以每次一行的迭代过程遍历 lines_of_text。首先将 istringstream 对象 line 与当前行绑定起来,然后使用 istringstream 的输入操作符读入该行中的每个单词。回顾此类输入操作符,与其他 istream 操作符一样,将忽略空白符号。因此,while 循环将 line 中以空白符分隔的单词读取出来。

The last part of this function is similar to our word-count programs. We use word to subscript the map. If word was not already present, then the subscript operator adds word to the word_map, giving it an inital value that is the empty set. Regardless of whether word was added, the return value from the subscript is a set. We then call insert to add the current line number. If the word occurs more than once in the same line, then the call to insert does nothing.

这个函数的结尾部分类似前面的单词统计程序。将 word 用做 map 容器的下标。如果 wordword_map 容器对象中不存在,那么下标操作符将该 word 添加到此容器中,将将其关联的值初始化为空的 set。不管是否添加了 word,下标运算都返回一个 set 对象,然后调用 insert 函数在该 set 对象中添加当前行号。如果某个单词在同一行中重复出现,那么 insert 函数的调用将不做任何操作。

Supporting Queries
支持查询

The run_query function handles the actual queries:

run_query 函数实现真正的查询功能:

     set<TextQuery::line_no>
     TextQuery::run_query(const string &query_word) const
     {
         //Note: must use find and not subscript the map directly
         //to avoid adding words to word_map!
         map<string, set<line_no> >::const_iterator
                               loc = word_map.find(query_word);
     if (loc == word_map.end())
         return set<line_no>(); // not found, return empty set
     else
         // fetch and return set of line numbers for this word
         return loc->second;
     }

The run_query function takes a reference to a const string and uses that value to index the word_map. Assuming the string is found, it returns the set of line numbers associated with the string. Otherwise, it returns an empty set.

run_query 函数带有指向 const string 类型对象的引用参数,并以这个参数作为下标来访问 word_map 对象。假设成功找到这个 string,那么该函数返回关联此 stringset 对象,否则返回一个空的 set 对象。

Using the Return from run_query
run_query 返回值的使用

Once we've run the run_query function, we get back a set of line numbers on which the word we sought appears. In addition to printing how many times each word appears, we also want to print the line on which the word appeared. The text_line function lets us do so:

运行 run_query 函数后,将获得一组所查找的单词出现的行号。除了输出该单词的出现次数之外,还需要输出出现该单词的每一行。这就是 text_line 函数实现的功能:

   string TextQuery::text_line(line_no line) const
   {
       if (line < lines_of_text.size())
           return lines_of_text[line];
       throw std::out_of_range("line number out of range");
   }

This function takes a line number and returns the input text line corresponding to that line number. Because the code using our TextQuery class cannot do so lines_of_text is privatewe first check that the line we are asked for is in range. If it is, we return the corresponding line. If it is not, we tHRow an out_of_range exception.

该函数带有一个行号参数,返回该行号所对应的输入文本行。由于上述代码使用了 TextQuery 类,因此不能直接输出(因为 lines_of_text 是私有的),应该首先检查我们要查询的行是否们于合法范围内。如果是,则返回相应的行,否则,抛出 out_of_range 异常。

Exercises Section 10.6.4

Exercise 10.32:

Reimplement the text-query program to use a vector instead of a set to hold the line numbers. Note that because lines appear in ascending order, we can append a new line number to the vector only if the last element of the vector isn't that line number. What are the performance and design characteristics of the two solutions? Which do you feel is the preferred design solution? Why?

重新实现文本查询程序,使用 vector 容器代替 set 对象来存储行号。注意,由于行以升序出现,因此只有在当前行号不是 vector 容器对象中的最后一个元素时,才能将新行号添加到 vector 中。这两种实现方法的性能特点和设计特点分别是什么?你觉得哪一种解决方法更好?为什么?

Exercise 10.33:

Why doesn't the TextQuery::text_line function check whether its argument is negative?

TextQuery::text_line 函数为什么不检查它的参数是否为负数?


Team LiB
Previous Section Next Section