Blog

Recent innovations in indexing software

ISC conference attendees were treated to a tour of four of the most popular indexing programs.

TExtract

Harry Bego, developer of TExtract, came from the Netherlands to give us a presentation and demo of his “semi-automatic” indexing software. Having been a researcher in natural language processing at Tilburg University, Bego incorporated linguistic and statistical analysis algorithms into TExtract; these identify important terms and compile them all into an initial draft index, taking out a lot of grunt work of data entry. Bego was quick to emphasize that the user is always completely in control. Although TExtract puts together the initial index automatically, the indexer can review each entry and choose whether to accept or discard it. For each entry, the program shows its frequency and “significance score.” Users can adjust the significance threshold of a text to control what kinds of terms are picked out in the initial index, and they can add filters to determine which terms the program should exclude or include.

TExtract also has a “document replacement” feature that allows the indexer to compare a new version of the text with an old one and update the index accordingly. The entries are linked to the text—a feature that supports in-context navigation and editing.

Although TExtract is in itself a complete indexing program, Bego told us that TExtract outputs can then be fed into any of the other major indexing programs (such as the ones below), if an indexer is more comfortable editing on different software.

SKY Index

SKY Index’s developer, Kamm Schreiner, was unable to join us in person, but he sent a video that showed off some of this program’s features.

SKY has a spreadsheet-like interface and allows you to do data entry on the right-hand pane while a preview pane on the left shows you the index as it’s being built. Schreiner has built in several functions that in other programs might require a macro: SKY Index can easily consume subheadings, swap acronyms, and so on. The program also flags common errors for the indexer (e.g., adding a locator to an entry that already has a See cross-reference).

SKY Index has an edit view, which allows for quick and efficient editing. You can open up a browse pane that allows you to see and compare two separate sections of the index side by side. The browse pane can be used in data entry view as well, and the program bookmarks where you left off before opening the browse pane.

More information and videos are available on the SKY Software website.

CINDEX

Frances Lennie, a freelance indexer since 1977, established Indexing Research in 1986 to develop CINDEX, which is available on both Windows and Mac. The company also features a publishers’ edition of CINDEX, available only on Windows, which accommodates multi-user production environments, such as legal publishing houses or government publishing houses.

CINDEX uses an index card metaphor and allows up to 15 levels of subheadings. The indexer enters data in the record entry window while the index builds in the background. All records are date and time stamped.

CINDEX is fully Unicode compliant and supports different sorting conventions and spell checking in a variety of languages. Lennie gave us a demonstration of an index created in Hebrew, which reads right to left and so has entries in inverted order (locator on the left, entry on the right).

Lennie showed how CINDEX supports searching specific entries based on a character string, page range, or style attributes. Editing is also easy: CINDEX offers both global and individual editing options, and it also uses a variety of techniques to check the integrity of the index. Finally, the index may be exported in many different formats and file types.

Macrex

Gale Rhoades, the North American publisher of Macrex, gave us a demo of Macrex’s just-released ninth version. Macrex is a Windows-based program that was first developed in 1981. It boasts a seemingly overwhelming list of features, but the secret to using it, explained Rhoades, is to focus only on your current project and not worry about what may seem like overhead. She has worked with many indexers over the years, she explained, and she has always managed to help them configure Macrex to do what they need it to do.

Macrex is big about giving the user control. Entries are written directly in the index; different components (e.g., cross-references, locators, etc.) are colour coded, and you can change the colour palette to suit your tastes. Further, you can create a folder with a particular client’s specifications (e.g., sort order, layout, cross-reference format, output format, etc.) and use it essentially as a template for all of the work you do for that client.

Rhoades emphasized the client support that you get with Macrex. She hosts a weekly chat session for North American Macrex users to talk about indexing and software issues; she can also connect directly to your computer to troubleshoot Macrex problems.

Louise Spiteri—User-generated metadata: boon or bust for indexing and controlled vocabularies? (ISC conference 2013)

Louise Spiteri is the director of the School of Information Management at Dalhousie University, and she spoke at the ISC conference about social tagging and folksonomies. As a trained cataloguer, Spiteri said to us, “I’m a firm believer in controlled vocabularies, but we have to accept the fact that that’s not what our clients use.” She added, “User-generated metadata is here. Let’s accept it and learn to work with it rather than against it.”

Traditionally, a document’s metadata has been the purview of cataloguers, information architects, and professional indexers. Users could search for an item based on its existing classification, but they couldn’t amend that item’s categorization and organization based on their own needs and understanding.

In recent years, however, many blog and social media platforms have made it possible for users to store and categorize items—blog posts, photos, music, articles, and so on—based on their interests. They can organize these items by adding their own keywords, and in many cases they can add further metadata in the form of ratings or reviews.

Users typically add keywords using tags, which are non-hierarchical. A social dimension to user tagging was popularized by such sites as Delicious, CiteULike, and Flickr, on which users could not only tag information but also share those tags with a wider community. The collective tagging efforts of such a community is a folksonomy (a portmanteau of “folk” and “taxonomy”)—the set of terms that a group of users has used to tag content. Although such a set is open and uncontrolled, some sites offer tag recommendations based on what others have assigned, allowing for the potential for consensus.

User tagging has its limitations, of course—from ambiguity and polysemy (does the tag “port” refer to wine or a computer port or the left side of a ship?) to synonymy (especially in cases of spelling variants and singular versus plural nouns) to variations in the level of their specificity—but it can also be enormously powerful. In some communities, for example, dedicated users—avid fans who are intimately familiar with the content—can generate a set of tags that are more useful and informative than classifications offered by the vendor or a cataloguer, who is more likely to do the minimum level of cataloguing. Social tagging’s major strength is that terms can be individualized to users’ own needs. Further, folksonomies can adapt quickly to changes in user vocabulary, accommodating new terms with virtually no cost to the user or the system. Over time, particularly if the platform supports recommendations for tags, an item’s tags will tend to stabilize into an organically curated set.

Spiteri also briefly discussed newer forms of social tagging, including hashtags and geotags. Hashtags, common on Twitter, Tumblr, Instagram, and now Facebook, allow users to quickly follow a stream of content about a particular topic. However, they suffer the same problems as uncontrolled vocabularies; Spiteri strongly advocated promoting an official hashtag for a public event so that everyone uses the same one and the conversation isn’t split among multiple streams. Geotags, by contrast, add geographic metadata to information—allowing users to follow location-based news or identify the place a photo was taken, for example—and because they are often given in numerical format, such as latitude, longitude, and altitude, they are likely to be more consistent.

Social tagging, emphasized Spiteri, isn’t going away. How do we indexers work with it? Ideally, we would have a system that combines both controlled vocabularies and tags. On many blogs, for example, you can assign a post to one or more categories, which can be tightly controlled. User tags can then supplement or complement these categories, serving special user-focused functions. For instance, in multi-cultural communities, users can tag an item in their own language. Tags can also connect like-minded users, a function that controlled vocabularies don’t readily support. Most importantly, indexers can learn from user tags, adapting their subject headings to the language of their clients.

Caroline Diepeveen—Team indexing: The way forward? (ISC conference 2013)

Caroline Diepeveen led a small team that indexed the five-volume Encyclopedia of Jews in the Islamic World (EJIW), published by Brill. Her efforts, along with those of her co-indexers, Pierke Bosschieter and Jacqueline Belder, won the team the Society of Indexers’ Wheatley Medal in 2011. Most gratifying for Diepeveen was the jury’s remark that they couldn’t tell that this index had been composed as a team.

Indexers are used to working in isolation, Diepeveen said, and some seem averse to the idea of working in a team. But her own experience with EJIW was positive, and in a small survey she conducted about team indexing, with eleven indexers responding, she found that 73% had had good experiences, while 27% said that their experience was okay; nobody had found team indexing particularly negative. The respondents had mostly worked in groups of two or three and used such strategies as constant discussion and a controlled vocabulary to achieve consistency in their work. Many teams had one main indexer who was responsible for putting the team together and ensuring the quality of the final product.

In Diepeveen’s case, team indexing became a necessity because of EJIW‘s project deadlines. She had initially signed on as the encyclopedia’s sole indexer. In theory, the encyclopedia would be built one article at a time; the editors expected a steady flow of articles from the authors, and Diepeveen could index at her leisure. In reality, the bulk of the articles came at the end, and the options for the publisher were to extend the deadline or to bring in more indexers.

Fortunately, the encyclopedia itself was compiled using a sophisticated content management system (CMS) with a fine-tuned workflow. Team members were allowed access to only the parts of the CMS that they needed; authors from all over the world contributed articles directly into the CMS, which were then edited by a team of editors and finally released for indexing. With the CMS, articles could easily be assigned to one indexer or then reassigned as needed; there was no need to mail files around. (Brill had attempted to develop a software module that allowed embedded indexing directly in the CMS, but the first version of the indexing module didn’t allow basic indexing features, such as selecting a range, and so was deemed unacceptable. In the end, the index was not fully embedded and instead was compiled using anchors in the text as locators.)

Serving as team captain, Diepeveen not only put together the indexing team but also oversaw her team’s work. She had already done some of the indexing before she brought on the other indexers, so the other team members could use her work as a reference. Helpfully, the articles in the CMS showed all indexed terms highlighted in green, and Diepeveen could easily see whether her teammates were over- or under-indexing and provide feedback as needed. She emphasized the importance of regularly communicating with team members to build trust and a strong working relationship. Geographically separated team members may not be able to meet in person, but teleconferencing and web conferencing go a long way in clarifying roles and tasks, not to mention allowing team members a chance to get to know one another.

To keep the process running smoothly, the team had to lay some groundwork:

  • Diepeveen did a thorough edit of the index near the start of the project so that all team members would have a basic structure to work towards.
  • The team disallowed double postings; cross-references could be converted to double postings at the very end if needed.
  • The team stipulated that all entries must have a subheading. When you see only one part of a publication, you don’t know how much weight or detail is given to a particular subject in another part of the publication. Again, unnecessary subheadings could be edited out at the very end if needed.

Most importantly, Diepeveen said, the team “kept asking questions. EJIW worked almost like peer review on the go. We asked each other, ‘Why did you decide to do things this way?’ We kept each other sharp by asking questions. That improved the quality of the index.”

As larger and larger electronic publications become the norm, Diepeveen said, team index will probably become more common. Emerging technological tools may help with the logistics, but the most important aspect of team indexing, she reiterated, was the team itself. It is critical to invest in trust, not only at the beginning of the project but also regularly throughout.

Going out of style

This post also appears on The Editors’ Weekly, the Editors’ Association of Canada’s official blog.

***

“It’s house style.”

I don’t think I fully appreciated the power of that sentence until I could no longer use it. As editors we’re constantly striving to balance the needs of the publisher, author and reader, but with the growth of self- and custom publishing, the needs of the publisher are becoming irrelevant in more and more projects. You’d think having one less item for editors to juggle would make our lives easier, but upholding certain editorial standards can get tricky when the author is the publisher and the client.

As much as we’d love to believe that all parties in the traditional publisher–editor–author relationship are equal, in reality the publisher holds the bulk of the power, as the one who signs the editor’s paycheque and decides whether an author’s work will make it to market. I’ve leveraged that tacit hierarchy (which is particularly apparent in academic publishing, where production efficiency is a priority, and corporate publishing, where brand building is important) when working with authors: in some cases it has allowed me to build a sibling-like rapport with them. We understand that we’re both answering to “Mom,” and when I invoke house style, I’m pretty much saying, “Mom says you need to go mow the lawn. Sorry.” I’m not that sorry, of course: I know that following these house rules will, in general, give us an ultimately stronger, more consistent text.

In the editor–self-publishing author dynamic, however, the concept of house style is meaningless, and because the author’s the one paying the editor’s invoices, the editor has to be prepared to bend. As Amy Einsohn advises in the The Copyeditor’s Handbook, “For the working copyeditor, deference is the better part of valor: if the author’s preference is at all acceptable, it should be respected.” (p. 336) The author insists on uppercasing all corporate titles? Well, okay. The text might look a bit funny, but readers probably won’t get confused. But what if an author wants to do something so unconventional that even a casual reader would be baffled—and responds to “But The Chicago Manual of Style says…” with “I don’t care!”?

When editing sans publisher, other instruments of persuasion in our usual editorial toolkit can also lose their power. Even “I’m concerned that this editorial approach will hurt your sales” doesn’t always work, because many self-publishing authors either won’t have thought that far ahead or won’t believe you. I’ve had more success with “Your readers won’t be used to seeing that style; they’ll think it’s a mistake, which might affect your credibility,” but when that strategy fails and the author really digs in his or her heels, I have to take a step back and remind myself whose name is on the book.

Of course, I’m by no means suggesting that editors should blindly follow house style even when it is available; in many cases doing so would be to the detriment of the text. What I am saying is that I miss being able to lean on house style—and other aspects of a publisher’s editorial vision—when the publisher is out of the picture, and I have to admit that those situations leave me feeling a bit more vulnerable in my dealings with the author. But with that vulnerability, I suppose, comes a creative freedom that may allow this minimalist publication team to produce some truly innovative work.

What strategies have you developed to persuade your self-publishing authors that your style choice is in the best interest of their text?

Pilar Wyman—Metadata, marketing, and more (ISC conference 2013)

Pilar Wyman (@pilarw on Twitter) is the immediate past president of the American Society for Indexing, as well as a member of the ASI’s Digital Trends Task Force, and she spoke at the ISC conference about promoting indexes as metadata and showing our clients how our indexes can be used as sales tools for their books.

We’re used to thinking of a book’s metadata as information about the book as a product—its title, author, ISBN, etc.—but a book’s index can also serve as metadata: each index heading and subheading can be thought of as a tag for a chunk of text that we want readers to see. As a result, readers can use this metadata to provide them with a filtered view of the content that reveals specific facets or dimensions of a book.

Indexes, Wyman argues, are as important for ebooks as a search function. They

  • add browsability and help readers find what they need by expanding the number of access points to content
  • serve as a navigational tool
  • offer pre-analysis: indexes give readers a good sense of the range of topics covered and the importance of each
  • provide a conversation with the reader, allowing publishers to show what their product has to offer

Wyman advocates giving away a book’s index for free (as Amazon essentially does with its Look Inside feature) as a marketing strategy, to let readers know what they could be getting. She also showed us the potential of index mashups, in which you combine the indexes of several publications in a collection, allowing users to browse or search across all of them. These mashups could be enormously useful for “scrapbook files”—collections of content from a variety of sources, as you’d find in a university course pack, for example.  Each heading in the mashed-up index is a link, taking you either directly to the content or to a summary screen of available information, with context. Most importantly for publishers, these indexes would offer users a direct link to purchase any of the books included in the mashup.

To exploit this marketing potential of ebook indexes, whether they are standalone indexes or mashups, publishers should link them—both in to the content and out to further resources or places to buy the book. These linked indexes should be included as back-of-the-file chapters or, better yet, in the front of an ebook so that the index gets searched first. For usability, the index should be accessible wherever you are in the book (just as you can flip to the back of a print book anytime you want), and the “find” tool should bring up the best hits, as identified by the index. Results should show snippets of a term in context, and cross-references should help the reader refine their search terms.

Generic cross-references can often present a dilemma for the indexer (e.g., Does See specific battles really give readers the information they need?), but Wyman’s vision for the EPUB index eliminates this problem: “specific battles” would link to a list of those battles, which would in turn link to the corresponding headings in the index. She also adds that smart use of tagging would allow you to filter not only based on concept but also type of content. For example, many of us already indicate definitions with boldface, images with italics, etc. This “decoration metadata,” as Wyman calls it, can be another layer of information that users can use to narrow their search down to what they need. Wyman also introduced the concept of a reverse index: users can highlight a section of text and discover what terms in the index are associated with it, allowing them to easily jump to other places in the text that discuss the same topic.

As indexers, Wyman said, we’re already skilled at figuring out aboutness and can easily apply those skills, especially if we’re already familiar with embedded indexing, to semantic tagging of text. If we can persuade our clients of the value of using our indexes as a sales tool, we can further leverage our expertise.

***

(My take: I think the idea of index mashups is brilliant. My colleagues who work in academic publishing spend huge amounts of time compiling different catalogues for different subject areas and markets. Offering one index mashup of all of their Aboriginal studies titles and another for their women’s studies titles, for example, could allow them to show the breadth and depth of their list to particular target markets, including academics considering course adoptions and subject-specific libraries.)

Nancy Mulvany—The repurposed book index and indexer (ISC conference 2013)

Nancy Mulvany is the author of Indexing Books, the go-to reference for any aspiring or practising indexer. She kicked off this year’s Indexing Society of Canada’s conference in Halifax with her keynote speech about the changing role of the index and indexer in a digital age.

We all know that we will have to evolve and adapt to this new landscape. But how do we go about it, and what potential obstacles do we face? Mulvany warned us about some of the more insidious forces behind our seeming glut of free information. She quoted Jaron Lanier’s Who Owns the Future?, saying “the dominant principle of the new economy, the information economy, has lately been to conceal the value of information.” One you start to devalue information, she said, you devalue the people who provide access to information—that’s us. Companies like Facebook expect us to share information freely while they profit handsomely; Mulvany noted that “making information free is survivable so long as only limited numbers of people are disenfranchised”—right now it’s easy to get information for free by exploiting our musicians, writers, and artists.

Tracing the history of the book—from the invention of the codex to the development of the printing press and movable type to the first use of pagination—Mulvany offered context for the evolution of our consumption of information. At one time, Mulvany said, knowledge and information were highly prized: books were so valued they were chained to their bookshelves in libraries. Today we have an abundance of books in personal and office libraries, as well as in our computers and e-readers.

Finding information in an ebook, however, can be frustrating. If you go to Amazon and look through the reviews of reference books or non-fiction books that readers are trying to get information from, you’ll discover that those who have the Kindle edition can’t find what they’re looking for. Is there a way we effectively integrate the codex and the ebook and find the information we need?

Mulvany gave us an example of what’s possible by taking us on a tour of Evernote, a note-taking application that allows you to collect information from a variety of sources—from PDFs and Word files to images and audio files—and keep them in one place. Evernote makes the text and images searchable; it even has the ability to decipher neat handwriting on a scanned or photographed document. Items in Evernote are “indexed,” in that you can assign categories and subcategories to items, then add tags—all of which are searchable. If you add another layer of information by aggregating indexes in Evernote, as Mulvany demonstrated with her collection of cookbook indexes, you can then search across multiple books at once, and she believes that there’s a market for organizing information in this way in all sorts of fields. If you want to make the information in a law office library searchable, for example, “the five-hundred-page book on torts doesn’t have to be scanned—provided there’s a good index.”

The power to aggregate information in applications such as Evernote is an example of the repurposed index, but how do we repurpose the indexer? That’s easier said than done, said Mulvany; many of us are very set in our ways, but we have an amazing skill set that includes the ability to analyze, prioritize, synthesize, and localize. By applying those skills with tools such as Evernote, we could “create a product for a client that provides an incredible depth of access to information—something that the most sophisticated search algorithm can’t provide.” In so doing, Mulvany warns, we have to remember the users, who are getting more and more difficult to anticipate—especially younger people who have never been taught how to use an index.

ISC/EAC conference notes and news

I’ve been back from the Indexing Society of Canada/Editors’ Association of Canada conference in Halifax for almost a week now but have spent these past few days trying to get caught up.

As with last year, I’ll be posting summaries of the talks I attended at the conference, but, as I learned last year, they might take me a few weeks to finish; the pockets of time I need to write have been elusive.

The ISC and EAC conference committees put on a wonderful event: the sessions were engaging and well balanced, and I loved being able to see and catch up with fellow editors and indexers from across Canada and beyond. I was also honoured to receive a President’s Award for Volunteer Service from EAC at the awards banquet—a million thanks to Frances Peck, Anne Brennan, and Eva van Emden for nominating me. I work on committees with some of the most dedicated people I know, and there’s no way I deserve this award any more than they do.

Book review: Indexing and Retrieval of Non-Text Information

This review appeared in the Spring 2013 issue of Bulletin, the Indexing Society of Canada’s newsletter.

***

I expected to learn a lot from Indexing and Retrieval of Non-Text Information (edited by Diane Rasmussen Neal and published by Walter deGruyter); what I didn’t expect was to enjoy reading it as much as I did. Neal and her team have put together a timely and fascinating collection of texts that explore the challenges of indexing non-text material in an online world. Although geared much more toward academically minded information scientists than to back-of-the-book indexers, this book nevertheless has a lot to offer indexers who work with illustrated books or digital documents with embedded multimedia.

Covering everything from music information retrieval systems to World of Warcraft as a case study for gaming indexing, Neal’s wide-ranging book features voices from all over the world—including Bar-Ilan University in Israel, Universidade Federal Luminense in Brazil, and Heinrich-Heine-Universität Düsseldorf—but also showcases the strength of Canadian research in the field, with contributions from doctoral students and faculty at the University of Toronto, McGill University, and Western University, where Neal is an assistant professor.

Although I read the chapters about music with interest (Jason Neal, for example, looks at the problematic definition of classical in his probe of genre in music recommender systems), I focused mostly on the content most relevant to book indexers—namely, image indexing. Chris Landbeck’s chapter about editorial cartoons was eye-opening, as he explained that several factors contribute to the complexity of indexing these images:

  1. editorial cartoons are time sensitive;
  2. there is no tradition of describing editorial cartoons for the Electronic Age to draw on;
  3. editorial cartoons do not exist in a vacuum, but in a rich and active world that a reader must be familiar with in order to both perceive the visual part of the cartoon as well the message within it. (p. 61)

This distinction between an image’s “ofness” and “aboutness” is echoed in Kathrin Knautz’s chapter about emotions in multimedia; indexing must take into account that, because “an emotion may arise for various reasons (induction, empathy, contagion),” (p. 359) an emotion depicted may not be the same as the one evoked. Pawel Rygiel extends Landbeck’s thread about the time sensitivity of an image, showing the complications that can arise when indexing photos of architectural objects “whose name, form and function might have changed throughout their history.” (p. 288) The chapter by Renata Maria Abrantes Baracho Porto and Beatriz Valadares Cendón about an image-based retrieval system for engineering drawings was also interesting; I once worked on an art book in which the designer included details of the artwork next to the tombstone data (the title, date, medium, dimensions, and inscriptions for each piece of artwork)—a lovely visual index—and this chapter in Neal’s book made me wonder whether a closer relationship between indexer and designer may yield surprising, useful results for carefully chosen projects.

The book’s biggest weakness, ironically, is its unforgivably anemic index. Only three pages in a 428-page book, the index is virtually useless, with its entry for “indexing” consisting of 108 undifferentiated locators.

Indexing and Retrieval of Non-Text Information offers indexers a lot to ponder, especially in its look at the strengths and weaknesses of social tagging and the question of whether crowdsourcing the task of indexing will ever put us out of a job. For the working book indexer, however, this book is probably overkill. If someone extracted only the information that was relevant to book indexers and edited it into a smaller, more manageable resource, that abridged volume would be a welcome addition to any indexer’s reference shelf.

Learning to type: Adventures in publishing

Huh. Well, I’ve been meaning to post a recap of Scott McIntyre’s talk at last Tuesday’s Alcuin AGM, but I’ve been swamped with work and haven’t been able to get to it. The Alcuin Society has since uploaded the full video of his talk here.

I promise to post something soon—after I get through this crush of work. Next week I’ll be heading to Halifax for the Indexing Society of Canada and Editors’ Association of Canada conferences, and I’ll make my write-ups of the sessions I attend available when I manage to get to them.