ISC Conference 2012, Day 1—Building a bilingual taxonomy for ordinary images indexing

Elaine Ménard gave ISC conference attendees a glimpse into the world of information science research. An assistant professor in the school of information studies at McGill University, Ménard embarked on a project to develop a bilingual taxonomy to see how controlled vocabularies can assist in both indexing and information retrieval. Taxonomies are inherently labour intensive to create, and the bilingualism adds an additional complication.

Ménard’s Taxonomy for Image Indexing And RetrivAl (TIIARA) project consists of three phases:

  1. a best practices review,
  2. development of the taxonomy, and
  3. testing and refinement of the taxonomy.

Phase 3 is currently underway, and she gave us an overview of the first two phases.

In phase 1, Ménard and her team evaluated 150 resources, including 70 image collections held by libraries, museums, image search engines, and commercial stock agencies and 80 image-sharing platforms with user-generated tagging. They discovered that 40% of the metadata dealt with the image’s dimensions, material, and source, and 50% of the metadata addressed copyright information, with the balance devoted to subject classification. This review of best practices constituted the basis of phase 2.

In phase 2, Ménard’s team constructed an image database and developed the top-level categories and subcategories of the taxonomy. To create the database, they solicited voluntary submissions and ended up with a database, called Images DOnated Liberally (IDOL), of over 6,000 photos from 14 contributors. Her taxonomy kept in mind Miller’s Law of 7 +/- 2 and featured (after a series of revisions and refinements) nine top-level categories, designed to help users with retrieval while being as broad as possible, and a further forty-three second-level categories.

After the category headings were translated, two volunteers, one anglophone and the other francophone, tested the preliminary taxonomy through a card-sorting game, in which they were instructed to sort the second-level cards according to whatever structure they desired and provide a heading for each sorted group. This pretest showed a polarization of “splitters” and “lumpers” and didn’t provide any practical recommendations for the taxonomy but did suggest revisions to the card-sorting exercise.

Ten participants (five male, five female; five anglophone, five francophone) were recruited to test the taxonomy to expose problematic categories in the structure. Half of the group was instructed to sort the second-level categories according to the existing first-level structure; the other half could sort the second-level categories as they pleased. Through this test Ménard hoped to assess how well each category and subcategory were understood; the differences between the French and English sorts would reveal nuances that had to be taken into account in the translation of the structure.

Results showed that the first-level categories of “Arts,” “Places,” and “Nature” were well understood but that “Abstractions,” “Activities,” and “Business and Industry” were problematic. Feedback from participants helped researchers clarify the taxonomic structure to seven first-level headings. Interestingly, Ménard found fewer disparities between the languages than expected.

The revised TIIARA structure was refined to include second-, third-, and fourth-level subcategories and was simultaneously developed in English and French.

In phase 3, underway now, two indexers—one English, one French—will work to index all images in the IDOL databases according to the TIIARA structure. Iterative user testing will be carried out to validate and refine the taxonomy.

So far the study has shown that language barriers still prevent users from easily accessing information, including visual resources, and a bilingual taxonomy is a definite benefit for image searchers. Eventually the aim is to implement TIIARA in an image search engine.

Leave a Reply

Your email address will not be published. Required fields are marked *