Yahoo! term extraction

In my current work around adding tag support to Pebble, I came across the Yahoo! Term Extraction service, via TagCloud. Several people have already tried to use this technology as a way to automate the tagging process and there's been lots of talk about how automated tagging has the potential to pollute the tag clouds being created by sites like Technorati. Rather than re-iterate, here are some references.

Most of this was written around April, so a few months on and I thought I'd have a go at writing a client in Java, which was just a matter of sending an HTTP request to the service and reading back the XML response. Here's the code, wrapped up in a Pebble BlogEntryListener plugin (javadoc stripped for brevity).

package pebble.event.blogentry;

import java.io.InputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import pebble.blog.Blog;
import pebble.blog.BlogEntry;

public class YahooTermExtractionListener extends BlogEntryListenerSupport {

  private static final String URL =
    "http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction";

  private static final Log log =
    LogFactory.getLog(YahooTermExtractionListener.class);

  public void blogEntryAdded(BlogEntryEvent event) {
    tag(event.getBlogEntry());
  }

  private void tag(BlogEntry blogEntry) {
    Blog blog = blogEntry.getRootBlog();

    try {
      // post to the service using HttpClient
      HttpClient httpClient = new HttpClient();
      PostMethod postMethod = new PostMethod(URL);
      postMethod.addRequestHeader("Content-Type",
        "application/x-www-form-urlencoded; charset=" +
        blog.getCharacterEncoding());
      NameValuePair[] data = {
        new NameValuePair("appid", "PebbleWeblog"),
        new NameValuePair("context", blogEntry.getBody()),
      };
      postMethod.addParameters(data);
      int responseCode = httpClient.executeMethod(postMethod);
      if (responseCode != 200) {
        return;
      }

      // now parse the response and extract the Result elements
      DocumentBuilderFactory factory =
        DocumentBuilderFactory.newInstance();
      factory.setValidating(false);
      factory.setNamespaceAware(true);
      factory.setIgnoringElementContentWhitespace(true);
      factory.setIgnoringComments(true);
      DocumentBuilder builder = factory.newDocumentBuilder();
      builder.setErrorHandler(new ErrorHandler() {
        public void warning(SAXParseException e) throws SAXException {
          log.warn(e);
          throw e;
        }

        public void error(SAXParseException e) throws SAXException {
          log.error(e);
          throw e;
        }

        public void fatalError(SAXParseException e) throws SAXException {
          log.fatal(e);
          throw e;
        }
      });

      // add the terms to the list of tags for the blog entry
      StringBuffer tags = new StringBuffer(blogEntry.getTags());
      InputStream in = postMethod.getResponseBodyAsStream();
      Document doc = builder.parse(in);
      in.close();
      NodeList results = doc.getElementsByTagName("Result");
      if (results != null) {
        for (int i = 0; i < results.getLength(); i++) {
          Node node = results.item(i);
          String tag = getTextValue(node);
          if (tags.length() > 0) {
            tags.append(", ");
          }
          tags.append(tag);
        }
      }

      blogEntry.setTags(tags.toString());
      blogEntry.store();
    } catch (Exception e) {
      log.error(e.getMessage(), e);
    }
  }

  private String getTextValue(Node node) {
    if (node.hasChildNodes()) {
      return node.getFirstChild().getNodeValue();
    } else {
      return "";
    }
  }

}

To test it, I took this short blog entry and added it to my development blog. Here's the result, with the tags at the bottom of the image.

Yahoo! term extraction in action

In summary, it's a mixed bag. Some of the terms are very relevant. Some, however, aren't. Automated tagging is certainly an interesting concept, and perhaps a great way of starting out with a larger set of tags which get chopped down by a human. I don't think you can rely on them alone though. Alternatively, and more interestingly, TagCloud are using the Yahoo! term extraction service as a basis and applying additional logic to rank the tags and ensure that they remain relevant. I think that this is something to keep an eye on.



Re: Yahoo! term extraction

Recently,I found searching tags in Technorati has no interrelated URL results from my blog(your blog too).It means PING Listener plugin is useless? Puzzling me...

Re: Yahoo! term extraction

I think it's Technorati that hasn't picked up the tags yet. I've seen stories of this taking a couple of weeks, but I'm wondering whether they tags need to be output in the newsfeeds too. Perhaps hidden. Not sure yet.

As for the ping listener, I've even manually pinged Technorati but that didn't seem to make any difference. We wait and see!

Re: Yahoo! term extraction

Just a quick update on this topic, but with the new Atom 1.0 feed, Technorati is picking up my blog entries pretty quickly now. :-)

Re: Yahoo! term extraction

In appreciation of your response.
Simon, submit Pebble to Koders, compute the development cost of Pebble.:-)
eg: AppFuse cost: $78K (
AppFuse cost $78K to develop).
This is an interesting indicator.

Re: Yahoo! term extraction

Hi everybody!
TermExtractor, my master thesis, is online at the
address http://lcl2.di.uniroma1.it.

TermExtractor is a FREE and high-performing software package for Terminology
Extraction. The software helps a web community to
extract and validate relevant domain terms in their
interest domain, by submitting an archive of
domain-related documents in any format.

TermExtractor extracts terminology consensually
referred in a specific application domain. The
software takes as input a corpus of domain documents,
parses the documents, and extracts a list of
"syntactically plausible" terms (e.g. compounds,
adjective-nouns, etc.).
Documents parsing assigns a greater importance
to terms with text layouts (title, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string. Accept files formats are: txt, pdf, ps, dvi,
tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
also zip archives.

I'd like if you partecipate in the TermExtractor
evaluation task. The result of your evaluation will be
put in a paper (I enclose a draft). Please contact me
if you want to partecipate (this is very important for
me!).

MANY THANKS!!!

--
Francesco Sclano
home page: http://lcl2.di.uniroma1.it/~sclano
msn:       francesco_sclano@yahoo.it
skype:     francesco978


Add a comment Send a TrackBack