Yahoo! term extraction
In my current work around adding tag support to Pebble, I came across the Yahoo! Term Extraction service, via TagCloud. Several people have already tried to use this technology as a way to automate the tagging process and there's been lots of talk about how automated tagging has the potential to pollute the tag clouds being created by sites like Technorati. Rather than re-iterate, here are some references.
- Tags vs. Yahoo Term Extraction
- Yahoo! Term Extraction, Technorati, and Tag Pollution
- Updated: Y! Terms Extraction Plugin
Most of this was written around April, so a few months on and I thought I'd have a go at writing a client in Java, which was just a matter of sending an HTTP request to the service and reading back the XML response. Here's the code, wrapped up in a Pebble BlogEntryListener plugin (javadoc stripped for brevity).
package pebble.event.blogentry;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import pebble.blog.Blog;
import pebble.blog.BlogEntry;
public class YahooTermExtractionListener extends BlogEntryListenerSupport {
private static final String URL =
"http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction";
private static final Log log =
LogFactory.getLog(YahooTermExtractionListener.class);
public void blogEntryAdded(BlogEntryEvent event) {
tag(event.getBlogEntry());
}
private void tag(BlogEntry blogEntry) {
Blog blog = blogEntry.getRootBlog();
try {
// post to the service using HttpClient
HttpClient httpClient = new HttpClient();
PostMethod postMethod = new PostMethod(URL);
postMethod.addRequestHeader("Content-Type",
"application/x-www-form-urlencoded; charset=" +
blog.getCharacterEncoding());
NameValuePair[] data = {
new NameValuePair("appid", "PebbleWeblog"),
new NameValuePair("context", blogEntry.getBody()),
};
postMethod.addParameters(data);
int responseCode = httpClient.executeMethod(postMethod);
if (responseCode != 200) {
return;
}
// now parse the response and extract the Result elements
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setIgnoringComments(true);
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
public void warning(SAXParseException e) throws SAXException {
log.warn(e);
throw e;
}
public void error(SAXParseException e) throws SAXException {
log.error(e);
throw e;
}
public void fatalError(SAXParseException e) throws SAXException {
log.fatal(e);
throw e;
}
});
// add the terms to the list of tags for the blog entry
StringBuffer tags = new StringBuffer(blogEntry.getTags());
InputStream in = postMethod.getResponseBodyAsStream();
Document doc = builder.parse(in);
in.close();
NodeList results = doc.getElementsByTagName("Result");
if (results != null) {
for (int i = 0; i < results.getLength(); i++) {
Node node = results.item(i);
String tag = getTextValue(node);
if (tags.length() > 0) {
tags.append(", ");
}
tags.append(tag);
}
}
blogEntry.setTags(tags.toString());
blogEntry.store();
} catch (Exception e) {
log.error(e.getMessage(), e);
}
}
private String getTextValue(Node node) {
if (node.hasChildNodes()) {
return node.getFirstChild().getNodeValue();
} else {
return "";
}
}
}
To test it, I took this short blog entry and added it to my development blog. Here's the result, with the tags at the bottom of the image.
In summary, it's a mixed bag. Some of the terms are very relevant. Some, however, aren't. Automated tagging is certainly an interesting concept, and perhaps a great way of starting out with a larger set of tags which get chopped down by a human. I don't think you can rely on them alone though. Alternatively, and more interestingly, TagCloud are using the Yahoo! term extraction service as a basis and applying additional logic to rank the tags and ensure that they remain relevant. I think that this is something to keep an eye on.
Re: Yahoo! term extraction
As for the ping listener, I've even manually pinged Technorati but that didn't seem to make any difference. We wait and see!
Re: Yahoo! term extraction
Simon, submit Pebble to Koders, compute the development cost of Pebble.:-)
eg: AppFuse cost: $78K (AppFuse cost $78K to develop).
This is an interesting indicator.
Re: Yahoo! term extraction
Hi everybody!
TermExtractor, my master thesis, is online at the
address http://lcl2.di.uniroma1.it.
TermExtractor is a FREE and high-performing software package for Terminology
Extraction. The software helps a web community to
extract and validate relevant domain terms in their
interest domain, by submitting an archive of
domain-related documents in any format.
TermExtractor extracts terminology consensually
referred in a specific application domain. The
software takes as input a corpus of domain documents,
parses the documents, and extracts a list of
"syntactically plausible" terms (e.g. compounds,
adjective-nouns, etc.).
Documents parsing assigns a greater importance
to terms with text layouts (title, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string. Accept files formats are: txt, pdf, ps, dvi,
tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
also zip archives.
I'd like if you partecipate in the TermExtractor
evaluation task. The result of your evaluation will be
put in a paper (I enclose a draft). Please contact me
if you want to partecipate (this is very important for
me!).
MANY THANKS!!!
--
Francesco Sclano
home page: http://lcl2.di.uniroma1.it/~sclano
msn: francesco_sclano@yahoo.it
skype: francesco978
Simon is a hands-on software architect and a senior consultant at 

