open calais | I'm Not Actually a Geek

Bit.ly Gets Better with New Data…Are You Using It Yet?

November 3, 2008 by Hutch Carpenter 8 Comments

Lately, I’ve been using bit.ly for shortening the URLs I tweet, on the advice of Marshall Kirkpatrick at ReadWriteWeb. I started using it instead of is.gd, which had been my previous favorite.

Why? Because bit.ly offers an array of useful data. Who knew that a simple URL shortener could open up so much interesting data? I can’t believe people still use tinyurl and other services that “only” shorten URLs. The tracking of metadata around a posted URL – for free – makes bit.ly really powerful.

Here’s what bit.ly was offering before the latest data features…

Last 15 URLs: Bit.ly knows your last 15 shortened URLs, courtesy of a cookie.
Post to Twitter: Post shortened URLs from bit.ly to your Twitter account
Archived web page: Yup, see that page anytime because there’s a cached version of it, even if the source link changes or disappears.
Traffic sources: See how much click action that bit.ly URL got once you put it out there. And from what apps.
Conversations: Tracks which users on Twitter and FriendFeed put the URL out there. This is really cool, as you can see others who liked the same thing you did.
Browser bookmarklet: Easy way to create a shortened URL, stay on the page you’re reading.
Semantic metadata: According to Marshall’s July post, bit.ly was going to add semantic analysis via Reuter’s OpenCalais API. Looks like it’s there. Cool to see per link, probably more interesting with a critical mass of URLs.

On October 30, bit.ly announced several nice additions to their service.

Full referring domains: Not just the top-level domain.
Graph of click activity by time: The dates and times that a URL got clicked.
Clicks by Country: The countries of people who click on your URL. This is really fascinating.

Seriously, if you’re not using bit.ly, why not?

*****

See this post on FriendFeed: http://friendfeed.com/search?q=%22Bit.ly+Gets+Better+with+New+Data%E2%80%A6Are+You+Using+It+Yet%3F%22&who=everyone

Filed under geek Tagged with bit.ly, is.gd, open calais, readwriteweb, tinyurl

FriendFeed Noise Control, Semantic Web and Dave Winer

May 21, 2008 by Hutch Carpenter 7 Comments

On a FriendFeed discussion about the noise on the Web in general, Lindsay Donaghe posted this comment:

Actually I think it’s the same problem we have in general with the firehose of information we’re exposed (or expose ourselves) to on a daily basis. The struggle of where to apply our attention will only be resolved once someone develops intelligent agents to filter the bad stuff and alert us to the good stuff. Wish someone would hurry up and make those. That will be the ultimate killer app.

Louis Gray wrote this recently in his post Content Filters Proving Evasive for RSS, Social Media Sites:

So far, despite many users calling for content-based filters, solutions to block keywords or topics are missing from the vast majority of information spigots.

The recent meme about FriendFeed noise points to the frustration of some people with an inability to manage what content hits their screens. The two comments above underscore this feeling.

Here’s me own example. Dave Winer has two passions: technology and politics. For me personally, technology = signal. Politics = noise. I went through his FriendFeed stream for the month of May, and here are the 38 different political terms that show up:

So what to do? I’d like to suggest that the semantic web might be a solution for down the road.

What Is the Semantic Web?

Semantic web is still a confusing term. Two quotes from Wikipedia help describe it. This quote tells you generally what it’s about and importantly notes that there’s much development for the future:

The Semantic Web is an evolving extension of the World Wide Web in which the semantics of information and services on the web is defined, making it possible for the web to understand and satisfy the requests of people and machines to use the web content. At its core, the semantic web comprises a set of design principles, collaborative working groups, and a variety of enabling technologies. Some elements of the semantic web are expressed as prospective future possibilities that are yet to be implemented or realized.

This quote describes the problem that the semantic web will solve:

With HTML and a tool to render it (perhaps Web browser software, perhaps another user agent), one can create and present a page that lists items for sale. The HTML of this catalog page can make simple, document-level assertions such as “this document’s title is ‘Widget Superstore'”. But there is no capability within the HTML itself to assert unambiguously that, for example, item number X586172 is an Acme Gizmo with a retail price of €199, or that it is a consumer product.

It’s that last sentence there that addresses the noise issue. How does a server know that part X586172 can be categorized as a “consumer product”? That’s where the semantic web comes into play.

And how the noise can be controlled on FriendFeed.

Noise Control: Simplify Users’ Lives

One way to think of the semantic web is as tagging on steroids. In the example above, part X586172 is tagged as “consumer product”. And the tagging occurs without human intervention.

This is what’s needed on FriendFeed. The ability to take a wide range of terms that humans can understand are related. The relationship among the terms is tag.

Here’s what such an algorithm would do for Dave Winer’s political terms:

Now, imagine this in FriendFeed. Semantically-derived tags are appended to every item that flows through. Meanwhile, users have a new ‘Hide’ feature. Hide by topic. They could elect to hide streams with terms on a one-by-one basis. For instance, I’ll hide “robert reich”. I’ll hide “republicans”. I’ll hide “congress”. I’ll hide “obama”. I’ll hide “mitt romney”. I’ll hide…well, you get the picture.

In addition, users could just hide all items with the tag “politics”, and be done with it. Simple.

This could apply for all manner of topics: football, banking, Iraq, etc.

Just How Would These Semantic Tags Be Generated?

I’m not sure anything quite with this purpose exists yet. Reuters has been a leading player in the semantic web with its Open Calais initiative. However, Open Calais focuses of its tagging on people, places, and companies. So if Open Calais was applied to Dave Winder’s FriendFeed stream would have a lot of tags related to those topics. But not metadata tags.

A company called GroupSwim described their semantic tagging approach:

We use natural language processing to analyze the data our customers put into their sites. Our datasets tend to be much smaller but are high quality since someone doesn’t add something to GroupSwim unless they want to share it. Then, we compare the language used in the content to other semantic sources including WordNet, Wikipedia, etc. to do our automatic tagging and analysis.

Interesting, not sure what the tags they produce are. But it does give insight into a requirement: a core foundation of data against which all other data can be compared to derive tags. Something that would correctly map Obama and Clinton to a politics tag.

I’m sure there are other interesting approaches. It’d be great if someone was working on something in this area.

If anyone reading this knows of any semantic approaches that can apply metadata type of tags, feel free to leave a comment.

*****

See this item on FriendFeed: http://friendfeed.com/search?q=who%3Aeveryone+%22friendfeed+noise+control+semantic+web+dave+winer%22

Filed under geek Tagged with friendfeed, groupswim, open calais, semantic web

Semantic Web = Tagging on Steroids

February 17, 2008 by Hutch Carpenter 1 Comment

I read a nice list of “11 Things to Know About Semantic Web“, over at ReadWriteWeb. “Semantic Web” is an intimidating term. “Semantic”…hmmm, that’s something to do with words, right? Don’t people say, “he’s just using semantics” in a pejorative way?

I’ve researched it a bit, and here’s my initial attempt to put it in layman’s terms. The semantic web takes a document of unstructured data (say, this blog post), and renders it into a set of tags that are readable by both humans and software programs. Not just any tags, but really powerful tags beyond what you or I would use.

Now, at this point, that sounds sort of redundant. Don’t a lot of pages already have user-created tags? Can’t search engines find the words that are in the web page? If you know a little HTML, aren’t there metadata tags?

Well, it turns out those various methodologies are great for humans, but do little for machines. That kind of makes sense, right? I mean, we all get that machines’ ability to interpret our written words is limited. Google doesn’t really make a connection between your search term and the pages it serves up. It just looks for instances of your search terms and then applies all that special magic it does (number of links to a page, number of times the page was clicked previously, etc.).

The key to making these unstructured web pages readable by a machine is something called…RDF. RDF stands for Resource Description Framework. Here’s RDF from Wikipedia: “The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology.”

That “triples” thing (or, I’ve seen it as “triplets”). The idea of casting unstructured web data into subject-predicate-object is apparently quite powerful. Again, from Wikipedia: “In the English language statement ‘New York has the postal abbreviation NY’ , ‘New York’ would be the subject, ‘has the postal abbreviation’ the predicate and ‘NY’ the object.”

At this point, people much more versed in these technologies can explain how computers will use these triplets to better serve up content for a given search. For instance, Reuters has come out with its Open Calais initiative. They aim to “make all the worlds content more accessible, interoperable and valuable.” I will do some more research and write a follow-up post on this subject.

But I do want to note a few of the points provided in Bernard Lunn’s post over at ReadWriteWeb:

Semantic Web will start the long, slow decline of relational database technology. Web 3.0 enables the transition from “structure upfront” to “structure on the fly”. The world is clearly too complex to structure upfront, despite the tremendous skills brought by data modelers. Structure on the fly is done by people adding structure as they use the service and by engines that automatically create structure from unstructured content.
Don’t look for a killer app. That implies a client/consumer win. This is much more likely to be a server/platform/enterprise win.
Semantic Web could slow the Google steamroller. This could be like the PC for IBM or the Web for Microsoft. The steamroller’s momentum carries it forward for a very long time and it can build all kinds of wrapper systems around it, but something new always does come along. Google mastered how to give some structure to countless unstructured HTML pages. Semantic Web will gradually make that less critical as the underlying content will be more structured.
Tagging is the quietly disruptive technology. Everybody tags. It is the most basic human urge to mark what we find.
Semantic Web will leverage the “community” to add structure and this will use some techniques from first generation Social Networking. But it is very unlikely that Semantic Web will emerge from the walled gardens of current social networking sites.

Final note. I ran this post through a free website that employs Reuters’ Open Calais protocols, “Calais Text Tagger“. It returns a lot of text chock full of semantic tags. I won’t repeat that here. But I did like this little output:

IndustryTerm: unstructured web, search terms, relational database technology, software programs, wrapper systems, given search, unstructured web data, search term, semantic web
Company: IBM, Reuters, Microsoft, Google
Person: Bernard Lunn

Gotta say, that was pretty slick. And it’s more tags than I’m applying to this post.

Filed under geek Tagged with metadata, open calais, RDF, semantic web, tagging

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

I'm Not Actually a Geek

Bit.ly Gets Better with New Data…Are You Using It Yet?

FriendFeed Noise Control, Semantic Web and Dave Winer

What Is the Semantic Web?

Noise Control: Simplify Users’ Lives

Just How Would These Semantic Tags Be Generated?

Semantic Web = Tagging on Steroids

Who is this guy?

Email Subscription

Subscribe to this blog

Latest tweets

The Conversation

Recent Posts

Tag Cloud

Posts By Date

Archives

Meta