The brilliant concepts of Social Software help people to articulate and organize themselves on the web and communicate like never before. One of the reasons flickr is so great is that it works better for you than most desktop photo apps. Just upload your photos, existing metadata is transfered, new one like date and author added and everything is already done to have your photostream/weblog, like Esther's for example. Flickr is also one of the most widely used services utilizing Tags, therefore solving the traditional age-old folder problem. But Tags especially do rock collaboratively, like sharing and browsing photos on a simple and powerful concept level. So photos get a joint URL. Same with del.icio.us, where Bookmarks get a joint URL. Same with plazes, where real-world places get a joint URL. Thoughts, ideas and conversations get an URL, may it be as Tags, Permalinks or Trackbacks. So people like Joi (his weblog) or services like Technorati (AC2005 example) aggregate all the widespread information based on smart metadata.
If we dare to ask for Social Software limitations, it most likely is the rather lo-tek approach, so features that wouldn't corrode the simplicity aren't integrated (Wikipage renaming or Technorati Tag context for example) and the reliability sometimes suffers. Social Software also mostly relies on social cognition, meaning that i either get to see what's most popular in my peer group or i have to dig for pearls through thousands of RSS feeds and more. There's also the issue of a certain predispostion or level of abstract thinking (sharing and being open), were we think it's rather easy to substitute this with more clear benefits for more people.
On a weblog the sentence "Bruno Haid is just giving a strange talk at AC2005" makes perfect sense for a human, but even simple concept extraction is really tough for machines. This is were the Semantic Web comes in. With the right bottom-up tools and some standardization level, it's easy to create a machine-readable statement as well that refers to the machine-readable profile of the conference and the person, who's profile refers to the flickr images, plazes, amazon profile and so on. That greatness of this is that you don't have to laboriously visit or integrate all those services, or query dozens of DB's with millions or billions of items, but rather just "walk a few nodes" on this graph of statements.
The main problem when dealing with the Semantic Web is that it's currently mainly being theroetically approached: "Let's find some of the right statements, do some mad inferencing and the system is capable of explaining the whole world". This of course fails for various reasons and masks the opportunities arising from low level semantic technologies like Tags or Mircoformats. It also is the main root why the underlying infrastructure is so bad for real world applications. While it's easy to cook up a great service with a standard SQL database and Helma or Ruby on Rails, it's terribly hard to get something off the ground with existing Semantic Web components.
As it would have been overbearing for us to talk too much about Information Retrieval at a conference boasting such a speaker line-up, there was just a meantioning of the dynamics in the areas of Vector Space Models and Fuzzy & Probalistic Retrieval as well as two services: DevonThink, which is a great Mac tool that lets you view documents, for an example a .pdf, and just by the click of one button see other similar documents on your desktop. This is also something Spotlight and Google Desktop will offer sooner or later, so when opening this tools the most handy feature won't be a searchbox, but already the results relevant to your current work. Another good example is Amazon, as they don't just show the "Customers who bought also..." but also use algorithms operating over sophisticated usage metadata to give specific offers or decide which pages to display when you click on "Look inside". So a lot is happening in employing algorithms to enhace routines and services.
The still most predominant limitation of those algorithms is that most of the really smart things don't scale well in terms of numbers users and amplitude of data. Also, a lot of initiatives try to compute things that hardly can be done by machines, but more important are already out there on the web.
And this is why it makes so much sense to merge the three areas of Social Software, Semantic Web and Information Retrieval: People already speak on the web, on weblogs and wikis, they upload their photos, bookmarks, describe places, link all this, tag it, have metadata extracted and so on. So on the web, there is a gigantic pool of semantic data already available.
Implementing the ideas of Semantic Transport and Semantic Logistics, this enables to make sure that it's possible to bring the right information, at the right time, to the right person and via the right medium. And if you draw the line between human intelligence and machine capabilities right, you get something that can be dubbed as synergetic intelligence and hugely enhances all the technologies involved:
As for Information Retrievl as in the case of Devonthink, how can an algorithm like vector space model handle large collections of documents ? if that is what they use (i am not 100% sure fr. reading the article).
I believe the vector space models or other models similar to VSM such LSI all suffer from scalability problems. As for Probabilistic Retrieval, is that something that is still used ?
Your article does not mention what are the most promising search algorithms for web/IR at the present time. That is something I was hoping to find here.
a great read.
As far as we know DevonThink is a hybrid between Vector Spaces and Neural Networks, and definitely doesn't scale that well, which i suppose isn't that much of a problem since it's a desktop tool. DevonThink is rather used as an example because of it's very well executed front end funcionality with less focus on the algorithms in the background.
There are some advancements in scaling vector space models, especially with a probalistic notion: Thomas Hofman for example developed a highly scalable Probabilistic Latent Semantic Analysis framwork (http://www.cs.brown.edu/people/th/papers/Hofmann-UAI99.pdf - research paper and http://www.recommind.com - his company called Recommind).
So whats most promising? Hmm, good question. Our approach is to take as much "smart data" as possible, which means we use all things written by a user on the wikilog (personal corpus), the structured tags created and filled in by the users (bottom up generated ontologies/wordlists as byproduct of daily routines), track all user actions (build personal text/interest profile), monitor social interaction (user A cosine co-evolutes with user B's) etc.. This helps us to achieve way better results with 'dumber' algorithms.
On the web, this would mean using Tag Clouds, Google Base/del.icio.us, Browser/Search History as basis for sophisticated IR with many more parameters in the background and more relevant and 'elegant' results on the front end. Bradley Rhodes has a great paper (http://alumni.media.mit.edu/~rhodes/Papers/rhodes-phd-JITIR.pdf) on that, the Web 2.0 Search Matrix (http://www.nivi.com/blog/article/the-trillion-dollar-web-20-matrix) is also pretty nice.
I'll ask Tom if he can put together a more detailed technical backgrounder. Thanks again, best wishes for the remaining holidays!
I have checked the thesis paper by Thomas Hofman. I believe that the algorithm described in that paper is patented. I am not sure if this algorithm (as is in the paper) is patented or some OTHER variation of it is patented ?
I have been reading about this other algorithm called (CNG) Contextual Network Graph (a variation of LSI) which is more scalable than regular LSI and is not patented. It seems like a good alternative to Dr. Hofman's Probabilistic LSA algorithm and unlike vanilla LSI, it is scalable. I wonder what you think about CNG ?
I will keep monitoring this area for further postings/comments.
Once again thanks for your directions/feedback