Metadata

Thomas Vander Wal at St Paul's

A great pleasure yesterday to have Thomas speak at St Paul's — on 'Going Social'. A talk written for us, but anticipating Thomas' FoWA talk tomorrow, it was a great overview of social software and social networking and, no surprise, of social tagging. It meshed with much of what we're now trying to do at St Paul's, from our programme for our first year students (13 year-olds) with its introduction to online, collaborative working, to the work throughout the school on social software (now fully available to students).

Folksonomy Triad        Dual Folksonomy Triad       

Those attending the talk may want to explore further some of its more technical aspects — eg, folksonomy triads. Thomas has a number of key talks and blog postings online: Folksonomy (Online Information, 2005), Folksonomy Definition and Wikipedia (November, 2005), Understanding Folksonomy: Tagging that Works presentation posted (September, 2006), Understanding Folksonomy (d.construct, 2006).

Given the current impact of Facebook, it's important to gain a perspective, see its origins and limitations (specifically, but also in the context of the general state of social networking sites — let us extract our data; give us portability; let us refind stuff)

P1012102b       P1012101

and remember (or discover) that quite un-Facebook-like sites are ... social. I'm grateful to Thomas for setting out all of this and more. Like him, and like Demos, I place a lot of value in social bookmarking sites (such as del.icio.us) for educational use.


Community Systems

Yesterday afternoon, I went to the OII to hear Ricardo Baeza-Yates, Director of Yahoo! Research Barcelona and Yahoo! Research Latin America in Santiago, Chile:

In this talk we explore the current impact of social media or social networks, commonly called Web 2.0, where content is generated by users in sites like Yahoo! Answers, Flickr, YouTube or Del.icio.us. This phenomenon puts forward new research challenges that involves not only computer science, but also economy and psychology, just to mention a couple of related fields. We call this emerging new science, community systems, and we mention some of the issues that we are studying, as well as further open problems.

The webcast will, as ever, appear here, but here are some figures, thoughts and ideas that stuck:

  • an estimated 5 billion people will be connected to the web by 2015
  • today, there are 1.8 billion mobile phones
  • 500 million people are expected to have mobile broadband connectivity by 2010
  • the volume of internet traffic has increased 20 times in the last 5 years
  • there are more than 110 million web servers

Yahoo:

  • handles >4 billion page views per day
  • processes 12 terabytes of data per day
  • handles 2 million mail+IM messages per day

Ricardo put up a slide of what I now think of as the Bradley Horowitz creators/synthesisers/consumers pyramid (see here), followed by another of the three groups arranged in concentric circles: the history of the web has been from 'public web' (first 10 years) to 'my web' to 'our web', and consuming has now become a form of content production.

And so to user-generated content. In leading early adopter South Korea, 43.2% of the population with internet access has published UGC and 76.2% has used UGC.  Examples of our web: Yahoo! Answers (the idea originated in South Korea), LAUNCHcast (Last.fm might have worked better with his audience yesterday, but the point was taken) and Flickr — in the case of the latter, fewer than 10 employees were "aided" by millions in the Flickr community. No surprise that several times Ricardo referred to James Surowiecki's The Wisdom of Crowds. (Pointers: espgame.org; peekaboom.org.)

Yahoo!'s vision: better search through people and our trillions of artifacts. Many questions and challenges (eg, How to deal with spam?, How to establish and factor in a user's reputation?, What role does the community of users play?, What are the incentive mechanisms?, Where else can we leverage the power of the people?), but the underlying drive is to put the wisdom of crowds to work, milking query actions (breakdown: 25% informational, 40% navigational and 35% transactional). (Pointer: Yahoo!'s Mindset research site.) Semi-gnomic conclusion: Yahoo! is not seeking to personalise the search query but to personalise the search task (= active information supply driven by user activity and context).

Worth watching the screencast when it's up. There's much more in Ricardo's talk than I've tried to catch here (search language, folksonomic tagging and the inter-relatedness of meanings that Yahoo! explores in search queries …), and I came away puzzled by a couple of things said or asked about blogs, but I liked the emphasis on the web as 'scientifically young'.


The value in tags

My interest in Dave Sifry's State of the Blogosphere update of 1 May (previous post) lay primarily in the report that 'About 47% of all blog posts have non-default tags or categories associated with them'.

Joshua Porter, who gave us the del.icio.us lesson in a post last December, Learning more about Structured Blogging, writes now in The Del.icio.us Lesson:

Del.icio.us tags aren’t like meta keyword tags because of the Del.icio.us Lesson. Meta keyword tags provide no personal value whatsoever. All of their value is social. They’re for aggregation engines to find and tell other people about. In other words, they’re for getting attention only. Del.icio.us tags, on the other hand, provide personal value each time someone uses them to recall a bookmark. … the Del.icio.us Lesson might help us parse Dave’s statistics, especially this one: 47% of blog posts have tags or categories associated with them. If the Del.icio.us Lesson is predictive, it would suggest that nearly all of that 47% would be categories that users are applying for their personal value on their blog, rather than tags applied for attention only. Any way to separate out those numbers, Dave?

Joshua Porter's new post is interesting in its own right (the second half is particularly valuable) and it links to a number of articles I hadn't come across before, including Rashmi Sinha on why tags are easier than categories:

… the beauty of tagging is that it taps into an existing cognitive process without adding add much cognitive cost. At the cognitive level, people already make local, conceptual observations. Tagging decouples these conceptual observations from concerns about the overall categorical scheme. The challenge for tagging systems is to then do what the brain does - intelligent computation to make sense of these local observations, and an efficient, predictable way to ensure findability.

Tagging is something I'm getting my students to use and I'm hoping that it will have a good future in our work. I take to heart Joshua's advice:

Just don’t try and make it the primary thing to do. Instead, make sure personal value precedes network value. Then you’ll have plenty to aggregate.

Technorati tags: , , , , ,


Librarians and the future now

Yahoo! search blog — Mark Sandler, University of Michigan:

1,100 librarians recently swarmed on the seaside town of Monterey, California for a deep dive in search technology, and I was among them. Topics included desktop search, visual and clustering search, podcasting, taxonomies and metadata, RSS, blogs, wikis, online education, intranets, spyware, digitization, wireless access, and more. In today’s world of search engines, librarians are reaching way beyond the physical walls of the library.

To make library services more compelling, some librarians have begun experimenting with new virtual reference techniques like instant messenger and text-messaging to interact with patrons. Although some adults may be slow to adopt these techniques in the library, just imagine the usefulness to all the teenagers who already use instant messenger and text-messaging as their main methods of communication. Elsewhere, librarians discussed creating online library catalogs that allow patrons to tag, comment, review, share, recommend, and otherwise create a virtual community around records in the catalog. Imagine browsing through a library catalog and seeing other people’s reviews or recommendations for similar items. Sounds like what happens on many Web sites now, places like Yahoo! Local, My Web 2.0, Flickr, Furl, Amazon.com, etc.

… librarians are continuing to evolve their roles now that people rely so heavily on search engines. What does this mean?

  • For search, knowing when to use particular vertical and specialty engines, specialty databases, meta-search engines, advanced search syntax for the big engines, and so forth.
  • For news, helping people use RSS, email alerts … to know when new and relevant content is available online.
  • For sharing information, helping people find and share with others by using blogs, wikis, and tagging.

As the world of online and offline libraries continue to converge, I think this quote summarizes the conference perfectly: “In 2020, Internet Librarian will simply be called the Librarian Conference.”             

Technorati tags: , , , , ,


Pandora, Last.fm, MySpace

cityofsound — New Musical Experiences:

The overall model of Pandora is a fascinating alternative to Last FM, relying as it does on studied, human expert knowledge rather than software-based inferred connections. It appears to work very well at providing recommendations, which is the stated aim, though Last FM appears to be the more engaging experience and more scalable system. I suspect that welding the two approaches together into one coherent experience would deliver a very powerful system indeed. …

I've been observing myspace's development as a genuinely interesting new music experience. For example, see http://www.myspace.com/arcticmonkeys, in which fans and others can set up sites around bands and artists, containing music, images, and vibrant messageboards. These could develop into some of the more interesting new music experiences, sitting alongside Last FM and Pandora. Compared to the latter two, myspace's sites tend to be more colourful, characterful, shambolic and rich in personality. Music is deployed into this space in the form of embedded media players, with links for lyrics, videos and images alongside. In this sense, context is present as well as the music, albeit in freeform fashion, meaning that these (my)spaces become a form of combined listening/contextual experience. The music here appears as a thin veneer or layer, drifting across the experience. Music one identifies with can be adopted on personal pages within myspace, leading to 'presentation of self' opportunities which are playful and potentially powerful.… It'll be fascinating to see how myspace develops, particularly with the financial and cultural muscle of its new owner, News Corp.

The Guardian on, first, Pandora — and then Pandora and Last.fm:

"It's quite fun just putting it on when you're doing something else but I'm not sure I would use it as a music recommendation site yet," says music writer John Mullen. "It's a bit too random. And I really don't like its pretensions that it is offering some kind of universal truth about music. It just got it wrong for me too often. For example, I put in Nirvana and it was playing bands that would make Kurt Cobain turn in his grave. It's a bit crude. Often, bands that are similar on paper using their method aren't alike in reality. You could say Radiohead and Pink Floyd have similar musical 'genes' but really they create incredibly different music." Pandora is also flawed to the extent that users' personal stations, by definition, are narrowly defined. …

Pandora's top-down approach to music recommendation is in stark contrast to rival Last.fm's bottom-up approach. … "The recommendations work by finding music from users who are similar to you, who we call neighbours," says Last.fm founder, former Austrian radio DJ Martin Stiksel. "We then play you music they have listened to but which you don't have in your profile. So if you have 100 records, and 80 of them are the same, it's very likely you are going to be interested in those 20 that the other person has but which you don't. Pandora relies on a team of experts classifying the music, whereas with us it's the knowledge of the crowd, so to speak. [The crowds] know more about music, in our opinion, than a few experts." …

Both Pandora and Last.fm also have another benefit: artists and labels can submit music - providing a new way for acts to market themselves if they have been ignored by the major record labels. And Stiksel argues that they also offer the potential to steer illegal downloaders into legal ways of accessing music. "We believe in streaming rather than downloading," he says. While the recommendation sites may require fine tuning, he adds, they offer a fantastic way to navigate the ever-larger ocean of music. "In this day and age it's really difficult to find the music you like without having something like a music profile. Sites like this are going to become more and more crucial."

Technorati tags: , ,


Weird feed behaviour

1) I've had to decouple FeedDemon 1.6 RC2 (a beta) from NewsGator: the synching between the two had gone haywire, ever since a problem that developed some time around 28 December at the NewsGator end of things, and it was driving me nuts.

2) More to the point here, apologies to my FeedBurner subscribers: FeedBurner has a range of services on offer — PingShot service and FeedFlare — and, I'm not sure, but changing my options on both of these seems to have set off a riot in that feed, posts reappearing as unread a number of times and (most recently) a strange 'noemail' address appearing entirely unasked for in the headers of posts. I've reset my options within FeedBurner and I hope things will now quieten down again.

For good measure, I've been playing with Technorati tags: in TypePad these have to be entered manually (TypePad's categories are read as Technorati tags, but categories are not the same kind of animal as tags) which is a little bit of work. (Within Firefox, Performancing semi-automates the process for you.) The work's worth it when the tags are read by Technorati, but I'm finding the process more miss than hit. As ever, Dave Sifry is very supportive, but we still haven't cracked the problem. Niall Kennedy at Technorati suggests it may be feed-related, which led me to validate my feed and the feed of a number of blogs. Errors abound everywhere, which made me feel a bit better. I still can't get the Technorati tags to work consistently, though, and the most recent ones have simply gone unnoticed by Technorati's spiders.

Web 2.0. Dontcha just luv it.

Technorati tags: , , ,


Jeff Jarvis on tagging

Well, I've had a geeky good time with the subject of tags. But this isn't just another valentine to just another cool online trend; we're so over that. No, tags have a larger lesson to teach to media. They present a clear demonstration that the web is not about flat content. The web is about connections and the value that arises from them if you enable people to collect and communicate. In the old, big, centralised, controlled world of media, a few people with a few tools - pencils, presses and Dewey decimals - thought they could organise the world and its content. But as it turns out, left to its own devices, the world is often better at organising itself. Jeff Jarvis, Media Guardian

Technorati tags: ,


Structured blogging

Structured blogging is back.

This is a marker so I don't lose sight of what might be a significant development next year.

Structured Blogging is a way to get more information on the web in a way that's more usable. You can enter information in this form and it'll get published on your blog like a normal entry, but it will also be published in a machine-readable format so that other services can read and understand it. Think of structured blogging as RSS for your information. Now any kind of data - events, reviews, classified ads - can be represented in your blog. Structured Blogging

Almost immediately, controversy. The engaged but non-technical punter is bound to be confused. On the one hand, Stowe Boyd:

My bet is that Structured Blogging will fail, not because people wouldn't like some of the consequences -- such as an easy way to compare blog posts about concrete things like record reviews, and so on -- but because of the inherent, and wonderful messiness of the world of blogging. Because blog posts don't have to conform to any structural standards, they can be used to do anything: nothing is out of bounds, because we haven't created the boundaries. The messiness of the world we are living in is one of the reasons that it is such a rich and rewarding experience. I am not sure who is benefitted if everyone falling into line and adopting consistent standards for the structure of blog posts. Perhaps companies like PubSub -- one of the driving force behind all this -- who would like to be able to sort out all the blog posts about hotels, gadgets, and wine out there, and aggregate the results in some algorithmic fashion, and then make money from the resulting ratings and reviews. But I am not sure that it would be a better world for bloggers, or even blog readers. So I favor the microformat approach, which is messy, puts more of a burden on the blogger, and will require a host of tools to be built to make it all work. But microformats will work bottom-up -- tiny little tagged bits of information buried in the blog posts -- as opposed to structurally. And I am betting -- as always -- on bottom-up.

This feels right to me, but the idea that 'The promise of structured content is that we would have an explosion of software aggregating it into useful, specialized services' (bokardo) is attractive (of course) and when I find David, Marc and Thomas all lining up behind it …

Another source of confusion is the link between this, or the lack of link between this, and microformats. Bob Wyman explains that structured blogging is what we do and microformat is just what it says on the can — the format we use: 'The two concepts are orthogonal. They don't compete. They can't compete. Verbs don't compete with nouns'.

One thing seems certain: if it's as unclear as this, how on earth will it take off (assuming it should)?


David Weinberger at the OII

Back in July, 2004, I came across David Weinberger's post about Three Orders of Organisation, and then I read about his idea of Trees vs Leaves. You can read him on the former here and the latter here. The material behind and in these two postings formed much of the substance of David's seminar at the OII on Wednesday morning. In addition, I've come across a third posting, The end of data?, which also fed in to what he said this week in Oxford. There's a book on the way, Everything Is Miscellaneous — overview here — and there's a summary of an earlier version of yesterday's talk here. Finally, the OII has a webcast of the talk.

The seminar was a whistle stop tour of some "high" points in the development of taxonomies — Aristotle on nesting, Porphyry's tree, Dewey and library classification (David has blogged about Dewey a number of times, eg here and here): 'all of these systems assume there's a top down view of knowledge' and seek to banish ambiguity and present a clear picture of reality/knowledge. Everything in its right place …

But in the bottom-up world of social tagging an item can be in many categories simultaneously (I don't think 'tags' are the same as 'categories', but I'm running here with the general tenor of David's argument), and users are contributors both to the stock of tagged items and to their ordering. In this world, trees will never go away, but we need to stop looking for The Tree. Instead, we should build a big pile of data (leaves), attach as much metadata as possible and filter on the way out not on the way in. Users will do the filtering, and the moment of "taxonomizing" should be postponed until the users need to do it. There is now nothing that is not metadata — data is metadata — and we can no longer predict what users want. Messiness is a virtue.

David sees this bottom-up approach to tagging as a reaction to the semantic web. There is no end to the way the deck of digitalised knowledge in this world can be cut and sliced. (Wikipedia, as Jimmy Wales says, is not paper: for one thing, David said, where the Encyclopedia Britannica restricts itself to 32 printed volumes and 65,000 topics, Wikipedia has no such restrictions and is currently running at some 800,000 entires — including ones on the Deep-fried Mars Bar and, famously, the Heavy Metal Umlaut.) In the world of multi-subjectivity, knowledge is never going to be "perfect". Instead, we must think in terms of 'good enough'. We are living through a revolution, a fundamental change to the way we understand knowledge and our pursuit of it. The global conversation that is the net changes the roles of filters and, therefore, our understanding of what a filter is.

In the questions at the end of his talk, it seemed to me that in fact David is prepared to admit much more nuance and to accept that top-down taxonomies are not going to go away. And, yes, he agreed that the web is both a distributed library as well as being something that is about and for connectivity. It was put to him that the top-down, authoritarian conception of the semantic web is only one model, and that there are other models where the semantic web is bottom up. I share the view developed by him and his questioner at this point, that the net can provide for many different ways of organising knowledge. And I'm sure he's right when he says that soon we will see people making a living through devising new classificatory systems.

There's a problem of scale, too: as David put it, too many taggers can make for an unhelpful, confusing tag-soup, counterable, perhaps, through cluster-analyses intended to disambiguate (eg, Flickr's Capri clusters). But in David's view, if 'good enough' is good enough then scaling should not prove a problem.

So is "good enough" good enough? Tom Chance probed whether it's sufficient in matters more important than the examples David used (eg, beer): when it comes to deciding about nuclear power, 'good enough' is surely short of the mark. My colleague, Ian, linked this point to one about the role of institutions in this new world. They're highly unlikely to go away (!), but the morning's seminar left me in no doubt that trust, and the verification of trust, in institutions is altered by the rise of online, do-it-yourself mass publishing. Yet, as Jonathan Zittrain said in his summing up, the desire for the canonical article on a topic continues.

At the start of his talk David remarked, 'This could be the bright, shiny period of the internet, of openness'. The net gives us many reasons to be happy, but there are many forces at work which may make history of David's visionary presentation. More about this soon.