The Many Skins of Web Data

As I've said in the past, I really love making stuff on the internet as much for the thing that's created as watching and learning from the reactions to it. This was most certainly the case with My First Tweet (which is still alive and well, by the way, with 5,370 first tweets in the DB so far). There's one response in particular I want to highlight today, though, because I think it's particularly interesting.

A few days after launching I got an email from someone telling me I must take down their first tweet. It wasn't offensive or anything like that, rather, they just didn't like the idea that they hadn't said it was okay for it to be on the site. While I didn't really understand it, I figured it seemed like a reasonable request and would only take a minute of my time. So I took it down. When they went back to check that I had done what I said, they found their first tweet again. Once again, I took it down.

Then I realized what the problem is. You see, the site is built so that if the user's first tweet isn't already in the database, it queries Twitter's API and grabs it. That means that every time they went back to check if I had been honest, they were actually responsible for their first tweet being in the database.

That, I thought, is a really interesting problem. I went over to read Twitter's terms of service and indeed you the user own everything you create. In addition, they "encourage users to contribute their creations to the public domain or consider progressive licensing terms." However, from a technology perspective there are only two states for Twitter: Public and private.

Let me step back for one second and explain the act of querying Twitter's API for one second. Basically, when someone puts their username into the site, I send a message to Twitter saying, "hey, can I have the information for the user XYZ?" Twitter then sends me back one of two different messages, most often they say, "sure, here's the info you requested," but sometimes they say, "sorry, we can't give you that info because the user you requested have made themselves private." (When you try to look at the tweets of a user that is private on twitter.com you get a little lock icon and a message that says you can only see this person's tweets if they give you permission.)

So basically Twitter is a binary system, you are either public or you are private. If you're private I can't grab your first tweet. However, if you're public, I can, whether you want me to or not.

This is particularly interesting to me for a few reasons. First, it's a good way to explain how outdated the idea of webpages really are. Most people think of them as these hard coded things, like pages in a magazine or something. However, many of the webpages you look at are not created until the moment you look at the site. Brand Tags, for instance, really only consists of about a dozen files. Even though there are 800 brands in the system, all the tag clouds are generated by the same few lines of code which queries the database and returns the formatted results. When I was getting the request to take down the first tweet, I was complying, however, it didn't really matter because it never existed as anything but a database entry in the first place.

What's so interesting about this is that that's actually how Twitter works as well (I believe). The results that the Twitter API returns are remarkably similar to the way the pages are formatted (down to the fact that you can only get to page 160 on both Twitter.com and from their API). That means that the site isn't so much a site as it is a view for the data (of which My First Tweet is one, search.twitter.com is another and Twitter Grader is a third).

Twitter isn't alone in working this way, either. Most sites these days are just skins for the underlying data, which is increasingly being shared with others who are making new skins for it. This isn't new news to those who build things on the web, but I think it is a fundamentally different functionality than the average user understands. Just something to think about.

The second point I wanted to make is around this public/private thing. In a world where everything is just skins for the underlying data, you have fewer and fewer controls over how that data is displayed when you sign up to use a service. Some services (like Flickr) allow you to specify a licensing for your work (full copyright, creative commons, etc.) and they report that to those people who want to work with the data, but even then, the API user can chose to ignore the licensing entirely and just take the photo unless the user has specified that this CAN NOT be used (either because it's private or there is no access to full size).

As someone developing using APIs this kind of flexibility is pretty awesome. I can get access to pretty much anything I want (which is rad). But for some users, clearly this is worrying. I don't know that more safeguards need to be put in place, but I do think that this wholesale data access needs to be better explained (there's a tendency to live in a world where we assume people know what an API is¹).

As usual, no hard answers here, just some stuff to think about.

¹ While I'm no technician, I do think it's worth trying to explain what an API is, since it's thrown around quite a bit these days. Essentially an API is just wholesale access to the data/functionality from a web service. If you're Google Maps that can manifest itself in letting people send you an address and returning the latitude and longitude or if you're Flickr that can mean returning the URLs for photos tagged with noah. Developers then can find lots of different ways to use the data/functionality. Essentially, with access to the raw data the sky is the limit. In some ways, RSS feeds are kind of like APIs for websites. They provide people with some access to the underlying data (which is separated from the presentation layer that you see when you visit NoahBrier.com for instance). (I don't know if this definition is helpful at all. If anyone wants I can take another shot, or maybe someone else can try to give a better definition in the comments.)