Sunday, October 3, 2010

Wikipedia

I've had a lot of random thoughts about Wikipedia stewing for a while. It's finally time for me to get them out, and hopefully some interesting discussion will come out of it.

A Little History
One of the most interesting aspects of Wikipedia's history, to me, is that its creators initially didn't think it would work. It was a lark.

Before Wikipedia, Jimmy Wales and Larry Sanders worked on a project called "Nupedia", which was a free online encyclopedia project, but was not a wiki and was not open for everyone to freely contribute. It had an extensive peer-review process that made it awkward for people to add content. Jimmy and Larry were trying to figure out how to make it easier for users to contribute, and they heard about the WikiWikiWeb project (http://c2.com/cgi/wiki), the world's first wiki. They were short on cash and it was cheap and easy to set up a wiki, so they decided to give it shot and see how badly it would fail. It didn't, and Wikipedia was born.

See Larry Sander's article about it for more details.

What It's For
Wikipedia is a pretty reliable and extensive source of facts. Not all information belongs on Wikipedia (see "What Wikipedia Isn't" in "Wikipedia: The Missing Manual"), but there are several companion sites to fill some of those gaps (e.g., Wiktionary, Wikiquote, and Wikia; EDIT: see Wikipedia:Alternative outlets for more), and what is on Wikipedia is extremely useful. And if something isn't useful, you can fix it, without even having to register!

I've heard a lot of complaints, mostly from English teachers, about how you can't trust Wikipedia because "anyone can say anything" on it. There are statistics showing that Wikipedia is (gasp!) slightly less reliable than the Encyclopedia Britannica, and for some reason people think that's a big deal. And, by the way, it makes perfect sense you can't cite Wikipedia as a source in a research paper, because it's just plain lazy. Every claim that's cited on Wikipedia directs you straight to the source(s), and anything that isn't cited doesn't belong in a research paper because you have no idea who said it ("Yes I do! 78.133.30.247 said it.").



Who Creates It?

I've been surprised talking to people around me by how few of them feel competent to edit Wikipedia. I ran across a pretty interesting article from 2006 analyzing how "clustered" the editing of Wikipedia is. At the time, Jimmy Wales had stated that .7% of the users (524) were responsible for 50% of the edits. The author of the article uses a different measurement on a few arbitrarily selected articles (how many letters from the current version were typed by which users), and finds a much different picture, but then there are some big weaknesses in that approach, too (deleting and rearranging blocks of text can improve an article substantially).

My background with Wikipedia is that I've done a lot of reading and a little writing. I've made a few small edits, just little things that bugged me, and I guess it just seemed natural to me since I'm a programmer and I'm used to collaborating online. I've found the Wikipedia community to be very friendly and cooperative. There are vandals and lots of opinionated people who barely speak English making edits, the Wikipedians are used to it, so it's not some pristine, elite community you're imposing on. You really are welcome to edit, and you don't have to apologize for being inexperienced. If there's something clearly wrong with your changes, they'll fix it without batting an eye, and usually they'll explain what they don't like about it.

There's also a "discussion" page attached to each regular page. Scanning through the discussion pages is a great way to learn about how Wikipedia works. If you're not sure about an edit you're making, or it's a big change, just add a section on the discussion page and describe what you're doing and why. And even though it's called a "discussion" page, I don't think you're necessarily expected to check back for responses to your questions and explanations, so there's no reason to shy away thinking it'll be a big commitment.

I really think everyone should try editing Wikipedia at least once (there's a guide with lots of links if you want help getting started), just so you understand the process and are comfortable with it. If I were an English teacher, I think I'd make an assignment of making one edit on some page somewhere on Wikipedia so the students understand where the content comes from, and get some hands-on experience supporting their claims.

Clogs in the System
There are some sorts of "traffic jams" that crop up a lot on Wikipedia. One is that most or all of the content on  math-related articles is incomprehensible, and I think I've figured out why. When you see content that's written in poor English, and you can't understand any of it, you'll usually be pretty quick to delete it or criticize it. But if it's heady and academic-sounding and you don't understand any of it, you'll usually assume they must know what they're talking about, that it's helpful to other people but not you, and then leave it where it is. I think there's a sort of "natural selection" for confusing academic content on Wikipedia, where it's relatively easy to create and difficult to delete. You'll find the same thing in a lot of philosophy-related articles.

I think it's a sort of benign tumor that Wikipedia would do well to excise wherever possible. My stance is that long mathematical proofs and derivations don't belong on Wikipedia any more than extensive plot summaries of books and movies do. They're not citable claims, even if they're useful information. For these, Wikipedia articles should just summarize the proofs and derivations, and provide excellent links to further information (like math.wikia.com or Wolfram MathWorld), just the way any other citation works on Wikipedia.

Another little bug for me is that it would be nice to be able to collect pronunciation information for the names of famous people and places on Wikipedia, but it's very hard to find citable references. There is an "International Phonetic Alphabet" that they use to annotate pronunciations, but it's very hard to find textual evidence of a certain pronunciation in the real world, and it's also hard finding public domain audio or video evidence.

The good news is there are people actively strategizing and cleaning up issues. In addition to the paid staff working on improving the Wikipedia system in general, I recently discovered there are groups called WikiProjects that set out to assess and improve the quality of articles in different subject areas of Wikipedia.

The Future
Wikipedia has a lot of happy users and quite a few active editors, and honestly I think they'd be doing just fine to maintain their current growth for the foreseeable future. I can't imagine it being replaced or obsoleted. But there are several ways I could imagine it being even better.

One feature I can't believe they haven't implemented yet is a common feature in software development tools called "blame" or "annotate". They already have a history for each page that shows all the versions there have ever been of that page, through time. You can select any two dates and see a neat comparison or "diff" of what text was added, deleted, or otherwise changed in between. An "annotate" view turns that feature on its head. It shows, next to each line, the date when that line was last modified. This is very helpful for tracking down long-standing vandalism, for instance. There is a prototype of that feature here, but it's still just a prototype.

On another note, Wikipedia naturally has two tiers of content, cited and uncited, and I love that I can see both on one site. But sometimes there are claims that are too relevant to delete from an article and surprisingly hard to verify from an independent source, that just seem doomed to linger uncited. If you search the internet for sources, a lot of times you'll only turn up people who have copied and pasted that very article from Wikipedia, often without even mentioning where it came from. I've been wondering, as time goes on, will these citation problems stabilize and decrease, or will they only get worse as Wikipedia grows?

You've probably seen the "citation needed" tags floating around for stuff like that, but sometimes I wish the difference between them was a bit more obvious. As it is, a lot of people don't even notice the difference, and just repeat everything they read on Wikipedia as fact. Even the uncited material on Wikipedia is somewhat reliable, because if it can be easily disproven, someone will probably delete it before too long. But I've been wondering how it would look if the uncited content, instead of being marked with "citation needed" tags everywhere, was marked with a very light pale yellow background, so that you could easily scan it and see exactly which part of the content was less reliable.

One final thing I think the future will hold for Wikipedia: more non-human readers. That's right, I'm talking about learning AI systems. It will certainly be a good source of raw knowledge for certain types of software projects once we get some more tools to extract semantic information from it, but I think it's also worth noting that those tools will help us in improving Wikipedia, too: automatically finding weak claims, directing Wikipedia editors straight to possible sources across the internet for specific claims, and possibly doing active editing to clean up problem areas. It's a thought, anyway.