These are real questions. A Texas company called Infochimps has datasets quite similar to these, ready for you to use. Some of the datasets are free, others you have to pay for. More interesting is that if you have a dataset you think other people might be interested in, or even pay for, InfoChimps will host it for you and help you find customers. (Infochimps just announced they had raised $1.2 million in its first round of institutional funding.)
One of the datasets you can get from Infochimps for free is the set of smileys used on twitter in tweets sent between March 2006 and November 2009. It's free. It tells you that the smiley ":)" was used 13,458,831 times, while ";-}" was only used 1,822 times.
If you're willing to fork over $300, you can get a 160MB file conatining a month-by-month summary of all the hashtags, URLs and smiley's used on twitter during the same period. That dataset wil tell you that during September of 2009, the hashtag #kanyeisagayfish was used 11 times while #takekanyeinstead was used 141 times.
If you're a scrabble player, you can spend $4 for a list of the 113,809 official words, with definitions. Or you can get them free, without the definitions.
|courtesy of Infochimps, Inc. CC-BY-A|
The character connection data was fed to a program called Cytoscape, which is an open source visualization program used in bioinformatics; Mike Bergmann has a nice article about its use for large RDF graphs. The networks are laid out using a force-directed algorithm (which is pretty much the simplest thing you can do). Coloring is applied arbitrarily.
The other anomalous character subnet is labeled "???" in the graph. A closer look reveals this to be the set of characters that look like upside down roman text.
Kromer has noticed that the price (or perhaps cost) of a partial data set follows a non-monotonic curve (see graphic). Small amounts of data are essentially free, but a peak value is reached when portions of the data set are extracted from the full data set. If we were discussing book metadata, for example, peak value might accrue for a set of the 100,000 top selling books.
This pattern seems plausible to me, but I'd like to see some clearer examples. I've previously written about having too much data, but that article looked at the effect of error rates on data collection; Kromer's curve is about utility.
For me, the most interesting thing about Infochimps is the idea that the best way to make data flow in large volumes and create new types of knowledge is to provide the right incentives for data producers through the establishment of a market. This makes a lot of sense to me; however I'm not sure that the Infochimps market has also established incentives needed for data set maintenance; the world's most valuable and expensive data sets are one that change rapidly.
Kromer contrasted the Infochimps approach to that of Wolfram, whose Alpha service is produced by "putting 100 PhDs and data in a lab". He also feels that much of the work being put into the semantic web is a "crock" because its technology stack solves problems that we don't have. Humans are pretty good at extracting meaning from data, given a good visualization.
We can even recognize upside-down text.