Wednesday, June 17, 2009

Is Semantic Web Technology Scalable?

"Scalable" is a politician of a word. It has attractiveness to obtain solid backing from diverse factions- it has something to offer both the engineericans and the marketerists. At the same time it has the dexterity to mean different things to different people, so that the sales team can always argue that the competition's product lacks "scalability". The word even supports multiple mental images- you can think of soldiers scaling a wall or climbers scaling a mountain; a more correct image is that of scaling a picture to making it bigger. Even technology cynics can get behind the word "scalable": if a technology is scalable, they would argue, that means it hasn't been scaled.

The fact is that scalability is a complex attribute, more easily done in the abstract than in the concrete. I've long been a cynic about scalability. A significant fraction of engineers who worry about scalability end up with solutions that are too expensive or too late to meet the customer problems at hand, or else they build systems that scale poorly along an axis of unexpected growth. Another fraction of engineers who worry too little about scalability get lucky and avoid problems by the grace of Moore's Law and its analogs in memory storage density, processor power and bandwidth. On the other hand, ignorance of scalability issues in the early phases of a design can have catastrophic effects if a system or service stops working once it grows beyond a certain size.

Before considering the scalability of the Semantic Technology, let's define terms a bit. The overarching definition of scalability in information systems is that the resources needed to solve a problem should not grow much faster than the size of the problem. From the business point of view, it's a requirement that 100 customers should cost less to serve than 100 times what it would cost to serve one customer (the scaling should be less than linear). If you are trying to build a Facebook, for example, you can tolerate linear scaling in number of processors needed per million customers if you have sublinear costs for other parts of the technology or significantly superlinear revenue per customer. Anything superlinear will eventually kill you. If there are any bits of your technology which scale quadratically or even exponentially, then you will very quickly "run into a brick wall".

In my post on curated datasets, I touched on an example where a poorly designed knowledge model could "explode" a semantic database. This is one example of how the Semantic Web might fail the scalability criterion. My understanding of the model scaling issue is that it's something that can be addressed, and is in fact addressed in the best semantic technology databases. The semantic analysis component of semantic technology can quite easily be parallelized, so that appears to pose no fundamental problems. What I'd like to address here is whether there are scalability issues in the semantic databases and inference engines that are at the core of Semantic Web technology.

Enterprise-quality semantic databases (using triple-stores) are designed to to scale well in the sense that the number of RDF triples they can hold and process scales linearly with the amount of memory available to the CPU. So if you have a knowledge model that has 1 Billion triples, you just need to get yourself a box with 8GB of RAM. This type of scaling is called "vertical scaling". Unfortunately if you wanted to build a Semantic Google or a Semantic Facebook, you would probably need a knowledge model with trillions of triples. You would have a very hard time to do it with a reasoning triple store, because you can't buy a CPU with that much RAM attached. The variety of scaling you would want to have to solve a bigger problems is called "horizontal scaling". Horizontal scaling distributes a problem across a farm of servers, and the scaling imperative is that the number of servers required should scale with the size of the problem. At this time, there is NO well-developed capability for semantic databases with inference engines to distribute problems across multiple servers. (Mere storage is not a problem.)

I'll do my best to explain the difficulties of horizontal scaling in semantic databases. If you're an expert in this, please forgive my simplifications (and please comment if I've gotten anything horribly wrong.) Horizontal scaling in typical web applications uses partitioning. Partitioning of a relational database typically takes advantage of the structure of the data in the application. So for example, if you're building a Facebook, you might chose to partition your data by user. The data for any particular user would be stored on one or two of a hundred machines. Any request for your information is routed to the particular machine that holds your data. That machine can make processing decisions very quickly if all your data is stored on the same machine. So instead of sharing one huge Facebook web application with 100 million other Facebook users, you might be sharing one of a hundred identical Facebook application servers with "only" a million other users. this works well if the memory size needed for 1 million users is a good match to that available on a cheap machine.

In a semantic (triplestore) database, information is chopped up into smaller pieces (triples) with the result that much of information will be dispersed into multiple pieces. A partitioned semantic database would need to intelligently distribute the information across machines so that closely related information will reside on the same machine. Communication between machines is typically 100 times slower than communication within the same machines, so the consequences of doing a bad job of distributing information can be disastrous. Figuring out how to build partitioning into a semantic database is not impossible, but it's not easy.

I'm getting ahead of myself a bit, because a billion triples is nothing to sneeze at. Semantic database technology is exciting today in applications where you can put everything on one machine. But if you read my last post, you may remember my argument that the Semantic Web is NOT loading information into a big database of facts. It's a social construct for connections of meaning between machines. Current semantic database technology is designed for reasoning on facts loaded onto a single machine; it's capable of building semantic spaces up to a rather large size; but it's not capable of building a semantic Google, for example.

I've learned a lot at the Semantic Technology Conference around this analysis. What I see is that there is a divergence in the technologies being developed. One thread is to focus on the problems that can be addressed on single machines. In practice, that technology has advanced so that the vast majority of problems, particularly the enterprise problems, can be addressed by vertically scaled systems. This is a great achievement, and is one reason for the excitement around semantic technologies. The other thread is to achieve horizontal scaling by layering the semantic technologies on top of horizontally scaled conventional database technologies.

I've been going around the conference provoking people into interesting conversations by asserting that there is no such thing (today) as Semantic Web Technology- there are only Semantic Technology and Web Technology, and combinations thereof. The answer to the question in the title is then that if there was such a thing as Semantic Web technology, then it would be scalable.


  1. I wonder if there has been any progress around this since this posting.

    1. good question! I think it's fair to say that "big data" and "data science" has mostly ignored the "semantic technology stack".